[GH-ISSUE #186] wget Errors on latest master #132

Closed
opened 2026-03-01 14:40:52 +03:00 by kerem · 6 comments
Owner

Originally created by @n0ncetonic on GitHub (Mar 21, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/186

Describe the bug

wget times out after 30 seconds on the latest build of master branch. When same wget command is run outside of ArchiveBox wget works as expected

Steps to reproduce

Steps to reproduce the behavior:

  1. Use the following .ArchiveBox.config options
# Example config file for ArchiveBox: The self-hosted internet archive.
# Copy this file to ~/.ArchiveBox.conf before editing it.
# Config file is in both Python and .env syntax (all strings must be quoted).
# For documentation, see:
#    https://github.com/pirate/ArchiveBox/wiki/Configuration

################################################################################
## General Settings
################################################################################
OUTPUT_PERMISSIONS=644
ONLY_NEW=True
TIMEOUT=30
MEDIA_TIMEOUT=3600
#TEMPLATES_DIR="archivebox/templates"
FOOTER_INFO="Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests."
FETCH_TITLE=True
FETCH_FAVICON=True
FETCH_WGET=True
FETCH_WARC=True
FETCH_PDF=True
FETCH_SCREENSHOT=False
FETCH_DOM=True
FETCH_GIT=True
FETCH_MEDIA=True
SUBMIT_ARCHIVE_DOT_ORG=False
#CHECK_SSL_VALIDITY=True
FETCH_WGET_REQUISITES=True
RESOLUTION="1440,900"
WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
HEADLESS_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"
GIT_DOMAINS="github.com,bitbucket.org,gitlab.com"
#COOKIES_FILE="path/to/cookies.txt"
#CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"
USE_COLOR=false
SHOW_PROGRESS=false
  1. Run ./archive
    `echo "https://developer.apple.com/library/archive/technotes/tn2218/_index.html#//apple_ref/doc/uid/DTS40007625" | ./archive

  2. See error

Screenshots or log output

wget
Failed:TimeoutExpired Command '/usr/local/bin/wget' timed out after 30 seconds
Run to see full output:
cd /Volumes/home/www/archive/1553194400.182;
/usr/local/bin/wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=unix --timeout=30 --warc-file=warc/1553194992 --page-requisites "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" https://developer.apple.com/library/archive/technotes/tn2219/_index.html#//apple_ref/doc/uid/DTS10004624

Software versions

(please complete the following information)

  • OS: macOS 10.14
  • ArchiveBox version: d798117
  • Python version: Python 3.7.2
  • Wget version: GNU Wget 1.19.5 built on darwin17.5.0.
Originally created by @n0ncetonic on GitHub (Mar 21, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/186 ### Describe the bug `wget` times out after 30 seconds on the latest build of `master` branch. When same wget command is run outside of ArchiveBox wget works as expected ### Steps to reproduce Steps to reproduce the behavior: 1. Use the following .ArchiveBox.config options ``` # Example config file for ArchiveBox: The self-hosted internet archive. # Copy this file to ~/.ArchiveBox.conf before editing it. # Config file is in both Python and .env syntax (all strings must be quoted). # For documentation, see: # https://github.com/pirate/ArchiveBox/wiki/Configuration ################################################################################ ## General Settings ################################################################################ OUTPUT_PERMISSIONS=644 ONLY_NEW=True TIMEOUT=30 MEDIA_TIMEOUT=3600 #TEMPLATES_DIR="archivebox/templates" FOOTER_INFO="Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests." FETCH_TITLE=True FETCH_FAVICON=True FETCH_WGET=True FETCH_WARC=True FETCH_PDF=True FETCH_SCREENSHOT=False FETCH_DOM=True FETCH_GIT=True FETCH_MEDIA=True SUBMIT_ARCHIVE_DOT_ORG=False #CHECK_SSL_VALIDITY=True FETCH_WGET_REQUISITES=True RESOLUTION="1440,900" WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" HEADLESS_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" GIT_DOMAINS="github.com,bitbucket.org,gitlab.com" #COOKIES_FILE="path/to/cookies.txt" #CHROME_USER_DATA_DIR="~/.config/google-chrome/Default" USE_COLOR=false SHOW_PROGRESS=false ``` 2. Run ./archive `echo "https://developer.apple.com/library/archive/technotes/tn2218/_index.html#//apple_ref/doc/uid/DTS40007625" | ./archive 4. See error ### Screenshots or log output > wget Failed:TimeoutExpired Command '/usr/local/bin/wget' timed out after 30 seconds Run to see full output: cd /Volumes/home/www/archive/1553194400.182; /usr/local/bin/wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=unix --timeout=30 --warc-file=warc/1553194992 --page-requisites "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" https://developer.apple.com/library/archive/technotes/tn2219/_index.html#//apple_ref/doc/uid/DTS10004624 ### Software versions (please complete the following information) - OS: macOS 10.14 - ArchiveBox version: d798117 - Python version: Python 3.7.2 - Wget version: GNU Wget 1.19.5 built on darwin17.5.0.
kerem 2026-03-01 14:40:52 +03:00
Author
Owner

@n0ncetonic commented on GitHub (Mar 21, 2019):

Somehow appears to have resolved itself although wget does appear to have been severely slowed down by something in the commits between c79e1df and d798117 and I'm getting throughput of 1 url archived every 30 or so seconds

<!-- gh-comment-id:475385333 --> @n0ncetonic commented on GitHub (Mar 21, 2019): Somehow appears to have resolved itself although wget does appear to have been severely slowed down by something in the commits between `c79e1df` and `d798117` and I'm getting throughput of 1 url archived every 30 or so seconds
Author
Owner

@pirate commented on GitHub (Mar 21, 2019):

Ok this is all helpful, thanks, I'll try to git bisect and see if it's the commit I think it is that slowed everything down so much.

Out of curiosity, how fast is your disk IO? I recently added some code that does ~5x more reading and rewriting of the index and output dir in order to provide a more real-time UI experience during the archiving process, so if disk IO is your bottleneck it would make sense that that change slowed it significantly for you.

<!-- gh-comment-id:475432663 --> @pirate commented on GitHub (Mar 21, 2019): Ok this is all helpful, thanks, I'll try to git bisect and see if it's the commit I think it is that slowed everything down so much. Out of curiosity, how fast is your disk IO? I recently added some code that does ~5x more reading and rewriting of the index and output dir in order to provide a more real-time UI experience during the archiving process, so if disk IO is your bottleneck it would make sense that that change slowed it significantly for you.
Author
Owner

@n0ncetonic commented on GitHub (Mar 21, 2019):

Unsure of how to determine speed of disk I/O. I'm pointing ArchiveBox to a mounted NAS share on the local network and the NAS has a gigabit line with 3 x WD RED 8TB NAS Drives

<!-- gh-comment-id:475433678 --> @n0ncetonic commented on GitHub (Mar 21, 2019): Unsure of how to determine speed of disk I/O. I'm pointing ArchiveBox to a mounted NAS share on the local network and the NAS has a gigabit line with 3 x WD RED 8TB NAS Drives
Author
Owner

@n0ncetonic commented on GitHub (Mar 21, 2019):

Update: So I checked my settings and it looks like my NAS was mounting using SMB 2 by default. I've since changed this to SMB 3 which should help with any disk I/O issues resulting from network latency.

I'm running ArchiveBox again on the same data set as before and the issue seems to be resolved with an average archive time of 15 seconds per link which is back to a fairly decent speed.

<!-- gh-comment-id:475441966 --> @n0ncetonic commented on GitHub (Mar 21, 2019): Update: So I checked my settings and it looks like my NAS was mounting using SMB 2 by default. I've since changed this to SMB 3 which should help with any disk I/O issues resulting from network latency. I'm running ArchiveBox again on the same data set as before and the issue seems to be resolved with an average archive time of 15 seconds per link which is back to a fairly decent speed.
Author
Owner

@n0ncetonic commented on GitHub (Mar 22, 2019):

Closing this as it appears to have been fixed by switching to SMB3

<!-- gh-comment-id:475640513 --> @n0ncetonic commented on GitHub (Mar 22, 2019): Closing this as it appears to have been fixed by switching to SMB3
Author
Owner

@pirate commented on GitHub (Mar 22, 2019):

Also in case you ever need to check in the future, you can determine I/O speed like this:

$ dd if=/dev/urandom of=/testfile bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 9.27182 s, 57.9 MB/s
<!-- gh-comment-id:475699425 --> @pirate commented on GitHub (Mar 22, 2019): Also in case you ever need to check in the future, you can determine I/O speed like this: ```bash $ dd if=/dev/urandom of=/testfile bs=1M count=512 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 9.27182 s, 57.9 MB/s ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#132
No description provided.