[GH-ISSUE #865] Bug: wget timeout retrying unavailable ipv6 (Docker/Pihole) #536

Closed
opened 2026-03-01 14:44:23 +03:00 by kerem · 1 comment
Owner

Originally created by @lkubb on GitHub (Sep 30, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/865

This is not necessarily a bug in Archivebox, but an edge case in the default configuration that might cause hiccups for some. Hope this template is fine nevertheless.

Describe the bug

I'm running Archivebox inside a Docker container configured with docker-compose. Pihole runs inside my network as a DNS server. Running wget extractor on a site with a blocked resource results in a timeout, failing to archive. It will retry repeatedly to reach the resource, running over the time limit specified by Archivebox. This is probably specific to the way Pihole blocks domains on the DNS level by default, some Docker ipv6 weirdness and wget not recognising Cannot assign requested address as fatal.

This is solvable by setting inet4_only = on in /etc/wgetrc. Not sure if you want to support this edge case, but I would propose either documenting this issue or adding a configuration WGET_FORCE_IPV4 which adds --inet4-only to the generated wget command. I can open a pull request, if desired.

Related issue: https://github.com/ArchiveBox/ArchiveBox/issues/491

Steps to reproduce

  1. wget https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml

  2. docker-compose up -d

  3. force DNS queries for cdn.optimizely.com to resolve to ipv4 0.0.0.0 / ipv6 ::

    docker-compose exec -T archivebox tee -a /etc/hosts > /dev/null << EOF
    0.0.0.0 cdn.optimizely.com
    :: cdn.optimizely.com
    EOF
    
  4. try to archive https://medium.com using wget extractor, e.g. via web UI, result:

    Command '['wget', '--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off', '--timeout=60', '--restrict-file-names=windows', '--warc-file=/data/archive/1632992981.405285/warc/1632992981', '--page-requisites', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.20.1', '--compression=auto', 'https://medium.com/']' timed out after 60 seconds
    
  5. or verify that the following command runs longer than 60s:

    docker-compose exec --user=archivebox archivebox wget --verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --page-requisites "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://medium.com
    

Screenshots or log output

[Note: my pihole blocks another domain on that site, static.cloudflareinsights.com]

--2021-09-30 09:21:07--  https://medium.com/
Resolving medium.com (medium.com)... 162.159.152.4, 162.159.153.4, 2606:4700:7::a29f:9904, ...
Connecting to medium.com (medium.com)|162.159.152.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘medium.com/index.html’

medium.com/index.html                                 [ <=>                                                                                                       ]  13.87K  --.-KB/s    in 0.001s

2021-09-30 09:21:08 (16.4 MB/s) - ‘medium.com/index.html’ saved [68512]

--2021-09-30 09:21:08--  https://cdn.optimizely.com/js/16180790160.js
Resolving cdn.optimizely.com (cdn.optimizely.com)... 0.0.0.0, ::
Connecting to cdn.optimizely.com (cdn.optimizely.com)|0.0.0.0|:443... failed: Connection refused.
Connecting to cdn.optimizely.com (cdn.optimizely.com)|::|:443... failed: Cannot assign requested address.
Retrying.


[...]

--2021-09-30 09:23:33--  (try:20)  https://cdn.optimizely.com/js/16180790160.js
Connecting to cdn.optimizely.com (cdn.optimizely.com)|0.0.0.0|:443... failed: Connection refused.
Connecting to cdn.optimizely.com (cdn.optimizely.com)|::|:443... failed: Cannot assign requested address.
Giving up.

--2021-09-30 09:23:33--  https://glyph.medium.com/css/unbound.css
Resolving glyph.medium.com (glyph.medium.com)... 162.159.153.4, 162.159.152.4, 2606:4700:7::a29f:9904, ...
Connecting to glyph.medium.com (glyph.medium.com)|162.159.153.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/css]
Saving to: ‘glyph.medium.com/css/unbound.css’

glyph.medium.com/css/unbound.css                      [ <=>                                                                                                       ]     788  --.-KB/s    in 0s


[...]

--2021-09-30 09:23:34--  https://static.cloudflareinsights.com/beacon.min.js
Resolving static.cloudflareinsights.com (static.cloudflareinsights.com)... 0.0.0.0, ::
Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|0.0.0.0|:443... failed: Connection refused.
Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|::|:443... failed: Cannot assign requested address.
Retrying.

[...]

--2021-09-30 09:25:59--  (try:20)  https://static.cloudflareinsights.com/beacon.min.js
Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|0.0.0.0|:443... failed: Connection refused.
Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|::|:443... failed: Cannot assign requested address.
Giving up.

--2021-09-30 09:25:59--  https://glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8_6c9_6cc_6cd_6ci_6cm/charter-400-italic.woff
Connecting to glyph.medium.com (glyph.medium.com)|162.159.153.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/font-woff]
Saving to: ‘glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8_6c9_6cc_6cd_6ci_6cm/charter-400-italic.woff’

glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8     [ <=>                                                                                                       ]  10.48K  --.-KB/s    in 0s

2021-09-30 09:25:59 (32.3 MB/s) - ‘glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8_6c9_6cc_6cd_6ci_6cm/charter-400-italic.woff’ saved [10744]

[...]

FINISHED --2021-09-30 09:26:01--
Total wall clock time: 4m 53s
Downloaded: 59 files, 1.5M in 0.2s (9.64 MB/s)
Converting links in medium.com/index.html... 25-4
Converting links in glyph.medium.com/css/unbound.css... 36-0
Converted links in 2 files in 0.002 seconds.

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-4.19.0-17-amd64-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data
 √  SOURCES_DIR           128 files       valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           133 files       valid     ./archive
 √  CONFIG_FILE           95.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             1.4 MB          valid     ./index.sqlite3
Originally created by @lkubb on GitHub (Sep 30, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/865 This is not necessarily a bug in Archivebox, but an edge case in the default configuration that might cause hiccups for some. Hope this template is fine nevertheless. #### Describe the bug I'm running Archivebox inside a Docker container configured with docker-compose. Pihole runs inside my network as a DNS server. Running wget extractor on a site with a blocked resource results in a timeout, failing to archive. It will retry repeatedly to reach the resource, running over the time limit specified by Archivebox. This is probably specific to the way Pihole blocks domains on the DNS level by default, some Docker ipv6 weirdness and wget not recognising `Cannot assign requested address` as fatal. This is solvable by setting `inet4_only = on` in `/etc/wgetrc`. Not sure if you want to support this edge case, but I would propose either documenting this issue or adding a configuration `WGET_FORCE_IPV4` which adds `--inet4-only` to the generated wget command. I can open a pull request, if desired. Related issue: https://github.com/ArchiveBox/ArchiveBox/issues/491 #### Steps to reproduce 1. `wget https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml` 2. `docker-compose up -d` 3. force DNS queries for `cdn.optimizely.com` to resolve to `ipv4 0.0.0.0 / ipv6 ::` ```bash docker-compose exec -T archivebox tee -a /etc/hosts > /dev/null << EOF 0.0.0.0 cdn.optimizely.com :: cdn.optimizely.com EOF ``` 4. try to archive `https://medium.com` using wget extractor, e.g. via web UI, result: ``` Command '['wget', '--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off', '--timeout=60', '--restrict-file-names=windows', '--warc-file=/data/archive/1632992981.405285/warc/1632992981', '--page-requisites', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.20.1', '--compression=auto', 'https://medium.com/']' timed out after 60 seconds ``` 5. or verify that the following command runs longer than 60s: ```bash docker-compose exec --user=archivebox archivebox wget --verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --page-requisites "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://medium.com ``` #### Screenshots or log output [Note: my pihole blocks another domain on that site, `static.cloudflareinsights.com`] ```logs --2021-09-30 09:21:07-- https://medium.com/ Resolving medium.com (medium.com)... 162.159.152.4, 162.159.153.4, 2606:4700:7::a29f:9904, ... Connecting to medium.com (medium.com)|162.159.152.4|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘medium.com/index.html’ medium.com/index.html [ <=> ] 13.87K --.-KB/s in 0.001s 2021-09-30 09:21:08 (16.4 MB/s) - ‘medium.com/index.html’ saved [68512] --2021-09-30 09:21:08-- https://cdn.optimizely.com/js/16180790160.js Resolving cdn.optimizely.com (cdn.optimizely.com)... 0.0.0.0, :: Connecting to cdn.optimizely.com (cdn.optimizely.com)|0.0.0.0|:443... failed: Connection refused. Connecting to cdn.optimizely.com (cdn.optimizely.com)|::|:443... failed: Cannot assign requested address. Retrying. [...] --2021-09-30 09:23:33-- (try:20) https://cdn.optimizely.com/js/16180790160.js Connecting to cdn.optimizely.com (cdn.optimizely.com)|0.0.0.0|:443... failed: Connection refused. Connecting to cdn.optimizely.com (cdn.optimizely.com)|::|:443... failed: Cannot assign requested address. Giving up. --2021-09-30 09:23:33-- https://glyph.medium.com/css/unbound.css Resolving glyph.medium.com (glyph.medium.com)... 162.159.153.4, 162.159.152.4, 2606:4700:7::a29f:9904, ... Connecting to glyph.medium.com (glyph.medium.com)|162.159.153.4|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/css] Saving to: ‘glyph.medium.com/css/unbound.css’ glyph.medium.com/css/unbound.css [ <=> ] 788 --.-KB/s in 0s [...] --2021-09-30 09:23:34-- https://static.cloudflareinsights.com/beacon.min.js Resolving static.cloudflareinsights.com (static.cloudflareinsights.com)... 0.0.0.0, :: Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|0.0.0.0|:443... failed: Connection refused. Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|::|:443... failed: Cannot assign requested address. Retrying. [...] --2021-09-30 09:25:59-- (try:20) https://static.cloudflareinsights.com/beacon.min.js Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|0.0.0.0|:443... failed: Connection refused. Connecting to static.cloudflareinsights.com (static.cloudflareinsights.com)|::|:443... failed: Cannot assign requested address. Giving up. --2021-09-30 09:25:59-- https://glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8_6c9_6cc_6cd_6ci_6cm/charter-400-italic.woff Connecting to glyph.medium.com (glyph.medium.com)|162.159.153.4|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/font-woff] Saving to: ‘glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8_6c9_6cc_6cd_6ci_6cm/charter-400-italic.woff’ glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8 [ <=> ] 10.48K --.-KB/s in 0s 2021-09-30 09:25:59 (32.3 MB/s) - ‘glyph.medium.com/font/81d2bf1/0-3j_4g_6bu_6c4_6c8_6c9_6cc_6cd_6ci_6cm/charter-400-italic.woff’ saved [10744] [...] FINISHED --2021-09-30 09:26:01-- Total wall clock time: 4m 53s Downloaded: 59 files, 1.5M in 0.2s (9.64 MB/s) Converting links in medium.com/index.html... 25-4 Converting links in glyph.medium.com/css/unbound.css... 36-0 Converted links in 2 files in 0.002 seconds. ``` #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Linux Linux-4.19.0-17-amd64-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 128 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 133 files valid ./archive √ CONFIG_FILE 95.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.4 MB valid ./index.sqlite3 ```
kerem 2026-03-01 14:44:23 +03:00
Author
Owner

@pirate commented on GitHub (Sep 30, 2021):

You can set WGET_ARGS for this:

# this is the default
# archivebox config --set WGET_ARGS='["--no-verbose", "--adjust-extension", "--convert-links", "--force-directories", "--backup-converted", "--span-hosts", "--no-parent", "-e", "robots=off"]'

# you can add --inet4-only to the end like so:
archivebox config --set WGET_ARGS='["--no-verbose", "--adjust-extension", "--convert-links", "--force-directories", "--backup-converted", "--span-hosts", "--no-parent", "-e", "robots=off", "--inet4-only"]'

In general, every dependency used has a similar <dependency name>_ARGS config option to satisfy edge cases like this. 👍

See here for the source code for the WGET_ARGS option: https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py#L150

<!-- gh-comment-id:931564138 --> @pirate commented on GitHub (Sep 30, 2021): You can set `WGET_ARGS` for this: ```bash # this is the default # archivebox config --set WGET_ARGS='["--no-verbose", "--adjust-extension", "--convert-links", "--force-directories", "--backup-converted", "--span-hosts", "--no-parent", "-e", "robots=off"]' # you can add --inet4-only to the end like so: archivebox config --set WGET_ARGS='["--no-verbose", "--adjust-extension", "--convert-links", "--force-directories", "--backup-converted", "--span-hosts", "--no-parent", "-e", "robots=off", "--inet4-only"]' ``` In general, every dependency used has a similar `<dependency name>_ARGS` config option to satisfy edge cases like this. 👍 See here for the source code for the `WGET_ARGS` option: https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py#L150
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#536
No description provided.