[GH-ISSUE #1210] Bug: Cannot archive jpg at non-80 port 443 #3764

Closed
opened 2026-03-15 00:22:49 +03:00 by kerem · 1 comment
Owner

Originally created by @mlsteele on GitHub (Aug 11, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1210

Describe the bug

When trying to archive this url it repeatedly fails.
https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg .

Two weird things about this url:

  1. It is a jpg, not an html page. I think this is not the issue though wget_output_path does have html-specific elements.
  2. It is at port 443.

Steps to reproduce

  1. Run archivebox add https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg
  2. Find the timestamp on the webui by searching for the url.
  3. Run archivebox update -t timestamp 1691792281.185177 (your timestamp will vary)
  4. See wget failure and web ui lacking the content

Screenshots or log output

[i] [2023-08-11 23:09:20] ArchiveBox v0.6.2: archivebox update -t timestamp 1691792281.185177
    > /Users/miles/archivebox


[▶] [2023-08-11 23:09:22] Starting archiving of 1 snapshots in index...

[√] [2023-08-11 23:09:22] "newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg"
    https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg
    √ ./archive/1691792281.185177
      > wget
        Extractor failed:
             Wget failed or got an error from the server
            Got wget response code: 0.
            Total wall clock time: 0.3s
            Downloaded: 1 files, 149K in 0.05s (2.67 MB/s)
        Run to see full output:
            cd /Users/miles/archivebox/archive/1691792281.185177;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/Users/miles/archivebox/archive/1691792281.185177/warc/1691795362 --page-requisites "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.4" --compression=auto https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg

        10 files (1.1 MB) in 0:00:00s 

[√] [2023-08-11 23:09:23] Update of 1 pages complete (0.72 sec)
    - 0 links skipped
    - 1 links updated
    - 1 links had errors

    Hint: To manage your archive in a Web UI, run:
        archivebox server 0.0.0.0:8000

Possible fix

The issue is that in wget.py in wget_output_path this line:

    search_dir = Path(link.link_dir) / domain(link.url).replace(":", "+") / urldecode(full_path)

translates "newfs.s3.amazonaws.com:443" into "newfs.s3.amazonaws.com+443". And then goes on a wild goose chase looking for the file. Whereas the real wget did not do that translation and instead threw out the port number when placing the jpg in the filesystem.

A localized solution is to use urlparse(link.url).hostname instead of domain(link.url).

A broader solution, which may do more good or break something far away, is to change in util.py:

- domain = lambda url: urlparse(url).netloc
+ domain = lambda url: urlparse(url).hostname

ArchiveBox version

ArchiveBox v0.6.2
Cpython Darwin macOS-13.5-x86_64-i386-64bit x86_64
IN_DOCKER=False DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /Users/miles/.pyenv/versions/3.11.1/bin/archivebox                          
 √  PYTHON_BINARY         v3.11.1         valid     /Users/miles/.pyenv/versions/3.11.1/bin/python3.11                          
 √  DJANGO_BINARY         v3.1.14         valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v8.1.2          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.4         valid     /usr/local/bin/wget                                                         
 √  NODE_BINARY           v18.16.0        valid     /opt/nodejs/bin/node                                                        
 √  SINGLEFILE_BINARY     v1.0.47         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.6          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.41.0         valid     /usr/local/bin/git                                                          
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /Users/miles/.pyenv/versions/3.11.1/bin/youtube-dl                          
 √  CHROME_BINARY         v115.0.5790.75  valid     /Users/miles/Library/Caches/ms-playwright/chromium-1071/chrome-mac/Chromium.app/Contents/MacOS/Chromium
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/local/bin/rg                                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox 
 √  TEMPLATES_DIR         3 files         valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /Users/miles/archivebox                                                     
 √  SOURCES_DIR           10 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           40 files        valid     ./archive                                                                   
 √  CONFIG_FILE           222.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             616.0 KB        valid     ./index.sqlite3                                                             

Originally created by @mlsteele on GitHub (Aug 11, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1210 #### Describe the bug When trying to archive this url it repeatedly fails. https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg . Two weird things about this url: 1. It is a jpg, not an html page. I think this is not the issue though wget_output_path does have html-specific elements. 2. It is at port 443. #### Steps to reproduce 1. Run `archivebox add https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg` 2. Find the timestamp on the webui by searching for the url. 3. Run ```archivebox update -t timestamp 1691792281.185177``` (your timestamp will vary) 4. See wget failure and web ui lacking the content #### Screenshots or log output ```log [i] [2023-08-11 23:09:20] ArchiveBox v0.6.2: archivebox update -t timestamp 1691792281.185177 > /Users/miles/archivebox [▶] [2023-08-11 23:09:22] Starting archiving of 1 snapshots in index... [√] [2023-08-11 23:09:22] "newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg" https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg √ ./archive/1691792281.185177 > wget Extractor failed: Wget failed or got an error from the server Got wget response code: 0. Total wall clock time: 0.3s Downloaded: 1 files, 149K in 0.05s (2.67 MB/s) Run to see full output: cd /Users/miles/archivebox/archive/1691792281.185177; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/Users/miles/archivebox/archive/1691792281.185177/warc/1691795362 --page-requisites "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.4" --compression=auto https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg 10 files (1.1 MB) in 0:00:00s [√] [2023-08-11 23:09:23] Update of 1 pages complete (0.72 sec) - 0 links skipped - 1 links updated - 1 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 ``` #### Possible fix The issue is that in `wget.py` in `wget_output_path` this line: ``` search_dir = Path(link.link_dir) / domain(link.url).replace(":", "+") / urldecode(full_path) ``` translates "newfs.s3.amazonaws.com:443" into "newfs.s3.amazonaws.com+443". And then goes on a wild goose chase looking for the file. Whereas the real `wget` did not do that translation and instead threw out the port number when placing the jpg in the filesystem. A localized solution is to use `urlparse(link.url).hostname` instead of `domain(link.url)`. A broader solution, which may do more good or break something far away, is to change in `util.py`: ``` - domain = lambda url: urlparse(url).netloc + domain = lambda url: urlparse(url).hostname ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Darwin macOS-13.5-x86_64-i386-64bit x86_64 IN_DOCKER=False DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /Users/miles/.pyenv/versions/3.11.1/bin/archivebox √ PYTHON_BINARY v3.11.1 valid /Users/miles/.pyenv/versions/3.11.1/bin/python3.11 √ DJANGO_BINARY v3.1.14 valid /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/django/bin/django-admin.py √ CURL_BINARY v8.1.2 valid /usr/bin/curl √ WGET_BINARY v1.21.4 valid /usr/local/bin/wget √ NODE_BINARY v18.16.0 valid /opt/nodejs/bin/node √ SINGLEFILE_BINARY v1.0.47 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.6 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.41.0 valid /usr/local/bin/git √ YOUTUBEDL_BINARY v2021.12.17 valid /Users/miles/.pyenv/versions/3.11.1/bin/youtube-dl √ CHROME_BINARY v115.0.5790.75 valid /Users/miles/Library/Caches/ms-playwright/chromium-1071/chrome-mac/Chromium.app/Contents/MacOS/Chromium √ RIPGREP_BINARY v13.0.0 valid /usr/local/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 8 files valid /Users/miles/archivebox √ SOURCES_DIR 10 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 40 files valid ./archive √ CONFIG_FILE 222.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 616.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

I suspect that wget stripping the port for the URL you provided is just a special case of it being both https and :443.

I agree that your approach using urlparse is better, but the existing behavior in wget_output_path is written in blood haha.
There are so many subtle edge cases to how they rewrite URLs to be suitable for the filesystem, I really just wish they printed the final path in stdout!
I'm going to err on the safe side and add it as a a fallback instead of modifying the first-pass behavior.

Fixed in 99bb02cd6c, will be released in v0.7.3.

<!-- gh-comment-id:1900253065 --> @pirate commented on GitHub (Jan 19, 2024): I suspect that wget stripping the port for the URL you provided is just a special case of it being both `https` and `:443`. I agree that your approach using `urlparse` is better, but the existing behavior in `wget_output_path` is written in blood haha. There are so many subtle edge cases to how they rewrite URLs to be suitable for the filesystem, I really just wish they printed the final path in stdout! I'm going to err on the safe side and add it as a a fallback instead of modifying the first-pass behavior. Fixed in 99bb02cd6c58c991d5298cd8df95888f09ef1bdf, will be released in `v0.7.3`.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3764
No description provided.