mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1210] Bug: Cannot archive jpg at non-80 port 443 #2254
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2254
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mlsteele on GitHub (Aug 11, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1210
Describe the bug
When trying to archive this url it repeatedly fails.
https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg .
Two weird things about this url:
Steps to reproduce
archivebox add https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpgarchivebox update -t timestamp 1691792281.185177(your timestamp will vary)Screenshots or log output
Possible fix
The issue is that in
wget.pyinwget_output_paththis line:translates "newfs.s3.amazonaws.com:443" into "newfs.s3.amazonaws.com+443". And then goes on a wild goose chase looking for the file. Whereas the real
wgetdid not do that translation and instead threw out the port number when placing the jpg in the filesystem.A localized solution is to use
urlparse(link.url).hostnameinstead ofdomain(link.url).A broader solution, which may do more good or break something far away, is to change in
util.py:ArchiveBox version
@pirate commented on GitHub (Jan 19, 2024):
I suspect that wget stripping the port for the URL you provided is just a special case of it being both
httpsand:443.I agree that your approach using
urlparseis better, but the existing behavior inwget_output_pathis written in blood haha.There are so many subtle edge cases to how they rewrite URLs to be suitable for the filesystem, I really just wish they printed the final path in stdout!
I'm going to err on the safe side and add it as a a fallback instead of modifying the first-pass behavior.
Fixed in
99bb02cd6c, will be released inv0.7.3.