[GH-ISSUE #1411] a bug of urllib.parse.urljoin #2360

Closed
opened 2026-03-01 17:58:29 +03:00 by kerem · 2 comments
Owner

Originally created by @tqobqbq on GitHub (Apr 16, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1411

from urllib.parse import urljoin
print(urljoin('https://web.archive.org/http://www.example.com/a/b.html',
              'c.html')
      )
#result is 'https://web.archive.org/http:/www.example.com/a/c.html'

the change from "http://" to 'http:/' in the sub url is not what I expected.
This bug stem from the filter statement from urljoin:
image
And simpily comment out the filter statement return a proper result in this case.

Originally created by @tqobqbq on GitHub (Apr 16, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1411 ``` from urllib.parse import urljoin print(urljoin('https://web.archive.org/http://www.example.com/a/b.html', 'c.html') ) #result is 'https://web.archive.org/http:/www.example.com/a/c.html' ``` the change from "http://" to 'http:/' in the sub url is not what I expected. This bug stem from the filter statement from urljoin: ![image](https://github.com/ArchiveBox/ArchiveBox/assets/49046901/46320463-f423-4dbe-81b3-18913456d33f) And simpily comment out the filter statement return a proper result in this case.
Author
Owner

@pirate commented on GitHub (Apr 25, 2024):

Woah good find, it looks like it's an old known issue in the cpython source:

https://github.com/python/cpython/issues/84774#issuecomment-2076080928

while digging I found out urljoin actually has many unfortunate problems:

I wish we could use JS's new URL(part, base) behavior instead:
https://developer.mozilla.org/en-US/docs/Web/API/URL/URL

<!-- gh-comment-id:2076081481 --> @pirate commented on GitHub (Apr 25, 2024): Woah good find, it looks like it's an old known issue in the cpython source: https://github.com/python/cpython/issues/84774#issuecomment-2076080928 while digging I found out `urljoin` actually has many unfortunate problems: - different behavior depending on scheme: https://github.com/python/cpython/issues/63028 - no support for non-standard schemes: https://github.com/python/cpython/issues/110463 - incorrect handling of relative paths e.g. `../`: https://github.com/python/cpython/issues/96015 - poor arg type enforcement: https://github.com/python/cpython/issues/63293 - incorrect handling of `?` empty query param: https://github.com/python/cpython/issues/76960 - incorrect handling of `#` empty search param: https://github.com/python/cpython/issues/83980 - other RFC compliance issues: https://github.com/python/cpython/issues/43453 I wish we could use JS's `new URL(part, base)` behavior instead: https://developer.mozilla.org/en-US/docs/Web/API/URL/URL
Author
Owner

@pirate commented on GitHub (Apr 25, 2024):

Fixed in e5aba0d

<!-- gh-comment-id:2076252215 --> @pirate commented on GitHub (Apr 25, 2024): Fixed in e5aba0d
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2360
No description provided.