[PR #1766] Fix #1517: Feature Request: Better support for archiving URLs beind HTT #3019

Open
opened 2026-03-01 18:01:25 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1766

State: open
Merged: No


Fixes #1517

Summary

This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth

Changes

archivebox/misc/util.py                            |  5 +--
 .../on_Snapshot__70_parse_html_urls.py             | 41 +++++++++++++++++++++-
 .../on_Snapshot__72_parse_rss_urls.py              | 37 ++++++++++++++++++-
 .../on_Snapshot__71_parse_txt_urls.py              | 34 ++++++++++++++++++
 4 files changed, 113 insertions(+), 4 deletions(-)

Testing

Please review the changes carefully. The fix was verified against the existing test suite.


This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!


Summary by cubic

Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517.

  • Bug Fixes
    • HTML/RSS/TXT parsers: inject user:pass from the root URL into discovered URLs when hostname and port match; skip if child already has creds.
    • URL utils: domain() now excludes credentials and includes port; added without_auth(); base_url() now ignores credentials so user:pass@host and host dedupe correctly.

Written for commit 231fb4f10b. Summary will update on new commits.

**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/1766 **State:** open **Merged:** No --- Fixes #1517 ## Summary This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth ## Changes ``` archivebox/misc/util.py | 5 +-- .../on_Snapshot__70_parse_html_urls.py | 41 +++++++++++++++++++++- .../on_Snapshot__72_parse_rss_urls.py | 37 ++++++++++++++++++- .../on_Snapshot__71_parse_txt_urls.py | 34 ++++++++++++++++++ 4 files changed, 113 insertions(+), 4 deletions(-) ``` ## Testing Please review the changes carefully. The fix was verified against the existing test suite. --- *This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!* <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517. - **Bug Fixes** - HTML/RSS/TXT parsers: inject user:pass from the root URL into discovered URLs when hostname and port match; skip if child already has creds. - URL utils: domain() now excludes credentials and includes port; added without_auth(); base_url() now ignores credentials so user:pass@host and host dedupe correctly. <sup>Written for commit 231fb4f10bf0f6513b960721c0a5f0c77fd93a79. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. -->
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3019
No description provided.