[PR #1766] Fix #1517: Feature Request: Better support for archiving URLs beind HTT #1510

Open
opened 2026-03-01 14:50:06 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1766
Author: @danielalanbates
Created: 2/21/2026
Status: 🔄 Open

Base: devHead: fix/issue-1517


📝 Commits (1)

  • 231fb4f Fix #1517: Feature Request: Better support for archiving URLs beind HTT

📊 Changes

4 files changed (+113 additions, -4 deletions)

View changed files

📝 archivebox/misc/util.py (+3 -2)
📝 archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py (+40 -1)
📝 archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py (+36 -1)
📝 archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py (+34 -0)

📄 Description

Fixes #1517

Summary

This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth

Changes

archivebox/misc/util.py                            |  5 +--
 .../on_Snapshot__70_parse_html_urls.py             | 41 +++++++++++++++++++++-
 .../on_Snapshot__72_parse_rss_urls.py              | 37 ++++++++++++++++++-
 .../on_Snapshot__71_parse_txt_urls.py              | 34 ++++++++++++++++++
 4 files changed, 113 insertions(+), 4 deletions(-)

Testing

Please review the changes carefully. The fix was verified against the existing test suite.


This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!


Summary by cubic

Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517.

  • Bug Fixes
    • HTML/RSS/TXT parsers: inject user:pass from the root URL into discovered URLs when hostname and port match; skip if child already has creds.
    • URL utils: domain() now excludes credentials and includes port; added without_auth(); base_url() now ignores credentials so user:pass@host and host dedupe correctly.

Written for commit 231fb4f10b. Summary will update on new commits.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1766 **Author:** [@danielalanbates](https://github.com/danielalanbates) **Created:** 2/21/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `fix/issue-1517` --- ### 📝 Commits (1) - [`231fb4f`](https://github.com/ArchiveBox/ArchiveBox/commit/231fb4f10bf0f6513b960721c0a5f0c77fd93a79) Fix #1517: Feature Request: Better support for archiving URLs beind HTT ### 📊 Changes **4 files changed** (+113 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/misc/util.py` (+3 -2) 📝 `archivebox/plugins/parse_html_urls/on_Snapshot__70_parse_html_urls.py` (+40 -1) 📝 `archivebox/plugins/parse_rss_urls/on_Snapshot__72_parse_rss_urls.py` (+36 -1) 📝 `archivebox/plugins/parse_txt_urls/on_Snapshot__71_parse_txt_urls.py` (+34 -0) </details> ### 📄 Description Fixes #1517 ## Summary This PR fixes: Feature Request: Better support for archiving URLs beind HTTP basic auth ## Changes ``` archivebox/misc/util.py | 5 +-- .../on_Snapshot__70_parse_html_urls.py | 41 +++++++++++++++++++++- .../on_Snapshot__72_parse_rss_urls.py | 37 ++++++++++++++++++- .../on_Snapshot__71_parse_txt_urls.py | 34 ++++++++++++++++++ 4 files changed, 113 insertions(+), 4 deletions(-) ``` ## Testing Please review the changes carefully. The fix was verified against the existing test suite. --- *This PR was created with the assistance of Claude Sonnet 4.6 by Anthropic | effort: high. Happy to make any adjustments!* <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Improves archiving behind HTTP Basic Auth by propagating credentials to same-host links and stripping creds from URL dedupe helpers. Fixes depth > 0 crawls that failed auth and prevents credentialed URLs from breaking deduplication. Addresses #1517. - **Bug Fixes** - HTML/RSS/TXT parsers: inject user:pass from the root URL into discovered URLs when hostname and port match; skip if child already has creds. - URL utils: domain() now excludes credentials and includes port; added without_auth(); base_url() now ignores credentials so user:pass@host and host dedupe correctly. <sup>Written for commit 231fb4f10bf0f6513b960721c0a5f0c77fd93a79. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. --> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1510
No description provided.