mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1517] Feature Request: Better support for archiving URLs beind HTTP basic auth #3915
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3915
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @agowa on GitHub (Sep 22, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1517
Type
What is the problem that your feature request solves
Archive a website behind a login form. I think this is mainly a documentation issue but currently the documentation is very sparse on how to archive webpages behind sign-in prompts. And for HTTP Basic Auth websites it looks like the only possible way is currently to archive them using dept=0 and with the credentials embedded into the URL.
dept=1 doesn't work as the credentials aren't preserved to all the discovered URLs and therefore they throw an authentication error. Further more it looks like URLs like
http://user:pass@domainare considered to be distinct fromhttp://domainURLs and not automatically detected as being the same but with the login credentials (aka links from other pages won't work, also when depth=1 was used for a site referencing them their version without login is also added to the index).Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
Document how to do capture a website that is behind a credential prompt.
What hacks or alternative solutions have you tried to solve the problem?
If there is no native solution, probably having to use a mitmproxy and disable TLS certificate validation or something similar?
How badly do you want this new feature?
@pirate commented on GitHub (Sep 22, 2024):
HTTP basic auth is pretty rare these days so we only really support embedding credentials into the URL as you found, I'm unlikely to add first-class support for it beyond that.
As for other forms of auth, there's a bunch of docs on how to set that up:
@agowa commented on GitHub (Sep 22, 2024):
HTTP basic auth is not that uncommon. It it still used at universities, at enterprises for internal stuff and outside of that for proxies. So having a "first-class support for it" would still be kinda desirable.
But I'd also "settle" for some documentation on how I could transparently remap urls in the backend or something. Basically a way that when ArchiveBox tries to access Domain-A it actually goes to Domain-B instead. That probably also would come in handy for enterprises when you want ArchiveBox to go to the internal domain name instead of the external one but without breaking the URLs. One example use case of this would be to allow ArchiveBox to bypass rate-limiting and CDN caches.
@pirate commented on GitHub (Sep 22, 2024):
There's a ticket open for supporting URL rewriting at some point, that is actually fairly versatile so it's higher up on my personal TODO list as it would allow building workarounds for a bunch of different cases: https://github.com/ArchiveBox/ArchiveBox/issues/1319
Well it's rare enough that you're the only person to ask about it in 7 years, so in the context of ArchiveBox users it's a fairly niche need. Obviously in an ideal world it would be supported but I alas I don't have time to build everything unfortunately. The real issue is that's it's hard to handle across all the different extractors as they each have different ways of dealing with it. By contrast COOKIES.txt and chrome profiles are quite standardized.
@agowa commented on GitHub (Sep 22, 2024):
I know, I've seen it earlier today. That's why I mentioned it :)
That's good to hear.
That's fair.
I feel you, looks like we all are struggling with the little time we have...
Well this is where the above mentioned ticket comes in handy, not every problem needs a special tailored solution, sometimes having a generic one that addresses multiple is good enough. Like the URL rewriting. HTTP Basic Auth is standardized in an "in url" format, so being able to do URL (or even more generic HTTP-Request) rewrite would also be sufficient.
I did not expect that, for the few things I looked at it it always was a huge mess...
@vitSkalicky commented on GitHub (Sep 27, 2024):
I have the same problem and similar use case: I have a simple website that contains documentation which is secured using HTTP basic auth (because it's simple) and I would like to archive it for myself.
But thinking about it further... archivebox maybe isn't the right tool for this.