[GH-ISSUE #1517] Feature Request: Better support for archiving URLs beind HTTP basic auth #3915

Open
opened 2026-03-15 00:58:19 +03:00 by kerem · 5 comments
Owner

Originally created by @agowa on GitHub (Sep 22, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1517

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Archive a website behind a login form. I think this is mainly a documentation issue but currently the documentation is very sparse on how to archive webpages behind sign-in prompts. And for HTTP Basic Auth websites it looks like the only possible way is currently to archive them using dept=0 and with the credentials embedded into the URL.
dept=1 doesn't work as the credentials aren't preserved to all the discovered URLs and therefore they throw an authentication error. Further more it looks like URLs like http://user:pass@domain are considered to be distinct from http://domain URLs and not automatically detected as being the same but with the login credentials (aka links from other pages won't work, also when depth=1 was used for a site referencing them their version without login is also added to the index).

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Document how to do capture a website that is behind a credential prompt.

What hacks or alternative solutions have you tried to solve the problem?

If there is no native solution, probably having to use a mitmproxy and disable TLS certificate validation or something similar?

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @agowa on GitHub (Sep 22, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1517 ## Type - [x] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves Archive a website behind a login form. I think this is mainly a documentation issue but currently the documentation is very sparse on how to archive webpages behind sign-in prompts. And for HTTP Basic Auth websites it looks like the only possible way is currently to archive them using dept=0 and with the credentials embedded into the URL. dept=1 doesn't work as the credentials aren't preserved to all the discovered URLs and therefore they throw an authentication error. Further more it looks like URLs like `http://user:pass@domain` are considered to be distinct from `http://domain` URLs and not automatically detected as being the same but with the login credentials (aka links from other pages won't work, also when depth=1 was used for a site referencing them their version without login is also added to the index). ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes Document how to do capture a website that is behind a credential prompt. ## What hacks or alternative solutions have you tried to solve the problem? If there is no native solution, probably having to use a mitmproxy and disable TLS certificate validation or something similar? ## How badly do you want this new feature? - [x] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@pirate commented on GitHub (Sep 22, 2024):

HTTP basic auth is pretty rare these days so we only really support embedding credentials into the URL as you found, I'm unlikely to add first-class support for it beyond that.

As for other forms of auth, there's a bunch of docs on how to set that up:

<!-- gh-comment-id:2366922914 --> @pirate commented on GitHub (Sep 22, 2024): HTTP basic auth is pretty rare these days so we only really support embedding credentials into the URL as you found, I'm unlikely to add first-class support for it beyond that. As for other forms of auth, there's a bunch of docs on how to set that up: - https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile - https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#archiving-content-behind-log-ins--advanced-users-only - https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file - https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir
Author
Owner

@agowa commented on GitHub (Sep 22, 2024):

HTTP basic auth is not that uncommon. It it still used at universities, at enterprises for internal stuff and outside of that for proxies. So having a "first-class support for it" would still be kinda desirable.

But I'd also "settle" for some documentation on how I could transparently remap urls in the backend or something. Basically a way that when ArchiveBox tries to access Domain-A it actually goes to Domain-B instead. That probably also would come in handy for enterprises when you want ArchiveBox to go to the internal domain name instead of the external one but without breaking the URLs. One example use case of this would be to allow ArchiveBox to bypass rate-limiting and CDN caches.

<!-- gh-comment-id:2366934131 --> @agowa commented on GitHub (Sep 22, 2024): HTTP basic auth is not that uncommon. It it still used at universities, at enterprises for internal stuff and outside of that for proxies. So having a "first-class support for it" would still be kinda desirable. But I'd also "settle" for some documentation on how I could transparently remap urls in the backend or something. Basically a way that when ArchiveBox tries to access Domain-A it actually goes to Domain-B instead. That probably also would come in handy for enterprises when you want ArchiveBox to go to the internal domain name instead of the external one but without breaking the URLs. One example use case of this would be to allow ArchiveBox to bypass rate-limiting and CDN caches.
Author
Owner

@pirate commented on GitHub (Sep 22, 2024):

There's a ticket open for supporting URL rewriting at some point, that is actually fairly versatile so it's higher up on my personal TODO list as it would allow building workarounds for a bunch of different cases: https://github.com/ArchiveBox/ArchiveBox/issues/1319

HTTP basic auth is not that uncommon

Well it's rare enough that you're the only person to ask about it in 7 years, so in the context of ArchiveBox users it's a fairly niche need. Obviously in an ideal world it would be supported but I alas I don't have time to build everything unfortunately. The real issue is that's it's hard to handle across all the different extractors as they each have different ways of dealing with it. By contrast COOKIES.txt and chrome profiles are quite standardized.

<!-- gh-comment-id:2366950047 --> @pirate commented on GitHub (Sep 22, 2024): There's a ticket open for supporting URL rewriting at some point, that is actually fairly versatile so it's higher up on my personal TODO list as it would allow building workarounds for a bunch of different cases: https://github.com/ArchiveBox/ArchiveBox/issues/1319 > HTTP basic auth is not that uncommon Well it's rare enough that you're the only person to ask about it in 7 years, so in the context of ArchiveBox users it's a fairly niche need. Obviously in an ideal world it would be supported but I alas I don't have time to build everything unfortunately. The real issue is that's it's hard to handle across all the different extractors as they each have different ways of dealing with it. By contrast COOKIES.txt and chrome profiles are quite standardized.
Author
Owner

@agowa commented on GitHub (Sep 22, 2024):

There's a ticket open for supporting URL rewriting at some point, that is actually fairly versatile

I know, I've seen it earlier today. That's why I mentioned it :)

so it's higher up on my personal TODO list as it would allow building workarounds for a bunch of different cases: #1319

That's good to hear.

Well it's rare enough that you're the only person to ask about it in 7 years, so in the context of ArchiveBox users it's a fairly niche need.

That's fair.

Obviously in an ideal world it would be supported but I alas I don't have time to build everything unfortunately.

I feel you, looks like we all are struggling with the little time we have...

The real issue is that's it's hard to handle across all the different extractors as they each have different ways of dealing with it.

Well this is where the above mentioned ticket comes in handy, not every problem needs a special tailored solution, sometimes having a generic one that addresses multiple is good enough. Like the URL rewriting. HTTP Basic Auth is standardized in an "in url" format, so being able to do URL (or even more generic HTTP-Request) rewrite would also be sufficient.

By contrast COOKIES.txt and chrome profiles are quite standardized.

I did not expect that, for the few things I looked at it it always was a huge mess...

<!-- gh-comment-id:2366968196 --> @agowa commented on GitHub (Sep 22, 2024): > There's a ticket open for supporting URL rewriting at some point, that is actually fairly versatile I know, I've seen it earlier today. That's why I mentioned it :) > so it's higher up on my personal TODO list as it would allow building workarounds for a bunch of different cases: #1319 That's good to hear. > Well it's rare enough that you're the only person to ask about it in 7 years, so in the context of ArchiveBox users it's a fairly niche need. That's fair. > Obviously in an ideal world it would be supported but I alas I don't have time to build everything unfortunately. I feel you, looks like we all are struggling with the little time we have... > The real issue is that's it's hard to handle across all the different extractors as they each have different ways of dealing with it. Well this is where the above mentioned ticket comes in handy, not every problem needs a special tailored solution, sometimes having a generic one that addresses multiple is good enough. Like the URL rewriting. HTTP Basic Auth is standardized in an "in url" format, so being able to do URL (or even more generic HTTP-Request) rewrite would also be sufficient. > By contrast COOKIES.txt and chrome profiles are quite standardized. I did not expect that, for the few things I looked at it it always was a huge mess...
Author
Owner

@vitSkalicky commented on GitHub (Sep 27, 2024):

I have the same problem and similar use case: I have a simple website that contains documentation which is secured using HTTP basic auth (because it's simple) and I would like to archive it for myself.

  1. I need to authenticate using HTTP basic auth
  2. I would like to archive an entire subtree/path/subdomain, not just depth=1

But thinking about it further... archivebox maybe isn't the right tool for this.

<!-- gh-comment-id:2378991711 --> @vitSkalicky commented on GitHub (Sep 27, 2024): I have the same problem and similar use case: I have a simple website that contains documentation which is secured using HTTP basic auth (because it's simple) and I would like to archive it for myself. 1. I need to authenticate using HTTP basic auth 2. I would like to archive an entire subtree/path/subdomain, not just depth=1 But thinking about it further... archivebox maybe isn't the right tool for this.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3915
No description provided.