[GH-ISSUE #893] Feature Request: Whole-site archiving with link-rewriting to point to archived versions #555

Closed
opened 2026-03-01 14:44:32 +03:00 by kerem · 4 comments
Owner

Originally created by @charlesangus on GitHub (Nov 21, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/893

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

  • Archive an entire site, rewrite URLs to point to archives, so the archive is entirely self-contained. Like how archive.org does it - if I'm viewing an archived web page, the links I click point to other archived pages, not to the live versions on the web.
  • Organize snapshots imported this way hierarchically under the top-level domain, so essentially I can go to the top-level archived page, and then naturally navigate through the site, hitting only archived pages

If I'm archiving a site with say, thousands of pages. each snapshot is an island unto itself, which basically breaks any kind of flow of revisiting the archived pages. To follow a link, I have to hover, see where the link is going, and then go back to the snapshots page, search for that link, and then click on the snapshot. Quite cumbersome.

Example

e.g. start page:

https://mygreatblog.com/

found links:

https://mygreatblog.com/2019/01/mygreatarticle.html
https://mygreatblog.com/2021/01/mygreatarticle-part-ii.html

Archivebox archives both pages and rewrites links so that internal links still work and point to local copies:

https://mygreatblog.com/ --> https://myarchiveboxinstance.com/archive//index.html
https://mygreatblog.com/2019/01/mygreatarticle.html --> https://myarchiveboxinstance.com/archive/index.html

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Point to top-level page, say "archive whole site", and get an archive of the whole site, with all links pointing to the archived versions of the links.

What hacks or alternative solutions have you tried to solve the problem?

Currently using an external crawler to find all the pages on the site and adding them to ArchiveBox en masse, but this just gets me thousands of snapshots with no way to navigate between them.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @charlesangus on GitHub (Nov 21, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/893 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves - Archive an entire site, rewrite URLs to point to archives, so the archive is entirely self-contained. Like how archive.org does it - if I'm viewing an archived web page, the links I click point to other archived pages, not to the live versions on the web. - Organize snapshots imported this way hierarchically under the top-level domain, so essentially I can go to the top-level archived page, and then naturally navigate through the site, hitting only archived pages If I'm archiving a site with say, thousands of pages. each snapshot is an island unto itself, which basically breaks any kind of flow of revisiting the archived pages. To follow a link, I have to hover, see where the link is going, and then go back to the snapshots page, search for that link, and then click on the snapshot. Quite cumbersome. ### Example e.g. start page: https://mygreatblog.com/ found links: https://mygreatblog.com/2019/01/mygreatarticle.html https://mygreatblog.com/2021/01/mygreatarticle-part-ii.html Archivebox archives both pages and rewrites links so that internal links still work and point to local copies: https://mygreatblog.com/ --> https://myarchiveboxinstance.com/archive/<archive id>/index.html https://mygreatblog.com/2019/01/mygreatarticle.html --> https://myarchiveboxinstance.com/archive/<archive id>index.html ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes Point to top-level page, say "archive whole site", and get an archive of the whole site, with all links pointing to the archived versions of the links. ## What hacks or alternative solutions have you tried to solve the problem? Currently using an external crawler to find all the pages on the site and adding them to ArchiveBox en masse, but this just gets me thousands of snapshots with no way to navigate between them. ## How badly do you want this new feature? - [x] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [ ] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
kerem 2026-03-01 14:44:32 +03:00
Author
Owner

@pirate commented on GitHub (Nov 23, 2021):

Duplicate of #191. 😉

<!-- gh-comment-id:976125075 --> @pirate commented on GitHub (Nov 23, 2021): Duplicate of #191. 😉
Author
Owner

@charlesangus commented on GitHub (Nov 23, 2021):

Well, half a dupe, I'll admit.

Using an external crawler doesn't really work properly, because each archived page is a standalone thing.

In order to be able to browse the archives conveniently, archived should rewrite urls to point to the archive box version.

If that was implemented, using an external crawler would basically work fine.

As it is, it works for incidental archiving of a page here and there, but trying to archive anything systematically gets impossible to browse.

<!-- gh-comment-id:976198382 --> @charlesangus commented on GitHub (Nov 23, 2021): Well, half a dupe, I'll admit. Using an external crawler doesn't really work properly, because each archived page is a standalone thing. In order to be able to browse the archives conveniently, archived should rewrite urls to point to the archive box version. If that was implemented, using an external crawler would basically work fine. As it is, it works for incidental archiving of a page here and there, but trying to archive anything systematically gets impossible to browse.
Author
Owner

@pirate commented on GitHub (Nov 26, 2021):

URL rewriting is definitely a subset of the other issue, it wouldn't make sense to provide recursive whole-site archiving without it.

<!-- gh-comment-id:979585035 --> @pirate commented on GitHub (Nov 26, 2021): URL rewriting is definitely a subset of the other issue, it wouldn't make sense to provide recursive whole-site archiving without it.
Author
Owner

@pirate commented on GitHub (Apr 11, 2024):

More detailed comment about URL rewriting for anyone landing here via Google: https://github.com/ArchiveBox/ArchiveBox/discussions/1395#discussioncomment-9063232

<!-- gh-comment-id:2049148439 --> @pirate commented on GitHub (Apr 11, 2024): More detailed comment about URL rewriting for anyone landing here via Google: https://github.com/ArchiveBox/ArchiveBox/discussions/1395#discussioncomment-9063232
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#555
No description provided.