starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 17:16:00 +03:00

[GH-ISSUE #893] Feature Request: Whole-site archiving with link-rewriting to point to archived versions #555

New issue

Closed

opened 2026-03-01 14:44:32 +03:00 by kerem · 4 comments

kerem commented

2026-03-01 14:44:32 +03:00

Owner

Originally created by @charlesangus on GitHub (Nov 21, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/893

Type

General question or discussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves

Archive an entire site, rewrite URLs to point to archives, so the archive is entirely self-contained. Like how archive.org does it - if I'm viewing an archived web page, the links I click point to other archived pages, not to the live versions on the web.
Organize snapshots imported this way hierarchically under the top-level domain, so essentially I can go to the top-level archived page, and then naturally navigate through the site, hitting only archived pages

If I'm archiving a site with say, thousands of pages. each snapshot is an island unto itself, which basically breaks any kind of flow of revisiting the archived pages. To follow a link, I have to hover, see where the link is going, and then go back to the snapshots page, search for that link, and then click on the snapshot. Quite cumbersome.

Example

e.g. start page:

https://mygreatblog.com/

found links:

https://mygreatblog.com/2019/01/mygreatarticle.html
https://mygreatblog.com/2021/01/mygreatarticle-part-ii.html

Archivebox archives both pages and rewrites links so that internal links still work and point to local copies:

https://mygreatblog.com/ --> https://myarchiveboxinstance.com/archive//index.html
https://mygreatblog.com/2019/01/mygreatarticle.html --> https://myarchiveboxinstance.com/archive/index.html

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Point to top-level page, say "archive whole site", and get an archive of the whole site, with all links pointing to the archived versions of the links.

What hacks or alternative solutions have you tried to solve the problem?

Currently using an external crawler to find all the pages on the site and adding them to ArchiveBox en masse, but this just gets me thousands of snapshots with no way to navigate between them.

How badly do you want this new feature?

It's an urgent deal-breaker, I can't live without it
It's important to add it in the near-mid term future
It would be nice to have eventually

I'm willing to contribute dev time / money to fix this issue
I like ArchiveBox so far / would recommend it to a friend
I've had a lot of difficulty getting ArchiveBox set up

Originally created by @charlesangus on GitHub (Nov 21, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/893  ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves - Archive an entire site, rewrite URLs to point to archives, so the archive is entirely self-contained. Like how archive.org does it - if I'm viewing an archived web page, the links I click point to other archived pages, not to the live versions on the web. - Organize snapshots imported this way hierarchically under the top-level domain, so essentially I can go to the top-level archived page, and then naturally navigate through the site, hitting only archived pages If I'm archiving a site with say, thousands of pages. each snapshot is an island unto itself, which basically breaks any kind of flow of revisiting the archived pages. To follow a link, I have to hover, see where the link is going, and then go back to the snapshots page, search for that link, and then click on the snapshot. Quite cumbersome. ### Example e.g. start page: https://mygreatblog.com/ found links: https://mygreatblog.com/2019/01/mygreatarticle.html https://mygreatblog.com/2021/01/mygreatarticle-part-ii.html Archivebox archives both pages and rewrites links so that internal links still work and point to local copies: https://mygreatblog.com/ --> https://myarchiveboxinstance.com/archive/<archive id>/index.html https://mygreatblog.com/2019/01/mygreatarticle.html --> https://myarchiveboxinstance.com/archive/<archive id>index.html ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes Point to top-level page, say "archive whole site", and get an archive of the whole site, with all links pointing to the archived versions of the links. ## What hacks or alternative solutions have you tried to solve the problem? Currently using an external crawler to find all the pages on the site and adding them to ArchiveBox en masse, but this just gets me thousands of snapshots with no way to navigate between them. ## How badly do you want this new feature? - [x] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [ ] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up

kerem

2026-03-01 14:44:32 +03:00

closed this issue
added the
status: idea-phase
label

kerem commented

2026-03-01 14:44:33 +03:00

Author

Owner

@pirate commented on GitHub (Nov 23, 2021):

Duplicate of #191. 😉

@pirate commented on GitHub (Nov 23, 2021): Duplicate of #191. 😉

kerem commented

2026-03-01 14:44:33 +03:00

Author

Owner

@charlesangus commented on GitHub (Nov 23, 2021):

Well, half a dupe, I'll admit.

Using an external crawler doesn't really work properly, because each archived page is a standalone thing.

In order to be able to browse the archives conveniently, archived should rewrite urls to point to the archive box version.

If that was implemented, using an external crawler would basically work fine.

As it is, it works for incidental archiving of a page here and there, but trying to archive anything systematically gets impossible to browse.

@charlesangus commented on GitHub (Nov 23, 2021): Well, half a dupe, I'll admit. Using an external crawler doesn't really work properly, because each archived page is a standalone thing. In order to be able to browse the archives conveniently, archived should rewrite urls to point to the archive box version. If that was implemented, using an external crawler would basically work fine. As it is, it works for incidental archiving of a page here and there, but trying to archive anything systematically gets impossible to browse.

kerem commented

2026-03-01 14:44:33 +03:00

Author

Owner

@pirate commented on GitHub (Nov 26, 2021):

URL rewriting is definitely a subset of the other issue, it wouldn't make sense to provide recursive whole-site archiving without it.

@pirate commented on GitHub (Nov 26, 2021): URL rewriting is definitely a subset of the other issue, it wouldn't make sense to provide recursive whole-site archiving without it.

kerem commented

2026-03-01 14:44:33 +03:00

Author

Owner

@pirate commented on GitHub (Apr 11, 2024):

More detailed comment about URL rewriting for anyone landing here via Google: https://github.com/ArchiveBox/ArchiveBox/discussions/1395#discussioncomment-9063232

@pirate commented on GitHub (Apr 11, 2024): More detailed comment about URL rewriting for anyone landing here via Google: https://github.com/ArchiveBox/ArchiveBox/discussions/1395#discussioncomment-9063232

kerem referenced this issue

2026-03-01 14:48:54 +03:00

[PR #555] [MERGED] refactor: Move indexing logic out of logging module #1220

kerem referenced this issue

2026-03-01 18:00:35 +03:00

[PR #555] [MERGED] refactor: Move indexing logic out of logging module #2732

kerem referenced this issue

2026-03-15 01:33:39 +03:00

[PR #555] [MERGED] refactor: Move indexing logic out of logging module #4235