mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #893] Feature Request: Whole-site archiving with link-rewriting to point to archived versions #3573
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3573
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @charlesangus on GitHub (Nov 21, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/893
Type
What is the problem that your feature request solves
If I'm archiving a site with say, thousands of pages. each snapshot is an island unto itself, which basically breaks any kind of flow of revisiting the archived pages. To follow a link, I have to hover, see where the link is going, and then go back to the snapshots page, search for that link, and then click on the snapshot. Quite cumbersome.
Example
e.g. start page:
https://mygreatblog.com/
found links:
https://mygreatblog.com/2019/01/mygreatarticle.html
https://mygreatblog.com/2021/01/mygreatarticle-part-ii.html
Archivebox archives both pages and rewrites links so that internal links still work and point to local copies:
https://mygreatblog.com/ --> https://myarchiveboxinstance.com/archive//index.html
https://mygreatblog.com/2019/01/mygreatarticle.html --> https://myarchiveboxinstance.com/archive/index.html
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
Point to top-level page, say "archive whole site", and get an archive of the whole site, with all links pointing to the archived versions of the links.
What hacks or alternative solutions have you tried to solve the problem?
Currently using an external crawler to find all the pages on the site and adding them to ArchiveBox en masse, but this just gets me thousands of snapshots with no way to navigate between them.
How badly do you want this new feature?
@pirate commented on GitHub (Nov 23, 2021):
Duplicate of #191. 😉
@charlesangus commented on GitHub (Nov 23, 2021):
Well, half a dupe, I'll admit.
Using an external crawler doesn't really work properly, because each archived page is a standalone thing.
In order to be able to browse the archives conveniently, archived should rewrite urls to point to the archive box version.
If that was implemented, using an external crawler would basically work fine.
As it is, it works for incidental archiving of a page here and there, but trying to archive anything systematically gets impossible to browse.
@pirate commented on GitHub (Nov 26, 2021):
URL rewriting is definitely a subset of the other issue, it wouldn't make sense to provide recursive whole-site archiving without it.
@pirate commented on GitHub (Apr 11, 2024):
More detailed comment about URL rewriting for anyone landing here via Google: https://github.com/ArchiveBox/ArchiveBox/discussions/1395#discussioncomment-9063232