starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

` #678

New issue

Open

opened 2026-03-01 14:45:28 +03:00 by kerem · 3 comments

kerem commented

2026-03-01 14:45:28 +03:00

Owner

Originally created by @pirate on GitHub (Jan 17, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1085

To visit an archived version of a website (or archive it automatically) on Archive.org, one can just visit http://web.archive.org/web/https://example.com/ and it will redirect to http://web.archive.org/web/20230116145642/https://example.com/ (or whatever the most recent snapshot timestamp is).

To really emobdy the tagline "ArchiveBox is a self-hosted version of archive.org" we should properly support their URL scheme too.

e.g.

https://demo.archivebox.io/web/https://example.com should redirect to the most recent snapshot https://demo.archivebox.io/web/20230116145642/https://example.com
- note: support both the ArchiveBox-style timestamp in unix timestamp format e.g. 1673919713 or the Archive.org-style 20230116145642 format and truncated forms 2023, 202301, 20230116
- note: also support visiting using snapshots using ulid uuid instead of timestamp as slug, e.g. https://demo.archivebox.io/01ARZ3NDEKTSV4RRFFQ69G5FAV/...
- note: support auto prefix-matching slugs so that 2023 matches 202301, 20230116, 20230116145642 automatically, and 01AN4Z07BY matches 01AN4Z07BY79KA1307SR9X4MV3 automatically
Full spec:

https://demo.archivebox.io/web/<SLUG> where SLUG can be:
- an original URL, with or without scheme, e.g. https://example.com/index.html, 'example.com/index.html' ➡️ redirect to most recent snapshot for https://demo.archivebox.io/web/20230116145642/https://example.com/index.html
- an ArchiveBox snapshot UUID in ulid/spec format 01AN4Z07BY79KA1307SR9X4MV3/index.html or timestamp prefix 01AN4Z07BY/index.html ➡️ redirect to that exact snapshot https://demo.archivebox.io/web/20230116145642/https://example.com/index.html
- an ArchiveBox snapshot timestamp in YYMMDDHHMMSS, shortened forms like YYYYMM, or unix timestamp format e.g. 20230116145642/index.html or 202301161456/index.html, 202301/index.html, 1673919713/index.html ➡️ redirect to most recent snapshot matching that prefix https://demo.archivebox.io/web/20230116145642/https://example.com/index.html

Subtasks:

adds derived ulid field + migration to coalesce old uuid and timestamp fields into new ulid format (+asserts all snapshot timestamps are valid and are between 1900 and 2100 AD) (done in v0.8.5)
update admin and index UI to show ULID of old UUID4 xxxx-xxxx-xxxxxxx format, add ULID diagram in docs breaking it down into timestamp and randomness
create disambiguation page to show all the matching results for a given SLUG if it's the prefix for multiple possible snapshots
reject Snapshot UUIDs being created that begin with 0, 1,2,htt to make prefix-matching faster and less error prone (avoids clashing with 199x*/20** year, 1* unix timestamp, 01* ULIDs, or http(s?) URL slug prefixes)
add docs examples on how to truly "self-host your own archive.org", add screenshot side-by-side of URL bar examples for visiting snapshots on Archive.org and demo.Archivebox.io

Originally created by @pirate on GitHub (Jan 17, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1085 To visit an archived version of a website (or archive it automatically) on Archive.org, one can just visit `http://web.archive.org/web/https://example.com/` and it will redirect to `http://web.archive.org/web/20230116145642/https://example.com/` (or whatever the most recent snapshot timestamp is). To really emobdy the tagline "ArchiveBox is a self-hosted version of archive.org" we should properly support their URL scheme too. e.g. - `https://demo.archivebox.io/web/https://example.com` should redirect to the most recent snapshot `https://demo.archivebox.io/web/20230116145642/https://example.com` - note: support both the ArchiveBox-style timestamp in unix timestamp format e.g. `1673919713` or the Archive.org-style `20230116145642` format and truncated forms `2023`, `202301`, `20230116` - note: also support visiting using snapshots using [ulid](https://github.com/ulid/spec) uuid instead of timestamp as slug, e.g. `https://demo.archivebox.io/01ARZ3NDEKTSV4RRFFQ69G5FAV/...` - note: support auto prefix-matching slugs so that `2023` matches `202301`, `20230116`, `20230116145642` automatically, and `01AN4Z07BY` matches `01AN4Z07BY79KA1307SR9X4MV3 ` automatically Full spec: `https://demo.archivebox.io/web/<SLUG>` where `SLUG` can be: - an original URL, with or without scheme, e.g. `https://example.com/index.html`, 'example.com/index.html' ➡️ redirect to most recent snapshot for `https://demo.archivebox.io/web/20230116145642/https://example.com/index.html` - an ArchiveBox snapshot UUID in [`ulid/spec`](https://github.com/ArchiveBox/ArchiveBox/issues/74) format `01AN4Z07BY79KA1307SR9X4MV3/index.html` or timestamp prefix `01AN4Z07BY/index.html` ➡️ redirect to that exact snapshot `https://demo.archivebox.io/web/20230116145642/https://example.com/index.html` - an ArchiveBox snapshot timestamp in `YYMMDDHHMMSS`, shortened forms like `YYYYMM`, or unix timestamp format e.g. `20230116145642/index.html` or `202301161456/index.html`, `202301/index.html`, `1673919713/index.html` ➡️ redirect to most recent snapshot matching that prefix `https://demo.archivebox.io/web/20230116145642/https://example.com/index.html` Subtasks: - [x] adds derived `ulid` field + migration to coalesce old uuid and timestamp fields into new ulid format (+asserts all snapshot timestamps are valid and are between 1900 and 2100 AD) (done in v0.8.5) - [x] update admin and index UI to show ULID of old UUID4 `xxxx-xxxx-xxxxxxx` format, add ULID diagram in docs breaking it down into timestamp and randomness - [x] create disambiguation page to show all the matching results for a given SLUG if it's the prefix for multiple possible snapshots - [ ] reject Snapshot UUIDs being created that begin with `0`, `1`,`2`,`htt` to make prefix-matching faster and less error prone (avoids clashing with `199x*`/`20**` year, `1*` unix timestamp, `01*` ULIDs, or `http(s?)` URL slug prefixes) - [ ] add docs examples on how to truly "self-host your own archive.org", add screenshot side-by-side of URL bar examples for visiting snapshots on Archive.org and demo.Archivebox.io <img width="1295" alt="image" src="https://user-images.githubusercontent.com/511499/212802210-925be4a7-39ad-4474-8290-c3441bf4738d.png">

kerem added the

why: functionality

size: medium

touches: docs

touches: API/CLI/Spec

status: wip

touches: views/replayers/html/css

labels

2026-03-01 14:45:28 +03:00

kerem commented

2026-03-01 14:45:29 +03:00

Author

Owner

@ArrayBolt3 commented on GitHub (Nov 20, 2024):

At least one project interested in using ArchiveBox (Kicksecure) would also be interested in this functionality, or any functionality that allows turning a URL into an archived URL via a simple transformation (i.e., prepend https://archivebox.example.org/whatever/goes/here/ to a URL to get an archived URL). The use case for this is:

We have two very large MediaWiki instances, with wikis that contain many links to external websites.
For each of those links, we want to link to an archived version of the page we link to.
If the corresponding url for https://example.com/my-page is https://archivebox.example.com/BN833Z or something like that, there's no easy way to convert a link to an ArchiveBox link. Thus adding the archive links requires running a large "archive job" that archives all unarchived links, then gets the corresponding URLs and mass-edits them into the Wiki. This is a pain.
If the corresponding url for https://example.com/my-page is https://archivebox.example.com/web/https://example.com/my-page, no mass-editing is required. A MediaWiki plugin can be used to put a button after each link that offers an archived version of the webpage to the user. (This is what we already do with archive.org.)

Worthy of note, the format doesn't have to be exactly like archive.org for this to work. If ArchiveBox supported the MementoWeb API similar to how archive.today does, we would end up turning https://example.com/my-page into https://archivebox.example.com/timegate/https://example.com/my-page, which works just as well.

Is help wanted here? Depending on how suitable ArchiveBox is for Kicksecure's use case, this might be a feature we'd be willing to implement and work on upstreaming.

@ArrayBolt3 commented on GitHub (Nov 20, 2024): At least one project interested in using ArchiveBox (Kicksecure) would also be interested in this functionality, or any functionality that allows turning a URL into an archived URL via a simple transformation (i.e., prepend `https://archivebox.example.org/whatever/goes/here/` to a URL to get an archived URL). The use case for this is: * We have two very large MediaWiki instances, with wikis that contain many links to external websites. * For each of those links, we want to link to an archived version of the page we link to. * If the corresponding url for `https://example.com/my-page` is `https://archivebox.example.com/BN833Z` or something like that, there's no easy way to convert a link to an ArchiveBox link. Thus adding the archive links requires running a large "archive job" that archives all unarchived links, then gets the corresponding URLs and mass-edits them into the Wiki. This is a pain. * If the corresponding url for `https://example.com/my-page` is `https://archivebox.example.com/web/https://example.com/my-page`, no mass-editing is required. A MediaWiki plugin can be used to put a button after each link that offers an archived version of the webpage to the user. (This is what we already do with archive.org.) Worthy of note, the format doesn't have to be *exactly* like archive.org for this to work. If ArchiveBox supported the MementoWeb API similar to how archive.today does, we would end up turning `https://example.com/my-page` into `https://archivebox.example.com/timegate/https://example.com/my-page`, which works just as well. Is help wanted here? Depending on how suitable ArchiveBox is for Kicksecure's use case, this might be a feature we'd be willing to implement and work on upstreaming.

kerem commented

2026-03-01 14:45:29 +03:00

Author

Owner

@pirate commented on GitHub (Nov 20, 2024):

This is actually already supported 😃 It's just not well documented yet. You can visit:
https://archivebox.example.com/archive/https://example.com/archived/url e.g.:
https://demo.archivebox.io/archive/https://arstechnica.com/tech-policy/2024/10/the-internet-archive-and-its-916-billion-saved-webpages-are-back-online/

Note you can also put any identifier for a snapshot after /archive/ and it will redirect correctly, e.g.:

https://demo.archivebox.io/archive/<snapshot_timestamp>
https://demo.archivebox.io/archive/<snapshot_URL>
https://demo.archivebox.io/archive/<snapshot_UUID>
https://demo.archivebox.io/archive/<snapshot_ABID> (a new publicly sharable ID format added in >=v0.8.5 designed to make sharing snapshots between federated/distributed servers easier in future releases)

The REST API and admin pages for editing snapshots also allow fetching by any identifier (in >=v0.8.5):

https://archivebox.phantasm.group/admin/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>
https://archivebox.phantasm.group/api/v1/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>
(using the URL is not supported for these yet because I don't think it's needed as much for admins/API users)

In all cases you can also provide just the first few characters of the identifier to do a prefix search for all matching snapshots, e.g. to see all snapshots for https://arstechnica.com/* you can visit: https://demo.archivebox.io/archive/https://arstechnica.com/

We use this feature extensively with several of our paying clients who have similar needs as what you describe.

It's not fully compatible with archive.org / memento, but I have plans to make it cross-comaptible with both in the future which is what this ticket is meant to track.

@pirate commented on GitHub (Nov 20, 2024): This is actually already supported 😃 It's just not well documented yet. You can visit: `https://archivebox.example.com/archive/https://example.com/archived/url` e.g.: https://demo.archivebox.io/archive/https://arstechnica.com/tech-policy/2024/10/the-internet-archive-and-its-916-billion-saved-webpages-are-back-online/ Note you can also put any identifier for a snapshot after `/archive/` and it will redirect correctly, e.g.: - `https://demo.archivebox.io/archive/<snapshot_timestamp>` - `https://demo.archivebox.io/archive/<snapshot_URL>` - `https://demo.archivebox.io/archive/<snapshot_UUID>` - `https://demo.archivebox.io/archive/<snapshot_ABID>` (a new publicly sharable ID format added in >=v0.8.5 designed to make sharing snapshots between federated/distributed servers easier in future releases) The REST API and admin pages for editing snapshots also allow fetching by any identifier (in >=`v0.8.5`): - `https://archivebox.phantasm.group/admin/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>` - `https://archivebox.phantasm.group/api/v1/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>` (using the URL is not supported for these yet because I don't think it's needed as much for admins/API users) In all cases you can also provide just the first few characters of the identifier to do a prefix search for all matching snapshots, e.g. to see all snapshots for `https://arstechnica.com/*` you can visit: https://demo.archivebox.io/archive/https://arstechnica.com/ ![Image](https://github.com/user-attachments/assets/cedfc0b1-f4e7-407c-957c-e4964721edf1) We use this feature extensively with several of our paying clients who have similar needs as what you describe. It's not fully compatible with archive.org / memento, but I have plans to make it cross-comaptible with both in the future which is what this ticket is meant to track.

kerem commented

2026-03-01 14:45:29 +03:00

Author

Owner

@ArrayBolt3 commented on GitHub (Nov 20, 2024):

Oh nice! Thanks for the info!

@ArrayBolt3 commented on GitHub (Nov 20, 2024): Oh nice! Thanks for the info!

kerem referenced this issue

2026-03-01 17:55:08 +03:00

[GH-ISSUE #678] Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to <undefined> #1937

kerem referenced this issue

2026-03-14 22:56:45 +03:00

[GH-ISSUE #678] Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to <undefined> #3447

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#678

No description provided.

Rows
Columns

[GH-ISSUE #1085] Enhancement: Use the same URL layout as Archive.org for viewing ArchiveBox Snapshots https://archive.org/web/<URL> #678

[GH-ISSUE #1085] Enhancement: Use the same URL layout as Archive.org for viewing ArchiveBox Snapshots `https://archive.org/web/<URL>` #678