mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #74] Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #3074
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3074
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cdzombak on GitHub (Mar 14, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/74
My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).
The first time I run
archive.py, I end up with several archive directories named like1317249309,1317249309.0,1317249309.1, …. These directory names correspond properly with entries inindex.jsonas expected.If I run
archive.pya second time with the same input, it appears to rewriteindex.json, assigning different numerical suffixes to the1317249309timestamp. The entries inindex.jsonno longer correspond with the contents of those archive directories on disk.You can reproduce this with the following JSON file (
pinboard.json):Run the following commands:
@pirate commented on GitHub (Mar 15, 2018):
I'm 90% sure this is due to the faulty cleanup/merging code I added recently. Can you try checking out
2aae6e0c27(a known good version that I use on my server) and seeing if the problem exists there?@cdzombak commented on GitHub (Mar 15, 2018):
I checked out
2aae6e0locally and ran through the same process as described above. I still see the same thing — the same URL gets reassigned a different timestamp suffix each time I run the archiver, andindex.jsonis no longer in sync with the disk.FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.@cdzombak commented on GitHub (Mar 16, 2018):
This does not work around the issue. After running
archive.pyon my Pinboard RSS feed containing only new links, all these very old links with duplicate timestamps seem to have been assigned different numbers, so my index.html/json are out of sync with what's on disk 😢@pirate commented on GitHub (Apr 17, 2018):
Try pulling master or
1776bdfand let me know if it works.@pirate commented on GitHub (Apr 25, 2018):
I'm thinking about abolishing the incremental timestamp de-duping like:
1523763242.1,1523763242.2,1523763242.3, etc. because it's not really deterministic and was only causing problems.The design is similar to buckets in a hash table to handle collisions, so I propose we take further inspiration from our hash-table roots and dedupe timestamps with a hash instead of an incrementing number:
I'm testing this right now, I will push the code soon to a branch:
We might as well add a hash suffix to all links while we're add it. The
timestamp.hashformat as a primary key is very useful because it instantly makes all links unique and it retains the original timestamp order.The real issue is migrating old archives to the new format. Right now a migration system doesn't really exist, and my last attempt to build one
util.py:cleanup_archive()failed miserably and corrupted some people's archive folders. One of the main reasons I'm switching to Django is the excellent forwards & backwards migrations system.Whatever new timestamp deduping solution we end up choosing will need to come with a migration script to force BA to reindex the links and move old folders to the new format.
@cdzombak commented on GitHub (May 23, 2018):
@pirate I finally had a chance to test this with the latest
master(a532d11549).Following the reproduction instructions in the OP issue, I end up with directories on disk whose index pages seem to line up with what
index.jsonexpects, but on further inspection the archive folders on disk contain resources from multiple archive entries. Further, the screenshots and etc. are still mixed up. One example (note flickr screenshots for a non-flickr site):@pirate commented on GitHub (May 24, 2018):
Thanks for the report @cdzombak, this is fairly critical, I'll take a look as soon as I can. In the meantime if you absolutely need it working I suggest writing a little script to pre-process your links to ensure they have unique timestamps.
@pirate commented on GitHub (Jun 11, 2018):
I found one of the bugs:
https://github.com/pirate/bookmark-archiver/blob/master/util.py#L281
Should be:
Very sneaky 1 character bug 🤦♂️.
It will be fixed on master shortly.
@cdzombak commented on GitHub (Jun 11, 2018):
Oh, yikes. That's a tricky one to find.
If you let me know when that's fixed on
master, I can re-run my test and let you know the result.@pirate commented on GitHub (Aug 30, 2018):
@aurelg fyi you might be interested in following this issue
@pirate commented on GitHub (Jan 22, 2019):
A quick update to those waiting on this issue. This is still taking a lot of thought because there are some hard problems to consider, namely:
convenience of user access vs integrity of disk storage
Timestamps convey valuable information about when the website was archived, which is why other sites like archive.org and archive.is use them in URLs. I think timestamps will remain the primary way for users to access archived resources, but for database integrity and on-disk storage, it's much better to have things bucketed by a unique, immutable key. Because ArchiveBox needs to generate a static output, it can't just serve up two web endpoints that refer to one folder layout, it has to have both folder layouts accessible on disk and indexed statically. This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.
folder and URL layout
We have to allow archives to be accessed by either hash OR timestamp to preserve backwards-compatibility.
If we change the directory structure, we'll have a create a second directory full of symlinks pointing to their equivalent folders.
Somethings like this could work:
hash type
Some background: https://blog.codinghorror.com/url-shortening-hashes-in-practice/
I wanted to go with a base62 encoding of the first 32 bits of a sha256 for super dense URL slugs, but unfortunately, macOS has a case-insensitive filesystem, so it's a disaster waiting to happen. We don't want two archives written to the same folder, and I'd rather explicitly pick a smaller hash algorithm that works for everyone, than attempt to release two different hash options to users as a config var.
It seems dangerous to go with something so obscure for a potentially long-term project, but maybe a base32 of a few more sha256 bytes could work for URL and filesystem safe storage:
https://github.com/ulid/spec or https://github.com/jbittel/base32-crockford
migration
We have to carefully move all the archive data to the new format and link everything, and we only get one try because many people will run it the moment it's released
django serverthis is done nowThe next highest priority issue is migrating to the new cli format + django server, and I think it will make this problem slightly easier because the database can keep track of timestamps and map them to hashes on disk.
Plan:
Rather than implement hashed storage on the current CLI ArchiveBox, I think I want to build the django sever first, because it will allow me to run safe, rewindable migrations on the archive data without destroying people's folders by accident.
This migration will take place for users of the
./archiveCLI command as well.Once the initial django version is released, all subsequent versions will automatically
migrate the data format forward to the latest schema when they start.
This should be a mostly invisible process to users as almost all migrations are non-destructive, and we will prompt to explain it to the user before doing destructive ones.
If any of you have ideas or input on this process, any help is welcome.
@karlicoss commented on GitHub (Apr 16, 2019):
Hey @pirate , thanks for for your response! Some thoughts:
convenience of user access vs integrity of disk storage
Symlinks sounds like a good compromise. However, there will still be issue when two symlinks clash due to same timestamps, right? But at least it won't be damaging to the actual backups though.
Have to say, I don't really understand the concept of using historic timestamps from, say, Pinboard backup or chrome history. You can't retrieve the page at the time of that timestamp (sadly!), the only relevant timestamp is the current time, isn't it?
Also, if you are using historic timestamps and happened to have same URL incoming from several sources, would they all end up as different archived directories? Sounds a bit wasteful...
hash type
sha256 is just 64 characters as hex, right? For URL shortening, it's a probel, agree. But as part of archive URL, which you would not have to access that often, presumably, don't think it's too bad.
@pirate commented on GitHub (Apr 16, 2019):
Oh I'm already halfway through the migration process away from timestamps, I forgot to update this issue :)Edit: it's ended up taking longer than I expectedMost of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations.
In v0.4.0 I've already added hashes, and in a subsequent version they will become the primary unique key.
The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like:
I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.
@pirate commented on GitHub (Dec 19, 2022):
We should use one of these better implementations instead of crockford-base32 directly:
@pirate commented on GitHub (May 12, 2024):
WIP: https://github.com/ArchiveBox/ArchiveBox/pull/1430/files