[GH-ISSUE #74] Uniquely identify URLs by UUID/ULID/hash of url instead of archive timestamp #54

Closed
opened 2026-03-01 14:40:11 +03:00 by kerem · 15 comments
Owner

Originally created by @cdzombak on GitHub (Mar 14, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/74

My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).

The first time I run archive.py, I end up with several archive directories named like 1317249309, 1317249309.0, 1317249309.1, …. These directory names correspond properly with entries in index.json as expected.

If I run archive.py a second time with the same input, it appears to rewrite index.json, assigning different numerical suffixes to the 1317249309 timestamp. The entries in index.json no longer correspond with the contents of those archive directories on disk.

You can reproduce this with the following JSON file (pinboard.json):

[{"href":"http:\/\/www.flickr.com\/groups\/photoshopsupport\/discuss\/72157600201629413\/","description":"Flickr: Discussing Index Of Topics: Compliments of LifeLive~ in Photoshop Support Group","extended":"","meta":"c9aa62c0eaa3c35a587903100870df43","hash":"8dd9951810c0eae6af67651341af5110","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography photoshop retouching"},
{"href":"http:\/\/allinthehead.com\/retro\/345\/whats-in-your-utility-belt","description":"What's In Your Utility Belt? \u2014 All in the head","extended":"","meta":"746e69822f36f2e78c16fc789a7545b5","hash":"ac4d0527bca6c7d6741fee117f45f631","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"php"},
{"href":"http:\/\/www.tyndellphotographic.com\/plasticwallet.html","description":"Plastic Wallet Boxes for Wallet sized photos","extended":"","meta":"c133eb53f29d97c35c3f31768ff7ce45","hash":"60bbf228c559518b818ed7d0ff997a69","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography supply"},
{"href":"http:\/\/www.arduino.cc\/","description":"Arduino - HomePage","extended":"","meta":"a80835b5f374965f5f8a5990da6cf2be","hash":"78532ff2155cd9feeac11aba18739bdc","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"arduino elecdiy"},
{"href":"http:\/\/mbed.org\/","description":"Rapid Prototyping for Microcontrollers | mbed","extended":"","meta":"644e8e0c9ae522eb1ca025c2af604f7d","hash":"fd2d014879e63a9aca6c18eb11e19b02","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy"},
{"href":"http:\/\/www.tasankokaiku.com\/jarse\/?p=268","description":"Jarse \u00bb Blog Archive \u00bb Kohtauskone","extended":"","meta":"8483f7b4d0423ddd0930142c55c909e3","hash":"e971d3670f0fe1b2638c343e458f88bd","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy arduino dmx512"}]

Run the following commands:

./archive.py ~/path/to/pinboard.json
# contents on disk match up with contents of index.json

./archive.py ~/path/to/pinboard.json
# timestamp suffices in index.json have been changed and no longer match content on disk
Originally created by @cdzombak on GitHub (Mar 14, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/74 My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago). The first time I run `archive.py`, I end up with several archive directories named like `1317249309`, `1317249309.0`, `1317249309.1`, …. These directory names correspond properly with entries in `index.json` as expected. If I run `archive.py` a second time with the same input, it appears to rewrite `index.json`, assigning _different numerical suffixes_ to the `1317249309` timestamp. The entries in `index.json` no longer correspond with the contents of those archive directories on disk. You can reproduce this with the following JSON file (`pinboard.json`): ```json [{"href":"http:\/\/www.flickr.com\/groups\/photoshopsupport\/discuss\/72157600201629413\/","description":"Flickr: Discussing Index Of Topics: Compliments of LifeLive~ in Photoshop Support Group","extended":"","meta":"c9aa62c0eaa3c35a587903100870df43","hash":"8dd9951810c0eae6af67651341af5110","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography photoshop retouching"}, {"href":"http:\/\/allinthehead.com\/retro\/345\/whats-in-your-utility-belt","description":"What's In Your Utility Belt? \u2014 All in the head","extended":"","meta":"746e69822f36f2e78c16fc789a7545b5","hash":"ac4d0527bca6c7d6741fee117f45f631","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"php"}, {"href":"http:\/\/www.tyndellphotographic.com\/plasticwallet.html","description":"Plastic Wallet Boxes for Wallet sized photos","extended":"","meta":"c133eb53f29d97c35c3f31768ff7ce45","hash":"60bbf228c559518b818ed7d0ff997a69","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography supply"}, {"href":"http:\/\/www.arduino.cc\/","description":"Arduino - HomePage","extended":"","meta":"a80835b5f374965f5f8a5990da6cf2be","hash":"78532ff2155cd9feeac11aba18739bdc","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"arduino elecdiy"}, {"href":"http:\/\/mbed.org\/","description":"Rapid Prototyping for Microcontrollers | mbed","extended":"","meta":"644e8e0c9ae522eb1ca025c2af604f7d","hash":"fd2d014879e63a9aca6c18eb11e19b02","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy"}, {"href":"http:\/\/www.tasankokaiku.com\/jarse\/?p=268","description":"Jarse \u00bb Blog Archive \u00bb Kohtauskone","extended":"","meta":"8483f7b4d0423ddd0930142c55c909e3","hash":"e971d3670f0fe1b2638c343e458f88bd","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy arduino dmx512"}] ``` Run the following commands: ```shell ./archive.py ~/path/to/pinboard.json # contents on disk match up with contents of index.json ./archive.py ~/path/to/pinboard.json # timestamp suffices in index.json have been changed and no longer match content on disk ```
Author
Owner

@pirate commented on GitHub (Mar 15, 2018):

I'm 90% sure this is due to the faulty cleanup/merging code I added recently. Can you try checking out 2aae6e0c27 (a known good version that I use on my server) and seeing if the problem exists there?

<!-- gh-comment-id:373216146 --> @pirate commented on GitHub (Mar 15, 2018): I'm 90% sure this is due to the faulty cleanup/merging code I added recently. Can you try checking out 2aae6e0c270cdad55d337a9e9fa5353a5a637513 (a known good version that I use on my server) and seeing if the problem exists there?
Author
Owner

@cdzombak commented on GitHub (Mar 15, 2018):

I checked out 2aae6e0 locally and ran through the same process as described above. I still see the same thing — the same URL gets reassigned a different timestamp suffix each time I run the archiver, and index.json is no longer in sync with the disk.

FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.

<!-- gh-comment-id:373233298 --> @cdzombak commented on GitHub (Mar 15, 2018): I checked out 2aae6e0 locally and ran through the same process as described above. I still see the same thing — the same URL gets reassigned a different timestamp suffix each time I run the archiver, and `index.json` is no longer in sync with the disk. ~~FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps _exactly_ once, then only running incremental updates via RSS, containing _only_ newer entries with unique timestamps, in the future. That should at least let me avoid the issue.~~
Author
Owner

@cdzombak commented on GitHub (Mar 16, 2018):

FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps exactly once, then only running incremental updates via RSS, containing only newer entries with unique timestamps, in the future. That should at least let me avoid the issue.

This does not work around the issue. After running archive.py on my Pinboard RSS feed containing only new links, all these very old links with duplicate timestamps seem to have been assigned different numbers, so my index.html/json are out of sync with what's on disk 😢

<!-- gh-comment-id:373730047 --> @cdzombak commented on GitHub (Mar 16, 2018): > FWIW, I'm going to solve this for myself by running the archiver on an export with these duplicate timestamps _exactly_ once, then only running incremental updates via RSS, containing _only_ newer entries with unique timestamps, in the future. That should at least let me avoid the issue. This does not work around the issue. After running `archive.py` on my Pinboard RSS feed containing only new links, all these very old links with duplicate timestamps seem to have been assigned different numbers, so my index.html/json are out of sync with what's on disk 😢
Author
Owner

@pirate commented on GitHub (Apr 17, 2018):

Try pulling master or 1776bdf and let me know if it works.

<!-- gh-comment-id:381991291 --> @pirate commented on GitHub (Apr 17, 2018): Try pulling master or 1776bdf and let me know if it works.
Author
Owner

@pirate commented on GitHub (Apr 25, 2018):

I'm thinking about abolishing the incremental timestamp de-duping like: 1523763242.1, 1523763242.2, 1523763242.3, etc. because it's not really deterministic and was only causing problems.

The design is similar to buckets in a hash table to handle collisions, so I propose we take further inspiration from our hash-table roots and dedupe timestamps with a hash instead of an incrementing number:

I'm testing this right now, I will push the code soon to a branch:

url_hash = sha256(link['url'].encode('utf-8')).hexdigest()
uniqueish_suffix = str(int(url_hash, base=16))[:10]                # ~ 10^9 is probably enough imo
link['timestamp'] = f'{link["timestamp"]}.{uniqueish_suffix}'

# timstamp   hash_of_url
# 1523763242.090329842341

We might as well add a hash suffix to all links while we're add it. The timestamp.hash format as a primary key is very useful because it instantly makes all links unique and it retains the original timestamp order.

The real issue is migrating old archives to the new format. Right now a migration system doesn't really exist, and my last attempt to build one util.py:cleanup_archive() failed miserably and corrupted some people's archive folders. One of the main reasons I'm switching to Django is the excellent forwards & backwards migrations system.

Whatever new timestamp deduping solution we end up choosing will need to come with a migration script to force BA to reindex the links and move old folders to the new format.

<!-- gh-comment-id:384251299 --> @pirate commented on GitHub (Apr 25, 2018): I'm thinking about abolishing the incremental timestamp de-duping like: `1523763242.1`, `1523763242.2`, `1523763242.3`, etc. because it's not really deterministic and was only causing problems. The design is similar to buckets in a hash table to handle collisions, so I propose we take further inspiration from our hash-table roots and dedupe timestamps with a hash instead of an incrementing number: I'm testing this right now, I will push the code soon to a branch: ```python url_hash = sha256(link['url'].encode('utf-8')).hexdigest() uniqueish_suffix = str(int(url_hash, base=16))[:10] # ~ 10^9 is probably enough imo link['timestamp'] = f'{link["timestamp"]}.{uniqueish_suffix}' # timstamp hash_of_url # 1523763242.090329842341 ``` We might as well add a hash suffix to all links while we're add it. The `timestamp.hash` format as a primary key is very useful because it instantly makes all links unique and it retains the original timestamp order. The real issue is migrating old archives to the new format. Right now a migration system doesn't really exist, and my last attempt to build one `util.py:cleanup_archive()` failed miserably and corrupted some people's archive folders. One of the main reasons I'm switching to Django is the excellent forwards & backwards migrations system. Whatever new timestamp deduping solution we end up choosing will need to come with a migration script to force BA to reindex the links and move old folders to the new format.
Author
Owner

@cdzombak commented on GitHub (May 23, 2018):

@pirate I finally had a chance to test this with the latest master (a532d11549).

Following the reproduction instructions in the OP issue, I end up with directories on disk whose index pages seem to line up with what index.json expects, but on further inspection the archive folders on disk contain resources from multiple archive entries. Further, the screenshots and etc. are still mixed up. One example (note flickr screenshots for a non-flickr site):

screen shot 2018-05-23 at 17 00 18
<!-- gh-comment-id:391495372 --> @cdzombak commented on GitHub (May 23, 2018): @pirate I finally had a chance to test this with the latest `master` (a532d1154910577f7954897c874e052dda70e73f). Following the reproduction instructions in the OP issue, I end up with directories on disk whose index pages seem to line up with what `index.json` expects, but on further inspection the archive folders on disk contain resources from multiple archive entries. Further, the screenshots and etc. are still mixed up. One example (note flickr screenshots for a non-flickr site): <img width="1465" alt="screen shot 2018-05-23 at 17 00 18" src="https://user-images.githubusercontent.com/102904/40450852-d6180cd0-5eaa-11e8-9bd6-5534324c262e.png">
Author
Owner

@pirate commented on GitHub (May 24, 2018):

Thanks for the report @cdzombak, this is fairly critical, I'll take a look as soon as I can. In the meantime if you absolutely need it working I suggest writing a little script to pre-process your links to ensure they have unique timestamps.

<!-- gh-comment-id:391585238 --> @pirate commented on GitHub (May 24, 2018): Thanks for the report @cdzombak, this is fairly critical, I'll take a look as soon as I can. In the meantime if you absolutely need it working I suggest writing a little script to pre-process your links to ensure they have unique timestamps.
Author
Owner

@pirate commented on GitHub (Jun 11, 2018):

I found one of the bugs:

https://github.com/pirate/bookmark-archiver/blob/master/util.py#L281

archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive' + folder, 'archive.org.txt')

Should be:

archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive', folder, 'archive.org.txt')

Very sneaky 1 character bug 🤦‍♂️.

It will be fixed on master shortly.

<!-- gh-comment-id:396099412 --> @pirate commented on GitHub (Jun 11, 2018): I found one of the bugs: https://github.com/pirate/bookmark-archiver/blob/master/util.py#L281 ```python archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive' + folder, 'archive.org.txt') ``` Should be: ```python archive_org_txt = os.path.join(ARCHIVE_DIR, 'html/archive', folder, 'archive.org.txt') ``` Very sneaky 1 character bug 🤦‍♂️. It will be fixed on master shortly.
Author
Owner

@cdzombak commented on GitHub (Jun 11, 2018):

Oh, yikes. That's a tricky one to find.

If you let me know when that's fixed on master, I can re-run my test and let you know the result.

<!-- gh-comment-id:396249402 --> @cdzombak commented on GitHub (Jun 11, 2018): Oh, yikes. That's a tricky one to find. If you let me know when that's fixed on `master`, I can re-run my test and let you know the result.
Author
Owner

@pirate commented on GitHub (Aug 30, 2018):

@aurelg fyi you might be interested in following this issue

<!-- gh-comment-id:417478496 --> @pirate commented on GitHub (Aug 30, 2018): @aurelg fyi you might be interested in following this issue
Author
Owner

@pirate commented on GitHub (Jan 22, 2019):

A quick update to those waiting on this issue. This is still taking a lot of thought because there are some hard problems to consider, namely:

  • convenience of user access vs integrity of disk storage
    Timestamps convey valuable information about when the website was archived, which is why other sites like archive.org and archive.is use them in URLs. I think timestamps will remain the primary way for users to access archived resources, but for database integrity and on-disk storage, it's much better to have things bucketed by a unique, immutable key. Because ArchiveBox needs to generate a static output, it can't just serve up two web endpoints that refer to one folder layout, it has to have both folder layouts accessible on disk and indexed statically. This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.

  • folder and URL layout
    We have to allow archives to be accessed by either hash OR timestamp to preserve backwards-compatibility.
    If we change the directory structure, we'll have a create a second directory full of symlinks pointing to their equivalent folders.
    Somethings like this could work:

    output/
        index.html
        index.json
        archive/
            <timestamp>     -> output/assets/<hash>
        assets/
            <hash>/
                index.html
                index.json
    ...
    
    
  • hash type
    Some background: https://blog.codinghorror.com/url-shortening-hashes-in-practice/

    I wanted to go with a base62 encoding of the first 32 bits of a sha256 for super dense URL slugs, but unfortunately, macOS has a case-insensitive filesystem, so it's a disaster waiting to happen. We don't want two archives written to the same folder, and I'd rather explicitly pick a smaller hash algorithm that works for everyone, than attempt to release two different hash options to users as a config var.

    It seems dangerous to go with something so obscure for a potentially long-term project, but maybe a base32 of a few more sha256 bytes could work for URL and filesystem safe storage:

    In [1]: base32_crockford.encode(int(hashlib.sha256(url).hexdigest(), 16) % (10 ** 32))   # take the first 32 bits out of 64
    Out[1]: '7P6HMQR2VTC7P6HMQR2VTC'
    

    https://github.com/ulid/spec or https://github.com/jbittel/base32-crockford

  • migration
    We have to carefully move all the archive data to the new format and link everything, and we only get one try because many people will run it the moment it's released

  • django server this is done now
    The next highest priority issue is migrating to the new cli format + django server, and I think it will make this problem slightly easier because the database can keep track of timestamps and map them to hashes on disk.

Plan:

Rather than implement hashed storage on the current CLI ArchiveBox, I think I want to build the django sever first, because it will allow me to run safe, rewindable migrations on the archive data without destroying people's folders by accident.

1. create django server and script to load existing archive folder into db
2. add sha256 hash field with database migration
3. serve both urls `/<hash>/example.com/index.html` and `/<timestamp>/example.com/index.html`
4. export archive to new folder layout using new sha256 hash folders
5. continue serving both url types with data from new folder layout

This migration will take place for users of the ./archive CLI command as well.
Once the initial django version is released, all subsequent versions will automatically
migrate the data format forward to the latest schema when they start.
This should be a mostly invisible process to users as almost all migrations are non-destructive, and we will prompt to explain it to the user before doing destructive ones.

If any of you have ideas or input on this process, any help is welcome.

<!-- gh-comment-id:456336803 --> @pirate commented on GitHub (Jan 22, 2019): A quick update to those waiting on this issue. This is still taking a lot of thought because there are some *hard problems* to consider, namely: - *convenience of user access vs integrity of disk storage* Timestamps convey valuable information about when the website was archived, which is why other sites like archive.org and archive.is use them in URLs. I think timestamps will remain the primary way for users to access archived resources, but for database integrity and on-disk storage, it's much better to have things bucketed by a unique, immutable key. Because ArchiveBox needs to generate a static output, it can't just serve up two web endpoints that refer to one folder layout, it has to have both folder layouts accessible on disk and indexed statically. This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files. - *folder and URL layout* We have to allow archives to be accessed by either hash OR timestamp to preserve backwards-compatibility. If we change the directory structure, we'll have a create a second directory full of symlinks pointing to their equivalent folders. Somethings like this could work: ``` output/ index.html index.json archive/ <timestamp> -> output/assets/<hash> assets/ <hash>/ index.html index.json ... - *hash type* Some background: https://blog.codinghorror.com/url-shortening-hashes-in-practice/ I wanted to go with a [base62](https://github.com/suminb/base62/blob/develop/base62.py) encoding of the first 32 bits of a sha256 for super dense URL slugs, but unfortunately, macOS has a case-insensitive filesystem, so it's a disaster waiting to happen. We don't want two archives written to the same folder, and I'd rather explicitly pick a smaller hash algorithm that works for everyone, than attempt to release two different hash options to users as a config var. It seems dangerous to go with something so obscure for a potentially long-term project, but maybe a base32 of a few more sha256 bytes could work for URL and filesystem safe storage: ```python In [1]: base32_crockford.encode(int(hashlib.sha256(url).hexdigest(), 16) % (10 ** 32)) # take the first 32 bits out of 64 Out[1]: '7P6HMQR2VTC7P6HMQR2VTC' ``` https://github.com/ulid/spec or https://github.com/jbittel/base32-crockford - *migration* We have to carefully move all the archive data to the new format and link everything, and we only get one try because many people will run it the moment it's released - ~~*django server*~~ this is done now The next highest priority issue is migrating to the new cli format + django server, and I think it will make this problem slightly easier because the database can keep track of timestamps and map them to hashes on disk. **Plan:** Rather than implement hashed storage on the current CLI ArchiveBox, I think I want to build the django sever first, because it will allow me to run safe, rewindable migrations on the archive data without destroying people's folders by accident. 1. create django server and script to load existing archive folder into db 2. add sha256 hash field with database migration 3. serve both urls `/<hash>/example.com/index.html` and `/<timestamp>/example.com/index.html` 4. export archive to new folder layout using new sha256 hash folders 5. continue serving both url types with data from new folder layout This migration will take place for users of the `./archive` CLI command as well. Once the initial django version is released, all subsequent versions will automatically migrate the data format forward to the latest schema when they start. This should be a mostly invisible process to users as almost all migrations are non-destructive, and we will prompt to explain it to the user before doing destructive ones. If any of you have ideas or input on this process, any help is welcome.
Author
Owner

@karlicoss commented on GitHub (Apr 16, 2019):

Hey @pirate , thanks for for your response! Some thoughts:

  • convenience of user access vs integrity of disk storage

    I think timestamps will remain the primary way for users to access archived resources

    This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files.

    Symlinks sounds like a good compromise. However, there will still be issue when two symlinks clash due to same timestamps, right? But at least it won't be damaging to the actual backups though.

    Have to say, I don't really understand the concept of using historic timestamps from, say, Pinboard backup or chrome history. You can't retrieve the page at the time of that timestamp (sadly!), the only relevant timestamp is the current time, isn't it?
    Also, if you are using historic timestamps and happened to have same URL incoming from several sources, would they all end up as different archived directories? Sounds a bit wasteful...

  • hash type
    sha256 is just 64 characters as hex, right? For URL shortening, it's a probel, agree. But as part of archive URL, which you would not have to access that often, presumably, don't think it's too bad.

<!-- gh-comment-id:483833022 --> @karlicoss commented on GitHub (Apr 16, 2019): Hey @pirate , thanks for for your response! Some thoughts: * convenience of user access vs integrity of disk storage > I think timestamps will remain the primary way for users to access archived resources > This means we have to use symlinks or hardlinks to represent a single folder layout without duplicating files. Symlinks sounds like a good compromise. However, there will still be issue when two symlinks clash due to same timestamps, right? But at least it won't be damaging to the actual backups though. Have to say, I don't really understand the concept of using historic timestamps from, say, Pinboard backup or chrome history. You can't retrieve the page at the time of that timestamp (sadly!), the only relevant timestamp is the current time, isn't it? Also, if you are using historic timestamps and happened to have same URL incoming from several sources, would they all end up as different archived directories? Sounds a bit wasteful... * hash type sha256 is just 64 characters as hex, right? For URL shortening, it's a probel, agree. But as part of archive URL, which you would not have to access that often, presumably, don't think it's too bad.
Author
Owner

@pirate commented on GitHub (Apr 16, 2019):

Oh I'm already halfway through the migration process away from timestamps, I forgot to update this issue :) Edit: it's ended up taking longer than I expected

Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations.

In v0.4.0 I've already added hashes, and in a subsequent version they will become the primary unique key.

The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like:

archivebox export --folders=timestamp
# or
archivebox export --folders=hash

I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.

<!-- gh-comment-id:483834784 --> @pirate commented on GitHub (Apr 16, 2019): ~~Oh I'm already halfway through the migration process away from timestamps, I forgot to update this issue :)~~ Edit: it's ended up taking longer than I expected Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations. In [v0.4.0](https://github.com/pirate/ArchiveBox/pull/207) I've already added hashes, and in a subsequent version they will become the primary unique key. The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like: ```bash archivebox export --folders=timestamp # or archivebox export --folders=hash ``` I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.
Author
Owner

@pirate commented on GitHub (Dec 19, 2022):

We should use one of these better implementations instead of crockford-base32 directly:

 01AN4Z07BY      79KA1307SR...

 01AN4Z07BY      79KA1307SR9X4MV3

|----------|    |----------------|
 Timestamp          Randomness
   48bits             80bits
   10char             16char
<!-- gh-comment-id:1358227338 --> @pirate commented on GitHub (Dec 19, 2022): We should use one of these better implementations instead of crockford-base32 directly: - https://github.com/jetify-com/typeid - https://uuid7.com/ - https://github.com/ulid/spec ``` 01AN4Z07BY 79KA1307SR... 01AN4Z07BY 79KA1307SR9X4MV3 |----------| |----------------| Timestamp Randomness 48bits 80bits 10char 16char ```
Author
Owner

@pirate commented on GitHub (May 12, 2024):

WIP: https://github.com/ArchiveBox/ArchiveBox/pull/1430/files

<!-- gh-comment-id:2106229025 --> @pirate commented on GitHub (May 12, 2024): WIP: https://github.com/ArchiveBox/ArchiveBox/pull/1430/files
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#54
No description provided.