[GH-ISSUE #119] History entry timestamps aren't accurate #81

Closed
opened 2026-03-01 14:40:27 +03:00 by kerem · 3 comments
Owner

Originally created by @kergoth on GitHub (Dec 3, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/119

Firefox uses PRTime, Chrome uses webkit timestamps, neither of which match up as is with bookmark-archiver timestamp expectations. Firefox's timestamps need to be multiplied by 10, otherwise this year's history entries show up as 1974, and chrome's timestamps are in microseconds from 1601. To work around, use (last_visit_time-11644446702000000)*10 rather than last_visit_time for chrome, and last_visit_date*10 rather than last_visit_date for firefox. I'm also testing addition of safari history export, but the dates require further massaging than the other two, as they're Mac Absolute Time and in <seconds from 2001>.<microseconds> form, just multiplying to eliminate the decimal doesn't work as the microseconds lack leading zero padding.

For reference, see:

Originally created by @kergoth on GitHub (Dec 3, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/119 Firefox uses PRTime, Chrome uses webkit timestamps, neither of which match up as is with bookmark-archiver timestamp expectations. Firefox's timestamps need to be multiplied by 10, otherwise this year's history entries show up as 1974, and chrome's timestamps are in microseconds from 1601. To work around, use `(last_visit_time-11644446702000000)*10` rather than `last_visit_time` for chrome, and `last_visit_date*10` rather than `last_visit_date` for firefox. I'm also testing addition of safari history export, but the dates require further massaging than the other two, as they're Mac Absolute Time and in `<seconds from 2001>.<microseconds>` form, just multiplying to eliminate the decimal doesn't work as the microseconds lack leading zero padding. For reference, see: - https://www.oreilly.com/library/view/practical-mobile-forensics/9781788839198/a78b0a6a-e0f5-45b8-bce9-b82c3e7a3b7a.xhtml - https://linuxsleuthing.blogspot.com/2011/06/decoding-google-chrome-timestamps-in.html - https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference/PRTime
kerem 2026-03-01 14:40:27 +03:00
Author
Owner

@pirate commented on GitHub (Dec 4, 2018):

Thanks for pointing this out.

Timestamps seem to be fundamentally flawed as a unique identifier I think. The new design I'm working on makes them entirely optional and uses a sha256 of the URL instead, but it's going to be hard to change the folder layout of the archive to hashes if everyone's right now are timestamp-based.

Related to: https://github.com/pirate/bookmark-archiver/issues/74

<!-- gh-comment-id:444113127 --> @pirate commented on GitHub (Dec 4, 2018): Thanks for pointing this out. Timestamps seem to be fundamentally flawed as a unique identifier I think. The new design I'm working on makes them entirely optional and uses a sha256 of the URL instead, but it's going to be hard to change the folder layout of the archive to hashes if everyone's right now are timestamp-based. Related to: https://github.com/pirate/bookmark-archiver/issues/74
Author
Owner

@pirate commented on GitHub (Mar 30, 2019):

@kergoth a quick update, v0.3.0 adds some improvement to the timestamp parsing, but it's still not perfect.

It doesn't yet handle Firefox's timestamps being off by 10x, and Chrome's timestamps aren't fixed from 1601 yet either, but it's a start:

https://github.com/pirate/ArchiveBox/blob/dev/archivebox/util.py#L369

<!-- gh-comment-id:478292358 --> @pirate commented on GitHub (Mar 30, 2019): @kergoth a quick update, v0.3.0 adds some improvement to the timestamp parsing, but it's still not perfect. It doesn't yet handle Firefox's timestamps being off by 10x, and Chrome's timestamps aren't fixed from 1601 yet either, but it's a start: https://github.com/pirate/ArchiveBox/blob/dev/archivebox/util.py#L369
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

I think the latest django branch gets us as close as we're going to get without implementing custom offset parsing for different sources.

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com'
docker run -v $PWD/output:/data archivebox remove --delete 'https://example.com'
docker run -v $PWD/output:/data archivebox update

Comment back here if you're still having troubles with timestamps being wildly off and I can reopen the ticket.

<!-- gh-comment-id:663649004 --> @pirate commented on GitHub (Jul 24, 2020): I think the latest `django` branch gets us as close as we're going to get without implementing custom offset parsing for different sources. ```bash git checkout django git pull docker build . -t archivebox docker run -v $PWD/output:/data archivebox init docker run -v $PWD/output:/data archivebox add 'https://example.com' docker run -v $PWD/output:/data archivebox remove --delete 'https://example.com' docker run -v $PWD/output:/data archivebox update ``` Comment back here if you're still having troubles with timestamps being wildly off and I can reopen the ticket.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#81
No description provided.