mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #374] Bugfix: django branch start_ts error on init #1766
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1766
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @drpfenderson on GitHub (Jul 20, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/374
Describe the bug
When attempting to
archivebox initwith version 0.4.3 in old archive, archivebox fails atCollecting links from any existing indexes and archive folders...withKeyError: 'start_ts'Steps to reproduce
git cloneandpip install ..archivebox initScreenshots or log output
Software versions
848977e@pirate commented on GitHub (Jul 20, 2020):
Interesting, what version was archive folder originally made with? Can you post a sample
output/archive/<timestamp>/index.jsonfile from your archive and I'll investigate further.@drpfenderson commented on GitHub (Jul 20, 2020):
The index.html says that it was created with version
a3a048d4. Here is a gist containing the output of one of the most recent index.json files, with redacted personal info.@pirate commented on GitHub (Jul 21, 2020):
Ah sorry I just noticed the error is actually in the main index parse, not the link details parse. Can you post a redacted/shortened version of your main index
output/index.jsonfile when you get a chance.@drpfenderson commented on GitHub (Jul 21, 2020):
Here is a snippet from the beginning of the main index.json file. Here is another snipped from later in the file. Let me know if you would like/need more, or are looking for something in particular.
@cdvv7788 commented on GitHub (Jul 22, 2020):
@drpfenderson can you please test with the django branch again? We pushed a change that should help with your issue.
@drpfenderson commented on GitHub (Jul 22, 2020):
Different set of errors. This is with
36124f2. I'm going to list how I did things, just in case I'm missing a step.EDIT: To clarify, the error is thrown at the same point in the init process,
[*] Collecting links from any existing indexes and archive folders...@drpfenderson commented on GitHub (Jul 23, 2020):
Saw you had made some changes, and pulled
4cb671a, rebuilt, ran on a copy of the archive. Finally went through the init process completely, though it listed a ton of the indexes with errors!Here is a sampling of some of the indexes that it says are invalid. 1144634362. 1160641317. 1222742076.
@pirate commented on GitHub (Jul 23, 2020):
Perfect, thanks for those samples. It confirms our suspicion that you had a few links archived with a very old version before we introduced
start_ts. We'll add a workaround that will handle that older schema and upgrade those files to the new style.(Also thanks for the sponsorship @drpfenderson!)
@cdvv7788 commented on GitHub (Jul 24, 2020):
@drpfenderson can you test again please? fingers crossed
@drpfenderson commented on GitHub (Jul 24, 2020):
New error. :(
I hate being this problem person, but I do sincerely appreciate y'all continuing to support me with this severely out-of-date index. It's a very important, personal historical archive for me, and I would have just recreated it from scratch using the new version and the old list of links, but a lot of those sites in the original index are no longer online.
I uninstalled, deleted working dir,
git pull,pip install ., copy archive,archivebox init. This is with74ad79f:Let me know if there is anything else I can provide for you.
Side question: I know I can run
git rev-parse HEAD | head -c7to get the current revision on the git directory, but is there a command to find out exactly which revision might be installed in pip? As sometimes they will not be in sync. This isn't really that important, was just curious.@cdvv7788 commented on GitHub (Jul 24, 2020):
@drpfenderson one more try please.
Also, if you install it with
pip install -e .you will always have installed the version of the code you are currently running (no need to pip install after changing branches i.e.)@drpfenderson commented on GitHub (Jul 24, 2020):
(Mostly) successful import with
5582d8a! Thanks again for the quick responses and fixes. I would say that this specific bug is crushed, but want to make sure the next part of the error is unrelated first.Not sure what that exactly means, but 149 is much easier to handle. From what I can tell of a random sampling, the properly-loaded 1369 links works fine. But the
archivebox initlisted specific pages with errors, so I looked a few up. They are all in the index, but partially corrupted. The main archive for each, found clicking the title of the link from the archive index, loads just fine. However, when clicking the Files link, where it shows the various versions that were captured, it actually loads the files for a different link. Testing the other link's version sometimes points to the first link, but sometimes it's a loop. Link A Files (click) > Link B files appear, Link B Files (click) > Link A files appear. But also, in some cases. Link A Files (click) > Link B files appear. Link B Files (click) > Link C files appear. Link C Files (click) > Link A files appear. Very weird.If this error is unrelated, please feel free to mark this as closed, and I can definitely file a separate report.
@pirate commented on GitHub (Jul 25, 2020):
Ah I have seen that issue before, I think it might've been caused by a previous version actually. If I remember correctly the last time we saw this bug it was caused the timestamp deduplication code switching the timestamps of two existing links during the deduping process.
The older versions didn't display the info on startup about which links were invalid/valid, so it's possible it just went un-noticed. Do you happen to have a backup from before you did the upgrade with
archivebox init? If so, you can check if the swapped timestamps were present previously, and that would help us rule out a bug in this version. If not, no worries, we have some things we can try to make the new version auto-fix this type of situation.@drpfenderson commented on GitHub (Jul 27, 2020):
Ah! I believe you are correct. The error exists in those links in my previous/backup version. The paths of the "loops" I mentioned align with the progression of a directory structure. For example:
To clarify, I click 1222742097 from the index, and it loads 1222742097.0, and vice-versa. So then, yes! This must be that duplication error you recognize from a previous version, and it sounds like this specific bug is closed.
Do you have any advice for the other error, or maybe link to an issue # if it already exists? Even if it requires me hand-editing all of the incorrect ones in nano, I would be super happy.
@pirate commented on GitHub (Jul 27, 2020):
Awesome, that's a relief to hear. We were worried it was a regression from the latest version. I'm going to close this issue for now but I'll keep responding to your comments here, don't worry.
If you post a ZIP (or email me
) of a handful of those swapped folders I'll write you a bash script that fixes it.