mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #705] Bug: NOT NULL constraint failed: core_archiveresult.output when upgrading v0.4.24 archive to v0.6 #444
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#444
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pigmonkey on GitHub (Apr 14, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/705
I upgraded from v0.4.24 to v0.6.0 and ran
archivebox init. After the list of migrations, it output:@pirate commented on GitHub (Apr 14, 2021):
Ah sorry for the trouble, that shouldn't happen.
In the meantime while I investigate, if you have a backups from v0.4.24, can you try migrating to v0.5.6 first, then from there to v0.6?
The issue is caused by some extractor outputs being null in your old archive (which shouldn't happen, they shouldn't get saved in the first place if there is no output, but the old v0.4.x series had problems with this). I can add a case to handle this in v0.6 and create them as emptystrings instead, but it will take a bit of time to test.
Also helpful would be a sample
./archive/<timestamp>/index.jsonfrom one of your archive folders that's marked asinvalid. It looks like you have lots more of them than normal and I'm wondering why so many are being considered invalid.@pigmonkey commented on GitHub (Apr 14, 2021):
I downgraded to v0.5.6.
Rather than restoring a backup from v0.4.24, I deleted everything in the directory except the
archive/snapshot folder and ranarchivebox init. I figured it might be easier for it to start fresh rather than trying to migrate an older database structure. It ended up failing with a new error:Here is a random
index.jsonfrom one of the snapshot directories that appeared in the invalid list for both v0.5.6 and v0.6.0:@pigmonkey commented on GitHub (Apr 14, 2021):
One thing I notice looking at that JSON file is that it has a mix of absolute paths.
I originally had ArchiveBox in
~/library/bookmarks. I still have the old pre-Django version running there. When I began to experiment with the Django releases, I copied my archive from~/library/bookmarksto~/tmp/bookmarks. The above snapshot is one of the older URLs that was originally captured with the pre-Django versions, so I see thepwdkey for some of the old archive methods, like screenshot, are pointing to my old directory, while thepwdkey for some of the newer archive methods, like singlefile, are pointing to the new directory.Maybe that is screwing it up somehow? I'm not sure why it cares about absolute paths, since I think the expectation is that
archiveboxis always run from the root output directory.@pirate commented on GitHub (Apr 14, 2021):
It doesn't actually use those paths for anything, so that wont affect it. They're just added for human readers to find files easier.
Instead of starting fresh on v0.5.6, can you try starting fresh on v0.6? Backup & delete the main index files, leaving only the
archive/then run init. If that still fails then I'll push some fixes to v0.6 to account for null outputs.@pigmonkey commented on GitHub (Apr 14, 2021):
Starting fresh with v0.6.0 results in the same
NOT NULL constraint failed: core_archiveresult.outputerror as my original post.@pirate commented on GitHub (Apr 14, 2021):
Ok, I'll push a fix for that one then. Hang tight, thanks for your patience.
@milosz commented on GitHub (Jan 7, 2022):
Is there anything that can be done to import old archives? Any guidance would be helpful.
I tried to add snapshot_id to index.json inside archived website (like
sed -i -e "1 a\ \ \ \ \"snapshot_id\": \"$(uuidgen --time)\"," archive/1561294668/index.json). After that executedarchivebox update --status orphaned --index-only, but it does not help.@pirate commented on GitHub (Jan 8, 2022):
What version are you trying to import @milosz? I recommend upgrading through 0.5 then to 0.6 after.
@milosz commented on GitHub (Jan 11, 2022):
I have ~20G archive backup from 2019 year. Thanks, I will try this intermediate step. I am thrilled that it is possible.
@pirate commented on GitHub (Mar 23, 2022):
New instructions here: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives
Also note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting
Contributions/suggestions welcome there.