mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #454] ArchiveBox index corruption when running multiple import processes on v0.5.0 #3320
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3320
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jaw-sh on GitHub (Aug 20, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/454
I am trying to import a lot of URLs from a text file. While it was importing already, I tried to import a specific URL parallel. It spat out an error about a mismatched SQL / JSON index and I didn't think anything of it. Now, almost every archive that had already been archived now has no content.
https://tinf.io/admin/core/snapshot/?p=0
All of those archive.whatever URLs were successfully captured. After I tried to download a second page at the same time, it broke hundreds of them.
archivebox inityields:I can verify these folders exist and have content. They are owned by
archive:archive, the user that archivebox runs on.@cdvv7788 commented on GitHub (Aug 20, 2020):
archiveboxshould not be run in parallel in the current version. Try runningarchivebox update https://some-url-herefor one of the invalid archives, to check what happens.@jaw-sh commented on GitHub (Aug 20, 2020):
Update output looks like this.
Nothing gets fixed by it.
@pirate commented on GitHub (Aug 20, 2020):
If you could post your
archive/index.jsonfile (redacted, remove all but 2 or 3 of the URLs and titles within), that would actually be more helpful than the traceback at this point. We just need a sample of the index for 2 or 3 of the broken URLs to get an idea of what's causing this.The good news is that from what I can see, this situation is recoverable. There are multiple redundant indexes in the archive folder structure (see
/opt/archive/archive/1597919542.046462/index.jsonfor example), so you won't lose any of your archive data and it's safe to continue using in the meantime (but don't run multiple processes in parallel).@jaw-sh commented on GitHub (Aug 21, 2020):
Attached. @pirate
1.txt
2.txt
@jaw-sh commented on GitHub (Sep 15, 2020):
The archive services I rely on continue to be garbage and I would very much like to switch to ArchiveBox.
@cdvv7788 commented on GitHub (Oct 7, 2020):
@jaw-sh can you try running
archivebox initwith the current master version? Now we try to move everything from theindex.jsonto theindex.sqlto avoid conflicts. Maybe this helps with your issue.@jaw-sh commented on GitHub (Oct 7, 2020):
@cdvv7788 I'm inexperienced with Python. I installed via pip. Would there be a straight-forward way of replacing the installation with the repository?
@cdvv7788 commented on GitHub (Oct 7, 2020):
The easiest way to run it is to use docker. The new version requires node too, so it is getting a little more complex. Maybe wait until the
0.5version is officially release before we dive into this issue again?@pirate commented on GitHub (Feb 1, 2021):
We're now on
v0.5.4, this may be worth trying again (back up any important data first).@pirate commented on GitHub (Apr 6, 2021):
I believe these issues are fixed in the latest versions. SQLite3 is now run in WAL mode, and it will complain about the database being locked if you attempt to write with too many concurrent threads. It's not perfect because it's not able to be run fully in parallel yet, but it should failsafe now instead of corrupting the archive.
Please comment back here if you're still encountering any corruption issues on the latest version and I'll reopen the issue to investigate.
@pirate commented on GitHub (Apr 12, 2022):
Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting
Contributions/suggestions welcome there.