[GH-ISSUE #454] ArchiveBox index corruption when running multiple import processes on v0.5.0 #3320

Closed
opened 2026-03-14 22:05:38 +03:00 by kerem · 11 comments
Owner

Originally created by @jaw-sh on GitHub (Aug 20, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/454

I am trying to import a lot of URLs from a text file. While it was importing already, I tried to import a specific URL parallel. It spat out an error about a mismatched SQL / JSON index and I didn't think anything of it. Now, almost every archive that had already been archived now has no content.

https://tinf.io/admin/core/snapshot/?p=0

All of those archive.whatever URLs were successfully captured. After I tried to download a second page at the same time, it broke hundreds of them.

archivebox init yields:

[*] Updating existing ArchiveBox collection in this folder...
    /opt/archive
------------------------------------------------------------------

[*] Verifying archive folder structure...
    √ /opt/archive/sources
    √ /opt/archive/archive
    √ /opt/archive/logs
    √ /opt/archive/ArchiveBox.conf

[*] Verifying main SQL index and running migrations...
    √ /opt/archive/index.sqlite3

    Operations to perform:
    Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    No migrations to apply.

[*] Collecting links from any existing indexes and archive folders...
    √ Loaded 574 links from existing main index.
    ! Skipped adding 540 invalid link data directories.
        X /opt/archive/archive/1597919542.044602 [1597919542.044602] https://archive.md/o/abcd/https://twitter.com/... "Some title"
        X ...(redacted)...

    Hint: For more information about the link data directories that were skipped, run:
        archivebox status
        archivebox list --status=invalid

[*] [2020-08-20 15:46:43] Writing 574 links to main index...

    √ /opt/archive/index.sqlite3

    √ /opt/archive/index.json

    √ /opt/archive/index.html

------------------------------------------------------------------
[√] Done. Verified and updated the existing ArchiveBox collection.

    Hint: To view your archive index, run:
        archivebox server  # then visit http://127.0.0.1:8000

    To add new links, you can run:
        archivebox add ~/some/path/or/url/to/list_of_links.txt

    For more usage and examples, run:
        archivebox help

I can verify these folders exist and have content. They are owned by archive:archive, the user that archivebox runs on.

/opt/archive# ls -l /opt/archive/archive/1597919542.046462
total 11108
drwxr-xr-x 4 archive archive    4096 Aug 20 12:25 archive.fo
-rwxr-xr-x 1 archive archive      66 Aug 20 12:26 archive.org.txt
-rwxr-xr-x 1 archive archive     686 Aug 20 12:25 favicon.ico
-rwxr-xr-x 1 archive archive   21021 Aug 20 12:26 index.html
-rwxr-xr-x 1 archive archive   11631 Aug 20 12:26 index.json
drwxr-xr-x 2 archive archive    4096 Aug 20 12:26 media
-rwxr-xr-x 1 archive archive  437299 Aug 20 12:26 output.html
-rwxr-xr-x 1 archive archive 6008835 Aug 20 12:25 output.pdf
drwxr-xr-x 2 archive archive    4096 Aug 20 12:26 readability
-rwxr-xr-x 1 archive archive  382290 Aug 20 12:25 screenshot.png
-rwxr-xr-x 1 archive archive 4473399 Aug 20 12:25 singlefile.html
drwxr-xr-x 2 archive archive    4096 Aug 20 12:25 warc
Originally created by @jaw-sh on GitHub (Aug 20, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/454 I am trying to import a lot of URLs from a text file. While it was importing already, I tried to import a specific URL parallel. It spat out an error about a mismatched SQL / JSON index and I didn't think anything of it. Now, almost every archive that had already been archived now has no content. https://tinf.io/admin/core/snapshot/?p=0 All of those archive.whatever URLs were successfully captured. After I tried to download a second page at the same time, it broke hundreds of them. `archivebox init` yields: ``` [*] Updating existing ArchiveBox collection in this folder... /opt/archive ------------------------------------------------------------------ [*] Verifying archive folder structure... √ /opt/archive/sources √ /opt/archive/archive √ /opt/archive/logs √ /opt/archive/ArchiveBox.conf [*] Verifying main SQL index and running migrations... √ /opt/archive/index.sqlite3 Operations to perform: Apply all migrations: admin, auth, contenttypes, core, sessions Running migrations: No migrations to apply. [*] Collecting links from any existing indexes and archive folders... √ Loaded 574 links from existing main index. ! Skipped adding 540 invalid link data directories. X /opt/archive/archive/1597919542.044602 [1597919542.044602] https://archive.md/o/abcd/https://twitter.com/... "Some title" X ...(redacted)... Hint: For more information about the link data directories that were skipped, run: archivebox status archivebox list --status=invalid [*] [2020-08-20 15:46:43] Writing 574 links to main index... √ /opt/archive/index.sqlite3 √ /opt/archive/index.json √ /opt/archive/index.html ------------------------------------------------------------------ [√] Done. Verified and updated the existing ArchiveBox collection. Hint: To view your archive index, run: archivebox server # then visit http://127.0.0.1:8000 To add new links, you can run: archivebox add ~/some/path/or/url/to/list_of_links.txt For more usage and examples, run: archivebox help ``` I can verify these folders exist and have content. They are owned by `archive:archive`, the user that archivebox runs on. ``` /opt/archive# ls -l /opt/archive/archive/1597919542.046462 total 11108 drwxr-xr-x 4 archive archive 4096 Aug 20 12:25 archive.fo -rwxr-xr-x 1 archive archive 66 Aug 20 12:26 archive.org.txt -rwxr-xr-x 1 archive archive 686 Aug 20 12:25 favicon.ico -rwxr-xr-x 1 archive archive 21021 Aug 20 12:26 index.html -rwxr-xr-x 1 archive archive 11631 Aug 20 12:26 index.json drwxr-xr-x 2 archive archive 4096 Aug 20 12:26 media -rwxr-xr-x 1 archive archive 437299 Aug 20 12:26 output.html -rwxr-xr-x 1 archive archive 6008835 Aug 20 12:25 output.pdf drwxr-xr-x 2 archive archive 4096 Aug 20 12:26 readability -rwxr-xr-x 1 archive archive 382290 Aug 20 12:25 screenshot.png -rwxr-xr-x 1 archive archive 4473399 Aug 20 12:25 singlefile.html drwxr-xr-x 2 archive archive 4096 Aug 20 12:25 warc ```
Author
Owner

@cdvv7788 commented on GitHub (Aug 20, 2020):

archivebox should not be run in parallel in the current version. Try running archivebox update https://some-url-here for one of the invalid archives, to check what happens.

<!-- gh-comment-id:677754676 --> @cdvv7788 commented on GitHub (Aug 20, 2020): `archivebox` should not be run in parallel in the current version. Try running `archivebox update https://some-url-here` for one of the invalid archives, to check what happens.
Author
Owner

@jaw-sh commented on GitHub (Aug 20, 2020):

Update output looks like this.

[*] [2020-08-20 16:14:13] Writing 574 links to main index...
    √ /opt/archive/index.sqlite3
    √ /opt/archive/index.json
    √ /opt/archive/index.html

[*] [2020-08-20 16:14:50] Writing 574 links to main index...
    √ /opt/archive/index.sqlite3
    √ /opt/archive/index.json
    √ /opt/archive/index.html

Nothing gets fixed by it.

<!-- gh-comment-id:677787260 --> @jaw-sh commented on GitHub (Aug 20, 2020): Update output looks like this. ``` [*] [2020-08-20 16:14:13] Writing 574 links to main index... √ /opt/archive/index.sqlite3 √ /opt/archive/index.json √ /opt/archive/index.html [*] [2020-08-20 16:14:50] Writing 574 links to main index... √ /opt/archive/index.sqlite3 √ /opt/archive/index.json √ /opt/archive/index.html ``` Nothing gets fixed by it.
Author
Owner

@pirate commented on GitHub (Aug 20, 2020):

If you could post your archive/index.json file (redacted, remove all but 2 or 3 of the URLs and titles within), that would actually be more helpful than the traceback at this point. We just need a sample of the index for 2 or 3 of the broken URLs to get an idea of what's causing this.

The good news is that from what I can see, this situation is recoverable. There are multiple redundant indexes in the archive folder structure (see /opt/archive/archive/1597919542.046462/index.json for example), so you won't lose any of your archive data and it's safe to continue using in the meantime (but don't run multiple processes in parallel).

<!-- gh-comment-id:677937100 --> @pirate commented on GitHub (Aug 20, 2020): If you could post your `archive/index.json` file (redacted, remove all but 2 or 3 of the URLs and titles within), that would actually be more helpful than the traceback at this point. We just need a sample of the index for 2 or 3 of the broken URLs to get an idea of what's causing this. The good news is that from what I can see, this situation is recoverable. There are multiple redundant indexes in the archive folder structure (see `/opt/archive/archive/1597919542.046462/index.json` for example), so you won't lose any of your archive data and it's safe to continue using in the meantime (but don't run multiple processes in parallel).
Author
Owner

@jaw-sh commented on GitHub (Aug 21, 2020):

Attached. @pirate

1.txt
2.txt

<!-- gh-comment-id:678127957 --> @jaw-sh commented on GitHub (Aug 21, 2020): Attached. @pirate [1.txt](https://github.com/pirate/ArchiveBox/files/5107857/1.txt) [2.txt](https://github.com/pirate/ArchiveBox/files/5107859/2.txt)
Author
Owner

@jaw-sh commented on GitHub (Sep 15, 2020):

The archive services I rely on continue to be garbage and I would very much like to switch to ArchiveBox.

<!-- gh-comment-id:692609354 --> @jaw-sh commented on GitHub (Sep 15, 2020): The archive services I rely on continue to be garbage and I would very much like to switch to ArchiveBox.
Author
Owner

@cdvv7788 commented on GitHub (Oct 7, 2020):

@jaw-sh can you try running archivebox init with the current master version? Now we try to move everything from the index.json to the index.sql to avoid conflicts. Maybe this helps with your issue.

<!-- gh-comment-id:705015178 --> @cdvv7788 commented on GitHub (Oct 7, 2020): @jaw-sh can you try running `archivebox init` with the current master version? Now we try to move everything from the `index.json` to the `index.sql` to avoid conflicts. Maybe this helps with your issue.
Author
Owner

@jaw-sh commented on GitHub (Oct 7, 2020):

@cdvv7788 I'm inexperienced with Python. I installed via pip. Would there be a straight-forward way of replacing the installation with the repository?

<!-- gh-comment-id:705038026 --> @jaw-sh commented on GitHub (Oct 7, 2020): @cdvv7788 I'm inexperienced with Python. I installed via pip. Would there be a straight-forward way of replacing the installation with the repository?
Author
Owner

@cdvv7788 commented on GitHub (Oct 7, 2020):

The easiest way to run it is to use docker. The new version requires node too, so it is getting a little more complex. Maybe wait until the 0.5 version is officially release before we dive into this issue again?

<!-- gh-comment-id:705039336 --> @cdvv7788 commented on GitHub (Oct 7, 2020): The easiest way to run it is to use docker. The new version requires node too, so it is getting a little more complex. Maybe wait until the `0.5` version is officially release before we dive into this issue again?
Author
Owner

@pirate commented on GitHub (Feb 1, 2021):

We're now on v0.5.4, this may be worth trying again (back up any important data first).

<!-- gh-comment-id:770749833 --> @pirate commented on GitHub (Feb 1, 2021): We're now on `v0.5.4`, this may be worth trying again (back up any important data first).
Author
Owner

@pirate commented on GitHub (Apr 6, 2021):

I believe these issues are fixed in the latest versions. SQLite3 is now run in WAL mode, and it will complain about the database being locked if you attempt to write with too many concurrent threads. It's not perfect because it's not able to be run fully in parallel yet, but it should failsafe now instead of corrupting the archive.

Please comment back here if you're still encountering any corruption issues on the latest version and I'll reopen the issue to investigate.

<!-- gh-comment-id:813879702 --> @pirate commented on GitHub (Apr 6, 2021): I believe these issues are fixed in the latest versions. SQLite3 is now run in WAL mode, and it will complain about the database being locked if you attempt to write with too many concurrent threads. It's not perfect because it's not able to be run fully in parallel yet, but it should failsafe now instead of corrupting the archive. Please comment back here if you're still encountering any corruption issues on the latest version and I'll reopen the issue to investigate.
Author
Owner

@pirate commented on GitHub (Apr 12, 2022):

Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

<!-- gh-comment-id:1097265280 --> @pirate commented on GitHub (Apr 12, 2022): Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting Contributions/suggestions welcome there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3320
No description provided.