[GH-ISSUE #202] Windows Docker error on resuming archiving process #3158

Closed
opened 2026-03-14 21:21:30 +03:00 by kerem · 3 comments
Owner

Originally created by @mrbenns on GitHub (Mar 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/202

Describe the bug

Successfully ran several thousand archives but then had to restart windows and now unable to get Archivebox to resume?

Screenshots or log output

PS E:\archivebox> docker-compose exec archivebox /bin/archive /data/sources/instapaper.html
Traceback (most recent call last):
  File "/bin/archive", line 136, in <module>
    main(*sys.argv)
  File "/bin/archive", line 98, in main
    update_archive_data(import_path=import_path, resume=resume)
  File "/bin/archive", line 106, in update_archive_data
    all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path)
  File "/home/pptruser/app/archivebox/index.py", line 61, in load_links_index
    existing_links = parse_json_links_index(out_dir)
  File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index
    links = json.load(f)['links']
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 2096737 column 29 (char 100663294)
PS E:\archivebox>
Originally created by @mrbenns on GitHub (Mar 30, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/202 ## Describe the bug Successfully ran several thousand archives but then had to restart windows and now unable to get Archivebox to resume? ## Screenshots or log output ``` PS E:\archivebox> docker-compose exec archivebox /bin/archive /data/sources/instapaper.html Traceback (most recent call last): File "/bin/archive", line 136, in <module> main(*sys.argv) File "/bin/archive", line 98, in main update_archive_data(import_path=import_path, resume=resume) File "/bin/archive", line 106, in update_archive_data all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path) File "/home/pptruser/app/archivebox/index.py", line 61, in load_links_index existing_links = parse_json_links_index(out_dir) File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index links = json.load(f)['links'] File "/usr/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.5/json/__init__.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Unterminated string starting at: line 2096737 column 29 (char 100663294) PS E:\archivebox> ```
kerem 2026-03-14 21:21:30 +03:00
Author
Owner

@pirate commented on GitHub (Mar 30, 2019):

Ah crap, unfortunately master didn't have power-off protection yet, we just added that a few days ago on dev and it hasn't been released to master yet.

The error you're seeing is because your index.json file was only written halfway before your computer rebooted, leaving it damaged.

Here are the steps to recover everything:

  1. Create a new text file recovered_urls.txt
  2. Put all the links inside the damaged output/index.json into that text file
  3. Append every line from output/sources/*.txt into that text file
  4. Put all the links from all the output/archive/<timestamp>/index.json files into that file
  5. Once you've confirmed that all the URLs are in recovered_urls.txt, delete output/index.json
  6. Run archivebox and re-import all the URLs ./archive recovered_urls.txt

It will regenerate your output/index.json file, and in the process, it will pick up any existing archived data in the output/archive/<timestamp> folders without overwriting it (so you won't lose any archived pages, but the index metadata will start over from scratch). In theory, you shouldn't have any data loss of previously archived pages if you follow these steps.

In the future, no one will ever experience this issue again, because as I mentioned we added atomic write enforcement on dev, you can see how that fix works here: https://github.com/pirate/ArchiveBox/pull/197/commits/d2a34f260287af8348ca2ffc7f5b0116bb2ecaf1

<!-- gh-comment-id:478285804 --> @pirate commented on GitHub (Mar 30, 2019): Ah crap, unfortunately `master` didn't have power-off protection yet, we [just added that a few days ago](https://github.com/pirate/ArchiveBox/pull/197/commits/d2a34f260287af8348ca2ffc7f5b0116bb2ecaf1) on `dev` and it hasn't been released to master yet. The error you're seeing is because your `index.json` file was only written halfway before your computer rebooted, leaving it damaged. Here are the steps to recover everything: 1. Create a new text file `recovered_urls.txt` 2. Put all the links inside the damaged `output/index.json` into that text file 3. Append every line from `output/sources/*.txt` into that text file 4. Put all the links from all the `output/archive/<timestamp>/index.json` files into that file 5. Once you've confirmed that all the URLs are in `recovered_urls.txt`, delete `output/index.json` 6. Run archivebox and re-import all the URLs `./archive recovered_urls.txt` It will regenerate your `output/index.json` file, and in the process, it will pick up any existing archived data in the `output/archive/<timestamp>` folders without overwriting it (so you won't lose any archived pages, but the index metadata will start over from scratch). In theory, you shouldn't have any data loss of previously archived pages if you follow these steps. In the future, no one will ever experience this issue again, because as I mentioned we added atomic write enforcement on dev, you can see how that fix works here: https://github.com/pirate/ArchiveBox/pull/197/commits/d2a34f260287af8348ca2ffc7f5b0116bb2ecaf1
Author
Owner

@mrbenns commented on GitHub (Mar 30, 2019):

Thank you for the very quick response :-)
Will try that

<!-- gh-comment-id:478287127 --> @mrbenns commented on GitHub (Mar 30, 2019): Thank you for the very quick response :-) Will try that
Author
Owner

@pirate commented on GitHub (Mar 30, 2019):

Ok good luck! Let me know if you have trouble. If you want you can send me your archive folder and I can help recover it for you (if you're ok with sharing that data, no worries if not).

<!-- gh-comment-id:478287484 --> @pirate commented on GitHub (Mar 30, 2019): Ok good luck! Let me know if you have trouble. If you want you can send me your archive folder and I can help recover it for you (if you're ok with sharing that data, no worries if not).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3158
No description provided.