[GH-ISSUE #265] JSONDecodingError while archiving a specific website #189

Closed
opened 2026-03-01 14:41:23 +03:00 by kerem · 7 comments
Owner

Originally created by @phretor on GitHub (Sep 12, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/265

Describe the bug

I'm getting this JSONDecodingError on a specific website.

Steps to reproduce

  1. Ran ArchiveBox with the "default" config (didn't change the docker-compose.yaml file much, apart from naming networks differently)
  2. Saw this output during archiving
u@h:~/docker/ArchiveBox$ echo "https://www.zdnet.com/article/new-simjacker-attack-exploited-in-the-wild-to-track-users-for-at-least-two-years/" | /usr/local/bin/docker-compose exec -T archivebox /bin/archive

Traceback (most recent call last):
  File "/bin/archive", line 136, in <module>
    main(*sys.argv)
  File "/bin/archive", line 98, in main
    update_archive_data(import_path=import_path, resume=resume)
  File "/bin/archive", line 106, in update_archive_data
    all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path)
  File "/home/pptruser/app/archivebox/index.py", line 61, in load_links_index
    existing_links = parse_json_links_index(out_dir)
  File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index
    links = json.load(f)['links']
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 5283 column 44 (char 271619)

Originally created by @phretor on GitHub (Sep 12, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/265 #### Describe the bug I'm getting this `JSONDecodingError` on a specific website. #### Steps to reproduce 1. Ran ArchiveBox with the "default" config (didn't change the docker-compose.yaml file much, apart from naming networks differently) 2. Saw this output during archiving ``` u@h:~/docker/ArchiveBox$ echo "https://www.zdnet.com/article/new-simjacker-attack-exploited-in-the-wild-to-track-users-for-at-least-two-years/" | /usr/local/bin/docker-compose exec -T archivebox /bin/archive Traceback (most recent call last): File "/bin/archive", line 136, in <module> main(*sys.argv) File "/bin/archive", line 98, in main update_archive_data(import_path=import_path, resume=resume) File "/bin/archive", line 106, in update_archive_data all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path) File "/home/pptruser/app/archivebox/index.py", line 61, in load_links_index existing_links = parse_json_links_index(out_dir) File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index links = json.load(f)['links'] File "/usr/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.5/json/__init__.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Invalid control character at: line 5283 column 44 (char 271619) ```
kerem closed this issue 2026-03-01 14:41:23 +03:00
Author
Owner

@pirate commented on GitHub (Sep 19, 2019):

Looks like your index got corrupted somehow, can you look inside the index.json file and see if it got truncated at line 5283? If so, don't worry you haven't lost any data, you'll just have to rebuild the index with ArchiveBox v0.4.x which has a new automatic index-rebuilding feature. This has happened in the past to users who used older versions that didn't have the atomic-writing index save feature to prevent corrupted indexes.

<!-- gh-comment-id:532914599 --> @pirate commented on GitHub (Sep 19, 2019): Looks like your index got corrupted somehow, can you look inside the `index.json` file and see if it got truncated at line 5283? If so, don't worry you haven't lost any data, you'll just have to rebuild the index with ArchiveBox v0.4.x which has a new automatic index-rebuilding feature. This has happened in the past to users who used older versions that didn't have the atomic-writing index save feature to prevent corrupted indexes.
Author
Owner

@logistic-bot commented on GitHub (Oct 7, 2019):

Getting similar error:

[*] [2019-10-07 21:31:42] "The History of TI Graphing Calculator Gaming - YouTube"
    https://www.youtube.com/watch?v=Jo_WgbUfNxc
    √ output/archive/1570023650
    ! Failed to archive link: JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Traceback (most recent call last):
  File "/usr/bin/archivebox", line 136, in <module>
    main(*sys.argv)
  File "/usr/bin/archivebox", line 98, in main
    update_archive_data(import_path=import_path, resume=resume)
  File "/usr/bin/archivebox", line 118, in update_archive_data
    archive_link(link_dir, link)
  File "/home/khais/bin/ArchiveBox/archivebox/archive_methods.py", line 87, in archive_link
    link = load_json_link_index(link_dir, link)
  File "/home/khais/bin/ArchiveBox/archivebox/index.py", line 234, in load_json_link_index
    **parse_json_link_index(out_dir),
  File "/home/khais/bin/ArchiveBox/archivebox/index.py", line 224, in parse_json_link_index
    link_json = json.load(f)
  File "/usr/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Temporarly solved it by deleting the output/archive/1570023650 folder

<!-- gh-comment-id:539171852 --> @logistic-bot commented on GitHub (Oct 7, 2019): Getting similar error: ``` [*] [2019-10-07 21:31:42] "The History of TI Graphing Calculator Gaming - YouTube" https://www.youtube.com/watch?v=Jo_WgbUfNxc √ output/archive/1570023650 ! Failed to archive link: JSONDecodeError: Expecting value: line 1 column 1 (char 0) Traceback (most recent call last): File "/usr/bin/archivebox", line 136, in <module> main(*sys.argv) File "/usr/bin/archivebox", line 98, in main update_archive_data(import_path=import_path, resume=resume) File "/usr/bin/archivebox", line 118, in update_archive_data archive_link(link_dir, link) File "/home/khais/bin/ArchiveBox/archivebox/archive_methods.py", line 87, in archive_link link = load_json_link_index(link_dir, link) File "/home/khais/bin/ArchiveBox/archivebox/index.py", line 234, in load_json_link_index **parse_json_link_index(out_dir), File "/home/khais/bin/ArchiveBox/archivebox/index.py", line 224, in parse_json_link_index link_json = json.load(f) File "/usr/lib/python3.7/json/__init__.py", line 296, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.7/json/__init__.py", line 348, in loads return _default_decoder.decode(s) File "/usr/lib/python3.7/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) ``` Temporarly solved it by deleting the `output/archive/1570023650` folder
Author
Owner

@gjedeer commented on GitHub (Oct 16, 2019):

docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox /bin/archive --version                                    
ArchiveBox version 73cdb8daf

I'm having a malformed index, how do I rebuild it? I tried doing it manually but it's FUBAR.

As you can see I'm now running the latest version and I'm still getting the exception:

  File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index
    links = json.load(f)['links']
...
json.decoder.JSONDecodeError: Extra data: line 2799 column 14 (char 132494)
<!-- gh-comment-id:542519209 --> @gjedeer commented on GitHub (Oct 16, 2019): ``` docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox /bin/archive --version ArchiveBox version 73cdb8daf ``` I'm having a malformed index, how do I rebuild it? I tried doing it manually but it's FUBAR. As you can see I'm now running the latest version and I'm still getting the exception: ``` File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index links = json.load(f)['links'] ... json.decoder.JSONDecodeError: Extra data: line 2799 column 14 (char 132494) ```
Author
Owner

@gjedeer commented on GitHub (Oct 16, 2019):

I've found out about development now happening in branches: cloned latest v0.5.0 and built a docker image from it, same problem.

➜  ArchiveBox git:(v0.5.0) echo http://www.tylerproject.org/gallery | docker run -i -v ~/ArchiveBox:/data archivebox:0.5.0 /bin/archive 
Traceback (most recent call last):
  File "/bin/archive", line 136, in <module>
    main(*sys.argv)
  File "/bin/archive", line 98, in main
    update_archive_data(import_path=import_path, resume=resume)
  File "/bin/archive", line 106, in update_archive_data
    all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path)
  File "/home/pptruser/app/archivebox/index.py", line 61, in load_links_index
    existing_links = parse_json_links_index(out_dir)
  File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index
    links = json.load(f)['links']
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2799 column 14 (char 132494)
➜  ArchiveBox git:(v0.5.0) docker run -i -v ~/ArchiveBox:/data archivebox:0.5.0 /bin/archive --version
ArchiveBox version 73cdb8daf

<!-- gh-comment-id:542526526 --> @gjedeer commented on GitHub (Oct 16, 2019): I've found out about development now happening in branches: cloned latest v0.5.0 and built a docker image from it, same problem. ``` ➜ ArchiveBox git:(v0.5.0) echo http://www.tylerproject.org/gallery | docker run -i -v ~/ArchiveBox:/data archivebox:0.5.0 /bin/archive Traceback (most recent call last): File "/bin/archive", line 136, in <module> main(*sys.argv) File "/bin/archive", line 98, in main update_archive_data(import_path=import_path, resume=resume) File "/bin/archive", line 106, in update_archive_data all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path) File "/home/pptruser/app/archivebox/index.py", line 61, in load_links_index existing_links = parse_json_links_index(out_dir) File "/home/pptruser/app/archivebox/index.py", line 108, in parse_json_links_index links = json.load(f)['links'] File "/usr/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.5/json/__init__.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 342, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 2799 column 14 (char 132494) ➜ ArchiveBox git:(v0.5.0) docker run -i -v ~/ArchiveBox:/data archivebox:0.5.0 /bin/archive --version ArchiveBox version 73cdb8daf ```
Author
Owner

@pirate commented on GitHub (Oct 17, 2019):

@gjedeer I'm removing the JSON index entirely and sticking to SQLite for the final release, it's too hard to incrementally add JSON entries in an efficient way without corrupting the index during power outages or causing huge read/write spikes for no reason. Instead it will use SQLite for the core index, and export a JSON index if the user manually requests it, or once an archiving import process is completely finished.

<!-- gh-comment-id:543270975 --> @pirate commented on GitHub (Oct 17, 2019): @gjedeer I'm removing the JSON index entirely and sticking to SQLite for the final release, it's too hard to incrementally add JSON entries in an efficient way without corrupting the index during power outages or causing huge read/write spikes for no reason. Instead it will use SQLite for the core index, and export a JSON index if the user manually requests it, or once an archiving import process is completely finished.
Author
Owner

@gjedeer commented on GitHub (Oct 18, 2019):

Cool, and I've just extracted individual archived page URLs from $subdir/index.js, removed the currupted main index.js file and archived the URLs again - fortunately, they were still available.

BTW, there was no power loss or anything like that. I've seen in the sources that you've had in-place JSON changing code, yeah, agree that SQLite is a better solution.

<!-- gh-comment-id:543678242 --> @gjedeer commented on GitHub (Oct 18, 2019): Cool, and I've just extracted individual archived page URLs from $subdir/index.js, removed the currupted main index.js file and archived the URLs again - fortunately, they were still available. BTW, there was no power loss or anything like that. I've seen in the sources that you've had in-place JSON changing code, yeah, agree that SQLite is a better solution.
Author
Owner

@pirate commented on GitHub (May 9, 2020):

Closing this in favor of #234

<!-- gh-comment-id:626220876 --> @pirate commented on GitHub (May 9, 2020): Closing this in favor of #234
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#189
No description provided.