[GH-ISSUE #1347] Bug: JSONDecodeError when trying to load JSON with an array at the top-level #2333

Closed
opened 2026-03-01 17:58:16 +03:00 by kerem · 7 comments
Owner

Originally created by @philippemilink on GitHub (Feb 15, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1347

With ArchiveBox version 0.6.2, I used to import URLs stored in JSON files with content looking like the following:

[
    {"url": "https://archivebox.io/", "tags": "test,archivebox"},
    {"url": "https://en.wikipedia.org/wiki/Linux", "tags": "test,wikipedia"}
]
archivebox add --parser json < ./links.json

Everything worked well.

With version 0.7.2, however, I have a JSONDecodeError exception during the import:

  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/main.py", line 631, in add
    new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/index/__init__.py", line 278, in parse_links_from_source
    raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 103, in parse_links
    links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 117, in run_parser_functions
    parsed_links = list(parser_func(to_parse, root_url=root_url))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/generic_json.py", line 23, in parse_generic_json_export
    links = json.loads(json_file_json_str)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 61 (char 60)

The error is caused by the following line, introduced by commit aaca74f: github.com/ArchiveBox/ArchiveBox@3ad32509e9/archivebox/parsers/generic_json.py (L22)

Originally created by @philippemilink on GitHub (Feb 15, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1347 With ArchiveBox version 0.6.2, I used to import URLs stored in JSON files with content looking like the following: ```json [ {"url": "https://archivebox.io/", "tags": "test,archivebox"}, {"url": "https://en.wikipedia.org/wiki/Linux", "tags": "test,wikipedia"} ] ``` ```bash archivebox add --parser json < ./links.json ``` Everything worked well. With version 0.7.2, however, I have a `JSONDecodeError` exception during the import: ``` File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/main.py", line 631, in add new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/index/__init__.py", line 278, in parse_links_from_source raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/util.py", line 116, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 103, in parse_links links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/__init__.py", line 117, in run_parser_functions parsed_links = list(parser_func(to_parse, root_url=root_url)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/archivebox/venv/lib/python3.11/site-packages/archivebox/parsers/generic_json.py", line 23, in parse_generic_json_export links = json.loads(json_file_json_str) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/__init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 61 (char 60) ``` The error is caused by the following line, introduced by commit aaca74f: https://github.com/ArchiveBox/ArchiveBox/blob/3ad32509e985236f82f3558f31b856623b1eb261/archivebox/parsers/generic_json.py#L22
Author
Owner

@pirate commented on GitHub (Feb 21, 2024):

ah interesting, all the files I tested with had an object at the top level instead of a list

I can add handling for lists pretty easily, there are so many different JSON formats to support haha

<!-- gh-comment-id:1955746396 --> @pirate commented on GitHub (Feb 21, 2024): ah interesting, all the files I tested with had an object at the top level instead of a list I can add handling for lists pretty easily, there are so many different JSON formats to support haha
Author
Owner

@jimwins commented on GitHub (Feb 26, 2024):

Instead of trying to figure out what is going on when the first line of the JSON file is garbage, it would be easier to try not skipping it first, and then try again after skipping it if that fails.

    try:
        links = json.load(json_file)
    except json.decoder.JSONDecodeError:
        # sometimes the first line is a comment or other junk, so try without
        json_file.seek(0)
        first_line = json_file.readline()
        #print('      > Trying JSON parser without first line: "', first_line.strip(), '"', sep= '')
        links = json.load(json_file)
        # we may fail again, which means we really don't know what to do

But maybe even this isn't necessary? It looks like the original "skip the first line" logic came about because ArchiveBox would add the filename to the file as the first line when putting in the sources directory, but that doesn't seem to happen any more (which seems like a much, much better way to go).

<!-- gh-comment-id:1963185874 --> @jimwins commented on GitHub (Feb 26, 2024): Instead of trying to figure out what is going on when the first line of the JSON file is garbage, it would be easier to try not skipping it first, and then try again after skipping it if that fails. ```python try: links = json.load(json_file) except json.decoder.JSONDecodeError: # sometimes the first line is a comment or other junk, so try without json_file.seek(0) first_line = json_file.readline() #print(' > Trying JSON parser without first line: "', first_line.strip(), '"', sep= '') links = json.load(json_file) # we may fail again, which means we really don't know what to do ``` But maybe even this isn't necessary? It looks like the original "skip the first line" logic came about because ArchiveBox would add the filename to the file as the first line when putting in the sources directory, but that doesn't seem to happen any more (which seems like a much, much better way to go).
Author
Owner

@pirate commented on GitHub (Feb 29, 2024):

But maybe even this isn't necessary?

Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their sources/ dir and they might want to re-import their sources again later on.

I do agree we should move it to a try: except: fallback though as you showed above.

<!-- gh-comment-id:1970245518 --> @pirate commented on GitHub (Feb 29, 2024): > But maybe even this isn't necessary? Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their `sources/` dir and they might want to re-import their sources again later on. I do agree we should move it to a `try: except:` fallback though as you showed above.
Author
Owner

@jimwins commented on GitHub (Feb 29, 2024):

Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their sources/ dir and they might want to re-import their sources again later on.

I do agree we should move it to a try: except: fallback though as you showed above.

I don't see how the existing workaround ever worked for anything because it chops off everything before the first { which includes the [, and the rest of the parsing assumes that links is a list.

<!-- gh-comment-id:1970286071 --> @jimwins commented on GitHub (Feb 29, 2024): > Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their `sources/` dir and they might want to re-import their sources again later on. > > I do agree we should move it to a `try: except:` fallback though as you showed above. I don't see how the existing workaround ever worked for anything because it chops off everything before the first `{` which includes the `[`, and the rest of the parsing assumes that `links` is a list.
Author
Owner

@pirate commented on GitHub (Feb 29, 2024):

Ah sorry, it was long enough ago that I don't remember what it was for exactly... maybe it was to handle an extra newline at the start, or maybe I thought I was handling a JSON object at the top level instead of JSONL?

Either way I'm down to change it, this parser is broken enough that it's not useful in its current state anyway.

<!-- gh-comment-id:1970327237 --> @pirate commented on GitHub (Feb 29, 2024): Ah sorry, it was long enough ago that I don't remember what it was for exactly... maybe it was to handle an extra newline at the start, or maybe I thought I was handling a JSON object at the top level instead of JSONL? Either way I'm down to change it, this parser is broken enough that it's not useful in its current state anyway.
Author
Owner

@jimwins commented on GitHub (Feb 29, 2024):

Handling JSONL wouldn’t be hard to add as a another fallback. We could try JSON, then JSONL, and then try them both again without the first line to handle old source lists that had that extra line added.

<!-- gh-comment-id:1970394282 --> @jimwins commented on GitHub (Feb 29, 2024): Handling JSONL wouldn’t be hard to add as a another fallback. We could try JSON, then JSONL, and then try them both again without the first line to handle old source lists that had that extra line added.
Author
Owner

@pirate commented on GitHub (Mar 22, 2024):

This is done, thanks again @jimwins for all your great work here!

Will be out in the next release, or pull :dev to get it early.

<!-- gh-comment-id:2014368392 --> @pirate commented on GitHub (Mar 22, 2024): This is done, thanks again @jimwins for all your great work here! Will be out in the next release, or [pull `:dev` to get it early](https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2333
No description provided.