[GH-ISSUE #68] parsing a pocket json download jsondecodeerror "extra data" #46

Closed
opened 2026-03-01 14:40:08 +03:00 by kerem · 2 comments
Owner

Originally created by @vargwolf on GitHub (Feb 22, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/68

I am not the best with python so I have not been able to find out what JSON line is causing it to fail.

I think line 95 may be the "tags": "ifttt,reddit", tag

Thank you,

File contains invalid JSON: {
    "updated": "1518974038.00784",
    "sources": [
        "ril_export.html"
    ],
    "timestamp": "1495763887",
    "type": null,
    "base_url": "www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/",
    "latest": {
        "wget": "www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/index.html",
        "screenshot": "screenshot.png",
        "pdf": "output.pdf",
        "archive_org": "https://web.archive.org/web/20171226180936/https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/",
        "favicon": "favicon.ico"
    },
    "url": "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/",
    "history": {
        "wget": [
            {
                "status": "succeded",
                "output": "www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/index.html",
                "timestamp": "1514311770",
                "duration": "427",
                "cmd": [
                    "wget",
                    "--timestamping",
                    "--adjust-extension",
                    "--no-parent",
                    "--page-requisites",
                    "--convert-links",
                    "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/"
                ]
            }
        ],
        "screenshot": [
            {
                "status": "succeded",
                "output": "screenshot.png",
                "timestamp": "1514311773",
                "duration": "2222",
                "cmd": [
                "duration": "2222",                                                                                                                                                                                                [120/1885]
                "cmd": [
                    "chromium-browser",
                    "--headless",
                    "--disable-gpu",
                    "--screenshot",
                    "--window-size=1440,900",
                    "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/"
                ]
            }
        ],
        "pdf": [
            {
                "status": "succeded",
                "output": "output.pdf",
                "timestamp": "1514311771",
                "duration": "2213",
                "cmd": [
                    "chromium-browser",
                    "--headless",
                    "--disable-gpu",
                    "--print-to-pdf",
                    "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/"
                ]
            }
        ],
        "archive_org": [
            {
                "status": "succeded",
                "output": "https://web.archive.org/web/20171226180936/https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/",
                "timestamp": "1514311775",
                "duration": "835",
                "cmd": [
                    "curl",
                    "-I",
                    "https://web.archive.org/save/https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/"
                ]
            }
        ],
        "favicon": [
            {
                "status": "succeded",
                "output": "favicon.ico",
                "timestamp": "1514311776",
                "duration": "270",
                "cmd": [
                    "curl",
                    "https://www.google.com/s2/favicons?domain=www.reddit.com"
                ]
            }
        ]
    },
    "domain": "www.reddit.com",
    "tags": "ifttt,reddit",
    "title": "Looking for an RSS Homepage : selfhosted"
}}!
Traceback (most recent call last):
  File "/home/vargwolf/bookmark-archiver/archive.py", line 124, in <module>
    update_archive(archive_path, links, source=source, resume=resume, append=True)
  File "/home/vargwolf/bookmark-archiver/archive.py", line 70, in update_archive
    archive_links(archive_path, links, source=source, resume=resume)
  File "/home/vargwolf/bookmark-archiver/archive_methods.py", line 60, in archive_links
    raise e
  File "/home/vargwolf/bookmark-archiver/archive_methods.py", line 46, in archive_links
    archive_link(link_dir, link)
  File "/home/vargwolf/bookmark-archiver/archive_methods.py", line 70, in archive_link
    **parse_json_link_index(link_dir),
  File "/home/vargwolf/bookmark-archiver/index.py", line 122, in parse_json_link_index
    return json.load(f)
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 95 column 2 (char 3608)
Originally created by @vargwolf on GitHub (Feb 22, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/68 I am not the best with python so I have not been able to find out what JSON line is causing it to fail. I think line 95 may be the `"tags": "ifttt,reddit"`, tag Thank you, ``` File contains invalid JSON: { "updated": "1518974038.00784", "sources": [ "ril_export.html" ], "timestamp": "1495763887", "type": null, "base_url": "www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/", "latest": { "wget": "www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/index.html", "screenshot": "screenshot.png", "pdf": "output.pdf", "archive_org": "https://web.archive.org/web/20171226180936/https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/", "favicon": "favicon.ico" }, "url": "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/", "history": { "wget": [ { "status": "succeded", "output": "www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/index.html", "timestamp": "1514311770", "duration": "427", "cmd": [ "wget", "--timestamping", "--adjust-extension", "--no-parent", "--page-requisites", "--convert-links", "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/" ] } ], "screenshot": [ { "status": "succeded", "output": "screenshot.png", "timestamp": "1514311773", "duration": "2222", "cmd": [ "duration": "2222", [120/1885] "cmd": [ "chromium-browser", "--headless", "--disable-gpu", "--screenshot", "--window-size=1440,900", "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/" ] } ], "pdf": [ { "status": "succeded", "output": "output.pdf", "timestamp": "1514311771", "duration": "2213", "cmd": [ "chromium-browser", "--headless", "--disable-gpu", "--print-to-pdf", "https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/" ] } ], "archive_org": [ { "status": "succeded", "output": "https://web.archive.org/web/20171226180936/https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/", "timestamp": "1514311775", "duration": "835", "cmd": [ "curl", "-I", "https://web.archive.org/save/https://www.reddit.com/r/selfhosted/comments/6d9grx/looking_for_an_rss_homepage/" ] } ], "favicon": [ { "status": "succeded", "output": "favicon.ico", "timestamp": "1514311776", "duration": "270", "cmd": [ "curl", "https://www.google.com/s2/favicons?domain=www.reddit.com" ] } ] }, "domain": "www.reddit.com", "tags": "ifttt,reddit", "title": "Looking for an RSS Homepage : selfhosted" }}! Traceback (most recent call last): File "/home/vargwolf/bookmark-archiver/archive.py", line 124, in <module> update_archive(archive_path, links, source=source, resume=resume, append=True) File "/home/vargwolf/bookmark-archiver/archive.py", line 70, in update_archive archive_links(archive_path, links, source=source, resume=resume) File "/home/vargwolf/bookmark-archiver/archive_methods.py", line 60, in archive_links raise e File "/home/vargwolf/bookmark-archiver/archive_methods.py", line 46, in archive_links archive_link(link_dir, link) File "/home/vargwolf/bookmark-archiver/archive_methods.py", line 70, in archive_link **parse_json_link_index(link_dir), File "/home/vargwolf/bookmark-archiver/index.py", line 122, in parse_json_link_index return json.load(f) File "/usr/lib/python3.5/json/__init__.py", line 268, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.5/json/__init__.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 342, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 95 column 2 (char 3608) ```
kerem 2026-03-01 14:40:08 +03:00
Author
Owner

@pirate commented on GitHub (Feb 22, 2018):

It looks like your 1495763887/index.json file got corrupted somehow. Can you copy paste the contents of the file into a comment here instead of copying the output from terminal so I can make sure the JSON is valid.

If you just want a quick solution: delete 1495763887 in your archive folder and run it again, it should get re-added correctly.

<!-- gh-comment-id:367544032 --> @pirate commented on GitHub (Feb 22, 2018): It looks like your `1495763887/index.json` file got corrupted somehow. Can you copy paste the contents of the file into a comment here instead of copying the output from terminal so I can make sure the JSON is valid. If you just want a quick solution: delete `1495763887` in your archive folder and run it again, it should get re-added correctly.
Author
Owner

@pirate commented on GitHub (Jan 30, 2019):

Closing this due to inactivity. If you comment back I'll reopen it.

<!-- gh-comment-id:458871708 --> @pirate commented on GitHub (Jan 30, 2019): Closing this due to inactivity. If you comment back I'll reopen it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#46
No description provided.