mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #1347] Bug: JSONDecodeError when trying to load JSON with an array at the top-level #825
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#825
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @philippemilink on GitHub (Feb 15, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1347
With ArchiveBox version 0.6.2, I used to import URLs stored in JSON files with content looking like the following:
Everything worked well.
With version 0.7.2, however, I have a
JSONDecodeErrorexception during the import:The error is caused by the following line, introduced by commit
aaca74f:github.com/ArchiveBox/ArchiveBox@3ad32509e9/archivebox/parsers/generic_json.py (L22)@pirate commented on GitHub (Feb 21, 2024):
ah interesting, all the files I tested with had an object at the top level instead of a list
I can add handling for lists pretty easily, there are so many different JSON formats to support haha
@jimwins commented on GitHub (Feb 26, 2024):
Instead of trying to figure out what is going on when the first line of the JSON file is garbage, it would be easier to try not skipping it first, and then try again after skipping it if that fails.
But maybe even this isn't necessary? It looks like the original "skip the first line" logic came about because ArchiveBox would add the filename to the file as the first line when putting in the sources directory, but that doesn't seem to happen any more (which seems like a much, much better way to go).
@pirate commented on GitHub (Feb 29, 2024):
Yes but I want to keep the workaround logic as a fallback because users still have the old "filename as first line" style imports in their
sources/dir and they might want to re-import their sources again later on.I do agree we should move it to a
try: except:fallback though as you showed above.@jimwins commented on GitHub (Feb 29, 2024):
I don't see how the existing workaround ever worked for anything because it chops off everything before the first
{which includes the[, and the rest of the parsing assumes thatlinksis a list.@pirate commented on GitHub (Feb 29, 2024):
Ah sorry, it was long enough ago that I don't remember what it was for exactly... maybe it was to handle an extra newline at the start, or maybe I thought I was handling a JSON object at the top level instead of JSONL?
Either way I'm down to change it, this parser is broken enough that it's not useful in its current state anyway.
@jimwins commented on GitHub (Feb 29, 2024):
Handling JSONL wouldn’t be hard to add as a another fallback. We could try JSON, then JSONL, and then try them both again without the first line to handle old source lists that had that extra line added.
@pirate commented on GitHub (Mar 22, 2024):
This is done, thanks again @jimwins for all your great work here!
Will be out in the next release, or pull
:devto get it early.