[PR #1370] [MERGED] Add generic_jsonl parser #4404

Closed
opened 2026-03-15 01:42:50 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1370
Author: @jimwins
Created: 3/1/2024
Status: Merged
Merged: 3/15/2024
Merged by: @pirate

Base: devHead: issue-1369


📝 Commits (10+)

  • 4e69d2c Add EXTRA_*_ARGS for wget, curl, and singlefile
  • ab8f395 Add YOUTUBEDL_EXTRA_ARGS
  • 4d9c5a7 Add CHROME_EXTRA_ARGS
  • 22f9a28 Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers
  • 68326a6 Add cookies file to http request in download_url
  • fe11e1c check if COOKIE_FILE is file
  • 89ab18c Add generic_jsonl parser
  • a577d1e Merge branch 'dev' into title-cookies-file
  • 1f828d9 Add tests for generic_rss and pinboard_rss parsers
  • 9f462a8 Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers

📊 Changes

23 files changed (+495 additions, -174 deletions)

View changed files

📝 archivebox/config.py (+47 -23)
📝 archivebox/extractors/archive_org.py (+9 -2)
📝 archivebox/extractors/favicon.py (+13 -3)
📝 archivebox/extractors/headers.py (+9 -3)
📝 archivebox/extractors/media.py (+9 -2)
📝 archivebox/extractors/mercury.py (+10 -4)
📝 archivebox/extractors/singlefile.py (+6 -19)
📝 archivebox/extractors/title.py (+9 -2)
📝 archivebox/extractors/wget.py (+10 -3)
📝 archivebox/parsers/__init__.py (+2 -0)
📝 archivebox/parsers/generic_json.py (+57 -53)
archivebox/parsers/generic_jsonl.py (+34 -0)
📝 archivebox/parsers/generic_rss.py (+20 -28)
📝 archivebox/parsers/pinboard_rss.py (+16 -25)
📝 archivebox/util.py (+40 -5)
📝 bin/test.sh (+1 -1)
📝 pyproject.toml (+3 -0)
📝 tests/mock_server/server.py (+1 -1)
tests/mock_server/templates/example-single.jsonl (+1 -0)
tests/mock_server/templates/example.atom (+24 -0)

...and 3 more files

📄 Description

Adds a JSONL parser and also fixes the JSON parser to reject what it suspects is a single-line JSONL file.

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1370 **Author:** [@jimwins](https://github.com/jimwins) **Created:** 3/1/2024 **Status:** ✅ Merged **Merged:** 3/15/2024 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `issue-1369` --- ### 📝 Commits (10+) - [`4e69d2c`](https://github.com/ArchiveBox/ArchiveBox/commit/4e69d2c9e14bbbc4597731fdc349f5461a726b54) Add `EXTRA_*_ARGS` for wget, curl, and singlefile - [`ab8f395`](https://github.com/ArchiveBox/ArchiveBox/commit/ab8f395e0a4104dd01385be3d8fcea082a6987ee) Add `YOUTUBEDL_EXTRA_ARGS` - [`4d9c5a7`](https://github.com/ArchiveBox/ArchiveBox/commit/4d9c5a7b4b0bc0f490b6d8928878853fad363d16) Add `CHROME_EXTRA_ARGS` - [`22f9a28`](https://github.com/ArchiveBox/ArchiveBox/commit/22f9a289d399de5dda1de624ef92f93969f1473e) Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers - [`68326a6`](https://github.com/ArchiveBox/ArchiveBox/commit/68326a60ee20e2a8831ae86e9867b352e0f74ca6) Add cookies file to http request in `download_url` - [`fe11e1c`](https://github.com/ArchiveBox/ArchiveBox/commit/fe11e1c2f47487b419497bac38aafbd433ed689a) check if COOKIE_FILE is file - [`89ab18c`](https://github.com/ArchiveBox/ArchiveBox/commit/89ab18c772b482a92ee8c3c9b4a7e93b80593d93) Add generic_jsonl parser - [`a577d1e`](https://github.com/ArchiveBox/ArchiveBox/commit/a577d1ed232101275383de2c96722c08436b9f30) Merge branch 'dev' into title-cookies-file - [`1f828d9`](https://github.com/ArchiveBox/ArchiveBox/commit/1f828d94410eded4e23ee8778a2d6151a4c89c8c) Add tests for generic_rss and pinboard_rss parsers - [`9f462a8`](https://github.com/ArchiveBox/ArchiveBox/commit/9f462a87a8f021b5497dd75208b044dbe1c4ce40) Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers ### 📊 Changes **23 files changed** (+495 additions, -174 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/config.py` (+47 -23) 📝 `archivebox/extractors/archive_org.py` (+9 -2) 📝 `archivebox/extractors/favicon.py` (+13 -3) 📝 `archivebox/extractors/headers.py` (+9 -3) 📝 `archivebox/extractors/media.py` (+9 -2) 📝 `archivebox/extractors/mercury.py` (+10 -4) 📝 `archivebox/extractors/singlefile.py` (+6 -19) 📝 `archivebox/extractors/title.py` (+9 -2) 📝 `archivebox/extractors/wget.py` (+10 -3) 📝 `archivebox/parsers/__init__.py` (+2 -0) 📝 `archivebox/parsers/generic_json.py` (+57 -53) ➕ `archivebox/parsers/generic_jsonl.py` (+34 -0) 📝 `archivebox/parsers/generic_rss.py` (+20 -28) 📝 `archivebox/parsers/pinboard_rss.py` (+16 -25) 📝 `archivebox/util.py` (+40 -5) 📝 `bin/test.sh` (+1 -1) 📝 `pyproject.toml` (+3 -0) 📝 `tests/mock_server/server.py` (+1 -1) ➕ `tests/mock_server/templates/example-single.jsonl` (+1 -0) ➕ `tests/mock_server/templates/example.atom` (+24 -0) _...and 3 more files_ </details> ### 📄 Description Adds a JSONL parser and also fixes the JSON parser to reject what it suspects is a single-line JSONL file. # Changes these areas - [ ] Bugfixes - [X] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-15 01:42:50 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4404
No description provided.