[PR #1743] [MERGED] Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) #1493

Closed
opened 2026-03-01 14:50:01 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1743
Author: @pirate
Created: 12/31/2025
Status: Merged
Merged: 12/31/2025
Merged by: @pirate

Base: devHead: claude/review-code-quality-macdO


📝 Commits (2)

  • f3e11b6 Implement JSONL CLI pipeline architecture (Phases 1-4, 6)
  • bb52b59 Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6)

📊 Changes

18 files changed (+1711 additions, -150 deletions)

View changed files

📝 TODO_archivebox_jsonl_cli.md (+17 -17)
📝 archivebox/cli/archivebox_archiveresult.py (+35 -20)
📝 archivebox/cli/archivebox_binary.py (+1 -15)
📝 archivebox/cli/archivebox_crawl.py (+36 -17)
📝 archivebox/cli/archivebox_machine.py (+1 -15)
📝 archivebox/cli/archivebox_process.py (+1 -15)
📝 archivebox/cli/archivebox_run.py (+68 -16)
📝 archivebox/cli/archivebox_snapshot.py (+18 -18)
📝 archivebox/cli/archivebox_tag.py (+1 -15)
archivebox/cli/cli_utils.py (+46 -0)
📝 archivebox/cli/tests_piping.py (+124 -0)
📝 archivebox/core/models.py (+91 -1)
archivebox/tests/conftest.py (+218 -0)
archivebox/tests/test_cli_archiveresult.py (+264 -0)
archivebox/tests/test_cli_crawl.py (+261 -0)
archivebox/tests/test_cli_run.py (+254 -0)
archivebox/tests/test_cli_snapshot.py (+274 -0)
📝 archivebox/workers/supervisord_util.py (+1 -1)

📄 Description

Summary

Related issues

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

Summary by cubic

Adds pass-through and create-or-update behavior to JSONL CLI pipeline commands and centralizes filter utilities, then adds comprehensive unit tests to validate the pipeline end-to-end. This improves piping workflows and makes run orchestration more reliable.

  • New Features
    • Pass-through added to crawl/snapshot/archiveresult create: non-target records are output unchanged.
    • run now supports create-or-update for Crawl/Snapshot/ArchiveResult and outputs processed records for chaining; cascades Crawl → Snapshots → ArchiveResults.
    • Shared apply_filters utility added (cli_utils.py) and used across 7 CLI files to remove duplication.
    • ArchiveResult.from_json()/from_jsonl() implemented; Snapshot.to_json now emits tags_str.
    • Supervisord default updated to use archivebox run instead of manage orchestrator.
    • New pytest fixtures and CLI tests for crawl, snapshot, archiveresult, run, plus pass-through and pipeline accumulation cases.

Written for commit bb52b5902a. Summary will update on new commits.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1743 **Author:** [@pirate](https://github.com/pirate) **Created:** 12/31/2025 **Status:** ✅ Merged **Merged:** 12/31/2025 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `claude/review-code-quality-macdO` --- ### 📝 Commits (2) - [`f3e11b6`](https://github.com/ArchiveBox/ArchiveBox/commit/f3e11b61fdfab0d464c9e212f48e5cab1fdae24b) Implement JSONL CLI pipeline architecture (Phases 1-4, 6) - [`bb52b59`](https://github.com/ArchiveBox/ArchiveBox/commit/bb52b5902a512f076f98b5f16139a76c7890c22b) Add unit tests for JSONL CLI pipeline commands (Phase 5 & 6) ### 📊 Changes **18 files changed** (+1711 additions, -150 deletions) <details> <summary>View changed files</summary> 📝 `TODO_archivebox_jsonl_cli.md` (+17 -17) 📝 `archivebox/cli/archivebox_archiveresult.py` (+35 -20) 📝 `archivebox/cli/archivebox_binary.py` (+1 -15) 📝 `archivebox/cli/archivebox_crawl.py` (+36 -17) 📝 `archivebox/cli/archivebox_machine.py` (+1 -15) 📝 `archivebox/cli/archivebox_process.py` (+1 -15) 📝 `archivebox/cli/archivebox_run.py` (+68 -16) 📝 `archivebox/cli/archivebox_snapshot.py` (+18 -18) 📝 `archivebox/cli/archivebox_tag.py` (+1 -15) ➕ `archivebox/cli/cli_utils.py` (+46 -0) 📝 `archivebox/cli/tests_piping.py` (+124 -0) 📝 `archivebox/core/models.py` (+91 -1) ➕ `archivebox/tests/conftest.py` (+218 -0) ➕ `archivebox/tests/test_cli_archiveresult.py` (+264 -0) ➕ `archivebox/tests/test_cli_crawl.py` (+261 -0) ➕ `archivebox/tests/test_cli_run.py` (+254 -0) ➕ `archivebox/tests/test_cli_snapshot.py` (+274 -0) 📝 `archivebox/workers/supervisord_util.py` (+1 -1) </details> ### 📄 Description <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Adds pass-through and create-or-update behavior to JSONL CLI pipeline commands and centralizes filter utilities, then adds comprehensive unit tests to validate the pipeline end-to-end. This improves piping workflows and makes run orchestration more reliable. - **New Features** - Pass-through added to crawl/snapshot/archiveresult create: non-target records are output unchanged. - run now supports create-or-update for Crawl/Snapshot/ArchiveResult and outputs processed records for chaining; cascades Crawl → Snapshots → ArchiveResults. - Shared apply_filters utility added (cli_utils.py) and used across 7 CLI files to remove duplication. - ArchiveResult.from_json()/from_jsonl() implemented; Snapshot.to_json now emits tags_str. - Supervisord default updated to use archivebox run instead of manage orchestrator. - New pytest fixtures and CLI tests for crawl, snapshot, archiveresult, run, plus pass-through and pipeline accumulation cases. <sup>Written for commit bb52b5902a512f076f98b5f16139a76c7890c22b. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. --> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:50:01 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1493
No description provided.