[PR #1736] [MERGED] Review output file paths and data directory structure #4501

Closed
opened 2026-03-15 01:47:58 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1736
Author: @pirate
Created: 12/31/2025
Status: Merged
Merged: 12/31/2025
Merged by: @pirate

Base: devHead: claude/review-output-paths-zk0BI


📝 Commits (1)

  • 04c23ba Fix output path structure for 0.9.x data directory

📊 Changes

3 files changed (+41 additions, -3 deletions)

View changed files

📝 archivebox/config/configset.py (+4 -0)
📝 archivebox/crawls/models.py (+36 -2)
📝 archivebox/plugins/chrome/on_Snapshot__20_chrome_tab.bg.js (+1 -1)

📄 Description

  • Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths
  • Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
  • Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl
  • Update comment in chrome_tab hook to reflect new config source

Summary

Related issues

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

Summary by cubic

Standardize crawl output paths and improve hook configuration. Paths now use usernames and include the first URL’s domain, and hooks get CRAWL_OUTPUT_DIR so chrome_tab can reuse the crawl’s Chrome session.

  • Refactors
    • Use users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ for crawl output paths (replaces user_id, adds domain).
    • Add domain extraction that handles ports, subdomains, file, and data URLs.
    • Pass CRAWL_OUTPUT_DIR in config to Snapshot hooks; chrome_tab reads it to find the shared Chrome session.

Written for commit 04c23badc2. Summary will update on new commits.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1736 **Author:** [@pirate](https://github.com/pirate) **Created:** 12/31/2025 **Status:** ✅ Merged **Merged:** 12/31/2025 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `claude/review-output-paths-zk0BI` --- ### 📝 Commits (1) - [`04c23ba`](https://github.com/ArchiveBox/ArchiveBox/commit/04c23badc20e17273e2b7d9ede13a0ce69370c1a) Fix output path structure for 0.9.x data directory ### 📊 Changes **3 files changed** (+41 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/config/configset.py` (+4 -0) 📝 `archivebox/crawls/models.py` (+36 -2) 📝 `archivebox/plugins/chrome/on_Snapshot__20_chrome_tab.bg.js` (+1 -1) </details> ### 📄 Description - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Standardize crawl output paths and improve hook configuration. Paths now use usernames and include the first URL’s domain, and hooks get CRAWL_OUTPUT_DIR so chrome_tab can reuse the crawl’s Chrome session. - **Refactors** - Use users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ for crawl output paths (replaces user_id, adds domain). - Add domain extraction that handles ports, subdomains, file, and data URLs. - Pass CRAWL_OUTPUT_DIR in config to Snapshot hooks; chrome_tab reads it to find the shared Chrome session. <sup>Written for commit 04c23badc20e17273e2b7d9ede13a0ce69370c1a. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. --> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-15 01:47:58 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4501
No description provided.