[PR #1736] [MERGED] Review output file paths and data directory structure #4501

New issue

Closed

opened 2026-03-15 01:47:58 +03:00 by kerem · 0 comments

kerem commented

2026-03-15 01:47:58 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1736
Author: @pirate
Created: 12/31/2025
Status: ✅ Merged
Merged: 12/31/2025
Merged by: @pirate

Base: dev ← Head: claude/review-output-paths-zk0BI

📝 Commits (1)

04c23ba Fix output path structure for 0.9.x data directory

📊 Changes

3 files changed (+41 additions, -3 deletions)

View changed files

📝 archivebox/config/configset.py (+4 -0)
📝 archivebox/crawls/models.py (+36 -2)
📝 archivebox/plugins/chrome/on_Snapshot__20_chrome_tab.bg.js (+1 -1)

📄 Description

Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths
Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl
Update comment in chrome_tab hook to reflect new config source

Summary

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Snapshot data layout on disk

Summary by cubic

Standardize crawl output paths and improve hook configuration. Paths now use usernames and include the first URL’s domain, and hooks get CRAWL_OUTPUT_DIR so chrome_tab can reuse the crawl’s Chrome session.

Refactors
- Use users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ for crawl output paths (replaces user_id, adds domain).
- Add domain extraction that handles ports, subdomains, file, and data URLs.
- Pass CRAWL_OUTPUT_DIR in config to Snapshot hooks; chrome_tab reads it to find the shared Chrome session.

^{Written for commit 04c23badc2. Summary will update on new commits.}

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1736 **Author:** [@pirate](https://github.com/pirate) **Created:** 12/31/2025 **Status:** ✅ Merged **Merged:** 12/31/2025 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `claude/review-output-paths-zk0BI` --- ### 📝 Commits (1) - [`04c23ba`](https://github.com/ArchiveBox/ArchiveBox/commit/04c23badc20e17273e2b7d9ede13a0ce69370c1a) Fix output path structure for 0.9.x data directory ### 📊 Changes **3 files changed** (+41 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/config/configset.py` (+4 -0) 📝 `archivebox/crawls/models.py` (+36 -2) 📝 `archivebox/plugins/chrome/on_Snapshot__20_chrome_tab.bg.js` (+1 -1) </details> ### 📄 Description - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk  --- ## Summary by cubic Standardize crawl output paths and improve hook configuration. Paths now use usernames and include the first URL’s domain, and hooks get CRAWL_OUTPUT_DIR so chrome_tab can reuse the crawl’s Chrome session. - **Refactors** - Use users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ for crawl output paths (replaces user_id, adds domain). - Add domain extraction that handles ports, subdomains, file, and data URLs. - Pass CRAWL_OUTPUT_DIR in config to Snapshot hooks; chrome_tab reads it to find the shared Chrome session. <sup>Written for commit 04c23badc20e17273e2b7d9ede13a0ce69370c1a. Summary will update on new commits.</sup>  --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

kerem

2026-03-15 01:47:58 +03:00

closed this issue
added the
pull-request
label

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#4501

No description provided.

Rows
Columns