[PR #1736] Review output file paths and data directory structure #2997

New issue

Closed

opened 2026-03-01 18:01:21 +03:00 by kerem · 0 comments

kerem commented

2026-03-01 18:01:21 +03:00

Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1736

State: closed
Merged: Yes

Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths
Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl
Update comment in chrome_tab hook to reflect new config source

Summary

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Snapshot data layout on disk

Summary by cubic

Standardize crawl output paths and improve hook configuration. Paths now use usernames and include the first URL’s domain, and hooks get CRAWL_OUTPUT_DIR so chrome_tab can reuse the crawl’s Chrome session.

Refactors
- Use users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ for crawl output paths (replaces user_id, adds domain).
- Add domain extraction that handles ports, subdomains, file, and data URLs.
- Pass CRAWL_OUTPUT_DIR in config to Snapshot hooks; chrome_tab reads it to find the shared Chrome session.

^{Written for commit 04c23badc2. Summary will update on new commits.}

**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/1736 **State:** closed **Merged:** Yes --- - Update Crawl.output_dir_parent to use username instead of user_id for consistency with Snapshot paths - Add domain from first URL to Crawl path structure for easier debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ - Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab can find the shared Chrome session from the Crawl - Update comment in chrome_tab hook to reflect new config source  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk  --- ## Summary by cubic Standardize crawl output paths and improve hook configuration. Paths now use usernames and include the first URL’s domain, and hooks get CRAWL_OUTPUT_DIR so chrome_tab can reuse the crawl’s Chrome session. - **Refactors** - Use users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/ for crawl output paths (replaces user_id, adds domain). - Add domain extraction that handles ports, subdomains, file, and data URLs. - Pass CRAWL_OUTPUT_DIR in config to Snapshot hooks; chrome_tab reads it to find the shared Chrome session. <sup>Written for commit 04c23badc20e17273e2b7d9ede13a0ce69370c1a. Summary will update on new commits.</sup>

kerem

2026-03-01 18:01:21 +03:00

closed this issue
added the
pull-request
label

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#2997

No description provided.

Rows
Columns