[GH-ISSUE #637] Add SNAPSHOT_ADD_STRATEGY=depth|breadth config flag to set archiving strategy #3415

Open
opened 2026-03-14 22:45:24 +03:00 by kerem · 0 comments
Owner

Originally created by @pirate on GitHub (Jan 30, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/637

This would allow the user to set the behavior when adding multiple urls at once.

Do you want to archive each url using each method, sequentially down the list?

SNAPSHOT_ADD_STRATEGY=depth
for url in urls_to_snapshot:
    for extractor in extractors:
        run_extractor(extractor, url)

Or do you want to first get the title & headers for every url, then go down the list again and get the media/git, ... though each extractor?

SNAPSHOT_ADD_STRATEGY=breadth
for extractor in extractors:
    for url in urls_to_snapshot:
        run_extractor(extractor, url)

Depth-first is more straightforward and intuitive (the CLI output is linear and easy to follow).

But breadth-first has several benefits if you're willing to stomach the noisy CLI output:

  • gets a first-past archive of all the sites faster, good when archiving is time-critical
  • avoids hammering a single url 8 times in a row for each extractor, load is spread out more evenly
  • potentially easier to parallelize many calls to the same extractor (e.g. with multiple tabs in puppeteer)
Originally created by @pirate on GitHub (Jan 30, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/637 This would allow the user to set the behavior when adding multiple urls at once. Do you want to archive each url using each method, sequentially down the list? ```ini SNAPSHOT_ADD_STRATEGY=depth ``` ```python for url in urls_to_snapshot: for extractor in extractors: run_extractor(extractor, url) ``` Or do you want to first get the title & headers for every url, then go down the list again and get the media/git, ... though each extractor? ```ini SNAPSHOT_ADD_STRATEGY=breadth ``` ```python for extractor in extractors: for url in urls_to_snapshot: run_extractor(extractor, url) ``` Depth-first is more straightforward and intuitive (the CLI output is linear and easy to follow). <img src="https://user-images.githubusercontent.com/511499/106351076-d1920400-62a7-11eb-83fc-ec24cd87112c.png" width="300px"/> But breadth-first has several benefits if you're willing to stomach the noisy CLI output: - gets a first-past archive of all the sites faster, good when archiving is time-critical - avoids hammering a single url 8 times in a row for each extractor, load is spread out more evenly - potentially easier to parallelize many calls to the same extractor (e.g. with multiple tabs in puppeteer)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3415
No description provided.