[GH-ISSUE #226] Follow and archive all links to a given depth. Similar to #191 but not restricted to a domain. #3174

Closed
opened 2026-03-14 21:26:41 +03:00 by kerem · 1 comment
Owner

Originally created by @knowncolor on GitHub (Apr 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/226

Similar to #191 I would like ArchiveBox to automatically follow and archive links up to a certain depth across domains.

This is a fantastic project!

Originally created by @knowncolor on GitHub (Apr 30, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/226 Similar to #191 I would like ArchiveBox to automatically follow and archive links up to a certain depth across domains. This is a fantastic project!
kerem closed this issue 2026-03-14 21:26:46 +03:00
Author
Owner

@pirate commented on GitHub (Apr 30, 2019):

Thanks! I'm going to close this and tweak #191 slightly to make it clearer that it'll cover this use case as well 😁

The idea is that we'll expose the same flags on ArchiveBox as are available on wget itself:

  • --mirror
  • --level=5
  • --span-hosts
  • --recursive
  • --no-parent

https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1

These flags together should cover all the use cases: archiving an entire domain, archiving an entire domain but only below the current directory level, and archiving recursively from a single page across multiple domains to a given depth.

I anticipate it will take a while to get to this point though (3-6 months likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment.

<!-- gh-comment-id:488026138 --> @pirate commented on GitHub (Apr 30, 2019): Thanks! I'm going to close this and tweak #191 slightly to make it clearer that it'll cover this use case as well 😁 The idea is that we'll expose the same flags on ArchiveBox as are available on `wget` itself: - `--mirror` - `--level=5` - `--span-hosts` - `--recursive` - `--no-parent` https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1 These flags together should cover all the use cases: archiving an entire domain, archiving an entire domain but only below the current directory level, and archiving recursively from a single page across multiple domains to a given depth. I anticipate it will take a while to get to this point though (3-6 months likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3174
No description provided.