[GH-ISSUE #226] Follow and archive all links to a given depth. Similar to #191 but not restricted to a domain. #3174

New issue

Closed

opened 2026-03-14 21:26:41 +03:00 by kerem · 1 comment

kerem commented

2026-03-14 21:26:41 +03:00

Owner

Originally created by @knowncolor on GitHub (Apr 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/226

Similar to #191 I would like ArchiveBox to automatically follow and archive links up to a certain depth across domains.

This is a fantastic project!

Originally created by @knowncolor on GitHub (Apr 30, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/226 Similar to #191 I would like ArchiveBox to automatically follow and archive links up to a certain depth across domains. This is a fantastic project!

kerem closed this issue

2026-03-14 21:26:46 +03:00

kerem commented

2026-03-14 21:26:52 +03:00

Author

Owner

@pirate commented on GitHub (Apr 30, 2019):

Thanks! I'm going to close this and tweak #191 slightly to make it clearer that it'll cover this use case as well 😁

The idea is that we'll expose the same flags on ArchiveBox as are available on wget itself:

--mirror
--level=5
--span-hosts
--recursive
--no-parent

https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1

These flags together should cover all the use cases: archiving an entire domain, archiving an entire domain but only below the current directory level, and archiving recursively from a single page across multiple domains to a given depth.

I anticipate it will take a while to get to this point though (3-6 months likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment.

@pirate commented on GitHub (Apr 30, 2019): Thanks! I'm going to close this and tweak #191 slightly to make it clearer that it'll cover this use case as well 😁 The idea is that we'll expose the same flags on ArchiveBox as are available on `wget` itself: - `--mirror` - `--level=5` - `--span-hosts` - `--recursive` - `--no-parent` https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1 These flags together should cover all the use cases: archiving an entire domain, archiving an entire domain but only below the current directory level, and archiving recursively from a single page across multiple domains to a given depth. I anticipate it will take a while to get to this point though (3-6 months likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment.

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#3174

No description provided.

Rows
Columns