[GH-ISSUE #191] Add support for recursive archiving of entire domains, or across domains to a given depth (using a crawler) #133

Open
opened 2026-03-01 14:40:53 +03:00 by kerem · 45 comments
Owner

Originally created by @diego898 on GitHub (Mar 22, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/191

Is it in scope to have it be possible to archive:

  • an entire blog (all posts) by only passing the root url?
  • have this archive process only be additive? (even if posts are later deleted, I can browse the local history myself)
Originally created by @diego898 on GitHub (Mar 22, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/191 Is it in scope to have it be possible to archive: - an entire blog (all posts) by only passing the root url? - have this archive process only be additive? (even if posts are later deleted, I can browse the local history myself)
Author
Owner

@pirate commented on GitHub (Mar 23, 2019):

I'd say this is currently our 2nd most-requested feature :) It's definitely in the roadmap: #120 but not planned anytime soon because it's not ArchiveBox's primary use-case and it's extremely difficult to do well. For now I recommend using another software to do the crawling to produce a list of URLs of all the pages, and then pipe that list into archivebox to do the actual archiving.

Eventually, I may consider exposing similar flags on ArchiveBox as are available on wget:

  • --mirror
  • --level=5
  • --span-hosts
  • --recursive
  • --no-parent

https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1

These flags together should cover all the use cases:

  1. archiving an entire domain with all pages
  2. archiving an entire domain but only below the current directory level
  3. archiving recursively from a single page across all domains to a given depth

I anticipate it will take a while to get to this point though (multiple major versions likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment (see scrapy for inspiration).

The process will also naturally be additive if multi-snapshot support is added: #179.

Unfortunately, doing mirroring / full-site crawling properly is extremely non-trivial, as it involves building or integrating with an existing crawler/spider. Even just the logic to parse URLs out of a page is deceivingly complex, and there are tons of intricacies around mirroring that don't need to be considered when doing the kind of single-page archiving that ArchiveBox was designed for.

Currently this is blocked by setting up our proxy archiver which has support for deduping response data in the WARC files, then we'll also need to pick a crawler, or integrate with an existing one from here.

For people landing on this issue and looking for an immediate solution, I recommend using this command (which is exactly what's used by ArchiveBox right now, but with a few recursive options added):

wget --server-response \
     --no-verbose \
     --adjust-extension \
     --convert-links \
     --force-directories \
     --backup-converted \
     --compression=auto \
     -e robots=off \
     --restrict-file-names=unix \
     --timeout=60 \
     --warc-file=warc \
     --page-requisites \
     --no-check-certificate \
     --no-hsts \
     --span-hosts \
     --no-parent \
     --recursive \
     --level=2 \
     --warc-file=$(date +%s) \
     --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" \
     https://example.com

Set --level=[n] to the depth of links you want to follow during archiving, or add --mirror and remove --span-hosts and --no-parent if you want to archive an entire domain.

<!-- gh-comment-id:475834386 --> @pirate commented on GitHub (Mar 23, 2019): I'd say this is currently our 2nd most-requested feature :) It's definitely in the roadmap: #120 but not planned anytime soon because it's not ArchiveBox's primary use-case and it's extremely difficult to do well. For now I recommend using another software to do the crawling to produce a list of URLs of all the pages, and then pipe that list into archivebox to do the actual archiving. Eventually, I may consider exposing similar flags on ArchiveBox as are available on `wget`: - `--mirror` - `--level=5` - `--span-hosts` - `--recursive` - `--no-parent` https://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options-1 **These flags together should cover all the use cases:** 1. archiving an entire domain with all pages 2. archiving an entire domain but only below the current directory level 3. archiving recursively from a single page across all domains to a given depth I anticipate it will take a while to get to this point though (multiple major versions likely), as we first have to build or integrate a crawler of some sort, and web crawling is an extremely complex process with lots of subtle nuance around configuration and environment (see [scrapy](https://github.com/scrapy/scrapy) for inspiration). The process will also naturally be additive if multi-snapshot support is added: #179. Unfortunately, doing mirroring / full-site crawling properly is extremely non-trivial, as it involves building or integrating with an existing crawler/spider. Even just the logic to parse URLs out of a page is deceivingly complex, and there are tons of intricacies around mirroring that don't need to be considered when doing the kind of single-page archiving that ArchiveBox was designed for. Currently this is blocked by [setting up our proxy archiver](#130) which has support for deduping response data in the WARC files, then we'll also need to pick a crawler, or integrate with an existing one [from here](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects). For people landing on this issue and looking for an immediate solution, I recommend using this command (which is exactly what's used by ArchiveBox right now, but with a few recursive options added): ```bash wget --server-response \ --no-verbose \ --adjust-extension \ --convert-links \ --force-directories \ --backup-converted \ --compression=auto \ -e robots=off \ --restrict-file-names=unix \ --timeout=60 \ --warc-file=warc \ --page-requisites \ --no-check-certificate \ --no-hsts \ --span-hosts \ --no-parent \ --recursive \ --level=2 \ --warc-file=$(date +%s) \ --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" \ https://example.com ``` Set `--level=[n]` to the depth of links you want to follow during archiving, or add `--mirror` and remove `--span-hosts` and `--no-parent` if you want to archive an entire domain.
Author
Owner

@bcarothe commented on GitHub (Apr 2, 2019):

I am interested in this feature, particularly in limited depth/level recursion. I would like ArchiveBox to be able to .archive a page from a specified URL and then archive any links/content that may not be on the same domain. My particular interest is to be able to click around on the archived page to local archived version of it rather than the actual URL (similar to how the Wayback Machine handles it).

Keep up the good work and good luck!

<!-- gh-comment-id:479188148 --> @bcarothe commented on GitHub (Apr 2, 2019): I am interested in this feature, particularly in limited depth/level recursion. I would like ArchiveBox to be able to .archive a page from a specified URL and then archive any links/content that may not be on the same domain. My particular interest is to be able to click around on the archived page to local archived version of it rather than the actual URL (similar to how the Wayback Machine handles it). Keep up the good work and good luck!
Author
Owner

@theAkito commented on GitHub (Sep 22, 2019):

I downloaded this app and took this ability for granted. Just by accident I found it not to work, yet. When can this be expected to be implemented?

<!-- gh-comment-id:533928420 --> @theAkito commented on GitHub (Sep 22, 2019): I downloaded this app and took this ability for granted. Just by accident I found it not to work, yet. When can this be expected to be implemented?
Author
Owner

@pirate commented on GitHub (Sep 23, 2019):

Not for a while, it's a very tricky feature to implement natively, I'd rather integrate an existing crawler and use ArchiveBox to just process the generated stream of URLs. Don't expect this feature anytime soon unless you feel like implementing it yourself, for now you can check out some of the alternative software on the wiki:

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community

<!-- gh-comment-id:534115378 --> @pirate commented on GitHub (Sep 23, 2019): Not for a while, it's a very tricky feature to implement natively, I'd rather integrate an existing crawler and use ArchiveBox to just process the generated stream of URLs. Don't expect this feature anytime soon unless you feel like implementing it yourself, for now you can check out some of the alternative software on the wiki: https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community
Author
Owner

@theAkito commented on GitHub (Sep 23, 2019):

@pirate Thank you very much for the reference. Will look into it, as long as the feature is not implemented into ArchiveBox.

<!-- gh-comment-id:534116535 --> @theAkito commented on GitHub (Sep 23, 2019): @pirate Thank you very much for the reference. Will look into it, as long as the feature is not implemented into ArchiveBox.
Author
Owner

@GlassedSilver commented on GitHub (Apr 11, 2021):

This got pinned I see. 👀

Does this mean this feature can expect to see some level of work sometime soon?

I swear to God, once this gets added I'll be damn happy. It's something I've been dreaming of having, domain-level snapshotting to go back in time for various endeavors.

Just imagine... browsing playstation.com or such 20-30 years later, relieving the PS4 and PS5 era in the context of them being retro, but being able to get all those articles etc back.

When I find old screenshots of websites and my OS that are pushing 15 years or more now I get nostalgia tickles, a feature like this would go beyond seeing what I saw back in the day and enable me to discover stuff that escaped my eyes at all to begin with at solely MY discretion.

The level of archivist comfort here is immeasurable.

This combined with automated re-snapshotting over time... Incredible!

Bonus question: would this also lay the foundation to enable use cases where I may want to archive not the entire domain, but all items of a certain user's feed?
e.g.: for future reference I may want to "subscribe" to a Twitter user locally and not miss any of their tweets. Surely the crawling method here could be utilized for this as well, right?
Correct me if I'm wrong and dreaming. :)

<!-- gh-comment-id:817229229 --> @GlassedSilver commented on GitHub (Apr 11, 2021): This got pinned I see. 👀 Does this mean this feature can expect to see some level of work sometime soon? I swear to God, once this gets added I'll be damn happy. It's something I've been dreaming of having, domain-level snapshotting to go back in time for various endeavors. Just imagine... browsing playstation.com or such 20-30 years later, relieving the PS4 and PS5 era in the context of them being retro, but being able to get all those articles etc back. When I find old screenshots of websites and my OS that are pushing 15 years or more now I get nostalgia tickles, a feature like this would go beyond seeing what I saw back in the day and enable me to discover stuff that escaped my eyes at all to begin with at solely MY discretion. The level of archivist comfort here is immeasurable. This combined with automated re-snapshotting over time... Incredible! Bonus question: would this also lay the foundation to enable use cases where I may want to archive not the entire domain, but all items of a certain user's feed? e.g.: for future reference I may want to "subscribe" to a Twitter user locally and not miss any of their tweets. Surely the crawling method here could be utilized for this as well, right? Correct me if I'm wrong and dreaming. :)
Author
Owner

@pirate commented on GitHub (Apr 11, 2021):

You can already archive everything on a user's feed with the --depth=1 flag, it just doesn't support depth >1.

However, you can achieve full recursive archiving if you do multiple passes of --depth=1 archiving (breadth-first), e.g.:

archivebox add --depth=1 https://example.com
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
...
# etc as many levels deep as you want, it wont duplicate stuff so it's safe to re-run until theres nothing new that it discovers

This is nice for a number of reasons, you can keep an eye on progress and make sure it's not accidentally downloading all of youtube by accident, and the URLs are added in order of depth and can be tagged separately during the adding process if you want. It also allows you to manually rate-limit, to avoid being blocked by / taking down the site you're archiving.

I really don't want to implement my own full recursive crawler, it's a lot of work and really difficult to maintain. Also a big support burden as crawlers constantly break and need fixing and extra config options to handle different desired behavior on different kinds of sites. I would much rather people use one of the many great crawlers/spiders that are already available and pipe the URLs into archivebox (e.g. Scrapy). https://scrapy-do.readthedocs.io/en/latest/quick-start.html

As it stands, I'm unlikely to add a crawler directly into ArchiveBox anytime soon because I barely have enough time to maintain ArchiveBox-as-is, but I'm not opposed to improving the ergonomics around using it with a crawler with smaller PRs, or reviewing a proposed design if someone wants to contribute a way to build scrapy or another existing crawler into AB.

This issue is pinned because we get a lot of requests for it and I'd rather make this thread easy to find so people know what the status is.

<!-- gh-comment-id:817245201 --> @pirate commented on GitHub (Apr 11, 2021): You can already archive everything on a user's feed with the `--depth=1` flag, it just doesn't support depth >1. However, you can achieve full recursive archiving if you do multiple passes of `--depth=1` archiving (breadth-first), e.g.: ```bash archivebox add --depth=1 https://example.com archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 ... # etc as many levels deep as you want, it wont duplicate stuff so it's safe to re-run until theres nothing new that it discovers ``` This is nice for a number of reasons, you can keep an eye on progress and make sure it's not accidentally downloading all of youtube by accident, and the URLs are added in order of depth and can be tagged separately during the adding process if you want. It also allows you to manually rate-limit, to avoid being blocked by / taking down the site you're archiving. I really don't want to implement my own full recursive crawler, it's a lot of work and really difficult to maintain. Also a big support burden as crawlers constantly break and need fixing and extra config options to handle different desired behavior on different kinds of sites. I would much rather people use one of the many great crawlers/spiders that are already available and pipe the URLs into archivebox (e.g. Scrapy). https://scrapy-do.readthedocs.io/en/latest/quick-start.html As it stands, I'm unlikely to add a crawler directly into ArchiveBox anytime soon because I barely have enough time to maintain ArchiveBox-as-is, but I'm not opposed to improving the ergonomics around using it with a crawler with smaller PRs, or reviewing a proposed design if someone wants to contribute a way to build scrapy or another existing crawler into AB. This issue is pinned because we get a lot of requests for it and I'd rather make this thread easy to find so people know what the status is.
Author
Owner

@larshaendler commented on GitHub (Jun 10, 2021):

@pirate I used a mix of the two suggestion you offered and build myself a workaround. That does the trick for a local version that I can navigate offline.

  1. Run wget first to get a list of all urls of my page.
    wget --spider --recursive --no-verbose --output-file=wgetlog.txt https://mydomain.com/

  2. Run sed to remove all wget clutter:
    sed -n "s@.+ URL:([^ ]+) .+@\1@p" wgetlog.txt | sed "s@&@&amp;@" > myurls.txt

  3. Manually open the myurls.txt to remove any pages that look fishy or dont make sense to keep.

  4. Drop all urls in archivebox in a single command
    xargs archivebox add < ~/Downloads/myurls.txt

4a. Alternative way, drop each line with a single command
xargs -0 -n 1 archivebox add < <(tr \\n \\0 < ~/Downloads/myurls.txt)

The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump.

<!-- gh-comment-id:859055200 --> @larshaendler commented on GitHub (Jun 10, 2021): @pirate I used a mix of the two suggestion you offered and build myself a workaround. That does the trick for a local version that I can navigate offline. 1. Run wget first to get a list of all urls of my page. `wget --spider --recursive --no-verbose --output-file=wgetlog.txt https://mydomain.com/` 2. Run sed to remove all wget clutter: sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&amp;@" > myurls.txt 3. Manually open the myurls.txt to remove any pages that look fishy or dont make sense to keep. 4. Drop all urls in archivebox in a single command `xargs archivebox add < ~/Downloads/myurls.txt` 4a. Alternative way, drop each line with a single command `xargs -0 -n 1 archivebox add < <(tr \\n \\0 < ~/Downloads/myurls.txt)` The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump.
Author
Owner

@francwalter commented on GitHub (Jul 2, 2021):

You can already archive everything on a user's feed with the --depth=1 flag, it just doesn't support depth >1.

However, you can achieve full recursive archiving if you do multiple passes of --depth=1 archiving (breadth-first), e.g.:

archivebox add --depth=1 https://example.com
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
...
# etc as many levels deep as you want, it wont duplicate stuff so it's safe to re-run until theres nothing new that it discovers

This is nice for a number of reasons, you can keep an eye on progress and make sure it's not accidentally downloading all of youtube by accident, and the URLs are added in order of depth and can be tagged separately during the adding process if you want. It also allows you to manually rate-limit, to avoid being blocked by / taking down the site you're archiving.

Is there a way to exclude urls not within example.org in this setting? Is there an option maybe?
Thanks, frank

<!-- gh-comment-id:872996332 --> @francwalter commented on GitHub (Jul 2, 2021): > > > You can already archive everything on a user's feed with the `--depth=1` flag, it just doesn't support depth >1. > > However, you can achieve full recursive archiving if you do multiple passes of `--depth=1` archiving (breadth-first), e.g.: > > ```shell > archivebox add --depth=1 https://example.com > archivebox list https://example.com | archivebox add --depth=1 > archivebox list https://example.com | archivebox add --depth=1 > archivebox list https://example.com | archivebox add --depth=1 > ... > # etc as many levels deep as you want, it wont duplicate stuff so it's safe to re-run until theres nothing new that it discovers > ``` > > This is nice for a number of reasons, you can keep an eye on progress and make sure it's not accidentally downloading all of youtube by accident, and the URLs are added in order of depth and can be tagged separately during the adding process if you want. It also allows you to manually rate-limit, to avoid being blocked by / taking down the site you're archiving. > Is there a way to exclude urls not within example.org in this setting? Is there an option maybe? Thanks, frank
Author
Owner

@pirate commented on GitHub (Jul 7, 2021):

@francwalter I've added a new option URL_WHITELIST URL_ALLOWLIST in 5a2c78e for this usecase. Here's an example of how to exclude everything except for URLs matching *.example.com:

export URL_ALLOWLIST='^http(s)?:\/\/(.+)?example\.com\/?.*$'

# then run your archivebox commands
archivebox add --depth=1 'https://example.com'
archivebox list https://example.com | archivebox add --depth=1
archivebox list https://example.com | archivebox add --depth=1
...
# all URLs that don't match *.example.com will be excluded, e.g. a link to youtube.com would not be followed
# (note that all assets required to render each page are still archived, URL_DENYLIST/URL_ALLOWLIST does not apply to inline images, css, video, etc.)

I've also documented the new allowlist support here: https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#URL_ALLOWLIST

It will be out in the next release v0.6.3, but if you want to use it early you can run from the dev branch: https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch

<!-- gh-comment-id:875241945 --> @pirate commented on GitHub (Jul 7, 2021): @francwalter I've added a new option ~~`URL_WHITELIST`~~ `URL_ALLOWLIST` in 5a2c78e for this usecase. Here's an example of how to exclude everything except for URLs matching `*.example.com`: ```bash export URL_ALLOWLIST='^http(s)?:\/\/(.+)?example\.com\/?.*$' # then run your archivebox commands archivebox add --depth=1 'https://example.com' archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 ... # all URLs that don't match *.example.com will be excluded, e.g. a link to youtube.com would not be followed # (note that all assets required to render each page are still archived, URL_DENYLIST/URL_ALLOWLIST does not apply to inline images, css, video, etc.) ``` I've also documented the new allowlist support here: https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#URL_ALLOWLIST It will be out in the next [release v0.6.3](https://github.com/ArchiveBox/ArchiveBox/pull/721), but if you want to use it early you can run from the `dev` branch: https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch
Author
Owner

@kiwimato commented on GitHub (Dec 12, 2022):

I may be doing something wrong, but it doesn't work for me

$ export URL_WHITELIST='^http(s)?:\/\/(.+)?redacted\.com\/?.*$'   
$ archivebox add --depth=1 'https://redacted.com'
[i] [2022-12-12 00:28:42] ArchiveBox v0.6.3: archivebox add --depth=1 https://redacted.com
    > /data

[+] [2022-12-12 00:28:42] Adding 1 links to index (crawl depth=1)...
[Errno 2] No such file or directory: 'https:/redacted.com'
    > Saved verbatim input to sources/1670804922-import.txt
    > Parsed 1 URLs from input (Generic TXT)                                                                                                                                                                                                                                                                                 

[*] Starting crawl of 1 sites 1 hop out from starting point
    > Downloading https://redacted.com contents
    > Saved verbatim input to sources/1670804922.925886-crawl-redacted.com.txt                                                                                                                                                                                                                                            
    > Parsed 30 URLs from input (Generic TXT)                                                                                                                                                                                                                                                                                
    > Found 0 new URLs not already in index

[*] [2022-12-12 00:28:43] Writing 0 links to main index...
    √ ./index.sqlite3                                                                                                                                                                                                                                                                                                        

$ archivebox list https://redacted.com | archivebox add --depth=1
[i] [2022-12-12 00:29:55] ArchiveBox v0.6.3: archivebox list https://redacted.com
    > /data

[i] [2022-12-12 00:29:55] ArchiveBox v0.6.3: archivebox add --depth=1
    > /data

[+] [2022-12-12 00:29:56] Adding 2 links to index (crawl depth=1)...
[Errno 21] Is a directory: '/data/archive/1670804464.144965'
[Errno 2] No such file or directory: 'https:/redacted.com'
[Errno 2] No such file or directory: '"redacted.com"'
    > Saved verbatim input to sources/1670804996-import.txt
    > Parsed 2 URLs from input (Generic TXT)                                                                                                                                                                                                                                                                                 

[*] Starting crawl of 1 sites 1 hop out from starting point
    > Downloading https://redacted.com contents
    > Saved verbatim input to sources/1670804996.266747-crawl-redacted.com.txt                                                                                                                                                                                                                                            
    > Parsed 30 URLs from input (Generic TXT)                                                                                                                                                                                                                                                                                
    > Found 0 new URLs not already in index

[*] [2022-12-12 00:29:56] Writing 0 links to main index...
    √ ./index.sqlite3                                                                                                                                                                                                                                                                                                        

$ archivebox list https://redacted.com
[i] [2022-12-12 00:30:20] ArchiveBox v0.6.3: archivebox list https://redacted.com
    > /data

/data/archive/1670804464.144965 https://redacted.com "redacted.com"
$ 

<!-- gh-comment-id:1345713735 --> @kiwimato commented on GitHub (Dec 12, 2022): I may be doing something wrong, but it doesn't work for me ``` $ export URL_WHITELIST='^http(s)?:\/\/(.+)?redacted\.com\/?.*$' $ archivebox add --depth=1 'https://redacted.com' [i] [2022-12-12 00:28:42] ArchiveBox v0.6.3: archivebox add --depth=1 https://redacted.com > /data [+] [2022-12-12 00:28:42] Adding 1 links to index (crawl depth=1)... [Errno 2] No such file or directory: 'https:/redacted.com' > Saved verbatim input to sources/1670804922-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://redacted.com contents > Saved verbatim input to sources/1670804922.925886-crawl-redacted.com.txt > Parsed 30 URLs from input (Generic TXT) > Found 0 new URLs not already in index [*] [2022-12-12 00:28:43] Writing 0 links to main index... √ ./index.sqlite3 $ archivebox list https://redacted.com | archivebox add --depth=1 [i] [2022-12-12 00:29:55] ArchiveBox v0.6.3: archivebox list https://redacted.com > /data [i] [2022-12-12 00:29:55] ArchiveBox v0.6.3: archivebox add --depth=1 > /data [+] [2022-12-12 00:29:56] Adding 2 links to index (crawl depth=1)... [Errno 21] Is a directory: '/data/archive/1670804464.144965' [Errno 2] No such file or directory: 'https:/redacted.com' [Errno 2] No such file or directory: '"redacted.com"' > Saved verbatim input to sources/1670804996-import.txt > Parsed 2 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://redacted.com contents > Saved verbatim input to sources/1670804996.266747-crawl-redacted.com.txt > Parsed 30 URLs from input (Generic TXT) > Found 0 new URLs not already in index [*] [2022-12-12 00:29:56] Writing 0 links to main index... √ ./index.sqlite3 $ archivebox list https://redacted.com [i] [2022-12-12 00:30:20] ArchiveBox v0.6.3: archivebox list https://redacted.com > /data /data/archive/1670804464.144965 https://redacted.com "redacted.com" $ ```
Author
Owner

@JustGitting commented on GitHub (Dec 17, 2022):

@kiwimato I've got the same problem. Seems the archivebox list command does not output what the piped archivebox add command is expecting.

I've modified the list command to the following and it seems to be working. Feedback and corrections welcome.

I used the substring option as felt it was more flexible... I'm new to archivebox.

2>/dev/null is used to redirect extra info to null, it cleans up the output of the list command.

export URL_WHITELIST='^http(s)?:\/\/(.+)?example\.com\/?.*$'

# then run your archivebox commands
archivebox add --depth=1 'https://example.com'
archivebox list --csv url -t substring https://example.com 2>/dev/null | archivebox add --depth=1
<!-- gh-comment-id:1356076386 --> @JustGitting commented on GitHub (Dec 17, 2022): @kiwimato I've got the same problem. Seems the `archivebox list` command does not output what the piped `archivebox add` command is expecting. I've modified the list command to the following and it seems to be working. Feedback and corrections welcome. I used the substring option as felt it was more flexible... I'm new to archivebox. `2>/dev/null` is used to redirect extra info to null, it cleans up the output of the list command. ``` export URL_WHITELIST='^http(s)?:\/\/(.+)?example\.com\/?.*$' # then run your archivebox commands archivebox add --depth=1 'https://example.com' archivebox list --csv url -t substring https://example.com 2>/dev/null | archivebox add --depth=1 ```
Author
Owner

@kiwimato commented on GitHub (Dec 17, 2022):

@JustGitting I got around this by actually using browsterix-cralwer which is awesome and does this out of the box. It also has the ability to create WACZ files directly which can then be used with web-replay-gen

Thank you for posting a solution btw :)

<!-- gh-comment-id:1356155545 --> @kiwimato commented on GitHub (Dec 17, 2022): @JustGitting I got around this by actually using [browsterix-cralwer](https://github.com/webrecorder/browsertrix-crawler) which is awesome and does this out of the box. It also has the ability to create WACZ files directly which can then be used with [web-replay-gen](https://github.com/webrecorder/web-replay-gen) Thank you for posting a solution btw :)
Author
Owner

@JustGitting commented on GitHub (Dec 17, 2022):

@kiwimato great to hear you found a workaround. I had a look at browsterix-crawler, but hoped to avoid needing to run docker. I like simple commands :-).

<!-- gh-comment-id:1356188728 --> @JustGitting commented on GitHub (Dec 17, 2022): @kiwimato great to hear you found a workaround. I had a look at browsterix-crawler, but hoped to avoid needing to run docker. I like simple commands :-).
Author
Owner

@grmrgecko commented on GitHub (Jan 18, 2023):

I'm stuck with using SiteSucker and storing into my own local directory until this is added. There are some sites I want to archive entirely for my personal browsing at a later date because I do not know if the site owner will continue to keep their site up. For an example, I have an archive of http://hampa.ch/ which will be useful for a lot of the things I do personally.

<!-- gh-comment-id:1396063970 --> @grmrgecko commented on GitHub (Jan 18, 2023): I'm stuck with using SiteSucker and storing into my own local directory until this is added. There are some sites I want to archive entirely for my personal browsing at a later date because I do not know if the site owner will continue to keep their site up. For an example, I have an archive of http://hampa.ch/ which will be useful for a lot of the things I do personally.
Author
Owner

@BenJackGill commented on GitHub (Feb 24, 2023):

+1 for this feature

<!-- gh-comment-id:1442821780 --> @BenJackGill commented on GitHub (Feb 24, 2023): +1 for this feature
Author
Owner

@KenwoodFox commented on GitHub (Jul 6, 2023):

I'm working out a way to hopefully archive webcomics, the trouble is most comics store their images as separate but very similar web pages ie domain.org/d/20210617 then domain.org/d/20210618 and domain.org/d/20210619. Maybe rather than trying to cram and everything-crawler into ArchiveBox, we could just improve the connections that could be made to more generalized crawlers? Even a python slot to type in some bs4 or something :3

<!-- gh-comment-id:1623773603 --> @KenwoodFox commented on GitHub (Jul 6, 2023): I'm working out a way to hopefully archive webcomics, the trouble is most comics store their images as separate but very similar web pages ie domain.org/d/20210617 then domain.org/d/20210618 and domain.org/d/20210619. Maybe rather than trying to cram and everything-crawler into ArchiveBox, we could just improve the connections that could be made to more generalized crawlers? Even a python slot to type in some bs4 or something :3
Author
Owner

@melyux commented on GitHub (Jul 11, 2023):

I saw @larshaendler said above that:

The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump.

Is this true? My internal hyperlinks on snapshots seem to go to the originals, not to their archived offline versions already on ArchiveBox. Even without adding a full-blown crawler, I'm sure a much simpler enhancement would be to, upon adding a new snapshot, rewrite all hyperlinks pointing to that snapshot's original URL to now point to the offline version. And check if any hyperlinks on the current snapshot exist offline in the archive and rewrite those to point to the offline versions.

<!-- gh-comment-id:1631468627 --> @melyux commented on GitHub (Jul 11, 2023): I saw @larshaendler said above that: >The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump. Is this true? My internal hyperlinks on snapshots seem to go to the originals, not to their archived offline versions already on ArchiveBox. Even without adding a full-blown crawler, I'm sure a much simpler enhancement would be to, upon adding a new snapshot, rewrite all hyperlinks pointing to that snapshot's original URL to now point to the offline version. And check if any hyperlinks on the current snapshot exist offline in the archive and rewrite those to point to the offline versions.
Author
Owner

@KenwoodFox commented on GitHub (Jul 13, 2023):

I saw @larshaendler said above that:

The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump.

Is this true? My internal hyperlinks on snapshots seem to go to the originals, not to their archived offline versions already on ArchiveBox. Even without adding a full-blown crawler, I'm sure a much simpler enhancement would be to, upon adding a new snapshot, rewrite all hyperlinks pointing to that snapshot's original URL to now point to the offline version. And check if any hyperlinks on the current snapshot exist offline in the archive and rewrite those to point to the offline versions.

I've actually also experienced this too, i tried it out just the other day infact! Even if i archive one path, then archive another with that archived path in it. The links still point back to the original. Dosn't seem to matter what order i archive the sites in.

<!-- gh-comment-id:1634514325 --> @KenwoodFox commented on GitHub (Jul 13, 2023): > I saw @larshaendler said above that: > > > The result pages can be navigated locally because archivebox is intelligent enough to find all linked offline versions. But it is of course not a single page dump. > > Is this true? My internal hyperlinks on snapshots seem to go to the originals, not to their archived offline versions already on ArchiveBox. Even without adding a full-blown crawler, I'm sure a much simpler enhancement would be to, upon adding a new snapshot, rewrite all hyperlinks pointing to that snapshot's original URL to now point to the offline version. And check if any hyperlinks on the current snapshot exist offline in the archive and rewrite those to point to the offline versions. I've actually also experienced this too, i tried it out just the other day infact! Even if i archive one path, then archive another with that archived path in it. The links still point back to the original. Dosn't seem to matter what order i archive the sites in.
Author
Owner

@KenwoodFox commented on GitHub (Jul 13, 2023):

Can archivebox display wacz files? because that would be great to pair that crawler with a docker archivebox

<!-- gh-comment-id:1634606029 --> @KenwoodFox commented on GitHub (Jul 13, 2023): > Can archivebox display wacz files? because that would be great to pair that crawler with a docker archivebox
Author
Owner

@pirate commented on GitHub (Aug 16, 2023):

No, we don't have wacz support yet but I'm friends with the folks that designed that spec, and I'd love to integrate with browsertrix / ArchiveWeb.page + ReplayWeb.page at some point in the future to improve our wacz support.

For now there are many higher priority things on my plate, mainly the event-sourcing refactor of the ArchiveBox internals (using django-Huey-monitor), before I'm ready to add new extractors / UI.

<!-- gh-comment-id:1679798685 --> @pirate commented on GitHub (Aug 16, 2023): No, we don't have wacz support yet but I'm friends with the folks that designed that spec, and I'd love to integrate with browsertrix / ArchiveWeb.page + ReplayWeb.page at some point in the future to improve our wacz support. For now there are many higher priority things on my plate, mainly the event-sourcing refactor of the ArchiveBox internals (using django-Huey-monitor), before I'm ready to add new extractors / UI.
Author
Owner

@TomLucidor commented on GitHub (Dec 6, 2023):

@pirate this is exactly the feature I would look for (mixing with AI for saerching pages) when HTTrack gets antiquated, and thank @larshaendler for the idea of extracting URLs using WGet first before archiving. What is the current best way to do this assuming we are not using WGet (on windows)?

<!-- gh-comment-id:1842413843 --> @TomLucidor commented on GitHub (Dec 6, 2023): @pirate this is exactly the feature I would look for (mixing with AI for saerching pages) when HTTrack gets antiquated, and thank @larshaendler for the idea of extracting URLs using WGet first before archiving. What is the current best way to do this assuming we are not using WGet (on windows)?
Author
Owner

@Ember-ruby commented on GitHub (Jan 11, 2024):

@ pirate this is exactly the feature I would look for (mixing with AI for saerching pages)

there is literally no use for using 'ai' to search pages, it would simply be less effective, and far more wasteful than regular search

<!-- gh-comment-id:1886507153 --> @Ember-ruby commented on GitHub (Jan 11, 2024): > @ pirate this is exactly the feature I would look for (mixing with AI for saerching pages) there is literally no use for using 'ai' to search pages, it would simply be less effective, and far more wasteful than regular search
Author
Owner

@TomLucidor commented on GitHub (May 23, 2024):

@Ember-ruby it's about getting data for AI to read and summarize, as part of a use case. That is why the WGet method is so important.

<!-- gh-comment-id:2126872719 --> @TomLucidor commented on GitHub (May 23, 2024): @Ember-ruby it's about getting data for AI to read and summarize, as part of a use case. That is why the WGet method is so important.
Author
Owner

@mclaudt commented on GitHub (Sep 15, 2024):

It is advertised as a "powerful, self-hosted internet archiving solution to collect, save, and view websites offline."
In reality, all it does is save a single URL through a nice GUI, which is cool but far from internet archiving.

Perhaps it would be better to reformulate the title to avoid obvious disappointment?

<!-- gh-comment-id:2351583924 --> @mclaudt commented on GitHub (Sep 15, 2024): It is advertised as a "powerful, self-hosted **internet archiving solution** to collect, save, and view websites offline." In reality, all it does is save a single URL through a nice GUI, which is cool but far from internet archiving. Perhaps it would be better to reformulate the title to avoid obvious disappointment?
Author
Owner

@TomLucidor commented on GitHub (Oct 7, 2024):

@mclaudt seconding this, which is hoping ArchiveBox to switch things up, if for example there are whole website-shaped books out there where every page is a chapter.

<!-- gh-comment-id:2395766460 --> @TomLucidor commented on GitHub (Oct 7, 2024): @mclaudt seconding this, which is hoping ArchiveBox to switch things up, if for example there are whole website-shaped books out there where every page is a chapter.
Author
Owner

@pirate commented on GitHub (Oct 7, 2024):

Work has already started on this in v0.8.5 ;)

It's still in early stages but the plan is to have a new pipable/chainable archivebox crawl command as described here. The new data model core.models.Crawl is already created and I'm working on the CLI now.

This is the rough roadmap for the roll-out of this feature:

  • 1. add Crawl model + basic Admin UI for creating/editing Crawl entries

    • implement fields that store the seed (urls/parsable text ), depth, parser, tags, and crawler='basic_breadth_first
    • connect archivebox add --depth=n ... up to create a new Crawl with Crawl.seed, .tags, .depth, etc. set to whatever is passed to add
    • create new huey job that listens for new Crawl entries and starts running them with the given crawler implementation, which yields new Snapshot rows in DB as it crawls out from the starting seed (this job does no archiving itself, it just creates the initial Snapshot index rows)
    • create a new huey job called extract that watches for new Snapshot index rows and runs the extractors on each, producing output to filesystem and ArchiveResult records
  • 2. [NOT FINALIZED, still thinking about other ways to do this] add another field on Crawl: schedule, which contains a cron-style schedule string to make Crawl happen repeatedly on a schedule (takes over old archivebox schedule system + allows it to be managed in the UI now)

    • create a new huey job called scheduler that runs once every minute, kicking off new Crawl tasks that are due to run based on their schedule field + current time
  • 3. Expose a new Plugin hook interface to populate the Crawl.crawler field, and add a Crawler.config field that can take extra args to configure a strategy. This will allow plugins to contribute new crawler implementations. At first the only built-in crawler provided will be: basic_breadth_first but new crawlers can be added later that can use external systems like scrapy, browsertrix-crawler, do depth_first instead of breadth first, etc.

    • expose more config options to manage crawler behavior (e.g. like wget --span-hosts, --no-parent, --only-domains=*.xyz.example.com)

It's all still months away but I finally resolved most of the issues that were making me hesitant to implement this in the past, so I'm more eager to add this now.

I'm also thinking about a shortcut to quickly add crawls tasks from external bookmarking systems, for example just tagging a URL with depth=1 should tell archivebox to automatically crawl the entire page when adding. I think tags could be a powerful way to configure archivebox behavior in general, we could treat any tag with an equals sign in it as an arg that tells it to do something when archiving.

<!-- gh-comment-id:2397514187 --> @pirate commented on GitHub (Oct 7, 2024): Work has already started on this in v0.8.5 ;) It's still in early stages but the plan is to have a new pipable/chainable `archivebox crawl` command [as described here](https://github.com/ArchiveBox/ArchiveBox/issues/1363#issuecomment-2050736969). The new data model `core.models.Crawl` is already created and I'm working on the CLI now. This is the rough roadmap for the roll-out of this feature: - [x] 1. add `Crawl` model + basic Admin UI for creating/editing Crawl entries - [x] implement fields that store the `seed` (urls/parsable text ), `depth`, `parser`, `tags`, and `crawler='basic_breadth_first` - [ ] connect `archivebox add --depth=n ...` up to create a new `Crawl` with `Crawl.seed`, `.tags`, `.depth`, etc. set to whatever is passed to `add` - [ ] create new huey job that listens for new `Crawl` entries and starts running them with the given `crawler` implementation, which yields new `Snapshot` rows in DB as it crawls out from the starting `seed` (this job does no archiving itself, it just creates the initial `Snapshot` index rows) - [ ] create a new huey job called `extract` that watches for new `Snapshot` index rows and runs the extractors on each, producing output to filesystem and `ArchiveResult` records - [ ] 2. [NOT FINALIZED, still thinking about other ways to do this] add another field on `Crawl`: `schedule`, which contains a `cron`-style schedule string to make `Crawl` happen repeatedly on a schedule (takes over old `archivebox schedule` system + allows it to be managed in the UI now) - [ ] create a new huey job called `scheduler` that runs once every minute, kicking off new `Crawl` tasks that are due to run based on their `schedule` field + current time - [ ] 3. Expose a new Plugin hook interface to populate the `Crawl.crawler` field, and add a `Crawler.config` field that can take extra args to configure a strategy. This will allow plugins to contribute new crawler implementations. At first the only built-in crawler provided will be: `basic_breadth_first` but new crawlers can be added later that can use external systems like `scrapy`, `browsertrix-crawler`, do `depth_first` instead of breadth first, etc. - [ ] expose more config options to manage crawler behavior (e.g. like wget `--span-hosts`, `--no-parent`, `--only-domains=*.xyz.example.com`) It's all still months away but I finally resolved most of the issues that were making me hesitant to implement this in the past, so I'm more eager to add this now. I'm also thinking about a shortcut to quickly add crawls tasks from external bookmarking systems, for example just tagging a URL with depth=1 should tell archivebox to automatically crawl the entire page when adding. I think tags could be a powerful way to configure archivebox behavior in general, we could treat any tag with an equals sign in it as an arg that tells it to do something when archiving.
Author
Owner

@KenwoodFox commented on GitHub (Oct 7, 2024):

Thats exciting news!

<!-- gh-comment-id:2397528079 --> @KenwoodFox commented on GitHub (Oct 7, 2024): Thats exciting news!
Author
Owner

@miltuss commented on GitHub (Jan 13, 2025):

I'm looking forward to the ability to archive entire sites with ArchiveBox. This feature has been requested for a long time!

<!-- gh-comment-id:2587946244 --> @miltuss commented on GitHub (Jan 13, 2025): I'm looking forward to the ability to archive entire sites with ArchiveBox. This feature has been requested for a long time!
Author
Owner

@nodecentral commented on GitHub (Jan 19, 2025):

Just to add, I've been monitoring and waiting for this capability too, so I can have it set up in docker. I'm currently using SiteSucker (iOS/mac) to achieve this which it does seamlessly - but would love to see ArchiveBox have it too

<!-- gh-comment-id:2600835889 --> @nodecentral commented on GitHub (Jan 19, 2025): Just to add, I've been monitoring and waiting for this capability too, so I can have it set up in docker. I'm currently using SiteSucker (iOS/mac) to achieve this which it does seamlessly - but would love to see ArchiveBox have it too
Author
Owner

@OisinHick commented on GitHub (Jan 24, 2025):

I might need to self-host site sucker in that case. Dont really like that as I'd prefer one tool to do everything but I need this functionality

<!-- gh-comment-id:2612875645 --> @OisinHick commented on GitHub (Jan 24, 2025): I might need to self-host site sucker in that case. Dont really like that as I'd prefer one tool to do everything but I need this functionality
Author
Owner

@pirate commented on GitHub (Feb 1, 2025):

Been using ArchiveTeam/grab-site lately to entire government domains to warc files and it's been pretty fast. Currently running at 1GB/min just doing html crawling (no js):

git clone https://github.com/ArchiveTeam/grab-site
cd grab-site
uv venv --python 3.8
rm pyproject.toml
uv pip install --no-binary lxml click psutil --upgrade -e .
./grab-site 'https://gml.noaa.gov/aftp/'
<!-- gh-comment-id:2628904129 --> @pirate commented on GitHub (Feb 1, 2025): Been using [`ArchiveTeam/grab-site`](https://github.com/ArchiveTeam/grab-site) lately to entire government domains to warc files and it's been pretty fast. Currently running at 1GB/min just doing html crawling (no js): ```bash git clone https://github.com/ArchiveTeam/grab-site cd grab-site uv venv --python 3.8 rm pyproject.toml uv pip install --no-binary lxml click psutil --upgrade -e . ./grab-site 'https://gml.noaa.gov/aftp/' ```
Author
Owner

@BriseBolt commented on GitHub (Apr 17, 2025):

Hello, I've been eagerly awaiting this feature in Archivebox for years. Why hasn't it been implemented after all these years? Will it be released soon?

<!-- gh-comment-id:2814086964 --> @BriseBolt commented on GitHub (Apr 17, 2025): Hello, I've been eagerly awaiting this feature in Archivebox for years. Why hasn't it been implemented after all these years? Will it be released soon?
Author
Owner

@marcbria commented on GitHub (Apr 22, 2025):

People, I have no relationship to the main developer of this project and I'm in this thread because I myself think this would be a very interesting feature... but let me remind you that it's rude to ask for the times in open source projects.

Are you willing to develop this functionality yourself?
Are you going to contribute something in return?

I'm sorry, but I think the people who develop the software and offers it altruistically deserve our respect for all the time and affection dedicated to the project or at least constructive proposals... and not complaints and rushing.

Best regards,

<!-- gh-comment-id:2820999087 --> @marcbria commented on GitHub (Apr 22, 2025): People, I have no relationship to the main developer of this project and I'm in this thread because I myself think this would be a very interesting feature... but let me remind you that it's rude to ask for the times in open source projects. Are you willing to develop this functionality yourself? Are you going to contribute something in return? I'm sorry, but I think the people who develop the software and offers it altruistically deserve our respect for all the time and affection dedicated to the project or at least constructive proposals... and not complaints and rushing. Best regards,
Author
Owner

@TomLucidor commented on GitHub (May 2, 2025):

@marcbria considering how so many people are asking for it, and how lucrative it is for business equivalents (and also how AI coding agents are getting extremely accessible now) it is better to treat this issue as people asking each other for pointers rather than just wanting the devs to do it. Without widened unit/integration tests, and maybe even design docs, it is hard for people to collaborate on this endeavor.

<!-- gh-comment-id:2846190330 --> @TomLucidor commented on GitHub (May 2, 2025): @marcbria considering how so many people are asking for it, and how lucrative it is for business equivalents (and also how AI coding agents are getting extremely accessible now) it is better to treat this issue as people asking each other for pointers rather than just wanting the devs to do it. Without widened unit/integration tests, and maybe even design docs, it is hard for people to collaborate on this endeavor.
Author
Owner

@virtadpt commented on GitHub (May 3, 2025):

Ultimately, for this sort of use case wget seems to work the best. It supports generating WARC files, too.

<!-- gh-comment-id:2848314998 --> @virtadpt commented on GitHub (May 3, 2025): Ultimately, for this sort of use case `wget` seems to work the best. It supports generating WARC files, too.
Author
Owner

@pirate commented on GitHub (May 3, 2025):

Hey guys I got an offer I couldn't refuse and I just joined browser-use.com as founding engineer, so ArchiveBox is back to being a side project for now. I was barely making enough to support myself on ArchiveBox clients, let alone my wife and first kid being born in a few months. As a result I'm unlikely to implement this recursive crawling anytime soon, so for now I highly recommend everyone check out:

Browsertrix in particular is incredibly well done, it's way higher-fidelity than ArchiveBox ever was, and it already supports this feature and much more. They also have a very a very affordable cloud which I recommend people pay for so that they can survive as a real company better than ArchiveBox did!

ArchiveBox is not abandoned, I plan to continue supporting it and adding small things in my spare time for many years, but it wont get any huge new features until I make enough $ from browser-use to hire my friends to keep growing ArchiveBox 😁

I also would like to direct people to https://github.com/ArchiveBox/archivebox-browser-extension/ which actually has gotten a ton of active development work from @benmuth recently! It's got some awesome new features like offline saving of the URL list and reddit auto-archiving.

All ArchiveBox captured data is in a very simple archive/<timestamp>/{index.json,singlefile.html,media,...} format that can be ingested by many other programs with just 15min of vibecoding to write a quick migration script. If you decide to stay on ArchiveBox, our Django migrations can chug along reliably for decades, and your data is easy to export if you decide to switch.

<!-- gh-comment-id:2848370416 --> @pirate commented on GitHub (May 3, 2025): Hey guys I got an offer I couldn't refuse and I just joined [browser-use.com](https://browser-use.com) as founding engineer, so ArchiveBox is back to being a side project for now. I was barely making enough to support myself on ArchiveBox clients, let alone my wife and first kid being born in a few months. As a result I'm unlikely to implement this recursive crawling anytime soon, so for now I highly recommend everyone check out: - [Browsertrix](https://webrecorder.net/browsertrix/#get-started) + https://github.com/webrecorder/browsertrix-crawler + https://ArchiveWeb.page made by my friends at https://webrecorder.net, best WARC/WACZ-based solution and highest fidelity archiving of complex sites - https://linkwarden.app + [`archivebox2linkwarden.js` migration script](https://gist.github.com/daniel31x13/569233e04a987d86350f467d7ed83f29) the best bookmark management + archiving tool with auto-tagging, collaboration, and many other features - https://github.com/gildas-lormeau/SingleFile best all-in-one-html-file archiving solution, has a CLI, ability to save to remote storage, and many other features - https://github.com/karakeep-app/karakeep - and many more: - https://github.com/stars/pirate/lists/internet-archiving - https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community Browsertrix in particular is incredibly well done, it's way higher-fidelity than ArchiveBox ever was, and it already supports this feature and much more. They also have a very a very affordable [cloud](https://webrecorder.net/) which I recommend people pay for so that they can survive as a real company better than ArchiveBox did! ArchiveBox is not abandoned, I plan to continue supporting it and adding small things in my spare time for many years, but it wont get any huge new features until I make enough $ from browser-use to hire my friends to keep growing ArchiveBox 😁 I also would like to direct people to https://github.com/ArchiveBox/archivebox-browser-extension/ which actually has gotten a ton of active development work from @benmuth recently! It's got some awesome new features like offline saving of the URL list and reddit auto-archiving. All ArchiveBox captured data is in a very simple `archive/<timestamp>/{index.json,singlefile.html,media,...}` format that can be ingested by many other programs with just 15min of vibecoding to write a quick migration script. If you decide to stay on ArchiveBox, our Django migrations can chug along reliably for decades, and your data is easy to export if you decide to switch.
Author
Owner

@marcbria commented on GitHub (May 7, 2025):

Nick, congrats for your kid and your new job.

You've done an incredible work (IMO it's the best tool of this kind I've tried) and if now it's time to park it a bit, by sharing the code, if someone has a lot of urgency they can provide a PR.

In short: Don't suffer for the project. It will be here if you got more time in future.
As obvious as it may seem, sometimes we forget to remember that the important things come first. ;-)

<!-- gh-comment-id:2859140719 --> @marcbria commented on GitHub (May 7, 2025): Nick, congrats for your kid and your new job. You've done an incredible work (IMO it's the best tool of this kind I've tried) and if now it's time to park it a bit, by sharing the code, if someone has a lot of urgency they can provide a PR. In short: Don't suffer for the project. It will be here if you got more time in future. As obvious as it may seem, sometimes we forget to remember that the important things come first. ;-)
Author
Owner

@Techcable commented on GitHub (Jul 20, 2025):

Hey guys I got an offer I couldn't refuse and I just joined browser-use.com as founding engineer, so ArchiveBox is back to being a side project for now. I was barely making enough to support myself on ArchiveBox clients, let alone my wife and first kid being born in a few months. As a result I'm unlikely to implement this recursive crawling anytime soon, so for now I highly recommend everyone check out:

[...]

ArchiveBox is not abandoned, I plan to continue supporting it and adding small things in my spare time for many years, but it wont get any huge new features until I make enough $ from browser-use to hire my friends to keep growing ArchiveBox 😁

Would it be possible to make a full release in the 0.8.0 series? I'm encountering persistent issues with 0.7.3 like running out of threads and forgetting urls. I use PikaPods for hosting, and I don't think they'll upgrade to a pre-release.

Even though it is technically pre-release software, the 0.8.0 series seems more stable in practice.

<!-- gh-comment-id:3092666598 --> @Techcable commented on GitHub (Jul 20, 2025): > Hey guys I got an offer I couldn't refuse and I just joined [browser-use.com](https://browser-use.com) as founding engineer, so ArchiveBox is back to being a side project for now. I was barely making enough to support myself on ArchiveBox clients, let alone my wife and first kid being born in a few months. As a result I'm unlikely to implement this recursive crawling anytime soon, so for now I highly recommend everyone check out: > > [...] > > ArchiveBox is not abandoned, I plan to continue supporting it and adding small things in my spare time for many years, but it wont get any huge new features until I make enough $ from browser-use to hire my friends to keep growing ArchiveBox 😁 > Would it be possible to make a full release in the 0.8.0 series? I'm encountering persistent issues with 0.7.3 like running out of threads and forgetting urls. I use PikaPods for hosting, and I don't think they'll upgrade to a pre-release. Even though it is technically pre-release software, the 0.8.0 series seems more stable in practice.
Author
Owner

@pirate commented on GitHub (Aug 9, 2025):

I'm a bit reluctant to mic drop release a 0.8.x "stable" release without providing support for it. Can you run it in docker instead? Safer and easier that way especially as the depencencies might be stale sometimes.

There are a couple migration bugs and schema design rough edges that I wanted to track down before a main release, I always fear inflicting buggy migrations on people because it's supposed to be durable archiving software that doesn't mess up your data.

<!-- gh-comment-id:3170471616 --> @pirate commented on GitHub (Aug 9, 2025): I'm a bit reluctant to mic drop release a 0.8.x "stable" release without providing support for it. Can you run it in docker instead? Safer and easier that way especially as the depencencies might be stale sometimes. There are a couple migration bugs and schema design rough edges that I wanted to track down before a main release, I always fear inflicting buggy migrations on people because it's supposed to be durable archiving software that *doesn't* mess up your data.
Author
Owner

@hannibalshosting88 commented on GitHub (Sep 1, 2025):

That is the best bad news ever! I am wiling to help, although I know next to nothing, might be a good learning project. The first year is the toughest, but you'll make it! Hope you and yours are healthy and happy and whole!!!

<!-- gh-comment-id:3243241639 --> @hannibalshosting88 commented on GitHub (Sep 1, 2025): That is the best bad news ever! I am wiling to help, although I know next to nothing, might be a good learning project. The first year is the toughest, but you'll make it! Hope you and yours are healthy and happy and whole!!!
Author
Owner

@GiveupEmeraude commented on GitHub (Sep 1, 2025):

We really want this feature. We definitely want to be able to save entire sites with ArchiveBox.

<!-- gh-comment-id:3243266333 --> @GiveupEmeraude commented on GitHub (Sep 1, 2025): We really want this feature. We definitely want to be able to save entire sites with ArchiveBox.
Author
Owner

@BriseBolt commented on GitHub (Oct 10, 2025):

Please. Resume development of this very important and critical feature

<!-- gh-comment-id:3389111374 --> @BriseBolt commented on GitHub (Oct 10, 2025): Please. Resume development of this very important and critical feature
Author
Owner

@Aholicknight commented on GitHub (Dec 17, 2025):

how has this issue been open for so long and this is not implemented yet?

<!-- gh-comment-id:3663191208 --> @Aholicknight commented on GitHub (Dec 17, 2025): how has this issue been open for so long and this is not implemented yet?
Author
Owner

@pirate commented on GitHub (Dec 28, 2025):

this is now implemented on dev but it's still WIP. --depth=N now supports arbitrary depth.

You can see an example in archivebox/tests/test_recursive_crawl.py

<!-- gh-comment-id:3695039837 --> @pirate commented on GitHub (Dec 28, 2025): this is now implemented on `dev` but it's still WIP. `--depth=N` now supports arbitrary depth. You can see an example in `archivebox/tests/test_recursive_crawl.py`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#133
No description provided.