[GH-ISSUE #1615] Feature Request: download entire domain #2476

Closed
opened 2026-03-01 17:59:17 +03:00 by kerem · 1 comment
Owner

Originally created by @Paulie420 on GitHub (Dec 11, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1615

Originally assigned to: @pirate on GitHub.

What type of suggestion are you making?

New extractor / type of content to save

What is the problem that your feature request solves?

Is there a way that ArchiveBox could be used in a way that gives an archive or a website/URL more like archive.org? Meaning - we want to pull down ALL the data of a website. All the files, all the videos, all the data composed of some website. I want to pull down everything that is accessible at example.com - even if it is very large....

What is your proposed solution?

If ArchiveBox were able to do this me and my community would support the project in the ways I've mentioned here.

What hacks or alternative solutions have you tried to solve the problem?

Using the depth=1 option when adding a URL - but this doesn't provide the results I'm interested in...

Share the entire output of the archivebox version command for the current verison you are using.

0.7.2
ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.1.0-27-amd64-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=3000:3000 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /usr/local/bin/archivebox

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget
 √  NODE_BINARY           v20.12.2        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v124.0.6367.29  valid     /usr/bin/chromium-browser
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None
 -  COOKIES_FILE          -               disabled  None

[i] Data locations:
 √  OUTPUT_DIR            6 files @       valid     /data
 √  SOURCES_DIR           145 files       valid     ./sources
 √  LOGS_DIR              3 files         valid     ./logs
 √  ARCHIVE_DIR           850 files @     valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             21.7 MB         valid     ./index.sqlite3

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to work on a PR to develop this myself
  • I have donated money to go towards fixing this issue

Mini Survey

  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
  • I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Originally created by @Paulie420 on GitHub (Dec 11, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1615 Originally assigned to: @pirate on GitHub. ### What type of suggestion are you making? New extractor / type of content to save ### What is the problem that your feature request solves? Is there a way that ArchiveBox could be used in a way that gives an archive or a website/URL more like archive.org? Meaning - we want to pull down ALL the data of a website. All the files, all the videos, all the data composed of some website. I want to pull down everything that is accessible at example.com - even if it is very large.... ### What is your proposed solution? If ArchiveBox were able to do this me and my community would support the project in the ways I've mentioned here. ### What hacks or alternative solutions have you tried to solve the problem? Using the depth=1 option when adding a URL - but this doesn't provide the results I'm interested in... ### Share the entire output of the `archivebox version` command for the current verison you are using. ```shell 0.7.2 ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.1.0-27-amd64-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=3000:3000 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /usr/local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.12.2 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.46 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/yt-dlp √ CHROME_BINARY v124.0.6367.29 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 6 files @ valid /data √ SOURCES_DIR 145 files valid ./sources √ LOGS_DIR 3 files valid ./logs √ ARCHIVE_DIR 850 files @ valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 21.7 MB valid ./index.sqlite3 ``` ### How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [x] It would be nice to have eventually - [ ] I'm willing to [work on a PR](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) to develop this myself - [ ] I have [donated money](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) to go towards fixing this issue ### Mini Survey - [ ] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up - [x] I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
kerem closed this issue 2026-03-01 17:59:17 +03:00
Author
Owner

@pirate commented on GitHub (Dec 18, 2024):

I think you mean you want to download an entire domain? e.g. example.com/*

If so, this is an often-requested feature, but it's not really the main use-case ArchiveBox is trying to serve.

We want to add it but it might take a while. Subscribe to this issue here for progress updates: https://github.com/ArchiveBox/ArchiveBox/issues/191

<!-- gh-comment-id:2550354946 --> @pirate commented on GitHub (Dec 18, 2024): I think you mean you want to download an entire domain? e.g. `example.com/*` If so, this is an often-requested feature, but it's not really the main use-case ArchiveBox is trying to serve. We want to add it but it might take a while. Subscribe to this issue here for progress updates: https://github.com/ArchiveBox/ArchiveBox/issues/191
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2476
No description provided.