[GH-ISSUE #1717] Feature Request: Better Forum archiving #4042

Closed
opened 2026-03-15 01:22:50 +03:00 by kerem · 1 comment
Owner

Originally created by @observeroftime01 on GitHub (Dec 13, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1717

Originally assigned to: @pirate on GitHub.

What type of suggestion are you making?

Proposing a new feature

What is the problem that your feature request solves?

Let us assume I'm talking about a forum thread with 500 pages, located at https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/ . The thread of interest has many pages, accessible via https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-2, https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-3, and so on.

As it stands currently, I have to generate a list of 500 individual URLs and add them to ArchiveBox (which will create 500 separate entries, one for every page). This will quickly clutter the dashboard and make finding anything a chore. All it takes is a few threads with many pages from the same forum, and things will quickly become unmanageable.

What is your proposed solution?

I would like to feed archivebox a URL like https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME, be asked how many pages it should download (500 in this example), and at the very least have everything saved under one expanding entry / heading.

The logic for the download URLs does not have to be complicated. The user could provide a template URL like https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-$NUMBER, which will download everything from https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME to https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-500. The resulting download of 500 pages could then be stored under one single heading (perhaps the thread title itself would make sense to use) that expands to show all 500 pages once clicked on.

I can manage to generate the download URLs using a simple python script myself, so the request to have pages belonging to the same thread be saved under one expanding entry on the dashboard is more urgent.

I don't know how feasible it is to have navigation within the saved pages themselves (say, to get from saved page https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-2 to https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-3). Provided that all 500 pages belonging to the same thread end up in the same "collection", navigation through the forum page buttons itself doesn't need to work. Maybe a simple "previous / next" button could be displayed atop the navigation when entering a "collection" created this way, which takes you to the next page in the list?

In any case, if anybody has any better suggestions and recommendations on how to back up forum content (and browse it properly) in a straightforward way, I'm all ears. Maybe this is all wildly out of scope, and there's better tools for this particular purpose I am not aware about.

What hacks or alternative solutions have you tried to solve the problem?

Tried assigning tags to forum threads to aid in navigation / finding pages that belong to the same thread

Share the entire output of the archivebox version command for the current verison you are using.

username@box:~/archivebox/data$ archivebox --version
0.7.1
ArchiveBox v0.7.1 Cpython Linux Linux-6.14.0-37-generic-x86_64-with-glibc2.39 x86_64
DEBUG=False IN_DOCKER=False IN_QEMU=False IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=644 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.12.3         valid     /usr/bin/python3.12
 √  SQLITE_BINARY         v2.6.0          valid     /usr/lib/python3.12/sqlite3/dbapi2.py
 √  DJANGO_BINARY         v3.1.14         valid     /home/username/.local/lib/python3.12/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.1          valid     /home/username/.local/bin/archivebox

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl
 √  WGET_BINARY           v1.21.4         valid     /usr/bin/wget
 √  NODE_BINARY           v24.11.1        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v1.1.54         valid     ./node_modules/single-file-cli/single-file
 √  READABILITY_BINARY    v0.0.11         valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/parser/cli.js
 √  GIT_BINARY            v2.43.0         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2025.12.08     valid     /home/username/.local/bin/yt-dlp
 √  CHROME_BINARY         v143.0.7499.40  valid     /usr/bin/chromium-browser
 √  RIPGREP_BINARY        v14.1.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /home/username/.local/lib/python3.12/site-packages/archivebox
 √  TEMPLATES_DIR         4 files         valid     /home/username/.local/lib/python3.12/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None
 -  COOKIES_FILE          -               disabled  None

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /home/username/archivebox/data
 √  SOURCES_DIR           14 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           35 files        valid     ./archive
 √  CONFIG_FILE           108.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             524.0 KB        valid     ./index.sqlite3

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to start a PR to develop this myself
  • I have donated money to go towards fixing this issue

Mini Survey

  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
  • I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Originally created by @observeroftime01 on GitHub (Dec 13, 2025). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1717 Originally assigned to: @pirate on GitHub. ### What type of suggestion are you making? Proposing a new feature ### What is the problem that your feature request solves? Let us assume I'm talking about a forum thread with 500 pages, located at `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/` . The thread of interest has many pages, accessible via `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-2`, `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-3`, and so on. As it stands currently, I have to generate a list of 500 individual URLs and add them to ArchiveBox (which will create 500 separate entries, one for every page). This will quickly clutter the dashboard and make finding anything a chore. All it takes is a few threads with many pages from the same forum, and things will quickly become unmanageable. ### What is your proposed solution? I would like to feed archivebox a URL like `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME`, be asked how many pages it should download (500 in this example), and at the very least have everything saved under one expanding entry / heading. The logic for the download URLs does not have to be complicated. The user could provide a template URL like `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-$NUMBER`, which will download everything from `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME` to `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-500`. The resulting download of 500 pages could then be stored under one single heading (perhaps the thread title itself would make sense to use) that expands to show all 500 pages once clicked on. I can manage to generate the download URLs using a simple python script myself, so the request to have pages belonging to the same thread be saved under one expanding entry on the dashboard is more urgent. I don't know how feasible it is to have navigation within the saved pages themselves (say, to get from saved page `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-2` to `https://www.SOMEFORUM.COM/threads/SOME_THREAD_NAME/page-3`). Provided that all 500 pages belonging to the same thread end up in the same "collection", navigation through the forum page buttons itself doesn't need to work. Maybe a simple "previous / next" button could be displayed atop the navigation when entering a "collection" created this way, which takes you to the next page in the list? In any case, if anybody has any better suggestions and recommendations on how to back up forum content (and browse it properly) in a straightforward way, I'm all ears. Maybe this is all wildly out of scope, and there's better tools for this particular purpose I am not aware about. ### What hacks or alternative solutions have you tried to solve the problem? Tried assigning tags to forum threads to aid in navigation / finding pages that belong to the same thread ### Share the entire output of the `archivebox version` command for the current verison you are using. ```shell username@box:~/archivebox/data$ archivebox --version 0.7.1 ArchiveBox v0.7.1 Cpython Linux Linux-6.14.0-37-generic-x86_64-with-glibc2.39 x86_64 DEBUG=False IN_DOCKER=False IN_QEMU=False IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=644 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.12.3 valid /usr/bin/python3.12 √ SQLITE_BINARY v2.6.0 valid /usr/lib/python3.12/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /home/username/.local/lib/python3.12/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.1 valid /home/username/.local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.4 valid /usr/bin/wget √ NODE_BINARY v24.11.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid ./node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.43.0 valid /usr/bin/git √ YOUTUBEDL_BINARY v2025.12.08 valid /home/username/.local/bin/yt-dlp √ CHROME_BINARY v143.0.7499.40 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v14.1.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /home/username/.local/lib/python3.12/site-packages/archivebox √ TEMPLATES_DIR 4 files valid /home/username/.local/lib/python3.12/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 8 files valid /home/username/archivebox/data √ SOURCES_DIR 14 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 35 files valid ./archive √ CONFIG_FILE 108.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 524.0 KB valid ./index.sqlite3 ``` ### How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [x] It would be nice to have eventually - [ ] I'm willing to [start a PR](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) to develop this myself - [ ] I have [donated money](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) to go towards fixing this issue ### Mini Survey - [x] I like ArchiveBox so far / would recommend it to a friend - [x] I've had a lot of difficulty getting ArchiveBox set up - [ ] I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
kerem closed this issue 2026-03-15 01:22:55 +03:00
Author
Owner

@pirate commented on GitHub (Dec 29, 2025):

forum-dl support and --depth=N recursive crawl support are now implemented in dev. let me know if that helps! dev is still wip but it shouldb e out in the next release

<!-- gh-comment-id:3697589368 --> @pirate commented on GitHub (Dec 29, 2025): forum-dl support and `--depth=N` recursive crawl support are now implemented in `dev`. let me know if that helps! dev is still wip but it shouldb e out in the next release
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4042
No description provided.