[GH-ISSUE #939] Does ArchiveBox support archiving a single page and all its related assets? #3602

Closed
opened 2026-03-14 23:40:21 +03:00 by kerem · 6 comments
Owner

Originally created by @zhiqiangxu on GitHub (Mar 2, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/939

Sometimes we want to archive a single page with all related assets instead of the whole site, does archivebox support this feature?

Originally created by @zhiqiangxu on GitHub (Mar 2, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/939 Sometimes we want to archive a single page with all related assets instead of the whole site, does archivebox support this feature?
kerem 2026-03-14 23:40:21 +03:00
Author
Owner

@varenc commented on GitHub (Mar 3, 2022):

That's literally exactly what ArchiveBox does. It saves individual URLs, and their associated assets. It does not try to crawl entire websites. What gave you the wrong impression?

<!-- gh-comment-id:1057625793 --> @varenc commented on GitHub (Mar 3, 2022): That's literally exactly what ArchiveBox does. It saves individual URLs, and their associated assets. It does not try to crawl entire websites. What gave you the wrong impression?
Author
Owner

@zhiqiangxu commented on GitHub (Mar 3, 2022):

Because I saw this command:

archivebox schedule --every=day --depth=1 https://example.com/rss.xml 

It looks like it's going to crawl the entire website, especially the word schedule, if it only crawls a single url, no need to schedule, right?

<!-- gh-comment-id:1057697860 --> @zhiqiangxu commented on GitHub (Mar 3, 2022): Because I saw this command: ``` archivebox schedule --every=day --depth=1 https://example.com/rss.xml ``` It looks like it's going to crawl the entire website, especially the word `schedule`, if it only crawls a single url, no need to `schedule`, right?
Author
Owner

@akhilleusuggo commented on GitHub (Mar 3, 2022):

schedule is for feed/rss webpages that changes every=day/hour/month etc.
archivebox will not re-download the same url
to redownload the same url you need to take a snapshot

It looks like it's going to crawl the entire website
No, like the command line states, --depth=1. Only one hope away from the URL that you indicate. For example on RSS/Feed pages, you will be downloading the content of the URLs on that the rss does provide. Everyday will check if new URLs have been added and download them, old ones (already crawled) will be ignored.

<!-- gh-comment-id:1058513402 --> @akhilleusuggo commented on GitHub (Mar 3, 2022): schedule is for feed/rss webpages that changes every=day/hour/month etc. archivebox will not re-download the same url to redownload the same url you need to take a snapshot > It looks like it's going to crawl the entire website No, like the command line states, --depth=1. Only one hope away from the URL that you indicate. For example on RSS/Feed pages, you will be downloading the content of the URLs on that the rss does provide. Everyday will check if new URLs have been added and download them, old ones (already crawled) will be ignored.
Author
Owner

@zhiqiangxu commented on GitHub (Mar 4, 2022):

This command fails to install:

curl -sSL 'https://get.archivebox.io' | sh

The final output is this:

sudo: python3.7: command not found

Python 3.6.9
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)
/usr/bin/python3: No module named django

My installed python version is 2.7, does archivebox only support python 3.x?

<!-- gh-comment-id:1058756834 --> @zhiqiangxu commented on GitHub (Mar 4, 2022): This command fails to install: ``` curl -sSL 'https://get.archivebox.io' | sh ``` The final output is this: ``` sudo: python3.7: command not found Python 3.6.9 pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6) /usr/bin/python3: No module named django ``` My installed python version is 2.7, does archivebox only support python 3.x?
Author
Owner

@pirate commented on GitHub (Mar 5, 2022):

Yes, we dropped Python 2.7 support long ago. @zhiqiangxu

<!-- gh-comment-id:1059720842 --> @pirate commented on GitHub (Mar 5, 2022): Yes, we dropped Python 2.7 support long ago. @zhiqiangxu
Author
Owner

@pirate commented on GitHub (Mar 5, 2022):

Docker

<!-- gh-comment-id:1059818152 --> @pirate commented on GitHub (Mar 5, 2022): Docker
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3602
No description provided.