[GH-ISSUE #404] Question: Archives and PDFs download #1779

Closed
opened 2026-03-01 17:53:35 +03:00 by kerem · 1 comment
Owner

Originally created by @walkero-gr on GitHub (Jul 30, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/404

When a page has links leading to archives or PDF files, should these be downloaded as well by default, or there is a need for a specific configuration? If not, is it possible to set those files extensions when FETCH_WGET_REQUISITES is true? For example, if I want to set a specific list of files that I would like to download, or set a specific file extensions to be excluded. Whatever is more convenient.

wget has the following parameter for this, which might be helpful
-R, --reject=LIST comma-separated list of rejected extensions

Originally created by @walkero-gr on GitHub (Jul 30, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/404 When a page has links leading to archives or PDF files, should these be downloaded as well by default, or there is a need for a specific configuration? If not, is it possible to set those files extensions when FETCH_WGET_REQUISITES is true? For example, if I want to set a specific list of files that I would like to download, or set a specific file extensions to be excluded. Whatever is more convenient. wget has the following parameter for this, which might be helpful -R, --reject=LIST comma-separated list of rejected extensions
kerem 2026-03-01 17:53:35 +03:00
Author
Owner

@pirate commented on GitHub (Jul 30, 2020):

It will not download linked PDFs, video, or downloads unless you pass --depth=1, it will however download embedded media files, pdfs, etc.s with --depth=0, i.e. if they are needed to render the page normally (this is what FETCH_WGET_REQUISITES, SAVE_MEDIA, SAVE_GIT, etc. control). A YouTube video or PDF inline in a page would be downloaded, but a download link or button going to an external PDF would not be followed with depth=0.

  • depth=0 = only archive the url provided and assets needed to render it
  • depth=1 = archive the url provided, then look for all links on the page, and archive those as well, including all assets needed to render those pages (1 hop outwards from the start)
  • depth=2-... not implemented yet, you probably want to use something like SiteSucker, wget, pywb, etc. to cral for links, then pipe those into ArchiveBox with --depth=0. recursive crawling is difficult and those tools do it better, just use archivebox as the storage and archiving layer instead.

There is also the URL_BLACKLIST config option, which you can use to exclude certain URL patterns, like extensions or domains, e.g.:

# don't archive *.facebook.com, *.ebay.com, or anything ending in .exe
URL_BLACKLIST = (://(.*\.)?facebook\.com)|(://(.*\.)?ebay\.com)|(.*\.exe$)

To set that up you can do:

$ archivebox config --set URL_BLACKLIST='(://(.*\.)?facebook\.com)|(://(.*\.)?ebay\.com)|(.*\.exe$)'
$ archivebox add --depth=0 'https://example.com/just/this/page'
$ archivebox add --depth=1 'https://example.com/this/page/and/all/outlinks'
$ archivebox add --depth=2 'this doesn't work. use a crawler instead'

If you only want to download PDFs linked inside a page you could do:

$ env URL_BLACKLIST='^.*[^\.pdf]$' archivebox add --depth=1 'https://example.com/pagefullofpdfs.htm'
<!-- gh-comment-id:666666230 --> @pirate commented on GitHub (Jul 30, 2020): It will not download *linked* PDFs, video, or downloads unless you pass `--depth=1`, it will however download *embedded* media files, pdfs, etc.s with `--depth=0`, i.e. if they are needed to render the page normally (this is what `FETCH_WGET_REQUISITES`, `SAVE_MEDIA`, `SAVE_GIT`, etc. control). A YouTube video or PDF inline in a page would be downloaded, but a download link or button going to an external PDF would not be followed with depth=0. - `depth=0` = only archive the url provided and assets needed to render it - `depth=1` = archive the url provided, then look for all links on the page, and archive those as well, including all assets needed to render those pages (1 hop outwards from the start) - `depth=2-...` not implemented yet, you probably want to use something like SiteSucker, wget, pywb, etc. to cral for links, then pipe those into ArchiveBox with --depth=0. recursive crawling is difficult and those tools do it better, just use archivebox as the storage and archiving layer instead. There is also the [`URL_BLACKLIST`](https://github.com/pirate/ArchiveBox/wiki/Configuration#url_blacklist) config option, which you can use to exclude certain URL patterns, like extensions or domains, e.g.: ```ini # don't archive *.facebook.com, *.ebay.com, or anything ending in .exe URL_BLACKLIST = (://(.*\.)?facebook\.com)|(://(.*\.)?ebay\.com)|(.*\.exe$) ``` To set that up you can do: ``` $ archivebox config --set URL_BLACKLIST='(://(.*\.)?facebook\.com)|(://(.*\.)?ebay\.com)|(.*\.exe$)' $ archivebox add --depth=0 'https://example.com/just/this/page' $ archivebox add --depth=1 'https://example.com/this/page/and/all/outlinks' $ archivebox add --depth=2 'this doesn't work. use a crawler instead' ``` If you *only* want to download PDFs linked inside a page you could do: ```bash $ env URL_BLACKLIST='^.*[^\.pdf]$' archivebox add --depth=1 'https://example.com/pagefullofpdfs.htm'
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1779
No description provided.