mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 09:36:01 +03:00
[GH-ISSUE #404] Question: Archives and PDFs download #1779
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1779
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @walkero-gr on GitHub (Jul 30, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/404
When a page has links leading to archives or PDF files, should these be downloaded as well by default, or there is a need for a specific configuration? If not, is it possible to set those files extensions when FETCH_WGET_REQUISITES is true? For example, if I want to set a specific list of files that I would like to download, or set a specific file extensions to be excluded. Whatever is more convenient.
wget has the following parameter for this, which might be helpful
-R, --reject=LIST comma-separated list of rejected extensions
@pirate commented on GitHub (Jul 30, 2020):
It will not download linked PDFs, video, or downloads unless you pass
--depth=1, it will however download embedded media files, pdfs, etc.s with--depth=0, i.e. if they are needed to render the page normally (this is whatFETCH_WGET_REQUISITES,SAVE_MEDIA,SAVE_GIT, etc. control). A YouTube video or PDF inline in a page would be downloaded, but a download link or button going to an external PDF would not be followed with depth=0.depth=0= only archive the url provided and assets needed to render itdepth=1= archive the url provided, then look for all links on the page, and archive those as well, including all assets needed to render those pages (1 hop outwards from the start)depth=2-...not implemented yet, you probably want to use something like SiteSucker, wget, pywb, etc. to cral for links, then pipe those into ArchiveBox with --depth=0. recursive crawling is difficult and those tools do it better, just use archivebox as the storage and archiving layer instead.There is also the
URL_BLACKLISTconfig option, which you can use to exclude certain URL patterns, like extensions or domains, e.g.:To set that up you can do:
If you only want to download PDFs linked inside a page you could do: