mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1156] Question: How can I archive full-resolution linked PNGs, JPGs, etc. within a page to the same snapshot? #718
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#718
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hadecake on GitHub (Jun 8, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1156
A site I am trying to snapshot has a lot of direct links to image media I am trying to also capture in full size (not just the thumbnail).
If I add an additional capture layer, it then grabs the entire page for every page linked at the top. Way too much data.
Ideally I would like to add more file formats to the "media" capture so that in addition to getting the full images, it stores them all in one snapshot.
Example link: https://boards.4chan.org/out/thread/2577670
Thank you for your assistance
@pirate commented on GitHub (Jun 10, 2023):
You can set the URL_WHITELIST for a single add command so that it only matches image extension URLs on the same domain, and then run it with depth=1.
Something like that should work
For more information on
URL_WHITELIST, including information on how to test your regex pattern before running, see the wiki here: https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#url_whitelistYou may also want to follow this issue: https://github.com/ArchiveBox/ArchiveBox/issues/564 for your progress on adding
gallery-dlsupport, which will enable us to save pages that have many inline images nicely within a single snapshot.Feel free to comment back here if you need any further help.