[GH-ISSUE #1156] Question: How can I archive full-resolution linked PNGs, JPGs, etc. within a page to the same snapshot? #2229

Open
opened 2026-03-01 17:57:29 +03:00 by kerem · 1 comment
Owner

Originally created by @hadecake on GitHub (Jun 8, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1156

A site I am trying to snapshot has a lot of direct links to image media I am trying to also capture in full size (not just the thumbnail).
If I add an additional capture layer, it then grabs the entire page for every page linked at the top. Way too much data.
Ideally I would like to add more file formats to the "media" capture so that in addition to getting the full images, it stores them all in one snapshot.

Example link: https://boards.4chan.org/out/thread/2577670

Thank you for your assistance

Originally created by @hadecake on GitHub (Jun 8, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1156 A site I am trying to snapshot has a lot of direct links to image media I am trying to also capture in full size (not just the thumbnail). If I add an additional capture layer, it then grabs the entire page for every page linked at the top. Way too much data. Ideally I would like to add more file formats to the "media" capture so that in addition to getting the full images, it stores them all in one snapshot. Example link: https://boards.4chan.org/out/thread/2577670 Thank you for your assistance
Author
Owner

@pirate commented on GitHub (Jun 10, 2023):

You can set the URL_WHITELIST for a single add command so that it only matches image extension URLs on the same domain, and then run it with depth=1.

env URL_WHITELIST='https://example.com/.*\.{png,jpg,gif}' archivebox add --depth=1 https://example.com/some/page.html

Something like that should work

For more information on URL_WHITELIST, including information on how to test your regex pattern before running, see the wiki here: https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#url_whitelist

You may also want to follow this issue: https://github.com/ArchiveBox/ArchiveBox/issues/564 for your progress on adding gallery-dl support, which will enable us to save pages that have many inline images nicely within a single snapshot.

Feel free to comment back here if you need any further help.

<!-- gh-comment-id:1585799970 --> @pirate commented on GitHub (Jun 10, 2023): You can set the URL_WHITELIST for a single add command so that it only matches image extension URLs on the same domain, and then run it with depth=1. ```bash env URL_WHITELIST='https://example.com/.*\.{png,jpg,gif}' archivebox add --depth=1 https://example.com/some/page.html ``` Something like that should work For more information on `URL_WHITELIST`, including information on how to test your regex pattern before running, see the wiki here: https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#url_whitelist You may also want to follow this issue: https://github.com/ArchiveBox/ArchiveBox/issues/564 for your progress on adding `gallery-dl` support, which will enable us to save pages that have many inline images nicely within a single snapshot. Feel free to comment back here if you need any further help.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2229
No description provided.