[GH-ISSUE #712] Feature Request: prioritize types of pages #445

Closed
opened 2026-03-01 14:43:40 +03:00 by kerem · 4 comments
Owner

Originally created by @dominictarr on GitHub (Apr 19, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/712

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Often static sites have thumbnail images in the article and a link to large image.
It would be great to get those images as soon as possible, before other html pages on the site.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

option to prioritize file types that can be known to be leaves - they won't add more links to the database

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I am evaluating ArchiveBox but havn't decided if it solves my problems
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @dominictarr on GitHub (Apr 19, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/712 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves <!-- e.g. I need to be able to archive spanish and french subtitle files from a particular <example.com> movie site that's going down soon. --> Often static sites have thumbnail images in the article and a link to large image. It would be great to get those images as soon as possible, before other html pages on the site. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes option to prioritize file types that can be known to be leaves - they won't add more links to the database <!-- e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages. The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving. --> ## What hacks or alternative solutions have you tried to solve the problem? <!-- A clear and concise description of any alternative solutions, workarounds, or other software you've considered using to fix the problem. --> ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [ ] I like ArchiveBox so far / would recommend it to a friend - [x] I am evaluating ArchiveBox but havn't decided if it solves my problems - [ ] I've had a lot of difficulty getting ArchiveBox set up
kerem 2026-03-01 14:43:40 +03:00
Author
Owner

@pirate commented on GitHub (Apr 19, 2021):

I think you're trying to use archivebox for something it's not designed primarily for. It's not built to archive entire domains recursively, there are better tools for that. You can use a scraper or spider to find the urls you want to archive in whatever order you like, then pipe them into archivebox once you have them in the order you want them archived.

<!-- gh-comment-id:822149155 --> @pirate commented on GitHub (Apr 19, 2021): I think you're trying to use archivebox for something it's not designed primarily for. It's not built to archive entire domains recursively, there are better tools for that. You can use a scraper or spider to find the urls you want to archive in whatever order you like, then pipe them into archivebox once you have them in the order you want them archived.
Author
Owner

@dominictarr commented on GitHub (Apr 19, 2021):

thanks. I was recommended archivebox when I asked about a better wget -kr 1 do you know something that might suit me better?

<!-- gh-comment-id:822294991 --> @dominictarr commented on GitHub (Apr 19, 2021): thanks. I was recommended archivebox when I asked about a better `wget -kr 1` do you know something that might suit me better?
Author
Owner

@pirate commented on GitHub (Apr 21, 2021):

Maybe Photon? https://github.com/s0md3v/Photon

<!-- gh-comment-id:824217625 --> @pirate commented on GitHub (Apr 21, 2021): Maybe Photon? https://github.com/s0md3v/Photon
Author
Owner

@pirate commented on GitHub (May 7, 2021):

You can find many more alternatives here too: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives

<!-- gh-comment-id:834014540 --> @pirate commented on GitHub (May 7, 2021): You can find many more alternatives here too: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#445
No description provided.