[GH-ISSUE #712] Feature Request: prioritize types of pages

kerem commented

2026-03-01 14:43:40 +03:00

Owner

Copy link

Originally created by @dominictarr on GitHub (Apr 19, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/712

Type

General question or discussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves

Often static sites have thumbnail images in the article and a link to large image.
It would be great to get those images as soon as possible, before other html pages on the site.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

option to prioritize file types that can be known to be leaves - they won't add more links to the database

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

It's an urgent deal-breaker, I can't live without it
It's important to add it in the near-mid term future
It would be nice to have eventually

I'm willing to contribute dev time / money to fix this issue
I like ArchiveBox so far / would recommend it to a friend
I am evaluating ArchiveBox but havn't decided if it solves my problems
I've had a lot of difficulty getting ArchiveBox set up

Originally created by @dominictarr on GitHub (Apr 19, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/712  ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves  Often static sites have thumbnail images in the article and a link to large image. It would be great to get those images as soon as possible, before other html pages on the site. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes option to prioritize file types that can be known to be leaves - they won't add more links to the database  ## What hacks or alternative solutions have you tried to solve the problem?  ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [ ] I like ArchiveBox so far / would recommend it to a friend - [x] I am evaluating ArchiveBox but havn't decided if it solves my problems - [ ] I've had a lot of difficulty getting ArchiveBox set up

kerem

2026-03-01 14:43:40 +03:00

closed this issue
added the
status: idea-phase
label

kerem commented

2026-03-01 14:43:41 +03:00

Author

Owner

Copy link

@pirate commented on GitHub (Apr 19, 2021):

I think you're trying to use archivebox for something it's not designed primarily for. It's not built to archive entire domains recursively, there are better tools for that. You can use a scraper or spider to find the urls you want to archive in whatever order you like, then pipe them into archivebox once you have them in the order you want them archived.

@pirate commented on GitHub (Apr 19, 2021): I think you're trying to use archivebox for something it's not designed primarily for. It's not built to archive entire domains recursively, there are better tools for that. You can use a scraper or spider to find the urls you want to archive in whatever order you like, then pipe them into archivebox once you have them in the order you want them archived.

kerem commented

2026-03-01 14:43:41 +03:00

Author

Owner

Copy link

@dominictarr commented on GitHub (Apr 19, 2021):

thanks. I was recommended archivebox when I asked about a better wget -kr 1 do you know something that might suit me better?

@dominictarr commented on GitHub (Apr 19, 2021): thanks. I was recommended archivebox when I asked about a better `wget -kr 1` do you know something that might suit me better?

kerem commented

2026-03-01 14:43:41 +03:00

Author

Owner

Copy link

@pirate commented on GitHub (Apr 21, 2021):

Maybe Photon? https://github.com/s0md3v/Photon

@pirate commented on GitHub (Apr 21, 2021): Maybe Photon? https://github.com/s0md3v/Photon

kerem commented

2026-03-01 14:43:41 +03:00

Author

Owner

Copy link

@pirate commented on GitHub (May 7, 2021):

You can find many more alternatives here too: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives

@pirate commented on GitHub (May 7, 2021): You can find many more alternatives here too: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives

kerem referenced this issue

2026-03-01 14:48:42 +03:00

[PR #446] [MERGED] fix: Issue with timeout in readability #1166

kerem referenced this issue

2026-03-01 17:53:50 +03:00

[GH-ISSUE #445] Bugfix: Archiving process dies when readability extractor falls back to download_url on a URL that's timing out #1809

kerem referenced this issue

2026-03-01 18:00:22 +03:00

[PR #446] [MERGED] fix: Issue with timeout in readability #2675

kerem referenced this issue

2026-03-14 22:05:38 +03:00

[GH-ISSUE #445] Bugfix: Archiving process dies when readability extractor falls back to download_url on a URL that's timing out #3319