[GH-ISSUE #1337] Feature Request: use cache control headers to determine if content has changed since last snapshot #2327

Open
opened 2026-03-01 17:58:14 +03:00 by kerem · 0 comments
Owner

Originally created by @Juliaria08 on GitHub (Jan 28, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1337

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

I'd like to request ArchiveBox send the If-Modified-Since if it has already fetched the website previously and the website sent the Last-Modified header. Or send If-None-Match from the stored value of the ETag response, such that feeds like Rachel Kroll's feed can easily be fetched without having to wait a full day.

This would also make long fetching of sites easier on both our host and the remote's host.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Sorry, I think I should've read the entire thing before. It should send If-Modified-Since if it has already fetched the website previously, using the value the server sent on the Last-Modified. Or it should send a If-None-Match from the value of ETag if found.

I don't know if sending both is allowed, but I guess it'd be acceptable to prefer If-Modified-Since if both are.

What hacks or alternative solutions have you tried to solve the problem?

I've considered putting a HTTP proxy that would store those tags, and have archivebox be in the middle, but that doesn't look pretty.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

I don't really mind too much, but I'd appreciate it being there, as archivebox could cause strain on servers, and thus we might get blocked from being able to archive things if we archive too deep.


  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up

I'm a fairly "new" systems admin, and I haven't set ArchiveBox up in a public enviroment, it is only running on my laptop, but I could easily set it up as I have already set up some other Django based apps to a system. But I don't have time to do things.

Originally created by @Juliaria08 on GitHub (Jan 28, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1337 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [X] Request modification of existing behavior or design ## What is the problem that your feature request solves <!-- e.g. I need to be able to archive spanish and french subtitle files from a particular <example.com> movie site that's going down soon. --> I'd like to request ArchiveBox send the `If-Modified-Since` if it has already fetched the website previously and the website sent the `Last-Modified` header. Or send `If-None-Match` from the stored value of the `ETag` response, such that feeds like [Rachel Kroll's feed](https://rachelbythebay.com/w/feed/) can easily be fetched without having to wait a full day. This would also make long fetching of sites easier on both our host and the remote's host. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes <!-- e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages. The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving. --> Sorry, I think I should've read the entire thing before. It should send `If-Modified-Since` if it has already fetched the website previously, using the value the server sent on the `Last-Modified`. Or it should send a `If-None-Match` from the value of `ETag` if found. I don't know if sending both is allowed, but I guess it'd be acceptable to prefer `If-Modified-Since` if both are. ## What hacks or alternative solutions have you tried to solve the problem? <!-- A clear and concise description of any alternative solutions, workarounds, or other software you've considered using to fix the problem. --> I've considered putting a HTTP proxy that would store those tags, and have archivebox be in the middle, but that doesn't look pretty. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [X] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually I don't really mind too much, but I'd appreciate it being there, as archivebox could cause strain on servers, and thus we might get blocked from being able to archive things if we archive too deep. --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [X] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up I'm a fairly "new" systems admin, and I haven't set ArchiveBox up in a public enviroment, it is only running on my laptop, but I could easily set it up as I have already set up some other Django based apps to a system. But I don't have time to do things.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2327
No description provided.