[GH-ISSUE #109] Add a MAX_URL_ATTEMPTS to stop retrying failed URLs #3095

Closed
opened 2026-03-14 20:59:49 +03:00 by kerem · 4 comments
Owner

Originally created by @pirate on GitHub (Oct 26, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/109

One of my use cases is similar to the one @f0086 mentioned: I've imported a Pinboard json file which includes old URLs that now 404. It's annoying to have bookmark-archiver waste time on every run attempting to archive them. However, I don't want to enable ONLY_NEW. If I do that, and I add a good URL, but my machine goes offline during the next run (or the machine serving the URL happens to be offline, or if they happen to be restarting their web server at that second) the archive will fail and it will not try again next time, despite there being a good chance the next attempt would succeed. This use case would be better solved by tracking the number of attempts per URL and having some sort of MAX_URL_ATTEMPTS configuration option. If this was set to 1, it would basically act exactly the same as ONLY_NEW: try once and ignore forever on failure. To address my concerns about temporary connectivity issues, I could set it to a value like 5.

-- https://github.com/pigmonkey

I agree, MAX_URL_ATTEMPTS stored in index.json would be really useful and not too hard to implement.

Originally created by @pirate on GitHub (Oct 26, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/109 > One of my use cases is similar to the one @f0086 mentioned: I've imported a Pinboard json file which includes old URLs that now 404. It's annoying to have bookmark-archiver waste time on every run attempting to archive them. However, I don't want to enable ONLY_NEW. If I do that, and I add a good URL, but my machine goes offline during the next run (or the machine serving the URL happens to be offline, or if they happen to be restarting their web server at that second) the archive will fail and it will not try again next time, despite there being a good chance the next attempt would succeed. This use case would be better solved by tracking the number of attempts per URL and having some sort of MAX_URL_ATTEMPTS configuration option. If this was set to 1, it would basically act exactly the same as ONLY_NEW: try once and ignore forever on failure. To address my concerns about temporary connectivity issues, I could set it to a value like 5. -- https://github.com/pigmonkey I agree, `MAX_URL_ATTEMPTS` stored in `index.json` would be really useful and not too hard to implement.
Author
Owner

@pigmonkey commented on GitHub (Oct 27, 2018):

Replying to @f0086's comment from #95 in this new issue:

I've also thought about the fact that a page could potentially went offline at the time I scrape it. But for that case, you can simply run the archiver with ONLY_NEW=False once a day or so. Most of the time, a page went offline either forever or for a short time (hours/days). So, If I set MAX_URL_ATTEMPTS to 10 and run that script every hour, I have the same problem as with ONLY_NEW for a page which is offline for a day or two. I set up two cronjobs: Once a week with ONLY_NEW=False and once a hour with ONLY_NEW=True.

I run bookmark-archiver daily, so for me a MAX_URL_ATTEMPTS=10 setting would mean the URL would have to be down for 10 days before the archiver gives up. I'd feel pretty good about that assumption.

You're right that if the script was run hourly a setting of 10 would be more likely to give up on URLs that are only temporarily inaccessible. I think I would want to set my maximum attempts such that every URL is tried across a period of something like 3-5 days before giving up. So if I was running the archiver hourly (and running it on an almost-always-online machine) maybe that value would be 100.

For extra assurance you could still use the dual cronjob approach. If MAX_URL_ATTEMPTS is unset (which should probably be the default) or set to something falsey (False, 0), it should act the same as the current behaviour when ONLY_NEW=False, so that could be the weekly job.

<!-- gh-comment-id:433591697 --> @pigmonkey commented on GitHub (Oct 27, 2018): Replying to @f0086's comment from #95 in this new issue: > I've also thought about the fact that a page could potentially went offline at the time I scrape it. But for that case, you can simply run the archiver with ONLY_NEW=False once a day or so. Most of the time, a page went offline either forever or for a short time (hours/days). So, If I set MAX_URL_ATTEMPTS to 10 and run that script every hour, I have the same problem as with ONLY_NEW for a page which is offline for a day or two. I set up two cronjobs: Once a week with ONLY_NEW=False and once a hour with ONLY_NEW=True. I run bookmark-archiver daily, so for me a `MAX_URL_ATTEMPTS=10` setting would mean the URL would have to be down for 10 days before the archiver gives up. I'd feel pretty good about that assumption. You're right that if the script was run hourly a setting of 10 would be more likely to give up on URLs that are only temporarily inaccessible. I think I would want to set my maximum attempts such that every URL is tried across a period of something like 3-5 days before giving up. So if I was running the archiver hourly (and running it on an almost-always-online machine) maybe that value would be 100. For extra assurance you could still use the dual cronjob approach. If `MAX_URL_ATTEMPTS` is unset (which should probably be the default) or set to something falsey (`False`, `0`), it should act the same as the current behaviour when `ONLY_NEW=False`, so that could be the weekly job.
Author
Owner

@karlicoss commented on GitHub (Nov 15, 2018):

Perhaps even better would be some sort of exponential backoff? I guess a website down for 10 days is not that uncommon (people forget to extend their domain lease sometimes etc), would not be great if we give up on such a link completely. Another thing we could do is a flag that would clear the failed attempts for the whole index so you could give the failed urls a change once in a while.

<!-- gh-comment-id:439201532 --> @karlicoss commented on GitHub (Nov 15, 2018): Perhaps even better would be some sort of exponential backoff? I guess a website down for 10 days is not that uncommon (people forget to extend their domain lease sometimes etc), would not be great if we give up on such a link completely. Another thing we could do is a flag that would clear the failed attempts for the whole index so you could give the failed urls a change once in a while.
Author
Owner

@f0086 commented on GitHub (Nov 16, 2018):

Sure, you can run the archiver with ONLY_NEW=false, so all URLs without valid content attached to it will trigger a retry. Run the archiver once a week or month with ONLY_NEW=false and you are good to go.

<!-- gh-comment-id:439319962 --> @f0086 commented on GitHub (Nov 16, 2018): Sure, you can run the archiver with ```ONLY_NEW=false```, so all URLs without valid content attached to it will trigger a retry. Run the archiver once a week or month with ```ONLY_NEW=false``` and you are good to go.
Author
Owner

@pirate commented on GitHub (Dec 29, 2025):

now implemented on dev

<!-- gh-comment-id:3697587954 --> @pirate commented on GitHub (Dec 29, 2025): now implemented on `dev` ✅
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3095
No description provided.