mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #109] Add a MAX_URL_ATTEMPTS to stop retrying failed URLs #3095
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3095
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Oct 26, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/109
-- https://github.com/pigmonkey
I agree,
MAX_URL_ATTEMPTSstored inindex.jsonwould be really useful and not too hard to implement.@pigmonkey commented on GitHub (Oct 27, 2018):
Replying to @f0086's comment from #95 in this new issue:
I run bookmark-archiver daily, so for me a
MAX_URL_ATTEMPTS=10setting would mean the URL would have to be down for 10 days before the archiver gives up. I'd feel pretty good about that assumption.You're right that if the script was run hourly a setting of 10 would be more likely to give up on URLs that are only temporarily inaccessible. I think I would want to set my maximum attempts such that every URL is tried across a period of something like 3-5 days before giving up. So if I was running the archiver hourly (and running it on an almost-always-online machine) maybe that value would be 100.
For extra assurance you could still use the dual cronjob approach. If
MAX_URL_ATTEMPTSis unset (which should probably be the default) or set to something falsey (False,0), it should act the same as the current behaviour whenONLY_NEW=False, so that could be the weekly job.@karlicoss commented on GitHub (Nov 15, 2018):
Perhaps even better would be some sort of exponential backoff? I guess a website down for 10 days is not that uncommon (people forget to extend their domain lease sometimes etc), would not be great if we give up on such a link completely. Another thing we could do is a flag that would clear the failed attempts for the whole index so you could give the failed urls a change once in a while.
@f0086 commented on GitHub (Nov 16, 2018):
Sure, you can run the archiver with
ONLY_NEW=false, so all URLs without valid content attached to it will trigger a retry. Run the archiver once a week or month withONLY_NEW=falseand you are good to go.@pirate commented on GitHub (Dec 29, 2025):
now implemented on
dev✅