mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #566] Feature Request: Rate-Limiting Options #1870
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1870
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @autumn-birds on GitHub (Dec 3, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/566
(sorry in advance for the verbosity! I really appreciate that people are working on this sort of software 💖 thanks so much for taking the time.)
Type
What is the problem that your feature request solves
I'd like to dump my --- probably rather extensive --- browsing history into this, but I haven't seen anything in the documentation that contradicts my assumption that if I do so 'all in one go' so to speak, ArchiveBox will generate a lot of requests on my behalf as fast as possible. I'm afraid this would annoy the people running the remote servers hosting the URLs I fed to it, and/or trigger automated abuse-detection mechanisms, which is bad enough without also considering the possibility that some of those sites might decide to IP-ban me or something. I don't want that, either for myself or for any other unfortunate creatures that might be behind the same public IP now or in the future.
I remember looking through the documentation for how to configure ArchiveBox a couple of times and not seeing anything that resembled a (say)
MAX_REQUEST_RATE. Maybe it's not as much of a concern in practice? Though, I think it would also be nice for people with a lot to download who are on a slow connection and want to do other stuff at the same time.I apologize if I'm just completely missing something, and/or this issue is redundant (eg. because I was just too lazy looking through documentation and such last time.) It feels like it should have come up before..?
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
A parameter I could set called
MIN_SAME_DOMAIN_REQUEST_COOLDOWN, which would cause ArchiveBox to reorder its fetches and potentially sleep/wait such that it never contacted any one domain more often than specified (e.g.,MIN_SAME_DOMAIN_REQUEST_COOLDOWN=1mor whatever is a reasonable value) feels like it would help with this concern.I don't know how hard that would be to implement given the existing architecture of the program.
It would also be nice (or alternatively nice, if the above is too difficult) to have a more coarse-grained
MIN_REQUEST_COOLDOWNorMAX_REQUEST_RATEor something similar which I could use to throttle how fast ArchiveBox requests things regardless of where it's requesting from.Maybe at the very least someone with expertise/experience doing this could add something to the documentation mentioning whether or why such things aren't a concern currently, if that's the case?
What hacks or alternative solutions have you tried to solve the problem?
I might look at other pieces of software---I saw something in the Community links that looked proxy-like; alternatively I might also look at solutions using archive.org's own tools (Heretrix?) as I remember that it had some documentation around this sort of issue. But it also definitely looked a lot heavier to set up and administer than this.
Honestly, I could also look at hacking on ArchiveBox's code myself, but I'd feel a lot more confident with that if I knew what people with more experience think of this.
How badly do you want this new feature?
/ moneyto fix this issue (I'd be happy to try to hack on the source a little at some point though I should warn you all that I'm not very reliable and may or may not be likely to produce good code. Still, any pointers would be appreciated if it comes to that.)@pirate commented on GitHub (Dec 3, 2020):
We've discussed this previously as a subcomponent of another feature over here: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-489799497, but thanks for opening this as it's nice to have a dedicated ticket to track this config option.
The reason it hasn't been added yet is because honestly ArchiveBox is pretty slow without parallel link archiving! 😆
We haven't had issues with hitting ratelimits in practice because single-threaded archiving with a headless browser is not much faster than a human browsing those links by hand. Getting rate-limited is also not that detrimental to the archiving process, as ArchiveBox will just skip any extractors that fail and you can auto-retry them later by just running
archivebox update.Rate limiting is not super hard to implement, but it doesn't make sense to add it until we have async/parallel archiving support, because otherwise everything will just block for many seconds between links and make the process much slower. If you're willing to wait some months until we get around to that and don't mind the slow archiving in the meantime, then it sounds like ArchiveBox can fit your needs.
Out of curiosity how many links are you planning on archiving? I recommend splitting it up into batches of 500 or 1,000 at a time. If you're doing more than 20,000 links then I recommend waiting until v0.5.0 is released in a couple weeks because we have many general performance improvements in that version.
@autumn-birds commented on GitHub (Dec 3, 2020):
I see! Okay, that makes some amount of sense, hehe. I'm not in a huge hurry to get things archived; I'm just of the 'vague long-standing irritation that I don't have something like this to search through/read offline/etc etc' type.
As far as number of links go, I'm... not honestly sure. My browser history for November seems to weigh in at almost 5,000 items on this machine, which has become my primary one. (...I might possibly have a habit of spending time wandering around the web just a little bit too much...) October is actually less, about 2,256 according to 'select all' in Firefox history. It doesn't go back too much before September, but I have some older history elsewhere... >.>
Thanks for the clarification though --- it's reassuring to hear that this has actually been thought about. Perhaps I'll have to give it a try!