starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #566] Feature Request: Rate-Limiting Options #1870

New issue

Open

opened 2026-03-01 17:54:27 +03:00 by kerem · 2 comments

kerem commented

2026-03-01 17:54:27 +03:00

Owner

Originally created by @autumn-birds on GitHub (Dec 3, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/566

(sorry in advance for the verbosity! I really appreciate that people are working on this sort of software 💖 thanks so much for taking the time.)

Type

General question or discussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves

I'd like to dump my --- probably rather extensive --- browsing history into this, but I haven't seen anything in the documentation that contradicts my assumption that if I do so 'all in one go' so to speak, ArchiveBox will generate a lot of requests on my behalf as fast as possible. I'm afraid this would annoy the people running the remote servers hosting the URLs I fed to it, and/or trigger automated abuse-detection mechanisms, which is bad enough without also considering the possibility that some of those sites might decide to IP-ban me or something. I don't want that, either for myself or for any other unfortunate creatures that might be behind the same public IP now or in the future.

I remember looking through the documentation for how to configure ArchiveBox a couple of times and not seeing anything that resembled a (say) MAX_REQUEST_RATE. Maybe it's not as much of a concern in practice? Though, I think it would also be nice for people with a lot to download who are on a slow connection and want to do other stuff at the same time.

I apologize if I'm just completely missing something, and/or this issue is redundant (eg. because I was just too lazy looking through documentation and such last time.) It feels like it should have come up before..?

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

A parameter I could set called MIN_SAME_DOMAIN_REQUEST_COOLDOWN, which would cause ArchiveBox to reorder its fetches and potentially sleep/wait such that it never contacted any one domain more often than specified (e.g., MIN_SAME_DOMAIN_REQUEST_COOLDOWN=1m or whatever is a reasonable value) feels like it would help with this concern.

I don't know how hard that would be to implement given the existing architecture of the program.

It would also be nice (or alternatively nice, if the above is too difficult) to have a more coarse-grained MIN_REQUEST_COOLDOWN or MAX_REQUEST_RATE or something similar which I could use to throttle how fast ArchiveBox requests things regardless of where it's requesting from.

Maybe at the very least someone with expertise/experience doing this could add something to the documentation mentioning whether or why such things aren't a concern currently, if that's the case?

What hacks or alternative solutions have you tried to solve the problem?

I might look at other pieces of software---I saw something in the Community links that looked proxy-like; alternatively I might also look at solutions using archive.org's own tools (Heretrix?) as I remember that it had some documentation around this sort of issue. But it also definitely looked a lot heavier to set up and administer than this.

Honestly, I could also look at hacking on ArchiveBox's code myself, but I'd feel a lot more confident with that if I knew what people with more experience think of this.

How badly do you want this new feature?

It's an urgent deal-breaker, I can't live without it
It's important to add it in the near-mid term future
It would be nice to have eventually

I'm willing to contribute dev time ~~/ money~~ to fix this issue (I'd be happy to try to hack on the source a little at some point though I should warn you all that I'm not very reliable and may or may not be likely to produce good code. Still, any pointers would be appreciated if it comes to that.)
I like ArchiveBox so far / would recommend it to a friend
I've had a lot of difficulty getting ArchiveBox set up

Originally created by @autumn-birds on GitHub (Dec 3, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/566 *(sorry in advance for the verbosity! I really appreciate that people are working on this sort of software :sparkling_heart: thanks so much for taking the time.)* ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves I'd like to dump my --- probably rather extensive --- browsing history into this, but I haven't seen anything in the documentation that contradicts my assumption that if I do so 'all in one go' so to speak, ArchiveBox will generate a lot of requests on my behalf as fast as possible. I'm afraid this would annoy the people running the remote servers hosting the URLs I fed to it, and/or trigger automated abuse-detection mechanisms, which is bad enough *without* also considering the possibility that some of those sites might decide to IP-ban me or something. I don't want that, either for myself or for any other unfortunate creatures that might be behind the same public IP now or in the future. I remember looking through the documentation for how to configure ArchiveBox a couple of times and not seeing anything that resembled a (say) `MAX_REQUEST_RATE`. Maybe it's not as much of a concern in practice? Though, I think it would also be nice for people with a lot to download who are on a slow connection and want to do other stuff at the same time. I apologize if I'm just completely missing something, and/or this issue is redundant (eg. because I was just too lazy looking through documentation and such last time.) It feels like it should have come up before..? ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes A parameter I could set called `MIN_SAME_DOMAIN_REQUEST_COOLDOWN`, which would cause ArchiveBox to reorder its fetches and potentially sleep/wait such that it never contacted any one domain *more often* than specified (e.g., `MIN_SAME_DOMAIN_REQUEST_COOLDOWN=1m` or whatever is a reasonable value) feels like it would help with this concern. I don't know how hard that would be to implement given the existing architecture of the program. It would also be nice (or alternatively nice, if the above is too difficult) to have a more coarse-grained `MIN_REQUEST_COOLDOWN` or `MAX_REQUEST_RATE` or something similar which I could use to throttle how fast ArchiveBox requests things regardless of where it's requesting from. Maybe at the very least someone with expertise/experience doing this could add something to the documentation mentioning whether or why such things aren't a concern currently, if that's the case? ## What hacks or alternative solutions have you tried to solve the problem? I might look at other pieces of software---I saw something in the Community links that looked proxy-like; alternatively I might also look at solutions using archive.org's own tools (Heretrix?) as I remember that it had some documentation around this sort of issue. But it also definitely looked a lot heavier to set up and administer than this. Honestly, I could also look at hacking on ArchiveBox's code myself, but I'd feel a lot more confident with that if I knew what people with more experience think of this. ## How badly do you want this new feature? - [x] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute dev time ~~/ money~~ to fix this issue *(I'd be happy to try to hack on the source a little at some point though I should warn you all that I'm not very reliable and may or may not be likely to produce good code. Still, any pointers would be appreciated if it comes to that.)* - [ ] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up

kerem added the

touches: configuration

size: hard

why: functionality

touches: data/schema/architecture

status: idea-phase

labels

2026-03-01 17:54:27 +03:00

kerem commented

2026-03-01 17:54:28 +03:00

Author

Owner

@pirate commented on GitHub (Dec 3, 2020):

We've discussed this previously as a subcomponent of another feature over here: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-489799497, but thanks for opening this as it's nice to have a dedicated ticket to track this config option.

The reason it hasn't been added yet is because honestly ArchiveBox is pretty slow without parallel link archiving! 😆

We haven't had issues with hitting ratelimits in practice because single-threaded archiving with a headless browser is not much faster than a human browsing those links by hand. Getting rate-limited is also not that detrimental to the archiving process, as ArchiveBox will just skip any extractors that fail and you can auto-retry them later by just running archivebox update.

Rate limiting is not super hard to implement, but it doesn't make sense to add it until we have async/parallel archiving support, because otherwise everything will just block for many seconds between links and make the process much slower. If you're willing to wait some months until we get around to that and don't mind the slow archiving in the meantime, then it sounds like ArchiveBox can fit your needs.

Out of curiosity how many links are you planning on archiving? I recommend splitting it up into batches of 500 or 1,000 at a time. If you're doing more than 20,000 links then I recommend waiting until v0.5.0 is released in a couple weeks because we have many general performance improvements in that version.

@pirate commented on GitHub (Dec 3, 2020): We've discussed this previously as a subcomponent of another feature over here: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-489799497, but thanks for opening this as it's nice to have a dedicated ticket to track this config option. The reason it hasn't been added yet is because honestly ArchiveBox is pretty slow without parallel link archiving! 😆 We haven't had issues with hitting ratelimits in practice because single-threaded archiving with a headless browser is not much faster than a human browsing those links by hand. Getting rate-limited is also not that detrimental to the archiving process, as ArchiveBox will just skip any extractors that fail and you can auto-retry them later by just running `archivebox update`. Rate limiting is not super hard to implement, but it doesn't make sense to add it until we have async/parallel archiving support, because otherwise everything will just block for many seconds between links and make the process much slower. If you're willing to wait some months until we get around to that and don't mind the slow archiving in the meantime, then it sounds like ArchiveBox can fit your needs. Out of curiosity how many links are you planning on archiving? I recommend splitting it up into batches of 500 or 1,000 at a time. If you're doing more than 20,000 links then I recommend waiting until v0.5.0 is released in a couple weeks because we have many general performance improvements in that version.

kerem commented

2026-03-01 17:54:29 +03:00

Author

Owner

@autumn-birds commented on GitHub (Dec 3, 2020):

I see! Okay, that makes some amount of sense, hehe. I'm not in a huge hurry to get things archived; I'm just of the 'vague long-standing irritation that I don't have something like this to search through/read offline/etc etc' type.

As far as number of links go, I'm... not honestly sure. My browser history for November seems to weigh in at almost 5,000 items on this machine, which has become my primary one. (...I might possibly have a habit of spending time wandering around the web just a little bit too much...) October is actually less, about 2,256 according to 'select all' in Firefox history. It doesn't go back too much before September, but I have some older history elsewhere... >.>

Thanks for the clarification though --- it's reassuring to hear that this has actually been thought about. Perhaps I'll have to give it a try!

@autumn-birds commented on GitHub (Dec 3, 2020): I see! Okay, that makes some amount of sense, hehe. I'm not in a huge hurry to get things archived; I'm just of the 'vague long-standing irritation that I don't have something like this to search through/read offline/etc etc' type. As far as number of links go, I'm... not honestly sure. My browser history for November seems to weigh in at almost 5,000 items on this machine, which has become my primary one. (...I might possibly have a habit of spending time wandering around the web just a little bit too much...) October is actually less, about 2,256 according to 'select all' in Firefox history. It doesn't go back too much before September, but I have some older history elsewhere... >.> Thanks for the clarification though --- it's reassuring to hear that this has actually been thought about. Perhaps I'll have to give it a try!

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#1870

No description provided.

Rows
Columns