[GH-ISSUE #340] Question: How to archive Facebook pages? #246

Closed
opened 2026-03-01 14:41:48 +03:00 by kerem · 1 comment
Owner

Originally created by @nihelmasell on GitHub (Apr 26, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340

Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks

Originally created by @nihelmasell on GitHub (Apr 26, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340 Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks
kerem closed this issue 2026-03-01 14:41:49 +03:00
Author
Owner

@pirate commented on GitHub (Apr 29, 2020):

Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally.

The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind.

While we might add a RATELIMITS option in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues:

For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.

<!-- gh-comment-id:621464493 --> @pirate commented on GitHub (Apr 29, 2020): Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally. The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind. While we might add a `RATELIMITS` option in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues: - https://github.com/pirate/ArchiveBox/issues/91#issuecomment-489799497 - https://github.com/pirate/ArchiveBox/issues/249 For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#246
No description provided.