[GH-ISSUE #340] Question: How to archive Facebook pages? #246

New issue

Closed

opened 2026-03-01 14:41:48 +03:00 by kerem · 1 comment

kerem commented

2026-03-01 14:41:48 +03:00

Owner

Originally created by @nihelmasell on GitHub (Apr 26, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340

Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks

Originally created by @nihelmasell on GitHub (Apr 26, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340 Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks

kerem closed this issue

2026-03-01 14:41:49 +03:00

kerem commented

2026-03-01 14:41:49 +03:00

Author

Owner

@pirate commented on GitHub (Apr 29, 2020):

Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally.

The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind.

While we might add a RATELIMITS option in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues:

For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.

@pirate commented on GitHub (Apr 29, 2020): Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally. The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind. While we might add a `RATELIMITS` option in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues: - https://github.com/pirate/ArchiveBox/issues/91#issuecomment-489799497 - https://github.com/pirate/ArchiveBox/issues/249 For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.

kerem referenced this issue

2026-03-01 14:48:26 +03:00

[PR #246] [MERGED] fix init error #1103

kerem referenced this issue

2026-03-01 14:48:27 +03:00

[PR #275] [CLOSED] Draft: Hypothetical modularizing refactor spec #1105

kerem referenced this issue

2026-03-01 18:00:08 +03:00

[PR #246] [MERGED] fix init error #2613

kerem referenced this issue

2026-03-01 18:00:09 +03:00

[PR #275] [CLOSED] Draft: Hypothetical modularizing refactor spec #2616