[GH-ISSUE #340] Question: How to archive Facebook pages? #1756

New issue

Closed

opened 2026-03-01 17:53:24 +03:00 by kerem · 1 comment

kerem commented

2026-03-01 17:53:24 +03:00

Owner

Originally created by @nihelmasell on GitHub (Apr 26, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340

Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks

Originally created by @nihelmasell on GitHub (Apr 26, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340 Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks

kerem closed this issue

2026-03-01 17:53:24 +03:00

kerem commented

2026-03-01 17:53:24 +03:00

Author

Owner

@pirate commented on GitHub (Apr 29, 2020):

Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally.

The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind.

While we might add a RATELIMITS option in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues:

For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.

@pirate commented on GitHub (Apr 29, 2020): Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally. The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind. While we might add a `RATELIMITS` option in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues: - https://github.com/pirate/ArchiveBox/issues/91#issuecomment-489799497 - https://github.com/pirate/ArchiveBox/issues/249 For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.

kerem referenced this issue

2026-03-01 18:01:24 +03:00

[PR #1756] Implement native LDAP authentication #3016

kerem referenced this issue

2026-03-15 01:49:02 +03:00

[PR #1756] [MERGED] Implement native LDAP authentication #4518

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#1756

No description provided.

Rows
Columns