mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #340] Question: How to archive Facebook pages? #1756
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1756
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nihelmasell on GitHub (Apr 26, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/340
Hi I'm trying to archive a facebook group (its actually a user rather than a group) which post some interesting stuff regarding language learning. I used snscrape to get the list of URLs (nearly 5000). But when I tried to download all those URLS via archivebox I can only grab a hundred, before getting captcha resolves on all the captured pages. Any way to solve this? Any used faced a similar problem? (I could use Tor but it is normally banned ab initio on Facebook). Thanks
@pirate commented on GitHub (Apr 29, 2020):
Rate limiting is a fundamentally difficult part of web archiving, and I'm afraid Archivebox is not designed to handle it interntally.
The expectation is that for users who are archiving 1000s of pages on a single domain, that they set up their own ratelimiting or distributed archiving infrastructure with a proxy bot network of some kind.
While we might add a
RATELIMITSoption in the future to self-regulate how many requests we make to a given domain per minute,hour,day, it's not on the horizon anytime soon (expect >1-2 years), so I'm going to close this issue for now. Related issues:For now I suggest drip-feeding chunks of ~20 or 30 URLs at a time to archivebox every few minutes in a small bash script.