mirror of https://github.com/Sejmou/async-scraper-api.git synced 2026-04-27 00:25:57 +03:00

My attempt at building my own infrastructure for distributed web scraping

Find a file

Sejmou d246d5e304 server: update requirements.txt (rnet 2.1.0 for some reason not on PyPi, like all v2.x version except for latest!?; using >=2.1.0,<3.0 now)		2025-11-03 14:04:06 +01:00
.vscode	add server debug config, bruno docs + local S3 docker setup; update README	2025-03-10 10:50:50 +01:00
admin-dashboard	admin-dashboard: add endpoint for fetching distributed task meta (incl. subtasks)	2025-07-26 20:55:19 +00:00
ansible	ansible: remove outdated Spotify credential CSV logic	2025-06-01 15:42:38 +02:00
api-docs	api-docs: Tasks/Resume -> Tasks/Execute	2025-05-31 19:18:51 +02:00
api-server	server: update requirements.txt (rnet 2.1.0 for some reason not on PyPi, like all v2.x version except for latest!?; using >=2.1.0,<3.0 now)	2025-11-03 14:04:06 +01:00
distributed-tasks-api-docs	docs update	2025-11-03 13:58:33 +01:00
local-s3	add server debug config, bruno docs + local S3 docker setup; update README	2025-03-10 10:50:50 +01:00
.gitignore	move API server to subfolder, add README w/ plans for other stuff to add	2025-03-08 11:14:50 +01:00
README.md	update README	2025-03-26 00:41:30 +01:00

README.md

Distributed Scraping Infrastructure

This project is my attempt at coming up with my own scraper infrastructure that can scale across several machines, hosted anywhere in the internet. My vision is creating a network of scrapers that can be controlled and monitored from a central admin dashboard.

Components

Scraper API Server (./api-server): A RESTful API server written in Python that should be deployed across several machines. Each instance can run scraper jobs independently. Each job first writes data to a local JSONL file which is eventually uploaded to specific prefixes in the specified S3 bucket (in compressed format).
API docs (./api-docs): A Bruno collection of example requests for the API server.
Local S3 instance (./local-s3): A local MinIO S3 instance that can be used for testing purposes and launched via Docker Compose.
Admin Dashboard (./admin-dashboard): A web application that can be used to control and monitor the scraper instances. It should be able to start/stop jobs and show details about the status of each job.
Ansible Playbooks (./ansible): Playbooks to deploy multiple scraper nodes from a central 'control node'.