My attempt at building my own infrastructure for distributed web scraping
Find a file
2025-11-03 14:04:06 +01:00
.vscode add server debug config, bruno docs + local S3 docker setup; update README 2025-03-10 10:50:50 +01:00
admin-dashboard admin-dashboard: add endpoint for fetching distributed task meta (incl. subtasks) 2025-07-26 20:55:19 +00:00
ansible ansible: remove outdated Spotify credential CSV logic 2025-06-01 15:42:38 +02:00
api-docs api-docs: Tasks/Resume -> Tasks/Execute 2025-05-31 19:18:51 +02:00
api-server server: update requirements.txt (rnet 2.1.0 for some reason not on PyPi, like all v2.x version except for latest!?; using >=2.1.0,<3.0 now) 2025-11-03 14:04:06 +01:00
distributed-tasks-api-docs docs update 2025-11-03 13:58:33 +01:00
local-s3 add server debug config, bruno docs + local S3 docker setup; update README 2025-03-10 10:50:50 +01:00
.gitignore move API server to subfolder, add README w/ plans for other stuff to add 2025-03-08 11:14:50 +01:00
README.md update README 2025-03-26 00:41:30 +01:00

Distributed Scraping Infrastructure

This project is my attempt at coming up with my own scraper infrastructure that can scale across several machines, hosted anywhere in the internet. My vision is creating a network of scrapers that can be controlled and monitored from a central admin dashboard.

Components

  • Scraper API Server (./api-server): A RESTful API server written in Python that should be deployed across several machines. Each instance can run scraper jobs independently. Each job first writes data to a local JSONL file which is eventually uploaded to specific prefixes in the specified S3 bucket (in compressed format).
  • API docs (./api-docs): A Bruno collection of example requests for the API server.
  • Local S3 instance (./local-s3): A local MinIO S3 instance that can be used for testing purposes and launched via Docker Compose.
  • Admin Dashboard (./admin-dashboard): A web application that can be used to control and monitor the scraper instances. It should be able to start/stop jobs and show details about the status of each job.
  • Ansible Playbooks (./ansible): Playbooks to deploy multiple scraper nodes from a central 'control node'.