MMA Almanac Scrapers
A Dockerized UFC/MMA data-collection pipeline using Playwright with Tor IP rotation and Cloudflare bypass.
About this project
What it is
A Python data-collection pipeline that drives Playwright browser automation with Tor IP rotation and a custom Cloudflare-bypass HTTP client to reliably scrape Sherdog fighter profiles and UFC event/fight-statistics pages. Scraped data passes through a set of parsers and enrichers — including a fighter-stats interpolator and name-matcher — before being seeded into PostgreSQL via Prisma. The pipeline runs on a schedule triggered by GitHub Actions and AWS EventBridge, and the whole scraper runs inside Docker for reproducible execution.
Engineering highlights
- Playwright browser automation with human-delay simulation to avoid bot detection
- Tor IP rotation (rotate_tor_ip) to cycle exit nodes between scraping sessions
- Cloudflare-bypass HTTP client for sites that block headless browsers
- Session-state save/load to resume scraping without re-authenticating
- Prisma ORM upsert seeders keep the PostgreSQL schema in sync with scraper output
- Dockerized execution with GitHub Actions + EventBridge scheduling
- Explicit data-leakage test suite confirming no future statistics bleed into training features
Stack
Part of the MMA Almanac system
This repo is one service in the four-part MMA Almanac platform. The system diagram below shows how the scrapers, ML engine, web UI, and AWS infrastructure fit together.