MMA Almanac Scrapers

A Dockerized UFC/MMA data-collection pipeline using Playwright with Tor IP rotation and Cloudflare bypass.

Last pushed Nov 2025PythonShellDockerfile

About this project

What it is

A Python data-collection pipeline that drives Playwright browser automation with Tor IP rotation and a custom Cloudflare-bypass HTTP client to reliably scrape Sherdog fighter profiles and UFC event/fight-statistics pages. Scraped data passes through a set of parsers and enrichers — including a fighter-stats interpolator and name-matcher — before being seeded into PostgreSQL via Prisma. The pipeline runs on a schedule triggered by GitHub Actions and AWS EventBridge, and the whole scraper runs inside Docker for reproducible execution.

Engineering highlights

Playwright browser automation with human-delay simulation to avoid bot detection
Tor IP rotation (rotate_tor_ip) to cycle exit nodes between scraping sessions
Cloudflare-bypass HTTP client for sites that block headless browsers
Session-state save/load to resume scraping without re-authenticating
Prisma ORM upsert seeders keep the PostgreSQL schema in sync with scraper output
Dockerized execution with GitHub Actions + EventBridge scheduling
Explicit data-leakage test suite confirming no future statistics bleed into training features

Stack

PythonPlaywrightTorPrismaPostgreSQLDockerGitHub ActionsAWS EventBridge

Part of the MMA Almanac system

This repo is one service in the four-part MMA Almanac platform. The system diagram below shows how the scrapers, ML engine, web UI, and AWS infrastructure fit together.

mma-almanac-ai mma-almanac-ui mma-almanac-aws

Back to projects