Data platform 2021 - present 2 min read

NBA ELT Pipeline

An end-to-end NBA analytics platform with ingestion, dbt models, REST API, and a dashboard for season metrics, insights, and ML-powered win predictions, all hosted on cloud infrastructure.

PythondbtTerraformAWS

What it is

A personal, production-grade data platform for NBA season analytics: daily ingestion, SQL transformations, ML win predictions, a public REST API, and an interactive Dash dashboard. Infrastructure is Terraform-managed on AWS, with orchestration on Step Functions for near-zero cost.

Full architecture write-up: NBA Project on Doqs.

What it does

  • Ingests raw data from basketball-reference, DraftKings, Reddit, and (historically) Twitter into Postgres bronze tables, with S3 backups for redundancy
  • Transforms data through a medallion architecture (bronze → silver → gold) using dbt, with dbt-expectations for data quality testing
  • Predicts daily game win probabilities with a logistic regression model using recent team performance, rest days, and active injuries as features
  • Serves gold marts through a Lambda-hosted REST API (api.jyablonski.dev) and a Dash frontend (nbadashboard.jyablonski.dev)
  • Runs daily on a single pipeline (ingestion → dbt → ML), with feature flags that control which sources get scraped based on the season schedule
flowchart LR
  SRC[External sources]

  subgraph aws [AWS]
    SF[Step Functions]

    subgraph pipeline [Daily ECS pipeline]
      ING[Ingestion]
      DBT[dbt]
      ML[ML]
    end

    PG[(Postgres)]

    subgraph serve [User-facing]
      API[REST API]
      DASH[Dash dashboard]
    end
  end

  SRC --> ING
  SF -.-> pipeline
  ING --> PG
  PG --> DBT
  DBT --> PG
  PG --> ML
  ML --> PG
  PG --> API
  PG --> DASH

Why I built it

I wanted a real end-to-end analytics product, not a notebook or a single script, with scheduling, testing, IAM, and something I could show in interviews. NBA data was a fun domain with messy sources, including APIs that block AWS IPs and force custom scraping solutions. Keeping monthly spend around $1 on the AWS free tier made cost discipline part of the design from day one.

Tech stack

LayerTools
IngestionPython, Pandas, SQLAlchemy
Transformdbt, dbt-expectations, Postgres
MLPython, scikit-learn
APIFastAPI, AWS Lambda, CloudFront
DashboardDash, Plotly
InfraTerraform (custom modules), ECS Fargate, Step Functions, RDS, S3
Shared libsjyablonski_common_modules

Related repos: dbt, ML, REST API, Dash, Terraform.