Jump Ball
Architecture
Control-tower view of the nbadb pipeline, public surface, and operational boundaries.
Architecture
Think of nbadb as an arena control tower for NBA data: extraction brings the game film in, DuckDB stages and validates it, transformers reshape it into analytics-ready tables, and export lanes package it for downstream use.
Quick navigation
See the pipeline path
Start with The short version for the raw → staging → star flow before you drop into details.
Map commands to run modes
Jump to Command intent by run mode if your question is operational rather than structural.
Understand public vs internal tables
Use Validation and operational state when you need to know what is contract surface versus pipeline machinery.
Check generated-docs boundaries
Go to Docs boundary if you are editing docs and need to know what is command-owned.
Use this page when…
| If you need to answer… | Start here |
|---|---|
| “How does data move from the NBA API into the warehouse?” | The short version |
| “Which layer is public contract versus internal pipeline machinery?” | Validation and operational state |
“What is actually different between init, daily, monthly, and full?” | Command intent by run mode |
| “Which docs are hand-written and which are command-owned?” | Docs boundary: what is generated vs. curated |
flowchart LR
subgraph Tipoff["Tip-Off: source intake"]
API["nba_api runtime surface
143 registered extractors"] --> RAW["Raw Polars frames"]
end
subgraph Tunnel["Tunnel: load + validate"]
RAW --> STG["DuckDB staging
normalized operational layer"]
end
subgraph Scoreboard["Main floor: public model"]
STG --> STAR["141 public tables/views
star + analytics surface"]
…The short version
- Extraction wraps the NBA stats surface with registered extractors and produces Polars DataFrames.
- Staging lands result sets in DuckDB, where normalized schemas and pipeline state live.
- Transforms build the public analytical surface in dependency order.
- Exports write the same modeled data to SQLite, DuckDB, Parquet, and CSV.
- Distribution can publish the built dataset to Kaggle.
Pipeline path in one glance
| Stage | Primary input | Primary output | Why this stage exists |
|---|---|---|---|
| Extraction | nba_api runtime endpoints | Raw Polars frames | Capture source data with retries, rate controls, and extractor-specific handling |
| Staging | Raw frames | Normalized DuckDB staging tables | Standardize types, names, and warehouse-ready shape before modeling |
| Transform | Staging tables | Public dim_*, fact_*, bridge_*, agg_*, and analytics_* outputs | Turn endpoint-shaped payloads into analysis-friendly structures |
| Export | Public modeled surface | SQLite, DuckDB, Parquet, and CSV artifacts | Make the same warehouse available in the formats downstream tools expect |
Stage-by-stage walkthrough
1. Tip-off: extract
Extraction calls the upstream NBA surface, applies retries and rate controls, and produces raw Polars frames. This is the layer closest to the source API and the first place where data fidelity matters.
2. Tunnel: validate and normalize
Raw frames land in DuckDB staging, where schemas enforce normalized column names, types, nullability, and basic ranges before those tables feed transforms.
3. Main floor: transform
SQL-first transformers build the public model in dependency order. The result is the analytical surface most users query directly.
4. Outbound lanes: export
The same modeled surface is written to SQLite, DuckDB, Parquet, and CSV, then can be packaged for Kaggle distribution.
Validation tiers in one glance
raw -> staging -> star| Tier | Where it happens | Why it exists | What it catches |
|---|---|---|---|
| Raw | Immediately around extracted frames | Validate close to the upstream source | Source-shape problems before load |
| Staging | After DuckDB load into normalized tables | Enforce warehouse-ready normalization | Naming, typing, nullability, and range issues after load |
| Star | After transforms build the public model | Protect the reader-facing analytical contract | Public-surface contract problems before export and use |
Public model families
| Family | Count | Prefix | What it gives you |
|---|---|---|---|
| Dimensions | 17 | dim_ | Core entities such as players, teams, games, and seasons |
| Facts | 102 | fact_ | Grain-specific measurements and events |
| Bridges | 2 | bridge_ | Join helpers for many-to-many relationships |
| Aggregates | 16 | agg_ | Reusable rollups for repeated analysis |
| Analytics outputs | 4 | analytics_ | Analysis-ready convenience surfaces |
Most transforms are SQL-first and run in dependency order, which keeps the model readable and predictable for maintainers.
Directory map by responsibility
| Area | What lives there | Why you would care |
|---|---|---|
src/nbadb/extract/ | Extractors wrapping stats and static NBA sources | Source-surface coverage and extraction behavior |
src/nbadb/schemas/ | Raw, staging, and star Pandera schemas | Validation rules and warehouse contracts |
src/nbadb/transform/ | Dimension, fact, bridge, aggregate, and analytics builders | The public model itself |
src/nbadb/load/ | Export and load logic | How modeled data gets written back out |
src/nbadb/orchestrate/ | Pipeline orchestration and staging map | Run ordering, checkpoints, and resume behavior |
src/nbadb/cli/ | Typer CLI and Textual TUI surface | Current operator entry points |
src/nbadb/docs_gen/ | Docs generators for schema, dictionary, ER, and lineage artifacts | Generator-owned docs boundaries |
Command intent by run mode
This is the quickest way to understand how the pipeline behaves operationally.
| Command | Primary intent | Scope |
|---|---|---|
nbadb init | Full historical build | Historical seasons from --season-start through --season-end |
nbadb daily | Current-season refresh play | Current season, recent games within NBADB_DAILY_LOOKBACK_DAYS, plus active player/team refresh |
nbadb monthly | Broader roster-and-history sweep | The last 3 seasons |
nbadb full | Recovery and gap-fill run | Retries failed journal entries, then scans the full season range while skipping already-extracted work |
One subtle but important behavior: daily, monthly, and full all finish by rebuilding downstream tables in replace mode. They are not row-level upsert commands against the public star surface.
Validation and operational state
The repo maintains 8 underscore-prefixed internal DuckDB tables for watermarks, journals, checkpoints, metrics, and schema history.
Public analytical contract
Treat dimensions, facts, bridges, aggregates, and analytics outputs as the reader-facing warehouse surface. That is the layer documented for analysts, downstream SQL, and exported datasets.
Internal pipeline machinery
Treat underscore-prefixed tables such as _pipeline_watermarks, _extraction_journal, _pipeline_metadata, and _transform_checkpoints as operational state unless a page explicitly calls them out for workflows like status inspection or resume behavior.
Internal tables to recognize quickly
| Table | Why it exists |
|---|---|
_pipeline_watermarks | Tracks incremental extraction high-water marks |
_extraction_journal | Records extraction run history |
_pipeline_metadata | Stores pipeline configuration state |
_pipeline_metrics | Captures per-transformer timing and row counts |
_transform_checkpoints | Supports resume-safe interrupted transforms |
_transform_metrics | Stores transform execution metrics |
_schema_versions | Snapshots column hashes for drift detection |
_schema_version_history | Keeps schema change history |
Key design decisions
Docs boundary: what is generated vs. curated
Use this command when generator-owned docs drift from the code:
uv run nbadb docs-autogen --docs-root docs/content/docsThat command owns these outputs:
schema/raw-reference.mdxschema/staging-reference.mdxschema/star-reference.mdxdata-dictionary/raw.mdxdata-dictionary/staging.mdxdata-dictionary/star.mdxdiagrams/er-auto.mdxlineage/lineage-auto.mdxlineage/lineage.json
Practical boundary
| If the page is… | Treat it as… |
|---|---|
| A guide, entry page, architecture page, or CLI walkthrough | Hand-authored and safe to edit directly |
| A schema reference, data dictionary artifact, ER auto page, or lineage auto page listed above | Generator-owned; regenerate instead of hand-editing |
Best next reads
- CLI Reference for exact commands and operator behavior
- Schema Reference for the public table families
- Data Dictionary for field-level meaning
- Diagrams for visual maps
- Daily Updates for the recurring operational runbook
Keep moving
Stay in the same possession
Keep the mental model warm with adjacent pages, section hubs, and search-friendly routes into the same topic cluster.