Jump Ball
Architecture
Control-tower view of the nbadb pipeline, public surface, and operational boundaries.
Architecture
Think of nbadb as an arena control tower for NBA data: extraction brings the game film in, DuckDB stages and validates it, transformers reshape it into analytics-ready tables, and export lanes package it for downstream use.
Navigate by decision
See the pipeline path
Start with The short version when the question is how the ball moves from source intake to exports.
Pick the right run mode
Jump to Which run mode should I pick? when the question is operational rather than structural.
Separate public from internal tables
Use
Public surface vs internal machinery
when you need to know which tables are contract surface versus pipeline state.
Check docs ownership
Go to Docs boundary if you are editing docs and need to know what is command-owned.
flowchart LR
subgraph Tipoff["Tip-Off: source intake"]
API["nba_api runtime surface
registered extractors"] --> RAW["Raw Polars frames"]
end
subgraph Tunnel["Tunnel: load + validate"]
RAW --> STG["DuckDB staging
normalized operational layer"]
end
subgraph Scoreboard["Main floor: public model"]
STG --> STAR["Public tables/views
star + analytics surface"]
…The short version
- Extraction wraps the NBA stats surface with registered extractors and produces Polars DataFrames.
- Staging lands result sets in DuckDB, where normalized schemas and pipeline state live.
- Transforms build the public analytical surface in dependency order.
- Exports write the same modeled data to SQLite, DuckDB, Parquet, and CSV.
- Distribution can publish the built dataset to Kaggle.
Pipeline path in one glance
| Stage | Primary input | Primary output | Why this stage exists |
|---|---|---|---|
| Extraction | nba_api runtime endpoints | Raw Polars frames | Capture source data with retries, rate controls, and extractor-specific handling |
| Staging | Raw frames | Normalized DuckDB staging tables | Standardize types, names, and warehouse-ready shape before modeling |
| Transform | Staging tables | Public dim_*, fact_*, bridge_*, agg_*, and analytics_* outputs | Turn endpoint-shaped payloads into analysis-friendly structures |
| Export | Public modeled surface | SQLite, DuckDB, Parquet, and CSV artifacts | Make the same warehouse available in the formats downstream tools expect |
Stage-by-stage walkthrough
1. Tip-off: extract
Extraction calls the upstream NBA surface, applies retries and rate controls, and produces raw Polars frames. This is the layer closest to the source API and the first place where data fidelity matters.
2. Tunnel: validate and normalize
Raw frames land in DuckDB staging, where schemas enforce normalized column names, types, nullability, and basic ranges before those tables feed transforms.
3. Main floor: transform
SQL-first transformers build the public model in dependency order. The result is the analytical surface most users query directly.
4. Outbound lanes: export
The same modeled surface is written to SQLite, DuckDB, Parquet, and CSV, then can be packaged for Kaggle distribution.
Validation tiers in one glance
raw -> staging -> star| Tier | Where it happens | Why it exists | What it catches |
|---|---|---|---|
| Raw | Immediately around extracted frames | Validate close to the upstream source | Source-shape problems before load |
| Staging | After DuckDB load into normalized tables | Enforce warehouse-ready normalization | Naming, typing, nullability, and range issues after load |
| Star | After transforms build the public model | Protect the reader-facing analytical contract | Public-surface contract problems before export and use |
Which run mode should I pick?
flowchart TD
START["Need to run the pipeline"] --> HIST{"First build or new historical range?"}
HIST -- Yes --> INIT["nbadb init
full historical build"]
HIST -- No --> CURRENT{"Current-season refresh?"}
CURRENT -- Yes --> DAILY["nbadb daily
recent games + active surfaces"]
CURRENT -- No --> RECENT{"Need a broader recent-history sweep?"}
RECENT -- Yes --> MONTHLY["nbadb monthly
last 3 seasons + live append"]
RECENT -- No --> GAP["nbadb backfill run
recovery + targeted gap fill"]
…| Command | Primary intent | Scope |
|---|---|---|
nbadb init | Full historical build | Historical seasons from --season-start through --season-end |
nbadb daily | Current-season refresh play | Current season, recent games within NBADB_DAILY_LOOKBACK_DAYS, plus active player/team refresh |
nbadb monthly | Broader roster-and-history sweep | The last 3 seasons |
nbadb backfill run | Recovery and gap-fill run | Retries failed journal entries, then scans the requested season range while skipping already-extracted work |
One subtle but important behavior: daily, monthly, and backfill all
finish by rebuilding downstream tables in replace mode. They
are not row-level upsert commands against the public star surface.
If the run goes sideways
The run finished, but the output looks wrong
Start with Troubleshooting Playbook , then verify the public contract on Schema Reference or Data Dictionary.
You need a recurring operator route
Keep Daily Updates open when the job is the same refresh, validation, or handoff every day.
You need to explain the system to someone new
Use Pipeline Flow, ER Diagram, and Schema Reference for the shortest explain-it-once route.
Contracts changed and docs drifted
Regenerate command-owned docs with
uv run nbadb docs-autogen --docs-root docs/content/docs, then
verify the authored guidance still points at the right surfaces.
Public model families
| Family | Count | Prefix | What it gives you |
|---|---|---|---|
| Dimensions | 18 | dim_ | Core entities such as players, teams, games, and seasons |
| Facts | 196 | fact_ | Grain-specific measurements and events |
| Bridges | 6 | bridge_ | Join helpers for many-to-many relationships |
| Aggregates | 19 | agg_ | Reusable rollups for repeated analysis |
| Analytics outputs | 14 | analytics_ | Analysis-ready convenience surfaces |
Most transforms are SQL-first and run in dependency order, which keeps the model readable and predictable for maintainers.
Public surface vs internal machinery
Public analytical contract
Treat dimensions, facts, bridges, aggregates, and analytics outputs as the warehouse surface documented for analysts, downstream SQL, and exported datasets.
Internal pipeline machinery
Treat underscore-prefixed tables such as _pipeline_watermarks,
_extraction_journal, _pipeline_metadata, and _transform_checkpoints as
operational state unless a page explicitly documents them for workflows like
status inspection or resume behavior.
Internal tables to recognize quickly
| Table | Why it exists |
|---|---|
_pipeline_watermarks | Tracks incremental extraction high-water marks |
_extraction_journal | Records extraction run history |
_pipeline_metadata | Stores pipeline configuration state |
_pipeline_metrics | Captures per-transformer timing and row counts |
_transform_checkpoints | Supports resume-safe interrupted transforms |
_transform_metrics | Stores transform execution metrics |
_schema_versions | Snapshots column hashes for drift detection |
_schema_version_history | Keeps schema change history |
Directory map by responsibility
| Area | What lives there | Why you would care |
|---|---|---|
src/nbadb/extract/ | Extractors wrapping stats and static NBA sources | Source-surface coverage and extraction behavior |
src/nbadb/schemas/ | Raw, staging, and star Pandera schemas | Validation rules and warehouse contracts |
src/nbadb/transform/ | Dimension, fact, bridge, aggregate, and analytics builders | The public model itself |
src/nbadb/load/ | Export and load logic | How modeled data gets written back out |
src/nbadb/orchestrate/ | Pipeline orchestration and staging map | Run ordering, checkpoints, and resume behavior |
src/nbadb/cli/ | Typer CLI and Textual TUI surface | Current operator entry points |
src/nbadb/docs_gen/ | Docs generators for schema, dictionary, ER, and lineage artifacts | Generator-owned docs boundaries |
Key design decisions
Docs boundary: curated vs generated
Use this command when generator-owned docs drift from the code:
uv run nbadb docs-autogen --docs-root docs/content/docsThat command owns these outputs:
schema/raw-reference.mdxschema/staging-reference.mdxschema/star-reference.mdxdata-dictionary/raw.mdxdata-dictionary/staging.mdxdata-dictionary/star.mdxdiagrams/er-auto.mdxlineage/lineage-auto.mdxdocs/lib/generated/schema.jsondocs/lib/generated/lineage.jsondocs/lib/generated/schema-coverage.json
| If the page is… | Treat it as… |
|---|---|
| A guide, entry page, architecture page, or CLI walkthrough | Hand-authored and safe to edit directly |
| A schema reference, data dictionary artifact, ER auto page, or lineage auto page listed above | Generator-owned; regenerate instead of hand-editing |
Best next reads
- CLI Reference for exact commands and operator behavior
- Schema Reference for the public table families
- Data Dictionary for field-level meaning
- Diagrams for visual maps
- Daily Updates for the recurring operational runbook
Keep moving
Stay in the same possession
Keep the mental model warm with adjacent pages, section hubs, and search-friendly routes into the same topic cluster.
