Architecture

Think of nbadb as an arena control tower for NBA data: extraction brings the game film in, DuckDB stages and validates it, transformers reshape it into analytics-ready tables, and export lanes package it for downstream use.

Extractors

152

registered wrappers over the current nba_api runtime surface

Public outputs

255

dimensions, facts, bridges, aggregates, and analytics surfaces

Export formats

sqlite, duckdb, csv, and parquet

Internal state tables

watermarks, journals, checkpoints, metrics, and schema history

Navigate by decision

Structure

See the pipeline path

Start with The short version when the question is how the ball moves from source intake to exports.

Operations

Pick the right run mode

Jump to Which run mode should I pick? when the question is operational rather than structural.

Boundary

Separate public from internal tables

Use

Public surface vs internal machinery

when you need to know which tables are contract surface versus pipeline state.

Docs boundary

Check docs ownership

Go to Docs boundary if you are editing docs and need to know what is command-owned.

Source preview

flowchart LR
    subgraph Tipoff["Tip-Off: source intake"]
        API["nba_api runtime surface
registered extractors"] --> RAW["Raw Polars frames"]
    end
    subgraph Tunnel["Tunnel: load + validate"]
        RAW --> STG["DuckDB staging
normalized operational layer"]
    end
    subgraph Scoreboard["Main floor: public model"]
        STG --> STAR["Public tables/views
star + analytics surface"]
…

The short version

Extraction wraps the NBA stats surface with registered extractors and produces Polars DataFrames.
Staging lands result sets in DuckDB, where normalized schemas and pipeline state live.
Transforms build the public analytical surface in dependency order.
Exports write the same modeled data to SQLite, DuckDB, Parquet, and CSV.
Distribution can publish the built dataset to Kaggle.

Pipeline path in one glance

Stage	Primary input	Primary output	Why this stage exists
Extraction	`nba_api` runtime endpoints	Raw Polars frames	Capture source data with retries, rate controls, and extractor-specific handling
Staging	Raw frames	Normalized DuckDB staging tables	Standardize types, names, and warehouse-ready shape before modeling
Transform	Staging tables	Public `dim_`, `fact_`, `bridge_`, `agg_`, and `analytics_*` outputs	Turn endpoint-shaped payloads into analysis-friendly structures
Export	Public modeled surface	SQLite, DuckDB, Parquet, and CSV artifacts	Make the same warehouse available in the formats downstream tools expect

Pipeline walkthrough

Stage-by-stage walkthrough

Pipeline stage

1. Tip-off: extract

Extraction calls the upstream NBA surface, applies retries and rate controls, and produces raw Polars frames. This is the layer closest to the source API and the first place where data fidelity matters.

Pipeline stage

2. Tunnel: validate and normalize

Raw frames land in DuckDB staging, where schemas enforce normalized column names, types, nullability, and basic ranges before those tables feed transforms.

Pipeline stage

3. Main floor: transform

SQL-first transformers build the public model in dependency order. The result is the analytical surface most users query directly.

Pipeline stage

4. Outbound lanes: export

The same modeled surface is written to SQLite, DuckDB, Parquet, and CSV, then can be packaged for Kaggle distribution.

Validation tiers in one glance

raw -> staging -> star

Tier	Where it happens	Why it exists	What it catches
Raw	Immediately around extracted frames	Validate close to the upstream source	Source-shape problems before load
Staging	After DuckDB load into normalized tables	Enforce warehouse-ready normalization	Naming, typing, nullability, and range issues after load
Star	After transforms build the public model	Protect the reader-facing analytical contract	Public-surface contract problems before export and use

Which run mode should I pick?

Source preview

flowchart TD
  START["Need to run the pipeline"] --> HIST{"First build or new historical range?"}
  HIST -- Yes --> INIT["nbadb init
full historical build"]
  HIST -- No --> CURRENT{"Current-season refresh?"}
  CURRENT -- Yes --> DAILY["nbadb daily
recent games + active surfaces"]
  CURRENT -- No --> RECENT{"Need a broader recent-history sweep?"}
  RECENT -- Yes --> MONTHLY["nbadb monthly
last 3 seasons + live append"]
  RECENT -- No --> GAP["nbadb backfill run
recovery + targeted gap fill"]
…

Command	Primary intent	Scope
`nbadb init`	Full historical build	Historical seasons from `--season-start` through `--season-end`
`nbadb daily`	Current-season refresh play	Current season, recent games within `NBADB_DAILY_LOOKBACK_DAYS`, plus active player/team refresh
`nbadb monthly`	Broader roster-and-history sweep	The last 3 seasons
`nbadb backfill run`	Recovery and gap-fill run	Retries failed journal entries, then scans the requested season range while skipping already-extracted work

One subtle but important behavior: daily, monthly, and backfill all finish by rebuilding downstream tables in replace mode. They are not row-level upsert commands against the public star surface.

If the run goes sideways

Debug lane

The run finished, but the output looks wrong

Start with Troubleshooting Playbook , then verify the public contract on Schema Reference or Data Dictionary.

Ops lane

You need a recurring operator route

Keep Daily Updates open when the job is the same refresh, validation, or handoff every day.

Context lane

You need to explain the system to someone new

Use Pipeline Flow, ER Diagram, and Schema Reference for the shortest explain-it-once route.

Docs lane

Contracts changed and docs drifted

Regenerate command-owned docs with uv run nbadb docs-autogen --docs-root docs/content/docs, then verify the authored guidance still points at the right surfaces.

Public model families

Family	Count	Prefix	What it gives you
Dimensions	18	`dim_`	Core entities such as players, teams, games, and seasons
Facts	198	`fact_`	Grain-specific measurements and events
Bridges	6	`bridge_`	Join helpers for many-to-many relationships
Aggregates	19	`agg_`	Reusable rollups for repeated analysis
Analytics outputs	14	`analytics_`	Analysis-ready convenience surfaces

Most transforms are SQL-first and run in dependency order, which keeps the model readable and predictable for maintainers.

Public surface vs internal machinery

Reader-facing

Public analytical contract

Treat dimensions, facts, bridges, aggregates, and analytics outputs as the warehouse surface documented for analysts, downstream SQL, and exported datasets.

Operator-facing

Internal pipeline machinery

Treat underscore-prefixed tables such as _pipeline_watermarks, _extraction_journal, _pipeline_metadata, and _transform_checkpoints as operational state unless a page explicitly documents them for workflows like status inspection or resume behavior.

Internal tables to recognize quickly

Table	Why it exists
`_pipeline_watermarks`	Tracks incremental extraction high-water marks
`_extraction_journal`	Records extraction run history
`_pipeline_metadata`	Stores pipeline configuration state
`_pipeline_metrics`	Captures per-transformer timing and row counts
`_transform_checkpoints`	Supports resume-safe interrupted transforms
`_transform_metrics`	Stores transform execution metrics
`_schema_versions`	Snapshots column hashes for drift detection
`_schema_version_history`	Keeps schema change history

Directory map by responsibility

Area	What lives there	Why you would care
`src/nbadb/extract/`	Extractors wrapping stats and static NBA sources	Source-surface coverage and extraction behavior
`src/nbadb/schemas/`	Raw, staging, and star Pandera schemas	Validation rules and warehouse contracts
`src/nbadb/transform/`	Dimension, fact, bridge, aggregate, and analytics builders	The public model itself
`src/nbadb/load/`	Export and load logic	How modeled data gets written back out
`src/nbadb/orchestrate/`	Pipeline orchestration and staging map	Run ordering, checkpoints, and resume behavior
`src/nbadb/cli/`	Typer CLI and Textual TUI surface	Current operator entry points
`src/nbadb/docs_gen/`	Docs generators for schema, dictionary, ER, and lineage artifacts	Generator-owned docs boundaries

Key design decisions

Docs boundary: curated vs generated

Use this command when generator-owned docs drift from the code:

uv run nbadb docs-autogen --docs-root docs/content/docs

That command owns these outputs:

schema/raw-reference.mdx
schema/staging-reference.mdx
schema/star-reference.mdx
data-dictionary/raw.mdx
data-dictionary/staging.mdx
data-dictionary/star.mdx
diagrams/er-auto.mdx
lineage/lineage-auto.mdx
docs/lib/generated/schema.json
docs/lib/generated/lineage.json
docs/lib/generated/schema-coverage.json

If the page is…	Treat it as…
A guide, entry page, architecture page, or CLI walkthrough	Hand-authored and safe to edit directly
A schema reference, data dictionary artifact, ER auto page, or lineage auto page listed above	Generator-owned; regenerate instead of hand-editing

Best next reads

CLI Reference for exact commands and operator behavior
Schema Reference for the public table families
Data Dictionary for field-level meaning
Diagrams for visual maps
Daily Updates for the recurring operational runbook

Architecture

Extractors

152

registered wrappers over the current nba_api runtime surface

Public outputs

255

dimensions, facts, bridges, aggregates, and analytics surfaces

Export formats

sqlite, duckdb, csv, and parquet

Internal state tables

watermarks, journals, checkpoints, metrics, and schema history

Navigate by decision

Structure

See the pipeline path

Start with The short version when the question is how the ball moves from source intake to exports.

Operations

Pick the right run mode

Jump to Which run mode should I pick? when the question is operational rather than structural.

Boundary

Separate public from internal tables

Use

Public surface vs internal machinery

when you need to know which tables are contract surface versus pipeline state.

Docs boundary

Check docs ownership

Go to Docs boundary if you are editing docs and need to know what is command-owned.

Source preview

flowchart LR
    subgraph Tipoff["Tip-Off: source intake"]
        API["nba_api runtime surface
registered extractors"] --> RAW["Raw Polars frames"]
    end
    subgraph Tunnel["Tunnel: load + validate"]
        RAW --> STG["DuckDB staging
normalized operational layer"]
    end
    subgraph Scoreboard["Main floor: public model"]
        STG --> STAR["Public tables/views
star + analytics surface"]
…

The short version

Extraction wraps the NBA stats surface with registered extractors and produces Polars DataFrames.
Staging lands result sets in DuckDB, where normalized schemas and pipeline state live.
Transforms build the public analytical surface in dependency order.
Exports write the same modeled data to SQLite, DuckDB, Parquet, and CSV.
Distribution can publish the built dataset to Kaggle.

Pipeline path in one glance

Stage	Primary input	Primary output	Why this stage exists
Extraction	`nba_api` runtime endpoints	Raw Polars frames	Capture source data with retries, rate controls, and extractor-specific handling
Staging	Raw frames	Normalized DuckDB staging tables	Standardize types, names, and warehouse-ready shape before modeling
Transform	Staging tables	Public `dim_`, `fact_`, `bridge_`, `agg_`, and `analytics_*` outputs	Turn endpoint-shaped payloads into analysis-friendly structures
Export	Public modeled surface	SQLite, DuckDB, Parquet, and CSV artifacts	Make the same warehouse available in the formats downstream tools expect

Pipeline walkthrough

Stage-by-stage walkthrough

Pipeline stage

1. Tip-off: extract

Pipeline stage

2. Tunnel: validate and normalize

Raw frames land in DuckDB staging, where schemas enforce normalized column names, types, nullability, and basic ranges before those tables feed transforms.

Pipeline stage

3. Main floor: transform

SQL-first transformers build the public model in dependency order. The result is the analytical surface most users query directly.

Pipeline stage

4. Outbound lanes: export

The same modeled surface is written to SQLite, DuckDB, Parquet, and CSV, then can be packaged for Kaggle distribution.

Validation tiers in one glance

raw -> staging -> star

Tier	Where it happens	Why it exists	What it catches
Raw	Immediately around extracted frames	Validate close to the upstream source	Source-shape problems before load
Staging	After DuckDB load into normalized tables	Enforce warehouse-ready normalization	Naming, typing, nullability, and range issues after load
Star	After transforms build the public model	Protect the reader-facing analytical contract	Public-surface contract problems before export and use

Which run mode should I pick?

Source preview

flowchart TD
  START["Need to run the pipeline"] --> HIST{"First build or new historical range?"}
  HIST -- Yes --> INIT["nbadb init
full historical build"]
  HIST -- No --> CURRENT{"Current-season refresh?"}
  CURRENT -- Yes --> DAILY["nbadb daily
recent games + active surfaces"]
  CURRENT -- No --> RECENT{"Need a broader recent-history sweep?"}
  RECENT -- Yes --> MONTHLY["nbadb monthly
last 3 seasons + live append"]
  RECENT -- No --> GAP["nbadb backfill run
recovery + targeted gap fill"]
…

Command	Primary intent	Scope
`nbadb init`	Full historical build	Historical seasons from `--season-start` through `--season-end`
`nbadb daily`	Current-season refresh play	Current season, recent games within `NBADB_DAILY_LOOKBACK_DAYS`, plus active player/team refresh
`nbadb monthly`	Broader roster-and-history sweep	The last 3 seasons
`nbadb backfill run`	Recovery and gap-fill run	Retries failed journal entries, then scans the requested season range while skipping already-extracted work

If the run goes sideways

Debug lane

The run finished, but the output looks wrong

Start with Troubleshooting Playbook , then verify the public contract on Schema Reference or Data Dictionary.

Ops lane

You need a recurring operator route

Keep Daily Updates open when the job is the same refresh, validation, or handoff every day.

Context lane

You need to explain the system to someone new

Use Pipeline Flow, ER Diagram, and Schema Reference for the shortest explain-it-once route.

Docs lane

Contracts changed and docs drifted

Regenerate command-owned docs with uv run nbadb docs-autogen --docs-root docs/content/docs, then verify the authored guidance still points at the right surfaces.

Public model families

Family	Count	Prefix	What it gives you
Dimensions	18	`dim_`	Core entities such as players, teams, games, and seasons
Facts	198	`fact_`	Grain-specific measurements and events
Bridges	6	`bridge_`	Join helpers for many-to-many relationships
Aggregates	19	`agg_`	Reusable rollups for repeated analysis
Analytics outputs	14	`analytics_`	Analysis-ready convenience surfaces

Most transforms are SQL-first and run in dependency order, which keeps the model readable and predictable for maintainers.

Public surface vs internal machinery

Reader-facing

Public analytical contract

Treat dimensions, facts, bridges, aggregates, and analytics outputs as the warehouse surface documented for analysts, downstream SQL, and exported datasets.

Operator-facing

Internal pipeline machinery

Internal tables to recognize quickly

Table	Why it exists
`_pipeline_watermarks`	Tracks incremental extraction high-water marks
`_extraction_journal`	Records extraction run history
`_pipeline_metadata`	Stores pipeline configuration state
`_pipeline_metrics`	Captures per-transformer timing and row counts
`_transform_checkpoints`	Supports resume-safe interrupted transforms
`_transform_metrics`	Stores transform execution metrics
`_schema_versions`	Snapshots column hashes for drift detection
`_schema_version_history`	Keeps schema change history

Directory map by responsibility

Area	What lives there	Why you would care
`src/nbadb/extract/`	Extractors wrapping stats and static NBA sources	Source-surface coverage and extraction behavior
`src/nbadb/schemas/`	Raw, staging, and star Pandera schemas	Validation rules and warehouse contracts
`src/nbadb/transform/`	Dimension, fact, bridge, aggregate, and analytics builders	The public model itself
`src/nbadb/load/`	Export and load logic	How modeled data gets written back out
`src/nbadb/orchestrate/`	Pipeline orchestration and staging map	Run ordering, checkpoints, and resume behavior
`src/nbadb/cli/`	Typer CLI and Textual TUI surface	Current operator entry points
`src/nbadb/docs_gen/`	Docs generators for schema, dictionary, ER, and lineage artifacts	Generator-owned docs boundaries

Key design decisions

Docs boundary: curated vs generated

Use this command when generator-owned docs drift from the code:

uv run nbadb docs-autogen --docs-root docs/content/docs

That command owns these outputs:

schema/raw-reference.mdx
schema/staging-reference.mdx
schema/star-reference.mdx
data-dictionary/raw.mdx
data-dictionary/staging.mdx
data-dictionary/star.mdx
diagrams/er-auto.mdx
lineage/lineage-auto.mdx
docs/lib/generated/schema.json
docs/lib/generated/lineage.json
docs/lib/generated/schema-coverage.json

If the page is…	Treat it as…
A guide, entry page, architecture page, or CLI walkthrough	Hand-authored and safe to edit directly
A schema reference, data dictionary artifact, ER auto page, or lineage auto page listed above	Generator-owned; regenerate instead of hand-editing

Best next reads

CLI Reference for exact commands and operator behavior
Schema Reference for the public table families
Data Dictionary for field-level meaning
Diagrams for visual maps
Daily Updates for the recurring operational runbook

Architecture

See the pipeline path

Pick the right run mode

Separate public from internal tables

Check docs ownership

1. Tip-off: extract

2. Tunnel: validate and normalize

3. Main floor: transform

4. Outbound lanes: export

The run finished, but the output looks wrong

You need a recurring operator route

You need to explain the system to someone new

Contracts changed and docs drifted

Public analytical contract

Internal pipeline machinery

Stay in the same possession

Installation

CLI Reference

On this page

Architecture

See the pipeline path

Pick the right run mode

Separate public from internal tables

Check docs ownership

1. Tip-off: extract

2. Tunnel: validate and normalize

3. Main floor: transform

4. Outbound lanes: export

The run finished, but the output looks wrong

You need a recurring operator route

You need to explain the system to someone new

Contracts changed and docs drifted

Public analytical contract

Internal pipeline machinery

Stay in the same possession

Installation

CLI Reference

On this page