Pipeline Flow

This page reads like a set play: bring the ball in from the NBA API, validate each possession, stage the data, and fan it out into star-schema outputs that are ready for analysis and distribution.

Playbook cue: The validation checkpoints work like replay review — they stop bad possessions before they become downstream tables.

The shape of the play stays the same across init, daily, monthly, backfill run, and export; only the scope and runtime change.

Entry surface

Read the whole play left to right

Start with

Read the possession left to right

when you need the fast version before any command or table detail.

Entry surface

Check the command lane

Jump to Pipeline commands when you already know the stages and only need the run-mode route.

Entry surface

Focus on guardrails

Use

Read the possession left to right

for the validation checkpoints that stop bad data before it reaches star outputs.

Next route

Leave the playbook for dependency trace

Skip to Next steps when the stage map is clear and you need endpoint coverage, ER shape, or lineage.

Use this page when…

If you need to answer…	Start here
“Where does validation happen?”	Read the possession left to right
“What actually changes between `init`, `daily`, `monthly`, `backfill run`, and `export`?”	Pipeline commands
“Which stages produce the public warehouse surface?”	Read the possession left to right
“Where should I go after the stage map?”	Next steps from pipeline flow

nbadb follows an ELT (Extract, Load, Transform) pipeline pattern.

Source preview

flowchart TD
    subgraph Extract["1. Extract"]
        API["nba_api
stats + static + live"] --> Raw["Raw Polars
DataFrames"]
        Static["Static Data
Players & Teams"] --> Raw
    end
    subgraph Validate1["2. Raw Validation"]
        Raw --> RawSchema["Pandera Raw
Schema Check"]
    end
…

The current source-backed inventory is 18 dimensions, 198 facts, 6 bridges, 19 aggregate tables, and 14 analytics outputs.

Call the stages

Read the possession left to right

Stage	What to look for	Why it matters
1. Extract	Which endpoints and static feeds start the run	This is the inbound surface and the first place coverage gaps appear
2. Raw validation	Structural checks on API-shaped payloads	Bad possessions get stopped before they are staged as if they were trustworthy
3. Stage to DuckDB	Normalized `stg_*` landing zone	This is the operational layer most transforms depend on directly
4. Staging validation	Type, nullability, and range checks	Naming is normalized here and contract drift becomes visible
5. Transform	Dimension, fact, bridge, aggregate, and analytics builders	This is where warehouse shape and dependency fan-out happen
6. Star validation	Final schema enforcement on public tables	It protects the analytical contract before export
7-8. Export and distribute	SQLite, DuckDB, Parquet, CSV, and Kaggle lanes	This is the finish: same modeled surface, different packaging

The short read

Extract raw payloads from live endpoints and static reference sources.
Validate the raw and staging layers before transform logic touches downstream models.
Transform staging tables into public dimensions, facts, bridges, aggregates, and analytics views.
Export and distribute the validated star surface to SQLite, DuckDB, Parquet, CSV, and Kaggle-ready artifacts.

Pipeline commands

Command	Stages	Duration
`nbadb init`	1-8 (full rebuild)	~2-4h
`nbadb daily`	1-7 (incremental, 7-day lookback)	~5-15m
`nbadb monthly`	1-7 (dimension refresh)	~30-60m
`nbadb backfill run`	1-7 (targeted gap-fill by season/endpoint)	varies
`nbadb export`	7-8 (re-export only)	~5-10m

Key Technologies

Polars: Primary DataFrame engine for all transforms
DuckDB: Staging engine with zero-copy Arrow interchange
Pandera: 3-tier schema validation (raw, staging, star)
ADBC: Arrow Database Connectivity for SQLite export
zstd: Compression for Parquet output files

Next board cut

Next steps from pipeline flow

Next stop

Reconnect each stage to actual source families

Use Endpoint Map when you need to know which endpoint families feed the possession before it reaches staging and transform layers.

Next stop

Inspect the finishing lineup

Open ER Diagram when the playbook has shown the movement and you now need the shape of the dimensions, facts, and bridges produced at the end.

Next stop

Replay one dependency chain in slow motion

Continue to Table Lineage when a pipeline stage is not specific enough and you need the exact tables involved in one downstream possession.

Pipeline Flow

This page reads like a set play: bring the ball in from the NBA API, validate each possession, stage the data, and fan it out into star-schema outputs that are ready for analysis and distribution.

Playbook cue: The validation checkpoints work like replay review — they stop bad possessions before they become downstream tables.

The shape of the play stays the same across init, daily, monthly, backfill run, and export; only the scope and runtime change.

Entry surface

Read the whole play left to right

Start with

Read the possession left to right

when you need the fast version before any command or table detail.

Entry surface

Check the command lane

Jump to Pipeline commands when you already know the stages and only need the run-mode route.

Entry surface

Focus on guardrails

Use

Read the possession left to right

for the validation checkpoints that stop bad data before it reaches star outputs.

Next route

Leave the playbook for dependency trace

Skip to Next steps when the stage map is clear and you need endpoint coverage, ER shape, or lineage.

Use this page when…

If you need to answer…	Start here
“Where does validation happen?”	Read the possession left to right
“What actually changes between `init`, `daily`, `monthly`, `backfill run`, and `export`?”	Pipeline commands
“Which stages produce the public warehouse surface?”	Read the possession left to right
“Where should I go after the stage map?”	Next steps from pipeline flow

nbadb follows an ELT (Extract, Load, Transform) pipeline pattern.

Source preview

flowchart TD
    subgraph Extract["1. Extract"]
        API["nba_api
stats + static + live"] --> Raw["Raw Polars
DataFrames"]
        Static["Static Data
Players & Teams"] --> Raw
    end
    subgraph Validate1["2. Raw Validation"]
        Raw --> RawSchema["Pandera Raw
Schema Check"]
    end
…

The current source-backed inventory is 18 dimensions, 198 facts, 6 bridges, 19 aggregate tables, and 14 analytics outputs.

Call the stages

Read the possession left to right

Stage	What to look for	Why it matters
1. Extract	Which endpoints and static feeds start the run	This is the inbound surface and the first place coverage gaps appear
2. Raw validation	Structural checks on API-shaped payloads	Bad possessions get stopped before they are staged as if they were trustworthy
3. Stage to DuckDB	Normalized `stg_*` landing zone	This is the operational layer most transforms depend on directly
4. Staging validation	Type, nullability, and range checks	Naming is normalized here and contract drift becomes visible
5. Transform	Dimension, fact, bridge, aggregate, and analytics builders	This is where warehouse shape and dependency fan-out happen
6. Star validation	Final schema enforcement on public tables	It protects the analytical contract before export
7-8. Export and distribute	SQLite, DuckDB, Parquet, CSV, and Kaggle lanes	This is the finish: same modeled surface, different packaging

The short read

Extract raw payloads from live endpoints and static reference sources.
Validate the raw and staging layers before transform logic touches downstream models.
Transform staging tables into public dimensions, facts, bridges, aggregates, and analytics views.
Export and distribute the validated star surface to SQLite, DuckDB, Parquet, CSV, and Kaggle-ready artifacts.

Pipeline commands

Command	Stages	Duration
`nbadb init`	1-8 (full rebuild)	~2-4h
`nbadb daily`	1-7 (incremental, 7-day lookback)	~5-15m
`nbadb monthly`	1-7 (dimension refresh)	~30-60m
`nbadb backfill run`	1-7 (targeted gap-fill by season/endpoint)	varies
`nbadb export`	7-8 (re-export only)	~5-10m

Key Technologies

Polars: Primary DataFrame engine for all transforms
DuckDB: Staging engine with zero-copy Arrow interchange
Pandera: 3-tier schema validation (raw, staging, star)
ADBC: Arrow Database Connectivity for SQLite export
zstd: Compression for Parquet output files

Next board cut

Next steps from pipeline flow

Next stop

Reconnect each stage to actual source families

Use Endpoint Map when you need to know which endpoint families feed the possession before it reaches staging and transform layers.

Next stop

Inspect the finishing lineup

Open ER Diagram when the playbook has shown the movement and you now need the shape of the dimensions, facts, and bridges produced at the end.

Next stop

Replay one dependency chain in slow motion

Continue to Table Lineage when a pipeline stage is not specific enough and you need the exact tables involved in one downstream possession.

Pipeline Flow

Pipeline Flow

Quick navigation

Read the whole play left to right

Check the command lane

Focus on guardrails

Leave the playbook for dependency trace

Use this page when…

Read the possession left to right

The short read

Pipeline commands

Key Technologies

Next steps from pipeline flow

Reconnect each stage to actual source families

Inspect the finishing lineup

Replay one dependency chain in slow motion

Stay in the same possession

On this page

Pipeline Flow

Pipeline Flow

Quick navigation

Read the whole play left to right

Check the command lane

Focus on guardrails

Leave the playbook for dependency trace

Use this page when…

Read the possession left to right

The short read

Pipeline commands

Key Technologies

Next steps from pipeline flow

Reconnect each stage to actual source families

Inspect the finishing lineup

Replay one dependency chain in slow motion

Stay in the same possession

On this page