nbadbArena Data Lab
Core DocsArena MapArchitecture16 waypoints

Jump Ball

Architecture

Control-tower view of the nbadb pipeline, public surface, and operational boundaries.

Architecture

Think of nbadb as an arena control tower for NBA data: extraction brings the game film in, DuckDB stages and validates it, transformers reshape it into analytics-ready tables, and export lanes package it for downstream use.

Extractors
143
registered wrappers over the current nba_api runtime surface
Public outputs
141
dimensions, facts, bridges, aggregates, and analytics surfaces
Export formats
4
sqlite, duckdb, csv, and parquet
Internal state tables
8
watermarks, journals, checkpoints, metrics, and schema history

Quick navigation

Entry surface

See the pipeline path

Start with The short version for the raw → staging → star flow before you drop into details.

Entry surface

Map commands to run modes

Jump to Command intent by run mode if your question is operational rather than structural.

Entry surface

Understand public vs internal tables

Use Validation and operational state when you need to know what is contract surface versus pipeline machinery.

Entry surface

Check generated-docs boundaries

Go to Docs boundary if you are editing docs and need to know what is command-owned.

Use this page when…

If you need to answer…Start here
“How does data move from the NBA API into the warehouse?”The short version
“Which layer is public contract versus internal pipeline machinery?”Validation and operational state
“What is actually different between init, daily, monthly, and full?”Command intent by run mode
“Which docs are hand-written and which are command-owned?”Docs boundary: what is generated vs. curated

Mermaid diagram

Showing Mermaid source preview until the SVG diagram hydrates.

Preparing board
Source preview
flowchart LR
    subgraph Tipoff["Tip-Off: source intake"]
        API["nba_api runtime surface
143 registered extractors"] --> RAW["Raw Polars frames"]
    end
    subgraph Tunnel["Tunnel: load + validate"]
        RAW --> STG["DuckDB staging
normalized operational layer"]
    end
    subgraph Scoreboard["Main floor: public model"]
        STG --> STAR["141 public tables/views
star + analytics surface"]

The short version

  • Extraction wraps the NBA stats surface with registered extractors and produces Polars DataFrames.
  • Staging lands result sets in DuckDB, where normalized schemas and pipeline state live.
  • Transforms build the public analytical surface in dependency order.
  • Exports write the same modeled data to SQLite, DuckDB, Parquet, and CSV.
  • Distribution can publish the built dataset to Kaggle.

Pipeline path in one glance

StagePrimary inputPrimary outputWhy this stage exists
Extractionnba_api runtime endpointsRaw Polars framesCapture source data with retries, rate controls, and extractor-specific handling
StagingRaw framesNormalized DuckDB staging tablesStandardize types, names, and warehouse-ready shape before modeling
TransformStaging tablesPublic dim_*, fact_*, bridge_*, agg_*, and analytics_* outputsTurn endpoint-shaped payloads into analysis-friendly structures
ExportPublic modeled surfaceSQLite, DuckDB, Parquet, and CSV artifactsMake the same warehouse available in the formats downstream tools expect
Pipeline walkthrough

Stage-by-stage walkthrough

Pipeline stage

1. Tip-off: extract

Extraction calls the upstream NBA surface, applies retries and rate controls, and produces raw Polars frames. This is the layer closest to the source API and the first place where data fidelity matters.

Pipeline stage

2. Tunnel: validate and normalize

Raw frames land in DuckDB staging, where schemas enforce normalized column names, types, nullability, and basic ranges before those tables feed transforms.

Pipeline stage

3. Main floor: transform

SQL-first transformers build the public model in dependency order. The result is the analytical surface most users query directly.

Pipeline stage

4. Outbound lanes: export

The same modeled surface is written to SQLite, DuckDB, Parquet, and CSV, then can be packaged for Kaggle distribution.

Validation tiers in one glance

raw -> staging -> star
TierWhere it happensWhy it existsWhat it catches
RawImmediately around extracted framesValidate close to the upstream sourceSource-shape problems before load
StagingAfter DuckDB load into normalized tablesEnforce warehouse-ready normalizationNaming, typing, nullability, and range issues after load
StarAfter transforms build the public modelProtect the reader-facing analytical contractPublic-surface contract problems before export and use

Public model families

FamilyCountPrefixWhat it gives you
Dimensions17dim_Core entities such as players, teams, games, and seasons
Facts102fact_Grain-specific measurements and events
Bridges2bridge_Join helpers for many-to-many relationships
Aggregates16agg_Reusable rollups for repeated analysis
Analytics outputs4analytics_Analysis-ready convenience surfaces

Most transforms are SQL-first and run in dependency order, which keeps the model readable and predictable for maintainers.

Directory map by responsibility

AreaWhat lives thereWhy you would care
src/nbadb/extract/Extractors wrapping stats and static NBA sourcesSource-surface coverage and extraction behavior
src/nbadb/schemas/Raw, staging, and star Pandera schemasValidation rules and warehouse contracts
src/nbadb/transform/Dimension, fact, bridge, aggregate, and analytics buildersThe public model itself
src/nbadb/load/Export and load logicHow modeled data gets written back out
src/nbadb/orchestrate/Pipeline orchestration and staging mapRun ordering, checkpoints, and resume behavior
src/nbadb/cli/Typer CLI and Textual TUI surfaceCurrent operator entry points
src/nbadb/docs_gen/Docs generators for schema, dictionary, ER, and lineage artifactsGenerator-owned docs boundaries

Command intent by run mode

This is the quickest way to understand how the pipeline behaves operationally.

CommandPrimary intentScope
nbadb initFull historical buildHistorical seasons from --season-start through --season-end
nbadb dailyCurrent-season refresh playCurrent season, recent games within NBADB_DAILY_LOOKBACK_DAYS, plus active player/team refresh
nbadb monthlyBroader roster-and-history sweepThe last 3 seasons
nbadb fullRecovery and gap-fill runRetries failed journal entries, then scans the full season range while skipping already-extracted work

One subtle but important behavior: daily, monthly, and full all finish by rebuilding downstream tables in replace mode. They are not row-level upsert commands against the public star surface.

Validation and operational state

The repo maintains 8 underscore-prefixed internal DuckDB tables for watermarks, journals, checkpoints, metrics, and schema history.

What most readers care about

Public analytical contract

Treat dimensions, facts, bridges, aggregates, and analytics outputs as the reader-facing warehouse surface. That is the layer documented for analysts, downstream SQL, and exported datasets.

What operators care about

Internal pipeline machinery

Treat underscore-prefixed tables such as _pipeline_watermarks, _extraction_journal, _pipeline_metadata, and _transform_checkpoints as operational state unless a page explicitly calls them out for workflows like status inspection or resume behavior.

Internal tables to recognize quickly

TableWhy it exists
_pipeline_watermarksTracks incremental extraction high-water marks
_extraction_journalRecords extraction run history
_pipeline_metadataStores pipeline configuration state
_pipeline_metricsCaptures per-transformer timing and row counts
_transform_checkpointsSupports resume-safe interrupted transforms
_transform_metricsStores transform execution metrics
_schema_versionsSnapshots column hashes for drift detection
_schema_version_historyKeeps schema change history

Key design decisions

Docs boundary: what is generated vs. curated

Use this command when generator-owned docs drift from the code:

uv run nbadb docs-autogen --docs-root docs/content/docs

That command owns these outputs:

  • schema/raw-reference.mdx
  • schema/staging-reference.mdx
  • schema/star-reference.mdx
  • data-dictionary/raw.mdx
  • data-dictionary/staging.mdx
  • data-dictionary/star.mdx
  • diagrams/er-auto.mdx
  • lineage/lineage-auto.mdx
  • lineage/lineage.json

Practical boundary

If the page is…Treat it as…
A guide, entry page, architecture page, or CLI walkthroughHand-authored and safe to edit directly
A schema reference, data dictionary artifact, ER auto page, or lineage auto page listed aboveGenerator-owned; regenerate instead of hand-editing

Best next reads

Keep moving

Stay in the same possession

Keep the mental model warm with adjacent pages, section hubs, and search-friendly routes into the same topic cluster.

Section hub

On this page