Column Lineage

This page traces individual columns through the pipeline -- from their NBA API source field through raw, staging, and star schema layers. Use it like replay review for one touch in the possession: the exact field that drifted, changed names, or started failing validation.

Start here when the bug is field-shaped: a wrong percentage, a renamed key, a surprising nullability change, or a foreign key that no longer lands where you expect.

Entry surface

Trace identity fields

Start with Player identity lineage when the issue is a key, natural identifier, or rename across layers.

Entry surface

Check metric math

Use Shooting stats lineage or Advanced metrics lineage when a percentage or rating changed unexpectedly.

Entry surface

Follow context keys

Jump to Game context lineage or Team lineage when the breakage is about joins rather than metric math.

Under the hood

Inspect metadata sources

Go to Lineage metadata in code when you need to confirm how schema metadata and transformer dependencies encode the replay.

Scan modes

If the issue looks like…	Start here	Why
A renamed or drifting identifier	Player identity lineage	Keys usually reveal where naming changed between API, staging, and star layers
A wrong percentage or derived metric	Shooting stats lineage	Metric examples show where values are passed through versus recomputed
A join or season-context mismatch	Game context lineage	Shared keys like `game_id` and `season_year` explain most warehouse joins
A denormalization or dimension split	Shot chart lineage	These examples show how raw fields become normalized dimensions and foreign keys
A code-generation question	Lineage metadata in code	The schema metadata and `depends_on` declarations power the generated lineage surface

How Column Lineage Works

Each column passes through up to four stages:

Source preview

flowchart LR
    A["NBA API Field
(UPPER_CASE)"] --> B["Raw Schema
(UPPER_CASE)"]
    B --> C["Staging Schema
(snake_case)"]
    C --> D["Star Schema
(snake_case + metadata)"]

The source metadata on staging schemas and description + fk_ref metadata on star schemas encode this lineage.

Frame	Typical change	What to watch for
API → Raw	Usually a straight pass-through	Nullable or mixed-type payloads
Raw → Staging	Renames to `snake_case` + validation	Contract tightening, parsed types, and nullability changes
Staging → Star	Modeling decisions and FK wiring	Surrogate keys, dimension resolution, and derived fields
Star → Analytics/Aggs	Convenience joins or recomputation	Semantic renames and metric rollups

Field-level replay

Player Identity Lineage

player_id

Source preview

flowchart LR
    A["CommonPlayerInfo
PERSON_ID"] --> B["raw_player_info
PERSON_ID (int|None)"]
    B --> C["stg_player_info
person_id (int, not null, gt=0)"]
    C --> D["dim_player
player_id (int, not null)
+ player_sk (surrogate)"]
    D --> E["fact_box_score_player
player_id (FK: dim_player)"]

Stage	Column Name	Type	Constraints
API Response	`PERSON_ID`	varies	none
Raw	`PERSON_ID`	`int \| None`	nullable
Staging	`person_id`	`int`	`not null, gt=0`
Star (dim)	`player_id`	`int`	`not null, gt=0, NK`
Star (fact)	`player_id`	`int`	`not null, FK: dim_player.player_id`

Key transformation: Raw PERSON_ID is renamed to person_id in staging. In dim_player, it becomes the natural key alongside the generated player_sk surrogate key. SCD2 logic creates multiple rows per player when team/position/jersey changes.

player_name

Source preview

flowchart LR
    A["CommonPlayerInfo
DISPLAY_FIRST_LAST"] --> B["raw_player_info
DISPLAY_FIRST_LAST"]
    B --> C["stg_player_info
display_first_last"]
    C --> D["dim_player
full_name"]
    D --> E["analytics_player_game_complete
player_name"]

Key transformation: Renamed at each stage. The current analytics_* outputs use player_name for user-friendly querying.

Metric math

Shooting Stats Lineage

fg_pct (Field Goal Percentage)

Source preview

flowchart LR
    A["BoxScoreTraditionalV3
FG_PCT"] --> B["raw_box_score_traditional
FG_PCT (float|None)"]
    B --> C["stg_box_score_traditional
fg_pct (float, ge=0, le=1)"]
    C --> D["fact_player_game_traditional
fg_pct (float)"]
    D --> E["agg_player_season
fg_pct = SUM(fgm)/SUM(fga)"]

Stage	Column	Notes
API	`FG_PCT`	Pre-computed by NBA
Raw	`FG_PCT`	Passed through
Staging	`fg_pct`	Validated: 0.0 - 1.0
Fact	`fg_pct`	Per-game value
Aggregate	`fg_pct`	Re-computed from season totals for accuracy

Key transformation: In agg_player_season, the season fg_pct is recomputed as SUM(fgm) / SUM(fga) rather than averaging per-game percentages, which would be statistically incorrect.

ts_pct (True Shooting Percentage)

Source preview

flowchart LR
    A["BoxScoreAdvancedV3
TS_PCT"] --> B["raw_box_score_advanced
TS_PCT"]
    B --> C["stg_box_score_advanced
ts_pct (ge=0, le=1)"]
    C --> D["fact_player_game_advanced
ts_pct"]
    D --> E["agg_player_season
avg_ts_pct = AVG(ts_pct)"]

Key transformation: Season-level avg_ts_pct is computed as a simple average of per-game values in the current implementation. For more accurate results, recompute from totals: PTS / (2 * (FGA + 0.44 * FTA)).

Shared context keys

Game Context Lineage

game_id

Source preview

flowchart LR
    A["Multiple Endpoints
GAME_ID"] --> B["raw_* tables
GAME_ID (str)"]
    B --> C["stg_* tables
game_id (str, not null)"]
    C --> D["dim_game
game_id (PK)"]
    D --> E["All fact tables
game_id (FK: dim_game)"]

The game_id is the most widely referenced key in the schema. It flows unchanged through all stages but gains FK constraints in the star layer.

season_year

Source preview

flowchart LR
    A["ScheduleLeagueV2
SEASON_ID"] --> B["raw_schedule
SEASON_ID"]
    B --> C["stg_schedule
season_id"]
    C --> D["dim_game
season_year (int)"]
    D --> E["analytics_player_game_complete
season_year"]

Key transformation: The API returns SEASON_ID as a string like "22024" (type prefix + year). The staging layer parses this to extract the integer year. dim_game stores it as season_year (int).

Team Lineage

team_id (in game context)

Source preview

flowchart LR
    A["BoxScoreTraditionalV3
TEAM_ID"] --> B["raw_box_score_traditional
TEAM_ID"]
    B --> C["stg_box_score_traditional
team_id"]
    C --> D["fact_player_game_traditional
team_id"]
    D --> E["analytics_player_game_complete
team_id"]

Key transformation: Player game rows carry team_id directly from the box score feed into fact_player_game_traditional, and analytics_player_game_complete preserves that team context alongside season and date metadata.

Location and dimension resolution

Shot Chart Lineage

loc_x, loc_y (Court Coordinates)

Source preview

flowchart LR
    A["ShotChartDetail
LOC_X, LOC_Y"] --> B["raw_shot_chart
LOC_X, LOC_Y"]
    B --> C["stg_shot_chart
loc_x, loc_y (float)"]
    C --> D["fact_shot_chart
loc_x, loc_y"]

Coordinate system: LOC_X ranges from -250 to 250 (tenths of feet from basket center, left-right). LOC_Y ranges from -50 to 890 (tenths of feet from basket, towards half-court). The basket is at (0, 0). Current analytics rollups summarize shot zones in agg_shot_zones, but the raw coordinates remain available at fact_shot_chart grain.

shot_zone (Dimension Resolution)

Source preview

flowchart LR
    A["ShotChartDetail
SHOT_ZONE_BASIC
SHOT_ZONE_AREA
SHOT_ZONE_RANGE"] --> B["stg_shot_chart
shot_zone_basic
shot_zone_area
shot_zone_range"]
    B --> C["dim_shot_zone
zone_id (PK)
zone_basic
zone_area
…

Key transformation: The three zone fields are denormalized in the API response. The transform extracts distinct combinations into dim_shot_zone and replaces the three text columns with a single zone_id FK in the fact table.

Ratings and rollups

Advanced Metrics Lineage

off_rating / def_rating / net_rating

Source preview

flowchart LR
    A["BoxScoreAdvancedV3
OFF_RATING / DEF_RATING / NET_RATING"] --> B["stg_box_score_advanced"]
    B --> C["fact_player_game_advanced
off_rating, def_rating, net_rating"]
    C --> D["agg_player_season
avg_off_rating, avg_def_rating, avg_net_rating"]
    C --> E["agg_team_pace_and_efficiency
avg_ortg, avg_drtg, avg_net_rtg"]

Key transformation: Per-game ratings flow directly to the fact table. Aggregate tables compute player-season averages in agg_player_season and team-level pace/efficiency summaries in agg_team_pace_and_efficiency.

How the replay gets encoded

Lineage Metadata in Code

Staging: `source` metadata

Staging schemas track the original API column name:

person_id: int = pa.Field(
    nullable=False,
    gt=0,
    metadata={"source": "PERSON_ID"},
)

Star: `fk_ref` metadata

Star schemas track foreign key relationships:

team_id: int | None = pa.Field(
    nullable=True,
    gt=0,
    metadata={
        "description": "Team identifier",
        "fk_ref": "dim_team.team_id",
    },
)

Transform: `depends_on` class variable

Transformers declare their upstream dependencies:

class AggPlayerSeasonTransformer(BaseTransformer):
    output_table = "agg_player_season"
    depends_on = [
        "fact_player_game_traditional",
        "fact_player_game_advanced",
        "fact_player_game_misc",
    ]

Together, these three metadata sources (source, fk_ref, depends_on) enable fully automated lineage generation via nbadb.docs_gen.lineage.

Next replay angle

Next steps from column lineage

Next stop

Zoom back out to table-level movement

Continue to Table Lineage when the issue has spread beyond one field and you need the full upstream/downstream dependency chain.

Next stop

Check naming and semantic intent

Use Field Reference or the Glossary when the lineage is clear but the meaning of the metric, suffix, or field family still is not.

Next stop

Verify the exact generated contract

Open Staging Reference or Star Reference when you need the current schema-backed type, nullability, and constraint details for the field you just traced.

Column Lineage

Start here when the bug is field-shaped: a wrong percentage, a renamed key, a surprising nullability change, or a foreign key that no longer lands where you expect.

Entry surface

Trace identity fields

Start with Player identity lineage when the issue is a key, natural identifier, or rename across layers.

Entry surface

Check metric math

Use Shooting stats lineage or Advanced metrics lineage when a percentage or rating changed unexpectedly.

Entry surface

Follow context keys

Jump to Game context lineage or Team lineage when the breakage is about joins rather than metric math.

Under the hood

Inspect metadata sources

Go to Lineage metadata in code when you need to confirm how schema metadata and transformer dependencies encode the replay.

Scan modes

If the issue looks like…	Start here	Why
A renamed or drifting identifier	Player identity lineage	Keys usually reveal where naming changed between API, staging, and star layers
A wrong percentage or derived metric	Shooting stats lineage	Metric examples show where values are passed through versus recomputed
A join or season-context mismatch	Game context lineage	Shared keys like `game_id` and `season_year` explain most warehouse joins
A denormalization or dimension split	Shot chart lineage	These examples show how raw fields become normalized dimensions and foreign keys
A code-generation question	Lineage metadata in code	The schema metadata and `depends_on` declarations power the generated lineage surface

How Column Lineage Works

Each column passes through up to four stages:

Source preview

flowchart LR
    A["NBA API Field
(UPPER_CASE)"] --> B["Raw Schema
(UPPER_CASE)"]
    B --> C["Staging Schema
(snake_case)"]
    C --> D["Star Schema
(snake_case + metadata)"]

The source metadata on staging schemas and description + fk_ref metadata on star schemas encode this lineage.

Frame	Typical change	What to watch for
API → Raw	Usually a straight pass-through	Nullable or mixed-type payloads
Raw → Staging	Renames to `snake_case` + validation	Contract tightening, parsed types, and nullability changes
Staging → Star	Modeling decisions and FK wiring	Surrogate keys, dimension resolution, and derived fields
Star → Analytics/Aggs	Convenience joins or recomputation	Semantic renames and metric rollups

Field-level replay

Player Identity Lineage

player_id

Source preview

flowchart LR
    A["CommonPlayerInfo
PERSON_ID"] --> B["raw_player_info
PERSON_ID (int|None)"]
    B --> C["stg_player_info
person_id (int, not null, gt=0)"]
    C --> D["dim_player
player_id (int, not null)
+ player_sk (surrogate)"]
    D --> E["fact_box_score_player
player_id (FK: dim_player)"]

Stage	Column Name	Type	Constraints
API Response	`PERSON_ID`	varies	none
Raw	`PERSON_ID`	`int \| None`	nullable
Staging	`person_id`	`int`	`not null, gt=0`
Star (dim)	`player_id`	`int`	`not null, gt=0, NK`
Star (fact)	`player_id`	`int`	`not null, FK: dim_player.player_id`

player_name

Source preview

flowchart LR
    A["CommonPlayerInfo
DISPLAY_FIRST_LAST"] --> B["raw_player_info
DISPLAY_FIRST_LAST"]
    B --> C["stg_player_info
display_first_last"]
    C --> D["dim_player
full_name"]
    D --> E["analytics_player_game_complete
player_name"]

Key transformation: Renamed at each stage. The current analytics_* outputs use player_name for user-friendly querying.

Metric math

Shooting Stats Lineage

fg_pct (Field Goal Percentage)

Source preview

flowchart LR
    A["BoxScoreTraditionalV3
FG_PCT"] --> B["raw_box_score_traditional
FG_PCT (float|None)"]
    B --> C["stg_box_score_traditional
fg_pct (float, ge=0, le=1)"]
    C --> D["fact_player_game_traditional
fg_pct (float)"]
    D --> E["agg_player_season
fg_pct = SUM(fgm)/SUM(fga)"]

Stage	Column	Notes
API	`FG_PCT`	Pre-computed by NBA
Raw	`FG_PCT`	Passed through
Staging	`fg_pct`	Validated: 0.0 - 1.0
Fact	`fg_pct`	Per-game value
Aggregate	`fg_pct`	Re-computed from season totals for accuracy

Key transformation: In agg_player_season, the season fg_pct is recomputed as SUM(fgm) / SUM(fga) rather than averaging per-game percentages, which would be statistically incorrect.

ts_pct (True Shooting Percentage)

Source preview

flowchart LR
    A["BoxScoreAdvancedV3
TS_PCT"] --> B["raw_box_score_advanced
TS_PCT"]
    B --> C["stg_box_score_advanced
ts_pct (ge=0, le=1)"]
    C --> D["fact_player_game_advanced
ts_pct"]
    D --> E["agg_player_season
avg_ts_pct = AVG(ts_pct)"]

Shared context keys

Game Context Lineage

game_id

Source preview

flowchart LR
    A["Multiple Endpoints
GAME_ID"] --> B["raw_* tables
GAME_ID (str)"]
    B --> C["stg_* tables
game_id (str, not null)"]
    C --> D["dim_game
game_id (PK)"]
    D --> E["All fact tables
game_id (FK: dim_game)"]

The game_id is the most widely referenced key in the schema. It flows unchanged through all stages but gains FK constraints in the star layer.

season_year

Source preview

flowchart LR
    A["ScheduleLeagueV2
SEASON_ID"] --> B["raw_schedule
SEASON_ID"]
    B --> C["stg_schedule
season_id"]
    C --> D["dim_game
season_year (int)"]
    D --> E["analytics_player_game_complete
season_year"]

Team Lineage

team_id (in game context)

Source preview

flowchart LR
    A["BoxScoreTraditionalV3
TEAM_ID"] --> B["raw_box_score_traditional
TEAM_ID"]
    B --> C["stg_box_score_traditional
team_id"]
    C --> D["fact_player_game_traditional
team_id"]
    D --> E["analytics_player_game_complete
team_id"]

Location and dimension resolution

Shot Chart Lineage

loc_x, loc_y (Court Coordinates)

Source preview

flowchart LR
    A["ShotChartDetail
LOC_X, LOC_Y"] --> B["raw_shot_chart
LOC_X, LOC_Y"]
    B --> C["stg_shot_chart
loc_x, loc_y (float)"]
    C --> D["fact_shot_chart
loc_x, loc_y"]

shot_zone (Dimension Resolution)

Source preview

flowchart LR
    A["ShotChartDetail
SHOT_ZONE_BASIC
SHOT_ZONE_AREA
SHOT_ZONE_RANGE"] --> B["stg_shot_chart
shot_zone_basic
shot_zone_area
shot_zone_range"]
    B --> C["dim_shot_zone
zone_id (PK)
zone_basic
zone_area
…

Ratings and rollups

Advanced Metrics Lineage

off_rating / def_rating / net_rating

Source preview

flowchart LR
    A["BoxScoreAdvancedV3
OFF_RATING / DEF_RATING / NET_RATING"] --> B["stg_box_score_advanced"]
    B --> C["fact_player_game_advanced
off_rating, def_rating, net_rating"]
    C --> D["agg_player_season
avg_off_rating, avg_def_rating, avg_net_rating"]
    C --> E["agg_team_pace_and_efficiency
avg_ortg, avg_drtg, avg_net_rtg"]

How the replay gets encoded

Lineage Metadata in Code

Staging: `source` metadata

Staging schemas track the original API column name:

person_id: int = pa.Field(
    nullable=False,
    gt=0,
    metadata={"source": "PERSON_ID"},
)

Star: `fk_ref` metadata

Star schemas track foreign key relationships:

team_id: int | None = pa.Field(
    nullable=True,
    gt=0,
    metadata={
        "description": "Team identifier",
        "fk_ref": "dim_team.team_id",
    },
)

Transform: `depends_on` class variable

Transformers declare their upstream dependencies:

class AggPlayerSeasonTransformer(BaseTransformer):
    output_table = "agg_player_season"
    depends_on = [
        "fact_player_game_traditional",
        "fact_player_game_advanced",
        "fact_player_game_misc",
    ]

Together, these three metadata sources (source, fk_ref, depends_on) enable fully automated lineage generation via nbadb.docs_gen.lineage.

Next replay angle

Next steps from column lineage

Next stop

Zoom back out to table-level movement

Continue to Table Lineage when the issue has spread beyond one field and you need the full upstream/downstream dependency chain.

Next stop

Check naming and semantic intent

Use Field Reference or the Glossary when the lineage is clear but the meaning of the metric, suffix, or field family still is not.

Next stop

Verify the exact generated contract

Open Staging Reference or Star Reference when you need the current schema-backed type, nullability, and constraint details for the field you just traced.

Column Lineage

Trace identity fields

Check metric math

Follow context keys

Inspect metadata sources

Zoom back out to table-level movement

Check naming and semantic intent

Verify the exact generated contract

Stay in the same possession

Table Lineage

Auto Lineage

On this page

Column Lineage

Trace identity fields

Check metric math

Follow context keys

Inspect metadata sources

Zoom back out to table-level movement

Check naming and semantic intent

Verify the exact generated contract

Stay in the same possession

Table Lineage

Auto Lineage

On this page