Ball Movement
Column Lineage
Column-level lineage examples showing field transformations across pipeline stages
Column Lineage
This page traces individual columns through the pipeline -- from their NBA API source field through raw, staging, and star schema layers. Use it like replay review for one touch in the possession: the exact field that drifted, changed names, or started failing validation.
Start here when the bug is field-shaped: a wrong percentage, a renamed key, a surprising nullability change, or a foreign key that no longer lands where you expect.
Quick navigation
Trace identity fields
Start with Player identity lineage when the issue is a key, natural identifier, or rename across layers.
Check metric math
Use Shooting stats lineage or Advanced metrics lineage when a percentage or rating changed unexpectedly.
Follow context keys
Jump to Game context lineage or Team lineage when the breakage is about joins rather than metric math.
Inspect metadata sources
Go to Lineage metadata in code when you need to confirm how schema metadata and transformer dependencies encode the replay.
Scan modes
| If the issue looks like… | Start here | Why |
|---|---|---|
| A renamed or drifting identifier | Player identity lineage | Keys usually reveal where naming changed between API, staging, and star layers |
| A wrong percentage or derived metric | Shooting stats lineage | Metric examples show where values are passed through versus recomputed |
| A join or season-context mismatch | Game context lineage | Shared keys like game_id and season_year explain most warehouse joins |
| A denormalization or dimension split | Shot chart lineage | These examples show how raw fields become normalized dimensions and foreign keys |
| A code-generation question | Lineage metadata in code | The schema metadata and depends_on declarations power the generated lineage surface |
How Column Lineage Works
Each column passes through up to four stages:
flowchart LR
A["NBA API Field
(UPPER_CASE)"] --> B["Raw Schema
(UPPER_CASE)"]
B --> C["Staging Schema
(snake_case)"]
C --> D["Star Schema
(snake_case + metadata)"]The source metadata on staging schemas and description + fk_ref metadata on star schemas encode this lineage.
| Frame | Typical change | What to watch for |
|---|---|---|
| API → Raw | Usually a straight pass-through | Nullable or mixed-type payloads |
| Raw → Staging | Renames to snake_case + validation | Contract tightening, parsed types, and nullability changes |
| Staging → Star | Modeling decisions and FK wiring | Surrogate keys, dimension resolution, and derived fields |
| Star → Analytics/Aggs | Convenience joins or recomputation | Semantic renames and metric rollups |
Player Identity Lineage
player_id
flowchart LR
A["CommonPlayerInfo
PERSON_ID"] --> B["raw_player_info
PERSON_ID (int|None)"]
B --> C["stg_player_info
person_id (int, not null, gt=0)"]
C --> D["dim_player
player_id (int, not null)
+ player_sk (surrogate)"]
D --> E["fact_box_score_player
player_id (FK: dim_player)"]| Stage | Column Name | Type | Constraints |
|---|---|---|---|
| API Response | PERSON_ID | varies | none |
| Raw | PERSON_ID | int | None | nullable |
| Staging | person_id | int | not null, gt=0 |
| Star (dim) | player_id | int | not null, gt=0, NK |
| Star (fact) | player_id | int | not null, FK: dim_player.player_id |
Key transformation: Raw PERSON_ID is renamed to person_id in staging. In dim_player, it becomes the natural key alongside the generated player_sk surrogate key. SCD2 logic creates multiple rows per player when team/position/jersey changes.
player_name
flowchart LR
A["CommonPlayerInfo
DISPLAY_FIRST_LAST"] --> B["raw_player_info
DISPLAY_FIRST_LAST"]
B --> C["stg_player_info
display_first_last"]
C --> D["dim_player
full_name"]
D --> E["analytics_player_game_complete
player_name"]Key transformation: Renamed at each stage. The current analytics_* outputs use player_name for user-friendly querying.
Shooting Stats Lineage
fg_pct (Field Goal Percentage)
flowchart LR
A["BoxScoreTraditionalV3
FG_PCT"] --> B["raw_box_score_traditional
FG_PCT (float|None)"]
B --> C["stg_box_score_traditional
fg_pct (float, ge=0, le=1)"]
C --> D["fact_player_game_traditional
fg_pct (float)"]
D --> E["agg_player_season
fg_pct = SUM(fgm)/SUM(fga)"]| Stage | Column | Notes |
|---|---|---|
| API | FG_PCT | Pre-computed by NBA |
| Raw | FG_PCT | Passed through |
| Staging | fg_pct | Validated: 0.0 - 1.0 |
| Fact | fg_pct | Per-game value |
| Aggregate | fg_pct | Re-computed from season totals for accuracy |
Key transformation: In agg_player_season, the season fg_pct is recomputed as SUM(fgm) / SUM(fga) rather than averaging per-game percentages, which would be statistically incorrect.
ts_pct (True Shooting Percentage)
flowchart LR
A["BoxScoreAdvancedV3
TS_PCT"] --> B["raw_box_score_advanced
TS_PCT"]
B --> C["stg_box_score_advanced
ts_pct (ge=0, le=1)"]
C --> D["fact_player_game_advanced
ts_pct"]
D --> E["agg_player_season
avg_ts_pct = AVG(ts_pct)"]Key transformation: Season-level avg_ts_pct is computed as a simple average of per-game values in the current implementation. For more accurate results, recompute from totals: PTS / (2 * (FGA + 0.44 * FTA)).
Game Context Lineage
game_id
flowchart LR
A["Multiple Endpoints
GAME_ID"] --> B["raw_* tables
GAME_ID (str)"]
B --> C["stg_* tables
game_id (str, not null)"]
C --> D["dim_game
game_id (PK)"]
D --> E["All fact tables
game_id (FK: dim_game)"]The game_id is the most widely referenced key in the schema. It flows unchanged through all stages but gains FK constraints in the star layer.
season_year
flowchart LR
A["ScheduleLeagueV2
SEASON_ID"] --> B["raw_schedule
SEASON_ID"]
B --> C["stg_schedule
season_id"]
C --> D["dim_game
season_year (int)"]
D --> E["analytics_player_game_complete
season_year"]Key transformation: The API returns SEASON_ID as a string like "22024" (type prefix + year). The staging layer parses this to extract the integer year. dim_game stores it as season_year (int).
Team Lineage
team_id (in game context)
flowchart LR
A["BoxScoreTraditionalV3
TEAM_ID"] --> B["raw_box_score_traditional
TEAM_ID"]
B --> C["stg_box_score_traditional
team_id"]
C --> D["fact_player_game_traditional
team_id"]
D --> E["analytics_player_game_complete
team_id"]Key transformation: Player game rows carry team_id directly from the box score feed into fact_player_game_traditional, and analytics_player_game_complete preserves that team context alongside season and date metadata.
Shot Chart Lineage
loc_x, loc_y (Court Coordinates)
flowchart LR
A["ShotChartDetail
LOC_X, LOC_Y"] --> B["raw_shot_chart
LOC_X, LOC_Y"]
B --> C["stg_shot_chart
loc_x, loc_y (float)"]
C --> D["fact_shot_chart
loc_x, loc_y"]Coordinate system: LOC_X ranges from -250 to 250 (tenths of feet from basket center, left-right). LOC_Y ranges from -50 to 890 (tenths of feet from basket, towards half-court). The basket is at (0, 0). Current analytics rollups summarize shot zones in agg_shot_zones, but the raw coordinates remain available at fact_shot_chart grain.
shot_zone (Dimension Resolution)
flowchart LR
A["ShotChartDetail
SHOT_ZONE_BASIC
SHOT_ZONE_AREA
SHOT_ZONE_RANGE"] --> B["stg_shot_chart
shot_zone_basic
shot_zone_area
shot_zone_range"]
B --> C["dim_shot_zone
zone_id (PK)
zone_basic
zone_area
…Key transformation: The three zone fields are denormalized in the API response. The transform extracts distinct combinations into dim_shot_zone and replaces the three text columns with a single zone_id FK in the fact table.
Advanced Metrics Lineage
off_rating / def_rating / net_rating
flowchart LR
A["BoxScoreAdvancedV3
OFF_RATING / DEF_RATING / NET_RATING"] --> B["stg_box_score_advanced"]
B --> C["fact_player_game_advanced
off_rating, def_rating, net_rating"]
C --> D["agg_player_season
avg_off_rating, avg_def_rating, avg_net_rating"]
C --> E["agg_team_pace_and_efficiency
avg_ortg, avg_drtg, avg_net_rtg"]Key transformation: Per-game ratings flow directly to the fact table. Aggregate tables compute player-season averages in agg_player_season and team-level pace/efficiency summaries in agg_team_pace_and_efficiency.
Lineage Metadata in Code
Staging: source metadata
Staging schemas track the original API column name:
person_id: int = pa.Field(
nullable=False,
gt=0,
metadata={"source": "PERSON_ID"},
)Star: fk_ref metadata
Star schemas track foreign key relationships:
team_id: int | None = pa.Field(
nullable=True,
gt=0,
metadata={
"description": "Team identifier",
"fk_ref": "dim_team.team_id",
},
)Transform: depends_on class variable
Transformers declare their upstream dependencies:
class AggPlayerSeasonTransformer(BaseTransformer):
output_table = "agg_player_season"
depends_on = [
"fact_player_game_traditional",
"fact_player_game_advanced",
"fact_player_game_misc",
]Together, these three metadata sources (source, fk_ref, depends_on) enable fully automated lineage generation via nbadb.docs_gen.lineage.
Next steps from column lineage
Zoom back out to table-level movement
Continue to Table Lineage when the issue has spread beyond one field and you need the full upstream/downstream dependency chain.
Check naming and semantic intent
Use Field Reference or the Glossary when the lineage is clear but the meaning of the metric, suffix, or field family still is not.
Verify the exact generated contract
Open Staging Reference or Star Reference when you need the current schema-backed type, nullability, and constraint details for the field you just traced.
Keep moving
Stay in the same possession
Keep the mental model warm with adjacent pages, section hubs, and search-friendly routes into the same topic cluster.