Playbook
Parquet Usage
Practice-facility guide for reading nbadb Parquet exports efficiently and correctly.
Parquet Usage
Use this guide when your workflow lives in Polars, Pandas, PyArrow, or DuckDB over files instead of inside nba.duckdb.
Know the floor layout first
nbadb writes Parquet exports under your data directory like this:
nbadb/parquet/
├── dim_player/
│ └── dim_player.parquet
├── dim_team/
│ └── dim_team.parquet
└── fact_shot_chart/
├── season_year=2023/
│ └── part0.parquet
└── season_year=2024/
└── part0.parquet| Pattern | Shape | Typical use |
|---|---|---|
parquet/<table>/<table>.parquet | Single-file table | Dimensions and non-partitioned tables |
parquet/<table>/season_year=<year>/part0.parquet | Partitioned directory tree | Season-based fact tables |
Lead with column selection and lazy scans. Parquet rewards narrow, targeted possessions much more than “load every column and sort it out later.”
Choose the right tool for the possession
| Tool | Best for |
|---|---|
| Polars | Fast lazy scans and selective reads |
| DuckDB | SQL across many Parquet tables or partitions |
| Pandas | Compatibility with notebook and plotting ecosystems |
| PyArrow | Metadata, schema inspection, and lower-level Arrow workflows |
Read with Polars
Single-file table
import polars as pl
players = pl.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")Select only the columns you need
players = pl.read_parquet(
"nbadb/parquet/dim_player/dim_player.parquet",
columns=["player_id", "first_name", "last_name", "position"],
)Lazy-scan a partitioned fact table
box_scores = (
pl.scan_parquet("nbadb/parquet/fact_shot_chart/**/*.parquet")
.filter(pl.col("shot_made_flag") == 1)
.select(["game_id", "player_id", "shot_zone_basic", "shot_distance"])
.collect()
)Query with DuckDB over Parquet files
Single-file table
SELECT *
FROM read_parquet('nbadb/parquet/dim_player/dim_player.parquet')
LIMIT 10;Partitioned directory glob
SELECT season_year, COUNT(*) AS attempts
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet')
GROUP BY 1
ORDER BY 1;Join across Parquet files directly
SELECT
p.first_name || ' ' || p.last_name AS player,
AVG(sc.shot_distance) AS avg_shot_distance
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet') sc
JOIN read_parquet('nbadb/parquet/dim_player/dim_player.parquet') p
ON sc.player_id = p.player_id
GROUP BY 1
ORDER BY avg_shot_distance DESC
LIMIT 10;Read with Pandas
import pandas as pd
df = pd.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")
subset = pd.read_parquet(
"nbadb/parquet/dim_player/dim_player.parquet",
columns=["player_id", "first_name", "last_name"],
)Inspect with PyArrow
import pyarrow.parquet as pq
metadata = pq.read_metadata("nbadb/parquet/dim_player/dim_player.parquet")
print(f"Rows: {metadata.num_rows}")
print(f"Columns: {metadata.num_columns}")
schema = pq.read_schema("nbadb/parquet/dim_player/dim_player.parquet")
print(schema)Habits that pay off
| Do this | Why |
|---|---|
Filter on season_year as early as possible for partitioned fact tables | You reduce scan cost before collection |
| Select only the columns you need | Narrow reads keep memory use predictable |
| Use DuckDB when you need joins across many tables or partitions | SQL over files is often simpler than manual DataFrame stitching |
| Use Parquet as the handoff format when downstream systems do not need the whole DuckDB file | Smaller, more modular delivery surface |
Related routes
- Analytics Quickstart for the fastest first query path
- Shot Chart Analysis for a concrete Parquet/DuckDB analysis pattern
- Kaggle Setup for where these files come from and how they are published
Keep moving
Stay in the same possession
Keep the mental model warm with adjacent pages, section hubs, and search-friendly routes into the same topic cluster.
Analytics Quickstart
Land quick wins fast and move from setup to analysis with intent.
Shot Chart Analysis
Lean into basketball-native visual storytelling on one of the best-fit pages.
Visual Asset Prompt Pack
Generate hero art, OG cards, icons, and texture systems without losing the docs identity.