nbadbArena Data Lab
GuidesSet MenuParquet Usage15 waypoints

Playbook

Parquet Usage

Practice-facility guide for reading nbadb Parquet exports efficiently and correctly.

Parquet Usage

Use this guide when your workflow lives in Polars, Pandas, PyArrow, or DuckDB over files instead of inside nba.duckdb.

Know the floor layout first

nbadb writes Parquet exports under your data directory like this:

nbadb/parquet/
├── dim_player/
│   └── dim_player.parquet
├── dim_team/
│   └── dim_team.parquet
└── fact_shot_chart/
    ├── season_year=2023/
    │   └── part0.parquet
    └── season_year=2024/
        └── part0.parquet
PatternShapeTypical use
parquet/<table>/<table>.parquetSingle-file tableDimensions and non-partitioned tables
parquet/<table>/season_year=<year>/part0.parquetPartitioned directory treeSeason-based fact tables

Lead with column selection and lazy scans. Parquet rewards narrow, targeted possessions much more than “load every column and sort it out later.”

Choose the right tool for the possession

ToolBest for
PolarsFast lazy scans and selective reads
DuckDBSQL across many Parquet tables or partitions
PandasCompatibility with notebook and plotting ecosystems
PyArrowMetadata, schema inspection, and lower-level Arrow workflows

Read with Polars

Single-file table

import polars as pl

players = pl.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")

Select only the columns you need

players = pl.read_parquet(
    "nbadb/parquet/dim_player/dim_player.parquet",
    columns=["player_id", "first_name", "last_name", "position"],
)

Lazy-scan a partitioned fact table

box_scores = (
    pl.scan_parquet("nbadb/parquet/fact_shot_chart/**/*.parquet")
    .filter(pl.col("shot_made_flag") == 1)
    .select(["game_id", "player_id", "shot_zone_basic", "shot_distance"])
    .collect()
)

Query with DuckDB over Parquet files

Single-file table

SELECT *
FROM read_parquet('nbadb/parquet/dim_player/dim_player.parquet')
LIMIT 10;

Partitioned directory glob

SELECT season_year, COUNT(*) AS attempts
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet')
GROUP BY 1
ORDER BY 1;

Join across Parquet files directly

SELECT
    p.first_name || ' ' || p.last_name AS player,
    AVG(sc.shot_distance) AS avg_shot_distance
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet') sc
JOIN read_parquet('nbadb/parquet/dim_player/dim_player.parquet') p
    ON sc.player_id = p.player_id
GROUP BY 1
ORDER BY avg_shot_distance DESC
LIMIT 10;

Read with Pandas

import pandas as pd

df = pd.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")

subset = pd.read_parquet(
    "nbadb/parquet/dim_player/dim_player.parquet",
    columns=["player_id", "first_name", "last_name"],
)

Inspect with PyArrow

import pyarrow.parquet as pq

metadata = pq.read_metadata("nbadb/parquet/dim_player/dim_player.parquet")
print(f"Rows: {metadata.num_rows}")
print(f"Columns: {metadata.num_columns}")

schema = pq.read_schema("nbadb/parquet/dim_player/dim_player.parquet")
print(schema)

Habits that pay off

Do thisWhy
Filter on season_year as early as possible for partitioned fact tablesYou reduce scan cost before collection
Select only the columns you needNarrow reads keep memory use predictable
Use DuckDB when you need joins across many tables or partitionsSQL over files is often simpler than manual DataFrame stitching
Use Parquet as the handoff format when downstream systems do not need the whole DuckDB fileSmaller, more modular delivery surface

Keep moving

Stay in the same possession

Keep the mental model warm with adjacent pages, section hubs, and search-friendly routes into the same topic cluster.

Section hub

On this page