Parquet Usage

Use this guide when your workflow lives in Polars, Pandas, PyArrow, or DuckDB over files instead of inside nba.duckdb.

Know the floor layout first

nbadb writes Parquet exports under your data directory like this:

nbadb/parquet/
├── dim_player/
│   └── dim_player.parquet
├── dim_team/
│   └── dim_team.parquet
└── fact_shot_chart/
    ├── season_year=2023/
    │   └── part0.parquet
    └── season_year=2024/
        └── part0.parquet

Pattern	Shape	Typical use
`parquet/<table>/<table>.parquet`	Single-file table	Dimensions and non-partitioned tables
`parquet/<table>/season_year=<year>/part0.parquet`	Partitioned directory tree	Season-based fact tables

Lead with column selection and lazy scans. Parquet rewards narrow, targeted possessions much more than “load every column and sort it out later.”

Choose the right tool for the possession

Tool	Best for
Polars	Fast lazy scans and selective reads
DuckDB	SQL across many Parquet tables or partitions
Pandas	Compatibility with notebook and plotting ecosystems
PyArrow	Metadata, schema inspection, and lower-level Arrow workflows

Read with Polars

Single-file table

import polars as pl

players = pl.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")

Select only the columns you need

players = pl.read_parquet(
    "nbadb/parquet/dim_player/dim_player.parquet",
    columns=["player_id", "first_name", "last_name", "position"],
)

Lazy-scan a partitioned fact table

box_scores = (
    pl.scan_parquet("nbadb/parquet/fact_shot_chart/**/*.parquet")
    .filter(pl.col("shot_made_flag") == 1)
    .select(["game_id", "player_id", "shot_zone_basic", "shot_distance"])
    .collect()
)

Query with DuckDB over Parquet files

Single-file table

SELECT *
FROM read_parquet('nbadb/parquet/dim_player/dim_player.parquet')
LIMIT 10;

Partitioned directory glob

SELECT season_year, COUNT(*) AS attempts
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet')
GROUP BY 1
ORDER BY 1;

Join across Parquet files directly

SELECT
    p.first_name || ' ' || p.last_name AS player,
    AVG(sc.shot_distance) AS avg_shot_distance
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet') sc
JOIN read_parquet('nbadb/parquet/dim_player/dim_player.parquet') p
    ON sc.player_id = p.player_id
GROUP BY 1
ORDER BY avg_shot_distance DESC
LIMIT 10;

Read with Pandas

import pandas as pd

df = pd.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")

subset = pd.read_parquet(
    "nbadb/parquet/dim_player/dim_player.parquet",
    columns=["player_id", "first_name", "last_name"],
)

Inspect with PyArrow

import pyarrow.parquet as pq

metadata = pq.read_metadata("nbadb/parquet/dim_player/dim_player.parquet")
print(f"Rows: {metadata.num_rows}")
print(f"Columns: {metadata.num_columns}")

schema = pq.read_schema("nbadb/parquet/dim_player/dim_player.parquet")
print(schema)

Habits that pay off

Do this	Why
Filter on `season_year` as early as possible for partitioned fact tables	You reduce scan cost before collection
Select only the columns you need	Narrow reads keep memory use predictable
Use DuckDB when you need joins across many tables or partitions	SQL over files is often simpler than manual DataFrame stitching
Use Parquet as the handoff format when downstream systems do not need the whole DuckDB file	Smaller, more modular delivery surface

Analytics Quickstart for the fastest first query path
Shot Chart Analysis for a concrete Parquet/DuckDB analysis pattern
Kaggle Setup for where these files come from and how they are published

Parquet Usage

Use this guide when your workflow lives in Polars, Pandas, PyArrow, or DuckDB over files instead of inside nba.duckdb.

Know the floor layout first

nbadb writes Parquet exports under your data directory like this:

nbadb/parquet/
├── dim_player/
│   └── dim_player.parquet
├── dim_team/
│   └── dim_team.parquet
└── fact_shot_chart/
    ├── season_year=2023/
    │   └── part0.parquet
    └── season_year=2024/
        └── part0.parquet

Pattern	Shape	Typical use
`parquet/<table>/<table>.parquet`	Single-file table	Dimensions and non-partitioned tables
`parquet/<table>/season_year=<year>/part0.parquet`	Partitioned directory tree	Season-based fact tables

Lead with column selection and lazy scans. Parquet rewards narrow, targeted possessions much more than “load every column and sort it out later.”

Choose the right tool for the possession

Tool	Best for
Polars	Fast lazy scans and selective reads
DuckDB	SQL across many Parquet tables or partitions
Pandas	Compatibility with notebook and plotting ecosystems
PyArrow	Metadata, schema inspection, and lower-level Arrow workflows

Read with Polars

Single-file table

import polars as pl

players = pl.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")

Select only the columns you need

players = pl.read_parquet(
    "nbadb/parquet/dim_player/dim_player.parquet",
    columns=["player_id", "first_name", "last_name", "position"],
)

Lazy-scan a partitioned fact table

box_scores = (
    pl.scan_parquet("nbadb/parquet/fact_shot_chart/**/*.parquet")
    .filter(pl.col("shot_made_flag") == 1)
    .select(["game_id", "player_id", "shot_zone_basic", "shot_distance"])
    .collect()
)

Query with DuckDB over Parquet files

Single-file table

SELECT *
FROM read_parquet('nbadb/parquet/dim_player/dim_player.parquet')
LIMIT 10;

Partitioned directory glob

SELECT season_year, COUNT(*) AS attempts
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet')
GROUP BY 1
ORDER BY 1;

Join across Parquet files directly

SELECT
    p.first_name || ' ' || p.last_name AS player,
    AVG(sc.shot_distance) AS avg_shot_distance
FROM read_parquet('nbadb/parquet/fact_shot_chart/**/*.parquet') sc
JOIN read_parquet('nbadb/parquet/dim_player/dim_player.parquet') p
    ON sc.player_id = p.player_id
GROUP BY 1
ORDER BY avg_shot_distance DESC
LIMIT 10;

Read with Pandas

import pandas as pd

df = pd.read_parquet("nbadb/parquet/dim_player/dim_player.parquet")

subset = pd.read_parquet(
    "nbadb/parquet/dim_player/dim_player.parquet",
    columns=["player_id", "first_name", "last_name"],
)

Inspect with PyArrow

import pyarrow.parquet as pq

metadata = pq.read_metadata("nbadb/parquet/dim_player/dim_player.parquet")
print(f"Rows: {metadata.num_rows}")
print(f"Columns: {metadata.num_columns}")

schema = pq.read_schema("nbadb/parquet/dim_player/dim_player.parquet")
print(schema)

Habits that pay off

Do this	Why
Filter on `season_year` as early as possible for partitioned fact tables	You reduce scan cost before collection
Select only the columns you need	Narrow reads keep memory use predictable
Use DuckDB when you need joins across many tables or partitions	SQL over files is often simpler than manual DataFrame stitching
Use Parquet as the handoff format when downstream systems do not need the whole DuckDB file	Smaller, more modular delivery surface

Analytics Quickstart for the fastest first query path
Shot Chart Analysis for a concrete Parquet/DuckDB analysis pattern
Kaggle Setup for where these files come from and how they are published

Parquet Usage

Parquet Usage

Know the floor layout first

Choose the right tool for the possession

Read with Polars

Single-file table

Select only the columns you need

Lazy-scan a partitioned fact table

Query with DuckDB over Parquet files

Single-file table

Partitioned directory glob

Join across Parquet files directly

Read with Pandas

Inspect with PyArrow

Habits that pay off

Stay in the same possession

On this page

Parquet Usage

Parquet Usage

Know the floor layout first

Choose the right tool for the possession

Read with Polars

Single-file table

Select only the columns you need

Lazy-scan a partitioned fact table

Query with DuckDB over Parquet files

Single-file table

Partitioned directory glob

Join across Parquet files directly

Read with Pandas

Inspect with PyArrow

Habits that pay off

Stay in the same possession

On this page

Parquet Usage

Stay in the same possession

Analytics Quickstart

SQL Playground

DuckDB Query Examples

On this page

Parquet Usage

Stay in the same possession

Analytics Quickstart

SQL Playground

DuckDB Query Examples

On this page