Kaggle Setup

Use this guide as the loading dock for the public nbadb dataset on Kaggle: wyattowalsh/basketball.

Pick the right delivery format

If you want…	Use…
A single portable database file	`nba.sqlite`
Fast local SQL and inspection	`nba.duckdb`
Columnar files for Polars/Pandas/Arrow	`parquet/`
Broadest compatibility	`csv/`

nbadb download copies the latest Kaggle dataset into your configured data directory. If Kaggle provides nba.sqlite but not nba.duckdb, nbadb seeds DuckDB from the SQLite file automatically.

Choose your Kaggle route

If you need to…	Start here	Why
Get a ready-to-use local dataset through the CLI	Download via the nbadb CLI	Fastest path into the rest of the nbadb command surface
Control the download inside Python	Download via `kagglehub`	Easier notebook or script integration
Publish your own refreshed build	Upload your own build	Handles metadata generation and dataset upload

Download via the nbadb CLI

nbadb download

That command downloads the dataset, copies the files into your data directory, and makes the local folder ready for the rest of the CLI.

Download via `kagglehub`

import kagglehub

path = kagglehub.dataset_download("wyattowalsh/basketball")
print(f"Dataset downloaded to: {path}")

Use this route when you want direct control over download handling inside Python.

What lands on disk

A default local layout looks like this:

data/nbadb/
├── nba.sqlite
├── nba.duckdb
├── parquet/
│   └── <table>/...
├── csv/
│   └── <table>.csv
└── dataset-metadata.json

Path	What it is	Reach for it when…
`nba.sqlite`	Portable relational export	You need maximum tool compatibility
`nba.duckdb`	Fast analytical local database	You want immediate SQL without file globs
`parquet/`	Columnar table exports	Your workflow is Polars, Pandas, Arrow, or DuckDB-over-files
`csv/`	Flat text exports	A downstream system cannot read DuckDB or Parquet
`dataset-metadata.json`	Kaggle dataset metadata	You are preparing or inspecting a publish handoff

Load the files in Python

Format	Best for
SQLite	Portable inspection or compatibility-oriented tools
DuckDB	Fast SQL, joins, and ad hoc analysis
Parquet	DataFrame-first or file-based analytical workflows

SQLite

import sqlite3

conn = sqlite3.connect("data/nbadb/nba.sqlite")
rows = conn.execute("SELECT * FROM dim_player LIMIT 5").fetchall()

DuckDB

import duckdb

conn = duckdb.connect("data/nbadb/nba.duckdb")
df = conn.sql("SELECT * FROM dim_player LIMIT 5").pl()

Parquet with Polars

import polars as pl

df = pl.read_parquet("data/nbadb/parquet/dim_player/dim_player.parquet")

Parquet with Pandas

import pandas as pd

df = pd.read_parquet("data/nbadb/parquet/dim_player/dim_player.parquet")

Upload your own build

uv run nbadb upload --data-dir data/nbadb --message "Post-trade-deadline refresh"

Preflight checklist

Confirm this first	Why
Your target data directory already contains the dataset you want to publish	`upload` publishes what is on disk
Kaggle credentials are available to the environment that `kagglehub` uses	Upload cannot authenticate without them
`NBADB_KAGGLE_DATASET` points at the correct slug if you are not using the default	Metadata and upload target must agree
CSV export headers match the generated table schemas	Kaggle resource schemas are order-bound

Operator preflight

uv run nbadb export --data-dir data/nbadb
uv run nbadb scan --data-dir data/nbadb --fail-on error
uv run nbadb run-quality --report-path artifacts/health/local/data-quality-report.json
uv run nbadb metadata --data-dir data/nbadb --output dataset-metadata.json
git diff --exit-code dataset-metadata.json
uv run nbadb upload --data-dir data/nbadb --message "Describe this publish"

Use uv run nbadb metadata --output dataset-metadata.json without --data-dir only as a catalog preview. Use an existing --data-dir for publish validation so the rendered resource inventory reflects the bundle on disk.

Version notes and metadata

uv run nbadb upload --data-dir data/nbadb --message "Post-trade-deadline refresh"

uv run nbadb metadata --data-dir data/nbadb --output dataset-metadata.json

The CLI default message is "Automated update".
nbadb ensures dataset-metadata.json exists before upload.
The metadata generator uses the configured Kaggle dataset slug as the dataset id.
Pass --data-dir when you want the root metadata file to reflect a concrete export directory before commit or upload.
Upload-time metadata generation validates CSV headers against the generated resource schema field order and fails fast on drift.
The current kagglehub upload path uploads dataset-metadata.json as part of the bundle; the latest verified 1.0.1 source still does not expose a separate high-level dataset-page metadata update API.

Treat dataset-metadata.json as the generated source of truth for the Kaggle handoff. If Kaggle's page-level metadata needs manual reconciliation after an upload, use the generated file as the copy source rather than editing a second README by hand.

Daily Updates for recurring refreshes before publish
Parquet Usage for DataFrame-first workflows
Troubleshooting Playbook for Kaggle, artifact, and credential misses

Kaggle Setup

Use this guide as the loading dock for the public nbadb dataset on Kaggle: wyattowalsh/basketball.

Pick the right delivery format

If you want…	Use…
A single portable database file	`nba.sqlite`
Fast local SQL and inspection	`nba.duckdb`
Columnar files for Polars/Pandas/Arrow	`parquet/`
Broadest compatibility	`csv/`

nbadb download copies the latest Kaggle dataset into your configured data directory. If Kaggle provides nba.sqlite but not nba.duckdb, nbadb seeds DuckDB from the SQLite file automatically.

Choose your Kaggle route

If you need to…	Start here	Why
Get a ready-to-use local dataset through the CLI	Download via the nbadb CLI	Fastest path into the rest of the nbadb command surface
Control the download inside Python	Download via `kagglehub`	Easier notebook or script integration
Publish your own refreshed build	Upload your own build	Handles metadata generation and dataset upload

Download via the nbadb CLI

nbadb download

That command downloads the dataset, copies the files into your data directory, and makes the local folder ready for the rest of the CLI.

Download via `kagglehub`

import kagglehub

path = kagglehub.dataset_download("wyattowalsh/basketball")
print(f"Dataset downloaded to: {path}")

Use this route when you want direct control over download handling inside Python.

What lands on disk

A default local layout looks like this:

data/nbadb/
├── nba.sqlite
├── nba.duckdb
├── parquet/
│   └── <table>/...
├── csv/
│   └── <table>.csv
└── dataset-metadata.json

Path	What it is	Reach for it when…
`nba.sqlite`	Portable relational export	You need maximum tool compatibility
`nba.duckdb`	Fast analytical local database	You want immediate SQL without file globs
`parquet/`	Columnar table exports	Your workflow is Polars, Pandas, Arrow, or DuckDB-over-files
`csv/`	Flat text exports	A downstream system cannot read DuckDB or Parquet
`dataset-metadata.json`	Kaggle dataset metadata	You are preparing or inspecting a publish handoff

Load the files in Python

Format	Best for
SQLite	Portable inspection or compatibility-oriented tools
DuckDB	Fast SQL, joins, and ad hoc analysis
Parquet	DataFrame-first or file-based analytical workflows

SQLite

import sqlite3

conn = sqlite3.connect("data/nbadb/nba.sqlite")
rows = conn.execute("SELECT * FROM dim_player LIMIT 5").fetchall()

DuckDB

import duckdb

conn = duckdb.connect("data/nbadb/nba.duckdb")
df = conn.sql("SELECT * FROM dim_player LIMIT 5").pl()

Parquet with Polars

import polars as pl

df = pl.read_parquet("data/nbadb/parquet/dim_player/dim_player.parquet")

Parquet with Pandas

import pandas as pd

df = pd.read_parquet("data/nbadb/parquet/dim_player/dim_player.parquet")

Upload your own build

uv run nbadb upload --data-dir data/nbadb --message "Post-trade-deadline refresh"

Preflight checklist

Confirm this first	Why
Your target data directory already contains the dataset you want to publish	`upload` publishes what is on disk
Kaggle credentials are available to the environment that `kagglehub` uses	Upload cannot authenticate without them
`NBADB_KAGGLE_DATASET` points at the correct slug if you are not using the default	Metadata and upload target must agree
CSV export headers match the generated table schemas	Kaggle resource schemas are order-bound

Operator preflight

uv run nbadb export --data-dir data/nbadb
uv run nbadb scan --data-dir data/nbadb --fail-on error
uv run nbadb run-quality --report-path artifacts/health/local/data-quality-report.json
uv run nbadb metadata --data-dir data/nbadb --output dataset-metadata.json
git diff --exit-code dataset-metadata.json
uv run nbadb upload --data-dir data/nbadb --message "Describe this publish"

Version notes and metadata

uv run nbadb upload --data-dir data/nbadb --message "Post-trade-deadline refresh"

uv run nbadb metadata --data-dir data/nbadb --output dataset-metadata.json

The CLI default message is "Automated update".
nbadb ensures dataset-metadata.json exists before upload.
The metadata generator uses the configured Kaggle dataset slug as the dataset id.
Pass --data-dir when you want the root metadata file to reflect a concrete export directory before commit or upload.
Upload-time metadata generation validates CSV headers against the generated resource schema field order and fails fast on drift.
The current kagglehub upload path uploads dataset-metadata.json as part of the bundle; the latest verified 1.0.1 source still does not expose a separate high-level dataset-page metadata update API.

Daily Updates for recurring refreshes before publish
Parquet Usage for DataFrame-first workflows
Troubleshooting Playbook for Kaggle, artifact, and credential misses

Kaggle Setup

Kaggle Setup

Pick the right delivery format

Choose your Kaggle route

Download via the nbadb CLI

Download via `kagglehub`

What lands on disk

Load the files in Python

SQLite

DuckDB

Parquet with Polars

Parquet with Pandas

Upload your own build

Preflight checklist

Operator preflight

Version notes and metadata

Stay in the same possession

On this page

Kaggle Setup

Kaggle Setup

Pick the right delivery format

Choose your Kaggle route

Download via the nbadb CLI

Download via `kagglehub`

What lands on disk

Load the files in Python

SQLite

DuckDB

Parquet with Polars

Parquet with Pandas

Upload your own build

Preflight checklist

Operator preflight

Version notes and metadata

Stay in the same possession

On this page

Kaggle Setup

Stay in the same possession

Analytics Quickstart

SQL Playground

DuckDB Query Examples

On this page

Kaggle Setup

Stay in the same possession

Analytics Quickstart

SQL Playground

DuckDB Query Examples

On this page