Skip to content

Instantly share code, notes, and snippets.

@rwcitek
Last active November 21, 2025 00:10
Show Gist options
  • Select an option

  • Save rwcitek/4c4b2b089d03f02728bca822e9a051b4 to your computer and use it in GitHub Desktop.

Select an option

Save rwcitek/4c4b2b089d03f02728bca822e9a051b4 to your computer and use it in GitHub Desktop.
duckdb-intro.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyPb5fMaPgBpHSybyrN/Zsi1",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/rwcitek/4c4b2b089d03f02728bca822e9a051b4/duckdb-intro.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# DuckDB\n",
"\n"
],
"metadata": {
"id": "GJdWCl_SxfJt"
}
},
{
"cell_type": "markdown",
"source": [
"## Motivation\n",
"\n",
"- Query/Analyze data in files without having to create a database\n",
"- performance without operational overhead\n",
"\n"
],
"metadata": {
"id": "-eY6qXLMHpOO"
}
},
{
"cell_type": "markdown",
"source": [
"## What is DuckDB?\n",
"\n",
"- An In-Process OLAP Database (The key paradigm shift).\n",
"\n",
"- Embedded Simplicity (for Sysadmins/Hobbyists): Runs inside your application (like SQLite for transactions), zero external dependencies, zero server configuration/maintenance.\n",
"\n",
"- OLAP Power (for Analysts/DBAs): Optimized for complex, analytical queries (aggregates, joins) on large datasets.\n",
"\n",
"- Installation: Single-line simplicity.\n"
],
"metadata": {
"id": "bAH-QodZIFvV"
}
},
{
"cell_type": "markdown",
"source": [
"## Documentation\n",
"\n",
"https://duckdb.org/docs/stable/\n"
],
"metadata": {
"id": "lIIHE3s4YO2C"
}
},
{
"cell_type": "markdown",
"source": [
"## Installation\n"
],
"metadata": {
"id": "5GRfiZJ2IObM"
}
},
{
"cell_type": "code",
"source": [
"%%bash\n",
"curl https://install.duckdb.org | sh\n"
],
"metadata": {
"id": "AK1l_Z6AINjU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Reading files\n"
],
"metadata": {
"id": "NnegoKc2xhSN"
}
},
{
"cell_type": "markdown",
"source": [
"### CSV\n"
],
"metadata": {
"id": "mO_YUU5qxhPE"
}
},
{
"cell_type": "markdown",
"source": [
"#### Download the CSV file then query\n",
"\n"
],
"metadata": {
"id": "sJaEFgg8Sz0l"
}
},
{
"cell_type": "code",
"source": [
"!curl -O 'https://duckdb.org/data/flights.csv'\n"
],
"metadata": {
"id": "xIyyQUAPJfpF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"SELECT count(1) FROM 'flights.csv';\n"
],
"metadata": {
"id": "xtuD2ofeIiNf"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"SELECT * FROM 'flights.csv';\n"
],
"metadata": {
"id": "C82phQ5gUNGE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Query directly\n"
],
"metadata": {
"id": "9JY7On34S2of"
}
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs; -- For reading remote files (URLs)\n",
"\n",
"SELECT count(1)\n",
"FROM 'https://duckdb.org/data/flights.csv'\n",
";\n"
],
"metadata": {
"id": "9d98_8NhKdc2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs;\n",
"\n",
"SELECT *\n",
"FROM 'https://duckdb.org/data/flights.csv'\n",
";\n"
],
"metadata": {
"id": "OhZR3G4VKq-J"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Query a US Census CSV file directly"
],
"metadata": {
"id": "GmyW8RNzTAoP"
}
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs ;\n",
"\n",
"SELECT\n",
" count(1) as \"total number of lines\"\n",
"FROM\n",
" read_csv(\n",
" 'https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2006_2010/EEO_2006_2010_Tables_County(050)_CSV/EEO_10_5YR_EEONCT7WP2_Data.csv',\n",
" encoding = 'ISO_8859_1'\n",
" )\n",
";\n"
],
"metadata": {
"id": "V422ubPGTLzV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs ;\n",
"\n",
"SELECT *\n",
"FROM\n",
"read_csv(\n",
" 'https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2006_2010/EEO_2006_2010_Tables_County(050)_CSV/EEO_10_5YR_EEONCT7WP2_Data.csv',\n",
" encoding = 'ISO_8859_1'\n",
")\n",
"limit 10\n",
";\n"
],
"metadata": {
"id": "wMsY3cXdNU8n"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs ;\n",
"\n",
"SELECT *\n",
"FROM\n",
" read_csv(\n",
" 'https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2006_2010/EEO_2006_2010_Tables_County(050)_CSV/EEO_10_5YR_EEONCT7WP2_Data.csv',\n",
" encoding = 'ISO_8859_1'\n",
" )\n",
"offset 1\n",
"limit 10\n",
";\n"
],
"metadata": {
"id": "ja0752jZT426"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Excel\n"
],
"metadata": {
"id": "T4mb3KuAxhML"
}
},
{
"cell_type": "markdown",
"source": [
"#### Download the Excel file\n",
"\n",
"> Note: only `.xlsx` files are supported. `.xls` files are NOT supported."
],
"metadata": {
"id": "tB5sZ21rc-oQ"
}
},
{
"cell_type": "code",
"source": [
"!curl -s -O 'https://www2.census.gov/ces/bds/estab/age_initial_size_state/bds_e_ageisz_AZ_release.xlsx'\n"
],
"metadata": {
"id": "KsnZEEZoWWwt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL excel;\n",
"LOAD excel;\n",
"\n",
"SELECT *\n",
"FROM\n",
" 'bds_e_ageisz_AZ_release.xlsx'\n",
"LIMIT 20\n",
";"
],
"metadata": {
"id": "BsigByKBWdzR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Query directly\n"
],
"metadata": {
"id": "GAirCqOCc5T-"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dzb1InbUxbD3"
},
"outputs": [],
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs ;\n",
"INSTALL excel;\n",
"LOAD excel;\n",
"\n",
"SELECT *\n",
"FROM\n",
" 'https://ddc-datascience.s3.us-west-1.amazonaws.com/animals.xlsx'\n",
"LIMIT 10\n",
";"
]
},
{
"cell_type": "markdown",
"source": [
"#### Specify sheet\n",
"\n"
],
"metadata": {
"id": "JKF7neRvclkU"
}
},
{
"cell_type": "markdown",
"source": [
"Implicitly, DuckDB reads the first sheet. However, you can specify the sheet you want to read."
],
"metadata": {
"id": "kJAE4cUydSSP"
}
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs ;\n",
"INSTALL excel ;\n",
"LOAD excel ;\n",
"\n",
"SELECT *\n",
"FROM\n",
" read_xlsx(\n",
" 'https://ddc-datascience.s3.us-west-1.amazonaws.com/animals.xlsx',\n",
" sheet=\"Sheet1\"\n",
" )\n",
"LIMIT 10\n",
";\n"
],
"metadata": {
"id": "Ee0oDusCZvRp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL httpfs ;\n",
"INSTALL excel ;\n",
"LOAD excel ;\n",
"\n",
"SELECT *\n",
"FROM\n",
" read_xlsx(\n",
" 'https://ddc-datascience.s3.us-west-1.amazonaws.com/animals.xlsx',\n",
" sheet=\"Sheet3\"\n",
" )\n",
"LIMIT 10\n",
";\n"
],
"metadata": {
"id": "6Iuy3wyRZvNP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "fhhYwGUyZvJ5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### JSON\n"
],
"metadata": {
"id": "MmhWpsp5ddmT"
}
},
{
"cell_type": "code",
"source": [
"!curl -O 'https://duckdb.org/data/json/todos.json'\n"
],
"metadata": {
"id": "JTFcHvZDZvD-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"!jq . todos.json | head\n"
],
"metadata": {
"id": "HrmWa5VQmthv"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"\n",
"SELECT *\n",
"FROM read_json('todos.json')\n",
"LIMIT 5;"
],
"metadata": {
"id": "ZAgXUXTtZvA_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"!curl -O 'https://ddc-datascience.s3.us-west-1.amazonaws.com/apd.incidents.json'\n"
],
"metadata": {
"id": "jZA4fv7yf842"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"!jq .features[:50] apd.incidents.json > sample.50.json\n",
"!head -20 sample.50.json\n"
],
"metadata": {
"id": "RAjvv3Tjmr0f"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"\n",
"SELECT\n",
" attributes.objectid,\n",
" attributes.blockaddress,\n",
" attributes.incidenttype,\n",
" attributes.reportdatetime,\n",
" geometry.x AS geometry_x,\n",
" geometry.y AS geometry_y\n",
"FROM\n",
" read_json(\n",
" 'sample.50.json'\n",
" ) as t1\n",
"limit 10\n",
";\n"
],
"metadata": {
"id": "hkrH2rTUirSJ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"I could not figure out how to unnest deeper nested data.\n"
],
"metadata": {
"id": "ZgOnegrRsT5y"
}
},
{
"cell_type": "markdown",
"source": [
"### Parquet\n"
],
"metadata": {
"id": "e0TldQoAdYtz"
}
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"\n",
"SELECT *\n",
"FROM\n",
" read_parquet('https://ddc-datascience.s3.us-west-1.amazonaws.com/fake.names.parquet')\n",
"limit 10\n",
";\n"
],
"metadata": {
"id": "TZRmJtMCVJTt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Zip files and Geospatial files\n"
],
"metadata": {
"id": "Ln5Bce4gfcSP"
}
},
{
"cell_type": "code",
"source": [
"!curl -O 'https://www2.census.gov/geo/tiger/TIGER2024/TRACT/tl_2024_35_tract.zip'\n"
],
"metadata": {
"id": "CJF3dlKmqwMN"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%script /root/.duckdb/cli/latest/duckdb\n",
"INSTALL spatial;\n",
"LOAD spatial;\n",
"\n",
"SELECT *\n",
"FROM st_read('/vsizip/tl_2024_35_tract.zip')\n",
"limit 10\n",
";\n"
],
"metadata": {
"id": "Kp_FCR7angz2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"I could not figure out how to get read the zip file directly without downloading first.\n"
],
"metadata": {
"id": "grjLw3rbsdzp"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment