Last active
November 21, 2025 00:10
-
-
Save rwcitek/4c4b2b089d03f02728bca822e9a051b4 to your computer and use it in GitHub Desktop.
duckdb-intro.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "provenance": [], | |
| "authorship_tag": "ABX9TyPb5fMaPgBpHSybyrN/Zsi1", | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/rwcitek/4c4b2b089d03f02728bca822e9a051b4/duckdb-intro.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# DuckDB\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "GJdWCl_SxfJt" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Motivation\n", | |
| "\n", | |
| "- Query/Analyze data in files without having to create a database\n", | |
| "- performance without operational overhead\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "-eY6qXLMHpOO" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## What is DuckDB?\n", | |
| "\n", | |
| "- An In-Process OLAP Database (The key paradigm shift).\n", | |
| "\n", | |
| "- Embedded Simplicity (for Sysadmins/Hobbyists): Runs inside your application (like SQLite for transactions), zero external dependencies, zero server configuration/maintenance.\n", | |
| "\n", | |
| "- OLAP Power (for Analysts/DBAs): Optimized for complex, analytical queries (aggregates, joins) on large datasets.\n", | |
| "\n", | |
| "- Installation: Single-line simplicity.\n" | |
| ], | |
| "metadata": { | |
| "id": "bAH-QodZIFvV" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Documentation\n", | |
| "\n", | |
| "https://duckdb.org/docs/stable/\n" | |
| ], | |
| "metadata": { | |
| "id": "lIIHE3s4YO2C" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Installation\n" | |
| ], | |
| "metadata": { | |
| "id": "5GRfiZJ2IObM" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%bash\n", | |
| "curl https://install.duckdb.org | sh\n" | |
| ], | |
| "metadata": { | |
| "id": "AK1l_Z6AINjU" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## Reading files\n" | |
| ], | |
| "metadata": { | |
| "id": "NnegoKc2xhSN" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "### CSV\n" | |
| ], | |
| "metadata": { | |
| "id": "mO_YUU5qxhPE" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "#### Download the CSV file then query\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "sJaEFgg8Sz0l" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!curl -O 'https://duckdb.org/data/flights.csv'\n" | |
| ], | |
| "metadata": { | |
| "id": "xIyyQUAPJfpF" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "SELECT count(1) FROM 'flights.csv';\n" | |
| ], | |
| "metadata": { | |
| "id": "xtuD2ofeIiNf" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "SELECT * FROM 'flights.csv';\n" | |
| ], | |
| "metadata": { | |
| "id": "C82phQ5gUNGE" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "#### Query directly\n" | |
| ], | |
| "metadata": { | |
| "id": "9JY7On34S2of" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs; -- For reading remote files (URLs)\n", | |
| "\n", | |
| "SELECT count(1)\n", | |
| "FROM 'https://duckdb.org/data/flights.csv'\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "9d98_8NhKdc2" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM 'https://duckdb.org/data/flights.csv'\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "OhZR3G4VKq-J" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "#### Query a US Census CSV file directly" | |
| ], | |
| "metadata": { | |
| "id": "GmyW8RNzTAoP" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs ;\n", | |
| "\n", | |
| "SELECT\n", | |
| " count(1) as \"total number of lines\"\n", | |
| "FROM\n", | |
| " read_csv(\n", | |
| " 'https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2006_2010/EEO_2006_2010_Tables_County(050)_CSV/EEO_10_5YR_EEONCT7WP2_Data.csv',\n", | |
| " encoding = 'ISO_8859_1'\n", | |
| " )\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "V422ubPGTLzV" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs ;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| "read_csv(\n", | |
| " 'https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2006_2010/EEO_2006_2010_Tables_County(050)_CSV/EEO_10_5YR_EEONCT7WP2_Data.csv',\n", | |
| " encoding = 'ISO_8859_1'\n", | |
| ")\n", | |
| "limit 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "wMsY3cXdNU8n" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs ;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| " read_csv(\n", | |
| " 'https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2006_2010/EEO_2006_2010_Tables_County(050)_CSV/EEO_10_5YR_EEONCT7WP2_Data.csv',\n", | |
| " encoding = 'ISO_8859_1'\n", | |
| " )\n", | |
| "offset 1\n", | |
| "limit 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "ja0752jZT426" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "### Excel\n" | |
| ], | |
| "metadata": { | |
| "id": "T4mb3KuAxhML" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "#### Download the Excel file\n", | |
| "\n", | |
| "> Note: only `.xlsx` files are supported. `.xls` files are NOT supported." | |
| ], | |
| "metadata": { | |
| "id": "tB5sZ21rc-oQ" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!curl -s -O 'https://www2.census.gov/ces/bds/estab/age_initial_size_state/bds_e_ageisz_AZ_release.xlsx'\n" | |
| ], | |
| "metadata": { | |
| "id": "KsnZEEZoWWwt" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL excel;\n", | |
| "LOAD excel;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| " 'bds_e_ageisz_AZ_release.xlsx'\n", | |
| "LIMIT 20\n", | |
| ";" | |
| ], | |
| "metadata": { | |
| "id": "BsigByKBWdzR" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "#### Query directly\n" | |
| ], | |
| "metadata": { | |
| "id": "GAirCqOCc5T-" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "id": "dzb1InbUxbD3" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs ;\n", | |
| "INSTALL excel;\n", | |
| "LOAD excel;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| " 'https://ddc-datascience.s3.us-west-1.amazonaws.com/animals.xlsx'\n", | |
| "LIMIT 10\n", | |
| ";" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "#### Specify sheet\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "JKF7neRvclkU" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "Implicitly, DuckDB reads the first sheet. However, you can specify the sheet you want to read." | |
| ], | |
| "metadata": { | |
| "id": "kJAE4cUydSSP" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs ;\n", | |
| "INSTALL excel ;\n", | |
| "LOAD excel ;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| " read_xlsx(\n", | |
| " 'https://ddc-datascience.s3.us-west-1.amazonaws.com/animals.xlsx',\n", | |
| " sheet=\"Sheet1\"\n", | |
| " )\n", | |
| "LIMIT 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "Ee0oDusCZvRp" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL httpfs ;\n", | |
| "INSTALL excel ;\n", | |
| "LOAD excel ;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| " read_xlsx(\n", | |
| " 'https://ddc-datascience.s3.us-west-1.amazonaws.com/animals.xlsx',\n", | |
| " sheet=\"Sheet3\"\n", | |
| " )\n", | |
| "LIMIT 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "6Iuy3wyRZvNP" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [], | |
| "metadata": { | |
| "id": "fhhYwGUyZvJ5" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "### JSON\n" | |
| ], | |
| "metadata": { | |
| "id": "MmhWpsp5ddmT" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!curl -O 'https://duckdb.org/data/json/todos.json'\n" | |
| ], | |
| "metadata": { | |
| "id": "JTFcHvZDZvD-" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!jq . todos.json | head\n" | |
| ], | |
| "metadata": { | |
| "id": "HrmWa5VQmthv" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM read_json('todos.json')\n", | |
| "LIMIT 5;" | |
| ], | |
| "metadata": { | |
| "id": "ZAgXUXTtZvA_" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!curl -O 'https://ddc-datascience.s3.us-west-1.amazonaws.com/apd.incidents.json'\n" | |
| ], | |
| "metadata": { | |
| "id": "jZA4fv7yf842" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!jq .features[:50] apd.incidents.json > sample.50.json\n", | |
| "!head -20 sample.50.json\n" | |
| ], | |
| "metadata": { | |
| "id": "RAjvv3Tjmr0f" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "\n", | |
| "SELECT\n", | |
| " attributes.objectid,\n", | |
| " attributes.blockaddress,\n", | |
| " attributes.incidenttype,\n", | |
| " attributes.reportdatetime,\n", | |
| " geometry.x AS geometry_x,\n", | |
| " geometry.y AS geometry_y\n", | |
| "FROM\n", | |
| " read_json(\n", | |
| " 'sample.50.json'\n", | |
| " ) as t1\n", | |
| "limit 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "hkrH2rTUirSJ" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "I could not figure out how to unnest deeper nested data.\n" | |
| ], | |
| "metadata": { | |
| "id": "ZgOnegrRsT5y" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "### Parquet\n" | |
| ], | |
| "metadata": { | |
| "id": "e0TldQoAdYtz" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM\n", | |
| " read_parquet('https://ddc-datascience.s3.us-west-1.amazonaws.com/fake.names.parquet')\n", | |
| "limit 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "TZRmJtMCVJTt" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "### Zip files and Geospatial files\n" | |
| ], | |
| "metadata": { | |
| "id": "Ln5Bce4gfcSP" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "!curl -O 'https://www2.census.gov/geo/tiger/TIGER2024/TRACT/tl_2024_35_tract.zip'\n" | |
| ], | |
| "metadata": { | |
| "id": "CJF3dlKmqwMN" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "%%script /root/.duckdb/cli/latest/duckdb\n", | |
| "INSTALL spatial;\n", | |
| "LOAD spatial;\n", | |
| "\n", | |
| "SELECT *\n", | |
| "FROM st_read('/vsizip/tl_2024_35_tract.zip')\n", | |
| "limit 10\n", | |
| ";\n" | |
| ], | |
| "metadata": { | |
| "id": "Kp_FCR7angz2" | |
| }, | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "I could not figure out how to get read the zip file directly without downloading first.\n" | |
| ], | |
| "metadata": { | |
| "id": "grjLw3rbsdzp" | |
| } | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment