camriddell/dutc-smarter-pandas-2025-05-02.ipynb

## dutc-smarter-pandas-2025-05-02.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "59ac0c47",
   "metadata": {},
   "source": [
    "# Smarter pandas for Data Science\n",
    "\n",
    "## A Seminar by ‘Don’t Use This Code’\n",
    "\n",
    "**Presenter**: Cameron Riddell <cameron@dutc.io>\n",
    "\n",
    "<div style=\"display: flex; justify-content: center; font-size: 2em; width: auto; padding: .25em 5em .25em 5em;\">\n",
    "    <p style=\"text-align: center\">\n",
    "        Join us on <a href=\"https://discord.gg/ZhJPKYSfNp\">Discord (https://discord.gg/ZhJPKYSfNp)</a> for discussion and guidance!\n",
    "    </p>\n",
    "</div>\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [A Seminar by ‘Don’t Use This Code’](#a-seminar-by-‘don’t-use-this-code’)\n",
    "2. [Book a Class!](#book-a-class!)\n",
    "3. [About](#about)\n",
    "   * [Don’t Use This Code; Training & Consulting](#don’t-use-this-code;-training-&-consulting)\n",
    "\n",
    "## Book a Class!\n",
    "\n",
    "<big><big>Book a class or training for your team!</big></big>\n",
    "\n",
    "Please reach out to us at [learning@dutc.io](mailto:learning@dutc.io) if are\n",
    "interested in bringing this material, or any of our other material, to your\n",
    "team.\n",
    "\n",
    "We have courses on topics such as:\n",
    "- intro Python\n",
    "- expert Python\n",
    "- data engineering with Python\n",
    "- data science and scientific computing with `numpy`, `pandas`, and `xarray`\n",
    "\n",
    "If you reach out to us, we can also provide a printable copy of the notes,\n",
    "cleaned-up and in .pdf format, as well as a professionally edited video\n",
    "recording of this presentation.\n",
    "\n",
    "## About\n",
    "\n",
    "### Don’t Use This Code; Training & Consulting\n",
    "\n",
    "Don’t Use This Code is a professional training, coaching, and consulting\n",
    "company. We are deeply invested in the open source scientific computing\n",
    "community, and are dedicated to bringing better processes, better tools, and\n",
    "better understanding to the world.\n",
    "\n",
    "**Don’t Use This Code is growing! We are currently seeking new partners, new\n",
    "clients, and new engagements for our expert consulting and training\n",
    "services.**\n",
    "\n",
    "Our ideal client is an organization, large or small, using open source\n",
    "technologies, centering around the PyData stack for scientififc and numeric\n",
    "computing. Organizations looking to better employ these tools would benefit\n",
    "from our wide range of training courses on offer, ranging from an intensive\n",
    "introduction to Python fundamentals to advanced applications of Python for\n",
    "building large-scale, production systems. Working with your team, we can craft\n",
    "targeted curricula to meet your training goals. We are also available for\n",
    "consulting services such as building scientific computing and numerical\n",
    "analysis systems using technologies like Python and React.\n",
    "\n",
    "We pride ourselves on delivering top-notch training. We are committed to\n",
    "providing quality training that is uniquely valuable to each individual\n",
    "attendee, and we do so by investing in three key areas: our\n",
    "content, our processes, and our contributors.\n",
    "\n",
    "# Topics Covered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2294098",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91d08be4",
   "metadata": {},
   "source": [
    "**Why use pandas**\n",
    "- Compare Python list, numpy.ndarray, pandas.Series\n",
    "- Restricted Computation Domain\n",
    "- What is Index alignment and why its important.\n",
    "\n",
    "**Simple analysis**\n",
    "- Loading data from files & in-memory\n",
    "- indexing/slicing/filtering\n",
    "- simple data exploration\n",
    "    - aggregations\n",
    "- simple plots\n",
    "\n",
    "**Data Cleaning**\n",
    "- column name normalization\n",
    "- mixed types: casting to specific dtypes\n",
    "- missing data\n",
    "- duplicates\n",
    "- unnecessary columns/rows\n",
    "- entity normalization\n",
    "\n",
    "**Grouped Operations**\n",
    "- apply, agg, transform\n",
    "    - use specific verbs (agg/transform > apply)\n",
    "    - avoid 1 function call per group\n",
    "\n",
    "**Reshaping Data**\n",
    "- wide vs long\n",
    "- melt         → wide to long\n",
    "- pivot        → long to wide\n",
    "- pivot_table  → long to wide w/ agg\n",
    "- stack        → wide to long (via the index)\n",
    "- unstack      → long to wide (via the index)\n",
    "- lreshape     → generic wide to long\n",
    "- wide_to_long → patterned widt to long\n",
    "\n",
    "# Why Use Pandas at all?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21d6e7ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac82c840",
   "metadata": {},
   "source": [
    "**Can’t we analyze tabular data in Python?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2a4f1a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "product,quantity,price\n",
    "apple,10,0.5\n",
    "banana,5,0.3\n",
    "apple,4,0.5\n",
    "banana,7,0.3\n",
    "orange,8,0.7\n",
    "'''.lstrip())\n",
    "\n",
    "# ① What product has the highest revenue (quantity * price)\n",
    "# ② How often does each product appear?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9e1ff61",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "product,quantity,price\n",
    "apple,10,0.5\n",
    "banana,5,0.3\n",
    "apple,4,0.5\n",
    "banana,7,0.3\n",
    "orange,8,0.7\n",
    "'''.lstrip())\n",
    "\n",
    "from pandas import read_csv\n",
    "\n",
    "df = read_csv(buffer)\n",
    "\n",
    "# Now using pandas…\n",
    "# ① What product has the highest revenue (quantity * price)\n",
    "# ② How often does each product appear?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29dbbfd5",
   "metadata": {},
   "source": [
    "## Restricted Computation Domains\n",
    "\n",
    "*Array vs List Memory*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de4f86f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sys import getsizeof\n",
    "from pandas import Series\n",
    "\n",
    "class T:\n",
    "    pass\n",
    "\n",
    "# [PythonObject(...)]\n",
    "xs = [1   for _ in range(10_000)]\n",
    "# xs = [T() for _ in range(10_000)]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06bc6b71",
   "metadata": {},
   "source": [
    "*Array vs List Growing*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "539235df",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array, append as np_append, concatenate\n",
    "from numpy.random import default_rng\n",
    "from _lib import timed\n",
    "from functools import partial\n",
    "timed = partial(timed, padding=40)\n",
    "\n",
    "# numpy/pandas are used to compactly represent data\n",
    "#   repeatedly appending to these structures is SLOW\n",
    "\n",
    "rng = default_rng(0)\n",
    "np_xs = rng.normal(0, 10, size=10_000)\n",
    "py_xs = np_xs.tolist()\n",
    "\n",
    "# ⓪ python list.append\n",
    "# ① numpy  array.append\n",
    "# ② list.append → numpy.concatenate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "148b6737",
   "metadata": {},
   "source": [
    "*Array vs List Speed of Operations*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "341c10a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array, sum as np_sum\n",
    "from numpy.random import default_rng\n",
    "from _lib import timed\n",
    "\n",
    "# numpy/pandas are used to compactly represent data\n",
    "#   repeatedly appending to these structures is SLOW\n",
    "#   if youre going an array, stay using that array (or array functions)\n",
    "\n",
    "rng = default_rng(0)\n",
    "np_xs = rng.normal(0, 10, size=1_000_000)\n",
    "py_xs = np_xs.tolist()\n",
    "\n",
    "py_sum = sum\n",
    "np_sum = np_sum"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7457a8f3",
   "metadata": {},
   "source": [
    "If one remains within the restricted computation domain of numpy/pandas\n",
    "it will often be FASTER to traverse the data multiple times with simple\n",
    "supported transformations than it will be to traverse the data once with\n",
    "a complex transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48fdc4e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array, sum as np_sum\n",
    "from numpy.random import default_rng\n",
    "from _lib import timed\n",
    "from functools import partial\n",
    "\n",
    "timed = partial(timed, padding=30)\n",
    "\n",
    "rng = default_rng(0)\n",
    "np_xs = rng.normal(100, 10, size=1_000_000)\n",
    "py_xs = np_xs.tolist()\n",
    "\n",
    "with timed('multiple pass transform'):\n",
    "    res1 = (\n",
    "        (np_xs**2 + 3*np_xs - 5)\n",
    "        * (np_xs % 7)\n",
    "        / (np_xs + 2)**1.5\n",
    "        + (np_xs**0.5)\n",
    "        * (np_xs // 2)\n",
    "        - np_xs**1.1\n",
    "    )\n",
    "\n",
    "with timed('single pass transformation'):\n",
    "    res2 = [\n",
    "        (\n",
    "            (x**2 + 3*x - 5)\n",
    "            * (x % 7)\n",
    "            / (x + 2)**1.5\n",
    "            + (x**0.5)\n",
    "            * (x // 2)\n",
    "            - x**1.1\n",
    "        )\n",
    "        for x in np_xs\n",
    "    ]\n",
    "\n",
    "from numpy import allclose\n",
    "assert allclose(res1, res2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aee4a5d6",
   "metadata": {},
   "source": [
    "**How Does NumPy Do This?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f438225",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "\n",
    "py_xs = [1.1, 2, '3']\n",
    "np_xs = array(py_xs)\n",
    "\n",
    "# [PythonObject, ...]\n",
    "print(py_xs, [type(x) for x in py_xs])\n",
    "print(np_xs, np_xs.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "714bab36",
   "metadata": {},
   "source": [
    "```zsh\n",
    "# credit: Jake Vanderplas\n",
    "# source: https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png\n",
    "\n",
    "eog images/jake_vanderplas_numpy_list.png\n",
    "```\n",
    "\n",
    "## Data Types (preview)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb1ce7ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Categorical, DataFrame, date_range, timedelta_range\n",
    "\n",
    "class T:\n",
    "    pass\n",
    "\n",
    "df = DataFrame({\n",
    "    'bool':         [True, False, True],\n",
    "    'int':          [1, 2, 3],\n",
    "    'float':        [1.1, 2.2, 3.3],\n",
    "    'datetime':     date_range('2000-01-01', periods=3, freq='d'),\n",
    "    'timedelta':    timedelta_range('1 day', periods=3),\n",
    "    'string':       ['a', 'b', 'c'],\n",
    "    'categorical':  Categorical(['a', 'b', 'c']),\n",
    "    'object':       ['a', 1, T()],\n",
    "}).astype({'string': 'string'})\n",
    "\n",
    "print(\n",
    "    df,\n",
    "    df.dtypes,\n",
    "    sep='\\n{}\\n'.format('\\N{box drawings light horizontal}' * 100)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "866b1e97",
   "metadata": {},
   "source": [
    "- bool: stores True/False values.\n",
    "- int: stores whole numbers with optional signed/unsigned and bit-length variations.\n",
    "- float: stores decimal numbers with IEEE 754 double-precision floating-point representation.\n",
    "- datetime: Represents timestamps with nanosecond precision, supporting time zone awareness.\n",
    "- timedelta: Captures differences between timestamps, useful for time-based calculations.\n",
    "- string: Optimized string storage with variable-length UTF-8 encoding for efficient memory usage.\n",
    "- categorical: Stores repeated string-like values as integer codes mapped to categories, reducing memory usage.\n",
    "- object: Generic type storing arbitrary Python objects, often inefficient for large datasets.\n",
    "\n",
    "*Special Datatyped Operations*\n",
    "- [datetime|timedelta → `.dt`](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html#pandas.Series.dt)\n",
    "- [category → `.cat`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.html#pandas.Series.cat)\n",
    "- [object|string → `.str`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html#pandas.Series.str)\n",
    "\n",
    "## Core Pieces\n",
    "\n",
    "A Restricted Computation Domain is an \"area\" where one\n",
    "can perform *fast* computations, only if one abides by the rules\n",
    "of that space.\n",
    "\n",
    "*pandas.Series*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61bf9fd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series\n",
    "\n",
    "# ① construction\n",
    "s = Series([1,2,3,4], index=[*'abcd'])\n",
    "\n",
    "# ② .dtype\n",
    "print(s.dtype)\n",
    "\n",
    "# ③ .index\n",
    "# print(s.index)\n",
    "\n",
    "# ④ .loc\n",
    "# print(s.loc['b'])\n",
    "\n",
    "# ⑤ .iloc\n",
    "# print(s.iloc[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa4c3c49",
   "metadata": {},
   "source": [
    "*pandas.DataFrame*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0959afce",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "# ① construction\n",
    "df = DataFrame({\n",
    "    'a': [1,2,3],\n",
    "    'b': [4,5,6],\n",
    "    'c': [*'xyz'],\n",
    "})\n",
    "\n",
    "# ② .dtypes\n",
    "print(df.dtypes) # datatypes for each column\n",
    "\n",
    "# ③ .index\n",
    "# print(df.index) # the index that all underlying Series share\n",
    "                # (row index)\n",
    "\n",
    "# ④ .columns\n",
    "# print(df.index) # the names of all underlying Series\n",
    "                # (column index)\n",
    "\n",
    "# ⑤ .loc\n",
    "# print(df.loc[1, :])   # row w/ index label 1, all columns\n",
    "# print(df.loc[:, 'a']) # all rows, only column 'a'\n",
    "# print(df.loc[1, 'a']) # row w/ index label 1, only column 'a'\n",
    "\n",
    "# ⑥ .iloc\n",
    "# print(df.iloc[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28a0aa29",
   "metadata": {},
   "source": [
    "## Rules of Domain Operations\n",
    "\n",
    "*NumPy Broadcasting*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8cc684e",
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = [1, 2, 3, 4]\n",
    "ys = [5, 6, 7, 8]\n",
    "\n",
    "print(xs + ys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ddc2d00",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array\n",
    "\n",
    "xs = array([ 1,  1,  1,  1])\n",
    "ys = array([\n",
    "    [ 1,  2,  3,  4],\n",
    "    [10, 20, 30, 40],\n",
    "])\n",
    "\n",
    "# ① array vs scalar\n",
    "print(f'{xs + 10 = }')\n",
    "\n",
    "# ② array vs array\n",
    "print(xs + ys)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f3a55f1",
   "metadata": {},
   "source": [
    "*pandas Index Alignment*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "228bb135",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "from pandas import Series, DataFrame\n",
    "\n",
    "s1 = Series([ 1,  2,  3,  4], index=['a', 'b', 'c', 'd'          ])\n",
    "s2 = Series([10, 20, 30, 40], index=[     'b',      'd', 'e', 'f'])\n",
    "print(s1)\n",
    "print(s2)\n",
    "\n",
    "# ① adding 2 series\n",
    "# ② combine into single DataFrame\n",
    "# ③ scalar broadcast (if any element in the operation does not have an Index, it uses broadcasting rules)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98720279",
   "metadata": {},
   "source": [
    "# Simple Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d611f11",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38e53388",
   "metadata": {},
   "source": [
    "## Loading Data\n",
    "\n",
    "### DataFrames IO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52e14084",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas\n",
    "\n",
    "for fname in dir(pandas):\n",
    "    if not fname.startswith('read_'):\n",
    "        continue\n",
    "    print(fname)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18204bff",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "from textwrap import dedent\n",
    "from pandas import read_csv\n",
    "\n",
    "buffer = StringIO(dedent('''\n",
    "    col1,col2\n",
    "    1,2\n",
    "    3,4\n",
    "''').strip())\n",
    "\n",
    "df = read_csv(buffer)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85a046ef",
   "metadata": {},
   "source": [
    "**text vs binary based formats**\n",
    "\n",
    "- text based formats, easy to inspect on disk\n",
    "- binary based formats, smaller size and faster IO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e1f3458",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, date_range, Categorical, read_csv, read_parquet\n",
    "from io import StringIO, BytesIO\n",
    "from contextlib import contextmanager\n",
    "\n",
    "df = DataFrame({\n",
    "\t'date': date_range('2000-01-01', periods=4, freq='ME'),\n",
    "\t'ticker': Categorical(['abcd', 'efgh', 'ijkl', 'mnop']),\n",
    "\t'value': [100, 200, 300, 400],\n",
    "}).apply(lambda s: s.repeat(10_000))\n",
    "# print(df)\n",
    "\n",
    "df.to_csv('data/test.csv', index=False) # text-based file format\n",
    "df.to_parquet('data/test.parquet') # binary-based file format\n",
    "\n",
    "# ① Round trip\n",
    "csv = read_csv(\n",
    "    'data/test.csv',\n",
    "    parse_dates=['date'],\n",
    "    date_format={\n",
    "        'date': '%Y-%m-%d'\n",
    "    }\n",
    ")\n",
    "parq = read_parquet('data/test.parquet')\n",
    "\n",
    "print(\n",
    "    DataFrame({\n",
    "        'orig': df.dtypes,\n",
    "        'csv': csv.dtypes,\n",
    "        'parq': parq.dtypes\n",
    "    })\n",
    ")\n",
    "\n",
    "# ② On-disk size (below)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "faad24f1",
   "metadata": {},
   "source": [
    "```zsh\n",
    "head -n5 data/test.csv\n",
    "# head -n5 data/test.parquet\n",
    "\n",
    "# du -sh data/test.{parquet,csv}\n",
    "```\n",
    "\n",
    "### Comon DataFrames Constructors\n",
    "\n",
    "from_dict\n",
    "from_records\n",
    "json_normalize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92d6637e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "# df = DataFrame([[1],[2],[3]], columns=['a'])\n",
    "# DataFrame.__init__ → type(data) [is dict]       → from_dict\n",
    "# DataFrame.__init__ → type(data) [is list[dict]] → from_recrods\n",
    "\n",
    "# df = DataFrame({'a': [1,2,3]})\n",
    "# df = DataFrame.from_dict({'a': [1,2,3]})\n",
    "\n",
    "# df = DataFrame([{'a': 1}, {'a': 2}, {'b': 3}])\n",
    "df = DataFrame.from_records([{'a': 1}, {'a': 2}, {'b': 3}])\n",
    "\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e22fb3a2",
   "metadata": {},
   "source": [
    "*json vs records — nested data*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3e50199",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "  {\n",
    "    \"order_id\": 1,\n",
    "    \"customer\": {\n",
    "      \"name\": \"Alice\",\n",
    "      \"email\": \"alice@example.com\"\n",
    "    },\n",
    "    \"items\": [\n",
    "      {\"product\": \"Laptop\", \"price\": 999.99, \"quantity\": 1},\n",
    "      {\"product\": \"Mouse\", \"price\": 19.99, \"quantity\": 2}\n",
    "    ],\n",
    "    \"shipping_address\": {\n",
    "      \"street\": \"123 Elm St\",\n",
    "      \"city\": \"Metropolis\"\n",
    "    }\n",
    "  },\n",
    "  {\n",
    "    \"order_id\": 2,\n",
    "    \"customer\": {\n",
    "      \"name\": \"Bob\",\n",
    "      \"email\": \"bob@example.com\"\n",
    "    },\n",
    "    \"items\": [\n",
    "      {\"product\": \"Keyboard\", \"price\": 49.99, \"quantity\": 1}\n",
    "    ],\n",
    "    \"shipping_address\": {\n",
    "      \"street\": \"456 Oak St\",\n",
    "      \"city\": \"Smallville\"\n",
    "    }\n",
    "  }\n",
    "]\n",
    "\n",
    "from pandas import json_normalize\n",
    "\n",
    "df = json_normalize(\n",
    "    data,\n",
    "    # record_path=['items'],\n",
    "    # meta=[\n",
    "    #     ['order_id'],\n",
    "    #     ['customer', 'name'],\n",
    "    #     ['customer', 'email'],\n",
    "    # ]\n",
    ")\n",
    "\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa03141a",
   "metadata": {},
   "source": [
    "### Selecting Columns & Rows\n",
    "\n",
    "**projection**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a794f8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({\n",
    "    'abc': [1, 2, 3, 4],\n",
    "    'bcd': [5, 6, 7, 8],\n",
    "    'cde': [*'wxyz'],\n",
    "})\n",
    "print(df, end='\\n\\n')\n",
    "\n",
    "# ① .__getitem__ → scalar, list, boolean\n",
    "print(\n",
    "    # df['a'],\n",
    "    # df[['a', 'c']],\n",
    "    # df[[True, True, False, True]],\n",
    "    # df.columns == 'a',\n",
    "    # df.columns.isin(['a', 'b']),\n",
    "\n",
    "    # df.loc[[True, False, False, True], :], # .loc[row_indexer, column_indexer]\n",
    "    # df.loc[:, [True, False, False]],\n",
    "    sep='\\n',\n",
    ")\n",
    "\n",
    "# ② .filter → items, like, regex\n",
    "print(\n",
    "    # df.filter(items=['abc', 'bcd']),\n",
    "    # df.filter(like='b'),\n",
    ")\n",
    "\n",
    "# ③ .select_dtypes(…)\n",
    "print(\n",
    "    # df.select_dtypes('int')\n",
    "    # df.select_dtypes('object')\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53cd164d",
   "metadata": {},
   "source": [
    "Making new columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e81ec55",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "orig = DataFrame({\n",
    "    'abc': [1, 2, 3],\n",
    "    'def': [4, 5, 6],\n",
    "})\n",
    "print(orig, end='\\n\\n')\n",
    "\n",
    "# ① __setitem__\n",
    "# ② .assign"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13164236",
   "metadata": {},
   "source": [
    "**filtering**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e22f1e3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import chain, repeat\n",
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({\n",
    "    'group': [*chain(repeat('a', 5), repeat('b', 5))],\n",
    "    'value': [*range(10)],\n",
    "})\n",
    "# print(df)\n",
    "\n",
    "# ① .loc → scalar, slice, list, boolean, callable\n",
    "print(\n",
    "    # df.loc[1], # scalar\n",
    "    # df.loc[4:], # slice\n",
    "    # df.loc[[3, 4, 5]], # list\n",
    "    # df.loc[df['group'] == 'a'] # boolean\n",
    "\n",
    "    # df.loc[df['value'] > 5],\n",
    "    # df.loc[lambda d: d['value'] > 5],\n",
    "\n",
    "    # df.assign(\n",
    "    #     value2=df['value'] ** 2\n",
    "    # )\n",
    "    # .loc[df['value2'] > 16]\n",
    "    # ,\n",
    "    # df,\n",
    "    sep='\\n',\n",
    ")\n",
    "\n",
    "# ② .iloc → scalar, slice, list\n",
    "# print(\n",
    "#     df.iloc[1], # scalar\n",
    "#     df.iloc[4:], # slice\n",
    "#     df.iloc[[0, 1, 2]], # list\n",
    "#     sep='\\n',\n",
    "# )\n",
    "\n",
    "# ③ .drop(…) → remove rows by index\n",
    "print(\n",
    "    # df,\n",
    "    # df.drop(5, axis='rows'),\n",
    "    # df.drop('group', axis='columns'),\n",
    "\n",
    "    sep='\\n'\n",
    ")\n",
    "\n",
    "# ④ .query → convenience syntax\n",
    "print(\n",
    "    # df.assign(\n",
    "    #     value2=df['value'] ** 2\n",
    "    # )\n",
    "    # .query('value2 > 16')\n",
    "    # .loc[lambda d: d['value2 test'] > 16]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39773027",
   "metadata": {},
   "source": [
    "## Views, Copies, & SettingWithCopyWarning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4b56c28",
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = [1, 2, 3]\n",
    "# ys = [1, 2, 3]\n",
    "ys = xs # xs & ys both refer to the SAME list\n",
    "\n",
    "xs[0] = -99\n",
    "print(xs)\n",
    "print(ys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f775b65a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import array, shares_memory\n",
    "\n",
    "xs = array([1, 2, 3])\n",
    "# ys = xs[:-1] # view\n",
    "# ys = xs[[0, 2]] # copy\n",
    "# ys = xs[[True, False, True]] # copy\n",
    "\n",
    "# xs = [1, 2, 3]\n",
    "# ys = xs (but when we read it, just stop at one before the end position)\n",
    "\n",
    "xs[0] = -99\n",
    "\n",
    "print(xs)\n",
    "print(ys)\n",
    "print(f'{shares_memory(xs, ys) = }')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c91dedc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({'A': [1, 1, 1, 2, 2], 'B': [*'vwxyz']})\n",
    "\n",
    "subset = df.loc[:3] # view of the underlying data\n",
    "\n",
    "# pandas is not sure if you excpected a view or copy\n",
    "subset['A'] = 10 # parent dataframe to also be updated\n",
    "\n",
    "print(subset)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34b20a2a",
   "metadata": {},
   "source": [
    "## “Feeling” Your Data\n",
    "\n",
    "sample\n",
    "info\n",
    "describe\n",
    "sort_values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc2ff4f1",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "from itertools import chain, repeat\n",
    "from pandas import DataFrame, date_range\n",
    "\n",
    "df = DataFrame({\n",
    "    'group': [*chain(repeat('a', 5), repeat('b', 5))],\n",
    "    'date':  date_range('2000-01-01', periods=10),\n",
    "    'value': [*range(10)],\n",
    "}).apply(lambda s: s.repeat(10_000))\n",
    "\n",
    "print(\n",
    "    # df.head()\n",
    "    # df.sample(n=5, random_state=0)\n",
    "\n",
    "    # df.info(),\n",
    "    # df.describe(),\n",
    "    # df.sort_values('value', ascending=False)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a97539aa",
   "metadata": {},
   "source": [
    "## Aggregation\n",
    "\n",
    "```\n",
    "Aggregation name   | .agg interface   | method interface\n",
    "---------------------------------------------------\n",
    "Mean               | df.agg('mean')   | df.mean()\n",
    "Median             | df.agg('median') | df.median()\n",
    "Mode               | df.agg('mode')   | df.mode()\n",
    "Sum                | df.agg('sum')    | df.sum()\n",
    "Count              | df.agg('count')  | df.count()\n",
    "Max                | df.agg('max')    | df.max()\n",
    "Min                | df.agg('min')    | df.min()\n",
    "Standard Deviation | df.agg('std')    | df.std()\n",
    "Variance           | df.agg('var')    | df.var()\n",
    "Skewness           | df.agg('skew')   | df.skew()\n",
    "Kurtosis           | df.agg('kurt')   | df.kurt()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "778cdb32",
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import chain, repeat\n",
    "from pandas import DataFrame, date_range\n",
    "\n",
    "df = DataFrame({\n",
    "    'group': [*chain(repeat('a', 5), repeat('b', 6))],\n",
    "    'date':  date_range('2000-01-01', periods=11),\n",
    "    'value': [*range(11)],\n",
    "})\n",
    "\n",
    "print(\n",
    "    # df['value'].std(),\n",
    "    # df['value'].agg(['mean', 'std'])\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61758bbb",
   "metadata": {},
   "source": [
    "## Hows The Weather?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "047c2d3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "      date  temperature  humidity  precipitation\n",
    "2020-01-01           73       215           True\n",
    "2020-01-02           77       217           True\n",
    "2020-01-03           74       218           True\n",
    "2020-01-04           NA       219          False\n",
    "2020-01-05           78       220          False\n",
    "2020-01-06           72       221           True\n",
    "2020-01-07           90       209           True\n",
    "2020-01-09           85       231           True\n",
    "2020-01-10           84       211           True\n",
    "'''.lstrip())\n",
    "\n",
    "\n",
    "from pandas import read_csv\n",
    "\n",
    "df = read_csv(\n",
    "    buffer, sep=r'\\s{2,}', engine='python',\n",
    "    parse_dates=['date'],\n",
    "    date_format={\n",
    "        'date': '%Y-%m-%d'\n",
    "    }\n",
    ")\n",
    "# print(df)\n",
    "# df.info()\n",
    "\n",
    "# What is the weather on January 2nd?\n",
    "# How many days did it rain? How many days did it not rain?\n",
    "# What days did it rain?\n",
    "# What days were hotter than the preceding day?\n",
    "# What was the hottest day when it rained?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f81e165",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Don’t Forget About the Index](https://www.dontusethiscode.com/blog/2023-07-05_pandas_index.html)\n",
    "- [Python Sets & the pandas Index](https://www.dontusethiscode.com/blog/2024-03-06_indexes_and_sets.html)\n",
    "- [Good pandas needs good Python](https://www.dontusethiscode.com/blog/2024-01-24_pandas_needs_python.html)\n",
    "\n",
    "# Preparing for Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a277d665",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91833e8a",
   "metadata": {},
   "source": [
    "Like many things in life, data work also follows the 80/20 rule. You spend…\n",
    "- 40% of your time cleaning data\n",
    "- 20% analyzing\n",
    "- another 40% cleaning it better than you did the first time.\n",
    "\n",
    "## Practice Problem\n",
    "\n",
    "who made this?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71573608",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "\n",
    "buffer = StringIO('''\n",
    "customer_id,date,amount,country\n",
    "001,2024/01/01,$100.50,US\n",
    "002,2024-01-05,$50.25,UK\n",
    ",2024-01-08,,$50.75\n",
    "004,2024-01-11,$75,Canada\n",
    "005,,,$0.00\n",
    "006,2024-01-14,$30.10,UK\n",
    "007,2024-01-16,$45.00,US\n",
    "008,2024-01-20,$60.75,\n",
    "''')\n",
    "\n",
    "from pandas import read_csv, to_datetime, to_numeric\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a098b93",
   "metadata": {},
   "source": [
    "comma separated, but not tabular"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6e6cbb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "from textwrap import dedent\n",
    "\n",
    "buffer = StringIO(dedent('''\n",
    "    device,upgrade_dates\n",
    "    device-1,2000-01-01,2000-02-01,2000-03-01\n",
    "    device-2,2000-01-01,2000-04-01\n",
    "    device-3,2000-01-01,2000-03-01,2000-05-01,2000-10-01\n",
    "    device-4,2000-01-01,2000-07-01,2000-09-01\n",
    "''').strip())\n",
    "\n",
    "# from pandas import read_csv, DataFrame\n",
    "# df = read_csv(buffer, on_bad_lines='skip')\n",
    "\n",
    "# ① Load with 2 columns, where upgrade_dates is a column of lists\n",
    "# ② Load with 2 columns, where upgrade_dates is \"exploded\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29ee9e4d",
   "metadata": {},
   "source": [
    "## Help! My Data Are Missing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4246eb89",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, NA\n",
    "from numpy import nan\n",
    "\n",
    "# the default numpy backend ONLY represents NaNs as floating-point values\n",
    "s = Series([1, 2, None, 4, None, 6, None], dtype='float64') # numpy\n",
    "\n",
    "# pandas uses a separate array to track missing values\n",
    "# s = Series([1, 2, NA, 4, NA, 6, NA], dtype='Int64') # numpy[nullable]\n",
    "# s = Series([1, 2, NA, 4, NA, 6, NA], dtype='int64[pyarrow]') # pyarrow\n",
    "\n",
    "# print(s)\n",
    "\n",
    "# ⓪ nan removal\n",
    "# print(s.dropna())\n",
    "\n",
    "# ① scalar fill\n",
    "# print(s.fillna(-99))\n",
    "# print(s.fillna(s.mean()))\n",
    "\n",
    "# ② forward fill\n",
    "# print(s.ffill())\n",
    "\n",
    "# ③ backward fill\n",
    "# print(s)\n",
    "# print(s.bfill())\n",
    "\n",
    "# ④ interpolation\n",
    "# print(s.interpolate())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc02602f",
   "metadata": {},
   "source": [
    "## More Dtypes & Dtype Backends\n",
    "\n",
    "### NaN vs NA?!?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89883848",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA, Series\n",
    "from numpy import nan\n",
    "\n",
    "data = DataFrame({\n",
    "    'np':          [1, nan, 2],\n",
    "    'np_nullable': [0, nan, 2],\n",
    "}).astype({'np': 'float64', 'np_nullable': 'Int64'})\n",
    "\n",
    "# ⓪ NaN naturally occur with impossible algebra\n",
    "s = Series([0, 1, 2, 3])\n",
    "print(s / s)\n",
    "\n",
    "# s = data['np_nullable']\n",
    "# print(\n",
    "#     (s / s).astype('float64').dropna(),\n",
    "#     # (s / s).fillna(-99),\n",
    "#     sep='\\n'\n",
    "# )\n",
    "\n",
    "# print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8256c51f",
   "metadata": {},
   "source": [
    "Lets dive into these features starting with a closer look at a couple of\n",
    "data types.\n",
    "\n",
    "**Datetimes**\n",
    "\n",
    "- Datetime\n",
    "- Timedelta\n",
    "- Period"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0c145af",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, date_range\n",
    "\n",
    "# 1970-01-01\n",
    "# 0\n",
    "# 1\n",
    "\n",
    "df = DataFrame({\n",
    "    'date': date_range('2000-01-01', periods=5, freq='min'),\n",
    "}).assign(\n",
    "    delta=lambda d: d['date'].diff(),\n",
    "    period=lambda d: d['date'].dt.to_period('Q')\n",
    ")\n",
    "\n",
    "# print(df)\n",
    "# print(df.dtypes)\n",
    "\n",
    "print(\n",
    "    # df['date'].dt.strftime('%Y & %m & %d')\n",
    "    DataFrame({\n",
    "        'day' :df['date'].dt.day,\n",
    "        'month' :df['date'].dt.month,\n",
    "        'year' :df['date'].dt.year,\n",
    "    })\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "538b6da8",
   "metadata": {},
   "source": [
    "- Datetimes → represents a single point in time\n",
    "- timedelta → represents the distance between 2 datetimes\n",
    "- Period    → represents a SPAN of time\n",
    "    - datetime to anchor it\n",
    "    - timedelta to span it\n",
    "\n",
    "*useful datetime index*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ad4e32d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, date_range\n",
    "from numpy import arange\n",
    "\n",
    "s = Series(\n",
    "    index=date_range('2023-01-05', periods=100, freq='5h'),\n",
    "    data=arange(100),\n",
    "    name='signal 1',\n",
    ")\n",
    "\n",
    "print(\n",
    "    s.loc['2023-01-10 05']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d29f8cb",
   "metadata": {},
   "source": [
    "*Resampling*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62940bc4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, to_datetime, to_timedelta\n",
    "from numpy.random import default_rng\n",
    "\n",
    "rng = default_rng(0)\n",
    "\n",
    "s = Series(\n",
    "    100 * rng.normal(1, scale=.001, size=(size := 5_000)).cumprod(),\n",
    "    index=(\n",
    "        to_datetime('2023-02-15')\n",
    "        + to_timedelta(rng.integers(100, 1_000, size=size).cumsum(), unit='s')\n",
    "    )\n",
    ")\n",
    "\n",
    "\n",
    "print(\n",
    "    # s.resample('12h').mean(),\n",
    "    # s.resample('D').mean(),\n",
    "    # s.resample('ME').mean(),\n",
    "\n",
    "    # s.resample('MS').mean().to_period(),\n",
    "    sep='\\n',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "274fc023",
   "metadata": {},
   "source": [
    "**Categorical**\n",
    "\n",
    "Issue: strings take up more memory than integers do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2495e234",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, DataFrame\n",
    "\n",
    "s = Series(['a', 'b', 'c', 'a', 'a', 'b']).repeat(10_000)\n",
    "# s.astype('string') # a promise into the future that pandas will better support this\n",
    "                     # right now, it is pretty much a wrapper for the object dtype\n",
    "\n",
    "# print(s.factorize())\n",
    "\n",
    "# df = DataFrame({\n",
    "#     'orig': s,\n",
    "#     'cat': s.astype('category')\n",
    "# })\n",
    "# print(df)\n",
    "# print(df.memory_usage(deep=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df3c4da0",
   "metadata": {},
   "source": [
    "## Dtype Backends\n",
    "\n",
    "A new(er) feature of pandas is to have different array backends for in-memory\n",
    "data storage. (pandas >= 2.0)\n",
    "\n",
    "NumPy vs Numpy[nullable] vs Pyarrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7df16dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, concat\n",
    "\n",
    "df = DataFrame({\n",
    "    'int':   [1  , 2  , 3  ],\n",
    "    'float': [1.1, 2.2, 3.3],\n",
    "})\n",
    "\n",
    "orig = concat(\n",
    "    [\n",
    "        df.add_prefix('numpy_'),\n",
    "        *(\n",
    "            df.convert_dtypes(dtype_backend=backend)\n",
    "            .add_prefix(f'{backend}_')\n",
    "            for backend in ['numpy_nullable', 'pyarrow']\n",
    "        )\n",
    "    ],\n",
    "    axis=1\n",
    ")\n",
    "\n",
    "df = orig.copy()\n",
    "df.loc[0, 'numpy_int'] = None\n",
    "df.loc[0, 'numpy_nullable_int'] = None\n",
    "df.loc[0, 'pyarrow_int'] = None\n",
    "\n",
    "print(\n",
    "    orig,\n",
    "    df,\n",
    "    DataFrame({\n",
    "        'orig': orig.dtypes,\n",
    "        'new' : df.dtypes,\n",
    "    }),\n",
    "    sep='\\n{}\\n'.format('\\N{box drawings light horizontal}' * 80)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d10e9a27",
   "metadata": {},
   "source": [
    "**When Does This Really Make a Difference?**\n",
    "\n",
    "Working with strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93f1044d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import Random\n",
    "from string import ascii_lowercase\n",
    "from pandas import Series, DataFrame\n",
    "from _lib import timed\n",
    "from functools import partial\n",
    "\n",
    "timed = partial(timed, padding=30, fmt='.3f')\n",
    "\n",
    "rnd = Random(0)\n",
    "chars = [\n",
    "    ''.join(rnd.choices(\n",
    "        [*ascii_lowercase], k=rnd.randint(2, 10)\n",
    "    ))\n",
    "    for _ in range(1_000_000)\n",
    "]\n",
    "\n",
    "df = DataFrame({\n",
    "    'object': (s := Series(chars)),\n",
    "    'string':  s.astype('string'),\n",
    "    'pyarrow': s.astype('string[pyarrow]')\n",
    "})\n",
    "\n",
    "# ⓪ string operations speed by datatype\n",
    "# ① string operations memory by datatype"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7092f0bd",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Dealing with dates in pandas (part 1)](https://www.dontusethiscode.com/blog/2023-02-01_datetimes-pandas.html)\n",
    "- [Dealing with dates in pandas (part 2)](https://www.dontusethiscode.com/blog/2023-02-08_datetimes-pandas-2.html)\n",
    "- [Dealing with dates in pandas (part 3)](https://www.dontusethiscode.com/blog/2023-02-15_datetimes-pandas-3.html)\n",
    "- [Using a Categorical to Represent Tag Sets](https://www.dontusethiscode.com/blog/2024-06-19_enum-categoricals.html)\n",
    "- [pandas: Months, Days, and Categoricals](https://www.dontusethiscode.com/blog/2024-08-28_months_days_categoricals.html)\n",
    "- [pandas/Polars NaNs vs Null](https://www.dontusethiscode.com/blog/2025-01-08_null_vs_nan.html)\n",
    "- [More NaNs & Nulls in pandas/Polars](https://www.dontusethiscode.com/blog/2025-01-22_null_vs_nan_applied.html)\n",
    "\n",
    "# Group by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68732360",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0337ba1d",
   "metadata": {},
   "source": [
    "*modalities: apply, agg, transform*\n",
    "but what about filter?\n",
    "\n",
    "```\n",
    "df: G×N\n",
    "   10 groups\n",
    "  200 rows\n",
    "```\n",
    "\n",
    "- .agg       → G rows (mean)\n",
    "- .transform → N rows (cumsum)\n",
    "    - equivalent to window functions in SQL\n",
    "- .apply     → ? rows (anything, very flexible)\n",
    "    - flexible, but can lead to slow performance\n",
    "\n",
    "```zsh\n",
    "# https://www.dontusethiscode.com/blog/2022-09-21_groupby-split-apply-combine.html\n",
    "eog images/groupby-apply.png\n",
    "# eog images/groupby-agg.png\n",
    "# eog images/groupby-transform-nonreduction.png\n",
    "# eog images/groupby-transform-reduction.png\n",
    "# eog images/groupby-agg-transform-summary.png\n",
    "```\n",
    "\n",
    "## The Syntax"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47190a0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(...).agg(...)\n",
    "df.groupby(...).transform(...)\n",
    "df.groupby(...).apply(...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dd5ab90",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df = DataFrame({\n",
    "    'groups': [*'aaabbbccc'],\n",
    "    'values': range(9),\n",
    "})\n",
    "\n",
    "print(\n",
    "    df,\n",
    "    # df.groupby('groups').mean(),\n",
    "    # df.groupby('groups').agg(lambda g: g.mean()),\n",
    "    # df.groupby('groups').agg('mean'),\n",
    "    # df.groupby('groups').agg(\n",
    "    #     avg=('values', 'mean'),\n",
    "    #     total=('values', 'sum')\n",
    "    # ),\n",
    "    # df.groupby('groups').agg(lambda g: g.head(2)),\n",
    "\n",
    "    # df.groupby('groups').transform('cumsum'),\n",
    "    # df.groupby('groups').transform('sum'),\n",
    "    # df.groupby('groups').transform(lambda g: g.head(2)),\n",
    "\n",
    "    # df.groupby('groups').apply('mean'),\n",
    "    # df.groupby('groups').apply(lambda g: g.head(2)),\n",
    "    sep='\\n{}\\n'.format('\\N{box drawings light horizontal}' * 100)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c32da6e",
   "metadata": {},
   "source": [
    "## Using the Correct Verb\n",
    "\n",
    "Grouped, weighted means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30f3c550",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import unique\n",
    "from numpy.random import default_rng\n",
    "from pandas import DataFrame\n",
    "from _lib import timed\n",
    "from string import ascii_uppercase, ascii_lowercase\n",
    "\n",
    "rng = default_rng(0)\n",
    "categories = unique(\n",
    "    rng.choice([*ascii_uppercase], size=(10_000, length := 4), replace=True)\n",
    "    .view(f'<U{length}')\n",
    ")\n",
    "print(f'{categories.size} unique categories')\n",
    "\n",
    "df = DataFrame({\n",
    "    'groups' : categories.repeat(reps := 100),\n",
    "    'weights': rng.uniform(0, 2,   size=(reps * categories.size)),\n",
    "    'values' : rng.normal(100, 10, size=(reps * categories.size)),\n",
    "})\n",
    "\n",
    "# print(df['values'].mean())\n",
    "# print(df.groupby('groups')['values'].mean())\n",
    "\n",
    "# ⓪ use a for-loop\n",
    "# ① everything in groupby.apply\n",
    "# ② temp column, then groupby apply\n",
    "# ③ temp column, groupby agg (no user-defined function)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59b0d0ee",
   "metadata": {},
   "source": [
    "Further Reading\n",
    "- [Split → ??? → Combine](https://www.dontusethiscode.com/blog/2022-09-21_groupby-split-apply-combine.html)\n",
    "- [pandas & Polars: group by & window functions](https://www.dontusethiscode.com/blog/2024-06-05_dataframe-window.html)\n",
    "\n",
    "# Reshaping Data in pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68951583",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2941cb9",
   "metadata": {},
   "source": [
    "## Data Shapes\n",
    "\n",
    "Wide vs Long"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc4b2759",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA\n",
    "\n",
    "wide_df = DataFrame({\n",
    "    'product': [*'ABCD'],\n",
    "    'jan'    : [NA, 11, 12, 13],\n",
    "    'feb'    : [14, 15, 16, 17],\n",
    "    'mar'    : [18, 19, NA, NA],\n",
    "}).convert_dtypes('numpy_nullable')\n",
    "\n",
    "long_df = (\n",
    "    DataFrame({\n",
    "        'product': [*'BCDABCDAB'],\n",
    "        'month'  : [*'jjjffffmm'],\n",
    "        'sales'  : [*range(11, 11+9)],\n",
    "    })\n",
    "    .assign( # replace 'j' → 'jan', etc.\n",
    "        month=lambda d:\n",
    "            d['month'].map({'j': 'jan', 'f': 'feb', 'm': 'mar'})\n",
    "    )\n",
    ")\n",
    "\n",
    "print(\n",
    "    wide_df,\n",
    "    long_df,\n",
    "    sep='\\n\\n',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "163f73cd",
   "metadata": {},
   "source": [
    "## Standard Reshaping Functions\n",
    "\n",
    "- melt        → wide to long\n",
    "- pivot       → long to wide\n",
    "- pivot_table → long to wide w/ agg\n",
    "- stack       → wide to long\n",
    "- unstack     → long to wide\n",
    "\n",
    "**wide → long**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac77de25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA\n",
    "\n",
    "wide_df = DataFrame({\n",
    "    'product': [*'ABCD'],\n",
    "    'jan'    : [NA, 11, 12, 13],\n",
    "    'feb'    : [14, 15, 16, 17],\n",
    "    'mar'    : [18, 19, NA, NA],\n",
    "}).convert_dtypes(dtype_backend='pyarrow')\n",
    "\n",
    "# ① melt\n",
    "# ② stack\n",
    "from pandas import IndexSlice\n",
    "print(\n",
    "    wide_df.set_index('product').stack()\n",
    "    # .loc[['A', 'B'], 'mar']\n",
    "    # .loc[[('A', 'feb'), ('B', 'mar')]]\n",
    "    # .loc[IndexSlice[:, 'mar']]\n",
    "    # .loc[slice(None), 'mar']\n",
    "    # .xs('mar', level=1)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3462426c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, NA\n",
    "\n",
    "wide_df = DataFrame({\n",
    "    'product': [*'ABCD'],\n",
    "    'jan_sales'  : [NA, 11, 12, 13],\n",
    "    'jan_units'  : [NA,  1,  2,  3],\n",
    "    'jan_clicks' : [NA, 71, 72, 73],\n",
    "    'feb_sales'  : [14, 15, 16, 17],\n",
    "    'feb_units'  : [ 4,  5,  6,  7],\n",
    "    'feb_clicks' : [74, 75,  6, 77],\n",
    "    'mar_sales'  : [18, 19, NA, NA],\n",
    "    'mar_units'  : [8 ,  9, NA, NA],\n",
    "    'mar_clicks' : [78, 79, NA, NA],\n",
    "})\n",
    "print(wide_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd70028b",
   "metadata": {},
   "source": [
    "**long → wide**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a84945ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "long_df = (\n",
    "    DataFrame({\n",
    "        'product': [*'BCDABCDAB'],\n",
    "        'month'  : [*'jjjffffmm'],\n",
    "        'sales'  : [*range(11, 11+9)],\n",
    "    })\n",
    "    .assign( # replace 'j' → 'jan', etc.\n",
    "        month=lambda d:\n",
    "            d['month'].map({'j': 'jan', 'f': 'feb', 'm': 'mar'})\n",
    "    )\n",
    ")\n",
    "\n",
    "# print(long_df)\n",
    "\n",
    "# ① pivot\n",
    "# ② unstack"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91df64f6",
   "metadata": {},
   "source": [
    "## Working with multiple DataFrames\n",
    "\n",
    "- `pandas.merge` joins two DataFrames horizontally on a specific column (or optionally, the index).\n",
    "- `pandas.join` joins two DataFrames horizontally on their indexes.\n",
    "- `pandas.concat` joins many DataFrames either horizontally or vertically according to their row/column index.\n",
    "\n",
    "### merge & join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14ae69c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "\n",
    "df_left = DataFrame({\n",
    "    'value': [ 1 ,  2 ,  3 ,  4            ],\n",
    "    'group': ['a', 'b', 'c', 'd'           ],\n",
    "})\n",
    "df_right = DataFrame({\n",
    "    'group': [          'c', 'd', 'e', 'f' ],\n",
    "    'value': [ -3,  -4,  -5,  -6 ],\n",
    "})\n",
    "\n",
    "# ① merge vs join\n",
    "#  → join only uses the index of both tables\n",
    "#  → merge can either use columns or optionally the index of either table\n",
    "\n",
    "# ② left, right, inner, outer, cross\n",
    "print(\n",
    "    df_left.merge(df_right, on='group', how='inner', indicator=True),\n",
    "    # df_left.merge(df_right, on='group', how='outer', indicator=True),\n",
    "    # df_left.merge(df_right, on='group', how='left', indicator=True),\n",
    "    # df_left.merge(df_right, on='group', how='right', indicator=True),\n",
    "    sep='\\n\\n'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a141d9aa",
   "metadata": {},
   "source": [
    "### Align → Work\n",
    "\n",
    "We have two signals: one is an event-based signal while the other is a polling\n",
    "signal. The difference here is that the event signal records the timestamp when\n",
    "an event occurred, whereas the polling signal gives us back the result at the\n",
    "time of check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6fb4a1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame, Timedelta\n",
    "\n",
    "state_polling_df = DataFrame(\n",
    "    columns=[      'timestamp', 'device_id',  'state'],\n",
    "    data=[\n",
    "        ['2000-01-01 04:00:00',       'abc', 'state1'],\n",
    "        ['2000-01-01 04:30:00',       'abc', 'state1'],\n",
    "        ['2000-01-01 05:00:00',       'abc', 'state1'],\n",
    "        ['2000-01-01 05:30:00',       'abc', 'state3'],\n",
    "        ['2000-01-01 06:00:00',       'abc', 'state3'],\n",
    "        ['2000-01-01 06:30:00',       'abc', 'state2'],\n",
    "        ['2000-01-01 07:00:00',       'abc', 'state2'],\n",
    "        ['2000-01-01 07:30:00',       'abc', 'state1'],\n",
    "\n",
    "        ['2000-01-01 04:00:00',       'def', 'state2'],\n",
    "        ['2000-01-01 04:30:00',       'def', 'state1'],\n",
    "        ['2000-01-01 05:00:00',       'def', 'state3'],\n",
    "        ['2000-01-01 05:30:00',       'def', 'state3'],\n",
    "        ['2000-01-01 06:00:00',       'def', 'state1'],\n",
    "        ['2000-01-01 06:30:00',       'def', 'state1'],\n",
    "    ]\n",
    ").astype({'timestamp': 'datetime64[ns]'})\n",
    "\n",
    "alert_events = DataFrame(\n",
    "    columns=[      'timestamp', 'device_id'],\n",
    "    data=[\n",
    "        ['2000-01-01 03:15:00',       'abc'],\n",
    "        ['2000-01-01 04:05:00',       'abc'],\n",
    "        ['2000-01-01 04:17:00',       'abc'],\n",
    "        ['2000-01-01 04:44:00',       'abc'],\n",
    "        ['2000-01-01 05:10:00',       'abc'],\n",
    "        ['2000-01-01 05:23:00',       'abc'],\n",
    "        ['2000-01-01 05:43:00',       'abc'],\n",
    "        ['2000-01-01 05:53:00',       'abc'],\n",
    "        ['2000-01-01 06:02:00',       'abc'],\n",
    "        ['2000-01-01 06:08:00',       'abc'],\n",
    "        ['2000-01-01 06:10:00',       'abc'],\n",
    "        ['2000-01-01 06:23:00',       'abc'],\n",
    "        ['2000-01-01 06:51:00',       'abc'],\n",
    "\n",
    "        ['2000-01-01 03:05:00',       'def'],\n",
    "        ['2000-01-01 04:15:00',       'def'],\n",
    "        ['2000-01-01 04:27:00',       'def'],\n",
    "        ['2000-01-01 04:34:00',       'def'],\n",
    "        ['2000-01-01 05:20:00',       'def'],\n",
    "        ['2000-01-01 05:33:00',       'def'],\n",
    "        ['2000-01-01 06:22:00',       'def'],\n",
    "        ['2000-01-01 06:29:00',       'def'],\n",
    "        ['2000-01-01 06:43:00',       'def'],\n",
    "        ['2000-01-01 07:01:00',       'def'],\n",
    "    ]\n",
    ").astype({'timestamp': 'datetime64[ns]'})\n",
    "\n",
    "# How many alert events occurred within each device & state?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc82d085",
   "metadata": {},
   "source": [
    "### concat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bdd99b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import arange\n",
    "import pandas as pd\n",
    "\n",
    "data = arange(0, 18).reshape(3, 3, 2)\n",
    "\n",
    "# take note where these indexes and columns overlap and where they do not\n",
    "dfs = {\n",
    "    'df1': pd.DataFrame(data[0], index=['a', 'b', 'c'          ], columns=['x', 'y'     ]),\n",
    "    'df2': pd.DataFrame(data[1], index=[     'b', 'c', 'd'     ], columns=['x',      'z']),\n",
    "    'df3': pd.DataFrame(data[2], index=[     'b',      'd', 'e'], columns=['x',      'z']),\n",
    "}\n",
    "\n",
    "# for k, df in dfs.items():\n",
    "#     print(k)\n",
    "#     print(df)\n",
    "#     print()\n",
    "\n",
    "# concat across axes\n",
    "print(\n",
    "    # pd.concat(dfs, axis='columns', join='inner')\n",
    "    # pd.concat(dfs, axis='columns', join='outer')\n",
    "\n",
    "    # pd.concat(dfs, axis='rows')\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62beb83d",
   "metadata": {},
   "source": [
    "#### pandas.concat Performance Considerations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ca97016",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series, DataFrame, concat\n",
    "\n",
    "# def append(self, other):\n",
    "# \treturn concat([self, other])\n",
    "\n",
    "# Series.append = append\n",
    "# DataFrame.append = append\n",
    "\n",
    "# s = Series([1, 2, 3])\n",
    "# print(\n",
    "# \ts.append(Series([4, 5, 6]))\n",
    "# )\n",
    "\n",
    "# df = DataFrame()\n",
    "# for file in ['file1.csv', 'file2.csv', 'file3.csv']:\n",
    "#     # df.append(\n",
    "#     #     read_csv(file)\n",
    "#     # )\n",
    "#     df = concat([df, read_csv(file)])\n",
    "\n",
    "parts = {}\n",
    "for file in ['file1.csv', 'file2.csv', 'file3.csv']:\n",
    "    parts[file] = read_csv(file)\n",
    "concat(parts, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc88eb56",
   "metadata": {},
   "source": [
    "- list → designed to grow in size\n",
    "- pandas.Series/numpy.array → static size, needs reallocation to grow\n",
    "- pandas.DataFrame → doesn’t want to change the number of rows, but adding columns is fast.\n",
    "\n",
    "Further Reading\n",
    "- [Wide & Long Data Formats in pandas](https://www.dontusethiscode.com/blog/2024-11-20_pandas_reshaping.html)\n",
    "- [The pandas reshaping functions you didn’t know about](https://www.dontusethiscode.com/blog/2024-12-04_pandas_reshaping2.html)\n",
    "- [pandas Equality Joins](https://www.dontusethiscode.com/blog/2024-04-17_joins.html)\n",
    "- [DataFrame Joins and Multisets](https://www.dontusethiscode.com/blog/2024-04-24_joins_equality_dup.html)\n",
    "- [pandas concat & index alignment](https://www.dontusethiscode.com/blog/2024-07-24_pandas-concat.html)\n",
    "- [DataFrame inequality & asof joins](https://www.dontusethiscode.com/blog/2024-07-10_ineq_joins.html)\n",
    "\n",
    "# Concepts Review\n",
    "\n",
    "**Why use pandas**\n",
    "- Restricted Computation Domain\n",
    "- Index alignment happens EVERYWHERE.\n",
    "\n",
    "**Simple analysis**\n",
    "- Loading data from files & in-memory\n",
    "- indexing/slicing/filtering\n",
    "- simple data exploration\n",
    "    - aggregations\n",
    "- simple plots\n",
    "\n",
    "**Data Cleaning**\n",
    "- column name normalization\n",
    "- mixed types: casting to specific dtypes\n",
    "- missing data\n",
    "- duplicates\n",
    "- unnecessary columns/rows\n",
    "- entity normalization\n",
    "\n",
    "**Grouped Operations**\n",
    "- apply, agg, transform\n",
    "    - use specific verbs (agg/transform > apply)\n",
    "    - avoid 1 function call per group\n",
    "\n",
    "**Reshaping Data**\n",
    "- wide vs long\n",
    "- melt         → wide to long\n",
    "- pivot        → long to wide\n",
    "- pivot_table  → long to wide w/ agg\n",
    "- stack        → wide to long (via the index)\n",
    "- unstack      → long to wide (via the index)\n",
    "- lreshape     → generic wide to long\n",
    "- wide_to_long → patterned widt to long\n",
    "\n",
    "**Working with multiple DataFrames**\n",
    "- join\n",
    "- join_asof\n",
    "- merge\n",
    "- concat\n",
    "- inequality join?\n",
    "\n",
    "## Standard Reshaping Functions\n",
    "\n",
    "- melt        → wide to long\n",
    "- pivot       → long to wide\n",
    "- pivot_table → long to wide w/ agg\n",
    "- stack       → wide to long\n",
    "- unstack     → long to wide\n",
    "\n",
    "## Convenience Reshaping Functions\n",
    "- lreshape\n",
    "- wide_to_long\n",
    "\n",
    "## Working with multiple DataFrames\n",
    "- merge\n",
    "- join\n",
    "- join_asof\n",
    "- concat\n",
    "- inequality join?\n",
    "\n",
    "# Hard(er) Problems"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd166ee6",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Let's Get Started!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f361c818",
   "metadata": {},
   "source": [
    "## Scrabble Anyone?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "844dfe5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "from pandas import DataFrame, Series\n",
    "from string import ascii_lowercase\n",
    "\n",
    "words = ['hello', 'world', 'test', 'python', 'think']\n",
    "tiles = [*'pythnoik']\n",
    "\n",
    "# What words can I play?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6e17225",
   "metadata": {},
   "source": [
    "## Watch your transactions!\n",
    "\n",
    "You have your own credit card transactions from the past 3 years and want to\n",
    "analyze these data to investigate your own spending habits.\n",
    "\n",
    "The only caveat is that we have 3 different credit cards, and each creditor\n",
    "has a different structure for their transactional reports.\n",
    "\n",
    "### Data Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "253d7c0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.random import default_rng\n",
    "from pandas import date_range, DataFrame, Timestamp, to_timedelta\n",
    "\n",
    "category_map = {\n",
    "    \"Dining\": [\n",
    "      \"The Coffee Spot\", \"Quick Bites Grill\", \"JavaPoint Café\"\n",
    "    ],\n",
    "    \"Grocery\": [\n",
    "      \"Grocery Depot\", \"Green Earth Grocers\", \"CityMart\"\n",
    "    ],\n",
    "    \"Electronics\": [\"Tech Haven\", \"Smart Electronics\"],\n",
    "    \"Travel\": [\"Luxe Hotels\", \"TravelEase Agency\"],\n",
    "    \"Shopping\": [\n",
    "      \"Urban Outfitters\", \"Fashion Forward\", \"HighEnd Fashions\", \"Home Essentials\"\n",
    "    ],\n",
    "    \"Fitness\": [\n",
    "      \"Fitness World\", \"Mountain Gear\"\n",
    "    ],\n",
    "    \"Books\": [\"Global Books\"],\n",
    "    \"Gas\": [\"Speedy Gas\"]\n",
    "}\n",
    "\n",
    "average_transactions = {\n",
    "    \"The Coffee Spot\": 15,\n",
    "    \"Quick Bites Grill\": 20,\n",
    "    \"JavaPoint Café\": 12,\n",
    "    \"Grocery Depot\": 50,\n",
    "    \"Green Earth Grocers\": 45,\n",
    "    \"CityMart\": 60,\n",
    "    \"Tech Haven\": 500,\n",
    "    \"Smart Electronics\": 300,\n",
    "    \"Luxe Hotels\": 1200,\n",
    "    \"TravelEase Agency\": 800,\n",
    "    \"Urban Outfitters\": 100,\n",
    "    \"Fashion Forward\": 150,\n",
    "    \"HighEnd Fashions\": 200,\n",
    "    \"Home Essentials\": 80,\n",
    "    \"Fitness World\": 60,\n",
    "    \"Mountain Gear\": 120,\n",
    "    \"Global Books\": 25,\n",
    "    \"Speedy Gas\": 40\n",
    "}\n",
    "\n",
    "def gen_finix(rng):\n",
    "    transactions = []\n",
    "    for category, merchants in category_map.items():\n",
    "        for merchant in merchants:\n",
    "            avg_amount = average_transactions[merchant]\n",
    "            for _ in range(rng.integers(50, 100)):\n",
    "                transaction_date = Timestamp('2020-01-01') + to_timedelta(rng.integers(0, 365*3), unit='d')\n",
    "                transactions.append({\n",
    "                    \"store\": merchant,\n",
    "                    \"category\": category,\n",
    "                    \"date\": f\"{transaction_date:%m/%d/%Y}\",\n",
    "                    \"amount\": rng.normal(avg_amount, avg_amount * .2) * rng.choice([1, -1], p=[.99, .01]),\n",
    "                })\n",
    "    return DataFrame.from_records(transactions)\n",
    "\n",
    "def gen_credo(rng):\n",
    "    transactions = []\n",
    "    for category, merchants in category_map.items():\n",
    "        for merchant in merchants:\n",
    "            avg_amount = average_transactions[merchant]\n",
    "            for _ in range(rng.integers(50, 100)):\n",
    "                transaction_date = Timestamp('2020-01-01') + to_timedelta(rng.integers(0, 365*3), unit='d')\n",
    "                transaction_type = rng.choice(['purchase', 'refund'], p=[.98, .02])\n",
    "                transactions.append({\n",
    "                    \"Merchant\": merchant.replace(' ', ''),\n",
    "                    \"Category\": category.lower(),\n",
    "                    \"transaction_Date\": f\"{transaction_date:%Y-%m-%d}\",\n",
    "                    \"posted_Date\": f\"{transaction_date + to_timedelta(rng.integers(0, 2), unit='d'):%Y-%m-%d}\",\n",
    "                    transaction_type: rng.normal(avg_amount, avg_amount * .2),\n",
    "                })\n",
    "    return DataFrame.from_records(transactions)\n",
    "\n",
    "def gen_axpy(rng):\n",
    "    transactions = []\n",
    "    for category, merchants in category_map.items():\n",
    "        for merchant in merchants:\n",
    "            avg_amount = average_transactions[merchant]\n",
    "            for _ in range(rng.integers(80, 200)):\n",
    "                transaction_date = Timestamp('2020-01-01') + to_timedelta(rng.integers(0, 365*3), unit='d')\n",
    "                transaction_type = rng.choice(['purchase', 'refund'], p=[.98, .02])\n",
    "                transactions.append({\n",
    "                    \"Merchant\": merchant.replace(' ', ''),\n",
    "                    \"Category\": category,\n",
    "                    \"Date\": f\"{transaction_date:%Y-%m-%d}\",\n",
    "                    \"Type\": transaction_type,\n",
    "                    \"Amount\": rng.normal(avg_amount, avg_amount * .2),\n",
    "                })\n",
    "    return DataFrame.from_records(transactions)\n",
    "\n",
    "\n",
    "rng = default_rng(0)\n",
    "frames = {\n",
    "    'finix': gen_finix(rng),\n",
    "    'credo': gen_credo(rng),\n",
    "    'axpy':  gen_axpy(rng),\n",
    "}\n",
    "\n",
    "from pathlib import Path\n",
    "from pandas import to_datetime\n",
    "p = Path('data', 'transactions')\n",
    "p.mkdir(exist_ok=True)\n",
    "\n",
    "for name, df in frames.items():\n",
    "    if name == 'axpy':\n",
    "        dump_path = p / name\n",
    "        dump_path.mkdir(exist_ok=True)\n",
    "        for dt, group in df.groupby(df.pop('Date')):\n",
    "            group.to_csv(dump_path / f'{dt}.csv', index=False)\n",
    "    else:\n",
    "        df.to_csv(p / f'{name}.csv', index=False)\n",
    "    print(name, df.head(), sep='\\n', end='\\n{}\\n'.format('-' * 40))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1523095f",
   "metadata": {},
   "source": [
    "Running the above code results in a file structure like…\n",
    "\n",
    "```\n",
    "transactions/\n",
    "├── credo.csv\n",
    "├── finix.csv\n",
    "└── axpy/\n",
    "    ├── 2020-01-01.csv\n",
    "    ├── 2020-01-02.csv\n",
    "    ├── 2020-01-03.csv\n",
    "    ├── 2020-01-04.csv\n",
    "    ├── 2020-01-05.csv\n",
    "    ...\n",
    "```\n",
    "\n",
    "We want to determine...\n",
    "1. How much did we receive in refunds? How much did we spend?\n",
    "2. What is our favorite merchant to visit?\n",
    "3. What is our highest spending category across all cards?\n",
    "4. How much did we spend each month?\n",
    "    - What is our most expensive month?\n",
    "    - What is the most expensive month, averaging across years?\n",
    "5. What day did we make the most transactions?\n",
    "    - What day did we visit the most unique merchants?\n",
    "6. Were there any days that we did not make a transaction?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbdcea52",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "045a1feb",
   "metadata": {},
   "source": [
    "## Star Trader\n",
    "\n",
    "### Data Generation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f566177",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.random import default_rng\n",
    "from pandas import DataFrame, MultiIndex, date_range, Series, to_timedelta, IndexSlice, CategoricalIndex\n",
    "from pathlib import Path\n",
    "from sys import exit\n",
    "from scipy.stats import skewnorm\n",
    "import sys; sys.breakpointhook = exit\n",
    "\n",
    "data_dir = Path('data')\n",
    "data_dir.mkdir(exist_ok=True, parents=True)\n",
    "\n",
    "full_dates = date_range('2020-01-01', periods=180*2, freq='D')\n",
    "dates = date_range('2020-01-01', periods=180, freq='D')\n",
    "\n",
    "assets = '''\n",
    "    Equipment Medicine Metals Software StarGems Uranium\n",
    "'''.split()\n",
    "assets = CategoricalIndex(assets)\n",
    "\n",
    "stars = '''\n",
    "    Sol\n",
    "    Boyd Fate Gaol Hook Ivan Kirk Kris Quin\n",
    "    Reef Sand Sink Stan Task York\n",
    "'''.split()\n",
    "stars = CategoricalIndex(stars)\n",
    "\n",
    "players = '''\n",
    "    Alice Bob Charlie Dana\n",
    "'''.split()\n",
    "players = CategoricalIndex(players)\n",
    "\n",
    "rng = default_rng(0)\n",
    "\n",
    "inventory = (\n",
    "    Series(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                players,\n",
    "                assets,\n",
    "            ], names='player asset'.split())\n",
    "        ),\n",
    "        data=rng.normal(loc=1, scale=.25, size=len(idx)),\n",
    "        name='volume',\n",
    "    ) * Series({\n",
    "        'Equipment': 1_000,\n",
    "        'Medicine':    500,\n",
    "        'Metals':    1_250,\n",
    "        'Software':    350,\n",
    "        'StarGems':      5,\n",
    "        'Uranium':      50,\n",
    "    }, name='volume').rename_axis(index='asset')\n",
    ").pipe(lambda s:\n",
    "    s\n",
    "        .sample(len(s) - 3, random_state=rng)\n",
    "        .sort_index()\n",
    ").pipe(lambda s:\n",
    "    s\n",
    "        .astype('int')\n",
    "        .reindex(idx)\n",
    "        .astype('Int64')\n",
    ")\n",
    "\n",
    "base_prices = Series({\n",
    "    'Equipment':    7,\n",
    "    'Medicine':    40,\n",
    "    'Metals':       3,\n",
    "    'Software':    20,\n",
    "    'StarGems': 1_000,\n",
    "    'Uranium':    500,\n",
    "}, name='price').rename_axis('asset')\n",
    "\n",
    "price_shifts = (\n",
    "    Series(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                full_dates,\n",
    "                stars,\n",
    "                assets,\n",
    "            ], names='date star asset'.split())\n",
    "        ),\n",
    "        data=(\n",
    "            rng.normal(loc=1, scale=0.05, size=(len(stars), len(assets))).clip(0, 1.5)\n",
    "            *\n",
    "            rng.normal(loc=1, scale=0.02, size=(len(full_dates), len(stars), len(assets))).clip(0, 1.5).cumprod(axis=0)\n",
    "        ).ravel(),\n",
    "        name='price',\n",
    "    )\n",
    ")\n",
    "spreads = (\n",
    "    Series(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                full_dates,\n",
    "                stars,\n",
    "                assets,\n",
    "            ], names='date star asset'.split())\n",
    "        ),\n",
    "        data=skewnorm(a=1, loc=.02, scale=.01).rvs(len(idx), random_state=rng).clip(-0.01, +.05),\n",
    "        name='price',\n",
    "    )\n",
    ")\n",
    "\n",
    "market = DataFrame({\n",
    "    'buy':  base_prices * price_shifts * (1 + spreads),\n",
    "    'sell': base_prices * price_shifts,\n",
    "}).rename_axis(columns='direction').pipe(\n",
    "    lambda df: df.set_axis(\n",
    "        df.columns.astype('category'),\n",
    "        axis='columns',\n",
    "    )\n",
    ")\n",
    "\n",
    "loc_ps = {\n",
    "    pl: (p := rng.integers(10, size=len(stars))) / p.sum()\n",
    "    for pl in players\n",
    "}\n",
    "locations = (\n",
    "    DataFrame(\n",
    "        index=(idx := dates),\n",
    "        data={\n",
    "            pl: rng.choice(stars, p=loc_ps[pl], size=len(idx))\n",
    "            for pl in players\n",
    "        },\n",
    "    )\n",
    "    .rename_axis(index='date', columns='player')\n",
    "    .pipe(lambda s:\n",
    "        s\n",
    "        .set_axis(\n",
    "            s.columns.astype(players.dtype),\n",
    "            axis='columns',\n",
    "        )\n",
    "        .astype(\n",
    "              stars.dtype,\n",
    "        )\n",
    "    )\n",
    "    .stack('player')\n",
    "    .rename('star')\n",
    "    .pipe(\n",
    "        lambda s: s\n",
    "            .sample(frac=.75, random_state=rng)\n",
    "            .reindex(s.index)\n",
    "            .groupby('player', observed=False).ffill()\n",
    "            .groupby('player', observed=False).bfill()\n",
    "            .sort_index()\n",
    "    )\n",
    ")\n",
    "\n",
    "trips = (\n",
    "    locations.groupby('player', group_keys=False, observed=False).apply(\n",
    "        lambda g: g[g != g.shift()]\n",
    "    ).sort_index()\n",
    ")\n",
    "\n",
    "standard_volumes = (10_000 / base_prices).round(0)\n",
    "\n",
    "trades = (\n",
    "    DataFrame(\n",
    "        index=(idx :=\n",
    "            MultiIndex.from_product([\n",
    "                dates,\n",
    "                players,\n",
    "                assets,\n",
    "                range(10),\n",
    "            ], names='date player asset trade#'.split())\n",
    "        ),\n",
    "        data={\n",
    "            'sentiment': rng.normal(loc=0, scale=.025, size=len(idx)),\n",
    "            'regret': rng.normal(loc=0, scale=.0005, size=len(idx)),\n",
    "            'edge': rng.normal(loc=1, scale=.001, size=len(idx)).clip(.75, 1.25),\n",
    "        },\n",
    "    )\n",
    "    .pipe(\n",
    "        lambda df: df\n",
    "            .assign(\n",
    "                buy=lambda df: (df.groupby(['player', 'asset'], observed=False)['sentiment'].rolling(3).mean() > 0).values,\n",
    "                sign=lambda df: df['buy'] * -1 + ~df['buy'],\n",
    "                direction=lambda df: df['buy'].map({True: 'buy', False: 'sell'}).astype(market.columns.dtype),\n",
    "                volume=lambda df: df['sign'] * rng.normal(loc=1, scale=.5, size=len(df)).clip(0, 2),\n",
    "            )\n",
    "            .assign(\n",
    "                star=lambda df:\n",
    "                    locations.loc[\n",
    "                        MultiIndex.from_arrays([\n",
    "                            df.index.get_level_values('date'),\n",
    "                            df.index.get_level_values('player'),\n",
    "                        ])\n",
    "                    ].values,\n",
    "                asset_price=lambda df: (\n",
    "                    market.stack('direction').loc[\n",
    "                        MultiIndex.from_arrays([\n",
    "                            df.index.get_level_values('date'),\n",
    "                            df['star'],\n",
    "                            df.index.get_level_values('asset'),\n",
    "                            df['direction'],\n",
    "                        ])\n",
    "                    ].values\n",
    "                ),\n",
    "                price=lambda df: (df['asset_price'] * df['edge']).round(4),\n",
    "                mark=lambda df: (df['price'] * (1 + df['regret'])).round(4),\n",
    "                volume=lambda df: (df['volume'] * standard_volumes).round(-1).astype(int),\n",
    "            )\n",
    "    )\n",
    "    .pipe(\n",
    "        lambda df: df\n",
    "            .loc[lambda df: df['volume'] != 0]\n",
    "            .sample(frac=.5, random_state=rng)\n",
    "            .sort_index()\n",
    "    )\n",
    ")\n",
    "\n",
    "data_dir = Path('data', 'startrader')\n",
    "data_dir.mkdir(exist_ok=True, parents=True)\n",
    "\n",
    "marks = trades['mark'].groupby(['date', 'player', 'asset'], observed=False).last()\n",
    "\n",
    "market.to_pickle(data_dir / 'market.pkl')\n",
    "trips.to_pickle(data_dir / 'trips.pkl')\n",
    "inventory.to_pickle(data_dir / 'inventory.pkl')\n",
    "locations.to_pickle(data_dir / 'locations.pkl')\n",
    "trades.loc[:, ['volume', 'star', 'price']].assign(\n",
    "    price=lambda df: df['price'].round(2),\n",
    ").to_pickle(data_dir / 'trades.pkl')\n",
    "\n",
    "print('done!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ba042e7",
   "metadata": {},
   "source": [
    "### Inventory only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5dbe024c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "inventory = read_pickle('data/startrader/inventory.pkl')\n",
    "\n",
    "# Who had the most Uranium? What about second most?\n",
    "# The most of each asset?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5b07af9",
   "metadata": {},
   "source": [
    "### Inventory & Market"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd550661",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "inventory = read_pickle('data/startrader/inventory.pkl')\n",
    "market = read_pickle('data/startrader/market.pkl')\n",
    "\n",
    "print(inventory.head(), market.head(), sep='\\n' * 2)\n",
    "\n",
    "# what is everyone’s inventory worth on Sol?\n",
    "# what is everyone’s inventory worth on everywhere?\n",
    "# where is everyones inventory worth the most?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7713a210",
   "metadata": {},
   "source": [
    "### Inventory & Trades"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a68f1cd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "inventory = read_pickle('data/startrader/inventory.pkl')\n",
    "trades = read_pickle('data/startrader/trades.pkl')\n",
    "\n",
    "# what is maximum number of Star Gems each player had, and when were those maximums observed?\n",
    "# did Alice every have more Star Gems than Bob? If so, when?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70671018",
   "metadata": {},
   "source": [
    "### Trips"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b2738a5",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "from pandas import read_pickle\n",
    "from pandas import Grouper\n",
    "trips = read_pickle('data/startrader/trips.pkl')\n",
    "\n",
    "# Where was Alice at on 2024-01-02?\n",
    "print(\n",
    "    trips.head(8),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afee193d",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
No results found