Created
September 28, 2016 18:39
-
-
Save briandk/25908433b422b21058f6a0e7cf91d69b to your computer and use it in GitHub Desktop.
Day 2 of our Flint Water Quality Data Exercise
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# The Flint, MI water crisis - part II\n", | |
| "\n", | |
| "## Student names" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Type the names of everybody in your group here!" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Learning Goals (why are we asking you to do this?)\n", | |
| "\n", | |
| "As discussed in last class, there are two main reasons:\n", | |
| "\n", | |
| "1. Because data analysis is something that can and should be used for (among other things) improving local and federal government, and serving humanity.\n", | |
| "2. Because data analysis and visualization are two of the most important parts of modeling and understanding a system.\n", | |
| "\n", | |
| "In today's class, we'll be pursuing both of these objectives." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Today's activity\n", | |
| "\n", | |
| "We'll be looking at the [Flint Water Quality dataset](http://flintwaterstudy.org/), as we did in the last class. However, now we'll be looking at lead content data for the same houses, but sampled three times over almost a year. \n", | |
| "\n", | |
| "**Your goal for today is to answer this question:** Is the water quality for the residents of Flint, Michigan, getting better, getting worse, or not substantially changing?\n", | |
| "\n", | |
| "The dataset you have been provided is a much larger and richer dataset than you worked with in the last class session, and while it has data for fewer houses (162 total) it includes data for more elements as well as 3 bottles (immediately, after 45 seconds, after 2 minutes) for three different dates: August 2015, March 2016, and July 2016, for a total of 1458 records.\n", | |
| "\n", | |
| "You are deliberately being given very little in the way of concrete guidance for this project. Talk with your group members about how you want to do your analysis, and then create whatever plots, charts, or statistical analyses are necessary in order to answer the question. Use the rest of this notebook to show your work!\n", | |
| "\n", | |
| "**Note:** make sure that the Excel spreadsheet Flint_Lead_Kits_ICP_Data.xlsx is in the same directory as this notebook. You may also wish to look at the [\"pandas cheat sheet\"](pandas-cheat-sheet.ipynb) notebook and your pre-class assignment for inspiration, as well as the [\"10 minutes to Pandas\" tutorial](http://pandas.pydata.org/pandas-docs/stable/10min.html).\n", | |
| "\n", | |
| "The cell below is code that we wrote to read in the Flint water quality dataset - do not change it! Add your code in the cells below it.\n", | |
| "\n", | |
| "### [Click here for the Pandas cheat sheet](pandas-cheat-sheet.ipynb)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# THIS CELL READS IN THE FLINT DATASET - DO NOT CHANGE ANYTHING!\n", | |
| "\n", | |
| "# Make plots inline\n", | |
| "%matplotlib inline\n", | |
| "\n", | |
| "# Make inline plots vector graphics instead of raster graphics\n", | |
| "from IPython.display import set_matplotlib_formats\n", | |
| "set_matplotlib_formats('pdf', 'svg')\n", | |
| "\n", | |
| "# import modules for plotting and data analysis\n", | |
| "import matplotlib.pyplot # as plt\n", | |
| "import pandas\n", | |
| "import numpy as np\n", | |
| "import functools\n", | |
| "\n", | |
| "def add_bottle_id_column(data, key_name):\n", | |
| " data['bottleID'] = np.repeat(key_name, data.shape[0])\n", | |
| " return data\n", | |
| "\n", | |
| "'''\n", | |
| "Loads the flint water quality dataset from the spreadsheet. \n", | |
| "This manipulation is necessary because (1) the data is in a spreadsheet\n", | |
| "rather than a CSV file or something else, and (2) the data is spread out\n", | |
| "across multiple sheets in the spreadsheet.\n", | |
| "'''\n", | |
| "def load_flint_water_data():\n", | |
| " flint_water_data = pandas.read_excel(\n", | |
| " # NOTE: uncomment the following line and comment out the one after that if\n", | |
| " # you have problems getting this to run on a Windows machine.\n", | |
| " # io = “https://github.com/ComputationalModeling/flint-water-data/raw/f6093bba145b1745b68bac2964b341fa30f3a08a/Flint%20Lead%20Kits%20ICP%20Data.xlsx”,\n", | |
| " io = \"Flint_Lead_Kits_ICP_Data.xlsx\",\n", | |
| " sheetname = [\n", | |
| " \"Sub_B1-8.15\",\n", | |
| " \"Sub_B2-8.15\",\n", | |
| " \"Sub_B3-8.15\",\n", | |
| " \"Sub_B1-3.16\",\n", | |
| " \"Sub_B2-3.16\",\n", | |
| " \"Sub_B3-3.16\",\n", | |
| " \"Sub_B1-7.16\",\n", | |
| " \"Sub_B2-7.16\",\n", | |
| " \"Sub_B3-7.16\"],\n", | |
| " header = 0,\n", | |
| " skiprows = 3,\n", | |
| " names = [\n", | |
| " \"Sample\",\n", | |
| " \"208Pb\",\n", | |
| " \"\",\n", | |
| " \"23Na\",\n", | |
| " \"25Mg\",\n", | |
| " \"27Al\",\n", | |
| " \"28Si\",\n", | |
| " \"31P\",\n", | |
| " \"PO4\",\n", | |
| " \"34S\",\n", | |
| " \"35Cl\",\n", | |
| " \"39K\",\n", | |
| " \"43Ca\",\n", | |
| " \"47Ti\",\n", | |
| " \"51V\",\n", | |
| " \"52Cr\",\n", | |
| " \"54Fe\",\n", | |
| " \"55Mn\",\n", | |
| " \"59Co\",\n", | |
| " \"60Ni\",\n", | |
| " \"65Cu\",\n", | |
| " \"66Zn\",\n", | |
| " \"75As\",\n", | |
| " \"78Se\",\n", | |
| " \"88Sr\",\n", | |
| " \"95Mo\",\n", | |
| " \"107Ag\",\n", | |
| " \"111Cd\",\n", | |
| " \"112Sn\",\n", | |
| " \"137Ba\",\n", | |
| " \"238U\"\n", | |
| " ] \n", | |
| " )\n", | |
| " data_with_id = [\n", | |
| " add_bottle_id_column(value, key)\n", | |
| " for key, value\n", | |
| " in flint_water_data.items()]\n", | |
| " # collapse dataframes into one long dataframe\n", | |
| " flint_water_data = functools.reduce(lambda x,y: x.append(y), data_with_id)\n", | |
| " return flint_water_data\n", | |
| "\n", | |
| "def add_date_and_bottle_number(flint_data): \n", | |
| " flint_data['bottle_number'] = flint_data['bottleID'].apply(lambda x: x.split('-')[0])\n", | |
| " flint_data['date_collected'] = flint_data['bottleID'].apply(lambda x: x.split('-')[1])\n", | |
| " return(flint_data)\n", | |
| "\n", | |
| "bottle_map = {\n", | |
| " 'Sub_B1': 'bottle1',\n", | |
| " 'Sub_B2': 'bottle2',\n", | |
| " 'Sub_B3': 'bottle3'\n", | |
| "}\n", | |
| "\n", | |
| "date_map = {\n", | |
| " '8.15': '2015-08-01',\n", | |
| " '3.16': '2016-03-01',\n", | |
| " '7.16': '2016-07-01'\n", | |
| "}\n", | |
| "\n", | |
| "\n", | |
| "flint_data = load_flint_water_data()\n", | |
| "flint_data = add_date_and_bottle_number(flint_data)\n", | |
| "flint_data = flint_data.replace(\n", | |
| " {'bottle_number': bottle_map,\n", | |
| " 'date_collected': date_map })\n", | |
| "flint_data['date_collected'] = pandas.DatetimeIndex(flint_data['date_collected'])\n", | |
| "flint_data = flint_data.drop('bottleID', axis = 1)\n", | |
| "\n", | |
| "# the end result is that you have a data frame called \"flint_data\"" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "What you should now have is a data frame called **flint_data**, which contains water quality data with many elements for three different dates (2015-08-01, 2016-03-01, and 2016-07-01), and with three different bottles per date (bottle1, bottle2, bottle3, corresponding to the three samples). You can see all of the columns by typing `flint_data.columns` - note that all quantities are in parts per billion, not mg/L! Note also that the sample numbers in the leftmost column may be misleading - this is a result of combining together several datasets. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "# Put all of your code, plots, and explanations here.\n", | |
| "# If necessary, insert additional cells below this one!\n", | |
| "\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Feedback (this is required)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "from IPython.display import HTML\n", | |
| "HTML(\n", | |
| "\"\"\"\n", | |
| "<iframe \n", | |
| "\tsrc=\"https://goo.gl/forms/DjojWCmPAwzlHrpQ2?embedded=true\" \n", | |
| "\twidth=\"80%\" \n", | |
| "\theight=\"1200px\" \n", | |
| "\tframeborder=\"0\" \n", | |
| "\tmarginheight=\"0\" \n", | |
| "\tmarginwidth=\"0\">\n", | |
| "\tLoading...\n", | |
| "</iframe>\n", | |
| "\"\"\"\n", | |
| ")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "source": [ | |
| "## Now submit your notebook!\n", | |
| "\n", | |
| "Submit this notebook to the \"Day 5\" dropbox, and make sure to send a copy to everybody in your group." | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "anaconda-cloud": {}, | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.5.1" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 0 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment