viniciusmss/Matrices, Factors, and Data Frames.ipynb

## Matrices, Factors, and Data Frames.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## R Prep Minicourse\n",
    "### Week 9: Matrices, Factors, and Data Frames\n",
    "\n",
    "**Credits:** [Datacamp's Introduction to R Course](https://campus.datacamp.com/courses/free-introduction-to-r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "#### Recap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "230"
      ]
     },
     "execution_count": 1,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Poker winnings from Monday to Friday\n",
    "poker_vector <- c(140, -50, 20, -120, 240)\n",
    "\n",
    "# Assign days as names of poker_vector\n",
    "names(poker_vector) <- c(\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\")\n",
    "\n",
    "# Total winnings with poker\n",
    "total_poker <- sum(poker_vector)\n",
    "total_poker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<dl class=dl-horizontal>\n",
       "\t<dt>Monday</dt>\n",
       "\t\t<dd>140</dd>\n",
       "\t<dt>Wednesday</dt>\n",
       "\t\t<dd>20</dd>\n",
       "\t<dt>Friday</dt>\n",
       "\t\t<dd>240</dd>\n",
       "</dl>\n"
      ]
     },
     "execution_count": 2,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check which days you won money\n",
    "poker_wins <- poker_vector[poker_vector > 0]\n",
    "poker_wins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "133.333333333333"
      ]
     },
     "execution_count": 3,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# How much do you earn on average in winning days?\n",
    "mean(poker_wins)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.\n",
    "\n",
    "You can construct a matrix in R with the `matrix()` function. Consider the following example:\n",
    "\n",
    "`matrix(1:9, byrow = TRUE, nrow = 3)`\n",
    "\n",
    "In the `matrix()` function:\n",
    "\n",
    "- The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use `1:9` which is a shortcut for `c(1, 2, 3, 4, 5, 6, 7, 8, 9)`.\n",
    "- The argument `byrow` indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place `byrow = FALSE`.\n",
    "- The third argument `nrow` indicates that the matrix should have three rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tbody>\n",
       "\t<tr><td>1</td><td>2</td><td>3</td></tr>\n",
       "\t<tr><td>4</td><td>5</td><td>6</td></tr>\n",
       "\t<tr><td>7</td><td>8</td><td>9</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 4,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matrix(1:9, byrow = TRUE, nrow = 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Similar to vectors, you can add names for the rows and the columns of a matrix\n",
    "\n",
    "`rownames(my_matrix) <- row_names_vector`\n",
    "\n",
    "`colnames(my_matrix) <- col_names_vector`\n",
    "`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                           US non-US\n",
      "A New Hope              461.0  314.4\n",
      "The Empire Strikes Back 290.5  247.9\n",
      "Return of the Jedi      309.3  165.8\n"
     ]
    }
   ],
   "source": [
    "# Box office Star Wars (in millions!)\n",
    "new_hope <- c(461.0, 314.4)\n",
    "empire_strikes <- c(290.5, 247.9)\n",
    "return_jedi <- c(309.3, 165.8)\n",
    "\n",
    "# Construct matrix\n",
    "star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)\n",
    "\n",
    "# Vectors region and titles, used for naming\n",
    "region <- c(\"US\", \"non-US\")\n",
    "titles <- c(\"A New Hope\", \"The Empire Strikes Back\", \"Return of the Jedi\")\n",
    "\n",
    "# Name the columns with region\n",
    "colnames(star_wars_matrix) <- region\n",
    "\n",
    "# Name the rows with titles\n",
    "rownames(star_wars_matrix) <- titles\n",
    "\n",
    "# Print out star_wars_matrix\n",
    "print(star_wars_matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                           US non-US\n",
      "A New Hope              461.0  314.4\n",
      "The Empire Strikes Back 290.5  247.9\n",
      "Return of the Jedi      309.3  165.8\n"
     ]
    }
   ],
   "source": [
    "# This also works:\n",
    "\n",
    "box_office <- c(461.0, 314.4, 290.5, 247.9, 309.3, 165.8)\n",
    "star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,\n",
    "                           dimnames = list(c(\"A New Hope\", \"The Empire Strikes Back\", \"Return of the Jedi\"), c(\"US\", \"non-US\")))\n",
    "print(star_wars_matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In R, the function `rowSums()` conveniently calculates the totals for each row of a matrix. This function creates a new vector:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             A New Hope The Empire Strikes Back      Return of the Jedi \n",
      "                  775.4                   538.4                   475.1 \n"
     ]
    }
   ],
   "source": [
    "# Calculate worldwide box office figures\n",
    "worldwide_vector <- rowSums(star_wars_matrix)\n",
    "\n",
    "print(worldwide_vector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "You can add a column or multiple columns to a matrix with the `cbind()` function, which merges matrices and/or vectors together by column. For example:\n",
    "\n",
    "`big_matrix <- cbind(matrix1, matrix2, vector1 ...)`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                           US non-US Total\n",
      "A New Hope              461.0  314.4 775.4\n",
      "The Empire Strikes Back 290.5  247.9 538.4\n",
      "Return of the Jedi      309.3  165.8 475.1\n"
     ]
    }
   ],
   "source": [
    "# Bind the new variable worldwide_vector as a column to star_wars_matrix\n",
    "all_wars_matrix <- cbind(star_wars_matrix, Total = worldwide_vector)\n",
    "\n",
    "print(all_wars_matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Whereas `cbind()` can paste columns together, `rbind()` does the same thing, but with rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                           US non-US\n",
      "A New Hope              461.0  314.4\n",
      "The Empire Strikes Back 290.5  247.9\n",
      "Return of the Jedi      309.3  165.8\n",
      "The Phantom Menace      474.5  552.5\n",
      "Attack of the Clones    310.7  338.7\n",
      "Revenge of the sith     380.3  468.5\n"
     ]
    }
   ],
   "source": [
    "box_office2 <- c(474.5, 552.5, 310.7, 338.7, 380.3, 468.5)\n",
    "star_wars_matrix2 <- matrix(box_office2, nrow = 3, byrow = TRUE,\n",
    "                           dimnames = list(c(\"The Phantom Menace\", \"Attack of the Clones\", \"Revenge of the sith\"), c(\"US\", \"non-US\")))\n",
    "\n",
    "# Combine both Star Wars trilogies in one matrix\n",
    "all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)\n",
    "\n",
    "print(all_wars_matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Similarly, we also have a `colSums()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    US non-US \n",
      "2226.3 2087.8 \n"
     ]
    }
   ],
   "source": [
    "# Total revenue for US and non-US\n",
    "total_revenue_vector <- colSums(all_wars_matrix)\n",
    "\n",
    "print(total_revenue_vector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Similar to vectors, you can use the square brackets `[  ]` to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:\n",
    "\n",
    "- `my_matrix[1,2]` selects the element at the first row and second column.\n",
    "- `my_matrix[1:3,2:4]` results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.\n",
    "\n",
    "If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:\n",
    "\n",
    "- `my_matrix[,1]` selects all elements of the first column.\n",
    "- `my_matrix[1,]` selects all elements of the first row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 347.9667\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 375.75\n"
     ]
    }
   ],
   "source": [
    "# Select the non-US revenue for all movies\n",
    "non_us_all <- all_wars_matrix[,2]\n",
    "\n",
    "# Average non-US revenue\n",
    "print(mean(non_us_all))\n",
    "\n",
    "# Select the US revenue for first two movies\n",
    "us_some <- all_wars_matrix[1:2, 1]\n",
    "\n",
    "# Average non-US revenue for first two movies\n",
    "print(mean(us_some))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                        US non-US\n",
      "A New Hope              92     63\n",
      "The Empire Strikes Back 58     50\n",
      "Return of the Jedi      62     33\n",
      "The Phantom Menace      95    110\n",
      "Attack of the Clones    62     68\n",
      "Revenge of the sith     76     94\n"
     ]
    }
   ],
   "source": [
    "# You can also use arithmetic operators with matrices.\n",
    "# They will apply to every element of the matrix.\n",
    "\n",
    "# Assume the ticket price was $5. Estimate the visitors:\n",
    "visitors <- all_wars_matrix / 5\n",
    "\n",
    "# Print the estimate to the console\n",
    "print(visitors, digits=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                         US non-US\n",
      "A New Hope               92     63\n",
      "The Empire Strikes Back  48     41\n",
      "Return of the Jedi       44     24\n",
      "The Phantom Menace      119    138\n",
      "Attack of the Clones     69     75\n",
      "Revenge of the sith      78     96\n"
     ]
    }
   ],
   "source": [
    "# Or if you have more granular information on the ticket prices...\n",
    "ticket_prices_matrix <- matrix(c(5, 5, 6, 6, 7, 7, 4, 4, 4.5, 4.5, 4.9, 4.9), nrow = 6, byrow = TRUE)\n",
    "\n",
    "visitors2 <- all_wars_matrix / ticket_prices_matrix\n",
    "print(visitors2, digits=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "##### Summary:\n",
    "\n",
    "- `matrix(1:9, byrow = TRUE, nrow = 3)`\n",
    "- `colnames`, `rownames`\n",
    "- `colSums`, `rowSums`\n",
    "- `cbind`, `rbind`\n",
    "- `[row_index, col_index]`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Factors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.\n",
    "\n",
    "To create factors in R, you make use of the function `factor()`. First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. The function `factor()` will encode the vector as a factor:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] Male       Female     Female     Male       Non-Binary\n",
      "Levels: Female Male Non-Binary\n"
     ]
    }
   ],
   "source": [
    "# Gender vector\n",
    "gender_vector <- c(\"Male\", \"Female\", \"Female\", \"Male\", \"Non-Binary\")\n",
    "\n",
    "# Convert gender_vector to a factor\n",
    "factor_gender_vector <- factor(gender_vector)\n",
    "\n",
    "# Print out factor_gender_vector\n",
    "print(factor_gender_vector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Sometimes, you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function `levels()`:\n",
    "\n",
    "`levels(factor_vector) <- c(\"name1\", \"name2\",...)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>Male</li>\n",
       "\t<li>Female</li>\n",
       "\t<li>Female</li>\n",
       "\t<li>Male</li>\n",
       "\t<li>Non-Binary</li>\n",
       "</ol>\n",
       "\n",
       "<details>\n",
       "\t<summary style=display:list-item;cursor:pointer>\n",
       "\t\t<strong>Levels</strong>:\n",
       "\t</summary>\n",
       "\t<ol class=list-inline>\n",
       "\t\t<li>'Female'</li>\n",
       "\t\t<li>'Male'</li>\n",
       "\t\t<li>'Non-Binary'</li>\n",
       "\t</ol>\n",
       "</details>"
      ]
     },
     "execution_count": 15,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Code to build factor_survey_vector\n",
    "survey_vector <- c(\"M\", \"F\", \"F\", \"M\", \"NB\")\n",
    "factor_survey_vector <- factor(survey_vector)\n",
    "\n",
    "# Specify the levels of factor_survey_vector\n",
    "levels(factor_survey_vector) <- c(\"Female\", \"Male\", \"Non-Binary\")\n",
    "\n",
    "factor_survey_vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "`summary()` gives you a quick overview of the contents of a variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "   Length     Class      Mode \n",
       "        5 character character "
      ]
     },
     "execution_count": 16,
     "metadata": {
     },
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<dl class=dl-horizontal>\n",
       "\t<dt>Female</dt>\n",
       "\t\t<dd>2</dd>\n",
       "\t<dt>Male</dt>\n",
       "\t\t<dd>2</dd>\n",
       "\t<dt>Non-Binary</dt>\n",
       "\t\t<dd>1</dd>\n",
       "</dl>\n"
      ]
     },
     "execution_count": 16,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate summary for survey_vector\n",
    "summary(survey_vector)\n",
    "\n",
    "# Generate summary for factor_survey_vector\n",
    "summary(factor_survey_vector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "##### Summary:\n",
    "\n",
    "- `factor_vector <- factor(vector)`\n",
    "- `levels(factor_vector) <- c(\"name1\", \"name2\",...)`\n",
    "- `summary()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Data Frames\n",
    "\n",
    "A data frame has the variables of a data set as columns and the observations as rows. Contrary to matrices, data frames can hold elements of different types.\n",
    "\n",
    "Let's import one built-in data frame to use as an example. The function `data()` allows you to load a data set. `?x`, which is equivalent to `help(x)`, shows the documentation associated with the variable x (if it exists)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<table width=\"100%\" summary=\"page for mtcars {datasets}\"><tr><td>mtcars {datasets}</td><td style=\"text-align: right;\">R Documentation</td></tr></table>\n",
       "\n",
       "<h2>Motor Trend Car Road Tests</h2>\n",
       "\n",
       "<h3>Description</h3>\n",
       "\n",
       "<p>The data was extracted from the 1974 <em>Motor Trend</em> US magazine,\n",
       "and comprises fuel consumption and 10 aspects of\n",
       "automobile design and performance for 32 automobiles (1973&ndash;74\n",
       "models).\n",
       "</p>\n",
       "\n",
       "\n",
       "<h3>Usage</h3>\n",
       "\n",
       "<pre>mtcars</pre>\n",
       "\n",
       "\n",
       "<h3>Format</h3>\n",
       "\n",
       "<p>A data frame with 32 observations on 11 variables.\n",
       "</p>\n",
       "\n",
       "<table summary=\"Rd table\">\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 1] </td><td style=\"text-align: left;\"> mpg  </td><td style=\"text-align: left;\"> Miles/(US) gallon </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 2] </td><td style=\"text-align: left;\"> cyl  </td><td style=\"text-align: left;\"> Number of cylinders </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 3] </td><td style=\"text-align: left;\"> disp </td><td style=\"text-align: left;\"> Displacement (cu.in.) </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 4] </td><td style=\"text-align: left;\"> hp   </td><td style=\"text-align: left;\"> Gross horsepower </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 5] </td><td style=\"text-align: left;\"> drat </td><td style=\"text-align: left;\"> Rear axle ratio </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 6] </td><td style=\"text-align: left;\"> wt   </td><td style=\"text-align: left;\"> Weight (1000 lbs) </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 7] </td><td style=\"text-align: left;\"> qsec </td><td style=\"text-align: left;\"> 1/4 mile time </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 8] </td><td style=\"text-align: left;\"> vs   </td><td style=\"text-align: left;\"> V/S </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [, 9] </td><td style=\"text-align: left;\"> am   </td><td style=\"text-align: left;\"> Transmission (0 = automatic, 1 = manual) </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [,10] </td><td style=\"text-align: left;\"> gear </td><td style=\"text-align: left;\"> Number of forward gears </td>\n",
       "</tr>\n",
       "<tr>\n",
       " <td style=\"text-align: right;\">\n",
       "    [,11] </td><td style=\"text-align: left;\"> carb </td><td style=\"text-align: left;\"> Number of carburetors\n",
       "  </td>\n",
       "</tr>\n",
       "\n",
       "</table>\n",
       "\n",
       "\n",
       "\n",
       "<h3>Source</h3>\n",
       "\n",
       "<p>Henderson and Velleman (1981),\n",
       "Building multiple regression models interactively.\n",
       "<em>Biometrics</em>, <b>37</b>, 391&ndash;411.\n",
       "</p>\n",
       "\n",
       "\n",
       "<h3>Examples</h3>\n",
       "\n",
       "<pre>\n",
       "require(graphics)\n",
       "pairs(mtcars, main = \"mtcars data\")\n",
       "coplot(mpg ~ disp | as.factor(cyl), data = mtcars,\n",
       "       panel = panel.smooth, rows = 1)\n",
       "</pre>\n",
       "\n",
       "<hr /><div style=\"text-align: center;\">[Package <em>datasets</em> version 3.4.4 ]</div>"
      ]
     },
     "execution_count": 17,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data(mtcars)\n",
    "\n",
    "?mtcars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "The function `head()` enables you to show the first observations of a data frame. Similarly, the function `tail()` prints out the last observations in your data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>mpg</th><th scope=col>cyl</th><th scope=col>disp</th><th scope=col>hp</th><th scope=col>drat</th><th scope=col>wt</th><th scope=col>qsec</th><th scope=col>vs</th><th scope=col>am</th><th scope=col>gear</th><th scope=col>carb</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>Mazda RX4</th><td>21.0 </td><td>6    </td><td>160  </td><td>110  </td><td>3.90 </td><td>2.620</td><td>16.46</td><td>0    </td><td>1    </td><td>4    </td><td>4    </td></tr>\n",
       "\t<tr><th scope=row>Mazda RX4 Wag</th><td>21.0 </td><td>6    </td><td>160  </td><td>110  </td><td>3.90 </td><td>2.875</td><td>17.02</td><td>0    </td><td>1    </td><td>4    </td><td>4    </td></tr>\n",
       "\t<tr><th scope=row>Datsun 710</th><td>22.8 </td><td>4    </td><td>108  </td><td> 93  </td><td>3.85 </td><td>2.320</td><td>18.61</td><td>1    </td><td>1    </td><td>4    </td><td>1    </td></tr>\n",
       "\t<tr><th scope=row>Hornet 4 Drive</th><td>21.4 </td><td>6    </td><td>258  </td><td>110  </td><td>3.08 </td><td>3.215</td><td>19.44</td><td>1    </td><td>0    </td><td>3    </td><td>1    </td></tr>\n",
       "\t<tr><th scope=row>Hornet Sportabout</th><td>18.7 </td><td>8    </td><td>360  </td><td>175  </td><td>3.15 </td><td>3.440</td><td>17.02</td><td>0    </td><td>0    </td><td>3    </td><td>2    </td></tr>\n",
       "\t<tr><th scope=row>Valiant</th><td>18.1 </td><td>6    </td><td>225  </td><td>105  </td><td>2.76 </td><td>3.460</td><td>20.22</td><td>1    </td><td>0    </td><td>3    </td><td>1    </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 18,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "head(mtcars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Another method that is often used to get a rapid overview of your data is the function `str()`. The function `str()` shows you the structure of your data set. For a data frame, it tells you:\n",
    "\n",
    "- The total number of observations (e.g. 32 car types)\n",
    "- The total number of variables (e.g. 11 car features)\n",
    "- A full list of the variables names (e.g. `mpg`, `cyl`)\n",
    "- The data type of each variable (e.g. `num`)\n",
    "- The first observations\n",
    "\n",
    "\n",
    "\n",
    "Applying the `str()` function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight in your data set before diving into the real analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'data.frame':\t32 obs. of  11 variables:\n",
      " $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...\n",
      " $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...\n",
      " $ disp: num  160 160 108 258 360 ...\n",
      " $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...\n",
      " $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...\n",
      " $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...\n",
      " $ qsec: num  16.5 17 18.6 19.4 17 ...\n",
      " $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...\n",
      " $ am  : num  1 1 1 0 0 0 0 0 0 0 ...\n",
      " $ gear: num  4 4 4 3 3 3 3 4 4 4 ...\n",
      " $ carb: num  4 4 1 1 2 1 4 2 2 4 ...\n"
     ]
    }
   ],
   "source": [
    "str(mtcars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now, let's imagine you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:\n",
    "\n",
    "- The type of planet (Terrestrial or Gas Giant).\n",
    "- The planet's diameter relative to the diameter of the Earth.\n",
    "- The planet's rotation across the sun relative to that of the Earth.\n",
    "- If the planet has rings or not (TRUE or FALSE).\n",
    "\n",
    "You construct a data frame with the `data.frame()` function. As arguments, you pass vectors: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don't forget that it is possible (and likely) that they contain different types of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th scope=col>name</th><th scope=col>type</th><th scope=col>diameter</th><th scope=col>rotation</th><th scope=col>rings</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><td>Mercury           </td><td>Terrestrial planet</td><td> 0.382            </td><td>  58.64           </td><td>FALSE             </td></tr>\n",
       "\t<tr><td>Venus             </td><td>Terrestrial planet</td><td> 0.949            </td><td>-243.02           </td><td>FALSE             </td></tr>\n",
       "\t<tr><td>Earth             </td><td>Terrestrial planet</td><td> 1.000            </td><td>   1.00           </td><td>FALSE             </td></tr>\n",
       "\t<tr><td>Mars              </td><td>Terrestrial planet</td><td> 0.532            </td><td>   1.03           </td><td>FALSE             </td></tr>\n",
       "\t<tr><td>Jupiter           </td><td>Gas giant         </td><td>11.209            </td><td>   0.41           </td><td> TRUE             </td></tr>\n",
       "\t<tr><td>Saturn            </td><td>Gas giant         </td><td> 9.449            </td><td>   0.43           </td><td> TRUE             </td></tr>\n",
       "\t<tr><td>Uranus            </td><td>Gas giant         </td><td> 4.007            </td><td>  -0.72           </td><td> TRUE             </td></tr>\n",
       "\t<tr><td>Neptune           </td><td>Gas giant         </td><td> 3.883            </td><td>   0.67           </td><td> TRUE             </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 20,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Definition of vectors\n",
    "name <- c(\"Mercury\", \"Venus\", \"Earth\", \"Mars\", \"Jupiter\", \"Saturn\", \"Uranus\", \"Neptune\")\n",
    "type <- c(\"Terrestrial planet\", \"Terrestrial planet\", \"Terrestrial planet\", \"Terrestrial planet\", \"Gas giant\", \"Gas giant\", \"Gas giant\", \"Gas giant\")\n",
    "diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)\n",
    "rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)\n",
    "rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)\n",
    "\n",
    "# Create a data frame from the vectors\n",
    "planets_df <- data.frame(name, type, diameter, rotation, rings)\n",
    "planets_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'data.frame':\t8 obs. of  5 variables:\n",
      " $ name    : Factor w/ 8 levels \"Earth\",\"Jupiter\",..: 4 8 1 3 2 6 7 5\n",
      " $ type    : Factor w/ 2 levels \"Gas giant\",\"Terrestrial planet\": 2 2 2 2 1 1 1 1\n",
      " $ diameter: num  0.382 0.949 1 0.532 11.209 ...\n",
      " $ rotation: num  58.64 -243.02 1 1.03 0.41 ...\n",
      " $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...\n"
     ]
    }
   ],
   "source": [
    "str(planets_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Similar to vectors and matrices, you select elements from a data frame with the help of square brackets `[  ]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "0.382"
      ]
     },
     "execution_count": 22,
     "metadata": {
     },
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>name</th><th scope=col>type</th><th scope=col>diameter</th><th scope=col>rotation</th><th scope=col>rings</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>4</th><td>Mars              </td><td>Terrestrial planet</td><td>0.532             </td><td>1.03              </td><td>FALSE             </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 22,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out diameter of Mercury (row 1, column 3)\n",
    "planets_df[1,3]\n",
    "\n",
    "# Print out data for Mars (entire fourth row)\n",
    "planets_df[4,]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>0.382</li>\n",
       "\t<li>0.949</li>\n",
       "\t<li>1</li>\n",
       "\t<li>0.532</li>\n",
       "\t<li>11.209</li>\n",
       "</ol>\n"
      ]
     },
     "execution_count": 23,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select first 5 values of diameter column\n",
    "planets_df[1:5, \"diameter\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "You will often want to select an entire column, namely one specific variable from a data frame. If your columns have names, you can use the `$` sign:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol class=list-inline>\n",
       "\t<li>FALSE</li>\n",
       "\t<li>FALSE</li>\n",
       "\t<li>FALSE</li>\n",
       "\t<li>FALSE</li>\n",
       "\t<li>TRUE</li>\n",
       "\t<li>TRUE</li>\n",
       "\t<li>TRUE</li>\n",
       "\t<li>TRUE</li>\n",
       "</ol>\n"
      ]
     },
     "execution_count": 24,
     "metadata": {
     },
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>name</th><th scope=col>type</th><th scope=col>diameter</th><th scope=col>rotation</th><th scope=col>rings</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>5</th><td>Jupiter  </td><td>Gas giant</td><td>11.209   </td><td> 0.41    </td><td>TRUE     </td></tr>\n",
       "\t<tr><th scope=row>6</th><td>Saturn   </td><td>Gas giant</td><td> 9.449   </td><td> 0.43    </td><td>TRUE     </td></tr>\n",
       "\t<tr><th scope=row>7</th><td>Uranus   </td><td>Gas giant</td><td> 4.007   </td><td>-0.72    </td><td>TRUE     </td></tr>\n",
       "\t<tr><th scope=row>8</th><td>Neptune  </td><td>Gas giant</td><td> 3.883   </td><td> 0.67    </td><td>TRUE     </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 24,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select the rings variable from planets_df\n",
    "rings_vector <- planets_df$rings\n",
    "  \n",
    "# Print out rings_vector\n",
    "rings_vector\n",
    "\n",
    "# What are the planetse with rings?\n",
    "planets_df[rings_vector, ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now, let us move up one level and use the function `subset()`. You should see the `subset()` function as a short-cut to do exactly the same as what you did in the previous exercises.\n",
    "\n",
    "`subset(my_df, subset = some_condition)`\n",
    "\n",
    "The first argument of `subset()` specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>name</th><th scope=col>type</th><th scope=col>diameter</th><th scope=col>rotation</th><th scope=col>rings</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>5</th><td>Jupiter  </td><td>Gas giant</td><td>11.209   </td><td> 0.41    </td><td>TRUE     </td></tr>\n",
       "\t<tr><th scope=row>6</th><td>Saturn   </td><td>Gas giant</td><td> 9.449   </td><td> 0.43    </td><td>TRUE     </td></tr>\n",
       "\t<tr><th scope=row>7</th><td>Uranus   </td><td>Gas giant</td><td> 4.007   </td><td>-0.72    </td><td>TRUE     </td></tr>\n",
       "\t<tr><th scope=row>8</th><td>Neptune  </td><td>Gas giant</td><td> 3.883   </td><td> 0.67    </td><td>TRUE     </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 25,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset(planets_df, subset = rings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>name</th><th scope=col>type</th><th scope=col>diameter</th><th scope=col>rotation</th><th scope=col>rings</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>1</th><td>Mercury           </td><td>Terrestrial planet</td><td>0.382             </td><td>  58.64           </td><td>FALSE             </td></tr>\n",
       "\t<tr><th scope=row>2</th><td>Venus             </td><td>Terrestrial planet</td><td>0.949             </td><td>-243.02           </td><td>FALSE             </td></tr>\n",
       "\t<tr><th scope=row>4</th><td>Mars              </td><td>Terrestrial planet</td><td>0.532             </td><td>   1.03           </td><td>FALSE             </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 26,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Select planets with diameter < 1\n",
    "subset(planets_df, diameter < 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "##### Summary:\n",
    "\n",
    "- `data()`, `?`, `help()`\n",
    "- `head()`, `tail()`, `str()`\n",
    "- `data.frame(vector1, vector2, vector3, ...)`\n",
    "- `df[row_index, col_index]`, `df[row_index, col_name]`, `df$col_name`\n",
    "- `subset()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "### Lists\n",
    "\n",
    "A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.\n",
    "\n",
    "To construct a list you use the function `list()`:\n",
    "\n",
    "`my_list <- list(comp1, comp2 ...)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "# Vector with numerics from 1 up to 10\n",
    "my_vector <- 1:10 \n",
    "\n",
    "# Matrix with numerics from 1 up to 9\n",
    "my_matrix <- matrix(1:9, ncol = 3)\n",
    "\n",
    "# First 10 elements of the built-in data frame mtcars\n",
    "my_df <- mtcars[1:10,]\n",
    "\n",
    "# Construct list with these different elements:\n",
    "my_list <- list(my_vector, my_matrix, my_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "It is helpful to give names to the components of your list. You can do it with either the `names()` function or in the following way:\n",
    "\n",
    "`my_list <- list(name1 = your_comp1, name2 = your_comp2)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)\n",
    "\n",
    "# which is equivalent to...\n",
    "my_list <- list(my_vector, my_matrix, my_df)\n",
    "names(my_list) <- c(\"vec\", \"mat\", \"df\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "One way to select a component is using the numbered position of that component in double square brackets `[[  ]]`. For example, to \"grab\" the first component of `my_list` you type\n",
    "\n",
    "`my_list[[1]]`\n",
    "\n",
    "It is important to remember that to select elements from vectors, you use single square brackets `[  ]`, and from lists, double square brackets `[[  ]]`. Don't forget!\n",
    "\n",
    "You can also refer to the names of the components, with `[[  ]]` or with the `$` sign.\n",
    "\n",
    "`my_list[[\"df\"]]`\n",
    "\n",
    "`my_list$df`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tbody>\n",
       "\t<tr><td>1</td><td>4</td><td>7</td></tr>\n",
       "\t<tr><td>2</td><td>5</td><td>8</td></tr>\n",
       "\t<tr><td>3</td><td>6</td><td>9</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 29,
     "metadata": {
     },
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<thead><tr><th></th><th scope=col>mpg</th><th scope=col>cyl</th><th scope=col>disp</th><th scope=col>hp</th><th scope=col>drat</th><th scope=col>wt</th><th scope=col>qsec</th><th scope=col>vs</th><th scope=col>am</th><th scope=col>gear</th><th scope=col>carb</th></tr></thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>Mazda RX4 Wag</th><td>21   </td><td>6    </td><td>160  </td><td>110  </td><td>3.9  </td><td>2.875</td><td>17.02</td><td>0    </td><td>1    </td><td>4    </td><td>4    </td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ]
     },
     "execution_count": 29,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print out the matrix\n",
    "my_list$mat\n",
    "\n",
    "# Print the second row of the data frame\n",
    "my_list$df[2,]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "To conveniently add elements to lists you can use the `c()` function:\n",
    "\n",
    "`ext_list <- c(my_list, my_name = new_value)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "TRUE"
      ]
     },
     "execution_count": 30,
     "metadata": {
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_new_list <- c(my_list, logical = TRUE)\n",
    "\n",
    "my_new_list[[4]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "##### Summary:\n",
    "\n",
    "- `my_list <- list(name1 = your_comp1, name2 = your_comp2)`\n",
    "- `my_list[[1]]`, `my_list[[\"name1\"]]`, `my_list$name1`\n",
    "- `ext_list <- c(my_list, my_name = new_value)`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Congratulations! At this point in the course you are already familiar with:\n",
    "\n",
    "- **Vectors** (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.\n",
    "- **Matrices** (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.\n",
    "- **Lists** (one dimensional object): can hold numeric, character or logical values. The elements in a list can be of different data types.\n",
    "- **Data frames** (two dimensional object): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data types."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R (R-Project)",
   "language": "r",
   "name": "ir"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
No results found