Last active
April 3, 2021 12:53
-
-
Save PeterKjeldsen/1015fc2fc27cb6e7e6fef0afdf2d736d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "# **Exploratory Data Analysis Lab**\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Estimated time needed: **30** minutes\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "In this module you get to work with the cleaned dataset from the previous module.\n\nIn this assignment you will perform the task of exploratory data analysis.\nYou will find out the distribution of data, presence of outliers and also determine the correlation between different columns in the dataset.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Objectives\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "In this lab you will perform the following:\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "- Identify the distribution of data in the dataset.\n\n- Identify outliers in the dataset.\n\n- Remove outliers from the dataset.\n\n- Identify correlation between features in the dataset.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "* * *\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Hands on Lab\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Import the pandas module.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "import pandas as pd\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nimport numpy as np", | |
| "execution_count": 90, | |
| "outputs": [] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Load the dataset into a dataframe.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "df = pd.read_csv(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv\")", | |
| "execution_count": 91, | |
| "outputs": [] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "df.shape", | |
| "execution_count": 130, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 130, | |
| "data": { | |
| "text/plain": "(11398, 85)" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Distribution\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "### Determine how the data is distributed\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "The column `ConvertedComp` contains Salary converted to annual USD salaries using the exchange rate on 2019-02-01.\n\nThis assumes 12 working months and 50 working weeks.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Plot the distribution curve for the column `ConvertedComp`.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf.ConvertedComp.plot.density(color='green')\nplt.title('Visualization of Converted Compensation')\nplt.show()", | |
| "execution_count": 92, | |
| "outputs": [ | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": "<Figure size 432x288 with 1 Axes>", | |
| "image/png": "\n" | |
| }, | |
| "metadata": { | |
| "needs_background": "light" | |
| } | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Plot the histogram for the column `ConvertedComp`.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf.hist(column='ConvertedComp', bins=5)", | |
| "execution_count": 93, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 93, | |
| "data": { | |
| "text/plain": "array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f2bd637ca90>]],\n dtype=object)" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": "<Figure size 432x288 with 1 Axes>", | |
| "image/png": "\n" | |
| }, | |
| "metadata": { | |
| "needs_background": "light" | |
| } | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "What is the median of the column `ConvertedComp`?\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf['ConvertedComp'].median()", | |
| "execution_count": 94, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 94, | |
| "data": { | |
| "text/plain": "57745.0" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "\nHow many responders identified themselves only as a **Man**?\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf1=df.loc[df['Gender']=='Man']\ndf1['Gender'].count()", | |
| "execution_count": 95, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 95, | |
| "data": { | |
| "text/plain": "10480" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "df['Gender'].value_counts() #Control of calculated values", | |
| "execution_count": 96, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 96, | |
| "data": { | |
| "text/plain": "Man 10480\nWoman 731\nNon-binary, genderqueer, or gender non-conforming 63\nMan;Non-binary, genderqueer, or gender non-conforming 26\nWoman;Non-binary, genderqueer, or gender non-conforming 14\nWoman;Man 9\nWoman;Man;Non-binary, genderqueer, or gender non-conforming 2\nName: Gender, dtype: int64" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Find out the median ConvertedComp of responders identified themselves only as a **Woman**?\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf2=df[['ConvertedComp','Gender']]\ndf3=df2.loc[df2['Gender']=='Woman']\ndf3.median()", | |
| "execution_count": 97, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 97, | |
| "data": { | |
| "text/plain": "ConvertedComp 57708.0\ndtype: float64" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Give the five number summary for the column `Age`?\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "**Double click here for hint**.\n\n<!--\nmin,q1,median,q3,max of a column are its five number summary.\n-->\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\nmin=df['Age'].min()\nmax=df['Age'].max()\nmedian=df['Age'].median()\nq1=df['Age'].quantile(q=0.25)\nq3=df['Age'].quantile(q=0.75)\nprint('Age Distribution')\nprint('Minimum age: ', min)\nprint('Maximum age: ', max)\nprint()\nprint('Quartile 1 (25%)is: ', q1)\nprint('Quartile 3 (75%)is: ', q3)\nprint()\nprint('The median is: ', median)\ndf['Age'].describe() #Control of values", | |
| "execution_count": 98, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "Age Distribution\nMinimum age: 16.0\nMaximum age: 99.0\n\nQuartile 1 (25%)is: 25.0\nQuartile 3 (75%)is: 35.0\n\nThe median is: 29.0\n", | |
| "name": "stdout" | |
| }, | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 98, | |
| "data": { | |
| "text/plain": "count 11111.000000\nmean 30.778895\nstd 7.393686\nmin 16.000000\n25% 25.000000\n50% 29.000000\n75% 35.000000\nmax 99.000000\nName: Age, dtype: float64" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Plot a histogram of the column `Age`.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf.hist(column='Age', figsize=(8,4))", | |
| "execution_count": 100, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 100, | |
| "data": { | |
| "text/plain": "array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f2bd6314d10>]],\n dtype=object)" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": "<Figure size 576x288 with 1 Axes>", | |
| "image/png": "iVBORw0KGgoAAAANSUhEUgAAAe0AAAEICAYAAAByPazKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAUZ0lEQVR4nO3df6zd9X3f8eerdkpcEhYo4ca1acw0qy0/GjIs6i3ZdFfa4o0oRmpJ3ZFiNiJLiKjJ5Kky3bSo0ixRaYnaaAHVShPM2oZZTTK8MBqQ29OuEr9M0g0MYXjBBQ8HN0mzYCYRTN7743xpDuZe33PNueecj+/zIR2d7/d9vt/v+Zzztv3y93O+99xUFZIkafr90KQHIEmShmNoS5LUCENbkqRGGNqSJDXC0JYkqRGGtiRJjTC0JUlqhKEtLUNJekn+JskZkx6LpOEZ2tIyk2Qd8I+AAt4/0cFIWhRDW1p+rgMeAG4Htr5aTPKjSf5rku8meTjJv0/yFwOP/2SS+5J8O8mTST4w/qFLy9vKSQ9A0thdB3wCeBB4IMlMVT0PfAp4EXgHsA74MvBXAEnOBO4D/h3wT4GfBu5NcqCqDoz9FUjLlGfa0jKS5L3AO4E9VfUI8L+Bf55kBfCLwMeq6v9V1ePA7oFd3wccqqrPVtXxqvoK8Hngl8b8EqRlzdCWlpetwL1V9c1u/Q+72tvpz7w9O7Dt4PI7gZ9J8p1Xb8C19M/KJY2J0+PSMpFkFfABYEWSb3TlM4C3ATPAcWAt8L+6x84f2P1Z4M+q6ufHNFxJc4i/mlNaHpL8Cv3PrS8Fvjfw0B7gYfqB/QrwIeDHgXuBZ6rqvUneCjwG/Fvgzm6/S4FjVfXEeF6BJKfHpeVjK/DZqnqmqr7x6g34j/Snuj8M/B3gG8B/Aj4HvARQVS8AvwBsAZ7rtvkt+mfqksbEM21Jc0ryW8A7qmrrghtLGgvPtCUBf/tz2D+dvsuBG4AvTnpckn7AC9Ekveqt9KfEfww4CnwcuGuiI5L0Gk6PS5LUCKfHJUlqxNRPj5977rm1bt26SQ9jXi+++CJnnnnmpIexrNmD6WAfpoN9mA5vpA+PPPLIN6vq7XM9NvWhvW7dOvbv3z/pYcyr1+sxOzs76WEsa/ZgOtiH6WAfpsMb6UOSv5rvMafHJUlqhKEtSVIjDG1JkhphaEuS1AhDW5KkRhjakiQ1wtCWJKkRhrYkSY0wtCVJasTUfyPacrRux92THsJJHbrlqkkPQZKWJc+0JUlqhKEtSVIjDG1JkhphaEuS1AhDW5KkRhjakiQ1wtCWJKkRhrYkSY0wtCVJasRQoZ3kUJJHk/xlkv1d7Zwk9yV5qrs/e2D7m5McTPJkkisH6pd1xzmY5JNJMvqXJEnS6WkxZ9r/pKouraoN3foOYF9VrQf2deskuRDYAlwEbAJuTbKi2+c2YBuwvrtteuMvQZKk5eGNTI9vBnZ3y7uBqwfqd1bVS1X1NHAQuDzJauCsqrq/qgq4Y2AfSZK0gGF/YUgB9yYp4HerahcwU1VHAKrqSJLzum3XAA8M7Hu4q73cLZ9Yf50k2+ifkTMzM0Ov1xtymON37NixkY9v+yXHR3q8UZu2fixFD7R49mE62IfpsFR9GDa031NVz3XBfF+Sr51k27k+p66T1F9f7P+nYBfAhg0banZ2dshhjl+v12PU47t+2n/L17Wzkx7CayxFD7R49mE62IfpsFR9GGp6vKqe6+6PAl8ELgee76a86e6PdpsfBs4f2H0t8FxXXztHXZIkDWHB0E5yZpK3vroM/ALwGLAX2NptthW4q1veC2xJckaSC+hfcPZQN5X+QpKN3VXj1w3sI0mSFjDM9PgM8MXup7NWAn9YVX+c5GFgT5IbgGeAawCq6kCSPcDjwHHgpqp6pTvWjcDtwCrgnu4mSZKGsGBoV9XXgXfNUf8WcMU8++wEds5R3w9cvPhhSpIkvxFNkqRGGNqSJDXC0JYkqRGGtiRJjTC0JUlqhKEtSVIjDG1JkhphaEuS1AhDW5KkRhjakiQ1wtCWJKkRhrYkSY0wtCVJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNMLQlSWqEoS1JUiMMbUmSGmFoS5LUCENbkqRGGNqSJDXC0JYkqRGGtiRJjTC0JUlqxNChnWRFkq8m+VK3fk6S+5I81d2fPbDtzUkOJnkyyZUD9cuSPNo99skkGe3LkSTp9LWYM+2PAE8MrO8A9lXVemBft06SC4EtwEXAJuDWJCu6fW4DtgHru9umNzR6SZKWkaFCO8la4Crg0wPlzcDubnk3cPVA/c6qeqmqngYOApcnWQ2cVVX3V1UBdwzsI0mSFjDsmfZvA78OfH+gNlNVRwC6+/O6+hrg2YHtDne1Nd3yiXVJkjSElQttkOR9wNGqeiTJ7BDHnOtz6jpJfa7n3EZ/Gp2ZmRl6vd4QTzsZx44dG/n4tl9yfKTHG7Vp68dS9ECLZx+mg32YDkvVhwVDG3gP8P4k/wx4M3BWkt8Hnk+yuqqOdFPfR7vtDwPnD+y/Fniuq6+do/46VbUL2AWwYcOGmp2dHf4VjVmv12PU47t+x90jPd6oHbp2dtJDeI2l6IEWzz5MB/swHZaqDwtOj1fVzVW1tqrW0b/A7E+q6oPAXmBrt9lW4K5ueS+wJckZSS6gf8HZQ90U+gtJNnZXjV83sI8kSVrAMGfa87kF2JPkBuAZ4BqAqjqQZA/wOHAcuKmqXun2uRG4HVgF3NPdJEnSEBYV2lXVA3rd8reAK+bZbiewc476fuDixQ5SkiT5jWiSJDXD0JYkqRGGtiRJjTC0JUlqhKEtSVIjDG1JkhphaEuS1AhDW5KkRhjakiQ1wtCWJKkRhrYkSY0wtCVJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNMLQlSWqEoS1JUiMMbUmSGmFoS5LUCENbkqRGGNqSJDXC0JYkqRGGtiRJjTC0JUlqhKEtSVIjDG1JkhqxYGgneXOSh5L8jyQHkvxmVz8nyX1Jnuruzx7Y5+YkB5M8meTKgfplSR7tHvtkkizNy5Ik6fQzzJn2S8DPVtW7gEuBTUk2AjuAfVW1HtjXrZPkQmALcBGwCbg1yYruWLcB24D13W3TCF+LJEmntQVDu/qOdatv6m4FbAZ2d/XdwNXd8mbgzqp6qaqeBg4ClydZDZxVVfdXVQF3DOwjSZIWsHKYjboz5UeAvwd8qqoeTDJTVUcAqupIkvO6zdcADwzsfrirvdwtn1if6/m20T8jZ2Zmhl6vN/QLGrdjx46NfHzbLzk+0uON2rT1Yyl6oMWzD9PBPkyHperDUKFdVa8AlyZ5G/DFJBefZPO5Pqeuk9Tner5dwC6ADRs21Ozs7DDDnIher8eox3f9jrtHerxRO3Tt7KSH8BpL0QMtnn2YDvZhOixVHxZ19XhVfQfo0f8s+vluypvu/mi32WHg/IHd1gLPdfW1c9QlSdIQhrl6/O3dGTZJVgE/B3wN2Ats7TbbCtzVLe8FtiQ5I8kF9C84e6ibSn8hycbuqvHrBvaRJEkLGGZ6fDWwu/tc+4eAPVX1pST3A3uS3AA8A1wDUFUHkuwBHgeOAzd10+sANwK3A6uAe7qbJEkawoKhXVX/E3j3HPVvAVfMs89OYOcc9f3AyT4PlyRJ8/Ab0SRJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNMLQlSWqEoS1JUiMMbUmSGmFoS5LUiGF+y5f0Gut23D3pIbzG9kuOc/0JYzp0y1UTGo0kLR3PtCVJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNMLQlSWqEoS1JUiMMbUmSGmFoS5LUCENbkqRGGNqSJDXC0JYkqRGGtiRJjTC0JUlqxMqFNkhyPnAH8A7g+8CuqvqdJOcA/xlYBxwCPlBVf9PtczNwA/AK8GtV9eWufhlwO7AK+G/AR6qqRvuSTm7djrtHerztlxzn+hEfU5KkuQxzpn0c2F5VPwVsBG5KciGwA9hXVeuBfd063WNbgIuATcCtSVZ0x7oN2Aas726bRvhaJEk6rS0Y2lV1pKq+0i2/ADwBrAE2A7u7zXYDV3fLm4E7q+qlqnoaOAhcnmQ1cFZV3d+dXd8xsI8kSVrAgtPjg5KsA94NPAjMVNUR6Ad7kvO6zdYADwzsdrirvdwtn1if63m20T8jZ2Zmhl6vt5hhntT2S46P7FgAM6tGf0wtzlw9GOWfGQ3n2LFjvu9TwD5Mh6Xqw9ChneQtwOeBj1bVd5PMu+kctTpJ/fXFql3ALoANGzbU7OzssMNc0Kg/f95+yXE+/uii/u+jEZurB4eunZ3MYJaxXq/HKP+u6tTYh+mwVH0Y6urxJG+iH9h/UFVf6MrPd1PedPdHu/ph4PyB3dcCz3X1tXPUJUnSEBYM7fRPqX8PeKKqPjHw0F5ga7e8FbhroL4lyRlJLqB/wdlD3VT6C0k2dse8bmAfSZK0gGHmdd8D/CrwaJK/7Gq/AdwC7ElyA/AMcA1AVR1Isgd4nP6V5zdV1Svdfjfygx/5uqe7SZKkISwY2lX1F8z9eTTAFfPssxPYOUd9P3DxYgYoSZL6/EY0SZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNMLQlSWqEoS1JUiMMbUmSGmFoS5LUCENbkqRGGNqSJDXC0JYkqRGGtiRJjTC0JUlqhKEtSVIjDG1JkhphaEuS1AhDW5KkRhjakiQ1wtCWJKkRhrYkSY0wtCVJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNWDC0k3wmydEkjw3UzklyX5KnuvuzBx67OcnBJE8muXKgflmSR7vHPpkko385kiSdvoY5074d2HRCbQewr6rWA/u6dZJcCGwBLur2uTXJim6f24BtwPruduIxJUnSSSwY2lX158C3TyhvBnZ3y7uBqwfqd1bVS1X1NHAQuDzJauCsqrq/qgq4Y2AfSZI0hJWnuN9MVR0BqKojSc7r6muABwa2O9zVXu6WT6zPKck2+mflzMzM0Ov1TnGYr7f9kuMjOxbAzKrRH1OLM1cPRvlnRsM5duyY7/sUsA/TYan6cKqhPZ+5Pqeuk9TnVFW7gF0AGzZsqNnZ2ZEMDuD6HXeP7FjQD4uPPzrqt1GLMVcPDl07O5nBLGO9Xo9R/l3VqbEP02Gp+nCqV48/3015090f7eqHgfMHtlsLPNfV185RlyRJQzrV0N4LbO2WtwJ3DdS3JDkjyQX0Lzh7qJtKfyHJxu6q8esG9pEkSUNYcF43yeeAWeDcJIeBjwG3AHuS3AA8A1wDUFUHkuwBHgeOAzdV1SvdoW6kfyX6KuCe7iZJkoa0YGhX1a/M89AV82y/E9g5R30/cPGiRidJkv6WV1DptLRuxBccjtqhW66a9BAkNcivMZUkqRGGtiRJjTC0JUlqhKEtSVIjDG1JkhphaEuS1AhDW5KkRhjakiQ1wtCWJKkRhrYkSY0wtCVJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEYY2pIkNcLQliSpEYa2JEmNMLQlSWqEoS1JUiMMbUmSGmFoS5LUCENbkqRGrJz0AKTlaN2Ouyc9hAUduuWqSQ9B0gk805YkqRGGtiRJjTC0JUlqhKEtSVIjDG1Jkhox9tBOsinJk0kOJtkx7ueXJKlVY/2RryQrgE8BPw8cBh5OsreqHh/nOCQtbLE/lrb9kuNcP8YfZfNH0rQcjftM+3LgYFV9vaq+B9wJbB7zGCRJalKqanxPlvwSsKmqPtSt/yrwM1X14RO22wZs61Z/AnhybINcvHOBb056EMucPZgO9mE62Ifp8Eb68M6qevtcD4z7G9EyR+11/2uoql3ArqUfzhuXZH9VbZj0OJYzezAd7MN0sA/TYan6MO7p8cPA+QPra4HnxjwGSZKaNO7QfhhYn+SCJD8MbAH2jnkMkiQ1aazT41V1PMmHgS8DK4DPVNWBcY5hCTQxjX+aswfTwT5MB/swHZakD2O9EE2SJJ06vxFNkqRGGNqSJDXC0B5SkvOT/GmSJ5IcSPKRrn5OkvuSPNXdnz3psZ7ukqxI8tUkX+rW7cGYJXlbkj9K8rXu78Q/sA/jl+Rfdf8ePZbkc0nebB+WXpLPJDma5LGB2rzve5Kbu6/ufjLJlW/kuQ3t4R0HtlfVTwEbgZuSXAjsAPZV1XpgX7eupfUR4ImBdXswfr8D/HFV/STwLvr9sA9jlGQN8GvAhqq6mP7FvVuwD+NwO7DphNqc73uXE1uAi7p9bu2+0vuUGNpDqqojVfWVbvkF+v9IraH/Nay7u812A1dPZoTLQ5K1wFXApwfK9mCMkpwF/GPg9wCq6ntV9R3swySsBFYlWQn8CP3vvbAPS6yq/hz49gnl+d73zcCdVfVSVT0NHKT/ld6nxNA+BUnWAe8GHgRmquoI9IMdOG9yI1sWfhv4deD7AzV7MF5/F/hr4LPdxxSfTnIm9mGsqur/AP8BeAY4AvzfqroX+zAp873va4BnB7Y73NVOiaG9SEneAnwe+GhVfXfS41lOkrwPOFpVj0x6LMvcSuDvA7dV1buBF3EKduy6z0w3AxcAPwacmeSDkx2V5jDU13cPy9BehCRvoh/Yf1BVX+jKzydZ3T2+Gjg6qfEtA+8B3p/kEP3fEPezSX4fezBuh4HDVfVgt/5H9EPcPozXzwFPV9VfV9XLwBeAf4h9mJT53veRfn23oT2kJKH/Gd4TVfWJgYf2Alu75a3AXeMe23JRVTdX1dqqWkf/wo4/qaoPYg/Gqqq+ATyb5Ce60hXA49iHcXsG2JjkR7p/n66gf62NfZiM+d73vcCWJGckuQBYDzx0qk/iN6INKcl7gf8OPMoPPk/9Dfqfa+8Bfpz+X6JrqurECxQ0YklmgX9dVe9L8qPYg7FKcin9iwF/GPg68C/onwTYhzFK8pvAL9P/6ZavAh8C3oJ9WFJJPgfM0v/1m88DHwP+C/O870n+DfAv6ffpo1V1zyk/t6EtSVIbnB6XJKkRhrYkSY0wtCVJaoShLUlSIwxtSZIaYWhLktQIQ1uSpEb8f4C+IPryhcpsAAAAAElFTkSuQmCC\n" | |
| }, | |
| "metadata": { | |
| "needs_background": "light" | |
| } | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "df.boxplot(column='Age', figsize=(10,10))", | |
| "execution_count": 145, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 145, | |
| "data": { | |
| "text/plain": "<matplotlib.axes._subplots.AxesSubplot at 0x7f2bd639b150>" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": "<Figure size 720x720 with 1 Axes>", | |
| "image/png": "\n" | |
| }, | |
| "metadata": { | |
| "needs_background": "light" | |
| } | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Outliers\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "### Finding outliers\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Find out if outliers exist in the column `ConvertedComp` using a box plot?\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\nquantile=df['ConvertedComp'].quantile([0, 1.0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])\nprint(quantile)\ndf.boxplot(column='ConvertedComp', figsize=(8,10))", | |
| "execution_count": 108, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "0.0 0.0\n1.0 2000000.0\n0.2 20628.0\n0.3 32842.2\n0.4 45797.0\n0.5 57745.0\n0.6 70998.0\n0.7 88000.0\n0.8 113967.4\n0.9 170000.0\n1.0 2000000.0\nName: ConvertedComp, dtype: float64\n", | |
| "name": "stdout" | |
| }, | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 108, | |
| "data": { | |
| "text/plain": "<matplotlib.axes._subplots.AxesSubplot at 0x7f2bd735f1d0>" | |
| }, | |
| "metadata": {} | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "data": { | |
| "text/plain": "<Figure size 576x720 with 1 Axes>", | |
| "image/png": "\n" | |
| }, | |
| "metadata": { | |
| "needs_background": "light" | |
| } | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Find out the Inter Quartile Range for the column `ConvertedComp`.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\nq1=df['ConvertedComp'].quantile(q=0.25)\nq3=df['ConvertedComp'].quantile(q=0.75)\niqr=q3-q1\nprint('The Inter Quartile Range for the column ConvertedComp is: ', iqr)", | |
| "execution_count": 113, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "The Inter Quartile Range for the column ConvertedComp is: 73132.0\n", | |
| "name": "stdout" | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Find out the upper and lower bounds.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\nmin2=df['ConvertedComp'].min()\nmax2=df['ConvertedComp'].max()\n\nprint('The lower boundary is: ', min2)\nprint('The upper boundary is: ', max2)", | |
| "execution_count": 114, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "The lower boundary is: 0.0\nThe upper boundary is: 2000000.0\n", | |
| "name": "stdout" | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Identify how many outliers are there in the `ConvertedComp` column.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\noutliers=((df['ConvertedComp'] < (q1 - 1.5 * iqr)) | (df['ConvertedComp'] > (q3 + 1.5 * iqr))).sum()\nprint('There are ' ,outliers, ' in the column ConvertedComp.')", | |
| "execution_count": 125, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "There are 879 in the column ConvertedComp.\n", | |
| "name": "stdout" | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Create a new dataframe by removing the outliers from the `ConvertedComp` column.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndel_convertedcomp=df[~(df['ConvertedComp'] > (q3 + 1.5 * iqr))]\nremoved=df.shape[0]- del_convertedcomp.shape[0]\noriginal_df=df.shape[0]\nnew_df=del_convertedcomp.shape[0]\nprint(removed, 'outliers has been removed from the original dataframe.')\nprint('The original dataframe contained ',original_df,' entries.')\nprint('The new dataframe contains ',new_df,' entries.')", | |
| "execution_count": 142, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "879 outliers has been removed from the original dataframe.\nThe original dataframe contained 11398 entries.\nThe new dataframe contains 10519 entries.\n", | |
| "name": "stdout" | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "del_median=del_convertedcomp['ConvertedComp'].median()\nprint('The median after removing the outliers is: ', del_median)", | |
| "execution_count": 143, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "The median after removing the outliers is: 52704.0\n", | |
| "name": "stdout" | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "del_mean=del_convertedcomp['ConvertedComp'].mean()\nprint('The mean after removing the outliers is: ', del_mean)", | |
| "execution_count": 144, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "text": "The mean after removing the outliers is: 59883.20838915799\n", | |
| "name": "stdout" | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Correlation\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "### Finding correlation\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Find the correlation between `Age` and all other numerical columns.\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "code", | |
| "source": "# your code goes here\ndf.corr()", | |
| "execution_count": 148, | |
| "outputs": [ | |
| { | |
| "output_type": "execute_result", | |
| "execution_count": 148, | |
| "data": { | |
| "text/plain": " Respondent CompTotal ConvertedComp WorkWeekHrs CodeRevHrs \\\nRespondent 1.000000 -0.013490 0.002181 -0.015314 0.004621 \nCompTotal -0.013490 1.000000 0.001037 0.003510 0.007063 \nConvertedComp 0.002181 0.001037 1.000000 0.021143 -0.033865 \nWorkWeekHrs -0.015314 0.003510 0.021143 1.000000 0.026517 \nCodeRevHrs 0.004621 0.007063 -0.033865 0.026517 1.000000 \nAge 0.004041 0.006970 0.105386 0.036518 -0.020469 \n\n Age \nRespondent 0.004041 \nCompTotal 0.006970 \nConvertedComp 0.105386 \nWorkWeekHrs 0.036518 \nCodeRevHrs -0.020469 \nAge 1.000000 ", | |
| "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>CompTotal</th>\n <th>ConvertedComp</th>\n <th>WorkWeekHrs</th>\n <th>CodeRevHrs</th>\n <th>Age</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Respondent</th>\n <td>1.000000</td>\n <td>-0.013490</td>\n <td>0.002181</td>\n <td>-0.015314</td>\n <td>0.004621</td>\n <td>0.004041</td>\n </tr>\n <tr>\n <th>CompTotal</th>\n <td>-0.013490</td>\n <td>1.000000</td>\n <td>0.001037</td>\n <td>0.003510</td>\n <td>0.007063</td>\n <td>0.006970</td>\n </tr>\n <tr>\n <th>ConvertedComp</th>\n <td>0.002181</td>\n <td>0.001037</td>\n <td>1.000000</td>\n <td>0.021143</td>\n <td>-0.033865</td>\n <td>0.105386</td>\n </tr>\n <tr>\n <th>WorkWeekHrs</th>\n <td>-0.015314</td>\n <td>0.003510</td>\n <td>0.021143</td>\n <td>1.000000</td>\n <td>0.026517</td>\n <td>0.036518</td>\n </tr>\n <tr>\n <th>CodeRevHrs</th>\n <td>0.004621</td>\n <td>0.007063</td>\n <td>-0.033865</td>\n <td>0.026517</td>\n <td>1.000000</td>\n <td>-0.020469</td>\n </tr>\n <tr>\n <th>Age</th>\n <td>0.004041</td>\n <td>0.006970</td>\n <td>0.105386</td>\n <td>0.036518</td>\n <td>-0.020469</td>\n <td>1.000000</td>\n </tr>\n </tbody>\n</table>\n</div>" | |
| }, | |
| "metadata": {} | |
| } | |
| ] | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Authors\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Ramesh Sannareddy\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "### Other Contributors\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "Rav Ahuja\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "## Change Log\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-10-17 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n" | |
| }, | |
| { | |
| "metadata": {}, | |
| "cell_type": "markdown", | |
| "source": " Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n" | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3.7", | |
| "language": "python" | |
| }, | |
| "language_info": { | |
| "name": "python", | |
| "version": "3.7.10", | |
| "mimetype": "text/x-python", | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "pygments_lexer": "ipython3", | |
| "nbconvert_exporter": "python", | |
| "file_extension": ".py" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 4 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment