Skip to content

Instantly share code, notes, and snippets.

@adeways2000
Created March 23, 2021 02:48
Show Gist options
  • Select an option

  • Save adeways2000/1388a14170cf2cfe7ed6799fd11f6e5a to your computer and use it in GitHub Desktop.

Select an option

Save adeways2000/1388a14170cf2cfe7ed6799fd11f6e5a to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# **Data Wrangling Lab**\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Estimated time needed: **45 to 60** minutes\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "In this assignment you will be performing data wrangling.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Objectives\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "In this lab you will perform the following:\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Identify duplicate values in the dataset.\n\n- Remove duplicate values from the dataset.\n\n- Identify missing values in the dataset.\n\n- Impute the missing values in the dataset.\n\n- Normalize data in the dataset.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<hr>\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Hands on Lab\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Import pandas module.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "import pandas as pd\nimport numpy as np",
"execution_count": 2,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Load the dataset into a dataframe.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "df = pd.read_csv(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m1_survey_data.csv\")\ndf.head()",
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 3,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n0 4 I am a developer by profession No \n1 9 I am a developer by profession Yes \n2 13 I am a developer by profession Yes \n3 16 I am a developer by profession Yes \n4 17 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Once a month or more often \n2 Less than once a month but more than once per ... \n3 Never \n4 Less than once a month but more than once per ... \n\n OpenSource Employment \\\n0 The quality of OSS and closed source software ... Employed full-time \n1 The quality of OSS and closed source software ... Employed full-time \n2 OSS is, on average, of HIGHER quality than pro... Employed full-time \n3 The quality of OSS and closed source software ... Employed full-time \n4 The quality of OSS and closed source software ... Employed full-time \n\n Country Student EdLevel \\\n0 United States No Bachelor\u2019s degree (BA, BS, B.Eng., etc.) \n1 New Zealand No Some college/university study without earning ... \n2 United States No Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n3 United Kingdom No Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n4 Australia No Bachelor\u2019s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor ... \\\n0 Computer science, computer engineering, or sof... ... \n1 Computer science, computer engineering, or sof... ... \n2 Computer science, computer engineering, or sof... ... \n3 NaN ... \n4 Computer science, computer engineering, or sof... ... \n\n WelcomeChange \\\n0 Just as welcome now as I felt last year \n1 Just as welcome now as I felt last year \n2 Somewhat more welcome now than last year \n3 Just as welcome now as I felt last year \n4 Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n0 Tech articles written by other developers;Indu... 22.0 Man No \n1 NaN 23.0 Man No \n2 Tech articles written by other developers;Cour... 28.0 Man No \n3 Tech articles written by other developers;Indu... 26.0 Man No \n4 Tech articles written by other developers;Indu... 29.0 Man No \n\n Sexuality Ethnicity Dependents \\\n0 Straight / Heterosexual White or of European descent No \n1 Bisexual White or of European descent No \n2 Straight / Heterosexual White or of European descent Yes \n3 Straight / Heterosexual White or of European descent No \n4 Straight / Heterosexual Hispanic or Latino/Latina;Multiracial No \n\n SurveyLength SurveyEase \n0 Appropriate in length Easy \n1 Appropriate in length Neither easy nor difficult \n2 Appropriate in length Easy \n3 Appropriate in length Neither easy nor difficult \n4 Appropriate in length Easy \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>4</td>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor\u2019s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>New Zealand</td>\n <td>No</td>\n <td>Some college/university study without earning ...</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>NaN</td>\n <td>23.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Bisexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>13</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Somewhat more welcome now than last year</td>\n <td>Tech articles written by other developers;Cour...</td>\n <td>28.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>3</th>\n <td>16</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>26.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>4</th>\n <td>17</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Australia</td>\n <td>No</td>\n <td>Bachelor\u2019s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>29.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>Hispanic or Latino/Latina;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows \u00d7 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Finding duplicates\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "In this section you will identify duplicate values in the dataset.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": " Find how many duplicate rows exist in the dataframe.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\nduplicateDFRow = df[df.duplicated()].count()\nprint(duplicateDFRow)\n",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": "Respondent 154\nMainBranch 154\nHobbyist 154\nOpenSourcer 154\nOpenSource 154\n ... \nSexuality 149\nEthnicity 146\nDependents 150\nSurveyLength 154\nSurveyEase 154\nLength: 85, dtype: int64\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Removing duplicates\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Remove the duplicate rows from the dataframe.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf.drop_duplicates(subset=None, keep='first', inplace=False)\n",
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 5,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n0 4 I am a developer by profession No \n1 9 I am a developer by profession Yes \n2 13 I am a developer by profession Yes \n3 16 I am a developer by profession Yes \n4 17 I am a developer by profession Yes \n... ... ... ... \n11547 25136 I am a developer by profession Yes \n11548 25137 I am a developer by profession Yes \n11549 25138 I am a developer by profession Yes \n11550 25141 I am a developer by profession Yes \n11551 25142 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Once a month or more often \n2 Less than once a month but more than once per ... \n3 Never \n4 Less than once a month but more than once per ... \n... ... \n11547 Never \n11548 Never \n11549 Less than once per year \n11550 Less than once a month but more than once per ... \n11551 Less than once a month but more than once per ... \n\n OpenSource Employment \\\n0 The quality of OSS and closed source software ... Employed full-time \n1 The quality of OSS and closed source software ... Employed full-time \n2 OSS is, on average, of HIGHER quality than pro... Employed full-time \n3 The quality of OSS and closed source software ... Employed full-time \n4 The quality of OSS and closed source software ... Employed full-time \n... ... ... \n11547 OSS is, on average, of HIGHER quality than pro... Employed full-time \n11548 The quality of OSS and closed source software ... Employed full-time \n11549 The quality of OSS and closed source software ... Employed full-time \n11550 OSS is, on average, of LOWER quality than prop... Employed full-time \n11551 OSS is, on average, of HIGHER quality than pro... Employed full-time \n\n Country Student \\\n0 United States No \n1 New Zealand No \n2 United States No \n3 United Kingdom No \n4 Australia No \n... ... ... \n11547 United States No \n11548 Poland No \n11549 United States No \n11550 Switzerland No \n11551 United Kingdom No \n\n EdLevel \\\n0 Bachelor\u2019s degree (BA, BS, B.Eng., etc.) \n1 Some college/university study without earning ... \n2 Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n3 Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n4 Bachelor\u2019s degree (BA, BS, B.Eng., etc.) \n... ... \n11547 Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n11548 Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n11549 Master\u2019s degree (MA, MS, M.Eng., MBA, etc.) \n11550 Secondary school (e.g. American high school, G... \n11551 Other doctoral degree (Ph.D, Ed.D., etc.) \n\n UndergradMajor ... \\\n0 Computer science, computer engineering, or sof... ... \n1 Computer science, computer engineering, or sof... ... \n2 Computer science, computer engineering, or sof... ... \n3 NaN ... \n4 Computer science, computer engineering, or sof... ... \n... ... ... \n11547 Computer science, computer engineering, or sof... ... \n11548 Computer science, computer engineering, or sof... ... \n11549 Computer science, computer engineering, or sof... ... \n11550 NaN ... \n11551 A natural science (ex. biology, chemistry, phy... ... \n\n WelcomeChange \\\n0 Just as welcome now as I felt last year \n1 Just as welcome now as I felt last year \n2 Somewhat more welcome now than last year \n3 Just as welcome now as I felt last year \n4 Just as welcome now as I felt last year \n... ... \n11547 Just as welcome now as I felt last year \n11548 A lot more welcome now than last year \n11549 A lot more welcome now than last year \n11550 Somewhat less welcome now than last year \n11551 Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n0 Tech articles written by other developers;Indu... 22.0 Man No \n1 NaN 23.0 Man No \n2 Tech articles written by other developers;Cour... 28.0 Man No \n3 Tech articles written by other developers;Indu... 26.0 Man No \n4 Tech articles written by other developers;Indu... 29.0 Man No \n... ... ... ... ... \n11547 Tech articles written by other developers;Cour... 36.0 Man No \n11548 Tech articles written by other developers;Tech... 25.0 Man No \n11549 Tech articles written by other developers;Indu... 34.0 Man No \n11550 NaN 25.0 Man No \n11551 Tech articles written by other developers;Tech... 30.0 Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual White or of European descent \n1 Bisexual White or of European descent \n2 Straight / Heterosexual White or of European descent \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual Hispanic or Latino/Latina;Multiracial \n... ... ... \n11547 Straight / Heterosexual White or of European descent \n11548 Straight / Heterosexual White or of European descent \n11549 Straight / Heterosexual White or of European descent \n11550 Straight / Heterosexual White or of European descent \n11551 Bisexual White or of European descent \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Easy \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Easy \n3 No Appropriate in length Neither easy nor difficult \n4 No Appropriate in length Easy \n... ... ... ... \n11547 No Appropriate in length Difficult \n11548 No Appropriate in length Neither easy nor difficult \n11549 Yes Too long Easy \n11550 No Appropriate in length Easy \n11551 No Appropriate in length Easy \n\n[11398 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>4</td>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor\u2019s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>New Zealand</td>\n <td>No</td>\n <td>Some college/university study without earning ...</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>NaN</td>\n <td>23.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Bisexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>13</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Somewhat more welcome now than last year</td>\n <td>Tech articles written by other developers;Cour...</td>\n <td>28.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>3</th>\n <td>16</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>26.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>4</th>\n <td>17</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Australia</td>\n <td>No</td>\n <td>Bachelor\u2019s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>29.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>Hispanic or Latino/Latina;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>11547</th>\n <td>25136</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Cour...</td>\n <td>36.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Difficult</td>\n </tr>\n <tr>\n <th>11548</th>\n <td>25137</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Poland</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>A lot more welcome now than last year</td>\n <td>Tech articles written by other developers;Tech...</td>\n <td>25.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>11549</th>\n <td>25138</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Master\u2019s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>A lot more welcome now than last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>34.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Too long</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>11550</th>\n <td>25141</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of LOWER quality than prop...</td>\n <td>Employed full-time</td>\n <td>Switzerland</td>\n <td>No</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>...</td>\n <td>Somewhat less welcome now than last year</td>\n <td>NaN</td>\n <td>25.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>11551</th>\n <td>25142</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Other doctoral degree (Ph.D, Ed.D., etc.)</td>\n <td>A natural science (ex. biology, chemistry, phy...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Tech...</td>\n <td>30.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Bisexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>11398 rows \u00d7 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Verify if duplicates were actually dropped.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf.drop_duplicates().duplicated().any()",
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 6,
"data": {
"text/plain": "False"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Finding Missing values\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Find the missing values for all columns.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\n\nmissing_data = df.isnull()\nmissing_data.head(5)",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 7,
"data": {
"text/plain": " Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment \\\n0 False False False False False False \n1 False False False False False False \n2 False False False False False False \n3 False False False False False False \n4 False False False False False False \n\n Country Student EdLevel UndergradMajor ... WelcomeChange \\\n0 False False False False ... False \n1 False False False False ... False \n2 False False False False ... False \n3 False False False True ... False \n4 False False False False ... False \n\n SONewContent Age Gender Trans Sexuality Ethnicity Dependents \\\n0 False False False False False False False \n1 True False False False False False False \n2 False False False False False False False \n3 False False False False False False False \n4 False False False False False False False \n\n SurveyLength SurveyEase \n0 False False \n1 False False \n2 False False \n3 False False \n4 False False \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>...</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n </tr>\n <tr>\n <th>1</th>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>...</td>\n <td>False</td>\n <td>True</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n </tr>\n <tr>\n <th>2</th>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>...</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n </tr>\n <tr>\n <th>3</th>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>True</td>\n <td>...</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n </tr>\n <tr>\n <th>4</th>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>...</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows \u00d7 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Find out how many rows are missing in the column 'WorkLoc'\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf.WorkLoc.isnull().sum()\n",
"execution_count": 28,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 28,
"data": {
"text/plain": "32"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Imputing missing values\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Find the value counts for the column WorkLoc.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf['WorkLoc'].value_counts()",
"execution_count": 29,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 29,
"data": {
"text/plain": "Office 6905\nHome 3638\nOther place, such as a coworking space or cafe 977\nName: WorkLoc, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Identify the value that is most frequent (majority) in the WorkLoc column.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "#make a note of the majority value here, for future reference\n#\ndf['WorkLoc'].value_counts().idxmax()",
"execution_count": 33,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 33,
"data": {
"text/plain": "'Office'"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Impute (replace) all the empty rows in the column WorkLoc with the value that you have identified as majority.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf[\"WorkLoc\"].replace(np.nan, \"Office\", inplace=True)",
"execution_count": 36,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "After imputation there should ideally not be any empty rows in the WorkLoc column.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Verify if imputing was successful.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf.WorkLoc.isnull().sum()",
"execution_count": 40,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 40,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Normalizing data\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "There are two columns in the dataset that talk about compensation.\n\nOne is \"CompFreq\". This column shows how often a developer is paid (Yearly, Monthly, Weekly).\n\nThe other is \"CompTotal\". This column talks about how much the developer is paid per Year, Month, or Week depending upon his/her \"CompFreq\". \n\nThis makes it difficult to compare the total compensation of the developers.\n\nIn this section you will create a new column called 'NormalizedAnnualCompensation' which contains the 'Annual Compensation' irrespective of the 'CompFreq'.\n\nOnce this column is ready, it makes comparison of salaries easy.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "<hr>\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "List out the various categories in the column 'CompFreq'\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\nCompFreq = pd.Series([\"Yearly\", \"Monthly\", \"Weekly\"], dtype=\"category\")\nCompFreq\ndf.info()",
"execution_count": 127,
"outputs": [
{
"output_type": "stream",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 11552 entries, 0 to 11551\nData columns (total 86 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 Respondent 11552 non-null int64 \n 1 MainBranch 11552 non-null object \n 2 Hobbyist 11552 non-null object \n 3 OpenSourcer 11552 non-null object \n 4 OpenSource 11471 non-null object \n 5 Employment 11552 non-null object \n 6 Country 11552 non-null object \n 7 Student 11499 non-null object \n 8 EdLevel 11436 non-null object \n 9 UndergradMajor 10812 non-null object \n 10 EduOther 11388 non-null object \n 11 OrgSize 11454 non-null object \n 12 DevType 11485 non-null object \n 13 YearsCode 11543 non-null object \n 14 Age1stCode 11539 non-null object \n 15 YearsCodePro 11536 non-null object \n 16 CareerSat 11552 non-null object \n 17 JobSat 11551 non-null object \n 18 MgrIdiot 11054 non-null object \n 19 MgrMoney 11050 non-null object \n 20 MgrWant 11054 non-null object \n 21 JobSeek 11552 non-null object \n 22 LastHireDate 11552 non-null object \n 23 LastInt 11129 non-null object \n 24 FizzBuzz 11515 non-null object \n 25 JobFactors 11549 non-null object \n 26 ResumeUpdate 11511 non-null object \n 27 CurrencySymbol 11552 non-null object \n 28 CurrencyDesc 11552 non-null object \n 29 CompTotal 10737 non-null float64 \n 30 CompFreq 11346 non-null category\n 31 ConvertedComp 10730 non-null float64 \n 32 WorkWeekHrs 11427 non-null float64 \n 33 WorkPlan 11429 non-null object \n 34 WorkChallenge 11384 non-null object \n 35 WorkRemote 11544 non-null object \n 36 WorkLoc 11552 non-null object \n 37 ImpSyn 11547 non-null object \n 38 CodeRev 11551 non-null object \n 39 CodeRevHrs 9083 non-null float64 \n 40 UnitTests 11523 non-null object \n 41 PurchaseHow 11354 non-null object \n 42 PurchaseWhat 11514 non-null object \n 43 LanguageWorkedWith 11541 non-null object \n 44 LanguageDesireNextYear 11415 non-null object \n 45 DatabaseWorkedWith 11096 non-null object \n 46 DatabaseDesireNextYear 10497 non-null object \n 47 PlatformWorkedWith 11130 non-null object \n 48 PlatformDesireNextYear 10991 non-null object \n 49 WebFrameWorkedWith 10139 non-null object \n 50 WebFrameDesireNextYear 9918 non-null object \n 51 MiscTechWorkedWith 9343 non-null object \n 52 MiscTechDesireNextYear 10078 non-null object \n 53 DevEnviron 11523 non-null object \n 54 OpSys 11518 non-null object \n 55 Containers 11470 non-null object \n 56 BlockchainOrg 9198 non-null object \n 57 BlockchainIs 8915 non-null object \n 58 BetterLife 11452 non-null object \n 59 ITperson 11517 non-null object \n 60 OffOn 11514 non-null object \n 61 SocialMedia 11251 non-null object \n 62 Extraversion 11532 non-null object \n 63 ScreenName 11039 non-null object \n 64 SOVisit1st 11227 non-null object \n 65 SOVisitFreq 11547 non-null object \n 66 SOVisitTo 11551 non-null object \n 67 SOFindAnswer 11549 non-null object \n 68 SOTimeSaved 11501 non-null object \n 69 SOHowMuchTime 9616 non-null object \n 70 SOAccount 11551 non-null object \n 71 SOPartFreq 10404 non-null object \n 72 SOJobs 11546 non-null object \n 73 EntTeams 11547 non-null object \n 74 SOComm 11552 non-null object \n 75 WelcomeChange 11463 non-null object \n 76 SONewContent 9557 non-null object \n 77 Age 11255 non-null float64 \n 78 Gender 11477 non-null object \n 79 Trans 11429 non-null object \n 80 Sexuality 11005 non-null object \n 81 Ethnicity 10869 non-null object \n 82 Dependents 11408 non-null object \n 83 SurveyLength 11533 non-null object \n 84 SurveyEase 11538 non-null object \n 85 NormalizedAnnualCompensation 10737 non-null float64 \ndtypes: category(1), float64(6), int64(1), object(78)\nmemory usage: 7.5+ MB\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Create a new column named 'NormalizedAnnualCompensation'. Use the hint given below if needed.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Double click to see the **Hint**.\n\n<!--\n\nUse the below logic to arrive at the values for the column NormalizedAnnualCompensation.\n\nIf the CompFreq is Yearly then use the exising value in CompTotal\nIf the CompFreq is Monthly then multiply the value in CompTotal with 12 (months in an year)\nIf the CompFreq is Weekly then multiply the value in CompTotal with 52 (weeks in an year)\n\n-->\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# your code goes here\ndf['NormalizedAnnualCompensation'] = df[CompTotal]\ndf['NormalizedAnnualCompensation'] = np.where(df['CompFreq']=='Monthly', df[Comptotal]*12,np.where(df['CompFreq'] == 'Weekly', df[Comptotal]*52,df[default= Comptotal]))\n",
"execution_count": 136,
"outputs": [
{
"output_type": "error",
"ename": "SyntaxError",
"evalue": "invalid syntax (<ipython-input-136-2feb4879d67d>, line 3)",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-136-2feb4879d67d>\"\u001b[0;36m, line \u001b[0;32m3\u001b[0m\n\u001b[0;31m df['NormalizedAnnualCompensation'] = np.where(df['CompFreq']=='Monthly', df[Comptotal]*12,np.where(df['CompFreq'] == 'Weekly', df[Comptotal]*52,df[default= Comptotal]))\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
]
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Authors\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Ramesh Sannareddy\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Other Contributors\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Rav Ahuja\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Change Log\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-10-17 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": " Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.7",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.10",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment