olayinka04/1 - ExtractTransformLoad_V2.ipynb

## 1 - ExtractTransformLoad_V2.ipynb
{"cells":[{"cell_type":"markdown","metadata":{},"source":["<center>\n","    <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n","</center>\n"]},{"cell_type":"markdown","metadata":{},"source":["# **Extract Transform Load (ETL) Lab**\n"]},{"cell_type":"markdown","metadata":{},"source":["Estimated time needed: **30** minutes\n"]},{"cell_type":"markdown","metadata":{},"source":["## Objectives\n","\n","After completing this lab you will be able to:\n","\n","*   Read CSV and JSON file types.\n","*   Extract data from the above file types.\n","*   Transform data.\n","*   Save the transformed data in a ready-to-load format which data engineers can use to load into an RDBMS.\n"]},{"cell_type":"markdown","metadata":{},"source":["Import the required modules and functions\n"]},{"cell_type":"code","execution_count":12,"metadata":{},"outputs":[],"source":["import glob                         # this module helps in selecting files \n","import pandas as pd                 # this module helps in processing CSV files\n","import xml.etree.ElementTree as ET  # this module helps in processing XML files.\n","from datetime import datetime"]},{"cell_type":"markdown","metadata":{},"source":["## Download Files\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/source.zip"]},{"cell_type":"markdown","metadata":{},"source":["## Unzip Files\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!unzip source.zip"]},{"cell_type":"markdown","metadata":{},"source":["## Set Paths\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["tmpfile    = \"temp.tmp\"               # file used to store all extracted data\n","logfile    = \"logfile.txt\"            # all event logs will be stored in this file\n","targetfile = \"transformed_data.csv\"   # file where transformed data is stored"]},{"cell_type":"markdown","metadata":{},"source":["## Extract\n"]},{"cell_type":"markdown","metadata":{},"source":["### CSV Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract_from_csv(file_to_process):\n","    dataframe = pd.read_csv(file_to_process)\n","    return dataframe"]},{"cell_type":"markdown","metadata":{},"source":["### JSON Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract_from_json(file_to_process):\n","    dataframe = pd.read_json(file_to_process,lines=True)\n","    return dataframe"]},{"cell_type":"markdown","metadata":{},"source":["### XML Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract_from_xml(file_to_process):\n","    dataframe = pd.DataFrame(columns=[\"name\", \"height\", \"weight\"])\n","    tree = ET.parse(file_to_process)\n","    root = tree.getroot()\n","    for person in root:\n","        name = person.find(\"name\").text\n","        height = float(person.find(\"height\").text)\n","        weight = float(person.find(\"weight\").text)\n","        dataframe = dataframe.append({\"name\":name, \"height\":height, \"weight\":weight}, ignore_index=True)\n","    return dataframe"]},{"cell_type":"markdown","metadata":{},"source":["### Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract():\n","    extracted_data = pd.DataFrame(columns=['name','height','weight']) # create an empty data frame to hold extracted data\n","    \n","    #process all csv files\n","    for csvfile in glob.glob(\"*.csv\"):\n","        extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n","        \n","    #process all json files\n","    for jsonfile in glob.glob(\"*.json\"):\n","        extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n","    \n","    #process all xml files\n","    for xmlfile in glob.glob(\"*.xml\"):\n","        extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n","        \n","    return extracted_data"]},{"cell_type":"markdown","metadata":{},"source":["## Transform\n"]},{"cell_type":"markdown","metadata":{},"source":["The transform function does the following tasks.\n","\n","1.  Convert height which is in inches to millimeter\n","2.  Convert weight which is in pounds to kilograms\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def transform(data):\n","        #Convert height which is in inches to millimeter\n","        #Convert the datatype of the column into float\n","        #data.height = data.height.astype(float)\n","        #Convert inches to meters and round off to two decimals(one inch is 0.0254 meters)\n","        data['height'] = round(data.height * 0.0254,2)\n","        \n","        #Convert weight which is in pounds to kilograms\n","        #Convert the datatype of the column into float\n","        #data.weight = data.weight.astype(float)\n","        #Convert pounds to kilograms and round off to two decimals(one pound is 0.45359237 kilograms)\n","        data['weight'] = round(data.weight * 0.45359237,2)\n","        return data"]},{"cell_type":"markdown","metadata":{},"source":["## Loading\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def load(targetfile,data_to_load):\n","    data_to_load.to_csv(targetfile)  "]},{"cell_type":"markdown","metadata":{},"source":["## Logging\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def log(message):\n","    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second\n","    now = datetime.now() # get current timestamp\n","    timestamp = now.strftime(timestamp_format)\n","    with open(\"logfile.txt\",\"a\") as f:\n","        f.write(timestamp + ',' + message + '\\n')"]},{"cell_type":"markdown","metadata":{},"source":["## Running ETL Process\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"ETL Job Started\")"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"Extract phase Started\")\n","extracted_data = extract()\n","log(\"Extract phase Ended\")\n","extracted_data"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"Transform phase Started\")\n","transformed_data = transform(extracted_data)\n","log(\"Transform phase Ended\")\n","transformed_data "]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"Load phase Started\")\n","load(targetfile,transformed_data)\n","log(\"Load phase Ended\")"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"ETL Job Ended\")"]},{"cell_type":"markdown","metadata":{},"source":["# Exercise\n"]},{"cell_type":"markdown","metadata":{},"source":["Using the example above complete the exercise below.\n"]},{"cell_type":"markdown","metadata":{},"source":["## Download Files\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["--2021-01-20 15:58:13--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip\n","Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n","Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 4249 (4.1K) [application/zip]\n","Saving to: ‘datasource.zip’\n","\n","datasource.zip      100%[===================>]   4.15K  --.-KB/s    in 0.001s  \n","\n","2021-01-20 15:58:13 (8.09 MB/s) - ‘datasource.zip’ saved [4249/4249]\n","\n"]}],"source":["!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip"]},{"cell_type":"markdown","metadata":{},"source":["## Unzip Files\n"]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Archive:  datasource.zip\n","  inflating: dealership_data/used_car_prices1.csv  \n","  inflating: dealership_data/used_car_prices2.csv  \n","  inflating: dealership_data/used_car_prices3.csv  \n","  inflating: dealership_data/used_car_prices1.json  \n","  inflating: dealership_data/used_car_prices2.json  \n","  inflating: dealership_data/used_car_prices3.json  \n","  inflating: dealership_data/used_car_prices1.xml  \n","  inflating: dealership_data/used_car_prices2.xml  \n","  inflating: dealership_data/used_car_prices3.xml  \n"]}],"source":["!unzip datasource.zip -d dealership_data"]},{"cell_type":"markdown","metadata":{},"source":["## About the Data\n"]},{"cell_type":"markdown","metadata":{},"source":["The file `dealership_data` contains CSV, JSON, and XML files for used car data which contain features named `car_model`, `year_of_manufacture`, `price`, and `fuel`.\n"]},{"cell_type":"markdown","metadata":{},"source":["## Set Paths\n"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[],"source":["tmpfile    = \"dealership_temp.tmp\"               # file used to store all extracted data\n","logfile    = \"dealership_logfile.txt\"            # all event logs will be stored in this file\n","targetfile = \"dealership_transformed_data.csv\"   # file where transformed data is stored"]},{"cell_type":"markdown","metadata":{},"source":["## Extract\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 1: CSV Extract Function\n"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[],"source":["# Add the CSV extract function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","    \n","def extract_from_csv(file_to_process):\n","    dataframe = pd.read_csv(file_to_process)\n","    return dataframe\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 2: JSON Extract Function\n"]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[],"source":["# Add the JSON extract function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","    \n","def extract_from_json(file_to_process):\n","    dataframe = pd.read_json(file_to_process,lines=True)\n","    return dataframe\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 3: XML Extract Function\n"]},{"cell_type":"code","execution_count":6,"metadata":{},"outputs":[],"source":["# Add the XML extract function below, it is the same as the xml extract function above but the column names need to be renamed.\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","    \n","def extract_from_xml(file_to_process):\n","    dataframe = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel'])\n","    tree = ET.parse(file_to_process)\n","    root = tree.getroot()\n","    for person in root:\n","        car_model = person.find(\"car_model\").text\n","        year_of_manufacture = int(person.find(\"year_of_manufacture\").text)\n","        price = float(person.find(\"price\").text)\n","        fuel = person.find(\"fuel\").text\n","        dataframe = dataframe.append({\"car_model\":car_model, \"year_of_manufacture\":year_of_manufacture, \"price\":price, \"fuel\":fuel}, ignore_index=True)\n","    return dataframe\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 4: Extract Function\n","\n","Call the specific extract functions you created above by replacing the `ADD_FUNCTION_CALL` with the proper function call.\n"]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[],"source":["def extract():\n","    extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) # create an empty data frame to hold extracted data\n","    \n","    #process all csv files\n","    for csvfile in glob.glob(\"dealership_data/*.csv\"):\n","        extracted_data = extracted_data.append('ADD_FUNCTION_CALL', ignore_index=True)\n","        \n","    #process all json files\n","    for jsonfile in glob.glob(\"dealership_data/*.json\"):\n","        extracted_data = extracted_data.append('ADD_FUNCTION_CALL', ignore_index=True)\n","    \n","    #process all xml files\n","    for xmlfile in glob.glob(\"dealership_data/*.xml\"):\n","        extracted_data = extracted_data.append('ADD_FUNCTION_CALL', ignore_index=True)\n","        \n","    return extracted_data"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","    \n","def extract():\n","    extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) # create an empty data frame to hold extracted data\n","    \n","    #process all csv files\n","    for csvfile in glob.glob(\"dealership_data/*.csv\"):\n","        extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n","        \n","    #process all json files\n","    for jsonfile in glob.glob(\"dealership_data/*.json\"):\n","        extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n","    \n","    #process all xml files\n","    for xmlfile in glob.glob(\"dealership_data/*.xml\"):\n","        extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n","        \n","    return extracted_data\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Transform\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 5: Transform\n","\n","Round the `price` columns to 2 decimal places\n"]},{"cell_type":"code","execution_count":8,"metadata":{},"outputs":[],"source":["# Add the transform function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","def transform(data):\n","        data['price'] = round(data.price, 2)\n","        return data\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Loading\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 6: Load\n"]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[],"source":["# Add the load function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","def load(targetfile,data_to_load):\n","    data_to_load.to_csv(targetfile)  \n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Logging\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 7: Log\n","\n","Make sure to change the name of the logfile to the one specified in the set paths section. Change the timestamp order to Hour-Minute-Second-Monthname-Day-Year.\n"]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[],"source":["# Add the log function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","def log(message):\n","    timestamp_format = '%H:%M:%S-%h-%d-%Y' #Hour-Minute-Second-MonthName-Day-Year\n","    now = datetime.now() # get current timestamp\n","    timestamp = now.strftime(timestamp_format)\n","    with open(\"dealership_logfile.txt\",\"a\") as f:\n","        f.write(timestamp + ',' + message + '\\n') \n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Running ETL Process\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 8: ETL Process\n","\n","Run all functions to extract, transform, and load the data. Make sure to log all events using the `log` function. Place your code under each comment.\n"]},{"cell_type":"code","execution_count":14,"metadata":{},"outputs":[],"source":["# Log that you have started the ETL process\n","\n","\n","# Log that you have started the Extract step\n","\n","# Call the Extract function\n","\n","# Log that you have completed the Extract step\n","\n","\n","# Log that you have started the Transform step\n","\n","# Call the Transform function\n","\n","# Log that you have completed the Transform step\n","\n","\n","# Log that you have started the Load step\n","\n","# Call the Load function\n","\n","# Log that you have completed the Load step\n","\n","\n","# Log that you have completed the ETL process"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","log(\"ETL Job Started\")\n","\n","log(\"Extract phase Started\")\n","extracted_data = extract()\n","log(\"Extract phase Ended\")\n","\n","log(\"Transform phase Started\")\n","transformed_data = transform(extracted_data)\n","log(\"Transform phase Ended\")\n","\n","log(\"Load phase Started\")\n","load(targetfile,transformed_data)\n","log(\"Load phase Ended\")\n","\n","log(\"ETL Job Ended\")\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Authors\n"]},{"cell_type":"markdown","metadata":{},"source":["Ramesh Sannareddy\n","\n","Joseph Santarcangelo\n","\n","Azim Hirjani\n"]},{"cell_type":"markdown","metadata":{},"source":["## Change Log\n"]},{"cell_type":"markdown","metadata":{},"source":["| Date (YYYY-MM-DD) | Version | Changed By        | Change Description                 |\n","| ----------------- | ------- | ----------------- | ---------------------------------- |\n","| 2020-11-25        | 0.1     | Ramesh Sannareddy | Created initial version of the lab |\n"]},{"cell_type":"markdown","metadata":{},"source":["Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01&cm_mmc=Email_Newsletter-\\_-Developer_Ed%2BTech-\\_-WW_WW-\\_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork-23455645&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.0"}},"nbformat":4,"nbformat_minor":4}

## 2 - Webscraping_Engineer_Peer_Review_Assignment.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              2 - Webscraping_Engineer_Peer_Review_Assignment.ipynb
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## 3 - API_Engineer_Peer_Review_Assignment_Extract.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              3 - API_Engineer_Peer_Review_Assignment_Extract.ipynb
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
No results found