Created
November 22, 2021 09:15
-
-
Save olayinka04/975d41b85e3bed16aee9f5eeec3f3b5d to your computer and use it in GitHub Desktop.
DE IBM 3 - Python Project for Data Engineering
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| {"cells":[{"cell_type":"markdown","metadata":{},"source":["<center>\n"," <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n","</center>\n"]},{"cell_type":"markdown","metadata":{},"source":["# **Extract Transform Load (ETL) Lab**\n"]},{"cell_type":"markdown","metadata":{},"source":["Estimated time needed: **30** minutes\n"]},{"cell_type":"markdown","metadata":{},"source":["## Objectives\n","\n","After completing this lab you will be able to:\n","\n","* Read CSV and JSON file types.\n","* Extract data from the above file types.\n","* Transform data.\n","* Save the transformed data in a ready-to-load format which data engineers can use to load into an RDBMS.\n"]},{"cell_type":"markdown","metadata":{},"source":["Import the required modules and functions\n"]},{"cell_type":"code","execution_count":12,"metadata":{},"outputs":[],"source":["import glob # this module helps in selecting files \n","import pandas as pd # this module helps in processing CSV files\n","import xml.etree.ElementTree as ET # this module helps in processing XML files.\n","from datetime import datetime"]},{"cell_type":"markdown","metadata":{},"source":["## Download Files\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/source.zip"]},{"cell_type":"markdown","metadata":{},"source":["## Unzip Files\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!unzip source.zip"]},{"cell_type":"markdown","metadata":{},"source":["## Set Paths\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["tmpfile = \"temp.tmp\" # file used to store all extracted data\n","logfile = \"logfile.txt\" # all event logs will be stored in this file\n","targetfile = \"transformed_data.csv\" # file where transformed data is stored"]},{"cell_type":"markdown","metadata":{},"source":["## Extract\n"]},{"cell_type":"markdown","metadata":{},"source":["### CSV Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract_from_csv(file_to_process):\n"," dataframe = pd.read_csv(file_to_process)\n"," return dataframe"]},{"cell_type":"markdown","metadata":{},"source":["### JSON Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract_from_json(file_to_process):\n"," dataframe = pd.read_json(file_to_process,lines=True)\n"," return dataframe"]},{"cell_type":"markdown","metadata":{},"source":["### XML Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract_from_xml(file_to_process):\n"," dataframe = pd.DataFrame(columns=[\"name\", \"height\", \"weight\"])\n"," tree = ET.parse(file_to_process)\n"," root = tree.getroot()\n"," for person in root:\n"," name = person.find(\"name\").text\n"," height = float(person.find(\"height\").text)\n"," weight = float(person.find(\"weight\").text)\n"," dataframe = dataframe.append({\"name\":name, \"height\":height, \"weight\":weight}, ignore_index=True)\n"," return dataframe"]},{"cell_type":"markdown","metadata":{},"source":["### Extract Function\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def extract():\n"," extracted_data = pd.DataFrame(columns=['name','height','weight']) # create an empty data frame to hold extracted data\n"," \n"," #process all csv files\n"," for csvfile in glob.glob(\"*.csv\"):\n"," extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n"," \n"," #process all json files\n"," for jsonfile in glob.glob(\"*.json\"):\n"," extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n"," \n"," #process all xml files\n"," for xmlfile in glob.glob(\"*.xml\"):\n"," extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n"," \n"," return extracted_data"]},{"cell_type":"markdown","metadata":{},"source":["## Transform\n"]},{"cell_type":"markdown","metadata":{},"source":["The transform function does the following tasks.\n","\n","1. Convert height which is in inches to millimeter\n","2. Convert weight which is in pounds to kilograms\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def transform(data):\n"," #Convert height which is in inches to millimeter\n"," #Convert the datatype of the column into float\n"," #data.height = data.height.astype(float)\n"," #Convert inches to meters and round off to two decimals(one inch is 0.0254 meters)\n"," data['height'] = round(data.height * 0.0254,2)\n"," \n"," #Convert weight which is in pounds to kilograms\n"," #Convert the datatype of the column into float\n"," #data.weight = data.weight.astype(float)\n"," #Convert pounds to kilograms and round off to two decimals(one pound is 0.45359237 kilograms)\n"," data['weight'] = round(data.weight * 0.45359237,2)\n"," return data"]},{"cell_type":"markdown","metadata":{},"source":["## Loading\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def load(targetfile,data_to_load):\n"," data_to_load.to_csv(targetfile) "]},{"cell_type":"markdown","metadata":{},"source":["## Logging\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["def log(message):\n"," timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second\n"," now = datetime.now() # get current timestamp\n"," timestamp = now.strftime(timestamp_format)\n"," with open(\"logfile.txt\",\"a\") as f:\n"," f.write(timestamp + ',' + message + '\\n')"]},{"cell_type":"markdown","metadata":{},"source":["## Running ETL Process\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"ETL Job Started\")"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"Extract phase Started\")\n","extracted_data = extract()\n","log(\"Extract phase Ended\")\n","extracted_data"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"Transform phase Started\")\n","transformed_data = transform(extracted_data)\n","log(\"Transform phase Ended\")\n","transformed_data "]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"Load phase Started\")\n","load(targetfile,transformed_data)\n","log(\"Load phase Ended\")"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["log(\"ETL Job Ended\")"]},{"cell_type":"markdown","metadata":{},"source":["# Exercise\n"]},{"cell_type":"markdown","metadata":{},"source":["Using the example above complete the exercise below.\n"]},{"cell_type":"markdown","metadata":{},"source":["## Download Files\n"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["--2021-01-20 15:58:13-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip\n","Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n","Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 4249 (4.1K) [application/zip]\n","Saving to: ‘datasource.zip’\n","\n","datasource.zip 100%[===================>] 4.15K --.-KB/s in 0.001s \n","\n","2021-01-20 15:58:13 (8.09 MB/s) - ‘datasource.zip’ saved [4249/4249]\n","\n"]}],"source":["!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip"]},{"cell_type":"markdown","metadata":{},"source":["## Unzip Files\n"]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Archive: datasource.zip\n"," inflating: dealership_data/used_car_prices1.csv \n"," inflating: dealership_data/used_car_prices2.csv \n"," inflating: dealership_data/used_car_prices3.csv \n"," inflating: dealership_data/used_car_prices1.json \n"," inflating: dealership_data/used_car_prices2.json \n"," inflating: dealership_data/used_car_prices3.json \n"," inflating: dealership_data/used_car_prices1.xml \n"," inflating: dealership_data/used_car_prices2.xml \n"," inflating: dealership_data/used_car_prices3.xml \n"]}],"source":["!unzip datasource.zip -d dealership_data"]},{"cell_type":"markdown","metadata":{},"source":["## About the Data\n"]},{"cell_type":"markdown","metadata":{},"source":["The file `dealership_data` contains CSV, JSON, and XML files for used car data which contain features named `car_model`, `year_of_manufacture`, `price`, and `fuel`.\n"]},{"cell_type":"markdown","metadata":{},"source":["## Set Paths\n"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[],"source":["tmpfile = \"dealership_temp.tmp\" # file used to store all extracted data\n","logfile = \"dealership_logfile.txt\" # all event logs will be stored in this file\n","targetfile = \"dealership_transformed_data.csv\" # file where transformed data is stored"]},{"cell_type":"markdown","metadata":{},"source":["## Extract\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 1: CSV Extract Function\n"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[],"source":["# Add the CSV extract function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n"," \n","def extract_from_csv(file_to_process):\n"," dataframe = pd.read_csv(file_to_process)\n"," return dataframe\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 2: JSON Extract Function\n"]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[],"source":["# Add the JSON extract function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n"," \n","def extract_from_json(file_to_process):\n"," dataframe = pd.read_json(file_to_process,lines=True)\n"," return dataframe\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 3: XML Extract Function\n"]},{"cell_type":"code","execution_count":6,"metadata":{},"outputs":[],"source":["# Add the XML extract function below, it is the same as the xml extract function above but the column names need to be renamed.\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n"," \n","def extract_from_xml(file_to_process):\n"," dataframe = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel'])\n"," tree = ET.parse(file_to_process)\n"," root = tree.getroot()\n"," for person in root:\n"," car_model = person.find(\"car_model\").text\n"," year_of_manufacture = int(person.find(\"year_of_manufacture\").text)\n"," price = float(person.find(\"price\").text)\n"," fuel = person.find(\"fuel\").text\n"," dataframe = dataframe.append({\"car_model\":car_model, \"year_of_manufacture\":year_of_manufacture, \"price\":price, \"fuel\":fuel}, ignore_index=True)\n"," return dataframe\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 4: Extract Function\n","\n","Call the specific extract functions you created above by replacing the `ADD_FUNCTION_CALL` with the proper function call.\n"]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[],"source":["def extract():\n"," extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) # create an empty data frame to hold extracted data\n"," \n"," #process all csv files\n"," for csvfile in glob.glob(\"dealership_data/*.csv\"):\n"," extracted_data = extracted_data.append('ADD_FUNCTION_CALL', ignore_index=True)\n"," \n"," #process all json files\n"," for jsonfile in glob.glob(\"dealership_data/*.json\"):\n"," extracted_data = extracted_data.append('ADD_FUNCTION_CALL', ignore_index=True)\n"," \n"," #process all xml files\n"," for xmlfile in glob.glob(\"dealership_data/*.xml\"):\n"," extracted_data = extracted_data.append('ADD_FUNCTION_CALL', ignore_index=True)\n"," \n"," return extracted_data"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n"," \n","def extract():\n"," extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) # create an empty data frame to hold extracted data\n"," \n"," #process all csv files\n"," for csvfile in glob.glob(\"dealership_data/*.csv\"):\n"," extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n"," \n"," #process all json files\n"," for jsonfile in glob.glob(\"dealership_data/*.json\"):\n"," extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n"," \n"," #process all xml files\n"," for xmlfile in glob.glob(\"dealership_data/*.xml\"):\n"," extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n"," \n"," return extracted_data\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Transform\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 5: Transform\n","\n","Round the `price` columns to 2 decimal places\n"]},{"cell_type":"code","execution_count":8,"metadata":{},"outputs":[],"source":["# Add the transform function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","def transform(data):\n"," data['price'] = round(data.price, 2)\n"," return data\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Loading\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 6: Load\n"]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[],"source":["# Add the load function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","def load(targetfile,data_to_load):\n"," data_to_load.to_csv(targetfile) \n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Logging\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 7: Log\n","\n","Make sure to change the name of the logfile to the one specified in the set paths section. Change the timestamp order to Hour-Minute-Second-Monthname-Day-Year.\n"]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[],"source":["# Add the log function below\n"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","def log(message):\n"," timestamp_format = '%H:%M:%S-%h-%d-%Y' #Hour-Minute-Second-MonthName-Day-Year\n"," now = datetime.now() # get current timestamp\n"," timestamp = now.strftime(timestamp_format)\n"," with open(\"dealership_logfile.txt\",\"a\") as f:\n"," f.write(timestamp + ',' + message + '\\n') \n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Running ETL Process\n"]},{"cell_type":"markdown","metadata":{},"source":["### Question 8: ETL Process\n","\n","Run all functions to extract, transform, and load the data. Make sure to log all events using the `log` function. Place your code under each comment.\n"]},{"cell_type":"code","execution_count":14,"metadata":{},"outputs":[],"source":["# Log that you have started the ETL process\n","\n","\n","# Log that you have started the Extract step\n","\n","# Call the Extract function\n","\n","# Log that you have completed the Extract step\n","\n","\n","# Log that you have started the Transform step\n","\n","# Call the Transform function\n","\n","# Log that you have completed the Transform step\n","\n","\n","# Log that you have started the Load step\n","\n","# Call the Load function\n","\n","# Log that you have completed the Load step\n","\n","\n","# Log that you have completed the ETL process"]},{"cell_type":"markdown","metadata":{},"source":["<details><summary>Click here for the solution</summary>\n","\n","```\n","\n","log(\"ETL Job Started\")\n","\n","log(\"Extract phase Started\")\n","extracted_data = extract()\n","log(\"Extract phase Ended\")\n","\n","log(\"Transform phase Started\")\n","transformed_data = transform(extracted_data)\n","log(\"Transform phase Ended\")\n","\n","log(\"Load phase Started\")\n","load(targetfile,transformed_data)\n","log(\"Load phase Ended\")\n","\n","log(\"ETL Job Ended\")\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{},"source":["## Authors\n"]},{"cell_type":"markdown","metadata":{},"source":["Ramesh Sannareddy\n","\n","Joseph Santarcangelo\n","\n","Azim Hirjani\n"]},{"cell_type":"markdown","metadata":{},"source":["## Change Log\n"]},{"cell_type":"markdown","metadata":{},"source":["| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ----------------- | ---------------------------------- |\n","| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"]},{"cell_type":"markdown","metadata":{},"source":["Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01&cm_mmc=Email_Newsletter-\\_-Developer_Ed%2BTech-\\_-WW_WW-\\_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork-23455645&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.0"}},"nbformat":4,"nbformat_minor":4} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is very helpful, thanks a loads!! :)