Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save adeways2000/e5bacbfca9c61e16d52567383d13adf8 to your computer and use it in GitHub Desktop.

Select an option

Save adeways2000/e5bacbfca9c61e16d52567383d13adf8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# **Hands-on Lab : Web Scraping**\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Estimated time needed: **30 to 45** minutes\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Objectives\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this lab you will perform the following:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "- Extract information from a given web site \n- Write the scraped data into a csv file.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Extract information from the given web site\n\nYou will extract the data from the below web site: <br> \n"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "#this url contains the data you need to scrape\nurl = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html\""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the required libraries\n"
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": "# Your code here\nimport pandas as pd\nfrom bs4 import BeautifulSoup # this module helps in web scrapping.\nimport requests # this module helps us to download a web page"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Download the webpage at the url\n"
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": "#your code goes here\nurl = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html\""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a soup object\n"
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": "#your code goes here\ndata = requests.get(url).text\nsoup = BeautifulSoup(data,\"html5lib\") "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Scrape the `Language name` and `annual average salary`.\n"
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": "#your code goes here\ntech_data = pd.DataFrame(columns=[\"Language_name\", \"annual average salary\",])\ntable = soup.find('table')"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Save the scrapped data into a file named _popular-languages.csv_\n"
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Language--->Average Annual Salary\nPython--->$114,383\nJava--->$101,013\nR--->$92,037\nJavascript--->$110,981\nSwift--->$130,801\nC++--->$113,865\nC#--->$88,726\nPHP--->$84,727\nSQL--->$84,793\nGo--->$94,082\n"
}
],
"source": "# your code goes here\nfor row in table.find_all('tr'): # in html table row is represented by the tag <tr>\n # Get all columns in each row.\n cols = row.find_all('td') # in html a column is represented by the tag <td>\n language_name = cols[1].getText() # store the value in column 1 as language_name\n annual_average_salary= cols[3].getText() # store the value in column 4 as annual_average_salary\n print(\"{}--->{}\".format(language_name,annual_average_salary))\n tech_data.to_csv('popular-languages.csv')\n "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Authors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Ramesh Sannareddy\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Other Contributors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Rav Ahuja\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Change Log\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-10-17 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.7",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment