Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save gabrielsimas/633bfc0e01a4aa3fe19b71e01e1d350a to your computer and use it in GitHub Desktop.

Select an option

Save gabrielsimas/633bfc0e01a4aa3fe19b71e01e1d350a to your computer and use it in GitHub Desktop.
Challenge 0004 - IMDB Movies Data Analyzer with Python.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyPxVB+9oC0GKxEo+WmHIBvh",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/gabrielsimas/633bfc0e01a4aa3fe19b71e01e1d350a/challenge-0004-imdb-movies-data-analyzer-with-python.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Challenge 0004 - Movie IMDb Data Analyzer with Python\n",
"\n",
"## Project Description\n",
"In this project, your task is to analyze data from IMDb about movies, their genres, release years, ratings, and more\n",
"\n",
"## Expected Output\n",
"The goal is to visualize the movie ratings from the data given in the CSV file provided in the link above. Specifically, your program should generate a distribution of movie ratings\n",
"\n",
"The graph shows the frequency of each rating. For example, the ratings 8.5 and 8.6 are the most frequent ones happening in the CSV file.\n",
"\n",
"We recommend to use pandas, matplotlib, and seaborn to come up with the above graph.\n",
"\n",
"## Learning Benefits\n",
"Learn how to load, analyze, and manipulate movie data using Pandas.\n",
"\n",
"Understand how to visualize trends over time using line charts.\n",
"\n",
"Gain skills in data exploration, sorting, and filtering.\n",
"\n",
"### In this challenge we'll gonna use the Plotly to plot graphs and, in the end, export all graph to HTML file and publish on Github Pages.\n",
"\n",
"Let's go!\n",
"\n"
],
"metadata": {
"id": "p4YD-VkD3GUd"
}
},
{
"cell_type": "code",
"source": [
"# TODO: Convert it to an OOP approach!"
],
"metadata": {
"id": "IzyHGtK58c6c"
},
"execution_count": 136,
"outputs": []
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {
"id": "23IinkA-24vs"
},
"outputs": [],
"source": [
"# Installing necessary libraries\n",
"# If you have not in Google Collab\n",
"# !pip install pandas plotly"
]
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"from plotly.subplots import make_subplots\n",
"import plotly.graph_objects as go\n",
"import plotly.express as px\n",
"from plotly.io import to_html"
],
"metadata": {
"id": "EyHKhQgu4KjZ"
},
"execution_count": 138,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Step 2: Load the Dataset\n",
"# from google.colab import files\n",
"# uploads = files.upload()\n",
"df = pd.read_csv('/content/imdb_top_1000.csv')"
],
"metadata": {
"id": "zV2npfO04VUQ"
},
"execution_count": 139,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Step 3: Clean and Prepare the Data\n",
"# Convert 'Gross' and 'Runtime' to numeric, and handle missing values\n",
"df['Gross'] = df['Gross'].str.replace(\",\",\"\").astype(float).fillna(0)\n",
"df['Runtime'] = df['Runtime'].str.replace(\" min\", \".\").astype(float)\n",
"\n",
"# Using gross as proxy for profit if no budget is provided\n",
"df['Profit'] = df['Gross'].fillna(0)"
],
"metadata": {
"id": "T86eJOlk4wUu"
},
"execution_count": 140,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df['Primary_Genre'] = df['Genre'].apply(lambda x: x.split(',')[0])"
],
"metadata": {
"id": "mSO4l1IrKiDV"
},
"execution_count": 141,
"outputs": []
},
{
"cell_type": "code",
"source": [
"html_content = \"\"\"<html>\n",
"<head>\n",
" <title>IMDb Directors by Genre</title>\n",
" <style>\n",
" body {\n",
" display: flex;\n",
" font-family: Arial, sans-serif;\n",
" }\n",
" #menu {\n",
" width: 20%;\n",
" background-color: #f4f4f4;\n",
" padding: 20px;\n",
" box-shadow: 2px 0 5px rgba(0, 0, 0, 0.1);\n",
" overflow-y: auto;\n",
" height: 100vh;\n",
" }\n",
" #content {\n",
" width: 80%;\n",
" padding: 20px;\n",
" }\n",
" a {\n",
" text-decoration: none;\n",
" color: #333;\n",
" display: block;\n",
" margin-bottom: 10px;\n",
" }\n",
" a:hover {\n",
" color: #007BFF;\n",
" }\n",
" iframe {\n",
" width: 100%;\n",
" height: 80vh;\n",
" border: none;\n",
" }\n",
" </style>\n",
"</head>\n",
"<body>\n",
" <div id=\"menu\">\n",
" <h2>Graphs by Genre</h2>\n",
"\"\"\""
],
"metadata": {
"id": "iqNXGcmEwDkR"
},
"execution_count": 142,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Step 4: Create Visualization with Plotly\n",
"graph_title = 'Distribution of IMDB Ratings'\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"# 1. Distribution of IMdb Ratings\n",
"fig1 = px.histogram(\n",
" df,\n",
" x = 'IMDB_Rating',\n",
" nbins=10,\n",
" title=graph_title,\n",
" labels={'IMDB_Rating': 'IMDB Rating'}\n",
").update_layout(bargap=0.2)\n",
"fig1.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\"\n"
],
"metadata": {
"id": "Gx5yIgJV5ekG"
},
"execution_count": 143,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 2. Trend of Average IMDb Ratings Over Years\n",
"graph_title = 'Trend of Average IMDb Ratings Over Years'\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"average_rating_per_year = df.groupby('Released_Year')['IMDB_Rating'].mean().reset_index()\n",
"fig2 = px.line(\n",
" average_rating_per_year,\n",
" x='Released_Year',\n",
" y='IMDB_Rating',\n",
" title=graph_title,\n",
" labels={'Released_Year': 'Year', 'IMDB_Rating': 'Average Rating'}\n",
")\n",
"fig2.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "UfLoLpno61_u"
},
"execution_count": 144,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 3. Top Genres by Total Gross Revenue\n",
"graph_title = 'Top Genres by Total Gross Revenue'\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"total_gross_by_genre = df.groupby('Primary_Genre')['Gross'].sum().reset_index()\n",
"fig3 = px.bar(\n",
" total_gross_by_genre,\n",
" x='Primary_Genre',\n",
" y='Gross',\n",
" title=graph_title,\n",
" labels={'Primary_Genre': 'Genre', 'Gross': 'Total Gross Revenue'}\n",
").update_layout(xaxis={'categoryorder': 'total descending'})\n",
"fig3.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "YBdHwO5-77hr"
},
"execution_count": 145,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 4. Votes vs. IMDb Ratings\n",
"graph_title = 'Votes vs. IMDb Ratings'\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"fig4 = px.scatter(\n",
" df,\n",
" x='No_of_Votes',\n",
" y='IMDB_Rating',\n",
" size='Gross',\n",
" color='Primary_Genre',\n",
" title=graph_title,\n",
" labels={'No_of_Votes': 'Number of Votes', 'IMDB_Rating': 'IMDb Rating'}\n",
")\n",
"fig4.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "t6lS9Kj8_BYa"
},
"execution_count": 146,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 5. Most Profitable Movies by genre\n",
"graph_title = 'Most Profitable Movies by genre'\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"most_profitable_movies = df.loc[df.groupby('Primary_Genre')['Profit'].idxmax()]\n",
"fig5 = px.bar(\n",
" most_profitable_movies,\n",
" x='Primary_Genre',\n",
" y='Profit',\n",
" text='Series_Title',\n",
" title=graph_title,\n",
" labels={'Primary_Genre': 'Genre', 'Profit': 'Profit (in $)'}\n",
").update_traces(textposition='outside')\n",
"fig5.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "eLGqQn9kBrk_"
},
"execution_count": 147,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 6. Most Profitable Cast\n",
"graph_title = \"Most Profitable Cast\"\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"actor_columns = ['Star1', 'Star2', 'Star3', 'Star4']\n",
"cast_profits = (\n",
" pd.concat(\n",
" [df[['Profit', col]].rename(columns={col: 'Cast'}) for col in actor_columns]\n",
" )\n",
" .groupby('Cast')\n",
" .sum()\n",
" .sort_values(by='Profit', ascending=False)\n",
" .reset_index()\n",
" .head(10)\n",
")\n",
"\n",
"fig6 = px.bar(\n",
" cast_profits, x='Cast',\n",
" y='Profit',\n",
" title=graph_title,\n",
" labels={'Cast': 'Cast Member', 'Profit': 'Total Profit (in $)'}\n",
").update_layout(xaxis={'categoryorder': 'total descending'})\n",
"fig6.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "gpFvcfjsE10A"
},
"execution_count": 148,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 7. Most Profitable Directors\n",
"graph_title = \"Most Profitable Directors\"\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"most_profitable_directors = df.groupby('Director').agg(total_profit=('Profit', 'sum')).sort_values(by='total_profit', ascending=False).head(10).reset_index()\n",
"fig7 = px.bar(\n",
" most_profitable_directors,\n",
" x='Director',\n",
" y='total_profit',\n",
" title=graph_title,\n",
" labels={'Director': 'Director', 'total_profit': 'Total Profit (in $)'}\n",
").update_layout(xaxis={'categoryorder': 'total descending'})\n",
"fig7.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "bfqNZB8IGRtB"
},
"execution_count": 149,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 8. Top 10 Movies in Each Genre by Ratings\n",
"genres = df['Primary_Genre'].unique()\n",
"top_movies_by_genre = (\n",
" df.sort_values(by=['Primary_Genre', 'IMDB_Rating'], ascending=[True, False])\n",
" .groupby('Primary_Genre')\n",
" .head(10)\n",
")\n",
"\n",
"for genre in genres:\n",
" genre_data = top_movies_by_genre[top_movies_by_genre['Primary_Genre'] == genre].head(10)\n",
" graph_title = f\"Top 10 Movies in {genre} by Ratings\"\n",
" graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
" fig8 = px.bar(\n",
" genre_data,\n",
" x='Primary_Genre',\n",
" y='IMDB_Rating',\n",
" color='Series_Title',\n",
" title=graph_title,\n",
" labels={'Primary_Genre':'Genre', 'IMDB_Rating':'Rating', 'Series_Title':'Movie'},\n",
" text='Series_Title',\n",
" barmode='group'\n",
" ).update_traces(textposition='outside')\n",
" fig8.write_html(graph_file_name)\n",
" html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "Co5GQpKWH8Y8"
},
"execution_count": 150,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 9. Directors with the Highest Average Ratings\n",
"graph_title = \"Directors with the Highest Average Ratings\"\n",
"graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
"director_ratings = df.groupby('Director').agg(\n",
" avg_rating=('IMDB_Rating', 'mean'),\n",
" movie_count=('Series_Title', 'count')\n",
").sort_values(by='avg_rating', ascending=False).head(10).reset_index(0)\n",
"\n",
"fig9 = px.bar(\n",
" director_ratings,\n",
" x='Director',\n",
" y='avg_rating',\n",
" text='movie_count',\n",
" title=graph_title,\n",
" labels={'Director': 'Director', 'avg_rating':'Average Rating', 'movie_count': 'Movies Directed'}\n",
").update_traces(textposition='outside')\n",
"fig9.write_html(graph_file_name)\n",
"html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "BkhzEpd2J3qN"
},
"execution_count": 151,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 10. Directors with the Highest Average Ratings by Genre\n",
"genres = df['Primary_Genre'].unique()\n",
"director_ratings_by_genre = (\n",
" df.groupby(['Primary_Genre', 'Director'])\n",
" .agg(avg_rating=('IMDB_Rating', 'mean'), movie_count=('Series_Title', 'count'))\n",
" .sort_values(by=['Primary_Genre', 'avg_rating'], ascending=[True, False])\n",
" .reset_index()\n",
")\n",
"\n",
"for genre in genres:\n",
" graph_title = f\"Top Directors in {genre} by Average IMDb Ratings\"\n",
" graph_file_name = f\"{graph_title.lower().replace(' ', '_')}.html\"\n",
" genre_data = director_ratings_by_genre[director_ratings_by_genre['Primary_Genre'] == genre].head(10)\n",
" print(graph_file_name)\n",
" fig10 = px.bar(\n",
" genre_data,\n",
" x='Director',\n",
" y='avg_rating',\n",
" title=graph_title,\n",
" labels={'Director': 'Director', 'avg_rating': 'Average IMDb Rating'},\n",
" text='avg_rating'\n",
" ).update_traces(textposition='outside', marker=dict(line=dict(width=2)))\n",
" fig10.write_html(graph_file_name)\n",
" html_content += f\"\"\"<a href=\"#\" onclick=\"showGraph('{graph_file_name}')\">{graph_title}</a>\"\"\""
],
"metadata": {
"id": "dfM4j9kWPPDi",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "3bc2e6e7-c1df-4169-f1ce-396734add7a5"
},
"execution_count": 152,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"top_directors_in_drama_by_average_imdb_ratings.html\n",
"top_directors_in_crime_by_average_imdb_ratings.html\n",
"top_directors_in_action_by_average_imdb_ratings.html\n",
"top_directors_in_biography_by_average_imdb_ratings.html\n",
"top_directors_in_western_by_average_imdb_ratings.html\n",
"top_directors_in_comedy_by_average_imdb_ratings.html\n",
"top_directors_in_adventure_by_average_imdb_ratings.html\n",
"top_directors_in_animation_by_average_imdb_ratings.html\n",
"top_directors_in_horror_by_average_imdb_ratings.html\n",
"top_directors_in_mystery_by_average_imdb_ratings.html\n",
"top_directors_in_film-noir_by_average_imdb_ratings.html\n",
"top_directors_in_fantasy_by_average_imdb_ratings.html\n",
"top_directors_in_family_by_average_imdb_ratings.html\n",
"top_directors_in_thriller_by_average_imdb_ratings.html\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"html_content += \"\"\"\n",
" </div>\n",
" <div id=\"content\">\n",
" <h2>Graph Viewer</h2>\n",
" <div id=\"loading\">Loading...</div>\n",
" <iframe id=\"graph-container\" onload=\"hideLoading()\"></iframe>\n",
" </div>\n",
"\n",
" <script>\n",
" window.onload = function () {\n",
" window.showGraph = function(graph_filename) {\n",
" document.getElementById('loading').style.display = 'block';\n",
" document.getElementById('graph-container').src = graph_filename;\n",
" };\n",
"\n",
" window.hideLoading = function() {\n",
" document.getElementById('loading').style.display = 'none';\n",
" };\n",
"\n",
" document.getElementById('graph-container').onload = hideLoading;\n",
"};\n",
"\n",
" </script>\n",
"\"\"\"\n",
"\n",
"# Fecha o Html\n",
"html_content += \"</body></html>\"\n",
"\n",
"with open(\"index.html\", 'w') as f:\n",
" f.write(html_content)"
],
"metadata": {
"id": "3ZRBz4hUb_FB"
},
"execution_count": 153,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import os\n",
"import glob\n",
"\n",
"def delete_files(extension):\n",
" # Specify the directory and extension\n",
" directory = \"/content\" # Default working directory in Colab\n",
" extension = \".html\" # Change to the desired file extension\n",
"\n",
" # Find all files with the specified extension\n",
" files_to_delete = glob.glob(os.path.join(directory, f\"*{extension}\"))\n",
"\n",
" # Loop through and delete each file\n",
" for file in files_to_delete:\n",
" try:\n",
" os.remove(file)\n",
" print(f\"Deleted: {file}\")\n",
" except Exception as e:\n",
" print(f\"Error deleting {file}: {e}\")\n"
],
"metadata": {
"id": "8h2rHHEBW1VK"
},
"execution_count": 154,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# delete_files(\".html\")"
],
"metadata": {
"id": "6CIJPmbvW9i_"
},
"execution_count": 155,
"outputs": []
}
]
}
@gabrielsimas
Copy link
Author

Make public!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment