Skip to content

Instantly share code, notes, and snippets.

@DICOT4
Last active September 20, 2024 22:14
Show Gist options
  • Select an option

  • Save DICOT4/f99339dd67ae889b8ca85865d0e543d8 to your computer and use it in GitHub Desktop.

Select an option

Save DICOT4/f99339dd67ae889b8ca85865d0e543d8 to your computer and use it in GitHub Desktop.
Recommendation System
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "V28",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "TPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/DICOT4/f99339dd67ae889b8ca85865d0e543d8/recommendation_engine_movies.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Recommendation Engine\n",
"\n",
"In this notebook, we will develop a comprehensive movie recommendation engine using two rich datasets from The Movie Database (TMDb). The datasets, sourced from [TMDb Movie Metadata](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata), [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset), and [Movie Lens Dataset](https://grouplens.org/datasets/movielens/latest/)provide detailed information on various aspects of movies, including titles, genres, cast, crew, and user ratings.\n",
"\n",
"We will explore two popular approaches for building recommendation systems:\n",
"* Content-based filtering\n",
"* Collaborative filtering\n"
],
"metadata": {
"id": "Fr-YOE1QP-2N"
}
},
{
"cell_type": "code",
"source": [
"from google.colab import drive\n",
"drive.mount('/content/drive')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "yoUNIw3cqptT",
"outputId": "1ad51b4e-251f-42db-97b0-416ff5e8bfe0"
},
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Mounted at /content/drive\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!ls \"/content/drive/My Drive/movie-dataset/ml-latest-small\""
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "V1ea0AGLwz7-",
"outputId": "5f3fff29-6642-4baf-bcf1-5023ba6681c0"
},
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"links.csv movies.csv ratings.csv README.txt\ttags.csv\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## **Visualisation**"
],
"metadata": {
"id": "jsrPnLu4QKJ-"
}
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "hPl9Z5-xMhbu"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"source": [
"df_credits = pd.read_csv('/content/drive/My Drive/movie-dataset/TMDB-5000-Movie-Dataset/tmdb_5000_credits.csv')\n",
"df_credits.info()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XUtAVSWTO7X-",
"outputId": "309ba406-e197-4af5-fc23-a3808f2acc97"
},
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 4803 entries, 0 to 4802\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 movie_id 4803 non-null int64 \n",
" 1 title 4803 non-null object\n",
" 2 cast 4803 non-null object\n",
" 3 crew 4803 non-null object\n",
"dtypes: int64(1), object(3)\n",
"memory usage: 150.2+ KB\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"df_movies = pd.read_csv('/content/drive/My Drive/movie-dataset/TMDB-5000-Movie-Dataset/tmdb_5000_movies.csv')\n",
"df_movies.info()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ilbqRSLJOupl",
"outputId": "9afef70c-d0ad-4634-be71-73e9b706d002"
},
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 4803 entries, 0 to 4802\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 budget 4803 non-null int64 \n",
" 1 genres 4803 non-null object \n",
" 2 homepage 1712 non-null object \n",
" 3 id 4803 non-null int64 \n",
" 4 keywords 4803 non-null object \n",
" 5 original_language 4803 non-null object \n",
" 6 original_title 4803 non-null object \n",
" 7 overview 4800 non-null object \n",
" 8 popularity 4803 non-null float64\n",
" 9 production_companies 4803 non-null object \n",
" 10 production_countries 4803 non-null object \n",
" 11 release_date 4802 non-null object \n",
" 12 revenue 4803 non-null int64 \n",
" 13 runtime 4801 non-null float64\n",
" 14 spoken_languages 4803 non-null object \n",
" 15 status 4803 non-null object \n",
" 16 tagline 3959 non-null object \n",
" 17 title 4803 non-null object \n",
" 18 vote_average 4803 non-null float64\n",
" 19 vote_count 4803 non-null int64 \n",
"dtypes: float64(3), int64(4), object(13)\n",
"memory usage: 750.6+ KB\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"# Join the two dataset on the 'id' column\n",
"df_credits.columns = ['id', 'tittle', 'cast', 'crew']\n",
"df_movies = df_movies.merge(df_credits, on='id')\n",
"df_movies.info()"
],
"metadata": {
"id": "VFXD5sLFBptu",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "38e6cf6e-a82f-4949-ea88-00e1fc006cdb"
},
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 4803 entries, 0 to 4802\n",
"Data columns (total 23 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 budget 4803 non-null int64 \n",
" 1 genres 4803 non-null object \n",
" 2 homepage 1712 non-null object \n",
" 3 id 4803 non-null int64 \n",
" 4 keywords 4803 non-null object \n",
" 5 original_language 4803 non-null object \n",
" 6 original_title 4803 non-null object \n",
" 7 overview 4800 non-null object \n",
" 8 popularity 4803 non-null float64\n",
" 9 production_companies 4803 non-null object \n",
" 10 production_countries 4803 non-null object \n",
" 11 release_date 4802 non-null object \n",
" 12 revenue 4803 non-null int64 \n",
" 13 runtime 4801 non-null float64\n",
" 14 spoken_languages 4803 non-null object \n",
" 15 status 4803 non-null object \n",
" 16 tagline 3959 non-null object \n",
" 17 title 4803 non-null object \n",
" 18 vote_average 4803 non-null float64\n",
" 19 vote_count 4803 non-null int64 \n",
" 20 tittle 4803 non-null object \n",
" 21 cast 4803 non-null object \n",
" 22 crew 4803 non-null object \n",
"dtypes: float64(3), int64(4), object(16)\n",
"memory usage: 863.2+ KB\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## **Content based filtering**\n",
"\n",
"By analyzing metadata such as the plot, cast, director, and keywords, the system will recommend movies that are similar to a given movie based on its features.\n"
],
"metadata": {
"id": "puRGS5GWPd2A"
}
},
{
"cell_type": "markdown",
"source": [
"### Movie's plot based recommendations\n",
"We will calculate similarity scores between each pair of movies using their plot descriptions and provide recommendations based on those scores. The plot descriptions are available in the overview feature of our dataset. Let's examine the data."
],
"metadata": {
"id": "7Eo0_0urQleu"
}
},
{
"cell_type": "code",
"source": [
"df_movies['overview'].head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 241
},
"id": "gf8N3DUIPLt4",
"outputId": "6b4137d1-acd6-4eb4-b9bc-2c4b095abf90"
},
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 In the 22nd century, a paraplegic Marine is di...\n",
"1 Captain Barbossa, long believed to be dead, ha...\n",
"2 A cryptic message from Bond’s past sends him o...\n",
"3 Following the death of District Attorney Harve...\n",
"4 John Carter is a war-weary, former military ca...\n",
"Name: overview, dtype: object"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>overview</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>In the 22nd century, a paraplegic Marine is di...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Captain Barbossa, long believed to be dead, ha...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A cryptic message from Bond’s past sends him o...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Following the death of District Attorney Harve...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>John Carter is a war-weary, former military ca...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> object</label>"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"source": [
"To process text data efficiently, we must convert each document's content into a numerical representation. One common method for this is to compute `Term Frequency-Inverse Document Frequency (TF-IDF)` vectors.\n",
"\n",
"`Term Frequency (TF)` measures how often a word appears in a document, calculated as the number of occurrences of the term divided by the total number of terms in the document.\n",
"\n",
"`Inverse Document Frequency (IDF)` quantifies how unique or rare a word is across multiple documents. It is calculated as the logarithm of the total number of documents divided by the number of documents containing that term. The combined measure, TF-IDF, helps assess the importance of a word to a specific document while reducing the weight of frequently occurring words across the dataset.\n",
"\n",
"The result is a matrix where each column corresponds to a word in the vocabulary *(i.e., the words that appear in at least one document)*, and each row represents a document. This approach helps minimize the impact of common words, enhancing the precision of similarity calculations.\n",
"\n",
"Scikit-learn provides a built-in `TfidfVectorizer` class that simplifies this process, allowing you to generate the TF-IDF matrix efficiently.\n",
"\n",
"This method is crucial for accurately evaluating textual similarity while reducing the noise created by frequent, less meaningful words."
],
"metadata": {
"id": "IzUypCOF38C3"
}
},
{
"cell_type": "code",
"source": [
"# Import the TfidfVectorizer class from scikit-learn\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"# Instantiate a TF-IDF Vectorizer object, removing common English stop words such as 'the' and 'a'\n",
"tfidf = TfidfVectorizer(stop_words='english')\n",
"\n",
"# Replace any missing values (NaN) in the 'overview' column with an empty string\n",
"df_movies['overview'] = df_movies['overview'].fillna('')\n",
"\n",
"# Create the TF-IDF matrix by fitting and transforming the text data from the 'overview' column\n",
"tfidf_matrix = tfidf.fit_transform(df_movies['overview'])\n",
"\n",
"# Display the dimensions of the resulting TF-IDF matrix\n",
"tfidf_matrix.shape\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "bnfr2Sdk4SZB",
"outputId": "c640a0e6-17a9-4f03-c179-8d315e975a09"
},
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(4803, 20978)"
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"source": [
"The dataset contains over 21,000 unique words used to describe the 4,800 movies.\n",
"\n",
"With the `TF-IDF` matrix ready, we can now compute similarity scores between these movies. There are several possible approaches to calculating similarity, including Euclidean distance, Pearson correlation, and cosine similarity. No single metric is universally the best; different similarity measures perform better in different contexts, so experimenting with multiple methods can be beneficial.\n",
"\n",
"For this task, we will use cosine similarity, which measures the cosine of the angle between two vectors. It is an effective measure because it is independent of magnitude and is computationally efficient.\n",
"\n",
"The formula for **cosine similarity** between two vectors $\\mathbf{A}$ and $\\mathbf{B}$ is:\n",
"\n",
"$$\n",
"\\text{similarity} = \\cos(\\theta) = \\frac{\\mathbf{A} \\cdot \\mathbf{B}}{\\|\\mathbf{A}\\| \\|\\mathbf{B}\\|} = \\frac{\\sum_{i=1}^{n} A_i B_i}{\\sqrt{\\sum_{i=1}^{n} A_i^2} \\cdot \\sqrt{\\sum_{i=1}^{n} B_i^2}}\n",
"$$\n",
"\n",
"Where:\n",
"- $\\mathbf{A} \\cdot \\mathbf{B}$ represents the dot product of the vectors $\\mathbf{A}$ and $\\mathbf{B}$.\n",
"- $\\|\\mathbf{A}\\|$ and $\\|\\mathbf{B}\\|$ are the magnitudes of vectors $\\mathbf{A}$ and $\\mathbf{B}$, respectively.\n",
"- $A_i$ and $B_i$ represent the components of the vectors at position $i$.\n",
"- $n$ is the number of dimensions or features in the vectors.\n",
"\n",
"Since we have employed the `TF-IDF` vectorizer, calculating the dot product will yield the cosine similarity score directly. To optimize performance, we will use `Scikit-learn's` `linear_kernel()` function instead of `cosine_similarity()`, as it is more efficient for this purpose.\n"
],
"metadata": {
"id": "3KY6Zjoo4453"
}
},
{
"cell_type": "code",
"source": [
"# Import the linear_kernel function from Scikit-learn\n",
"from sklearn.metrics.pairwise import linear_kernel\n",
"\n",
"# Calculate the cosine similarity matrix using the dot product of the TF-IDF matrix\n",
"cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)\n"
],
"metadata": {
"id": "EFmEQ6BR6RDt"
},
"execution_count": 9,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We will define a function that accepts a movie title as input and returns a list of the *10 most similar* movies. To achieve this, we first need to create a reverse mapping between movie titles and their corresponding DataFrame indices. This will allow us to efficiently retrieve the index of a movie based on its title within the metadata DataFrame."
],
"metadata": {
"id": "st0_MXp3-H80"
}
},
{
"cell_type": "code",
"source": [
"# Create a reverse mapping of movie titles to their corresponding indices.\n",
"indices = pd.Series(df_movies.index, index = df_movies['title']).drop_duplicates()"
],
"metadata": {
"id": "P3C1BIti9--0"
},
"execution_count": 10,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We are now ready to define our recommendation function, following these steps:\n",
"\n",
"1. Retrieve the index of the movie based on its title.\n",
"2. Obtain the list of cosine similarity scores between the selected movie and all other movies. Convert this into a list of tuples, where the first element represents the movie index, and the second represents the similarity score.\n",
"3. Sort the list of tuples in descending order based on the similarity scores *(the second element)*.\n",
"4. Select the top 10 entries from the sorted list. Ignore the first entry, as it refers to the movie itself *(the highest similarity will be with the movie itself)*.\n",
"5. Return the titles corresponding to the indices of the top entries."
],
"metadata": {
"id": "q7AKLfmQ-hy8"
}
},
{
"cell_type": "code",
"source": [
"# Function that takes a movie title as input and returns the most similar movies\n",
"def get_recommendations(title, cosine_sim=cosine_sim):\n",
" # Retrieve the index of the movie that matches the provided title\n",
" idx = indices[title]\n",
"\n",
" # Get the similarity scores for all movies compared to the selected movie\n",
" sim_scores = list(enumerate(cosine_sim[idx]))\n",
"\n",
" # Sort the similarity scores in descending order\n",
" sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)\n",
"\n",
" # Select the top 10 most similar movies, excluding the first (which is the movie itself)\n",
" sim_scores = sim_scores[1:11]\n",
"\n",
" # Extract the movie indices of the top 10 similar movies\n",
" movie_indices = [i[0] for i in sim_scores]\n",
"\n",
" # Return the titles of the top 10 most similar movies\n",
" return df_movies['title'].iloc[movie_indices]\n"
],
"metadata": {
"id": "7znrd_s3-1vs"
},
"execution_count": 11,
"outputs": []
},
{
"cell_type": "code",
"source": [
"def get_recommendations_with_score(title, cosine_sim=cosine_sim):\n",
" # Check if the title exists in the indices\n",
" if title not in indices:\n",
" print(f'\"{title}\" not found in the dataset.')\n",
" return None\n",
"\n",
" # Retrieve the index of the movie that matches the provided title\n",
" idx = indices[title]\n",
"\n",
" # Get the similarity scores for all movies compared to the selected movie\n",
" sim_scores = list(enumerate(cosine_sim[idx]))\n",
"\n",
" # Sort the similarity scores in descending order\n",
" sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)\n",
"\n",
" # Select the top 10 most similar movies, excluding the first (which is the movie itself)\n",
" sim_scores = sim_scores[1:11]\n",
"\n",
" # Extract the movie indices and similarity scores\n",
" movie_indices = [i[0] for i in sim_scores]\n",
" similarity_scores = [i[1] for i in sim_scores]\n",
"\n",
" # Get the movie titles\n",
" movie_titles = df_movies['title'].iloc[movie_indices].values\n",
"\n",
" # Create a DataFrame with titles and similarity scores\n",
" recommendations = pd.DataFrame({\n",
" 'title': movie_titles,\n",
" 'similarity_score': similarity_scores\n",
" })\n",
"\n",
" return recommendations"
],
"metadata": {
"id": "Vt2tM5o7DbtH"
},
"execution_count": 12,
"outputs": []
},
{
"cell_type": "code",
"source": [
"get_recommendations(\"Pirates of the Caribbean: At World's End\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 398
},
"id": "icxaN4h6_Unj",
"outputId": "041263c1-3e91-4671-c2b7-497c7a20ed42"
},
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2542 What's Love Got to Do with It\n",
"3095 My Blueberry Nights\n",
"2102 The Descendants\n",
"1280 Disturbia\n",
"3632 90 Minutes in Heaven\n",
"792 Just Like Heaven\n",
"1709 Space Pirate Captain Harlock\n",
"1799 Original Sin\n",
"2652 Bathory: Countess of Blood\n",
"4423 Bang Bang Baby\n",
"Name: title, dtype: object"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2542</th>\n",
" <td>What's Love Got to Do with It</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3095</th>\n",
" <td>My Blueberry Nights</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2102</th>\n",
" <td>The Descendants</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1280</th>\n",
" <td>Disturbia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3632</th>\n",
" <td>90 Minutes in Heaven</td>\n",
" </tr>\n",
" <tr>\n",
" <th>792</th>\n",
" <td>Just Like Heaven</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1709</th>\n",
" <td>Space Pirate Captain Harlock</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799</th>\n",
" <td>Original Sin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2652</th>\n",
" <td>Bathory: Countess of Blood</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4423</th>\n",
" <td>Bang Bang Baby</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> object</label>"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"source": [
"get_recommendations('Snowpiercer')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 398
},
"id": "EkT493qV_uGe",
"outputId": "87534388-9db4-4b99-c43e-5ac0104e487b"
},
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"4350 An Inconvenient Truth\n",
"1643 Howard the Duck\n",
"4710 Antarctic Edge: 70° South\n",
"4427 Charly\n",
"3840 Train\n",
"2410 Good Boy!\n",
"2768 21 & Over\n",
"16 The Avengers\n",
"1704 The Big Short\n",
"3330 The Wave\n",
"Name: title, dtype: object"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4350</th>\n",
" <td>An Inconvenient Truth</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1643</th>\n",
" <td>Howard the Duck</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4710</th>\n",
" <td>Antarctic Edge: 70° South</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4427</th>\n",
" <td>Charly</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3840</th>\n",
" <td>Train</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2410</th>\n",
" <td>Good Boy!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2768</th>\n",
" <td>21 &amp; Over</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>The Avengers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1704</th>\n",
" <td>The Big Short</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3330</th>\n",
" <td>The Wave</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> object</label>"
]
},
"metadata": {},
"execution_count": 14
}
]
},
{
"cell_type": "markdown",
"source": [
"Although our system effectively identifies movies with similar plot descriptions, the quality of the recommendations can be improved. For example, *\\\"Pirates of the Caribbean: At World's End\\\"* returns all similarly plotted movies, whereas individuals who enjoyed that movie may be more interested in other films directed by the same director. This nuance is not captured by the current system."
],
"metadata": {
"id": "EdD_9g9Y_-wU"
}
},
{
"cell_type": "markdown",
"source": [
"### Credits, Genre, and Keywords based recommendations\n",
"\n",
"It is evident that the quality of our recommendation system can be significantly enhanced by utilizing richer metadata. This is precisely what we will address in this section. We will build a recommendation system based on the following metadata: the top 3 actors, the director, related genres, and the movie plot keywords.\n",
"\n",
"From the cast, crew, and keyword features, we will extract the three most prominent actors, the director, and the relevant keywords for each movie. Currently, this data is stored as \"stringified\" lists, and we need to convert it into a structured and usable format."
],
"metadata": {
"id": "GAzkSOVbAd1i"
}
},
{
"cell_type": "code",
"source": [
"# Convert the stringified features into their respective Python objects.\n",
"from ast import literal_eval\n",
"\n",
"features = ['crew', 'cast', 'genres', 'keywords']\n",
"for feature in features:\n",
" df_movies[feature] = df_movies[feature].apply(literal_eval)"
],
"metadata": {
"id": "gw4R6RXRAZvu"
},
"execution_count": 15,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Next up, we will define functions to extract the necessary information from each feature."
],
"metadata": {
"id": "PcW_i3sACUB-"
}
},
{
"cell_type": "code",
"source": [
"# Extract the director's name from the 'crew' feature. If the director is not listed, return NaN.\n",
"def get_director_from_crew(crew_list):\n",
" for member in crew_list:\n",
" if member['job'] == 'Director':\n",
" return member['name']\n",
" return np.nan\n"
],
"metadata": {
"id": "qovVTdCyBUpo"
},
"execution_count": 16,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Return the top 3 elements from the list or the entire list if fewer than 3 elements are present.\n",
"def get_list(feature_list):\n",
" if isinstance(feature_list, list):\n",
" names = [element['name'] for element in feature_list]\n",
"\n",
" # If the list contains more than 3 elements, return only the first three; otherwise, return the full list.\n",
" if len(names) > 3:\n",
" return names[:3]\n",
" return names\n",
"\n",
" # Return an empty list if the data is missing or malformed.\n",
" return []\n"
],
"metadata": {
"id": "p0s2-JuFCvBX"
},
"execution_count": 17,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Create new features for director, cast, genres, and keywords in a more usable format.\n",
"df_movies['director'] = df_movies['crew'].apply(get_director_from_crew)\n",
"\n",
"# Apply the 'get_list' function to the 'cast', 'genres', and 'keywords' features to extract relevant information.\n",
"features = ['cast', 'genres', 'keywords']\n",
"for feature in features:\n",
" df_movies[feature] = df_movies[feature].apply(get_list)\n"
],
"metadata": {
"id": "O8cG-J8JC902"
},
"execution_count": 18,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Let's see the new features\n",
"df_movies[['title', 'cast', 'director', 'genres', 'keywords']].head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 293
},
"id": "sB32qIuuDeYc",
"outputId": "619ec9e2-f173-4dd8-f3ec-12bb4514317e"
},
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" title \\\n",
"0 Avatar \n",
"1 Pirates of the Caribbean: At World's End \n",
"2 Spectre \n",
"3 The Dark Knight Rises \n",
"4 John Carter \n",
"\n",
" cast director \\\n",
"0 [Sam Worthington, Zoe Saldana, Sigourney Weaver] James Cameron \n",
"1 [Johnny Depp, Orlando Bloom, Keira Knightley] Gore Verbinski \n",
"2 [Daniel Craig, Christoph Waltz, Léa Seydoux] Sam Mendes \n",
"3 [Christian Bale, Michael Caine, Gary Oldman] Christopher Nolan \n",
"4 [Taylor Kitsch, Lynn Collins, Samantha Morton] Andrew Stanton \n",
"\n",
" genres keywords \n",
"0 [Action, Adventure, Fantasy] [culture clash, future, space war] \n",
"1 [Adventure, Fantasy, Action] [ocean, drug abuse, exotic island] \n",
"2 [Action, Adventure, Crime] [spy, based on novel, secret agent] \n",
"3 [Action, Crime, Drama] [dc comics, crime fighter, terrorist] \n",
"4 [Action, Adventure, Science Fiction] [based on novel, mars, medallion] "
],
"text/html": [
"\n",
" <div id=\"df-eeb80354-aa47-4b8a-8718-425dc5dd0b5e\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>cast</th>\n",
" <th>director</th>\n",
" <th>genres</th>\n",
" <th>keywords</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Avatar</td>\n",
" <td>[Sam Worthington, Zoe Saldana, Sigourney Weaver]</td>\n",
" <td>James Cameron</td>\n",
" <td>[Action, Adventure, Fantasy]</td>\n",
" <td>[culture clash, future, space war]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Pirates of the Caribbean: At World's End</td>\n",
" <td>[Johnny Depp, Orlando Bloom, Keira Knightley]</td>\n",
" <td>Gore Verbinski</td>\n",
" <td>[Adventure, Fantasy, Action]</td>\n",
" <td>[ocean, drug abuse, exotic island]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Spectre</td>\n",
" <td>[Daniel Craig, Christoph Waltz, Léa Seydoux]</td>\n",
" <td>Sam Mendes</td>\n",
" <td>[Action, Adventure, Crime]</td>\n",
" <td>[spy, based on novel, secret agent]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>The Dark Knight Rises</td>\n",
" <td>[Christian Bale, Michael Caine, Gary Oldman]</td>\n",
" <td>Christopher Nolan</td>\n",
" <td>[Action, Crime, Drama]</td>\n",
" <td>[dc comics, crime fighter, terrorist]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>John Carter</td>\n",
" <td>[Taylor Kitsch, Lynn Collins, Samantha Morton]</td>\n",
" <td>Andrew Stanton</td>\n",
" <td>[Action, Adventure, Science Fiction]</td>\n",
" <td>[based on novel, mars, medallion]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-eeb80354-aa47-4b8a-8718-425dc5dd0b5e')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-eeb80354-aa47-4b8a-8718-425dc5dd0b5e button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-eeb80354-aa47-4b8a-8718-425dc5dd0b5e');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-fa1bfa7f-e090-4944-b2c7-ca26dce74112\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-fa1bfa7f-e090-4944-b2c7-ca26dce74112')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-fa1bfa7f-e090-4944-b2c7-ca26dce74112 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_movies[['title', 'cast', 'director', 'genres', 'keywords']]\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Pirates of the Caribbean: At World's End\",\n \"John Carter\",\n \"Spectre\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"director\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Gore Verbinski\",\n \"Andrew Stanton\",\n \"Sam Mendes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"keywords\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 19
}
]
},
{
"cell_type": "markdown",
"source": [
"Now we need to convert the names and keyword instances to lowercase and remove any spaces. This ensures that our vectorizer distinguishes between different entities, such as *\\\"Johnny Depp\\\"* and *\\\"Johnny Galecki,\\\"* and does not treat them as the same."
],
"metadata": {
"id": "7ES_5YLiEiGe"
}
},
{
"cell_type": "code",
"source": [
"# Function to convert all strings to lowercase and remove spaces from names\n",
"def clean_data(value):\n",
" if isinstance(value, list):\n",
" return [str.lower(item.replace(\" \", \"\")) for item in value]\n",
" else:\n",
" # If the value is a string (e.g., director's name), convert it to lowercase and remove spaces\n",
" if isinstance(value, str):\n",
" return str.lower(value.replace(\" \", \"\"))\n",
" # If the value is missing or malformed, return an empty string\n",
" return ''\n"
],
"metadata": {
"id": "twEho5bNEtNn"
},
"execution_count": 20,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Apply the clean_data function to the relevant features (cast, keywords, director, and genres).\n",
"features = ['cast', 'director', 'genres', 'keywords']\n",
"\n",
"for feature in features:\n",
" df_movies[feature] = df_movies[feature].apply(clean_data)\n"
],
"metadata": {
"id": "ADaQaekWFeWR"
},
"execution_count": 21,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We are now ready to create our *metadata summary*, which is a consolidated string containing all the relevant metadata — such as actors, director, and keywords — that we will use as input for our vectorizer."
],
"metadata": {
"id": "oZ1R2jhMF09f"
}
},
{
"cell_type": "code",
"source": [
"# Function to create a metadata summary by combining cast, director, genres, and keywords into a single string\n",
"def create_metadata_summary(row):\n",
" return ' '.join(row['cast']) + ' ' + row['director'] + ' ' + ' '.join(row['genres']) + ' ' + ' '.join(row['keywords'])\n",
"\n",
"# Apply the create_metadata_summary function to generate the summary for each movie\n",
"df_movies['meta_summary'] = df_movies.apply(create_metadata_summary, axis=1)\n"
],
"metadata": {
"id": "A3m7nEJVGFpy"
},
"execution_count": 22,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The next steps are similar to the movie's plot based recommendations. However, a key difference is that we will use `CountVectorizer()` instead of `TF-IDF`. This is because we do not want to down-weight an actor or director simply because they have been involved in a larger number of movies. In this context, retaining the full weight of their presence makes more intuitive sense."
],
"metadata": {
"id": "441AzSi2Hiaj"
}
},
{
"cell_type": "code",
"source": [
"# Import CountVectorizer and create a count matrix based on the metadata summary\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"# Initialize CountVectorizer and apply it to the 'meta_feature' feature, excluding common English stop words\n",
"count = CountVectorizer(stop_words='english')\n",
"count_matrix = count.fit_transform(df_movies['meta_summary'])\n"
],
"metadata": {
"id": "gXDbw9AzIT-L"
},
"execution_count": 23,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Compute the cosine similarity matrix based on the count matrix\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"\n",
"# Calculate the cosine similarity between all movies using the count matrix\n",
"cosine_sim2 = cosine_similarity(count_matrix, count_matrix)\n"
],
"metadata": {
"id": "m__Da0ZUIizZ"
},
"execution_count": 24,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Reset the index of the main DataFrame and create a reverse mapping from movie titles to their indices\n",
"df_movies = df_movies.reset_index()\n",
"\n",
"# Create a reverse mapping where the index is the movie title and the value is the corresponding DataFrame index\n",
"indices = pd.Series(df_movies.index, index=df_movies['title'])\n"
],
"metadata": {
"id": "6zkcrsnKIsZ7"
},
"execution_count": 25,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We can now reuse the `get_recommendations()` function by passing the newly created `cosine_sim2` matrix as the second argument."
],
"metadata": {
"id": "LFZcgj-yI3hX"
}
},
{
"cell_type": "code",
"source": [
"get_recommendations_with_score('Iron Man', cosine_sim2)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"id": "gNf42u8HI-PF",
"outputId": "d1df527d-2c53-4aad-927d-9a8ea522612d"
},
"execution_count": 26,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" title similarity_score\n",
"0 Iron Man 2 0.600000\n",
"1 Avengers: Age of Ultron 0.400000\n",
"2 The Avengers 0.400000\n",
"3 Captain America: Civil War 0.400000\n",
"4 Iron Man 3 0.400000\n",
"5 TRON: Legacy 0.400000\n",
"6 The Helix... Loaded 0.365148\n",
"7 The Lovers 0.358569\n",
"8 After Earth 0.335410\n",
"9 Six-String Samurai 0.335410"
],
"text/html": [
"\n",
" <div id=\"df-34ed8997-e3d5-4ccd-ae57-253c1f5b6a5e\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>similarity_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Iron Man 2</td>\n",
" <td>0.600000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Avengers: Age of Ultron</td>\n",
" <td>0.400000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>The Avengers</td>\n",
" <td>0.400000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Captain America: Civil War</td>\n",
" <td>0.400000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Iron Man 3</td>\n",
" <td>0.400000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>TRON: Legacy</td>\n",
" <td>0.400000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>The Helix... Loaded</td>\n",
" <td>0.365148</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>The Lovers</td>\n",
" <td>0.358569</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>After Earth</td>\n",
" <td>0.335410</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Six-String Samurai</td>\n",
" <td>0.335410</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-34ed8997-e3d5-4ccd-ae57-253c1f5b6a5e')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-34ed8997-e3d5-4ccd-ae57-253c1f5b6a5e button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-34ed8997-e3d5-4ccd-ae57-253c1f5b6a5e');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-7ce19fe2-d47e-4486-8be8-916fb7b34517\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-7ce19fe2-d47e-4486-8be8-916fb7b34517')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-7ce19fe2-d47e-4486-8be8-916fb7b34517 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"get_recommendations_with_score('Iron Man', cosine_sim2)\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"After Earth\",\n \"Avengers: Age of Ultron\",\n \"TRON: Legacy\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"similarity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.07547766385676735,\n \"min\": 0.33541019662496846,\n \"max\": 0.6,\n \"num_unique_values\": 5,\n \"samples\": [\n 0.4,\n 0.33541019662496846,\n 0.3651483716701108\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 26
}
]
},
{
"cell_type": "code",
"source": [
"get_recommendations_with_score('Superman', cosine_sim2)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"id": "F58gWgSKJE0z",
"outputId": "a897fe91-127e-4bea-ca9d-5529b1cb821a"
},
"execution_count": 27,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" title similarity_score\n",
"0 Superman II 0.600000\n",
"1 Superman IV: The Quest for Peace 0.572078\n",
"2 Superman Returns 0.500000\n",
"3 Man of Steel 0.500000\n",
"4 Superman III 0.500000\n",
"5 Batman v Superman: Dawn of Justice 0.400000\n",
"6 The Mummy: Tomb of the Dragon Emperor 0.358569\n",
"7 The Monkey King 2 0.335410\n",
"8 Indiana Jones and the Kingdom of the Crystal S... 0.316228\n",
"9 The Sorcerer's Apprentice 0.316228"
],
"text/html": [
"\n",
" <div id=\"df-b725bc64-5335-414d-87ab-b997a4932eec\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>similarity_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Superman II</td>\n",
" <td>0.600000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Superman IV: The Quest for Peace</td>\n",
" <td>0.572078</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Superman Returns</td>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Man of Steel</td>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Superman III</td>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Batman v Superman: Dawn of Justice</td>\n",
" <td>0.400000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>The Mummy: Tomb of the Dragon Emperor</td>\n",
" <td>0.358569</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>The Monkey King 2</td>\n",
" <td>0.335410</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Indiana Jones and the Kingdom of the Crystal S...</td>\n",
" <td>0.316228</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>The Sorcerer's Apprentice</td>\n",
" <td>0.316228</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b725bc64-5335-414d-87ab-b997a4932eec')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-b725bc64-5335-414d-87ab-b997a4932eec button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-b725bc64-5335-414d-87ab-b997a4932eec');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-e5831da0-50ce-4667-a837-747c6908e75c\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e5831da0-50ce-4667-a837-747c6908e75c')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-e5831da0-50ce-4667-a837-747c6908e75c button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"get_recommendations_with_score('Superman', cosine_sim2)\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Indiana Jones and the Kingdom of the Crystal Skull\",\n \"Superman IV: The Quest for Peace\",\n \"Batman v Superman: Dawn of Justice\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"similarity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.10731754189425172,\n \"min\": 0.31622776601683794,\n \"max\": 0.6,\n \"num_unique_values\": 7,\n \"samples\": [\n 0.6,\n 0.5720775535473555,\n 0.33541019662496846\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 27
}
]
},
{
"cell_type": "markdown",
"source": [
"Our recommender has successfully captured more information by incorporating additional metadata, resulting in *(arguably)* better recommendations. For instance, it is more likely that fans of Marvel or DC comics will prefer movies from the same production house. Therefore, we can enhance our features by including the `production_company`. Additionally, we can increase the influence of the director by adding the `director` feature multiple times in the `metadata summary`."
],
"metadata": {
"id": "vBCDDhflJ1R7"
}
},
{
"cell_type": "markdown",
"source": [
"## Collaborative Filtering\n",
"\n",
"A recommendation model based on user-movie interaction data. This method focuses on leveraging user preferences and interactions with various movies to make recommendations that align with user tastes."
],
"metadata": {
"id": "YkLGRsKBJ_5Z"
}
},
{
"cell_type": "markdown",
"source": [
"Since the dataset we used before did not have `userId` *(which is necessary for collaborative filtering)*, let's load the Movie Dataset. We'll be using the `Surprise` library to implement `SVD`."
],
"metadata": {
"id": "DQPopPiEOvas"
}
},
{
"cell_type": "code",
"source": [
"!pip install scikit-surprise"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_Sfl5kGkZRAE",
"outputId": "1e669e3a-e7b6-4b8c-e56a-8e7247f792bc"
},
"execution_count": 28,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Collecting scikit-surprise\n",
" Downloading scikit_surprise-1.1.4.tar.gz (154 kB)\n",
"\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/154.4 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m20.5/154.4 kB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m51.2/154.4 kB\u001b[0m \u001b[31m993.2 kB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m153.6/154.4 kB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m154.4/154.4 kB\u001b[0m \u001b[31m1.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise) (1.4.2)\n",
"Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise) (1.26.4)\n",
"Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise) (1.13.1)\n",
"Building wheels for collected packages: scikit-surprise\n",
" Building wheel for scikit-surprise (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357286 sha256=d1b7065f3ca8aa9b20568fb4277f8402b961ad7860d371c4716b7bd4c712062d\n",
" Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54\n",
"Successfully built scikit-surprise\n",
"Installing collected packages: scikit-surprise\n",
"Successfully installed scikit-surprise-1.1.4\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"from surprise import Reader, Dataset, SVD\n",
"from surprise.model_selection import train_test_split\n",
"from surprise import accuracy\n",
"from collections import defaultdict"
],
"metadata": {
"id": "ZZA7DkgJ40Os"
},
"execution_count": 29,
"outputs": []
},
{
"cell_type": "code",
"source": [
"ratings = pd.read_csv('/content/drive/My Drive/movie-dataset/ml-latest-small/ratings.csv')\n",
"ratings.info()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cMJ36e5hOsKR",
"outputId": "783b9960-f92c-495b-94a9-04f79cb77a25"
},
"execution_count": 30,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 100836 entries, 0 to 100835\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 userId 100836 non-null int64 \n",
" 1 movieId 100836 non-null int64 \n",
" 2 rating 100836 non-null float64\n",
" 3 timestamp 100836 non-null int64 \n",
"dtypes: float64(1), int64(3)\n",
"memory usage: 3.1 MB\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"reader = Reader(rating_scale=(0, 5)) # Define the rating scale; 0 min and 5 max rating\n",
"ratings_data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)"
],
"metadata": {
"id": "VVRQ4gD0bu1u"
},
"execution_count": 31,
"outputs": []
},
{
"cell_type": "code",
"source": [
"trainset, testset = train_test_split(ratings_data, test_size=0.30, shuffle=True)"
],
"metadata": {
"id": "88vyutdBdRnq"
},
"execution_count": 32,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Compute Precision and Recall at k for each user.\n",
"def precision_recall_at_k(predictions, k=5, threshold=3.5):\n",
" # Map the predictions to each user.\n",
" user_est_true = defaultdict(list)\n",
" for uid, iid, true_r, est, details in predictions:\n",
" user_est_true[uid].append((est, true_r))\n",
"\n",
" precisions = []\n",
" recalls = []\n",
"\n",
" for uid, user_ratings in user_est_true.items():\n",
" # Sort user ratings by estimated value\n",
" user_ratings.sort(key=lambda x: x[0], reverse=True)\n",
"\n",
" # Number of relevant items\n",
" n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)\n",
"\n",
" # Number of recommended items in top k\n",
" n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])\n",
"\n",
" # Number of relevant and recommended items in top k\n",
" n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))\n",
" for (est, true_r) in user_ratings[:k])\n",
"\n",
" # Precision@K\n",
" if n_rec_k != 0:\n",
" precisions.append(n_rel_and_rec_k / n_rec_k)\n",
" else:\n",
" precisions.append(0)\n",
"\n",
" # Recall@K\n",
" if n_rel != 0:\n",
" recalls.append(n_rel_and_rec_k / n_rel)\n",
" else:\n",
" recalls.append(0)\n",
"\n",
" # Average precision and recall over all users\n",
" avg_precision = sum(precisions) / len(precisions)\n",
" avg_recall = sum(recalls) / len(recalls)\n",
"\n",
" return avg_precision, avg_recall"
],
"metadata": {
"id": "4BwEvmD447Oh"
},
"execution_count": 33,
"outputs": []
},
{
"cell_type": "code",
"source": [
"n_epochs = 100\n",
"rmse_values = []\n",
"precision_values = []\n",
"recall_values = []\n",
"\n",
"for epoch in range(1, n_epochs + 1):\n",
" svd_model = SVD(n_epochs=epoch, reg_all=0.1)\n",
" svd_model.fit(trainset)\n",
"\n",
" # Evaluate on test set\n",
" predictions = svd_model.test(testset)\n",
" rmse = accuracy.rmse(predictions, verbose=False)\n",
" rmse_values.append(rmse)\n",
"\n",
" # Compute precision and recall on test set\n",
" precision, recall = precision_recall_at_k(predictions, k=5, threshold=3.5)\n",
" precision_values.append(precision)\n",
" recall_values.append(recall)\n",
"\n",
" print(f\"Epoch: {epoch} | RMSE: {rmse:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}\")\n",
"\n"
],
"metadata": {
"id": "6GNLOKkwdV71",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "f4f9c10c-ec4b-4426-8cbb-d89d1cc58bde"
},
"execution_count": 34,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Epoch: 1 | RMSE: 0.9403 | Precision: 0.7572 | Recall: 0.3068\n",
"Epoch: 2 | RMSE: 0.9178 | Precision: 0.7772 | Recall: 0.3139\n",
"Epoch: 3 | RMSE: 0.9062 | Precision: 0.7802 | Recall: 0.3074\n",
"Epoch: 4 | RMSE: 0.8986 | Precision: 0.7833 | Recall: 0.3110\n",
"Epoch: 5 | RMSE: 0.8940 | Precision: 0.7805 | Recall: 0.3076\n",
"Epoch: 6 | RMSE: 0.8898 | Precision: 0.7865 | Recall: 0.3089\n",
"Epoch: 7 | RMSE: 0.8874 | Precision: 0.7816 | Recall: 0.3087\n",
"Epoch: 8 | RMSE: 0.8847 | Precision: 0.7823 | Recall: 0.3078\n",
"Epoch: 9 | RMSE: 0.8819 | Precision: 0.7844 | Recall: 0.3074\n",
"Epoch: 10 | RMSE: 0.8798 | Precision: 0.7805 | Recall: 0.3067\n",
"Epoch: 11 | RMSE: 0.8783 | Precision: 0.7839 | Recall: 0.3026\n",
"Epoch: 12 | RMSE: 0.8771 | Precision: 0.7826 | Recall: 0.3046\n",
"Epoch: 13 | RMSE: 0.8761 | Precision: 0.7903 | Recall: 0.3038\n",
"Epoch: 14 | RMSE: 0.8754 | Precision: 0.7908 | Recall: 0.3038\n",
"Epoch: 15 | RMSE: 0.8740 | Precision: 0.7829 | Recall: 0.3015\n",
"Epoch: 16 | RMSE: 0.8729 | Precision: 0.7823 | Recall: 0.3007\n",
"Epoch: 17 | RMSE: 0.8724 | Precision: 0.7892 | Recall: 0.3034\n",
"Epoch: 18 | RMSE: 0.8718 | Precision: 0.7823 | Recall: 0.3009\n",
"Epoch: 19 | RMSE: 0.8709 | Precision: 0.7815 | Recall: 0.2973\n",
"Epoch: 20 | RMSE: 0.8702 | Precision: 0.7848 | Recall: 0.3020\n",
"Epoch: 21 | RMSE: 0.8698 | Precision: 0.7814 | Recall: 0.2982\n",
"Epoch: 22 | RMSE: 0.8698 | Precision: 0.7815 | Recall: 0.2980\n",
"Epoch: 23 | RMSE: 0.8691 | Precision: 0.7792 | Recall: 0.3015\n",
"Epoch: 24 | RMSE: 0.8679 | Precision: 0.7810 | Recall: 0.3002\n",
"Epoch: 25 | RMSE: 0.8682 | Precision: 0.7721 | Recall: 0.2992\n",
"Epoch: 26 | RMSE: 0.8674 | Precision: 0.7783 | Recall: 0.2986\n",
"Epoch: 27 | RMSE: 0.8672 | Precision: 0.7769 | Recall: 0.3000\n",
"Epoch: 28 | RMSE: 0.8667 | Precision: 0.7831 | Recall: 0.2989\n",
"Epoch: 29 | RMSE: 0.8661 | Precision: 0.7809 | Recall: 0.3015\n",
"Epoch: 30 | RMSE: 0.8658 | Precision: 0.7839 | Recall: 0.2970\n",
"Epoch: 31 | RMSE: 0.8654 | Precision: 0.7815 | Recall: 0.2993\n",
"Epoch: 32 | RMSE: 0.8650 | Precision: 0.7770 | Recall: 0.2987\n",
"Epoch: 33 | RMSE: 0.8652 | Precision: 0.7801 | Recall: 0.2964\n",
"Epoch: 34 | RMSE: 0.8645 | Precision: 0.7795 | Recall: 0.3020\n",
"Epoch: 35 | RMSE: 0.8638 | Precision: 0.7804 | Recall: 0.2968\n",
"Epoch: 36 | RMSE: 0.8630 | Precision: 0.7846 | Recall: 0.3010\n",
"Epoch: 37 | RMSE: 0.8624 | Precision: 0.7810 | Recall: 0.3004\n",
"Epoch: 38 | RMSE: 0.8643 | Precision: 0.7749 | Recall: 0.2998\n",
"Epoch: 39 | RMSE: 0.8629 | Precision: 0.7813 | Recall: 0.2974\n",
"Epoch: 40 | RMSE: 0.8620 | Precision: 0.7856 | Recall: 0.3002\n",
"Epoch: 41 | RMSE: 0.8620 | Precision: 0.7714 | Recall: 0.2980\n",
"Epoch: 42 | RMSE: 0.8620 | Precision: 0.7885 | Recall: 0.3027\n",
"Epoch: 43 | RMSE: 0.8607 | Precision: 0.7868 | Recall: 0.3013\n",
"Epoch: 44 | RMSE: 0.8616 | Precision: 0.7829 | Recall: 0.2986\n",
"Epoch: 45 | RMSE: 0.8607 | Precision: 0.7921 | Recall: 0.3014\n",
"Epoch: 46 | RMSE: 0.8608 | Precision: 0.7798 | Recall: 0.2999\n",
"Epoch: 47 | RMSE: 0.8595 | Precision: 0.7853 | Recall: 0.3014\n",
"Epoch: 48 | RMSE: 0.8602 | Precision: 0.7769 | Recall: 0.2981\n",
"Epoch: 49 | RMSE: 0.8586 | Precision: 0.7890 | Recall: 0.3006\n",
"Epoch: 50 | RMSE: 0.8588 | Precision: 0.7904 | Recall: 0.3021\n",
"Epoch: 51 | RMSE: 0.8583 | Precision: 0.7837 | Recall: 0.3032\n",
"Epoch: 52 | RMSE: 0.8581 | Precision: 0.7914 | Recall: 0.3025\n",
"Epoch: 53 | RMSE: 0.8577 | Precision: 0.7903 | Recall: 0.3022\n",
"Epoch: 54 | RMSE: 0.8583 | Precision: 0.7831 | Recall: 0.3024\n",
"Epoch: 55 | RMSE: 0.8584 | Precision: 0.7874 | Recall: 0.3003\n",
"Epoch: 56 | RMSE: 0.8579 | Precision: 0.7925 | Recall: 0.3035\n",
"Epoch: 57 | RMSE: 0.8568 | Precision: 0.7933 | Recall: 0.3053\n",
"Epoch: 58 | RMSE: 0.8572 | Precision: 0.7942 | Recall: 0.3030\n",
"Epoch: 59 | RMSE: 0.8577 | Precision: 0.7851 | Recall: 0.3013\n",
"Epoch: 60 | RMSE: 0.8570 | Precision: 0.7850 | Recall: 0.2969\n",
"Epoch: 61 | RMSE: 0.8567 | Precision: 0.7856 | Recall: 0.2987\n",
"Epoch: 62 | RMSE: 0.8563 | Precision: 0.7892 | Recall: 0.3022\n",
"Epoch: 63 | RMSE: 0.8563 | Precision: 0.7886 | Recall: 0.3020\n",
"Epoch: 64 | RMSE: 0.8565 | Precision: 0.7889 | Recall: 0.3015\n",
"Epoch: 65 | RMSE: 0.8554 | Precision: 0.7960 | Recall: 0.3063\n",
"Epoch: 66 | RMSE: 0.8552 | Precision: 0.7883 | Recall: 0.3051\n",
"Epoch: 67 | RMSE: 0.8557 | Precision: 0.7936 | Recall: 0.3043\n",
"Epoch: 68 | RMSE: 0.8570 | Precision: 0.7942 | Recall: 0.3030\n",
"Epoch: 69 | RMSE: 0.8554 | Precision: 0.7973 | Recall: 0.3074\n",
"Epoch: 70 | RMSE: 0.8563 | Precision: 0.7821 | Recall: 0.2997\n",
"Epoch: 71 | RMSE: 0.8546 | Precision: 0.7974 | Recall: 0.3024\n",
"Epoch: 72 | RMSE: 0.8556 | Precision: 0.7955 | Recall: 0.3037\n",
"Epoch: 73 | RMSE: 0.8552 | Precision: 0.7948 | Recall: 0.3026\n",
"Epoch: 74 | RMSE: 0.8543 | Precision: 0.8004 | Recall: 0.3054\n",
"Epoch: 75 | RMSE: 0.8546 | Precision: 0.7935 | Recall: 0.3004\n",
"Epoch: 76 | RMSE: 0.8550 | Precision: 0.7894 | Recall: 0.3040\n",
"Epoch: 77 | RMSE: 0.8543 | Precision: 0.7864 | Recall: 0.3037\n",
"Epoch: 78 | RMSE: 0.8552 | Precision: 0.7924 | Recall: 0.3045\n",
"Epoch: 79 | RMSE: 0.8547 | Precision: 0.7880 | Recall: 0.3048\n",
"Epoch: 80 | RMSE: 0.8536 | Precision: 0.7957 | Recall: 0.3036\n",
"Epoch: 81 | RMSE: 0.8537 | Precision: 0.7928 | Recall: 0.3035\n",
"Epoch: 82 | RMSE: 0.8538 | Precision: 0.7956 | Recall: 0.3051\n",
"Epoch: 83 | RMSE: 0.8536 | Precision: 0.7919 | Recall: 0.3024\n",
"Epoch: 84 | RMSE: 0.8529 | Precision: 0.7930 | Recall: 0.3043\n",
"Epoch: 85 | RMSE: 0.8524 | Precision: 0.7969 | Recall: 0.3020\n",
"Epoch: 86 | RMSE: 0.8519 | Precision: 0.7971 | Recall: 0.3034\n",
"Epoch: 87 | RMSE: 0.8538 | Precision: 0.7935 | Recall: 0.3063\n",
"Epoch: 88 | RMSE: 0.8529 | Precision: 0.8011 | Recall: 0.3060\n",
"Epoch: 89 | RMSE: 0.8521 | Precision: 0.7981 | Recall: 0.3065\n",
"Epoch: 90 | RMSE: 0.8524 | Precision: 0.7859 | Recall: 0.3024\n",
"Epoch: 91 | RMSE: 0.8514 | Precision: 0.7959 | Recall: 0.3088\n",
"Epoch: 92 | RMSE: 0.8533 | Precision: 0.7976 | Recall: 0.3078\n",
"Epoch: 93 | RMSE: 0.8523 | Precision: 0.7967 | Recall: 0.3068\n",
"Epoch: 94 | RMSE: 0.8542 | Precision: 0.7899 | Recall: 0.3036\n",
"Epoch: 95 | RMSE: 0.8520 | Precision: 0.7898 | Recall: 0.3041\n",
"Epoch: 96 | RMSE: 0.8534 | Precision: 0.7961 | Recall: 0.3043\n",
"Epoch: 97 | RMSE: 0.8529 | Precision: 0.7974 | Recall: 0.3073\n",
"Epoch: 98 | RMSE: 0.8530 | Precision: 0.7926 | Recall: 0.3051\n",
"Epoch: 99 | RMSE: 0.8534 | Precision: 0.7981 | Recall: 0.3085\n",
"Epoch: 100 | RMSE: 0.8517 | Precision: 0.7978 | Recall: 0.3075\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"epochs = range(1, n_epochs + 1)\n",
"\n",
"plt.figure(figsize=(10, 6))\n",
"plt.plot(epochs, rmse_values, label='RMSE')\n",
"plt.plot(epochs, precision_values, label='Precision')\n",
"plt.plot(epochs, recall_values, label='Recall')\n",
"plt.xlabel('Number of Epochs')\n",
"plt.ylabel('Values')\n",
"plt.legend()\n",
"plt.grid(True)\n",
"plt.show()"
],
"metadata": {
"id": "AFgbkgBWihoq",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 542
},
"outputId": "1a0247ae-be50-483d-d777-e93574966ecc"
},
"execution_count": 35,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"user_id = 16\n",
"movie_id = 501\n",
"predicted_rating = svd_model.predict(user_id, movie_id)\n",
"print(predicted_rating)"
],
"metadata": {
"id": "h9mvZj7ydhx0",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b0c3d860-43ad-4c05-feee-bc3073125955"
},
"execution_count": 36,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"user: 16 item: 501 r_ui = None est = 3.61 {'was_impossible': False}\n"
]
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment