{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "vncDsAP0Gaoa"
},
"source": [
"# **Netflix Movies And TV Shows Clustering**\n",
"\n",
"---------------------------------------------"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "50pzm194nHoT"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FJNUwmbgGyua"
},
"source": [
"# **Project Summary -**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F6v_1wHtG2nS"
},
"source": [
"Netflix Movies and TV Shows Clustering is a data analysis and machine learning technique that Netflix uses to group their content into similar categories. This technique involves analyzing the various characteristics of each title, such as genre, cast and plot, and using algorithms to identify patterns and similarities. In this way, Netflix can provide its users with personalized recommendations based on their viewing history and preferences. The goal is to improve user engagement and satisfaction, which will lead to increased retention and company revenue.\n",
"\n",
"Netflix Movies and TV Shows Clustering is a data-driven approach that Netflix uses to group its vast library of content into similar categories. The process involves collecting and analyzing various data points such as genre, cast, director, plot and other relevant features. Netflix will then use unsupervised machine learning algorithms to identify patterns and similarities between different titles.\n",
"\n",
"Algorithms used in this process include clustering techniques such as k-means, hierarchical clustering, and principal component analysis (PCA). These algorithms help Netflix group movies and TV shows with similar features into distinct groups, each representing a unique genre or category.\n",
"\n",
"The ultimate goal of this clustering is to improve the user experience on Netflix by providing personalized content recommendations to users using cosine similarity score. With the help of this, Netflix can suggest titles to users that are more likely to match their interests, making it more likely that users will stay engaged with the platform.\n",
"\n",
"In addition to improving user satisfaction, clustering analysis and recommender system also helps Netflix make data-driven decisions about content production and licensing. By understanding underlying trends and patterns in user behavior, Netflix can make informed decisions about which titles to produce or acquire and which to remove from its platform. This ultimately helps increase customer retention and company revenue."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BLpFGEcARMA2"
},
"source": [
"In this project we have done-:\n",
"\n",
"* Understanding the dataset and problem statement.\n",
"* Data Wrangling.\n",
"* Handling null vaulues and Data Cleaning.\n",
"* Text Preprocessing.\n",
"* Text Vectorization using TF-TDF.\n",
"* Clustering Analysis using different clustering algorithms.\n",
"* Checking out distribution of different cluster with the help of word cloud.\n",
"* Building a recommender system using cosine similarity.\n",
"* Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w6K7xa23Elo4"
},
"source": [
"# **GitHub Link 👇🏻**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h1o69JH3Eqqn"
},
"source": [
"https://github.com/HarshJain41/Netflix-Movies-TVShow-Clustering"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yQaldy8SH6Dl"
},
"source": [
"# **Problem Statement👇🏻**\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DpeJGUA3kjGy"
},
"source": [
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rrCrqmn8lQih"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Hk1SVqcHldP3"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y3lxredqlCYt"
},
"source": [
"### Importing Necessary Libraries👇🏻"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "M8Vqi-pPk-HR"
},
"outputs": [],
"source": [
"# Necessary Libraries\n",
"#necessary packages\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import missingno as msno\n",
"import matplotlib.cm as cm\n",
"from wordcloud import WordCloud, STOPWORDS\n",
"%matplotlib inline\n",
"\n",
"\n",
"#for nlp\n",
"from sklearn import preprocessing\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
"from sklearn.model_selection import train_test_split, KFold\n",
"from nltk.corpus import stopwords\n",
"from nltk.stem.snowball import SnowballStemmer\n",
"\n",
"from sklearn.metrics import silhouette_score\n",
"from sklearn.cluster import KMeans\n",
"from sklearn.metrics import silhouette_samples\n",
"import scipy.cluster.hierarchy as sch\n",
"\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3RnN4peoiCZX"
},
"source": [
"### Loading the given dataset👇🏻\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "4CkvbW_SlZ_R"
},
"outputs": [],
"source": [
"# Load Dataset\n",
"data = pd.read_csv('NETFLIX MOVIES AND TV SHOWS CLUSTERING (1).csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x71ZqKXriCWQ"
},
"source": [
"### Dataset First View👇🏻"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 337
},
"executionInfo": {
"elapsed": 55,
"status": "ok",
"timestamp": 1676699085083,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "LWNFOSvLl09H",
"outputId": "fed5f463-d849-48a9-e19b-a62ae62f9fe8"
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" show_id type title director \\\n",
"count 7787 7787 7787 5398 \n",
"unique 7787 2 7787 4049 \n",
"top s6307 Movie The Hollow Point Raúl Campos, Jan Suter \n",
"freq 1 5377 1 18 \n",
"mean NaN NaN NaN NaN \n",
"std NaN NaN NaN NaN \n",
"min NaN NaN NaN NaN \n",
"25% NaN NaN NaN NaN \n",
"50% NaN NaN NaN NaN \n",
"75% NaN NaN NaN NaN \n",
"max NaN NaN NaN NaN \n",
"\n",
" cast country date_added release_year \\\n",
"count 7069 7280 7777 7787.000000 \n",
"unique 6831 681 1565 NaN \n",
"top David Attenborough United States January 1, 2020 NaN \n",
"freq 18 2555 118 NaN \n",
"mean NaN NaN NaN 2013.932580 \n",
"std NaN NaN NaN 8.757395 \n",
"min NaN NaN NaN 1925.000000 \n",
"25% NaN NaN NaN 2013.000000 \n",
"50% NaN NaN NaN 2017.000000 \n",
"75% NaN NaN NaN 2018.000000 \n",
"max NaN NaN NaN 2021.000000 \n",
"\n",
" rating duration listed_in \\\n",
"count 7780 7787 7787 \n",
"unique 14 216 492 \n",
"top TV-MA 1 Season Documentaries \n",
"freq 2863 1608 334 \n",
"mean NaN NaN NaN \n",
"std NaN NaN NaN \n",
"min NaN NaN NaN \n",
"25% NaN NaN NaN \n",
"50% NaN NaN NaN \n",
"75% NaN NaN NaN \n",
"max NaN NaN NaN \n",
"\n",
" description \n",
"count 7787 \n",
"unique 7769 \n",
"top Multiple women report their husbands as missin... \n",
"freq 3 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dataset Describe\n",
"data.describe(include = 'all')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bKJF3rekwFvQ"
},
"source": [
"### Data Wrangling Code"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 74,
"status": "ok",
"timestamp": 1676699088001,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "wk-9a2fpoLcV",
"outputId": "7ee26ad4-fcc7-4c80-fda0-9b5376c7959f"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to\n",
"[nltk_data] C:\\Users\\ADMIN\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# converting the cast column into a list\n",
"import nltk\n",
"nltk.download('punkt')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 69,
"status": "ok",
"timestamp": 1676699088002,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "iGQzh0z_wMDk",
"outputId": "bf8ed126-0031-4243-b12c-893042c92f47"
},
"outputs": [
{
"data": {
"text/plain": [
"show_id 0\n",
"type 0\n",
"title 0\n",
"director 2389\n",
"cast 718\n",
"country 507\n",
"date_added 10\n",
"release_year 0\n",
"rating 7\n",
"duration 0\n",
"listed_in 0\n",
"description 0\n",
"dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "TcNt1y5xT5OG"
},
"outputs": [],
"source": [
"#fillna() is a function in Pandas, a Python library for data analysis, that we used to replace missing (NaN) values in our DataFrame with a specified value.\n",
"data['cast'].fillna('No cast',inplace=True)\n",
"data['country'].fillna(data['country'].mode()[0],inplace=True)\n",
"data['director'].fillna('', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "Rhd-J-qWwW2B"
},
"outputs": [],
"source": [
"#'date_added' and 'rating' contains an lower and not much important portion of the data so we will drop them from our analysis.\n",
"data.dropna(subset=['date_added','rating'],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 66,
"status": "ok",
"timestamp": 1676699088004,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "7E06p0Mz6TyW",
"outputId": "c29dda97-fd19-4d53-af47-ad9796630059"
},
"outputs": [
{
"data": {
"text/plain": [
"show_id 0\n",
"type 0\n",
"title 0\n",
"director 0\n",
"cast 0\n",
"country 0\n",
"date_added 0\n",
"release_year 0\n",
"rating 0\n",
"duration 0\n",
"listed_in 0\n",
"description 0\n",
"dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#isnull() is a function in Pandas, a Python library for data analysis, that is used to identify missing (NaN) values in a DataFrame.\n",
"data.isnull().sum()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "riotQugGqtWx"
},
"source": [
"### So as we can check here the null values are replaced and read for EDA and TexT Preprocessing."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yLjJCtPM0KBk"
},
"source": [
"## **Feature Engineering & Data Pre-processing**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Iwf50b-R2tYG"
},
"source": [
"### Textual Data Preprocessing 👇🏻"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 64,
"status": "ok",
"timestamp": 1676699103203,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "lpSr8voLRYoN",
"outputId": "9e40e7e8-d5aa-4842-a582-b64022187a53"
},
"outputs": [
{
"data": {
"text/plain": [
"show_id 0\n",
"type 0\n",
"title 0\n",
"director 0\n",
"cast 0\n",
"country 0\n",
"date_added 0\n",
"release_year 0\n",
"rating 0\n",
"duration 0\n",
"listed_in 0\n",
"description 0\n",
"dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-lNShkeaQ_uG"
},
"source": [
"All the null values in our dataset are handled till now."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 458
},
"executionInfo": {
"elapsed": 54,
"status": "ok",
"timestamp": 1676699103204,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "He1mWK1W07Os",
"outputId": "970b14ef-2e8c-4c3b-b2ae-ff1cbe8d60ea"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
show_id
\n",
"
type
\n",
"
title
\n",
"
director
\n",
"
cast
\n",
"
country
\n",
"
date_added
\n",
"
release_year
\n",
"
rating
\n",
"
duration
\n",
"
listed_in
\n",
"
description
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
s1
\n",
"
TV Show
\n",
"
3%
\n",
"
\n",
"
João Miguel, Bianca Comparato, Michel Gomes, R...
\n",
"
Brazil
\n",
"
August 14, 2020
\n",
"
2020
\n",
"
TV-MA
\n",
"
4 Seasons
\n",
"
International TV Shows, TV Dramas, TV Sci-Fi &...
\n",
"
In a future where the elite inhabit an island ...
\n",
"
\n",
"
\n",
"
1
\n",
"
s2
\n",
"
Movie
\n",
"
7:19
\n",
"
Jorge Michel Grau
\n",
"
Demián Bichir, Héctor Bonilla, Oscar Serrano, ...
\n",
"
Mexico
\n",
"
December 23, 2016
\n",
"
2016
\n",
"
TV-MA
\n",
"
93 min
\n",
"
Dramas, International Movies
\n",
"
After a devastating earthquake hits Mexico Cit...
\n",
"
\n",
"
\n",
"
2
\n",
"
s3
\n",
"
Movie
\n",
"
23:59
\n",
"
Gilbert Chan
\n",
"
Tedd Chan, Stella Chung, Henley Hii, Lawrence ...
\n",
"
Singapore
\n",
"
December 20, 2018
\n",
"
2011
\n",
"
R
\n",
"
78 min
\n",
"
Horror Movies, International Movies
\n",
"
When an army recruit is found dead, his fellow...
\n",
"
\n",
"
\n",
"
3
\n",
"
s4
\n",
"
Movie
\n",
"
9
\n",
"
Shane Acker
\n",
"
Elijah Wood, John C. Reilly, Jennifer Connelly...
\n",
"
United States
\n",
"
November 16, 2017
\n",
"
2009
\n",
"
PG-13
\n",
"
80 min
\n",
"
Action & Adventure, Independent Movies, Sci-Fi...
\n",
"
In a postapocalyptic world, rag-doll robots hi...
\n",
"
\n",
"
\n",
"
4
\n",
"
s5
\n",
"
Movie
\n",
"
21
\n",
"
Robert Luketic
\n",
"
Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...
\n",
"
United States
\n",
"
January 1, 2020
\n",
"
2008
\n",
"
PG-13
\n",
"
123 min
\n",
"
Dramas
\n",
"
A brilliant group of students become card-coun...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" show_id type title director \\\n",
"0 s1 TV Show 3% \n",
"1 s2 Movie 7:19 Jorge Michel Grau \n",
"2 s3 Movie 23:59 Gilbert Chan \n",
"3 s4 Movie 9 Shane Acker \n",
"4 s5 Movie 21 Robert Luketic \n",
"\n",
" cast country \\\n",
"0 João Miguel, Bianca Comparato, Michel Gomes, R... Brazil \n",
"1 Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico \n",
"2 Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore \n",
"3 Elijah Wood, John C. Reilly, Jennifer Connelly... United States \n",
"4 Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States \n",
"\n",
" date_added release_year rating duration \\\n",
"0 August 14, 2020 2020 TV-MA 4 Seasons \n",
"1 December 23, 2016 2016 TV-MA 93 min \n",
"2 December 20, 2018 2011 R 78 min \n",
"3 November 16, 2017 2009 PG-13 80 min \n",
"4 January 1, 2020 2008 PG-13 123 min \n",
"\n",
" listed_in \\\n",
"0 International TV Shows, TV Dramas, TV Sci-Fi &... \n",
"1 Dramas, International Movies \n",
"2 Horror Movies, International Movies \n",
"3 Action & Adventure, Independent Movies, Sci-Fi... \n",
"4 Dramas \n",
"\n",
" description \n",
"0 In a future where the elite inhabit an island ... \n",
"1 After a devastating earthquake hits Mexico Cit... \n",
"2 When an army recruit is found dead, his fellow... \n",
"3 In a postapocalyptic world, rag-doll robots hi... \n",
"4 A brilliant group of students become card-coun... "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "kEJZf8DXTpp1"
},
"outputs": [],
"source": [
"#merging all text column to single text column to work with\n",
"\n",
"data['organized'] = data['description'] + ' ' + data['listed_in'] + ' ' + data['country']+ ' ' + data['cast'] + ' '+ data['director']\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "rw84Yt4bXZw1"
},
"outputs": [],
"source": [
"#filled all the missing value with empty strings\n",
"data['organized'] = data['organized'].fillna(\"\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 53
},
"executionInfo": {
"elapsed": 53,
"status": "ok",
"timestamp": 1676699103206,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "Wsas3AdhXgQ1",
"outputId": "d6f5a1eb-604d-4030-d5a8-d384c29e780d"
},
"outputs": [
{
"data": {
"text/plain": [
"\"When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that's haunting their jungle island training camp. Horror Movies, International Movies Singapore Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim Gilbert Chan\""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['organized'][2]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"id": "seRsQHVnTpEm"
},
"outputs": [],
"source": [
"#text cleaning\n",
"import re\n",
"def cleaned(x):\n",
" return re.sub(r\"[^a-zA-Z ]\",\"\",str(x)) #to remove all non-alphabetic characters\n",
"data['organized'] = data['organized'].apply(cleaned)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WVIkgGqN3qsr"
},
"source": [
"#### 2. Lower Casing"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "88JnJ1jN3w7j"
},
"outputs": [],
"source": [
"# Lower Casing\n",
"data['organized']= data['organized'].str.lower()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XkPnILGE3zoT"
},
"source": [
"#### 3. Removing Punctuations"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "JNXs4Uif5Pj1"
},
"outputs": [],
"source": [
"import string\n",
"\n",
"def remove_punctuation(text):\n",
" # Create a translation table to remove punctuation using the string module\n",
" translator = str.maketrans('', '', string.punctuation)\n",
" # Apply the translation table to remove punctuation from the text\n",
" text = text.translate(translator)\n",
" return text"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 55,
"status": "ok",
"timestamp": 1676699103210,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "SLKcXBZ75gIN",
"outputId": "e0bd6d24-6605-4b33-dcbf-cf03c35a2ffd"
},
"outputs": [
{
"data": {
"text/plain": [
"0 in a future where the elite inhabit an island ...\n",
"1 after a devastating earthquake hits mexico cit...\n",
"2 when an army recruit is found dead his fellow ...\n",
"3 in a postapocalyptic world ragdoll robots hide...\n",
"4 a brilliant group of students become cardcount...\n",
" ... \n",
"7782 when lebanons civil war deprives zozo of his f...\n",
"7783 a scrappy but poor boy worms his way into a ty...\n",
"7784 in this documentary south african rapper nasty...\n",
"7785 dessert wizard adriano zumbo looks for the nex...\n",
"7786 this documentary delves into the mystique behi...\n",
"Name: organized, Length: 7770, dtype: object"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['organized'].apply(remove_punctuation)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mT9DMSJo4nBL"
},
"source": [
"#### 4. Removing Stopwords"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 48,
"status": "ok",
"timestamp": 1676699103211,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "vqbBqNaA33c0",
"outputId": "fda25fda-bfb0-4786-a6c2-c3607b499878"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to\n",
"[nltk_data] C:\\Users\\ADMIN\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
}
],
"source": [
"#necessary import for nlp\n",
"import nltk\n",
"nltk.download('stopwords')\n",
"from nltk.corpus import stopwords\n",
"from nltk.tokenize import word_tokenize\n",
"from nltk.stem.snowball import SnowballStemmer\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"id": "24wQ8Jkn6Rnh"
},
"outputs": [],
"source": [
"# Removing Stopwords\n",
"def remove_stopwords(text):\n",
" # Tokenizing the text into words\n",
" words = nltk.word_tokenize(text)\n",
" # Removing stopwords from the list of words\n",
" words = [word for word in words if word.lower() not in stopwords.words('english')]\n",
" # Joining the remaining words back into a single string\n",
" text = ' '.join(words)\n",
" return text\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 46620,
"status": "ok",
"timestamp": 1676699149791,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "FcdeJmcZ7KKF",
"outputId": "19e3a50c-54e3-43bb-a234-6545882571bf"
},
"outputs": [
{
"data": {
"text/plain": [
"0 future elite inhabit island paradise far crowd...\n",
"1 devastating earthquake hits mexico city trappe...\n",
"2 army recruit found dead fellow soldiers forced...\n",
"3 postapocalyptic world ragdoll robots hide fear...\n",
"4 brilliant group students become cardcounting e...\n",
" ... \n",
"7782 lebanons civil war deprives zozo family hes le...\n",
"7783 scrappy poor boy worms way tycoons dysfunction...\n",
"7784 documentary south african rapper nasty c hits ...\n",
"7785 dessert wizard adriano zumbo looks next willy ...\n",
"7786 documentary delves mystique behind bluesrock t...\n",
"Name: organized, Length: 7770, dtype: object"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['organized'].apply(remove_stopwords)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9ExmJH0g5HBk"
},
"source": [
"#### 5. Text Normalization"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"id": "l32X3m9CYVgw"
},
"outputs": [],
"source": [
"#stemming\n",
"stemmer = SnowballStemmer('english')\n",
"stop_words = set(stopwords.words('english'))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"id": "2sxKgKxu4Ip3"
},
"outputs": [],
"source": [
"def stem_text(text):\n",
" words = nltk.word_tokenize(text) # tokenizing the text into words\n",
" stemmed_words = [stemmer.stem(word) for word in words] # applying the Snowball stemmer to each word\n",
" return ' '.join(stemmed_words) # joining the stemmed words back into a single string\n",
"\n",
"# Apply the stem_text function to a column of a pandas DataFrame, such as a column called 'text'\n",
"data['org_new'] = data['organized'].apply(stem_text)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 13,
"status": "ok",
"timestamp": 1676699157357,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "Jz6jRmB5ZRkY",
"outputId": "26b5df0e-a377-4d2c-b797-5868dba89f31"
},
"outputs": [
{
"data": {
"text/plain": [
"0 in a futur where the elit inhabit an island pa...\n",
"1 after a devast earthquak hit mexico citi trap ...\n",
"2 when an armi recruit is found dead his fellow ...\n",
"3 in a postapocalypt world ragdol robot hide in ...\n",
"4 a brilliant group of student becom cardcount e...\n",
" ... \n",
"7782 when lebanon civil war depriv zozo of his fami...\n",
"7783 a scrappi but poor boy worm his way into a tyc...\n",
"7784 in this documentari south african rapper nasti...\n",
"7785 dessert wizard adriano zumbo look for the next...\n",
"7786 this documentari delv into the mystiqu behind ...\n",
"Name: org_new, Length: 7770, dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.org_new"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 53
},
"executionInfo": {
"elapsed": 11,
"status": "ok",
"timestamp": 1676699157358,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "gojjm2KeDUcP",
"outputId": "3c4781c7-aa84-4776-8fb4-22a0f56d70fd"
},
"outputs": [
{
"data": {
"text/plain": [
"'in a futur where the elit inhabit an island paradis far from the crowd slum you get one chanc to join the save from squalor intern tv show tv drama tv scifi fantasi brazil joo miguel bianca comparato michel gome rodolfo valent vaneza oliveira rafael lozano vivian porto mel fronckowiak sergio mamberti zez motta celso frateschi'"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.org_new.iloc[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z9jKVxE06BC1"
},
"source": [
"Here We used Snowball stemmer because it is a popular choice for stemming in NLP because it supports multiple languages, is accurate and efficient, and can be customized to meet specific needs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T0VqWOYE6DLQ"
},
"source": [
"#### 6. Text Vectorization"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 738
},
"executionInfo": {
"elapsed": 529,
"status": "ok",
"timestamp": 1676699176059,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "NP1PRoPqDudF",
"outputId": "5dfe0b58-e201-49bb-c1db-d4e19ec28453"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
show_id
\n",
"
type
\n",
"
title
\n",
"
director
\n",
"
cast
\n",
"
country
\n",
"
date_added
\n",
"
release_year
\n",
"
rating
\n",
"
duration
\n",
"
listed_in
\n",
"
description
\n",
"
organized
\n",
"
org_new
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
s1
\n",
"
TV Show
\n",
"
3%
\n",
"
\n",
"
João Miguel, Bianca Comparato, Michel Gomes, R...
\n",
"
Brazil
\n",
"
August 14, 2020
\n",
"
2020
\n",
"
TV-MA
\n",
"
4 Seasons
\n",
"
International TV Shows, TV Dramas, TV Sci-Fi &...
\n",
"
In a future where the elite inhabit an island ...
\n",
"
in a future where the elite inhabit an island ...
\n",
"
in a futur where the elit inhabit an island pa...
\n",
"
\n",
"
\n",
"
1
\n",
"
s2
\n",
"
Movie
\n",
"
7:19
\n",
"
Jorge Michel Grau
\n",
"
Demián Bichir, Héctor Bonilla, Oscar Serrano, ...
\n",
"
Mexico
\n",
"
December 23, 2016
\n",
"
2016
\n",
"
TV-MA
\n",
"
93 min
\n",
"
Dramas, International Movies
\n",
"
After a devastating earthquake hits Mexico Cit...
\n",
"
after a devastating earthquake hits mexico cit...
\n",
"
after a devast earthquak hit mexico citi trap ...
\n",
"
\n",
"
\n",
"
2
\n",
"
s3
\n",
"
Movie
\n",
"
23:59
\n",
"
Gilbert Chan
\n",
"
Tedd Chan, Stella Chung, Henley Hii, Lawrence ...
\n",
"
Singapore
\n",
"
December 20, 2018
\n",
"
2011
\n",
"
R
\n",
"
78 min
\n",
"
Horror Movies, International Movies
\n",
"
When an army recruit is found dead, his fellow...
\n",
"
when an army recruit is found dead his fellow ...
\n",
"
when an armi recruit is found dead his fellow ...
\n",
"
\n",
"
\n",
"
3
\n",
"
s4
\n",
"
Movie
\n",
"
9
\n",
"
Shane Acker
\n",
"
Elijah Wood, John C. Reilly, Jennifer Connelly...
\n",
"
United States
\n",
"
November 16, 2017
\n",
"
2009
\n",
"
PG-13
\n",
"
80 min
\n",
"
Action & Adventure, Independent Movies, Sci-Fi...
\n",
"
In a postapocalyptic world, rag-doll robots hi...
\n",
"
in a postapocalyptic world ragdoll robots hide...
\n",
"
in a postapocalypt world ragdol robot hide in ...
\n",
"
\n",
"
\n",
"
4
\n",
"
s5
\n",
"
Movie
\n",
"
21
\n",
"
Robert Luketic
\n",
"
Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...
\n",
"
United States
\n",
"
January 1, 2020
\n",
"
2008
\n",
"
PG-13
\n",
"
123 min
\n",
"
Dramas
\n",
"
A brilliant group of students become card-coun...
\n",
"
a brilliant group of students become cardcount...
\n",
"
a brilliant group of student becom cardcount e...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" show_id type title director \\\n",
"0 s1 TV Show 3% \n",
"1 s2 Movie 7:19 Jorge Michel Grau \n",
"2 s3 Movie 23:59 Gilbert Chan \n",
"3 s4 Movie 9 Shane Acker \n",
"4 s5 Movie 21 Robert Luketic \n",
"\n",
" cast country \\\n",
"0 João Miguel, Bianca Comparato, Michel Gomes, R... Brazil \n",
"1 Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico \n",
"2 Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore \n",
"3 Elijah Wood, John C. Reilly, Jennifer Connelly... United States \n",
"4 Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States \n",
"\n",
" date_added release_year rating duration \\\n",
"0 August 14, 2020 2020 TV-MA 4 Seasons \n",
"1 December 23, 2016 2016 TV-MA 93 min \n",
"2 December 20, 2018 2011 R 78 min \n",
"3 November 16, 2017 2009 PG-13 80 min \n",
"4 January 1, 2020 2008 PG-13 123 min \n",
"\n",
" listed_in \\\n",
"0 International TV Shows, TV Dramas, TV Sci-Fi &... \n",
"1 Dramas, International Movies \n",
"2 Horror Movies, International Movies \n",
"3 Action & Adventure, Independent Movies, Sci-Fi... \n",
"4 Dramas \n",
"\n",
" description \\\n",
"0 In a future where the elite inhabit an island ... \n",
"1 After a devastating earthquake hits Mexico Cit... \n",
"2 When an army recruit is found dead, his fellow... \n",
"3 In a postapocalyptic world, rag-doll robots hi... \n",
"4 A brilliant group of students become card-coun... \n",
"\n",
" organized \\\n",
"0 in a future where the elite inhabit an island ... \n",
"1 after a devastating earthquake hits mexico cit... \n",
"2 when an army recruit is found dead his fellow ... \n",
"3 in a postapocalyptic world ragdoll robots hide... \n",
"4 a brilliant group of students become cardcount... \n",
"\n",
" org_new \n",
"0 in a futur where the elit inhabit an island pa... \n",
"1 after a devast earthquak hit mexico citi trap ... \n",
"2 when an armi recruit is found dead his fellow ... \n",
"3 in a postapocalypt world ragdol robot hide in ... \n",
"4 a brilliant group of student becom cardcount e... "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"id": "cBxFBVPG3yDP"
},
"outputs": [],
"source": [
"new_df = data[['title', 'org_new']]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"executionInfo": {
"elapsed": 6,
"status": "ok",
"timestamp": 1676699247013,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "lnE3n7hbD_Qz",
"outputId": "aabe11ec-858e-49a1-8e9a-f34287b5c1f7"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
title
\n",
"
org_new
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
3%
\n",
"
in a futur where the elit inhabit an island pa...
\n",
"
\n",
"
\n",
"
1
\n",
"
7:19
\n",
"
after a devast earthquak hit mexico citi trap ...
\n",
"
\n",
"
\n",
"
2
\n",
"
23:59
\n",
"
when an armi recruit is found dead his fellow ...
\n",
"
\n",
"
\n",
"
3
\n",
"
9
\n",
"
in a postapocalypt world ragdol robot hide in ...
\n",
"
\n",
"
\n",
"
4
\n",
"
21
\n",
"
a brilliant group of student becom cardcount e...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title org_new\n",
"0 3% in a futur where the elit inhabit an island pa...\n",
"1 7:19 after a devast earthquak hit mexico citi trap ...\n",
"2 23:59 when an armi recruit is found dead his fellow ...\n",
"3 9 in a postapocalypt world ragdol robot hide in ...\n",
"4 21 a brilliant group of student becom cardcount e..."
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"id": "yBRtdhth6JDE"
},
"outputs": [],
"source": [
"#using tfidf\n",
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"id": "D_ut5aDyZfm-"
},
"outputs": [],
"source": [
"t_vectorizer = TfidfVectorizer(max_features=20000)\n",
"X= t_vectorizer.fit_transform(new_df['org_new'])\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 17,
"status": "ok",
"timestamp": 1676699370133,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "vPw2Mv6KWfi8",
"outputId": "3ef9fa2c-67bf-4674-e24b-b70fe283e15c"
},
"outputs": [
{
"data": {
"text/plain": [
"(7770, 20000)"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "su2EnbCh6UKQ"
},
"source": [
"Here we have used Tf-idf vectorization beacause it takes into account the importance of each word in a document. TF-IDF also assigns higher values to rare words that are unique to a particular document, making them more important in the representation."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BqgSdoDiEjsP"
},
"source": [
"### **Dimensionality Reduction using PCA**"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 799598,
"status": "ok",
"timestamp": 1676700322087,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "DIpLz1k_Ea2g",
"outputId": "3f30dfa4-cfd4-4c95-baf6-5a276d5b8cf0"
},
"outputs": [
{
"data": {
"text/plain": [
"PCA()"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"pca = PCA()\n",
"pca.fit(X.toarray())"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 548
},
"executionInfo": {
"elapsed": 728,
"status": "ok",
"timestamp": 1676700345325,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "j5RZHG93Ea5t",
"outputId": "4def5484-d999-48e1-fed7-6b0857bd7075"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Lets plot explained var v/s comp to check how many components to be considered.\n",
"plt.figure(figsize=(14,5), dpi=120)\n",
"plt.plot(np.cumsum(pca.explained_variance_ratio_))\n",
"plt.xlabel('number of components')\n",
"plt.ylabel('cumulative explained variance')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RGz3cOZmFW2e"
},
"source": [
"* We can see from the above plot almost 95% of the variance can be explained by 5000 components.\n",
"* Since choosing 5000 could be tricky we will set the value to be 95% in sklearn."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 834096,
"status": "ok",
"timestamp": 1676701222021,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "c25x73lKEa88",
"outputId": "46ee11ae-58da-460b-9f69-bd8834b54736"
},
"outputs": [
{
"data": {
"text/plain": [
"(7770, 5000)"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pca_tuned = PCA(n_components=5000)\n",
"pca_tuned.fit(X.toarray())\n",
"X_transformed = pca_tuned.transform(X.toarray())\n",
"X_transformed.shape"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 4,
"status": "ok",
"timestamp": 1676701273365,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "FgFTGWv5Ea_5",
"outputId": "8ec504a8-0b1c-4571-85bf-2fb48d871d82"
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.16827901e-01, -3.10369903e-02, -3.41246062e-03, ...,\n",
" -1.79854520e-03, -7.63243167e-03, 8.98526686e-03],\n",
" [-4.36200516e-02, -3.76015060e-02, 2.87479309e-02, ...,\n",
" 1.66970683e-03, 1.01036253e-02, 9.25303735e-03],\n",
" [-5.64130207e-02, -7.12843503e-02, 8.35385566e-03, ...,\n",
" -4.06459845e-04, -2.25336637e-03, -9.23281249e-03],\n",
" ...,\n",
" [-5.48001710e-02, 1.94571030e-01, 9.70241682e-02, ...,\n",
" -9.47824400e-03, -1.31496047e-02, 8.57450338e-03],\n",
" [ 1.03010848e-01, 1.85228855e-04, 8.46923825e-03, ...,\n",
" 6.55925665e-03, 1.35469571e-02, 3.65357917e-03],\n",
" [-3.48841516e-02, 3.90448132e-01, 8.66812216e-02, ...,\n",
" 5.52689990e-03, -4.31618769e-03, -5.86522660e-03]])"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_transformed"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eRyf-cTqG-Cf"
},
"source": [
"Above we have used Principal component analysis which is one of the dimensionality reduction technique. We have used it in order to capture the maximum variance of our data into small number of features."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VfCC591jGiD4"
},
"source": [
"# **ML Model Implementation**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PhtGMjSh_ytU"
},
"source": [
"## **Recommender System**"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"id": "38hPncgI_H9D"
},
"outputs": [],
"source": [
"#removing stopwords\n",
"tfidf = TfidfVectorizer(stop_words='english')"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"executionInfo": {
"elapsed": 473,
"status": "ok",
"timestamp": 1676714753548,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "5onYUbWco2YB",
"outputId": "776ec8fb-2b77-4ed4-ece6-29a2284ec9c6"
},
"outputs": [
{
"data": {
"text/plain": [
"(7770, 41973)"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#get the tf-idf scores\n",
"#create TF-IDF matrix by fitting and transforming the data\n",
"tfidf_matrix = tfidf.fit_transform(data['org_new'])\n",
"\n",
"#shape of tfidf_matrix\n",
"tfidf_matrix.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mLGj0IEqBh0o"
},
"source": [
"We'll use cosine similarity over tf-idf because-:\n",
"* Cosine similarity handles high dimensional sparse data better. \n",
"\n",
"* Cosine similarity captures the meaning of the text better than tf-idf. For example, if two items contain similar words but in different orders, cosine similarity would still consider them similar, while tf-idf may not. This is because tf-idf only considers the frequency of words in a document and not their order or meaning."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kOgz-J28Bo7k"
},
"source": [
"### **Using Cosine Similarity**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3-1DxvwuSxll"
},
"source": [
"Cosine similarity is a measure of similarity between two non-zero vectors in a multidimensional space. It measures the cosine of the angle between the two vectors, which ranges from -1 (opposite direction) to 1 (same direction), with 0 indicating orthogonality (the vectors are perpendicular to each other).\n",
"\n",
"In this project we have used cosine similarity which is used to determine how similar two documents or pieces of text are. We represent the documents as vectors in a high-dimensional space, where each dimension represents a word or term in the corpus. We can then calculate the cosine similarity between the vectors to determine how similar the documents are based on their word usage."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"id": "r522RJBlA8i6"
},
"outputs": [],
"source": [
"from sklearn.metrics.pairwise import cosine_similarity\n",
"cosine_sim = cosine_similarity(tfidf_matrix)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"id": "DT0i29gBD23p"
},
"outputs": [],
"source": [
"programme_list=new_df['title'].to_list()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"id": "F-69HjWWD29i"
},
"outputs": [],
"source": [
"def recommend(title, cosine_similarity=cosine_sim):\n",
" index = programme_list.index(title) #finds the index of the input title in the programme_list.\n",
" sim_score = list(enumerate(cosine_sim[index])) #creates a list of tuples containing the similarity score and index between the input title and all other programmes in the dataset.\n",
" \n",
" #position 0 is the movie itself, thus exclude\n",
" sim_score = sorted(sim_score, key= lambda x: x[1], reverse=True)[1:11] #sorts the list of tuples by similarity score in descending order.\n",
" recommend_index = [i[0] for i in sim_score]\n",
" rec_movie = new_df['title'].iloc[recommend_index]\n",
" rec_score = [round(i[1],4) for i in sim_score]\n",
" rec_table = pd.DataFrame(list(zip(rec_movie,rec_score)), columns=['Recommended movie','Similarity(0-1)'])\n",
" return rec_table\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1XrNWkiTUDUf"
},
"source": [
"This function calculates the cosine similarity scores between the input title and all other titles in the dataset, sorts them in descending order, and returns the top 10 movies with the highest similarity scores as recommendations."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"executionInfo": {
"elapsed": 747,
"status": "ok",
"timestamp": 1676720468181,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "gt186LvoD3EG",
"outputId": "63b1900a-4776-4211-eb9f-899e6b0d1b38"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Recommended movie
\n",
"
Similarity(0-1)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Indiana Jones and the Raiders of the Lost Ark
\n",
"
0.3160
\n",
"
\n",
"
\n",
"
1
\n",
"
Indiana Jones and the Kingdom of the Crystal S...
\n",
"
0.1974
\n",
"
\n",
"
\n",
"
2
\n",
"
Indiana Jones and the Temple of Doom
\n",
"
0.1949
\n",
"
\n",
"
\n",
"
3
\n",
"
Monty Python and the Holy Grail
\n",
"
0.1294
\n",
"
\n",
"
\n",
"
4
\n",
"
Lincoln
\n",
"
0.1252
\n",
"
\n",
"
\n",
"
5
\n",
"
A Bridge Too Far
\n",
"
0.1225
\n",
"
\n",
"
\n",
"
6
\n",
"
The Adventures of Tintin
\n",
"
0.1087
\n",
"
\n",
"
\n",
"
7
\n",
"
The Battle of Midway
\n",
"
0.1073
\n",
"
\n",
"
\n",
"
8
\n",
"
Pawn Stars
\n",
"
0.1060
\n",
"
\n",
"
\n",
"
9
\n",
"
Patriot Games
\n",
"
0.1023
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Recommended movie Similarity(0-1)\n",
"0 Indiana Jones and the Raiders of the Lost Ark 0.3160\n",
"1 Indiana Jones and the Kingdom of the Crystal S... 0.1974\n",
"2 Indiana Jones and the Temple of Doom 0.1949\n",
"3 Monty Python and the Holy Grail 0.1294\n",
"4 Lincoln 0.1252\n",
"5 A Bridge Too Far 0.1225\n",
"6 The Adventures of Tintin 0.1087\n",
"7 The Battle of Midway 0.1073\n",
"8 Pawn Stars 0.1060\n",
"9 Patriot Games 0.1023"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recommend(\"Indiana Jones and the Last Crusade\")"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"executionInfo": {
"elapsed": 573,
"status": "ok",
"timestamp": 1676720472920,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "01qpvvoZD3KD",
"outputId": "e656c77a-d864-42ac-db30-fbb846435f9c"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Recommended movie
\n",
"
Similarity(0-1)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Poshter Girl
\n",
"
0.1284
\n",
"
\n",
"
\n",
"
1
\n",
"
Agent Raghav
\n",
"
0.1200
\n",
"
\n",
"
\n",
"
2
\n",
"
Anjaan: Rural Myths
\n",
"
0.1077
\n",
"
\n",
"
\n",
"
3
\n",
"
Bard of Blood
\n",
"
0.0991
\n",
"
\n",
"
\n",
"
4
\n",
"
Gunjan Saxena: The Kargil Girl
\n",
"
0.0917
\n",
"
\n",
"
\n",
"
5
\n",
"
Manusangada
\n",
"
0.0897
\n",
"
\n",
"
\n",
"
6
\n",
"
Fear Files... Har Mod Pe Darr
\n",
"
0.0866
\n",
"
\n",
"
\n",
"
7
\n",
"
Battlefield Recovery
\n",
"
0.0853
\n",
"
\n",
"
\n",
"
8
\n",
"
Back with the Ex
\n",
"
0.0844
\n",
"
\n",
"
\n",
"
9
\n",
"
Sacred Games
\n",
"
0.0841
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Recommended movie Similarity(0-1)\n",
"0 Poshter Girl 0.1284\n",
"1 Agent Raghav 0.1200\n",
"2 Anjaan: Rural Myths 0.1077\n",
"3 Bard of Blood 0.0991\n",
"4 Gunjan Saxena: The Kargil Girl 0.0917\n",
"5 Manusangada 0.0897\n",
"6 Fear Files... Har Mod Pe Darr 0.0866\n",
"7 Battlefield Recovery 0.0853\n",
"8 Back with the Ex 0.0844\n",
"9 Sacred Games 0.0841"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recommend('Betaal')"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"executionInfo": {
"elapsed": 14,
"status": "ok",
"timestamp": 1676720475568,
"user": {
"displayName": "Harsh Jain",
"userId": "02354714666944083536"
},
"user_tz": -330
},
"id": "j45r5r7X_8Ll",
"outputId": "b71a924c-4f1b-4c01-f412-5af2e14358b6"
},
"outputs": [
{
"data": {
"text/html": [
"