Spaces:

diivien
/

Music-Popularity-Prediction

Running

App Files Files Community

diivien commited on Apr 27, 2023

Commit

f17df96

0 Parent(s):

Initial commit

Browse files

Files changed (7) hide show

.gitignore +5 -0
Data Cleaning.ipynb +1564 -0
Exploratory Data Analysis.ipynb +0 -0
Model Building.ipynb +0 -0
README.md +79 -0
cleaned_dataset.csv +0 -0
dataset.csv +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+*.txt
+.venv/
+.ipynb_checkpoints/
+catboost_info/
+my_study.db

Data Cleaning.ipynb ADDED Viewed

	@@ -0,0 +1,1564 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1de28f74",
+   "metadata": {},
+   "source": [
+    "# Data Cleaning"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "bc4c415f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "6455bf8f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv('dataset.csv')\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "1c2440e4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',\n",
+      "       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',\n",
+      "       'key', 'loudness', 'mode', 'speechiness', 'acousticness',\n",
+      "       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',\n",
+      "       'track_genre'],\n",
+      "      dtype='object')\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Unnamed: 0</th>\n",
+       "      <th>track_id</th>\n",
+       "      <th>artists</th>\n",
+       "      <th>album_name</th>\n",
+       "      <th>track_name</th>\n",
+       "      <th>popularity</th>\n",
+       "      <th>duration_ms</th>\n",
+       "      <th>explicit</th>\n",
+       "      <th>danceability</th>\n",
+       "      <th>energy</th>\n",
+       "      <th>...</th>\n",
+       "      <th>loudness</th>\n",
+       "      <th>mode</th>\n",
+       "      <th>speechiness</th>\n",
+       "      <th>acousticness</th>\n",
+       "      <th>instrumentalness</th>\n",
+       "      <th>liveness</th>\n",
+       "      <th>valence</th>\n",
+       "      <th>tempo</th>\n",
+       "      <th>time_signature</th>\n",
+       "      <th>track_genre</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0</td>\n",
+       "      <td>5SuOikwiRyPMVoIQDJUgSV</td>\n",
+       "      <td>Gen Hoshino</td>\n",
+       "      <td>Comedy</td>\n",
+       "      <td>Comedy</td>\n",
+       "      <td>73</td>\n",
+       "      <td>230666</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.676</td>\n",
+       "      <td>0.4610</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-6.746</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.1430</td>\n",
+       "      <td>0.0322</td>\n",
+       "      <td>0.000001</td>\n",
+       "      <td>0.3580</td>\n",
+       "      <td>0.715</td>\n",
+       "      <td>87.917</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "      <td>4qPNDBW1i3p13qLCt0Ki3A</td>\n",
+       "      <td>Ben Woodward</td>\n",
+       "      <td>Ghost (Acoustic)</td>\n",
+       "      <td>Ghost - Acoustic</td>\n",
+       "      <td>55</td>\n",
+       "      <td>149610</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.420</td>\n",
+       "      <td>0.1660</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-17.235</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0763</td>\n",
+       "      <td>0.9240</td>\n",
+       "      <td>0.000006</td>\n",
+       "      <td>0.1010</td>\n",
+       "      <td>0.267</td>\n",
+       "      <td>77.489</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2</td>\n",
+       "      <td>1iJBSr7s7jYXzM8EGcbK5b</td>\n",
+       "      <td>Ingrid Michaelson;ZAYN</td>\n",
+       "      <td>To Begin Again</td>\n",
+       "      <td>To Begin Again</td>\n",
+       "      <td>57</td>\n",
+       "      <td>210826</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.438</td>\n",
+       "      <td>0.3590</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-9.734</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0557</td>\n",
+       "      <td>0.2100</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1170</td>\n",
+       "      <td>0.120</td>\n",
+       "      <td>76.332</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>3</td>\n",
+       "      <td>6lfxq3CG4xtTiEg7opyCyx</td>\n",
+       "      <td>Kina Grannis</td>\n",
+       "      <td>Crazy Rich Asians (Original Motion Picture Sou...</td>\n",
+       "      <td>Can't Help Falling In Love</td>\n",
+       "      <td>71</td>\n",
+       "      <td>201933</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.266</td>\n",
+       "      <td>0.0596</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-18.515</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0363</td>\n",
+       "      <td>0.9050</td>\n",
+       "      <td>0.000071</td>\n",
+       "      <td>0.1320</td>\n",
+       "      <td>0.143</td>\n",
+       "      <td>181.740</td>\n",
+       "      <td>3</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>4</td>\n",
+       "      <td>5vjLSffimiIP26QG5WcN2K</td>\n",
+       "      <td>Chord Overstreet</td>\n",
+       "      <td>Hold On</td>\n",
+       "      <td>Hold On</td>\n",
+       "      <td>82</td>\n",
+       "      <td>198853</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.618</td>\n",
+       "      <td>0.4430</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-9.681</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0526</td>\n",
+       "      <td>0.4690</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0829</td>\n",
+       "      <td>0.167</td>\n",
+       "      <td>119.949</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 21 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   Unnamed: 0                track_id                 artists  \\\n",
+       "0           0  5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   \n",
+       "1           1  4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   \n",
+       "2           2  1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   \n",
+       "3           3  6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   \n",
+       "4           4  5vjLSffimiIP26QG5WcN2K        Chord Overstreet   \n",
+       "\n",
+       "                                          album_name  \\\n",
+       "0                                             Comedy   \n",
+       "1                                   Ghost (Acoustic)   \n",
+       "2                                     To Begin Again   \n",
+       "3  Crazy Rich Asians (Original Motion Picture Sou...   \n",
+       "4                                            Hold On   \n",
+       "\n",
+       "                   track_name  popularity  duration_ms  explicit  \\\n",
+       "0                      Comedy          73       230666     False   \n",
+       "1            Ghost - Acoustic          55       149610     False   \n",
+       "2              To Begin Again          57       210826     False   \n",
+       "3  Can't Help Falling In Love          71       201933     False   \n",
+       "4                     Hold On          82       198853     False   \n",
+       "\n",
+       "   danceability  energy  ...  loudness  mode  speechiness  acousticness  \\\n",
+       "0         0.676  0.4610  ...    -6.746     0       0.1430        0.0322   \n",
+       "1         0.420  0.1660  ...   -17.235     1       0.0763        0.9240   \n",
+       "2         0.438  0.3590  ...    -9.734     1       0.0557        0.2100   \n",
+       "3         0.266  0.0596  ...   -18.515     1       0.0363        0.9050   \n",
+       "4         0.618  0.4430  ...    -9.681     1       0.0526        0.4690   \n",
+       "\n",
+       "   instrumentalness  liveness  valence    tempo  time_signature  track_genre  \n",
+       "0          0.000001    0.3580    0.715   87.917               4     acoustic  \n",
+       "1          0.000006    0.1010    0.267   77.489               4     acoustic  \n",
+       "2          0.000000    0.1170    0.120   76.332               4     acoustic  \n",
+       "3          0.000071    0.1320    0.143  181.740               3     acoustic  \n",
+       "4          0.000000    0.0829    0.167  119.949               4     acoustic  \n",
+       "\n",
+       "[5 rows x 21 columns]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "print(df.columns)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f1a88b42",
+   "metadata": {},
+   "source": [
+    "### Remove unique columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ece13796",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = df.drop(['Unnamed: 0','track_id', 'album_name'],axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "060fbd33",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>artists</th>\n",
+       "      <th>track_name</th>\n",
+       "      <th>popularity</th>\n",
+       "      <th>duration_ms</th>\n",
+       "      <th>explicit</th>\n",
+       "      <th>danceability</th>\n",
+       "      <th>energy</th>\n",
+       "      <th>key</th>\n",
+       "      <th>loudness</th>\n",
+       "      <th>mode</th>\n",
+       "      <th>speechiness</th>\n",
+       "      <th>acousticness</th>\n",
+       "      <th>instrumentalness</th>\n",
+       "      <th>liveness</th>\n",
+       "      <th>valence</th>\n",
+       "      <th>tempo</th>\n",
+       "      <th>time_signature</th>\n",
+       "      <th>track_genre</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Gen Hoshino</td>\n",
+       "      <td>Comedy</td>\n",
+       "      <td>73</td>\n",
+       "      <td>230666</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.676</td>\n",
+       "      <td>0.4610</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-6.746</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.1430</td>\n",
+       "      <td>0.0322</td>\n",
+       "      <td>0.000001</td>\n",
+       "      <td>0.3580</td>\n",
+       "      <td>0.7150</td>\n",
+       "      <td>87.917</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Ben Woodward</td>\n",
+       "      <td>Ghost - Acoustic</td>\n",
+       "      <td>55</td>\n",
+       "      <td>149610</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.420</td>\n",
+       "      <td>0.1660</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-17.235</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0763</td>\n",
+       "      <td>0.9240</td>\n",
+       "      <td>0.000006</td>\n",
+       "      <td>0.1010</td>\n",
+       "      <td>0.2670</td>\n",
+       "      <td>77.489</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Ingrid Michaelson;ZAYN</td>\n",
+       "      <td>To Begin Again</td>\n",
+       "      <td>57</td>\n",
+       "      <td>210826</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.438</td>\n",
+       "      <td>0.3590</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-9.734</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0557</td>\n",
+       "      <td>0.2100</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1170</td>\n",
+       "      <td>0.1200</td>\n",
+       "      <td>76.332</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Kina Grannis</td>\n",
+       "      <td>Can't Help Falling In Love</td>\n",
+       "      <td>71</td>\n",
+       "      <td>201933</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.266</td>\n",
+       "      <td>0.0596</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-18.515</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0363</td>\n",
+       "      <td>0.9050</td>\n",
+       "      <td>0.000071</td>\n",
+       "      <td>0.1320</td>\n",
+       "      <td>0.1430</td>\n",
+       "      <td>181.740</td>\n",
+       "      <td>3</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Chord Overstreet</td>\n",
+       "      <td>Hold On</td>\n",
+       "      <td>82</td>\n",
+       "      <td>198853</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.618</td>\n",
+       "      <td>0.4430</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-9.681</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0526</td>\n",
+       "      <td>0.4690</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0829</td>\n",
+       "      <td>0.1670</td>\n",
+       "      <td>119.949</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Tyrone Wells</td>\n",
+       "      <td>Days I Will Remember</td>\n",
+       "      <td>58</td>\n",
+       "      <td>214240</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.688</td>\n",
+       "      <td>0.4810</td>\n",
+       "      <td>6</td>\n",
+       "      <td>-8.807</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.1050</td>\n",
+       "      <td>0.2890</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1890</td>\n",
+       "      <td>0.6660</td>\n",
+       "      <td>98.017</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>A Great Big World;Christina Aguilera</td>\n",
+       "      <td>Say Something</td>\n",
+       "      <td>74</td>\n",
+       "      <td>229400</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.407</td>\n",
+       "      <td>0.1470</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-8.822</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0355</td>\n",
+       "      <td>0.8570</td>\n",
+       "      <td>0.000003</td>\n",
+       "      <td>0.0913</td>\n",
+       "      <td>0.0765</td>\n",
+       "      <td>141.284</td>\n",
+       "      <td>3</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Jason Mraz</td>\n",
+       "      <td>I'm Yours</td>\n",
+       "      <td>80</td>\n",
+       "      <td>242946</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.703</td>\n",
+       "      <td>0.4440</td>\n",
+       "      <td>11</td>\n",
+       "      <td>-9.331</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0417</td>\n",
+       "      <td>0.5590</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0973</td>\n",
+       "      <td>0.7120</td>\n",
+       "      <td>150.960</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Jason Mraz;Colbie Caillat</td>\n",
+       "      <td>Lucky</td>\n",
+       "      <td>74</td>\n",
+       "      <td>189613</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.625</td>\n",
+       "      <td>0.4140</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-8.700</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0369</td>\n",
+       "      <td>0.2940</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1510</td>\n",
+       "      <td>0.6690</td>\n",
+       "      <td>130.088</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Ross Copperman</td>\n",
+       "      <td>Hunger</td>\n",
+       "      <td>56</td>\n",
+       "      <td>205594</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.442</td>\n",
+       "      <td>0.6320</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-6.770</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0295</td>\n",
+       "      <td>0.4260</td>\n",
+       "      <td>0.004190</td>\n",
+       "      <td>0.0735</td>\n",
+       "      <td>0.1960</td>\n",
+       "      <td>78.899</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                artists                  track_name  \\\n",
+       "0                           Gen Hoshino                      Comedy   \n",
+       "1                          Ben Woodward            Ghost - Acoustic   \n",
+       "2                Ingrid Michaelson;ZAYN              To Begin Again   \n",
+       "3                          Kina Grannis  Can't Help Falling In Love   \n",
+       "4                      Chord Overstreet                     Hold On   \n",
+       "5                          Tyrone Wells        Days I Will Remember   \n",
+       "6  A Great Big World;Christina Aguilera               Say Something   \n",
+       "7                            Jason Mraz                   I'm Yours   \n",
+       "8             Jason Mraz;Colbie Caillat                       Lucky   \n",
+       "9                        Ross Copperman                      Hunger   \n",
+       "\n",
+       "   popularity  duration_ms  explicit  danceability  energy  key  loudness  \\\n",
+       "0          73       230666     False         0.676  0.4610    1    -6.746   \n",
+       "1          55       149610     False         0.420  0.1660    1   -17.235   \n",
+       "2          57       210826     False         0.438  0.3590    0    -9.734   \n",
+       "3          71       201933     False         0.266  0.0596    0   -18.515   \n",
+       "4          82       198853     False         0.618  0.4430    2    -9.681   \n",
+       "5          58       214240     False         0.688  0.4810    6    -8.807   \n",
+       "6          74       229400     False         0.407  0.1470    2    -8.822   \n",
+       "7          80       242946     False         0.703  0.4440   11    -9.331   \n",
+       "8          74       189613     False         0.625  0.4140    0    -8.700   \n",
+       "9          56       205594     False         0.442  0.6320    1    -6.770   \n",
+       "\n",
+       "   mode  speechiness  acousticness  instrumentalness  liveness  valence  \\\n",
+       "0     0       0.1430        0.0322          0.000001    0.3580   0.7150   \n",
+       "1     1       0.0763        0.9240          0.000006    0.1010   0.2670   \n",
+       "2     1       0.0557        0.2100          0.000000    0.1170   0.1200   \n",
+       "3     1       0.0363        0.9050          0.000071    0.1320   0.1430   \n",
+       "4     1       0.0526        0.4690          0.000000    0.0829   0.1670   \n",
+       "5     1       0.1050        0.2890          0.000000    0.1890   0.6660   \n",
+       "6     1       0.0355        0.8570          0.000003    0.0913   0.0765   \n",
+       "7     1       0.0417        0.5590          0.000000    0.0973   0.7120   \n",
+       "8     1       0.0369        0.2940          0.000000    0.1510   0.6690   \n",
+       "9     1       0.0295        0.4260          0.004190    0.0735   0.1960   \n",
+       "\n",
+       "     tempo  time_signature track_genre  \n",
+       "0   87.917               4    acoustic  \n",
+       "1   77.489               4    acoustic  \n",
+       "2   76.332               4    acoustic  \n",
+       "3  181.740               3    acoustic  \n",
+       "4  119.949               4    acoustic  \n",
+       "5   98.017               4    acoustic  \n",
+       "6  141.284               3    acoustic  \n",
+       "7  150.960               4    acoustic  \n",
+       "8  130.088               4    acoustic  \n",
+       "9   78.899               4    acoustic  "
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "d801195c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "artists              object\n",
+       "track_name           object\n",
+       "popularity            int64\n",
+       "duration_ms           int64\n",
+       "explicit               bool\n",
+       "danceability        float64\n",
+       "energy              float64\n",
+       "key                   int64\n",
+       "loudness            float64\n",
+       "mode                  int64\n",
+       "speechiness         float64\n",
+       "acousticness        float64\n",
+       "instrumentalness    float64\n",
+       "liveness            float64\n",
+       "valence             float64\n",
+       "tempo               float64\n",
+       "time_signature        int64\n",
+       "track_genre          object\n",
+       "dtype: object"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.dtypes"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "aeb25f1a",
+   "metadata": {},
+   "source": [
+    "### Drop Null Values"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "ce3c3319",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "artists             1\n",
+      "track_name          1\n",
+      "popularity          0\n",
+      "duration_ms         0\n",
+      "explicit            0\n",
+      "danceability        0\n",
+      "energy              0\n",
+      "key                 0\n",
+      "loudness            0\n",
+      "mode                0\n",
+      "speechiness         0\n",
+      "acousticness        0\n",
+      "instrumentalness    0\n",
+      "liveness            0\n",
+      "valence             0\n",
+      "tempo               0\n",
+      "time_signature      0\n",
+      "track_genre         0\n",
+      "dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(df.isna().sum())\n",
+    "df=df.dropna()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "de7960de",
+   "metadata": {},
+   "source": [
+    "### Drop Duplicated Rows (Same artists and track_name)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "eb46cc03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>artists</th>\n",
+       "      <th>track_name</th>\n",
+       "      <th>popularity</th>\n",
+       "      <th>duration_ms</th>\n",
+       "      <th>explicit</th>\n",
+       "      <th>danceability</th>\n",
+       "      <th>energy</th>\n",
+       "      <th>key</th>\n",
+       "      <th>loudness</th>\n",
+       "      <th>mode</th>\n",
+       "      <th>speechiness</th>\n",
+       "      <th>acousticness</th>\n",
+       "      <th>instrumentalness</th>\n",
+       "      <th>liveness</th>\n",
+       "      <th>valence</th>\n",
+       "      <th>tempo</th>\n",
+       "      <th>time_signature</th>\n",
+       "      <th>track_genre</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>Jason Mraz;Colbie Caillat</td>\n",
+       "      <td>Lucky</td>\n",
+       "      <td>68</td>\n",
+       "      <td>189613</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.625</td>\n",
+       "      <td>0.414</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-8.700</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0369</td>\n",
+       "      <td>0.29400</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1510</td>\n",
+       "      <td>0.6690</td>\n",
+       "      <td>130.088</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>20</th>\n",
+       "      <td>Jason Mraz</td>\n",
+       "      <td>I'm Yours</td>\n",
+       "      <td>75</td>\n",
+       "      <td>242946</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.703</td>\n",
+       "      <td>0.444</td>\n",
+       "      <td>11</td>\n",
+       "      <td>-9.331</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0417</td>\n",
+       "      <td>0.55900</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0973</td>\n",
+       "      <td>0.7120</td>\n",
+       "      <td>150.960</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22</th>\n",
+       "      <td>A Great Big World;Christina Aguilera</td>\n",
+       "      <td>Say Something</td>\n",
+       "      <td>70</td>\n",
+       "      <td>229400</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.407</td>\n",
+       "      <td>0.147</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-8.822</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0355</td>\n",
+       "      <td>0.85700</td>\n",
+       "      <td>0.000003</td>\n",
+       "      <td>0.0913</td>\n",
+       "      <td>0.0765</td>\n",
+       "      <td>141.284</td>\n",
+       "      <td>3</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>28</th>\n",
+       "      <td>Jason Mraz</td>\n",
+       "      <td>Winter Wonderland</td>\n",
+       "      <td>0</td>\n",
+       "      <td>131760</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.620</td>\n",
+       "      <td>0.309</td>\n",
+       "      <td>5</td>\n",
+       "      <td>-9.209</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0495</td>\n",
+       "      <td>0.78800</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1460</td>\n",
+       "      <td>0.6640</td>\n",
+       "      <td>145.363</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>29</th>\n",
+       "      <td>Jason Mraz</td>\n",
+       "      <td>Winter Wonderland</td>\n",
+       "      <td>0</td>\n",
+       "      <td>131760</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.620</td>\n",
+       "      <td>0.309</td>\n",
+       "      <td>5</td>\n",
+       "      <td>-9.209</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0495</td>\n",
+       "      <td>0.78800</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1460</td>\n",
+       "      <td>0.6640</td>\n",
+       "      <td>145.363</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>113845</th>\n",
+       "      <td>Hillsong Worship;Brooke Ligertwood</td>\n",
+       "      <td>King Of Kings - Live at Hillsong Conference</td>\n",
+       "      <td>40</td>\n",
+       "      <td>291565</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.454</td>\n",
+       "      <td>0.427</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-8.049</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0290</td>\n",
+       "      <td>0.02050</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.6900</td>\n",
+       "      <td>0.1840</td>\n",
+       "      <td>135.887</td>\n",
+       "      <td>4</td>\n",
+       "      <td>world-music</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>113882</th>\n",
+       "      <td>Bryan &amp; Katie Torwalt</td>\n",
+       "      <td>Good News - Live</td>\n",
+       "      <td>23</td>\n",
+       "      <td>266632</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.473</td>\n",
+       "      <td>0.474</td>\n",
+       "      <td>6</td>\n",
+       "      <td>-9.175</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0558</td>\n",
+       "      <td>0.39500</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1630</td>\n",
+       "      <td>0.2510</td>\n",
+       "      <td>140.746</td>\n",
+       "      <td>4</td>\n",
+       "      <td>world-music</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>113917</th>\n",
+       "      <td>Hillsong Worship;Mi-kaisha Rose</td>\n",
+       "      <td>Never Walk Alone - Live</td>\n",
+       "      <td>41</td>\n",
+       "      <td>348619</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.420</td>\n",
+       "      <td>0.553</td>\n",
+       "      <td>5</td>\n",
+       "      <td>-8.049</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0332</td>\n",
+       "      <td>0.14100</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1030</td>\n",
+       "      <td>0.2140</td>\n",
+       "      <td>143.804</td>\n",
+       "      <td>4</td>\n",
+       "      <td>world-music</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>113951</th>\n",
+       "      <td>Passion;Kristian Stanfill</td>\n",
+       "      <td>More Like Jesus - Live</td>\n",
+       "      <td>44</td>\n",
+       "      <td>338694</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.404</td>\n",
+       "      <td>0.676</td>\n",
+       "      <td>10</td>\n",
+       "      <td>-5.468</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0354</td>\n",
+       "      <td>0.02740</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.3520</td>\n",
+       "      <td>0.1630</td>\n",
+       "      <td>144.056</td>\n",
+       "      <td>3</td>\n",
+       "      <td>world-music</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>113991</th>\n",
+       "      <td>Chris Tomlin</td>\n",
+       "      <td>At The Cross (Love Ran Red)</td>\n",
+       "      <td>32</td>\n",
+       "      <td>250629</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.387</td>\n",
+       "      <td>0.531</td>\n",
+       "      <td>8</td>\n",
+       "      <td>-4.788</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0290</td>\n",
+       "      <td>0.00305</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.2010</td>\n",
+       "      <td>0.1530</td>\n",
+       "      <td>146.003</td>\n",
+       "      <td>4</td>\n",
+       "      <td>world-music</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>32656 rows × 18 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                     artists  \\\n",
+       "18                 Jason Mraz;Colbie Caillat   \n",
+       "20                                Jason Mraz   \n",
+       "22      A Great Big World;Christina Aguilera   \n",
+       "28                                Jason Mraz   \n",
+       "29                                Jason Mraz   \n",
+       "...                                      ...   \n",
+       "113845    Hillsong Worship;Brooke Ligertwood   \n",
+       "113882                 Bryan & Katie Torwalt   \n",
+       "113917       Hillsong Worship;Mi-kaisha Rose   \n",
+       "113951             Passion;Kristian Stanfill   \n",
+       "113991                          Chris Tomlin   \n",
+       "\n",
+       "                                         track_name  popularity  duration_ms  \\\n",
+       "18                                            Lucky          68       189613   \n",
+       "20                                        I'm Yours          75       242946   \n",
+       "22                                    Say Something          70       229400   \n",
+       "28                                Winter Wonderland           0       131760   \n",
+       "29                                Winter Wonderland           0       131760   \n",
+       "...                                             ...         ...          ...   \n",
+       "113845  King Of Kings - Live at Hillsong Conference          40       291565   \n",
+       "113882                             Good News - Live          23       266632   \n",
+       "113917                      Never Walk Alone - Live          41       348619   \n",
+       "113951                       More Like Jesus - Live          44       338694   \n",
+       "113991                  At The Cross (Love Ran Red)          32       250629   \n",
+       "\n",
+       "        explicit  danceability  energy  key  loudness  mode  speechiness  \\\n",
+       "18         False         0.625   0.414    0    -8.700     1       0.0369   \n",
+       "20         False         0.703   0.444   11    -9.331     1       0.0417   \n",
+       "22         False         0.407   0.147    2    -8.822     1       0.0355   \n",
+       "28         False         0.620   0.309    5    -9.209     1       0.0495   \n",
+       "29         False         0.620   0.309    5    -9.209     1       0.0495   \n",
+       "...          ...           ...     ...  ...       ...   ...          ...   \n",
+       "113845     False         0.454   0.427    2    -8.049     1       0.0290   \n",
+       "113882     False         0.473   0.474    6    -9.175     1       0.0558   \n",
+       "113917     False         0.420   0.553    5    -8.049     1       0.0332   \n",
+       "113951     False         0.404   0.676   10    -5.468     1       0.0354   \n",
+       "113991     False         0.387   0.531    8    -4.788     1       0.0290   \n",
+       "\n",
+       "        acousticness  instrumentalness  liveness  valence    tempo  \\\n",
+       "18           0.29400          0.000000    0.1510   0.6690  130.088   \n",
+       "20           0.55900          0.000000    0.0973   0.7120  150.960   \n",
+       "22           0.85700          0.000003    0.0913   0.0765  141.284   \n",
+       "28           0.78800          0.000000    0.1460   0.6640  145.363   \n",
+       "29           0.78800          0.000000    0.1460   0.6640  145.363   \n",
+       "...              ...               ...       ...      ...      ...   \n",
+       "113845       0.02050          0.000000    0.6900   0.1840  135.887   \n",
+       "113882       0.39500          0.000000    0.1630   0.2510  140.746   \n",
+       "113917       0.14100          0.000000    0.1030   0.2140  143.804   \n",
+       "113951       0.02740          0.000000    0.3520   0.1630  144.056   \n",
+       "113991       0.00305          0.000000    0.2010   0.1530  146.003   \n",
+       "\n",
+       "        time_signature  track_genre  \n",
+       "18                   4     acoustic  \n",
+       "20                   4     acoustic  \n",
+       "22                   3     acoustic  \n",
+       "28                   4     acoustic  \n",
+       "29                   4     acoustic  \n",
+       "...                ...          ...  \n",
+       "113845               4  world-music  \n",
+       "113882               4  world-music  \n",
+       "113917               4  world-music  \n",
+       "113951               3  world-music  \n",
+       "113991               4  world-music  \n",
+       "\n",
+       "[32656 rows x 18 columns]"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "duplicated_rows = df[df.duplicated(['artists', 'track_name'])]\n",
+    "\n",
+    "# print duplicated rows\n",
+    "duplicated_rows"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "251df65d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = df.drop_duplicates(['artists', 'track_name'], keep='first')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "d6eea5b5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "study          996\n",
+       "black-metal    991\n",
+       "comedy         987\n",
+       "heavy-metal    985\n",
+       "bluegrass      978\n",
+       "              ... \n",
+       "rock           167\n",
+       "reggae         166\n",
+       "house          134\n",
+       "indie          107\n",
+       "reggaeton       63\n",
+       "Name: track_genre, Length: 113, dtype: int64"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.shape\n",
+    "df['track_genre'].value_counts()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "363cf332",
+   "metadata": {},
+   "source": [
+    "### Drop artists and track name columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "2f11bf72",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>popularity</th>\n",
+       "      <th>duration_ms</th>\n",
+       "      <th>explicit</th>\n",
+       "      <th>danceability</th>\n",
+       "      <th>energy</th>\n",
+       "      <th>key</th>\n",
+       "      <th>loudness</th>\n",
+       "      <th>mode</th>\n",
+       "      <th>speechiness</th>\n",
+       "      <th>acousticness</th>\n",
+       "      <th>instrumentalness</th>\n",
+       "      <th>liveness</th>\n",
+       "      <th>valence</th>\n",
+       "      <th>tempo</th>\n",
+       "      <th>time_signature</th>\n",
+       "      <th>track_genre</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>73</td>\n",
+       "      <td>230666</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.676</td>\n",
+       "      <td>0.4610</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-6.746</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.1430</td>\n",
+       "      <td>0.0322</td>\n",
+       "      <td>0.000001</td>\n",
+       "      <td>0.3580</td>\n",
+       "      <td>0.715</td>\n",
+       "      <td>87.917</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>55</td>\n",
+       "      <td>149610</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.420</td>\n",
+       "      <td>0.1660</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-17.235</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0763</td>\n",
+       "      <td>0.9240</td>\n",
+       "      <td>0.000006</td>\n",
+       "      <td>0.1010</td>\n",
+       "      <td>0.267</td>\n",
+       "      <td>77.489</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>57</td>\n",
+       "      <td>210826</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.438</td>\n",
+       "      <td>0.3590</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-9.734</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0557</td>\n",
+       "      <td>0.2100</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1170</td>\n",
+       "      <td>0.120</td>\n",
+       "      <td>76.332</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>71</td>\n",
+       "      <td>201933</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.266</td>\n",
+       "      <td>0.0596</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-18.515</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0363</td>\n",
+       "      <td>0.9050</td>\n",
+       "      <td>0.000071</td>\n",
+       "      <td>0.1320</td>\n",
+       "      <td>0.143</td>\n",
+       "      <td>181.740</td>\n",
+       "      <td>3</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>82</td>\n",
+       "      <td>198853</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.618</td>\n",
+       "      <td>0.4430</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-9.681</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0526</td>\n",
+       "      <td>0.4690</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0829</td>\n",
+       "      <td>0.167</td>\n",
+       "      <td>119.949</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   popularity  duration_ms  explicit  danceability  energy  key  loudness  \\\n",
+       "0          73       230666     False         0.676  0.4610    1    -6.746   \n",
+       "1          55       149610     False         0.420  0.1660    1   -17.235   \n",
+       "2          57       210826     False         0.438  0.3590    0    -9.734   \n",
+       "3          71       201933     False         0.266  0.0596    0   -18.515   \n",
+       "4          82       198853     False         0.618  0.4430    2    -9.681   \n",
+       "\n",
+       "   mode  speechiness  acousticness  instrumentalness  liveness  valence  \\\n",
+       "0     0       0.1430        0.0322          0.000001    0.3580    0.715   \n",
+       "1     1       0.0763        0.9240          0.000006    0.1010    0.267   \n",
+       "2     1       0.0557        0.2100          0.000000    0.1170    0.120   \n",
+       "3     1       0.0363        0.9050          0.000071    0.1320    0.143   \n",
+       "4     1       0.0526        0.4690          0.000000    0.0829    0.167   \n",
+       "\n",
+       "     tempo  time_signature track_genre  \n",
+       "0   87.917               4    acoustic  \n",
+       "1   77.489               4    acoustic  \n",
+       "2   76.332               4    acoustic  \n",
+       "3  181.740               3    acoustic  \n",
+       "4  119.949               4    acoustic  "
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df = df.drop(['artists','track_name'],axis=1)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e7d572f5",
+   "metadata": {},
+   "source": [
+    "### Drop invalid tempo and time signature according to Spotify API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "69b1cceb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4    71986\n",
+       "3     6944\n",
+       "5     1488\n",
+       "1      775\n",
+       "0      150\n",
+       "Name: time_signature, dtype: int64"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df['time_signature'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "39a08b22",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = df[df['time_signature'] >2]\n",
+    "df = df[df['tempo'] > 0]"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "0b7c8cea",
+   "metadata": {},
+   "source": [
+    "### Save the cleaned dataset into csv"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "8c064cb0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>popularity</th>\n",
+       "      <th>duration_ms</th>\n",
+       "      <th>explicit</th>\n",
+       "      <th>danceability</th>\n",
+       "      <th>energy</th>\n",
+       "      <th>key</th>\n",
+       "      <th>loudness</th>\n",
+       "      <th>mode</th>\n",
+       "      <th>speechiness</th>\n",
+       "      <th>acousticness</th>\n",
+       "      <th>instrumentalness</th>\n",
+       "      <th>liveness</th>\n",
+       "      <th>valence</th>\n",
+       "      <th>tempo</th>\n",
+       "      <th>time_signature</th>\n",
+       "      <th>track_genre</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>73</td>\n",
+       "      <td>230666</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.676</td>\n",
+       "      <td>0.4610</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-6.746</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0.1430</td>\n",
+       "      <td>0.0322</td>\n",
+       "      <td>0.000001</td>\n",
+       "      <td>0.3580</td>\n",
+       "      <td>0.715</td>\n",
+       "      <td>87.917</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>55</td>\n",
+       "      <td>149610</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.420</td>\n",
+       "      <td>0.1660</td>\n",
+       "      <td>1</td>\n",
+       "      <td>-17.235</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0763</td>\n",
+       "      <td>0.9240</td>\n",
+       "      <td>0.000006</td>\n",
+       "      <td>0.1010</td>\n",
+       "      <td>0.267</td>\n",
+       "      <td>77.489</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>57</td>\n",
+       "      <td>210826</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.438</td>\n",
+       "      <td>0.3590</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-9.734</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0557</td>\n",
+       "      <td>0.2100</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.1170</td>\n",
+       "      <td>0.120</td>\n",
+       "      <td>76.332</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>71</td>\n",
+       "      <td>201933</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.266</td>\n",
+       "      <td>0.0596</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-18.515</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0363</td>\n",
+       "      <td>0.9050</td>\n",
+       "      <td>0.000071</td>\n",
+       "      <td>0.1320</td>\n",
+       "      <td>0.143</td>\n",
+       "      <td>181.740</td>\n",
+       "      <td>3</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>82</td>\n",
+       "      <td>198853</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.618</td>\n",
+       "      <td>0.4430</td>\n",
+       "      <td>2</td>\n",
+       "      <td>-9.681</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0.0526</td>\n",
+       "      <td>0.4690</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.0829</td>\n",
+       "      <td>0.167</td>\n",
+       "      <td>119.949</td>\n",
+       "      <td>4</td>\n",
+       "      <td>acoustic</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   popularity  duration_ms  explicit  danceability  energy  key  loudness  \\\n",
+       "0          73       230666     False         0.676  0.4610    1    -6.746   \n",
+       "1          55       149610     False         0.420  0.1660    1   -17.235   \n",
+       "2          57       210826     False         0.438  0.3590    0    -9.734   \n",
+       "3          71       201933     False         0.266  0.0596    0   -18.515   \n",
+       "4          82       198853     False         0.618  0.4430    2    -9.681   \n",
+       "\n",
+       "   mode  speechiness  acousticness  instrumentalness  liveness  valence  \\\n",
+       "0     0       0.1430        0.0322          0.000001    0.3580    0.715   \n",
+       "1     1       0.0763        0.9240          0.000006    0.1010    0.267   \n",
+       "2     1       0.0557        0.2100          0.000000    0.1170    0.120   \n",
+       "3     1       0.0363        0.9050          0.000071    0.1320    0.143   \n",
+       "4     1       0.0526        0.4690          0.000000    0.0829    0.167   \n",
+       "\n",
+       "     tempo  time_signature track_genre  \n",
+       "0   87.917               4    acoustic  \n",
+       "1   77.489               4    acoustic  \n",
+       "2   76.332               4    acoustic  \n",
+       "3  181.740               3    acoustic  \n",
+       "4  119.949               4    acoustic  "
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "12bea66e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_csv(\"cleaned_dataset.csv\",index = False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

Exploratory Data Analysis.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Model Building.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+# Music Popularity Prediction
+This repository contains a data science project that aims to predict the popularity of music using machine learning techniques.
+## Dataset
+This project uses the [Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset) available on Kaggle. This dataset contains information about Spotify tracks over a range of 125 different genres. Each track has several audio features associated with it, such as popularity, explicitness, danceability, energy, key, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and time signature.
+You can download the dataset from the Kaggle website and use it to follow along with the analysis in this project.
+## Overview
+This repository contains a data science project that aims to predict the popularity of music using machine learning techniques. The project is a binary classification problem where the goal is to predict whether a song will be popular or not. The dataset used in this project is imbalanced, meaning that one class is significantly more common than the other.
+The project consists of three main parts: Data Cleaning, Exploratory Data Analysis, and Model Building.
+### Data Cleaning
+In the [Data Cleaning](https://github.com/diivien/Music-Popularity-Prediction/blob/master/Data%20Cleaning.ipynb) notebook, I clean and preprocess the data to prepare it for analysis. This involves several steps such as:
+- Removing unique columns
+- Dropping null values
+- Dropping duplicated rows (same artists and track name)
+- Dropping artists and track name columns
+- Dropping invalid tempo and time signature according to Spotify API
+- Saving the cleaned dataset into a CSV file
+To get started with the data cleaning process, you can follow the instructions in the Data Cleaning notebook. This will guide you through the steps involved in cleaning and preprocessing the data.
+### Exploratory Data Analysis
+In the [Exploratory Data Analysis](https://github.com/diivien/Music-Popularity-Prediction/blob/master/Exploratory%20Data%20Analysis.ipynb) notebook, we explore the data and gain insights into the relationships between the features and the target variable. This involves generating various visualizations such as:
+- Correlation heatmaps to examine the relationships between pairs of continuous features
+- Histograms to check the distribution of continuous features
+- Bar charts to visualize categorical features
+- Scatter plots to examine the relationships between pairs of continuous features
+- Box plots to examine the distribution of continuous features by category
+- Stacked bar charts to visualize conditional distributions
+These visualizations help us understand the data better and inform our decisions when building machine learning models.
+To get started with the exploratory data analysis process, you can follow the instructions in the Exploratory Data Analysis notebook. This will guide you through the steps involved in exploring and visualizing the data.
+### Model Building
+In the [Model Building](https://github.com/diivien/Music-Popularity-Prediction/blob/master/Model%20Building.ipynb) notebook, I build and evaluate machine learning models to predict music popularity. The models used in this analysis include Linear SVC, Random Forest Classifier, LightGBM, and CatBoost. As part of this process, I perform several preprocessing steps such as scaling the data using a MinMax scaler and encoding categorical variables using a target encoder. I also use SMOTE-NC in an imbalanced-learn pipeline to prevent data leakage.
+To tune the hyperparameters of our models, I use Optuna for multi-objective optimization and generate a Pareto front plot to determine the best hyperparameters.
+To evaluate the performance of our models, I use several metrics that are appropriate for imbalanced datasets, such as F1 score, balanced accuracy, and PR AUC.
+To get started with the model building process, you can follow the instructions in the Model Building notebook. This will guide you through the steps involved in building and evaluating machine learning models to predict music popularity.
+## Future Work
+I am currently working on several improvements and extensions to this project. Some include:
+- Testing a neural network classifier to see if it can improve the accuracy of our predictions
+- Deploying an app on Gradle to make it easier for users to interact with our models and make predictions
+## Citations
+If you use any of the following libraries in your project, please cite them as follows:
+- imbalanced-learn: Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18(17), 1-5.
+- Matplotlib: Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90-95.
+- Seaborn: Waskom, M., Botvinnik, O., O’Kane, D., Hobson, P., Lukauskas, S., Gemperline, D. C., ... & de Ruiter, J. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021.
+- Joblib: Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., ... & Duchesnay, E. (2013). API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238.
+- Feature-engine: Sole-Ribalta A. (2020) Feature-engine: A Python Package for Feature Engineering and Preprocessing in Machine Learning. In: Martínez-Villaseñor L., Batyrshin I., Mendoza O., Kuri-Morales Á. (eds) Advances in Artificial Intelligence - IBERAMIA 2020. IBERAMIA 2020. Lecture Notes in Computer Science, vol 12422. Springer, Cham.
+- LightGBM: Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems.
+- CatBoost: Prokhorenkova L.O., Gusev G.L., Vorobev A.V., Dorogush A.V., Gulin A.A.(2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems.
+- Category Encoders: Micci-Barreca D (2001) A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3(1):27–32
+- NumPy: Harris CR et al.(2020) Array programming with NumPy. Nature 585(7825):357–362
+- SDV (Synthetic Data Vault): Patki N et al.(2016) The Synthetic Data Vault. IEEE International Conference on Data Science and Advanced Analytics
+- Optuna: Akiba T et al.(2019) Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
+- PyTorch: Paszke A et al.(2019) PyTorch: An Imperative Style High-performance Deep Learning Library. Advances in Neural Information Processing Systems
+- SciKeras: Varma P et al.(2020) SciKeras: a high-level Scikit-Learn compatible API for TensorFlow's Keras module

cleaned_dataset.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

dataset.csv ADDED Viewed

The diff for this file is too large to render. See raw diff