Music Recommendation Systems¶

Context¶

With the advent of technology, societies have become more efficient with their lives. But at the same time, individual human lives have become much more fast-paced and distracted by leaving little time to explore artistic pursuits. Also, technology has made significant advancements in the ability to coexist with art and general entertainment. In fact, it has made it easier for humans with a shortage of time to find and consume good content. Therefore, one of the key challenges for the companies is to be able to figure out what kind of content their customers are most likely to consume. Almost every internet-based company's revenue relies on the time consumers spend on their platforms. These companies need to be able to figure out what kind of content is needed in order to increase the time spent by customers on their platform and make their experience better. Spotify is one such audio content provider that has got a huge market base across the world. It has grown significantly because of its ability to recommend the ‘best’ next song to each and every customer based on the huge preference database they have gathered over time like millions of customers and billions of songs. This is done by using smart recommendation systems that can recommend songs based on the users’ likes/dislikes

Objective¶

To recommend songs to a user based on their likelihood of liking those songs.

The key questions¶

  • What are all songs they have listened to?
  • What are the most favored songs and artists?

Problem Formulation¶

Build a recommendation system to propose the top 10 songs for a user based on the likelihood of listening to those songs.

Data Dictionary¶

The core data is the Taste Profile Subset released by The Echo Nest as part of the Million Song Dataset. There are two files in this dataset. One contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.

song_data

  • song_id - A unique id given to every song
  • title - Title of the song
  • Release - Name of the released album
  • Artist_name - Name of the artist
  • year - Year of release

count_data

  • user _id - A unique id given to the user
  • song_id - A unique id given to the song
  • play_count - Number of times the song was played

Data Source¶

http://millionsongdataset.com/

Importing libraries and Reading dataset¶

In [1]:
#Mounting the drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
import warnings                                                               # Used to ignore the warning given as output of the code.
warnings.filterwarnings('ignore')

import numpy as np                                                            # Basic libraries of python for numeric and dataframe computations.
import pandas as pd

import matplotlib.pyplot as plt                                               # Basic library for data visualization.
import seaborn as sns                                                         # Slightly advanced library for data visualization

from sklearn.metrics.pairwise import cosine_similarity                        # To compute the cosine similarity between two vectors.
from collections import defaultdict                                           # A dictionary output that does not raise a key error

from sklearn.metrics import mean_squared_error                                # A performance metrics in sklearn.
In [ ]:
#importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Music_Recommendation_System/count_data.csv')
song_df  = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Music_Recommendation_System/song_data.csv')
In [ ]:
count_df.shape
Out[ ]:
(2000000, 4)
In [ ]:
song_df.shape
Out[ ]:
(1000000, 5)

Understanding the data by viewing a few observations¶

In [ ]:
count_df.head(10)
Out[ ]:
Unnamed: 0 user_id song_id play_count
0 0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
1 1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
2 2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
3 3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1
4 4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1
5 5 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5
6 6 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODXRTY12AB0180F3B 1
7 7 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFGUAY12AB017B0A8 1
8 8 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFRQTD12A81C233C0 1
9 9 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOHQWYZ12A6D4FA701 1
In [ ]:
song_df
Out[ ]:
song_id title release artist_name year
0 SOQMMHC12AB0180CB8 Silent Night Monster Ballads X-Mas Faster Pussy cat 2003
1 SOVFVAK12A8C1350D9 Tanssi vaan Karkuteillä Karkkiautomaatti 1995
2 SOGTUKN12AB017F4F1 No One Could Ever Butter Hudson Mohawke 2006
3 SOBNYVR12A8C13558C Si Vos Querés De Culo Yerba Brava 2003
4 SOHSBXH12A8C13B0DF Tangle Of Aspens Rene Ablaze Presents Winter Sessions Der Mystic 0
... ... ... ... ... ...
999995 SOTXAME12AB018F136 O Samba Da Vida Pacha V.I.P. Kiko Navarro 0
999996 SOXQYIQ12A8C137FBB Jago Chhadeo Naale Baba Lassi Pee Gya Kuldeep Manak 0
999997 SOHODZI12A8C137BB3 Novemba Dub_Connected: electronic music Gabriel Le Mar 0
999998 SOLXGOR12A81C21EB7 Faraday The Trance Collection Vol. 2 Elude 0
999999 SOWXJXQ12AB0189F43 Fernweh feat. Sektion Kuchikäschtli So Oder So Texta 2004

1000000 rows × 5 columns

Let us check the data types and and missing values of each column¶

In [ ]:
count_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   Unnamed: 0  int64 
 1   user_id     object
 2   song_id     object
 3   play_count  int64 
dtypes: int64(2), object(2)
memory usage: 61.0+ MB
In [ ]:
song_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   song_id      1000000 non-null  object
 1   title        999983 non-null   object
 2   release      999993 non-null   object
 3   artist_name  1000000 non-null  object
 4   year         1000000 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 38.1+ MB

Observations and Insights:¶

  • The count_df dataframe contains user_id, song_id, and the number of times a particular song has been played by a particular user. There are 4 columns and 20,000,000 observations in the dataset.
  • The unnamed: 0 column seems like the index of the dataframe. We can drop this column.
  • The song_df data has information/features of the song - title, released album, artist name, year of release. There are 5 columns and 10,000,00 observations in the dataset.
In [ ]:
df = pd.merge(count_df, song_df.drop_duplicates(['song_id']), on="song_id", how="left")
df = df.drop(['Unnamed: 0'],axis=1)
df
Out[ ]:
user_id song_id play_count title release artist_name year
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1 The Cove Thicker Than Water Jack Johnson 0
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2 Entre Dos Aguas Flamenco Para Niños Paco De Lucia 1976
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1 Stronger Graduation Kanye West 2007
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1 Constellations In Between Dreams Jack Johnson 2005
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1 Learn To Fly There Is Nothing Left To Lose Foo Fighters 1999
... ... ... ... ... ... ... ...
1999995 d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92 SOJEYPO12AAA8C6B0E 2 Ignorance (Album Version) Ignorance Paramore 0
1999996 d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92 SOJJYDE12AF729FC16 4 Two Is Better Than One Love Drunk Boys Like Girls featuring Taylor Swift 2009
1999997 d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92 SOJKQSF12A6D4F5EE9 3 What I've Done (Album Version) What I've Done Linkin Park 2007
1999998 d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92 SOJUXGA12AC961885C 1 Up My Worlds Justin Bieber 2010
1999999 d8bfd4ec88f0f3773a9e022e3c1a0f1d3b7b6a92 SOJYOLS12A8C13C06F 1 Soil_ Soil (Album Version) The Con Tegan And Sara 2007

2000000 rows × 7 columns

In [ ]:
df.play_count.describe()
Out[ ]:
play_count
count 2.000000e+06
mean 3.045485e+00
std 6.579720e+00
min 1.000000e+00
25% 1.000000e+00
50% 1.000000e+00
75% 3.000000e+00
max 2.213000e+03

Here the columns song_id and user_id are encrypted to provide anonymity. To ease our processing of the dataset we will use label encodings to process these two variables.

In [ ]:
#label encoding code
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['user_id'] = le.fit_transform(df['user_id'])

df['song_id'] = le.fit_transform(df['song_id'])

As this dataset is very large and has 2000000 observations, it is not computationally possible to build a model using this. Moreover, many users have only listened a few songs and also some songs are listened by very few users. Hence we can reduce the dataset by considering certain Logical assumptions.

Here, we will be taking users who have listened at least 90 songs, and the songs that are listened by at least 120 users.

In [ ]:
# Get the column containing the users
users = df.user_id

# Create a dictionary that maps users(listeners) to the number of songs that they have listened to
playing_count = dict()

for user in users:
    # If we already have the user, just add 1 to their playing count
    if user in playing_count:
        playing_count[user] += 1

    # Otherwise, set their playing count to 1
    else:
        playing_count[user] = 1
In [ ]:
# We want our users to have listened at least 90 songs
SONG_COUNT_CUTOFF = 90

# Create a list of users who need to be removed
remove_users = []

for user, num_songs in playing_count.items():

    if num_songs < SONG_COUNT_CUTOFF:
        remove_users.append(user)

df = df.loc[ ~ df.user_id.isin(remove_users)]
In [ ]:
# Get the column containing the songs
songs = df.song_id

# Create a dictionary that maps songs to its number of users(listeners)
playing_count = dict()

for song in songs:
    # If we already have the song, just add 1 to their playing count
    if song in playing_count:
        playing_count[song] += 1

    # Otherwise, set their playing count to 1
    else:
        playing_count[song] = 1
In [ ]:
# We want our song to be listened by atleast 120 users to be considred
LISTENER_COUNT_CUTOFF = 120

remove_songs = []

for song, num_users in playing_count.items():
    if num_users < LISTENER_COUNT_CUTOFF:
        remove_songs.append(song)

df_final= df.loc[ ~ df.song_id.isin(remove_songs)]

Out of all the songs available, songs with play_count less than or equal to 5 are in almost 90% abundance. So for building the recommendation system let us consider only those songs.

In [ ]:
# Keep only records of songs with play_count less than or equal to (<=) 5
df_final=df_final[df_final.play_count<=5]
In [ ]:
df_final.shape
Out[ ]:
(117876, 7)
In [ ]:
df_final.groupby("play_count").count()
Out[ ]:
user_id song_id title release artist_name year
play_count
1 72473 72473 72473 72473 72473 72473
2 23890 23890 23890 23890 23890 23890
3 10774 10774 10774 10774 10774 10774
4 5874 5874 5874 5874 5874 5874
5 4865 4865 4865 4865 4865 4865
In [ ]:
# See the shape of the data
df_final.shape
Out[ ]:
(117876, 7)

Exploratory Data Analysis¶

Let's check the total number of unique users, songs, artists in the data¶

Total number of unique user id

In [ ]:
df_final['user_id'].nunique()
Out[ ]:
3155

Total number of unique song id

In [ ]:
df_final['song_id'].nunique()
Out[ ]:
563

Total number of unique artists

In [ ]:
df_final['artist_name'].nunique()
Out[ ]:
232

Observations and Insights:¶

  • There are 3155 unique users, 563 unique songs, and 232 artists in the final dataset.

Let's find out about the most interacted songs and interacted users¶

Most interacted songs

In [ ]:
df_final['title'].value_counts()
Out[ ]:
title
Use Somebody                       751
Dog Days Are Over (Radio Edit)     748
Sehr kosmisch                      713
Clocks                             662
The Scientist                      652
                                  ... 
Who's Real                         103
Brave The Elements                 102
Creil City                         101
Heaven Must Be Missing An Angel     97
The Big Gundown                     96
Name: count, Length: 561, dtype: int64

Most interacted users

In [ ]:
df_final['user_id'].value_counts()
Out[ ]:
user_id
61472    243
15733    227
37049    202
9570     184
23337    177
        ... 
19776      1
45476      1
17961      1
14439      1
10412      1
Name: count, Length: 3155, dtype: int64

Observations and Insights:¶

  • The song 'Use somebody' has been played the most number of times.
  • The user with ID 61472 is the most interacted user.

Songs released on yearly basis

In [ ]:
count_songs = song_df.groupby('year').count()['title']
count = pd.DataFrame(count_songs)
count.drop(count.index[0], inplace=True)
count.tail()
Out[ ]:
title
year
2007 39414
2008 34770
2009 31051
2010 9397
2011 1
In [ ]:
plt.figure(figsize=(30,10))
ax = sns.barplot(x = count.index,
            y = 'title',
            data = count,
            estimator = np.median,)
for item in ax.get_xticklabels(): item.set_rotation(90)
plt.ylabel('number of songs released')
# Show the plot
plt.show()
No description has been provided for this image

Observations and Insights:¶

  • We can observe that the number of songs released in a year has been increasing over the years.
  • As per the data, the highest number of songs was released in 2007 i.e. 39,414
  • There is a decrease in the number of songs released in 2010. Since 2010 is the last year, the decrease might be because we have only partial data from the year 2010.

Now that we have explored the data, let's apply different algorithms to build recommendation systems

Popularity Based Recommendation Systems¶

As we have now explored the data, let's start building Recommendation systems¶

Model 1: Create Rank-Based Recommendation System¶

Rank-based recommendation systems provide recommendations based on the most popular songs. This kind of recommendation system is useful when we have cold start problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend songs to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use a rank-based recommendation system to recommend songs to the new user.

To build the rank-based recommendation system, we take average of all the play_counts provided to each song and then rank them based on their average play_counts.

In [ ]:
#Calculating average play_count
average_count = df_final.groupby('song_id')['play_count'].mean()

#Calculating the frequency a song is played.
play_freq = df_final.groupby('song_id')['play_count'].count()

#Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count':average_count, 'play_freq':play_freq})

Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of play_counts for a song to be considered for recommendation.

In [ ]:
def top_n_songs(data, n, min_interactions=100):

    #Finding songs with minimum number of play_counts
    recommendations = data[data['play_freq'] > min_interactions]

    #Sorting values w.r.t average count
    recommendations = recommendations.sort_values(by='avg_count', ascending=False)

    return recommendations.index[:n]

We can use this function with different n's and minimum interactions to get songs to recommend

Recommending top 10 Songs with 100 minimum interactions based on popularity¶
In [ ]:
list(top_n_songs(final_play, 10, 100))
Out[ ]:
[7224, 6450, 9942, 5531, 5653, 8483, 2220, 657, 614, 352]

Collaborative Filtering Based Recommendation System¶

In this type of recommendation system, we do not need any information about the users or songs. We only need user item interaction data to build a collaborative recommendation system. For example -

  1. Ratings provided by users. For example - ratings of books on goodread, movie ratings on imdb etc
  2. Likes of users on different facebook posts, likes on youtube videos
  3. Use/buying of a product by users. For example - buying different items on e-commerce sites
  4. Reading of articles by readers on various blogs

Types of Collaborative Filtering¶

  • Similarity/Neighborhood based
  • User User Similarity Based
  • Item Item similarity based
  • Model based

Building a baseline user user similarity based recommendation system¶

  • Below we are building similarity-based recommendation systems using Pearson similarity and using KNN to find similar users which are the nearest neighbor to the given user.
  • We will be using a new library - surprise to build the remaining models, let's first import the necessary classes and functions from this library
  • Please use the following code to install the surprise library. You only do it once while running the code for the first time.

!pip install surprise

In [ ]:
!pip install surprise
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.4/154.4 kB 2.4 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.4.2)
Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.25.2)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.11.4)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... done
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357248 sha256=a99df192f535f9ebc101bac5235e8df41a0cb57803d94bd7d6a50a718c12e5f0
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.4 surprise-0.1
In [ ]:
# To compute the accuracy of models
from surprise import accuracy

# class is used to parse a file containing play_counts, data should be in structure - user; item ; play_count
from surprise.reader import Reader

# class for loading datasets
from surprise.dataset import Dataset

# for tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# for splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# for implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# for implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing KFold cross-validation
from surprise.model_selection import KFold

#For implementing clustering-based recommendation system
from surprise import CoClustering

Before building the recommendation systems, let's go over some basic terminologies we are going to use:¶

Relevant songs - A song that is actually played higher than the threshold (here 1.5) is relevant, if the actual play_count is below the threshold then it is a non-relevant song.

Recommended song - A song that's predicted play_count is higher than the threshold (here 1.5) is a recommended song, if the predicted play_count is below the threshold then that song will not be recommended to the user.

False Negative (FN) - It is the frequency of relevant songs that are not recommended to the user. If the relevant songs are not recommended to the user, then the user might not listen to the song. This would result in the loss of opportunity for the service provider which they would like to minimize.

False Positive (FP) - It is the frequency of recommended songs that are actually not relevant. In this case, the recommendation system is not doing a good job of finding and recommending the relevant songs to the user. This would result in loss of resources for the service provider which they would also like to minimize.

Recall - It is the fraction of actually relevant songs that are recommended to the user i.e. if out of 10 relevant songs, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.

Precision - It is the fraction of recommended songs that are relevant actually i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.

While making a recommendation system it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are the two most used performance metrics used in the assessment of recommendation systems.

Precision@k and Recall@ k¶

Precision@k - It is the fraction of recommended songs that are relevant in top k predictions. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.

Recall@k - It is the fraction of relevant songs that are recommended to the user in top k predictions.

F1-Score@k - It is the harmonic mean of Precision@k and Recall@k. When precision@k and recall@k both seem to be important then it is useful to use this metric because it is representative of both of them.

Some useful functions¶

  • Below function takes the recommendation model as input and gives the precision@k and recall@k for that model.
  • To compute precision and recall, top k predictions are taken under consideration for each user.
In [ ]:
def precision_recall_at_k(model, k=30, threshold=1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)

    #Making predictions on the test data
    predictions = model.test(testset)

    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, playing_count in user_est_true.items():

        # Sort play count by estimated value
        playing_count.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in playing_count)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in playing_count[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in playing_count[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    #Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
    #Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)

    accuracy.rmse(predictions)
    print('Precision: ', precision) #Command to print the overall precision
    print('Recall: ', recall) #Command to print the overall recall
    print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.

Below we are loading the dataset, which is a pandas dataframe, into a different format called surprise.dataset.DatasetAutoFolds which is required by this library. To do this we will be using the classes Reader and Dataset

You will also notice here that we read the dataset by providing a scale of ratings. However, as you would know, we do not have ratings data of the songs. In this case, we are going to use play_count as a proxy for ratings with the assumption that the more the user listens to a song, the higher the chance that they like the song

In [ ]:
# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader)

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.40, random_state=42)
  • Now we are ready to build the first baseline similarity-based recommendation system using the cosine similarity.
  • KNNBasic is an algorithm that is also associated with the surprise package, it is used to find the desired similar songs among a given set of songs.
  • To compute precision and recall, a threshold of 1.5 and k value of 30 is taken for the recommended and relevant play counts.
  • The intuition of threshold 1.5 is that if the predicts that a user will listen to the song more than 1.5 times(can be understood 2 out of 3 if a non-integer value is getting hard to interpret) then that song should be recommended to that user.
  • In the present case precision and recall both need to be optimized as the service provider would like to minimize both the losses discussed above. Hence, the correct performance measure is the F_1 score.
In [ ]:
#Declaring the similarity options.
sim_options = {'name': 'cosine',
               'user_based': True}

#KNN algorithm is used to find desired similar items.
sim_user_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=1)

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =30.
precision_recall_at_k(sim_user_user)
RMSE: 1.0878
Precision:  0.396
Recall:  0.692
F_1 score:  0.504
  • We have calculated RMSE to check how far the overall predicted play counts are from the actual play counts.
  • Intuition of Recall - We are getting a recall of almost 0.70, which means out of all the relevant songs, 70% are recommended.
  • Intuition of Precision - We are getting a precision of almost 0.396, which means out of all the recommended songs, 39.6% are relevant.
  • Here F_1 score of the baseline model is almost 0.504. It indicates that mostly recommended songs were relevant and relevant songs were recommended. We will try to improve this later by using GridSearchCV by tuning different hyperparameters of this algorithm.

Let's now predict play_counts for a user with user_id=6958 and song_id=1671 as shown below.

In [ ]:
# predicting play_count for a sample user with a listened song.
sim_user_user.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.80   {'actual_k': 40, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.8009387435128914, details={'actual_k': 40, 'was_impossible': False})
  • The above output shows that the actual play count for this user-item pair is 2 and the predicted is 1.80 by this user-user-similarity-based baseline model.

Below we are predicting play_count for the same userId=6958 but for a song which this user has not listened to yet i.e. song_id=3232

In [ ]:
#predicting play_count for a sample user with a song not-listened by the user.
sim_user_user.predict(6958, 3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.64   {'actual_k': 40, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.6386860897998294, details={'actual_k': 40, 'was_impossible': False})

As we can see the predicted play count for this user-item pair is 1.64 based on this user-user-similarity-based baseline model.

Improving similarity-based recommendation system by tuning its hyper-parameters¶

Below we will be tuning hyperparameters for the KNNBasic algorithms. Let's try to understand some of the hyperparameters of the KNNBasic algorithm:

  • k (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.
  • min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all play_counts. Default is 1.
  • sim_options (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -
    • cosine
    • msd (default)
    • Pearson
    • Pearson baseline
In [ ]:
# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [True], "min_support":[2,4]}
              }

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
1.0462441381791592
{'k': 30, 'min_k': 9, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 2}}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Now let's build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation

In [ ]:
# using the optimal similarity measure for user-user based collaborative filtering
sim_options = {'name': 'pearson_baseline',
               'user_based': True, "min_support":2}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options=sim_options, k=30, min_k=9, random_state=1, verbose=False)

# training the algorithm on the trainset
sim_user_user_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =30.
precision_recall_at_k(sim_user_user_optimized)
RMSE: 1.0521
Precision:  0.413
Recall:  0.721
F_1 score:  0.525
  • We can see from above that after tuning hyperparameters, F_1 score of the tuned model is better than the baseline model. Along with this the RMSE of the model has gone down as compared to the model before hyperparameter tuning. Hence, we can say that the model performance has improved after hyperparameter tuning.

Let's us now predict play_count for a user with userId="6958", and song_id=1671 with the optimized model as shown below

In [ ]:
sim_user_user_optimized.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.96   {'actual_k': 24, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.962926073914969, details={'actual_k': 24, 'was_impossible': False})
  • Here the model gives a good prediction in comparison to the actual play_count(2).

Below we are predicting play_count for the same userId="6958" but for a song which this user has not listened before i.e. song_id=3232, by using the optimized model as shown below -

In [ ]:
sim_user_user_optimized.predict(6958,3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.45   {'actual_k': 10, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.4516261428486725, details={'actual_k': 10, 'was_impossible': False})

Identifying similar users to a given user (nearest neighbors)¶

We can also find out similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding the 5 most similar users to the first user in the list with internal id 0, based on the msd distance metric

In [ ]:
sim_user_user_optimized.get_neighbors(0,5) #Here 0 is the inner id of the above user.
Out[ ]:
[42, 1131, 17, 186, 249]
In [ ]:
df_final.iloc[1131,:]
Out[ ]:
user_id                   51415
song_id                    2115
play_count                    1
title                  Tive Sim
release        Nova Bis-Cartola
artist_name             Cartola
year                       1974
Name: 15513, dtype: object
In [ ]:
df_final[df_final.user_id==6958]
Out[ ]:
user_id song_id play_count title release artist_name year
200 6958 447 1 Daisy And Prudence Distillation Erin McKeown 2000
202 6958 512 1 The Ballad of Michael Valentine Sawdust The Killers 2004
203 6958 549 1 I Stand Corrected (Album) Vampire Weekend Vampire Weekend 2007
204 6958 703 1 They Might Follow You Tiny Vipers Tiny Vipers 2007
205 6958 719 1 Monkey Man You Know I'm No Good Amy Winehouse 2007
206 6958 892 1 Bleeding Hearts Hell Train Soltero 0
209 6958 1050 5 Wet Blanket Old World Underground_ Where Are You Now? Metric 2003
213 6958 1480 1 Fast As I Can Monday Morning Cold Erin McKeown 2000
215 6958 1671 2 Sleeping In (Album) Give Up Postal Service 2003
216 6958 1752 1 Gimme Sympathy Gimme Sympathy Metric 2009
217 6958 1756 1 You Mustn't Kick It Around Distillation Erin McKeown 2000
218 6958 1787 2 Help I'm Alive Fantasies Metric 2009
219 6958 1818 1 Teenager Modapop Camera Obscura 0
221 6958 2107 1 Stadium Love Fantasies Metric 2009
225 6958 2289 1 Satellite Mind Fantasies Metric 2009
226 6958 2304 1 Daddy's Eyes Sawdust The Killers 2006
227 6958 2425 1 Señorita Justified Justin Timberlake 2002
228 6958 2501 1 Camaro Because Of The Times Kings Of Leon 2007
232 6958 2701 1 Tron Antidotes Foals 2008
235 6958 2898 1 Twilight Galaxy Fantasies Metric 2009
237 6958 2994 1 Elephant Gun The Gulag Orkestar Beirut 2006
239 6958 3074 1 Catch You Baby (Steve Pitron & Max Sanna Radio... Catch You Baby Lonnie Gordon 0
244 6958 3491 1 Bling (Confession Of A King) Sam's Town The Killers 2006
246 6958 3551 1 You're A Cad Ray Guns Are Not Just The Future the bird and the bee 2009
247 6958 3718 2 The Penalty The Flying Club Cup Beirut 2007
249 6958 3801 1 Baby Ray Guns Are Not Just The Future the bird and the bee 2009
251 6958 3907 1 What's In The Middle Ray Guns Are Not Just The Future the bird and the bee 2009
262 6958 5193 1 Goodnight Bad Morning Midnight Boom The Kills 2008
264 6958 5340 1 Postcards From Italy The Gulag Orkestar Beirut 2005
267 6958 5441 1 Where The White Boys Dance Sam's Town The Killers 2006
269 6958 5566 5 The Bachelor and the Bride Her Majesty The Decemberists The Decemberists 2003
271 6958 5894 1 Caring Is Creepy Garden State - Music From The Motion Picture The Shins 2001
272 6958 6305 1 Rhode Island Is Famous For You Sing You Sinners Erin McKeown 2007
280 6958 7738 1 Nantes The Flying Club Cup Beirut 2007
281 6958 8029 1 I CAN'T GET STARTED It's The Time Ron Carter 0
282 6958 8037 1 Gold Guns Girls Fantasies Metric 2009
286 6958 8425 1 Love Letter To Japan Ray Guns Are Not Just The Future the bird and the bee 2009
290 6958 9065 1 Balloons (Single version) Balloons Foals 2007
293 6958 9351 2 The Police And The Private Live It Out Metric 2005

Implementing the recommendation algorithm based on optimized KNNBasic model¶

Below we will be implementing a function where the input parameters are -

  • data: a song dataset
  • user_id: a user id against which we want the recommendations
  • top_n: the number of songs we want to recommend
  • algo: the algorithm we want to use for predicting the play_count
  • The output of the function is a set of top_n items recommended for the given user_id based on the given algorithm
In [ ]:
def get_recommendations(data, user_id, top_n, algo):

    # creating an empty list to store the recommended song ids
    recommendations = []

    # creating an user item interactions matrix
    user_item_interactions_matrix = data.pivot(index='user_id', columns='song_id', values='play_count')

    # extracting those song ids which the user_id has not played yet
    non_interacted_songs = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()

    # looping through each of the song ids which user_id has not interacted yet
    for item_id in non_interacted_songs:

        # predicting the play_count for those non played song ids by this user
        est = algo.predict(user_id, item_id).est

        # appending the predicted play_count
        recommendations.append((item_id, est))

    # sorting the predicted play_count in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted play_count songs for this user

Predicted top 5 songs for userId=6958 with user_user_similarity based recommendation system based on the recommendation score(output from the model, which is the likelihood of the user liking the song that the user will listen to)¶

In [ ]:
#Making top 5 recommendations for user_id 6958 with a similarity-based recommendation engine.
recommendations = get_recommendations(df_final,6958, 5, sim_user_user)
In [ ]:
#Building the dataframe for above recommendations with columns "song_id" and "recommendation_score"
pd.DataFrame(recommendations, columns=['song_id', 'predicted_play_count'])
Out[ ]:
song_id predicted_play_count
0 7224 3.141147
1 614 2.525000
2 5653 2.514023
3 352 2.425000
4 6450 2.394927

Correcting the play_counts and Ranking the above songs¶

While comparing the play counts of two songs, it is not only the play_counts that describe the likelihood of the user to that song. Along with the play_count the number of users who have listened to that song also becomes important to consider. Due to this, we have calculated the "corrected_play_count" for each song. Commonly higher the "play_count" of a song more it is liked by users. To interpret the above concept, a song with play count 4 with rating_count 3 is less liked in comparison to a song with play count 3 with a rating count of 50. It has been empirically found that the likelihood of the song is directly proportional to the inverse of the square root of the rating_count of the song.

In [ ]:
def ranking_songs(recommendations, playing_count):
  # sort the songs based on play counts
  ranked_songs = playing_count.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending=False)[['play_freq']].reset_index()

  # merge with the recommended songs to get predicted play_count
  ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns=['song_id', 'predicted_play_count']), on='song_id', how='inner')

  # rank the songs based on corrected play_counts
  ranked_songs['corrected_play_count'] = ranked_songs['predicted_play_count'] - 1 / np.sqrt(ranked_songs['play_freq'])

  # sort the songs based on corrected play_counts
  ranked_songs = ranked_songs.sort_values('corrected_play_count', ascending=False)

  return ranked_songs
In [ ]:
#Applying the ranking_songs function and sorting it based on corrected play_counts.
ranking_songs(recommendations, final_play)
Out[ ]:
song_id play_freq predicted_play_count corrected_play_count
3 7224 107 3.141147 3.044473
1 614 373 2.525000 2.473222
2 5653 108 2.514023 2.417798
0 352 748 2.425000 2.388436
4 6450 102 2.394927 2.295913

Item Item Similarity-based collaborative filtering recommendation systems¶

  • Above we have seen similarity-based collaborative filtering where similarity has seen between users. Now let us look into similarity-based collaborative filtering where similarity is seen between songs.
In [ ]:
#Declaring the similarity options.
sim_options = {'name': 'pearson',
               'user_based': False}

#KNN algorithm is used to find desired similar items.
sim_item_item = KNNBasic(sim_options=sim_options, random_state=1, verbose=False)

# Train the algorithm on the trainset, and predict play_count for the testset
sim_item_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =30.
precision_recall_at_k(sim_item_item)
RMSE: 1.0588
Precision:  0.376
Recall:  0.538
F_1 score:  0.443
  • The baseline model is giving a good F_1 score. We will try to improve this later by using GridSearchCV by tuning different hyperparameters of this algorithm.

Let's now predict the play_count for a user with userId=6958 and song_id=1671 as shown below. Here the user has already listened to the song with song_id 1671.

In [ ]:
#predicting play_count for a sample user with a listened song.
sim_item_item.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.92   {'actual_k': 10, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.91669781984001, details={'actual_k': 10, 'was_impossible': False})
  • The above output shows that item-item similarity based model is making a good prediction where the actual play_count is 2.

Below we are predicting play count for the same userId=6958 but for a song which this user has not listened to yet i.e. song_id=3232

In [ ]:
#predicting play count for a sample user with song not listened to by the user.
sim_item_item.predict(6958,3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.00   {'actual_k': 5, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.0, details={'actual_k': 5, 'was_impossible': False})

As we can see the predicted play_count for this user-song pair is low based on this item-item similarity-based baseline model.

Improving similarity-based recommendation system by tuning its hyper-parameters¶

Below we will be tuning hyperparameters for the KNNBasic algorithms. Let's try to understand some of the hyperparameters of the KNNBasic algorithm:

  • k (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.
  • min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all play_counts. Default is 1.
  • sim_options (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -
    • cosine
    • msd (default)
    • Pearson
    • Pearson baseline
In [ ]:
# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [False], "min_support":[2,4]}
              }

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
1.0228249658265955
{'k': 30, 'min_k': 6, 'sim_options': {'name': 'pearson_baseline', 'user_based': False, 'min_support': 2}}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Now let's build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation

In [ ]:
# using the optimal similarity measure for item-item based collaborative filtering
sim_options = {'name': 'pearson_baseline',
               'user_based': False, "min_support":4}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_item_item_optimized = KNNBasic(sim_options=sim_options, k=30, min_k=6, random_state=1, verbose=False)

# training the algorithm on the trainset
sim_item_item_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10.
precision_recall_at_k(sim_item_item_optimized)
RMSE: 1.0328
Precision:  0.405
Recall:  0.696
F_1 score:  0.512
  • We can see from above that after tuning hyperparameters, F_1 score of the tuned model is much better than the baseline model. Also, there is a considerable fall in the RMSE value with tuning. Hence the tuned model is doing better than the earlier one.

Let's us now predict play_count for an user with userId=6958 and for songs_id=1671 with the optimized model as shown below

In [ ]:
sim_item_item_optimized.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.96   {'actual_k': 10, 'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.9634957386781853, details={'actual_k': 10, 'was_impossible': False})
  • Here the optimized model is predicting a very good play_count (almost 1.96) for the song whose actual play_count is 2.

Below we are predicting play_count for the same userId=6958 but for a song which this user has not listened before i.e. songs_id==3232, by using the optimized model as shown below -

In [ ]:
sim_item_item_optimized.predict(6958, 3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.70   {'was_impossible': True, 'reason': 'Not enough neighbors.'}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.6989607635206787, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})
  • For an unknown song the model is predicting a play_count of 1.70.

Identifying similar users to a given user (nearest neighbors)¶

We can also find out similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar users to the user with internal id 0 based on the msd distance metric

In [ ]:
sim_item_item_optimized.get_neighbors(0, k=5)
Out[ ]:
[124, 523, 173, 205, 65]

Predicted top 5 songs for userId=6958 with similarity based recommendation system¶

In [ ]:
#Making top 5 recommendations for user_id 6958 with similarity-based recommendation engine.
recommendations = get_recommendations(df_final, 6958, 5, sim_item_item)
In [ ]:
#Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(recommendations, columns=['songs_id', 'predicted_play_count'])
Out[ ]:
songs_id predicted_play_count
0 750 5.000000
1 4377 4.206578
2 139 3.875420
3 5616 3.868549
4 861 3.840408
In [ ]:
#Applying the ranking_songs function and sorting it based on corrected play_counts.
ranking_songs(recommendations, final_play)
Out[ ]:
song_id play_freq predicted_play_count corrected_play_count
2 750 123 5.000000 4.909833
0 4377 159 4.206578 4.127273
3 139 119 3.875420 3.783750
4 5616 113 3.868549 3.774477
1 861 126 3.840408 3.751321
  • Now as we have seen similarity-based collaborative filtering algorithms, let us now get into model-based collaborative filtering algorithms.

Model Based Collaborative Filtering - Matrix Factorization¶

Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.

Singular Value Decomposition (SVD)¶

SVD is used to compute the latent features from the user-song matrix. But SVD does not work when we miss values in the user-item matrix.

Building a baseline matrix factorization recommendation system¶

In [ ]:
# using SVD matrix factorization
svd = SVD(random_state=1)

# training the algorithm on the trainset
svd.fit(trainset)

# Let us compute precision@k and recall@k with k =30.
precision_recall_at_k(svd)
RMSE: 1.0252
Precision:  0.41
Recall:  0.633
F_1 score:  0.498
  • The baseline model with the algorithm is giving a nice F-1 score (almost 498.6%). It indicates a good performance by the model. The RMSE of the model is 1.0252.
  • Let's now predict the play_count for a user with userId="6958" and `song_id=1671 as shown below
  • Here the user has already listened to the song.
In [ ]:
#Making prediction.
svd.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.27   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.267473397214638, details={'was_impossible': False})

As we can see - the actual play_counts for this user-song pair is 2 and the predicted play_count is 1.27 by this matrix factorization-based baseline model. It seems like we have under-estimated the play_count. We will try to fix this later by tuning the hyperparameters of the model using GridSearchCV

Below we are predicting play_count for the same userId=6958 but for a song which this user has not listened before i.e. song_id=3232, as shown below -

In [ ]:
#Making prediction.
svd.predict(6958, 3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.56   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.5561675084403663, details={'was_impossible': False})

We can see that the estimated play_count for this user-song pair is 1.98 based on this matrix factorization based baseline model

Improving matrix factorization based recommendation system by tuning its hyper-parameters¶

In SVD, play_count is predicted as -

$$\hat{r}_{u i}=\mu+b_{u}+b_{i}+q_{i}^{T} p_{u}$$

If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.

To estimate all the unknown, we minimize the following regularized squared error:

$$\sum_{r_{u i} \in R_{\text {train }}}\left(r_{u i}-\hat{r}_{u i}\right)^{2}+\lambda\left(b_{i}^{2}+b_{u}^{2}+\left\|q_{i}\right\|^{2}+\left\|p_{u}\right\|^{2}\right)$$

The minimization is performed by a very straightforward stochastic gradient descent:

$$\begin{aligned} b_{u} & \leftarrow b_{u}+\gamma\left(e_{u i}-\lambda b_{u}\right) \\ b_{i} & \leftarrow b_{i}+\gamma\left(e_{u i}-\lambda b_{i}\right) \\ p_{u} & \leftarrow p_{u}+\gamma\left(e_{u i} \cdot q_{i}-\lambda p_{u}\right) \\ q_{i} & \leftarrow q_{i}+\gamma\left(e_{u i} \cdot p_{u}-\lambda q_{i}\right) \end{aligned}$$

There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters here

Below we will be tuning only three hyperparameters -

  • n_epochs: The number of iteration of the SGD algorithm
  • lr_all: The learning rate for all parameters
  • reg_all: The regularization term for all parameters
In [ ]:
# set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# performing 3-fold gridsearch cross validation
gs_ = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs_.fit(data)

# best RMSE score
print(gs_.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_.best_params['rmse'])
1.0123682332653112
{'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Now we will build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation

In [ ]:
# building the optimized SVD model using optimal hyperparameter search
svd_optimized = SVD(n_epochs=30, lr_all=0.01, reg_all=0.2, random_state=1)

# training the algorithm on the trainset
svd_optimized=svd_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =30.
precision_recall_at_k(svd_optimized)
RMSE: 1.0141
Precision:  0.415
Recall:  0.635
F_1 score:  0.502
  • We can see from above that the tuned model is showing a slightly better F_1 score and also a very slight growth is there in the model. Hence the tuned model is doing better than the earlier model.

Let's now predict the play_count for a user with userId=6958 and song_id=1671 with the optimized model as shown below

In [ ]:
#Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671.
svd_optimized.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.34   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.3432395286125098, details={'was_impossible': False})

Here the predicted play_count is 1.34 for a song whose actual play_count is 2.

In [ ]:
#Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline play_count.
svd_optimized.predict(6958, 3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.44   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.4425484461176483, details={'was_impossible': False})

For an unseen song the play_count given by the optimized model is 1.44.

In [ ]:
#Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm.
svd_recommendations = get_recommendations(df_final, 6958, 5, svd_optimized)
In [ ]:
#Ranking songs based on above recommendations
ranking_songs(svd_recommendations, final_play)
Out[ ]:
song_id play_freq predicted_play_count corrected_play_count
2 7224 107 2.601899 2.505225
1 5653 108 2.108728 2.012502
4 8324 96 2.014091 1.912029
0 9942 150 1.940115 1.858465
3 6450 102 1.952493 1.853478

Cluster Based Recommendation System¶

In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

  • Co-clustering is a set of techniques in Cluster Analysis. Given some matrix A, we want to cluster rows of A and columns of A simultaneously, this is a common task for user-item matrices.

  • As it clusters both the rows and columns simultaneously, it is also called bi-clustering. To understand the working of the algorithm let A be mxn matrix, goal is to generate co-clusters: a subset of rows that exhibit similar behavior across a subset of columns, or vice versa.

  • Co-clustering is defined as two map functions: rows -> row cluster indexes columns -> column cluster indexes These map functions are learned simultaneously. It is different from other clustering techniques where we cluster first the rows and then the columns.

In [ ]:
# using CoClustering algorithm.
clust_baseline = CoClustering(random_state=1)

# training the algorithm on the trainset
clust_baseline.fit(trainset)

# Let us compute precision@k and recall@k with k =30.
precision_recall_at_k(clust_baseline)
RMSE: 1.0487
Precision:  0.397
Recall:  0.582
F_1 score:  0.472
  • Here F_1 score of the baseline model is almost 0.472. It indicates that mostly recommended songs were relevant and relevant songs were recommended. We will try to improve this later by using GridSearchCV by tuning different hyperparameters of this algorithm.
  • Let's now predict the play_count for a user with userId=6958 and song_id=1671 as shown below
  • Here the user has already listened to the song.
In [ ]:
#Making prediction for user_id 6958 and song_id 1671.
clust_baseline.predict(6958,1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.29   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.2941824757363074, details={'was_impossible': False})

As we can see - the actual play_count for this user-song pair is 2 and the predicted play_count is 1.29 by this Co-clustering based baseline model. It seems like we have under-estimated the play_count by a small margin. We will try to fix this later by tuning the hyperparameters of the model using GridSearchCV

Below we are predicting play_count for the same userId=6958 but for a song to which this user has not listened before i.e. song_id=3232, as shown below -

In [ ]:
#Making prediction for userid 6958 and song_id 3232.
clust_baseline.predict(6958,3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.48   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.4785259100797417, details={'was_impossible': False})

We can see that estimated play_count for this user-song pair is 1.48 based on this Co-clustering based baseline model.

Improving clustering-based recommendation system by tuning its hyper-parameters¶

Below we will be tuning hyper-parameters for the CoClustering algorithms. Let's try to understand different hyperparameters of this algorithm -

  • n_cltr_u (int) – Number of user clusters. Default is 3.
  • n_cltr_i (int) – Number of item clusters. Default is 3.
  • n_epochs (int) – Number of iteration of the optimization loop. Default is 20.
  • random_state (int, RandomState instance from NumPy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
  • verbose (bool) – If True, the current epoch will be printed. Default is False.
In [ ]:
# set the parameter space to tune
param_grid = {'n_cltr_u':[5,6,7,8], 'n_cltr_i': [5,6,7,8], 'n_epochs': [10,20,30]}

# performing 3-fold gridsearch cross validation
gs = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
1.0613293131139294
{'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 30}

Once the grid search is complete, we can get the optimal values for each of those hyperparameters as shown above

Now we will build final model by using tuned values of the hyperparameters which we received by using grid search cross-validation

In [ ]:
# using tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u=3,n_cltr_i=2, n_epochs=60, random_state=1)

# training the algorithm on the trainset
clust_tuned.fit(trainset)

# Let us compute precision@k and recall@k with k =30.
precision_recall_at_k(clust_tuned)
RMSE: 1.0471
Precision:  0.396
Recall:  0.572
F_1 score:  0.468
  • We can see that the baseline F_1 score for tuned co-clustering model on testset is almost equal to F_1 score for baseline Co-clustering model. The two can be considered almost similar to each other.
  • Let's now predict play_count for a user with userId=6958 and for song_id=1671 as shown below
  • Here the user has already listened to the song.
In [ ]:
#Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671.
clust_tuned.predict(6958, 1671, r_ui=2, verbose=True)
user: 6958       item: 1671       r_ui = 2.00   est = 1.59   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=1671, r_ui=2, est=1.585941833604144, details={'was_impossible': False})

The optimized model predicted the play_count as 1.59. whereas the actual play_count is 2.

In [ ]:
#Using Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline play_count.
clust_tuned.predict(6958, 3232, verbose=True)
user: 6958       item: 3232       r_ui = None   est = 1.77   {'was_impossible': False}
Out[ ]:
Prediction(uid=6958, iid=3232, r_ui=None, est=1.7702852679475787, details={'was_impossible': False})

The optimized model predicted the play_count as 1.77.

Implementing the recommendation algorithm based on optimized CoClustering model¶

In [ ]:
#Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm.
clustering_recommendations = get_recommendations(df_final, 6958, 5, clust_tuned)

Correcting the play_count and Ranking the above songs¶

In [ ]:
#Ranking songs based on above recommendations
ranking_songs(clustering_recommendations, final_play)
Out[ ]:
song_id play_freq predicted_play_count corrected_play_count
3 6450 102 2.626819 2.527805
1 5653 108 2.578936 2.482711
2 7224 107 2.525240 2.428567
0 9942 150 2.506799 2.425149
4 4831 97 2.415542 2.314008

Content Based Recommendation Systems¶

In a content-based recommendation system, we would be using the feature - text. In this dataset, we don't have any song review but we can combine the columns - title, release, and artist_name to create a text-based feature and apply tfidf feature extraction technique to extract features, which we later use to compute similar songs based on these texts.

In [ ]:
df_final.shape
Out[ ]:
(117876, 7)
In [ ]:
df_small=df_final
In [ ]:
df_small['text'] = df_small['title'] + ' ' + df_small['release'] + ' ' + df_small['artist_name']
df_small.head()
Out[ ]:
user_id song_id play_count title release artist_name year text
200 6958 447 1 Daisy And Prudence Distillation Erin McKeown 2000 Daisy And Prudence Distillation Erin McKeown
202 6958 512 1 The Ballad of Michael Valentine Sawdust The Killers 2004 The Ballad of Michael Valentine Sawdust The Ki...
203 6958 549 1 I Stand Corrected (Album) Vampire Weekend Vampire Weekend 2007 I Stand Corrected (Album) Vampire Weekend Vamp...
204 6958 703 1 They Might Follow You Tiny Vipers Tiny Vipers 2007 They Might Follow You Tiny Vipers Tiny Vipers
205 6958 719 1 Monkey Man You Know I'm No Good Amy Winehouse 2007 Monkey Man You Know I'm No Good Amy Winehouse

Now, we can keep only five columns - user_id, sing_id, play_count, title, and text. We will drop the duplicate titles from the data and make it the title column as the index of the dataframe

In [ ]:
df_small = df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]
df_small = df_small.drop_duplicates(subset=['title'])
df_small = df_small.set_index('title')
df_small.head()
Out[ ]:
user_id song_id play_count text
title
Daisy And Prudence 6958 447 1 Daisy And Prudence Distillation Erin McKeown
The Ballad of Michael Valentine 6958 512 1 The Ballad of Michael Valentine Sawdust The Ki...
I Stand Corrected (Album) 6958 549 1 I Stand Corrected (Album) Vampire Weekend Vamp...
They Might Follow You 6958 703 1 They Might Follow You Tiny Vipers Tiny Vipers
Monkey Man 6958 719 1 Monkey Man You Know I'm No Good Amy Winehouse
In [ ]:
df_small.shape
Out[ ]:
(561, 4)
In [ ]:
indices = pd.Series(df_small.index)
indices[:5]
Out[ ]:
0                 Daisy And Prudence
1    The Ballad of Michael Valentine
2          I Stand Corrected (Album)
3              They Might Follow You
4                         Monkey Man
Name: title, dtype: object
In [ ]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Out[ ]:
True
In [ ]:
import re
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

We will create a function to pre-process the text data:

  • stopwords: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that does not contain information in the text and can be ignored.
  • Lemmatization: Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
In [ ]:
def tokenize(text):
    text = re.sub(r"[^a-zA-Z]"," ",text.lower())
    tokens = word_tokenize(text)
    words = [word for word in tokens if word not in stopwords.words("english")]
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems
In [ ]:
tfidf = TfidfVectorizer(tokenizer=tokenize)
song_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()

We have extracted features from the text data. Now, we can find similarities between songs using these features. We will use cosine similarity to calculate the similarity.

In [ ]:
song_tfidf
Out[ ]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [ ]:
song_tfidf.shape
Out[ ]:
(561, 1437)
In [ ]:
similar_songs = cosine_similarity(song_tfidf, song_tfidf)
similar_songs
Out[ ]:
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Finally, let's create a function to find most similar songs to recommend for a given song

In [ ]:
# function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):

    recommended_songs = []

    # gettin the index of the song that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(top_10_indexes)

    # populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])

    return recommended_songs

Recommending 10 songs similar to Learn to Fly

In [ ]:
recommendations('Learn To Fly', similar_songs)
[509, 234, 423, 345, 394, 370, 371, 372, 373, 375]
Out[ ]:
['Everlong',
 'The Pretender',
 'Nothing Better (Album)',
 'From Left To Right',
 'Lifespan Of A Fly',
 'Under The Gun',
 'I Need A Dollar',
 'Feel The Love',
 'All The Pretty Faces',
 'Bones']

Conclusions¶

In this case study, we built recommendation systems using five different algorithms. They are as follows:

  • rank-based using averages
  • User-user-similarity-based collaborative filtering
  • Item-item-similarity-based collaborative filtering
  • model-based (matrix factorization) collaborative filtering
  • clustering-based recommendation systems
  • content-based recommendation systems

We have seen how they are different from each other and what kind of data is needed to build each of these recommendation systems. We can further combine all the recommendation techniques we have seen.
To demonstrate "user-user-similarity-based collaborative filtering","item-item-similarity-based collaborative filtering", and "model-based (matrix factorization) collaborative filtering", surprise library has been demonstrated. For these algorithms grid search cross-validation is used to find the best working model, and using that the corresponding predictions are made.

Proposal for the final solution design:¶

We will use the user-user similarity-based collaborative filtering recommendation system final solution since it is more robust and gives a high F_1 score. We have predicted the play counts for all the users that have not listened to a particular song.

In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"