A Track Recommender System for the Spotify Million Playlist Dataset

Golden State Bridge. Shot on Kodak Gold 200.

December, 2021

This is a final project for CSE 258 - Recommender Systems. For the project we are tasked with using what we've learned and come up with a Recommender system for a task that we've chosen.

1. Spotify Million Playlist Dataset

For this assignment, we have chosen the Spotify Million Playlist dataset [1] as our subject of study. The dataset of 1,000,000 playlists is sampled from the 4 billion public playlists created by US spotify users between January 2010 and November 2017. The dataset consist of over 2 million unique tracks by nearly 300,000 artists and was originally created for the Spotify Million Playlist Challenge that ran from January to July 2018. The version we are using in this assignment was a re-release on aicrowd.com on September 2020.

Each entry of the playlist dataset contains the playlist title, along with the list of tracks in the playlist, as well as some metadata (number of followers, edit timestamp, etc.). The tracks in each playlist are characterized by the track name, artist name and unique in-spotify uniform resource identifier (URI) corresponding to the track and artist, along with metadata. The following table shows the statistics of the Spofity Million Playlist Dataset.

1.1. Adjustments Made for This Assignment

Due to the limited computing power we possess, we soon realized that it would be impossible for us to utilize the whole dataset. As an effort to make the dataset more manageable, we filtered the dataset and only kept playlists with 25-30 tracks. After filtering, we are left with 76,289 playlists.

Following this filtering step, the dataset was split into training, validation, and testing sets with a 8:1:1 split. The track count distribution of the training set is displayed in the following figure. By plotting the log of the track count, it is easy to see that a majority of the tracks included in the dataset occur only once, implying there is limited training data for our models to build robust representations of these rare tracks.

Sorted log popularity of songs present in the training data. Taking the log of the popularity shows that a majority of the tracks in the corpus have only a single interaction to train on.

2. Predictive Task

Originally, the Association of Computing Machinery (ACM) Recommender System Challenge focused on the task of Automatic Playlist Continuation (APC), where models should recommend tracks that fit the characteristics of an existing playlist. To define the task formally, let $T$ be the universe of tracks in the catalog, for each playlist $P$ we are given $k$ visible tracks $T_{p} = { t_{p1}, t_{p2} ... t_{pk} }$ . The task is then to rank the tracks in $T - T_{p}$ to be recommended to the playlist. For the testing set of the challenge, playlists are grouped by different values of $k$ , signifying different levels of available information of the playlist. In the case that $k = 0$ , the APC needs to rely solely on other metadata of the playlist to generate the recommendations (cold start).

By the definition of the original problem, we would have to get every pairwise ranking of tracks in $T - T_{p}$ to generate the recommendations. This would make the time complexity for testing quadratic in the number of tracks. To make the task more manageable, for this assignment, we have simplified the task into a pairwise classification problem.

The pairwise classification problem in the context of recommender systems, given an user $u$ and two items $i$ and $j$ , $r_{u, i}$ is the model's confidence of recommending item $i$ to user $u$ . The output is a binary value $p_{u, i, j}$ , indicating whether the model will have higher confidence recommending $i$ to $u$ than $j$ . For example:

p_{u, i, j}=\begin{cases} 1 & if r_{u, i} > r_{u, j} \\ 0 & otherwise \end{cases}

Under our settings, for a playlist $P$ with $n$ tracks, we are given $k=10$ visible tracks per playlist, while the remaining $n - k$ tracks are hidden. For each hidden track $t_{h}$ , we sample a negative track $t_{n}$ from the set $T - T_{p}$ . The model outputs a binary value, indicating if the first item should be ranked higher than the second.

The model's output is considered correct if it ranks the positive track (hidden) higher than the negative track (sampled). The performance of the models are measured using AUC [2]

3. Models Used

For the task of pairwise classification problem, we have chosen three models to rank the tracks: Jaccard Similarity model, Item2vec and FISM. We shall give the detail of each implementation in the following sections.

3.1. FISM

Factored Item Similarity Model (FISM) [3] is a kind of user-free model. In FISM, items are embedded into a latent space where an user-item query is defined as matrix multiplication between an item query matrix of items that user consumed and a target item vector.

Let $p$ denote a playlist and $i$ and $j$ denote tracks. We define a playlist-track interaction $r_{pi}$ . The recommendation score of FISM is then:

\hat r_{pi} = b_p + b_i + |p\setminus\{i\}|^{-\alpha}\sum_{j\in p\setminus\{i\}}\textbf{p}_j\textbf{q}_i^T,

where $b_p$ and $b_i$ are offset values for $p$ and $i$ respectively, $|p\setminus\{i\}|$ is the size of the playlist excluding $i$ , $\alpha$ is neighborhood agreement value in the range of $[0, 1]$ as in the original paper, and $\textbf{p}_j$ , $\textbf{q}_i$ are latent factor representations.

To train our model, we design a loss based on Bayesian Personalized Ranking (BPR) [4] which optimizes the area under the curve (AUC). Specifically, for every pair of positive sample and negative sample, we want to maximize the likelihood of our model successfully ranking the positive sample higher:

\mathcal{L}_{AUC}=-\sum_{p\in P}\sum_{i\in p, j \notin p}\ln\sigma(\hat r_{pi} - \hat r_{pj}),

where $P$ is the set of all playlists. Note that $b_p$ is canceled out in the loss so we get a model that is agnostic to specific playlists.

Our optimization problem is then to minimize the regularized loss:

\mathop{\mathrm{minimize}}\limits_{\textbf{b},\textbf{P},\textbf{Q}}~~\mathcal{L}_{AUC} + \lambda_1(||\textbf{b}||^2_2) + \lambda_2(||\textbf{P}||^2_2 + ||\textbf{Q}||^2_2).

In practice, we randomly sample a batch of $(p, i, j)$ in every step of gradient descent. We use $lambda_1=lambda_2=0.01$ , $alpha=1$ and a latent dimension of 5. We use Adam as the optimizer with a learning rate of 0.01, and we use a batch size of 32.

3.2. Item2vec

Item2vec is the generalized form of the popular Word2vec collaborative filtering model, which finds a similarity metric between all items in a corpus based on recurring proximity across many sequences. Similar to the way in which Word2vec develops a semantic understanding of word orderings and associations when the training data consists of sentences, we implement skip-gram Item2vec to learn associations between songs based on their inclusion and relative position in playlists. The fundamental assumption behind implementing Item2vec for playlist continuation is that a user assembling a playlist will tend to put similar songs near one another.

The loss function used during training contains the vital conditional probability that track $t_j$ is within a window of $w$ adjacent tracks to $t_i$ , iterated over all combinations of tracks $t_i$ and $t_j$ presented in the playlist data. The window hyperparameter $w$ will change the number of $t_i$ and $t_j$ pairs, and therefore the number of training examples $N$.

\begin{aligned} min \; -\frac{1}{N} \; \sum_{i} \; \sum_{j} \; log P(t_j | t_i) \\ \end{aligned}

The conditional probability between a pair of tracks $P(t_i |t_j)$ is estimated using an $M$ dimensional embedding vector for each track, $v_i$ and $u_j$ respectively, as well as the softmax function.

\begin{aligned} P(t_j | t_i) \; = \; \frac{exp(u_{j}^{T} v_{i})}{\sum_{k \in corpus} exp(u_{k}^{T} v_i)} \\ \end{aligned}

For each selection of window size $w$ and embedding size $M$ , the full training dataset is processed and weights are updated after each example to eventually arrive at a set of embedding vectors of dimension $||corpus||$ x $M$ .

To determine the optimality of parameters $w$ and $M$ , the trained model is evaluated on the validation set by observing the set of hidden tracks belonging to each playlist, randomly sampling an equal number of tracks from the corpus that are not in the playlist, then evaluating each pair of hidden/negative tracks to see which has a higher joint similarity to all visible tracks in the playlist. The set of parameters that yielded the best validation AUC of 0.74 included a window size of 11 and embedding size of 2275.

3.3. Jaccard Similarity

In the Jaccard similarity model, we record the interaction history of all playlist and all tracks. A set containing every track each playlist contains, and another containing all the playlists that a track is in. After generating these two data structure from the training set, we can define how can we recommend the tracks inside the withheld part of the playlist.

To determine the ranking between two tracks $t_{i}$ and $t_{j}$ in playlist $p$ , we sum the Jaccard Similarity [5] between the track in question ( $i$ and $j$ ) with the 10 visible songs in the playlist. And the track with the higher sum will be ranked higher.

4. Experiment Results

4.1. Pairwise Recommendation

For the testing set, we first filter the tracks in each playlist, removing tracks that are not present in the training set. Then, for each track in the hidden tracks, we randomly sampled a track from the set $T_{train} - T_{visible} - T_{withheld}$ , where $T_{train}$ is the set of all tracks that appeared in the training set.

We measure the performance using the AUC metric. The model's output is considered correct if it ranks the positive track (hidden) higher than the negative track (sampled). The following table shows a summary of the performance of each model on the testing set.

The three algorithms implemented to detect trends in how users construct playlists all performed above random chance with FISM performing the best, Jaccard performing second best, and Item2vec performing the worst. Interestingly, the Jaccard based heuristic of recommending songs present in the most similar playlists to the playlist in question outperformed the unsupervised, deep learning Item2vec method. One explanation for this result could come from Item2vec's requirement of a windowing parameter, which further subdivides training examples into small chunks, and breaks the correlation between two tracks that appear within the same playlist, but are more than $\frac{w-1}{2}$ tracks away from one another. This problem also makes the algorithm biased towards larger playlists, where the likelihood of two similar songs occurring near one another is lower than in smaller playlists. The Item2vec algorithm was initially chosen thinking that it's success in identifying synonyms after training on a collection of sentences would be advantageous in identifying synonymous tracks, which occur infrequently, but show up in close proximity to highly popular tracks.

The factored item similarity model (FISM), which is similar to Item2vec in its aim to learn embedded representations of tracks agnostic of user based features, performed extremely well relative to the two alternative models. FISM is effective at learning item-to-item representations even with increasingly sparse interaction data. In the context of the playlist completion task, sparse interaction data takes the form of playlists that contain tracks only appearing a handful of times in the entire dataset.

4.2. Automatic Playlist Continuation

For the sake of experiment, we ran the Jaccard based model on the slightly tweaked automatic playlist continuation task. Since we do not have access to the ground truth of the 500 recommended songs for each playlist, we have to evaluate our model in a slightly tweaked setting. In the tweaked version, we are no longer extending the playlist to 500 tracks. Instead, we generate recommendation to fill-in the withheld tracks.

To extend each playlist, for each track in all tracks in the set $T_{train} - T_{visible} - T_{withheld}$ , we calculate it's pairwise Jaccard Similarity with all the visible tracks in the playlist. The Jaccard Similarity is calculated with the set of playlist each track was in. The sum of these similarities would be the similarity score for that track to the playlist we are extending. We then rank all the tracks based on this score, and the top $n - k$ ranked tracks will be our recommendations.

To evaluate the model, we considered two cases of available tracks. Following the 2018 RecSys contest: Type 6 denotes samples where only the first 25 tracks are visible to us, while in Type 7, we can see25 random tracks. We used only 10 percent of the samples in type6 and type7 of the challenge set, and the results is reported in the following table

An interesting observation can be made on the results. Models trained on randomly selected visible set performed significantly better than the sequentially selected ones. We believe that this is because tracks in playlist has the tendency to display locality, meaning adjacent tracks in a playlist tend to share some characteristics (same artist, genre or albums). Random samples of the tracks gives a more holistic view of the playlist, and we get a better representation from the visible set.