Representation, Exploration and Recommendation of Playlists

  • Piyush PaprejaEmail author
  • Hemanth Venkateswara
  • Sethuraman Panchanathan
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1168)


Playlists have become a significant part of our listening experience because of digital cloud-based services such as Spotify, Pandora, Apple Music, making playlist recommendation crucial to music services today. With an aim towards playlist discovery and recommendation, we leverage sequence-to-sequence modeling to learn a fixed-length representation of playlists in an unsupervised manner. We evaluate our work using a recommendation task, along with embedding-evaluation tasks, to study the extent to which semantic characteristics such as genre, song-order, etc. are captured by the playlist embeddings and how they can be leveraged for music recommendation.


Playlists Sequence-to-sequence Recommendation 

1 Introduction

In this age of cloud-based music streaming services such as Spotify, Pandora, Apple music among others, users have grown accustomed to extended music listening experiences typically provided by playlists. As a result, playlist recommendation has been getting a lot of attention over the past couple of years. However, the playlist recommendation task has so far been analogous to playlist prediction [1] and continuation [2] rather than discovery. With billions of playlists already out there, and thousands being added every day, playlist discovery forms a significant part of the overall playlist recommendation pipeline. This work focuses on finding and recommending these existing playlists. We take inspiration from research in the domain of natural language processing to model playlist embeddings the way sentences are embedded by leveraging the relationship playlist:songs :: sentences:words, and model playlists using the sequence-to-sequence [3] learning technique.

In this work, we learn playlist embeddings in an unsupervised manner. We consider two main kinds of embedding models for this work: (a) Seq2seq models and (b) Bag of Words (BoW) models. We evaluate the models using recommendation and embedding-evaluation tasks, with the goal of analyzing the extent of information encoded by different models, and assessing the suitability of our approach for the purpose of recommendation. To the best of our knowledge, our work is the first attempt at modeling and extensively analyzing compact playlist representations for playlist recommendation. The demo, dataset, and slides for our work can be accessed online at

2 Seq2Seq Learning

Here we briefly describe the RNN Encoder-Decoder framework, proposed first in [4] and later improved in [3], upon which our model is based. Given a sequence of input vectors \(x = \{x_{1}, x_{2}, x_{3}...x_{T}\}\), the encoder reads this sequence and outputs a vector c called the context vector. The context vector represents a compressed version of the input sequence which is then fed to the decoder which predicts tokens from the target sequence. One of the significant limitations of this approach was that the model was not able to capture long term dependencies for relatively longer sequences [5]. This problem was partially mitigated in [3] by using LSTM units instead of vanilla RNN units and feeding the input sequence in the reversed order to solve for lack of long-term dependency capture.

Bahdanau et al. [6] introduced the attention mechanism to solve this problem which involved focussing on a specific portion of the input sequence when predicting the output at a particular time step. The attention mechanism ensures the encoder doesn’t have to encode all the information into a single context vector. In this setting, the context vector c is calculated using weighted sum of hidden states \(h_{j}\):
$$\begin{aligned} c_{i}=\sum _{j=1}^{T_{x}} \alpha _{i j} h_{j} \end{aligned}$$
where \(\alpha _{i j}\) is calculated as follows:
$$\begin{aligned} \alpha _{i j}=\frac{\exp \left( e_{i j}\right) }{\sum _{k=1}^{T_{x}} \exp \left( e_{i k}\right) } \end{aligned}$$
where \(e_{i j}=a\left( s_{i-1}, h_{j}\right) \) and \(s_{i-1}\) is the decoder state at time step \(i-1\) and \(h_j\) is the encoder state at time step j. a(.) is the alignment model which scores how well the output at time step i aligns with the input at time step j. The alignment model a is a shallow feed forward neural network which is trained along with the rest of the network.

3 Embedding Models

In this section, we present the embedding models that we consider for this work:
  1. 1.

    Bag-of-words Model (BoW): For baseline comparison we apply a variant [7] of BoW, which uses a weighted averaging scheme to get the sentence embedding vectors followed by their modification using singular-value decomposition (SVD). This method of generating sentence embeddings proves to be a stronger baseline compared to traditional averaging.

  2. 2.

    Base Seq2seq Encoder (base-seq2seq): We use a deep, unidirectional RNN-based model with global attention for our base seq2seq model.

  3. 3.

    Bidirectional Seq2seq Encoder (bi-seq2seq): For this model, the encoder generated hidden states \(h_t\), where \(t\in \{1,\ldots ,n\}\) are the concatenation of a forward RNN and a backward RNN that read the sentences in two opposite directions. Global attention is used for this model as well.


4 Experimental Setup

4.1 Data: Source and Filtering

We created the corpus by downloading 1 million publicly available playlists from Spotify using the Spotify developer API. As part of cleaning up the data before training, we follow [8] in discarding the less frequent, and less relevant items from our dataset. First, we remove the tracks occurring in less than 3 playlists, thereby removing rare songs from the corpus. This is a common preprocessing step in NLP-based works, which is equivalent of denoising the data by making the association weight between the more popular words stronger through the removal of their associations with less frequent words, as mentioned in [9]. All duplicate tracks from playlists are also removed. Finally, playlists with lengths in the range \(\{10\ldots 5000\}\) are retained and the rest are discarded. This resulted in a total of 745,543 unique playlists, 2,470,756 unique tracks, and 2680 unique genres, which we consider as training data.

4.2 Data Labeling: Genre Assignment

The songs in our dataset do not have genre labels, however artists do. Despite there being a 1:1 mapping between an artist and their song, we do not use the artist genre for the song because (1) an artist can have songs of different genres and (2) since genres are subjective in their nature (rock vs. soft-rock vs. classic rock), having a large number of genres for songs would result in an ambiguity between the genres with respect to empirical evaluation (classification) and add to the complexity of the problem. Hence, we aim to bring down the number of genres such that they are relatively mutually disjoint.

To achieve this we train a word-2-vec model [10]1 on our corpus to get song embeddings which capture the semantic characteristics (such as genre) of the songs by virtue of their co-occurrence in the playlists. Separate models are trained for embedding sizes \(k = \{500,750,1000\}\). For each of the embedding sizes, the resulting song embeddings are then clustered into 200 clusters2. For each cluster, the artist genre is applied to the corresponding song and a genre-frequency (count) dictionary is created. From this dictionary, the genre having a clear majority3 is assigned as the genre for all the songs in that cluster. All the songs in a cluster with no clear genre majority are discarded from the corpus. Based on the observed genre-distribution in the data, and as a result of clustering sub-genres (such as soft-rock) into parent genres (such as rock), the genres finally chosen for annotating the clusters are: Rock, Metal, Blues, Country, Classical, Electronic, Hip Hop, Reggae and Latin. To validate our approach, we train a classifier on our dataset consisting of annotated song embeddings. With training and test set kept separate at the time of training, we achieve a 94% test4 accuracy.

For playlist-genre annotation, only the playlists having annotations for all the songs are considered, which leaves us with 339,998 playlists in total. This is done to perform a confident evaluation of the playlist embeddings by not making any assumptions about the genre information of songs that are not annotated. Further, since we use hard-labels [11] for the annotation process to make the evaluation task simpler, only those playlists are assigned genres for which more than 70% of the songs have the same genre. These playlists are used for the GDPred and the Recommendation evaluation tasks described in Sect. 5.

4.3 Training

We now outline our approach for estimating playlist embeddings using the following models:
  1. 1.

    BoW Model: We experiment with a weighted BoW model where the weight assigned to each song w is \(a /(a+p(w))\). Here, a is the control parameter between \([e^{-3,e^{-5}}]\), and p(w) is the (estimated) song frequency.

  2. 2.

    Seq2seq-based Models: We train our seq2seq models as autoencoders (where the target sequence is the same as the source sequence, a playlist) where the encoders and decoders are 3-layer networks with hidden state \(k\in \{500,750,1000\}\). We experiment with both LSTM and GRU units, using Adam and SGD optimizers. We also set the maximum gradient norm to 1 to prevent exploding gradients.

Fig. 1.

Permute Classification Task Results: Seq2seq models outperform the BoW model in capturing the order of songs in a playlist. Also, the performance of the seq2seq models improve with the increasing proportion of permuted songs.

5 Evaluation Tasks

In this section, we outline the criteria to evaluate our playlist embeddings for information content and playlist recommendation. As per the definition of a playlist [12] and the characteristics which make up for a good playlist [13], a good playlist embedding should encode information about the genre of the songs it contains, the order of songs, length of playlist (which directly shapes and impacts the listening experience of the user), and songs themselves, among many other traits. Based on that, we propose the following experiments for embedding playlist evaluation:
  • Genre Diversity Prediction Task (GDPred-Task): This task measures the extent to which the playlist embedding captures the homogeneity and diversity of the songs (with regards to their genre) constituting it. Given a playlist embedding, the goal of the classifier is to predict the number of genres spanned by the songs in that playlist. The task is formulated as multi-class classification, with 3 output classes being low diversity (0–3 genres), medium diversity (3–6 genres) and high diversity (6–9 genres).

  • Song-content Task (SC-Task): This closely follows the Word Content (WC) task [14] which evaluates whether it is possible to recover information about the original words in the sentence from the sentence embedding. We pick 750 mid-frequency songs (the middle 750 songs in our corpus of songs sorted by their occurrence count), and sample equal numbers of playlists that contain one and only one of these songs. We formulate it as a 750-way classification problem where the aim of the classifier is to predict which of the 750 songs does a playlist contain, given the playlist embedding.

  • Permute Classification Task: Through this task we aim to answer the question: Can the proposed embedding models capture song order, and if they can, to what extent? We split this task into two sub tasks: (i) Shuffle Task, and (ii) Reversal task. In the Shuffle task, for each playlist in our task-specific dataset5, we randomly select a fraction of all the songs in that playlist and shuffle them to create a permuted playlist. We then train a binary classifier to distinguish between the original and the permuted playlist embedding. The Reversal task is similar to the Shuffle task except that the randomly selected sub-sequence of songs is reversed.

  • Recommendation Task: Recommendation being inherently subjective in nature is best evaluated by having user-labeled data. However, in the absence of such annotated datasets, we evaluate our proposed approach by measuring the extent to which the playlist space created by the embedding models is relevant, in terms of the similarity of genre and length information of closely-lying playlists. We quantify the relevance of the embedding space by calculating the precision recall scores in terms of genre and length labels, for a set of query playlists selected from the embedding space. We use the Approximate Nearest Neighbors Algorithm using Spotify ANNOY library [15] to populate the tree structure with the genre-annotated playlist embeddings mentioned in Sect. 4.2. A query playlist is randomly selected and the search results are compared with the queried playlist in terms of genre and length information. There are nine possible genre labels. For comparing length, ten output classes (spanning the range \(\{30\ldots 250\}\)) corresponding to bins of size 20 are created. The final precision value is calculated by taking an average of precision values for 100 queries for each recall value.

6 Results

The results for the GDPred-Task and the SC-Task are outlined in Table 1. In the GDPred-Task, the BoW model performs better than the seq2seq models achieving 80% accuracy while seq2seq models achieve an accuracy of 76%.
Fig. 2.

Recommendation Tasks Results. (a) Genre Recommendation (b) Length Recommendation. BoW model captures genre information better, whereas seq2seq models capture length information better.

For the SC-Task, the seq2seq models perform poorly compared to the BoW model. However, our results for the seq2seq models closely match the results for the same task in [14], where the authors cite the inability of the seq2seq models to capture the content-based information due to the complexity of the way the information is encoded.

As seen in Fig. 1, for the Permute Classification Task, seq2seq model is able to distinguish correctly the permuted playlists from the original playlists as the proportion of the permutation is increased6.

BoW model, on the other hand, fails the task as it is not able to capture the order information, thus making the seq2seq models better for capturing the song-order in the playlist.

The Recommendation task, as shown in Fig. 2a and b, captures some interesting insights about the effectiveness of different models for capturing different characteristics. Firstly, high precision values demonstrate the relevance of the playlist embedding space which is the first and foremost expectation from a recommendation system. Also, BoW models capture genre information7 better than seq2seq models (Fig. 2a), while length information is better captured by the seq2seq models (Fig. 2b), demonstrating the suitability of different models for different tasks.
Table 1.

Evaluation task accuracies for the embedding models for size 750.












7 Conclusions

We have presented a sequence-to-sequence based approach for learning playlist embeddings, which can be used for tasks such as playlist comparison and recommendation. First we define the problem of learning a playlist-embedding and describe how we formulate it as a seq2seq-based problem. We compare our proposed model with the weighted BoW model on embedding evaluation tasks as well as on a recommendation task. We show that our proposed approach is effective in capturing the semantic properties of playlists, and suitable for recommendation purposes.


  1. 1.

    Word2vec details: algorithm: Skipgram, playlists length range: \(\{30\ldots 3000\}\), min. frequency threshold of the songs: 5, negative sampling: 5, window size: 5.

  2. 2.

    This number was chosen to get maximum feasible clusters while keeping the number within limit which makes it feasible for annotating the data.

  3. 3.

    This was a subjective decision. For example, a dictionary having {rock: 5, indie-rock: 3, blues: 2, soft-rock: 7} is assigned the genre rock.

  4. 4.

    Result achieved for embedding size 750. Comparable results achieved for other sizes.

  5. 5.

    A list of 38168 playlists with lengths in the range \(\{50,\ldots ,100\}\).

  6. 6.

    Results for bi-seq2seq model follow a similar trend.

  7. 7.

    Since BoW created playlist embeddings lie in the song space (as calculated using arithmetic mean of song embeddings) where genre annotation happens, they perform better.



The authors thank ASU, Adidas, and the National Science Foundation for their funding support. This material is partially based upon work supported by Adidas and by the National Science Foundation under Grant No. 1828010.


  1. 1.
    Andric, A., Haus, G.: Automatic playlist generation based on tracking user’s listening habits. Multimed. Tools Appl. 29(2), 127–151 (2006)CrossRefGoogle Scholar
  2. 2.
    Volkovs, M., Rai, H., Cheng, Z., Wu, G., Lu, Y., Sanner, S.: Two-stage model for automatic playlist continuation at scale. In: Proceedings of the ACM Recommender Systems Challenge 2018, p. 9. ACM (2018)Google Scholar
  3. 3.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in NIPS (2014)Google Scholar
  4. 4.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  5. 5.
    Bengio, Y., Simard, P., Frasconi, P., et al.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  6. 6.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  7. 7.
    Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings (2016)Google Scholar
  8. 8.
    De Boom, C., et al.: Large-scale user modeling with recurrent neural networks for music discovery on multiple time scales. Multimed. Tools Appl. 77(12), 15385–15407 (2018)CrossRefGoogle Scholar
  9. 9.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)CrossRefGoogle Scholar
  10. 10.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  11. 11.
    Galstyan, A., Cohen, P.R.: Empirical comparison of “hard” and “soft” label propagation for relational classification. In: Blockeel, H., Ramon, J., Shavlik, J., Tadepalli, P. (eds.) ILP 2007. LNCS (LNAI), vol. 4894, pp. 98–111. Springer, Heidelberg (2008). Scholar
  12. 12.
    Fields, B., Lamere, P.: Finding a path through the juke box: the playlist tutorial. In: 11th International Society for Music Information Retrieval Conference (ISMIR). Citeseer (2010)Google Scholar
  13. 13.
    De Mooij, A.M., Verhaegh, W.F.J.: Learning preferences for music playlists. Artif. Intell. 97(1–2), 245–271 (1997)Google Scholar
  14. 14.
    Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single vector: probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070 (2018)
  15. 15.
    Bernhardsson, E.: Annoy: approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk (2013).

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Piyush Papreja
    • 1
    Email author
  • Hemanth Venkateswara
    • 1
  • Sethuraman Panchanathan
    • 1
  1. 1.Arizona State UniversityTempeUSA

Personalised recommendations