The main idea of our approach is to compute recommendations based on the listening histories of users and contextual information regarding audio content and situational features. Particularly, we model and exploit pairwise interaction effects between these different contexts, between users and contexts and between tracks and contexts.
An overview of the proposed framework is given in Fig. 1, where the steps taken to extract contextual information that is leveraged in the recommendation computation are outlined. As input for the proposed approach, we require a dataset of playlists (i.e., sets of tracksFootnote 4) assembled by users as presented in Section 4. Based on this dataset (shown in Fig. 1 as “Spotify Playlists Dataset”), we compute two types of contextual information for the computation of multi-context-aware track recommendations: (i) playlist archetypes (clusters) and (ii) situational clusters. For playlist archetypes (“Acoustic Cluster Component” in Fig. 1), the input comprises the track id and the acoustic features for each track as provided by the Spotify API (cf. Section 4). This component computes the assignment of each track to an acoustic cluster. We describe this procedure in detail in Section 5.1. For computing situational clusters, the input comprises the track id and the names of the playlists the track is contained in. This component (“Situational Cluster Component” in Fig. 1) computes the assignment of each track to a situational cluster. We detail this procedure in Section 5.2.
The extracted context information allows modeling user preferences for tracks contained in certain playlist archetypes in a given situation. We refer to the clusters mined from acoustic features as acoustic feature clusters (AC) and to the clusters mined from playlist names as situational clusters (SC). To finally incorporate this information (user, track, AC, and SC assignments) as input into a context-aware recommender system tackling the problem as stated in Section 3, we propose to utilize Factorization Machines (FM) [46] in a recommendation component (“Recommendation Component” in Fig. 1). This allows capturing user preference towards a certain archetype of music in a certain situational context and to exploit the interaction effects between these two notions of context. this procedure results in a list of tracks sorted by the predicted relevance score for the given user in a given situation. We describe the recommendation computation in more detail in Section 5.3.
Playlist archetypes
The proposed approach relies on clusters of playlists (archetypes) that share similar acoustic features (e.g., the tempo of the tracks contained). The major steps of this computation are also depicted in Fig. 1. In a first step, we aggregate the eight acoustic features obtained via the Spotify API (cf. Section 4) of each playlist using the arithmetic mean. To ensure that the arithmetic mean is indeed representative, we analyze the dispersion of the tracks forming a playlist by comparing the mean and mean absolute deviation (MAD) [33] for each feature for each playlist. Here, we argue that the MAD is a robust measure with respect to outliers. With this analysis, we find that except for loudness, the variance of each of the acoustic characteristics of the tracks inside a playlist is low and the MAD is rarely higher than the mean. This allows us to conclude that aggregating the characteristics of the individual tracks to playlist characteristics using the mean is representative. For loudness, the variance among the tracks of a playlist is too high. In 99.99% of all cases, the MAD is higher than the mean. Therefore, we drop the loudness characteristic for the conducted playlist analyses and refer to [43] for further analyses of the clusters. This aggregation step provides us with a lower-dimensional m × n matrix AFM (acoustic feature matrix), where each row represents a playlist and each column represents one of the proposed acoustic features. To find archetypes of music a user listens to, we apply factorization to the centered matrix AFM (all columns have a mean value of 0 and a standard deviation of 1) as this allows us to conduct a Principal Component Analysis (PCA) [40] via SVD [21].
The principal components (PCs) obtained by the conducted PCA allow explaining differences in playlists and, more importantly, estimate the number of acoustic clusters (ACs) to be obtained by the explained variance of each PC (squared singular values \({s_{i}^{2}}\) (diagonal of Σ)). For k = 5 clusters, the accumulated variance of the principal components is 85.64 and hence exceeds the 80% threshold. Thus, we set the number of acoustic feature clusters to be computed to k = 5. We compute the 5 clusters by applying k-means on the dimension-reduced matrix AFM. The clustering assigns each playlist and hence, implicitly each track, to one of five playlist archetypes that allow capturing a user’s preferences towards certain types of music. We depict the result of this approach in Fig. 2, where each playlist is represented by an integer that represents the cluster assignment. The clusters are marked by individual colors and are annotated with the respective acoustic features. From the conducted PCA, we observe that playlists that are highly influenced by instrumental and acoustic features are separated from the remaining playlists by the first PC (PC1). Furthermore, PC1 and PC2 separate energetic playlists with high tempo from the remaining playlists. Finally, we are also able to separate playlists with high valence and danceability characteristics by PC1 and PC2. PC3, not visible in Fig. 2, separates playlists with high speechiness values from other playlists. The clusters (archetypes) obtained serve as one notion of context to be used for the computation of multi-context-aware track recommendations. We refer to our previous work in [43] for further details on this approach and analyses of the resulting clusters.
Situational clusters
Besides capturing musical preferences, we also aim to contextualize playlists by extracting situational context from the names of playlists. The underlying assumption here is that the names of playlists provide information about the situational context in which the playlist’s tracks are listened to (e.g., “Summer Fun”, “Workout Mix”, or “Christmas”). Along the lines of [42, 44], we mine for activities and other descriptors (seasons, events, etc.) in the names of playlists.
As depicted in Fig. 1, we firstly lemmatize all terms contained in playlist names using WordNet [39]. Next, we remove stop words and non-contextual terms (e.g., genre, artist, and track names) as these do not provide any contextual information. Furthermore, we utilize AlchemyAPI’s entity recognition servicesFootnote 5 to remove playlist names that do not provide any contextual information. These are mostly playlist names that consist of artist names, track names, or genre descriptions. This results in a set of cleaned lemmata per playlist. However, those playlist names are rather short and heterogeneous. To create a meaningful distance matrix suitable for clustering playlists based on their names is challenging. Therefore, we again use WordNet to enrich the lemmata of each playlist with semantically matching synonyms and hypernyms to create a more expressive term frequency-inverse document frequency (tf-idf) matrix. We derive this matrix by using a bag-of-words describing each playlist based on the derived lemmas, synonyms, and hypernyms. For the resulting bags of lemmata describing each playlist, we compute the term frequency-inverse document frequency (tf-idf) for each bag-of-lemmata representing a playlist name. Playlist similarities can now be computed by the pairwise cosine similarity of the resulting vectors. Based on these similarities, we span a distance matrix and find contextually similar playlists by applying k-means clustering. Along the lines of [42] (cf. Section 6), we empirically determine the number of clusters and set these to k = 23. This provides us with a set of 23 situational clusters capturing in which context a user listened to certain tracks. For instance, one of the clusters comprises Christmas songs, whereas another cluster comprises playlists and tracks related to a “summer” theme (e.g., containing playlist names such as “my summer playlist”, “summer 2015 tracks”, “finally summer” and “hot outside”). We refer to our previous work [42, 44] for further details on the computation of situational clusters and their usage in recommendation scenarios. In the next section, we present how we incorporate the gained contextual information in the computation of recommendations.
Recommendation computation
The context extraction steps described in Sections 5.1 and 5.2 provide us with information about (i) a user’s preference for playlist archetypes, and (ii) the situational context in which a user listens to certain tracks. This information is extracted in the form of user-cluster assignments. We now combine these clusters and the listening history of users in a joint user model that informs the track recommender system.
In this work, we propose to use FMs [46] for the computation of recommendations, i.e., to compute a predicted rating \(\hat {r}\) for a given user i and a given track j, incorporating situational clusters (SCs) and acoustic feature-based clusters (ACs). We process the input for the rating prediction task as follows: first, <user,track>-pairs are enriched by the corresponding contextual cluster assignments, now forming <user,track,AC,SC>-tuples (as can also be seen in Fig. 1). By adding a fifth column—rating r—to each entry in the dataset, we derive the input matrix R for our rating prediction problem to be solved (holding user, track, AC, SC, and rating columns).
Our dataset does not contain any implicit feedback by users (i.e., play counts, skipping behavior, or session duration). Therefore, we cannot estimate any preference towards an item as e.g., proposed by [18]. However, we assume that adding a track to a playlist signals a user’s preference for the track. As the recommendation task is transformed into a rating prediction task, we require the dataset to also include negative examples. Therefore, for each user, we randomly add tracks the user did not interact with in a given situation (i.e., tracks tj with ri,j = 0 for the given user ui) to the dataset until the listening history of each user in both the training and test sets are filled with 50% relevant and 50% non-relevant items for the user. We chose to oversample the positive class to avoid class imbalance and hence, a bias towards the negative class (the number of tracks not listened to is much larger than the number of tracks listened to for all users as naturally, users only listen to a small fraction of the songs available). Hence, for each unique <user,track,AC,SC>-tuple, the rating rijsc is defined as stated in (4).
$$ r_{ijsc}= \begin{cases} 1 & if\ u_{i}\ listened\ to\ t_{j}\ in\ SC_{s}\ and\ AC_{c} \\ 0 & otherwise \end{cases} $$
(4)
Based on this dataset, for computing the predicted rating \(\hat {r}\), we model the influence of a user i, a track j, the situational cluster s, and the content-based cluster c on \(\hat {r}\) in a FM. Relying on FMs, we are able to model all pairwise interactions, allowing to model the influence of the simultaneous occurrence of two variable values, i.e., of a track j and the contexts s and c or a user i and the contexts s and c. Furthermore, we model the interaction of the contexts c and s which can be interpreted as the influence of the current activity of a user (SC) on the playlist archetype (AC) and vice versa. This is shown in Equation 5: the FM computes \(\hat {r}\) by estimating a global bias (w0), estimating the influence of the user, track as well as the contexts (\({\sum }_{i=1}^{n} w_{i} x_{i}\)) along with estimating the quadratic interaction effects of those (\({\sum }_{j=i+1}^{n} \langle {\mathbf {v}}_{\mathbf {i}},{\mathbf {v}}_{\mathbf {j}} \rangle x_{i}x_{j}\)). However, instead of learning all weights wi,j for the interaction effects, as traditional approaches such as logistic regression with quadratic interaction effects do, FMs rely on factorization to model the interaction as the inner product 〈vi,vj〉 of low-dimensional vectors [46].
$$ \hat{r}_{FM} = w_{0} + {\sum}_{i=1}^{n} w_{i} x_{i} + {\sum}_{i=1}^{n} {\sum}_{j=i+1}^{n} \langle {\mathbf{v}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}} \rangle x_{i} x_{j} $$
(5)
The weights of the latter interaction effects are computed by applying matrix factorization during the FM optimization using a Markov Chain Monte Carlo (MCMC) solver as proposed by [15, 46].
Recently, higher-order Factorization Machines (HOFM) have been introduced, that allow for incorporating higher-order interaction effects [7, 37]. Aiming at further advancing the presented approach, we propose to also exploit 3-way interaction effects. A HOFM model is depicted in (6), where a further factor capturing 3-way interactions is added (in comparison to 2-way Factorization Machines as depicted in (5)). Again, we rely on the Markov Chain Monte Carlo (MCMC) learning method.
$$ \begin{array}{@{}rcl@{}} \hat{r}_{HOFM} &=& w_{0} + {\sum}_{i=1}^{n} w_{i} x_{i} +{\sum}_{i=1}^{m} {\sum}_{j=i+1}^{m} \langle{\mathbf{v}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}\rangle x_{i}x_{j} \\ &&+ {\sum}_{i=1}^{m} {\sum}_{j=i+1}^{m} {\sum}_{l=j+1}^{m} \langle{\mathbf{v}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}},{\mathbf{v}}_{\mathbf{l}}\rangle x_{i}x_{j}x_{l} \end{array} $$
(6)