Current Challenges and Visions in Music Recommender Systems Research

Music recommender systems (MRS) have experienced a boom in recent years, thanks to the emergence and success of online streaming services, which nowadays make available almost all music in the world at the user's fingertip. While today's MRS considerably help users to find interesting music in these huge catalogs, MRS research is still facing substantial challenges. In particular when it comes to build, incorporate, and evaluate recommendation strategies that integrate information beyond simple user--item interactions or content-based descriptors, but dig deep into the very essence of listener needs, preferences, and intentions, MRS research becomes a big endeavor and related publications quite sparse. The purpose of this trends and survey article is twofold. We first identify and shed light on what we believe are the most pressing challenges MRS research is facing, from both academic and industry perspectives. We review the state of the art towards solving these challenges and discuss its limitations. Second, we detail possible future directions and visions we contemplate for the further evolution of the field. The article should therefore serve two purposes: giving the interested reader an overview of current challenges in MRS research and providing guidance for young researchers by identifying interesting, yet under-researched, directions in the field.


INTRODUCTION
Research in music recommender systems (MRS) has recently experienced a substantial gain in interest both in academia and industry [121]. anks to music streaming services like Spotify, Pandora, is research was supported in part by the Austrian Science Fund (FWF): P25655, and in part by the Center for Intelligent Information Retrieval. Any opinions, ndings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily re ect those of the sponsors. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Conference'17, Washington, DC, USA © 2017 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 DOI: 10.1145/nnnnnnn.nnnnnnn or Apple Music, music a cionados are nowadays given access to tens of millions music pieces. By ltering this abundance of music items, thereby limiting choice overload [14], MRS are o en very successful to suggest songs that t their users' preferences. However, such systems are still far from being perfect and frequently produce unsatisfactory recommendations. is is partly because of the fact that users' tastes and musical needs are highly dependent on a multitude of factors, which are not considered in su cient depth in current MRS approaches that are typically centered on the core concept of user-item interactions, or sometimes content-based item descriptors. In contrast, we argue that satisfying the users' musical entertainment needs requires taking into account intrinsic, extrinsic, and contextual aspects of the listeners [2], as well as more decent interaction information. For instance, personality and emotional state of the listeners (intrinsic) [50,108] as well as their activity (extrinsic) [54,142] are known to in uence musical tastes and needs. So are users' contextual factors including weather conditions, social surrounding, or places of interest [2,74]. Also the composition and annotation of a music playlist or a listening session reveals information about which songs go well together or are suited for a certain occasion [95,150]. erefore, researchers and designers of MRS should reconsider their users in a holistic way in order to build systems tailored to the speci cities of each user.
Against this background, in this trends and survey article, we elaborate on what we believe to be amongst the most pressing current challenges in MRS research, by discussing the respective state of the art and its restrictions (Section 2). Not being able to touch all challenges exhaustively, we focus on cold start, automatic playlist continuation, and evaluation of MRS. While these problems are to some extent prevalent in other recommendation domains too, certain characteristics of music pose particular challenges in these contexts. Among them are the short duration of items (compared to movies), the high emotional connotation of music, and the acceptance of users for duplicate recommendations. In the second part, we present our visions for future directions in MRS research (Section 3). More precisely, we elaborate on the topics of psychologically-inspired music recommendation (considering human personality and emotion), situation-aware music recommendation, and culture-aware music recommendation. We conclude this article with a summary and identi cation of possible starting points for the interested researcher to face the discussed challenges (Section 4). e composition of the authors allows to take academic as well as industrial perspectives, which are both re ected in this article. Furthermore, we would like to highlight that particularly the ideas presented as Challenge 2: Automatic playlist continuation in Section 2 play an important role in the task de nition, organization, and execution of the ACM Recommender Systems Challenge 2018 1 which focuses on this use case. is article may therefore also serve as an entry point for potential participants in this challenge.

GRAND CHALLENGES
In the following, we identify and detail a selection of the grand challenges, which we believe the research eld of music recommender systems is currently facing, i.e., overcoming the cold start problem, automatic playlist continuation, and properly evaluating music recommender systems. We review the state of the art of the respective tasks and its current limitations.

Particularities of music recommendation
Before we start digging deeper into these challenges, we would rst like to highlight the major aspects that make music recommendation a particular challenge and distinguishes it from recommending other items, such as movies, books, or products. ese aspects have been adopted from a tutorial on music recommender systems [120], co-presented by one of the authors at the ACM Recommender Systems 2017 conference. 2 Duration of items: While in traditional movie recommendation the items of interest have a typical duration of 90 minutes or more, the duration of music items usually ranges between 3 and 5 minutes (except for classical music). Because to this, items may be considered more disposable.
Magnitude of items: e size of common commercial music catalogs is in the range of tens of millions music pieces while movie streaming services have to deal with much smaller catalog sizes, typically thousands up to tens of thousands of movies and series. Scalability is therefore a much more important issue in music recommendation.
Sequential consumption: Unlike movies, music pieces are most frequently consumed sequentially, more than one at a time, i.e., in a listening session.
is yields a number of challenges for a MRS, which relate to identifying the right arrangement of items in a recommendation list.
Repeated recommendations: Recommending the same music piece again, at a later point in time, may be appreciated by the user of a MRS, in contrast to a movie or product recommender, where repeated recommendations are usually not preferred.
Consumption behavior: Music is o en consumed passively, in the background. While this is not a problem per se, it can a ect preference elicitation. In particular when using implicit feedback to infer listener preferences, the fact that a listener is not paying a ention to the music (therefore, e.g., not skipping a song) might be wrongly interpreted as a positive signal.
Listening intent and context: Listeners' intents for consuming or sharing a music piece are manifold and should be taken into account when building a MRS. For instance, a listener will likely create a di erent playlist when preparing for a romantic dinner than when warming-up with friends to go out at a Friday night.
is also highlights the importance of the social component of music listening. Furthermore, a lot of listeners strongly identify with their liked artists. In this vein, music is also o en used for selfexpression. Another important and frequent intent is regulating the listener's mood, which is discussed below. Similar to intent, the listening context may strongly in uence the listeners' preferences. Among others, context may relate to location, for instance, listening at the workplace, when commuting, or relaxing at home. It might also relate to the use of di erent listening devices, e.g., earplugs on a smartphone vs. hi-stereo at home, just to give a few examples. e importance of considering such intent and context factors in MRS research is acknowledged by discussing situation-aware MRS as a trending research direction, cf. Section 3.2.
Emotional connotation: Music is known to evoke very strong emotions, and one of the most frequent reasons to listen to music is indeed mood regulation [93]. At the same time, the emotion of the listener is usually neglected in current MRS. is is the reason why we selected emotion-aware MRS as one of the main future directions in MRS research, cf. Section 3.1.

Challenge 1: Cold start problem
Problem de nition: One of the major problems of recommender systems in general [43,112], and music recommender systems in particular [73,89] is the cold start problem, i.e., when a new user registers to the system or a new item is added to the catalog and the system does not have su cient data associated with these items/users. In such a case, the system cannot properly recommend existing items to a new user (new user problem) or recommend a new item to the existing users (new item problem) [3,43,73,123].
Another sub-problem of cold start is the sparsity problem which refers to the fact that the number of given ratings is much lower than the number of possible ratings, which is particularly likely when the number of users and items is large. e inverse of the ratio between given and possible ratings is called sparsity. High sparsity translates into low rating coverage, since most users tend to rate only a tiny fraction of items. e e ect is that recommendations o en become unreliable [73]. Typical values of sparsity are quite close to 100% in most real-world recommender systems. In the music domain, this is a particularly substantial problem. Dror et al. [34], for instance, analyzed the Yahoo! Music dataset, which as of time of writing represents the largest music recommendation dataset.
ey report a sparsity of 99.96%. For comparison, the Net ix dataset of movies has a sparsity of "only" 98.82%.

State of the art:
A number of approaches have already been proposed to tackle the cold start problem in the music recommendation domain, foremost content-based approaches, hybridization, cross-domain recommendation, and active learning.
Content-based recommendation (CB) algorithms do not require the rating of other users. erefore, as long as some pieces of information about the user's own preferences are available, such techniques can be used in cold start scenarios. Furthermore, in the most severe case, when a new item is added to the catalog, contentbased methods enable recommendations, because they can extract features from the new item and use them to make recommendations. It is noteworthy that while collaborative ltering (CF) systems have cold start problems both for new users and new items, contentbased systems have only cold start problems for new users [5]. As for the new item problem, a standard approach is to extract a number of features that de ne the acoustic properties of the audio signal and use content-based learning of the user interest (user pro le learning) in order to e ect recommendations. is is advantageous not only to address the new item problem but also because an accurate feature representation can be highly predicative of users' tastes and interests which can be leveraged in the subsequent information ltering stage [5]. Such feature extraction from audio signals can be done in two main manners: (1) by extracting a feature vector from each item individually, independent of other items or (2) by considering the cross-relation between items in the training dataset. e di erence is that in (1) the same process is performed in the training and testing phases of the system, and the extracted feature vectors can be used o -the-shelf in the subsequent processing stage, for example they can be used to compute similarities between items in a one-to-one fashion at testing time. In contrast, in (2) rst a model is built from all features extracted in the training phase, whose main role is to map the features into a new (acoustic) space in which the similarities between items are be er represented and exploited. An example of approach (1) is the blocklevel feature framework [126,127], which creates a feature vector of about 10,000 dimensions, independently for each song in the given music collection. is vector describes aspects such as spectral pa erns, recurring beats, and correlations between frequency bands. An example of strategy (2) is to create a low-dimensional i-vector representation from the Mel-frequency cepstral coe cients (MFCC), which model musical timbre to some extent [39]. To this end, a universal background model is created from the MFCC vectors of the whole music collection, using a Gaussian mixture model (GMM). Performing factor analysis on a representation of the GMM eventually yields i-vectors.
In scenarios where some form of semantic labels, e.g., genres or musical instruments, are available, it is possible to build models that learn the intermediate mapping between low-level audio features and semantic representations using machine learning techniques, and subsequently use the learned models for prediction. A good point of reference for such semantic-inferred approaches can be found in [13,24].
An alternative technique to tackle the new item problem is hybridization. A review of di erent hybrid and ensemble recommender systems can be found in [6,18]. In [33] the authors propose a music recommender system which combines an acoustic CB and an item-based CF recommmender. For the content-based component, it computes acoustic features including spectral properties, timbre, rhythm, and pitch. e content-based component then assists the collaborative ltering recommender in tackling the cold start problem since the features of the former are automatically derived via audio content analysis. e solution proposed in [146] is a hybrid recommender system that combines CF and acoustic CB strategies also by feature hybridization. However, in this work the feature-level hybridization is not done in the original feature domain. Instead, a set of latent variables referred to as conceptual genre are introduced, whose role is to provide a common shared feature space for the two recommenders and enable hybridization. e weights associated with the latent variables re ect the musical taste of the target user and are learned during the training stage. In [128] the authors propose a hybrid recommender system incorporating item-item CF and acoustic CB based on similarity metric learning. e proposed metric learning is an optimization model that aims to learn the weights associated with the audio content features (when combined in a linear fashion) so that a degree of consistency between CF-based similarity and the acoustic CB similarity measure is established. e optimization problem can be solved using quadratic programming techniques.
Another solution to cold start are cross-domain recommendation techniques, which aim at improving recommendations in one domain (here music) by making use of information about the user preferences in an auxiliary domain [20,46]. Hence, the knowledge of the preferences of the user is transferred from an auxiliary domain to the music domain, resulting in a more complete and accurate user model. Similarly, it is also possible to integrate additional pieces of information about the (new) users, which are not directly related to music, such as their personality, in order to improve the estimation of the user's music preferences. Several studies conducted on user personality characteristics support the conjecture that it may be useful to exploit this information in music recommender systems [48,52,63,98,108]. For a more detailed literature review of cross-domain recommendation, we refer to [21,47,76].
In addition to the aforementioned approaches, active learning has shown promising results in dealing with the cold start problem. Active learning addresses this problem at its origin by identifying and eliciting (high quality) data that can represent the preferences of users be er than by what they provide themselves [43,112]. Such a system therefore interactively demands speci c user feedback to maximize the improvement of system performance.

Limitations:
e state-of-the-art approaches discussed above are restricted by certain limitations. When using content-based ltering, for instance, almost all existing approaches rely on a number of prede ned audio features that have been used over and over again, including spectral features, MFCCs, and a great number of derivatives [80]. However, doing so assumes that (all) these features are predictive of the user's music taste, while in practice it has been shown that the acoustic properties that are important for the perception of music are highly subjective [100]. Furthermore, listeners' di erent tastes and amount of interests in di erent pieces of music in uence perception of item similarity [117]. is subjectiveness demands for CB recommenders that incorporate personalization in their mathematical model. For example, in [44] the authors propose a hybrid (CB+CF) recommender model, namely regression-based latent factor models (RLFM). In [4] the authors propose a userspeci c feature-based similarity model (UFSM), which de nes a similarity function for each user, leading to a high degree of personalization. Although not designed speci cally for the music domain, the authors of [4] provide an interesting literature review of similar user-speci c models.
While hybridization can therefore alleviate the cold start problem to a certain extent, as seen in the examples above, respective approaches are o en complex, computationally expensive, and lack transparency [19]. In particular, results of hybrids employing latent factor models are typically hard to understand for humans.
A major problem with cross-domain recommender systems is their need for data that connects two or more target domains, e.g., books, movies, and music [21]. In order for such approaches to work properly, items, users, or both therefore need to overlap to a certain degree [27]. In the absence of such overlap, relationships between the domains must be established otherwise, e.g., by inferring semantic relationships between items in di erent domains or assuming similar rating pa erns of users in the involved domains. However, whether respective approaches are capable of transferring knowledge between domains is disputed [26]. A related issue in cross-domain recommendation is that there is a lack of established datasets with clear de nition of domains and recommendation scenarios [76]. Because of this, the majority of existing work on cross-domain RS use some type of conventional recommendation dataset transformation to suit it for their need.
Finally, also active learning techniques su er from a number of issues. First of all, the typical active learning techniques propose to the users the items with the highest predicted ratings in order to elicit the true ratings. is indeed is a default strategy in recommender systems as users tend to rate what have been recommended to them. Moreover, users typically browse and rate interesting items which they would like. However, it has been shown that doing so creates a strong bias in the dataset and expands it disproportionately with high ratings.
is in turn may substantially in uence the prediction algorithm and decrease the recommendation accuracy [42]. Moreover, not all the active learning strategies are necessarily personalized. e users di er very much in the amount of information they have about the items, their preferences, and the way they make decisions. Hence, it is clearly ine cient to request all the users to rate the same set of items, because many users may have a very limited knowledge, ignore many items, and not properly provide ratings for these items. Properly designed active learning techniques should take this into account and propose di erent items to di erent users to rate. is can be very bene cial and increase the chance of acquiring ratings of higher quality [40].

Challenge 2: Automatic playlist continuation
Problem de nition: In its most generic de nition, a playlist is simply a sequence of tracks intended to be listened to together. e task of automatic playlist generation (APG) then refers to the automated creation of these sequences of tracks.
Considered a variation of APG, the task of automatic playlist continuation (APC) consists of adding one or more tracks to a playlist in a way that ts the same target characteristics of the original playlist.
is has bene ts in both the listening and creation of playlists: users can enjoy listening to continuous sessions beyond the end of a nite-length playlist, while also nding it easier to create longer, more compelling playlists without needing to have extensive musical familiarity.
A large part of the APC task is to accurately infer the intended purpose of a given playlist. is is challenging not only because of the broad range of these intended purposes (when they even exist), but also because of the diversity in the underlying features or characteristics that might be needed to infer those purposes.
Related to Challenge 1, an extreme cold start scenario for this task is where a playlist is created with some metadata (a title, for example), but no song has been added to the playlist. is problem can be cast as an ad-hoc information retrieval task, where the task is to rank songs in response to a user-provided metadata query.
e APC task can also potentially bene t from user pro ling, e.g., making use of previous playlists and the long-term listening history of the user. We call this personalized playlist continuation.
According to a study carried out in 2016 by the Music Business Association 3 as part of their Music Biz Consumer Insights program, 4 playlists accounted for 31% of music listening time among listeners in the USA, more than albums (22%), but less than single tracks (46%). Other studies, conducted by MIDiA, 5 show that 55% of streaming music service subscribers create music playlists, with some streaming services such as Spotify currently hosting over 2 billion playlists. 6 Studies like these suggest a growing importance of playlists as a mode of music consumption, and as such, the study of APG and APC has never been more relevant.
State of the art: APG has been studied ever since digital multimedia transmission made huge catalogs of music available to users. Bonnin and Jannach provide a comprehensive survey of this eld in [15]. In it, the authors frame the APG task as the creation of a sequence of tracks that ful ll some "target characteristics" of a playlist, given some "background knowledge" of the characteristics of the catalog of tracks from which the playlist tracks are drawn. Existing APG systems tackle both of these problems in many di erent ways.
In early approaches [8,9,101] the target characteristics of the playlist are speci ed as multiple explicit constraints, which include musical a ributes or metadata such as artist, tempo, and style. In others, the target characteristics are a single seed track [92] or a start and an end track [8,22,53]. Other approaches create a circular playlist that comprises all tracks in a given music collection, in such a way that consecutive songs are as similar as possible [79,107]. In other works, playlists are created based on the context of the listener, either as single source [116] or in combination with content-based similarity [23,110].
A common approach to build the background knowledge of the music catalog for playlist generation is using machine learning techniques to extract that knowledge from manually-curated playlists. e assumption here is that curators of these playlists are encoding rich latent information about which tracks go together to create a satisfying listening experience for an intended purpose. Some proposed APG and APC systems are trained on playlists from such sources as online radio stations [22,94], online playlist websites [95,139], and music streaming services [106]. In the study by Pichl et al. [106], the names of playlists on Spotify were analyzed to create contextual clusters, which were then used to improve recommendations.
Limitations: While some work on automated playlist continuation highlights the special characteristics of playlists, i.e., their sequential order, it is not well understood to which extent and in which cases taking into account the order of tracks in playlists helps create be er models for recommendation. For instance, in [139] Vall et al. recently demonstrated on two datasets of hand-curated playlists that the song order seems to be negligible for accurate playlist continuation when a lot of popular songs are present. On the other hand, the authors argue that order does ma er when creating playlists with tracks from the long tail. Another study by McFee and Lanckriet [95] also suggests that transition e ects play an important role in modeling playlist continuity. In another recent user study [135] conducted by Tintarev et al., the authors found that many participants did not care about the order of tracks in recommended playlists, sometimes they did not even notice that there is a particular order. However, this study was restricted to 20 participants who used the Discover Weekly service of Spotify. 7 Another challenge for APC is evaluation: in other words, how to assess the quality of a playlist. Evaluation in general is discussed in more detail in the next section, but there are speci c questions around evaluation of playlists in particular that should be pointed out here. As Bonnin and Jannach [15] put it, the ultimate criterion for this is user satisfaction, but that is not easy to measure. In [96], McFee and Lanckriet categorize the main approaches to APG evaluation as human evaluation, semantic cohesion, and sequence prediction. Human evaluation comes closest to measuring user satisfaction directly, but su ers from problems of scale and reproducibility. Semantic cohesion as a quality metric is easily measurable and reproducible, but assumes that users prefer playlists where tracks are similar along a particular semantic dimension, which may not always be true, see for instance the studies carried out by Slaney and White [131] and by Lee [88]. Sequence prediction casts APC as an information retrieval task, but in the domain of music, an inaccurate prediction need not be a bad recommendation, and this again leads to a potential disconnect between this metric and the ultimate criterion of user satisfaction.
Investigating which factors are potentially important for a positive user perception of a playlist, Lee conducted a qualitative user study [88], investigating playlists that had been automatically created based on content-based similarity. ey made several interesting observations. A concern frequently raised by participants was that of consecutive songs being too similar, and a general lack of variety. However, di erent people had di erent interpretations of variety, e.g., variety in genres or styles vs. di erent artists in the playlist. Similarly, di erent criteria were mentioned when listeners judged the coherence of songs in a playlist, including lyrical content, tempo, and mood. When creating playlists, participants mentioned that similar lyrics, a common theme (e.g., music to listen to in the train), story (e.g., music for the Independence Day), or era (e.g., rock music from the 1980s) are important and that tracks not complying negatively e ect the ow of the playlist. ese aspects can be extended by responses of participants in a study conducted by Cunningham et al. [29], who further identi ed the following 7 h ps://www.spotify.com/discoverweekly categories of playlists: same artist, genre, style, or orchestration, playlists for a certain event or activity (e.g., party or holiday), romance (e.g., love songs or breakup songs), playlists intended to send a message to their recipient (e.g., protest songs), and challenges or puzzles (e.g., cover songs liked more than the original or songs whose title contains a question mark).
Lee also found that personal preferences play a major role. In fact, already a single song, which is very much liked or hated by a listener, can have a strong in uence on how they judge the entire playlist [88], in particular if it is a highly disliked song [31]. Furthermore, a good mix of familiar and unknown songs was often mentioned as an important requirement for a good playlist. Supporting the discovery of interesting new songs, still contextualized by familiar ones, increases the serendipity [119, 149] of a playlist. Finally, participants also reported that their familiarity with a playlist's genre or theme in uenced their judgment of its quality. In general, listeners were more picky about playlists whose tracks they were familiar with or they liked a lot.
Supported by the studies summarized above, we argue that the question of what makes a great playlist is highly subjective and further depends on the intent of the creator or listener. Important criteria when creating or judging a playlist include track similarity/coherence, variety/diversity, but also the user's personal preferences and familiarity with the tracks, as well as the intention of the playlist creator. Unfortunately, current automatic approaches to playlist continuation are agnostic of the underlying psychological and sociological factors that in uence the decision of which songs users choose to include in a playlist. Since knowing about such factors is vital to understand the intent of the playlist creator, we believe that algorithmic methods for automatic playlist continuation need to holistically learn such aspects from manually created playlists and integrate respective intent models. However, we are aware that in today's era where billions of playlists are shared by users of online streaming services, 8 a large-scale analysis of psychological and sociological background factors is impossible. Nevertheless, in the absence of explicit information about user intent, a possible starting point to create intent models might be the metadata associated with user-generated playlists, such as title or description. To foster this kind of research, the playlists provided in the dataset for the ACM Recommender Systems Challenge 2018 will include playlist titles. 9 2.4 Challenge 3: Evaluating music recommender systems tailored to the recommendation problem and o en take a usercentric perspective have emerged in recent years. ese so-called beyond-accuracy measures [72] address the particularities of recommender systems and gauge, for instance, the utility, novelty, or serendipity of an item for a user. However, a major problem with these kinds of measures is that they integrate factors that are hard to describe mathematically, for instance, the aspect of surprise in case of serendipity measures. For this reason, there sometimes exist a variety of di erent de nitions to quantify the same beyond-accuracy aspect.
State of the art: e following performance measures are the ones most frequently reported when evaluating recommender systems. ey can be roughly categorized into accuracy-related measures, such as prediction error (e.g., MAE and RMSE) or standard IR measures (e.g., precision and recall), and beyond-accuracy measures, such as diversity, novelty, and serendipity. Furthermore, while some of the metrics quantify the ability of recommender systems to nd good items, e.g., precision, MAE, or RMSE, others consider the ranking of items and therefore assess the system's ability to position good recommendations at the top of the recommendation list, e.g., MAP, NDCG, or MPR.
Mean absolute error (MAE) is one of the most common metrics for evaluating the prediction power of recommender algorithms. It computes the average absolute deviation between the predicted ratings and the actual ratings provided by users [58]. Indeed, MAE indicates how close the rating predictions generated by an MRS are to the real user ratings. MAE is computed as follows: where r u,i andr u,i respectively denote the actual and the predicted ratings of item i for user u. MAE sums over the absolute prediction errors for all ratings in a test set T . Root mean square error (RMSE) is another similar metric that is computed as: It is an extension to MAE in that the error term is squared, which penalizes larger di erences between predicted and true ratings more than smaller ones. is is motivated by the assumption that, for instance, a rating prediction of 1 when the true rating is 4 is much more severe than a prediction of 3 for the same item. Precision at top K recommendations (P@K) is a common metric that measures the accuracy of the system in commanding relevant items. In order to compute the precision, for each user, the top K recommended items whose ratings also appear in test set T are considered.
is metric was originally designed for binary relevance judgments. erefore, in case of availability of relevance information at di erent levels, such as a ve point Likert scale, the labels should be binarized, e.g., considering the ratings greater than or equal to 4 (out of 5) as relevant. Precision@K is computed as follows: where L u is a set of relevant items of user u in the test set T andL u denotes the recommended set containing the K items in T with the highest predicted ratings for the user u from the set of all users U . Mean average precision (MAP) is a metric that computes the overall precision of a recommender system based on precision at di erent recall levels [90]. It is computed as the arithmetic mean of the average precision (AP) over the entire set of users in the test set, where AP is de ned as follows: where rel(k) is an indicator signaling if the k th recommended item is relevant, i.e. rel(k) = 1, or not, i.e. rel(k) = 0; M is the number of relevant items and N is the number of recommended items in the top N recommendation list. Note that AP implicitly incorporates recall, because it considers relevant items not in the recommendation list.
Recall at top K recommendations (R@K) is presented here for the sake of completeness, even though it is not a crucial measure from a consumer's perspective. Indeed, the listener is typically not interested in being recommended all or a large number of relevant items, rather in having good recommendations at the top of the recommendation list. R@K is de ned as: where L u is a set of relevant items of user u in the test set T andL u denotes the recommended set containing the K items in T with the highest predicted ratings for the user u from the set of all users U . Normalized discounted cumulative gain (NDCG) measures the ranking quality of the recommendations. is metric has originally been proposed to evaluate e ectiveness of information retrieval systems [69]. It is nowadays also frequently used for evaluating music recommender systems [91,104,143]. Assuming that the recommendations for user u are sorted according to the predicted rating values in descending order. DCG u is de ned as follows: where r u,i is the true rating (as found in test set T ) for the item ranked in position i for user u, and N is the length of the recommendation list. Since the rating distribution depends on the users' behavior, the DCG values for di erent users are not directly comparable. erefore, the cumulative gain for each user should be normalized. is is done by computing the ideal DCG for user u, denoted as IDCG u , which is the DCG u value for the best possible ranking, obtained by ordering the items by true ratings in descending order. Normalized discounted cumulative gain for user u is then calculated as: Finally, the overall normalized discounted cumulative gain NDCG is computed by averaging NDCG u over the entire set of users.
In the following, we present common quantitative evaluation metrics, which have been particularly designed or adopted to assess recommender systems performance, even though some of them have their origin in information retrieval and machine learning.
Half life utility (HLU) measures the utility of a recommendation list for a user with the assumption that the likelihood of viewing/choosing a recommended item by the user exponentially decays with the item's position in the ranking [16,102]. Formally wri en, HLU for user u is de ned as: where r ui and rank ui denote the rating and the rank of item i for user u, respectively, in the recommendation list of length N ; d represents a default rating (e.g., average rating) and h is the halftime, calculating as the rank of a music item in the list, such that the user can eventually listen to it with a 50% chance. HLU u can be further normalized by the maximum utility (similar to NDCG), and the nal HLU is the average over the half-time utilities obtained for all users in the test set. A larger HLU may correspond to a superior recommendation performance.
Mean percentile rank (MPR) estimates the users' satisfaction with items in the recommendation list, and is computed as the average of the percentile rank for each test item within the ranked list of recommended items for each user [66]. e percentile rank of an item is the percentage of items whose position in the recommendation list is equal to or lower than the position of the item itself. Formally, the percentile rank PR u for user u is de ned as: where r u,i is the true rating (as found in test set T ) for item i rated by user u and rank u,i is the percentile rank of item i within the ordered list of recommendations for user u. MPR is then the arithmetic mean of the individual PR u values over all users. A randomly ordered recommendation list has an expected MPR value of 50%. A smaller MPR value is therefore assumed to correspond to a superior recommendation performance.
Spread is a metric of how well the recommender algorithm can spread its a ention across a larger set of items [78]. In more detail, spread is the entropy of the distribution of the items recommended to the users in the test set. It is formally de ned as: where I represents the entirety of items in the dataset and P(i) = count(i)/ i ∈I count(i ), such that count(i) denotes the total number of times that a given item i showed up in the recommendation lists. It may be infeasible to expect an algorithm to achieve the perfect spread (i.e., recommending each item an equal number of times) without avoiding irrelevant recommendations or unful llable rating requests. Accordingly, moderate spread values are usually preferable.
Coverage of a recommender system is de ned as the proportion of items over which the system is capable of generating recommendations [58]: where |T | is the size of the test set and |T | is the number of ratings in T for which the system can predict a value. is is particularly important in cold start situations, when recommender systems are not able to accurately predict the ratings of new users or new items, and hence obtain low coverage. Recommender systems with lower coverage are therefore limited in the number of items they can recommend. A simple remedy to improve low coverage is to implement some default recommendation strategy for an unknown user-item entry. For example, we can consider the average rating of users for an item as an estimate of its rating. is may come at the price of accuracy and therefore the trade-o between coverage and accuracy needs to be considered in the evaluation process [7]. Novelty measures the ability of a recommender system to recommend new items that the user did not know about before [1]. A recommendation list may be accurate, but if it contains a lot of items that are not novel to a user, it is not necessarily a useful list [149]. While novelty should be de ned on an individual user level, considering the actual freshness of the recommended items, it is common to use the self-information of the recommended items relative to their global popularity: where pop i is the popularity of item i measured as percentage of users who rated i, L u is the recommendation list of the top N recommendations for user u [149,151]. e above de nition assumes that the likelihood of the user selecting a previously unknown item is proportional to its global popularity and is used as an approximation of novelty. In order to obtain more accurate information about novelty or freshness, explicit user feedback is needed, in particular since the user might have listened to an item through other channels before. It is o en assumed that the users prefer recommendation lists with more novel items. However, if the presented items are too novel, then the user is unlikely to have any knowledge of them, nor to be able to understand or rate them. erefore, moderate values indicate be er performances [78].
Serendipity aims at evaluating MRS based on the relevant and surprising recommendations. While the need for serendipity is commonly agreed upon [59], the question of how to measure the degree of serendipity for a recommendation list is controversial.
is particularly holds for the question of whether the factor of surprise implies that items must be novel to the user [72]. On a general level, serendipity of a recommendation list L u provided to a user u can be de ned as: where L unexp u and L usef ul u denote subsets of L that contain, respectively, recommendations unexpected to and useful for the user. e usefulness of an item is commonly assessed by explicitly asking users or taking user ratings as proxy [72]. e unexpectedness of an item is typically quanti ed by some measure of distance from expected items -those similar to the items already rated by the user. In the context of MRS, Zhang et al. [149] propose an "unserendipity" measure that is de ned as the average similarity between the items in the user's listening history and the new recommendations. Similarity between two items in this case is calculated by an adapted cosine measure that integrates co-liking information, i.e., number of users who like both items. It is assumed that lower values correspond to more surprising recommendations, since lower values indicate that recommendations deviate from the user s traditional behavior [149].
Diversity is another important beyond-accuracy measure as already discussed in the limitations part of Challenge 1. It gauges the extent to which recommended items are di erent from each other, where di erence can relate to various aspects, e.g., musical style, artist, lyrics, or instrumentation, just to name a few. Similar to serendipity, diversity can be de ned in several ways. One of the most common is to compute pairwise distance between all items in the recommendation set, either averaged [152] or summed [132]. In the former case, the diversity of a recommendation list L is calculated as follows: where dist i, j is the some distance function de ned between items i and j. Common choices are inverse cosine similarity [111], inverse Pearson correlation [141], or Hamming distance [75].
Limitations: As of today, the vast majority of evaluation approaches in recommender systems research focuses on quantitative measures, either accuracy-like or beyond-accuracy, which are o en computed in o ine studies. While doing so has the advantage of facilitating the reproducibility of evaluation results, these approaches typically fall short of grasping some of the most important user requirements that relate to user acceptance or satisfaction, among others.
Despite acknowledging the need for more user-centric evaluation strategies [117], the factor human, user, or, in the case of MRS, listener is still way too o en neglected or not properly addressed. For instance, while there exist quantitative measures for serendipity and diversity, as discussed above, perceived serendipity and diversity can be highly di erent from the measured ones [140]. Even beyond-accuracy measures can therefore not fully capture the real user satisfaction with a recommender system.
Addressing both objective and subjective evaluation criteria, Knijnenburg et al. [81] propose a holistic framework for user-centric evaluation of recommender systems. Figure 1 provides an overview of the components. e objective system aspects (OSA) are considered unbiased factors of the RS, including aspects of the user interface, computing time of the algorithm, or number of items shown to the user. ey are typically easy to specify or compute. e OSA in uence the subjective system aspects (SSA), which are caused by momentary, primary evaluative feelings while interacting with the system [57]. is results in a di erent perception of the system by di erent users. SSA are therefore highly individual aspects and typically assessed by user questionnaires. Examples of SSA include general appeal of the system, usability, and perceived recommendation diversity or novelty. e aspect of experience (EXP) describes the user's a itude towards the system and is commonly also investigated by questionnaires. It addresses the user's perception of the interaction with the system. e experience is highly in uenced by the other components, which means changing any of the other components likely results in a change of EXP aspects. Experience can be broken down into the evaluation of the system, the decision process, and the nal decisions made, i.e., the outcome. e interaction (INT) aspects describe the observable behavior of the user, such as time spent viewing an item, clicking or purchasing behavior. erefore, they belong to the objective measures and are usually determined via logging by the system. Finally, Knijnenburg et al. 's framework mentions personal characteristics (PC) and situational characteristics (SC), which in uence the user experience. PC include aspects that do not exist without the user, such as user demographics, knowledge, or perceived control, while SC include aspects of the interaction context, such as when and where the system is used, or situation-speci c trust or privacy concerns. Knijnenburg et al. [81] also propose a questionnaire to asses the factors de ned in their framework, for instance, perceived recommendation quality, perceived system e ectiveness, perceived recommendation variety, choice satisfaction, intention to provide feedback, general trust in technology, and system-speci c privacy concern.
While this framework is a generic one, tailoring it to MRS would allow for user-centric evaluation thereof. Especially the aspects of personal and situational characteristics should be adapted to the particularities of music listeners and listening situations, respectively, cf. Section 2.1. To this end, researchers in MRS should consider the aspects relevant for the perception and preference of music, and their implications on MRS, which have been identi ed in several studies, e.g. [30,86,87,117,118]. In addition to the general ones mentioned by Knijnenburg et al., of great importance in the music domain seem to be psychological factors, including a ect and personality, social in uence, musical training and experience, and physiological condition.
We believe that carefully and holistically evaluating MRS by means of accuracy and beyond-accuracy, objective and subjective measures, in o ine and online experiments, would lead to a be er understanding of the listeners' needs and requirements vis-à-vis MRS, and eventually a considerable improvement of current MRS.

FUTURE DIRECTIONS AND VISIONS
While the challenges identi ed in the previous section are already researched on intensely, in the following, we provide a more forwardlooking analysis and discuss some MRS-related trending topics, which we assume in uential for the next generation of MRS. All of them have in common that their aim is to create more personalized recommendations. More precisely, we rst outline how psychological constructs such as personality and emotion could be integrated into MRS. Subsequently, we address situation-aware MRS and argue for the need of multifaceted user models that describe contextual Figure 1: Evaluation framework of the user experience for recommender systems, according to [81]. and situational preferences. To round o , we discuss the in uence of users' cultural background on recommendation preferences, which needs to be considered when building culture-aware MRS.

Psychologically-inspired music recommendation
Personality and emotion are important psychological constructs. While personality characteristics of humans are a predictable and stable measure that shapes human behaviors, emotions are shortterm a ective responses to a particular stimulus [137]. Both have been shown to in uence music tastes [50,114,118] and user requirements for MRS [48,52]. However, in the context of (music) recommender systems, personality and emotion do not play a major role yet. Given the strong evidence that both in uence listening preferences [108,118] and the recent emergence of approaches to accurately predict them from user-generated data [83,129], we believe that psychologically-inspired MRS is an upcoming area. Personality: In psychology research, personality is o en de ned as a "consistent behavior pa ern and interpersonal processes originating within the individual" [17].
is de nition accounts for the individual di erences in people's emotional, interpersonal, experiential, a itudinal, and motivational styles [71]. Several prior works have studied the relation of decision making and personality factors. In [108], as an example, it has been shown that personality can in uence the human decision making process as well as the tastes and interests. Due to this direct relation, people with similar personality factors are very likely to share similar interests and tastes.
Earlier studies conducted on the user personality characteristics support the potential bene ts that personality information could have in recommender systems [41,62,64,136,138]. As a known example, psychological studies [108] have shown that extravert people are likely to prefer the upbeat and conventional music. Accordingly, a personality-based MRS could use this information to be er predict which songs are more likely than others to please extravert people [63]. Another example of potential usage is to exploit personality information in order to compute similarity among users and hence identify the like-minded users [136]. is similarity information could then be integrated into a neighborhood-based collaborative ltering approach.
In order to use personality information in a recommender system, the system rst has to elicit this information from the users, which can be done either explicitly or implicitly. In the former case, the system can ask the user to complete a personality questionnaire using one of the personality evaluation inventories, e.g., the ten item personality inventory [55] or the big ve inventory [70]. In the la er case, the system can learn the personality by tracking and observing users' behavioral pa erns [84,129]. Not too surprisingly, it has shown that systems that explicitly elicit personality characteristics achieve superior recommendation outcomes, e.g., in terms of user satisfaction, ease of use, and prediction accuracy [35]. On the downside, however, many users are not willing to ll in long questionnaires before being able to use the RS. A way to alleviate this problem is to ask users only the most informative questions of a personality instrument [122]. Which questions are most informative, though, rst needs to be determined based on existing user data and is dependent on the recommendation domain. Other studies showed that users are to some extent willing to provide further information in return for a be er quality of recommendations [134].
Personality information can be used in various ways, particularly, to generate recommendations when traditional rating or consumption data is missing. Otherwise, the personality traits can be seen as an additional feature that extends the user pro le, that can be used mainly to identify similar users in neighborhood-based recommender systems or directly fed into extended matrix factorization models [46].
Emotion: e emotional state of the MRS user is an important factor in identifying his or her short-time musical preferences, in particularly since emotion regulation is known to be one main reason why people listen to music [93]. Indeed, a music piece can be seen as an emotion-laden content and in turn can be described by emotions. Musical content contains various elements that can e ect the emotional state of a person, such as rhythm, key, tempo, melody, harmony, and lyrics. For instance, a musical piece that is in major key is typically perceived brighter and happier than those in minor key, or a piece in rapid tempo is perceived more exciting or more tense than slow tempo ones [85].
Several studies have already shown that listeners' emotions have a strong impact on their musical preferences [73]. As an example, people may listen to completely di erent musical genres or styles when they are sad in comparison to when they are happy. Indeed, prior research on music psychology discovered that people may choose the type of music which moderates their emotional condition [82]. More recent ndings show that music can be mainly chosen so as to augment the emotional situation perceived by the listener [99].
Similar to personality traits, the emotional state of a user can be elicited explicitly or implicitly. In the former case, the user is typically presented one of the various categorical models (emotions are described by distinct emotion words such as happiness, sadness, anger, or fear) [61,148] or dimensional models (emotions are described by scores with respect to two or three dimensions, e.g., valence and arousal) [113]. For a more detailed elaboration on emotion models in the context of music, we refer to [118,144]. e implicit acquisition of emotional states can be e ected, for instance, by analyzing user-generated text [32], speech [45], or facial expressions in video [38].
Since music can be viewed as an emotionally laden content, as it is capable of evoking intense emotions in a listener, it can also be annotated with emotional labels [68,145,148]. Doing so automatically is a task referred to as music emotion recognition (MER) and is discussed in detail, for instance, in [77,144]. While such automatic emotion labeling of music items could be bene cial for MRS, MER has been shown to be a highly challenging task [77].
Nowadays, emotion-based recommender systems typically consider emotional scores as contextual factors that characterize the contextual situation that the user is experiencing. Hence, the recommender systems exploit emotions in order to pre-lter the preferences of users or post-lter the generated recommendations. Unfortunately, this neglects the psychological background, in particular on the subjective and complex interrelationships between expressed, perceived, and induced emotions [118], which is of special importance in the music domain as music is known to evoke stronger emotions than, for instance, products [120]. It has also been shown that personality in uences in which emotional state which kind of emotionally laden music is preferred by listeners [50]. erefore, even if automated MER approaches would be able to accurately predict the perceived or induced emotion of a given music piece, in the absence of deep psychological listener pro les, matching emotion annotations of items and listeners may not yield satisfying recommendations. We hence believe that the eld of MRS should embrace psychological theories, elicit the respective user-speci c traits, and integrate them into recommender systems, in order to build decent emotion-aware MRS.

Situation-aware music recommendation
Most of the existing music recommender systems make recommendations solely based on a set of user-speci c and item-speci c signals. However, in real-world scenarios, many other signals are available. ese additional signals can be further used to improve the recommendation performance. A large subset of these additional signals includes situational signals. In more detail, the music taste of a user depends on the situation at the moment of recommendation. Location is an example of situational signals; for instance, the music taste of a user would di er in libraries and in gyms [23]. erefore, considering location as a situation-speci c signal could lead to substantial improvements in the recommendation performance. Time of the day is another situational signal that could be used for recommendation; for instance, the music a user would like to listen to in mornings di ers from those in nights [28].
ere are a lot of other situational signals, including but are not limited to, the user's current activity [142], the weather [105], day of the week [60], and the user's mood [97]. It is worth noting that situational features have been proven to be strong signals in improving retrieval performance in search engines [12,147]. erefore, we believe that researching and building situation-aware music recommender systems should be one central topic in MRS research.
While several situation-aware MRS already exist, e.g. [11,23,67,74,116,142], they commonly exploit only one or very few such situational signals, or are restricted to a certain usage context, e.g., music consumption in a car or in a tourist scenario.
ose systems that try to take a more comprehensive view and consider a variety of di erent signals, on the other hand, su er from a low number of data instances or users, rendering it very hard to build accurate context models [54]. What is still missing, in our opinion, are (commercial) systems that integrate a variety of situational signals on a very large scale in order to truly understand the listeners needs and intents in any given situation and recommend music accordingly. While we are aware that data availability and privacy concerns counteract the realization of such systems on a large commercial scale, we believe that MRS will eventually integrate decent multifaceted user models inferred from contextual and situational factors.

Culture-aware music recommendation
While most humans share an inclination to listen to music, independent on their location or cultural background, the way music is performed, perceived, and interpreted evolves in a culture-speci c manner. However, research in MRS seems to be agnostic of this fact. In music information retrieval (MIR) research, on the other hand, cultural aspects have been studied to some extent in recent years, a er preceding (and still ongoing) criticisms of the predominance of Western music in this community. Arguably the most comprehensive culture-speci c research in this domain has been conducted as part of the CompMusic project, 10 in which ve non-Western music traditions have been analyzed in detail in order to advance automatic description of music by emphasizing cultural speci city. e analyzed music traditions included Indian Hindustani and Carnatic [36], Turkish Makam [37], Arab-Andalusian [133], and Beijing Opera [109]. However, the project's focus was on music creation, content analysis, and ethnomusicological aspects rather than on the music consumption side [25,124,125]. Recently, analyzing content-based audio features describing rhythm, timbre, harmony, and melody for a corpus of a larger variety of world and folk music with given country information, Panteli et al. found distinct acoustic pa erns of the music created in individual countries [103]. ey also identi ed geographical and cultural proximities that are reected in music features, looking at outliers and misclassi cations in a classi cation experiments using country as target class. For instance, Vietnamese music was o en confused with Chinese and Japanese, South African with Botswanese.
In contrast to this -meanwhile quite extensive -work on culture-speci c analysis of music traditions, li le e ort has been made to analyze cultural di erences and pa erns of music consumption behavior, which is, as we believe, a crucial step to build culture-aware MRS. e few studies investigating such cultural di erences include [65], in which Hu and Lee found di erences in perception of moods between American and Chinese listeners. By analyzing the music listening behavior of users from 49 countries, Ferwerda et al. found relationships between music listening diversity and Hofstede's cultural dimensions [49,51]. Skowron et al. used the same dimensions to predict genre preferences of listeners with di erent cultural backgrounds [130]. Schedl analyzed a large corpus of listening histories created by Last.fm users in 47 countries and identi ed distinct preference pa erns [115]. Further analyses revealed countries closest to what can be considered the global mainstream (e.g., the Netherlands, UK, and Belgium) and countries farthest from it (e.g., China, Iran, and Slovakia). However, all of these works de ne culture in terms of country borders, which o en makes sense, but is sometimes also problematic, for instance in countries with large minorities of inhabitants with di erent culture.
In our opinion, when building MRS, the analysis of cultural pa erns of music consumption behavior, subsequent creation of respective cultural listener models, and their integration into recommender systems are vital steps to improve personalization and serendipity of recommendations. Culture should be de ned on various levels though, not only country borders. Other examples include having a joint historical background, speaking the same language, sharing the same beliefs or religion, and di erences between urban vs. rural cultures. We believe that MRS which are aware of the cross-cultural di erences and similarities in music perception and taste, and are able to recommend music a listener in the same or another culture may like, would substantially bene t both users and providers of MRS.

CONCLUSIONS
In this trends and survey paper, we identi ed several grand challenges the research eld of music recommender systems (MRS) is facing.
ese are, to the best of our knowledge, in the focus of current research in the area of MRS. We discussed (1) the cold start problem of items and users, with its particularities in the music domain, (2) the challenge of automatic playlist continuation, which is gaining particular importance due to the recently emerged user request of being recommended musical experiences rather than single tracks [120], and (3) the challenge of holistically evaluating music recommender systems, in particular, capturing aspects beyond accuracy.
In addition to the grand challenges, which are currently highly researched, we also presented a visionary outlook of what we believe to be the most interesting future research directions in MRS. In particular, we discussed (1) psychologically-inspired MRS, which consider in the recommendation process factors such as listeners' emotion and personality, (2) situation-aware MRS, which holistically model contextual and environmental aspects of the music consumption process, infer listener needs and intents, and eventually integrate these models at large scale in the recommendation process, and (3) culture-aware MRS, which exploit the fact that music taste highly depends on the cultural background of the listener, where culture can be de ned in manifold ways, including historical, political, linguistic, or religious similarities.
We hope that this article helped pinpointing major challenges, highlighting recent trends, and identifying interesting research questions in the area of music recommender systems. Believing that research addressing the discussed challenges and trends will pave the way for the next generation of music recommender systems, we are looking forward to exciting, innovative approaches and systems that improve user satisfaction and experience, rather than just accuracy measures.

ACKNOWLEDGEMENTS
We would like to thank all researchers in the elds of recommender systems, information retrieval, music research, and multimedia, with whom we had the pleasure to discuss and collaborate in recent years, and whom in turn in uenced and helped shaping this article. Special thanks go to Peter Knees and Fabien Gouyon for the fruitful discussions while preparing the ACM Recommender Systems 2017 tutorial on music recommender systems. We would also like to thank Eelco Wiechert for providing additional pointers to relevant literature. Furthermore, the many personal discussions with actual users of MRS unveiled important shortcomings of current approaches and in turn were considered in this article.