Empirical Analysis of Session-Based Recommendation Algorithms

Recommender systems are tools that support online users by pointing them to potential items of interest in situations of information overload. In recent years, the class of session-based recommendation algorithms received more attention in the research literature. These algorithms base their recommendations solely on the observed interactions with the user in an ongoing session and do not require the existence of long-term preference profiles. Most recently, a number of deep learning based ("neural") approaches to session-based recommendations were proposed. However, previous research indicates that today's complex neural recommendation methods are not always better than comparably simple algorithms in terms of prediction accuracy. With this work, our goal is to shed light on the state-of-the-art in the area of session-based recommendation and on the progress that is made with neural approaches. For this purpose, we compare twelve algorithmic approaches, among them six recent neural methods, under identical conditions on various datasets. We find that the progress in terms of prediction accuracy that is achieved with neural methods is still limited. In most cases, our experiments show that simple heuristic methods based on nearest-neighbors schemes are preferable over conceptually and computationally more complex methods. Observations from a user study furthermore indicate that recommendations based on heuristic methods were also well accepted by the study participants. To support future progress and reproducibility in this area, we publicly share the session-rec evaluation framework that was used in our research.


INTRODUCTION
Recommender systems (RS) are software applications that help users in situations of information overload and they have become a common feature on many modern online services. Collaborative filtering (CF) techniques, which are based on behavioral data collected from larger user communities, are among the most successful technical approaches in practice. Historically, these approaches mostly rely on the assumption that information about longer-term preferences of the individual users are available, e.g., in the form of a user-item rating matrix [34]. In many real-world applications, however, such longer-term information is often not available, because users are not logged in or because they are first-time users. In such cases, techniques that leverage behavioral patterns in a community can still be applied [20]. The difference is that instead of the long-term preference profiles only the observed interactions with the user in the ongoing session can be used to adapt the recommendations to the assumed needs, preferences, or intents of the user. Such a setting is usually termed a session-based recommendation problem [32].
Interestingly, research on session-based recommendation was very scarce for many years despite the high practical relevance of the problem setting. Only in recent years, we can observe an increased interest in the topic in academia [41], which is at least partially caused by the recent availability of public datasets in particular from the e-commerce domain. This increased interest in session-based recommendations coincides with the recent boom of deep learning (neural) methods in various application areas. Accordingly, it is not surprising that several neural session-based recommendation approaches were proposed in recent years, with gru4rec being one of the pioneering and most cited works in this context [15].
From the perspective of the evaluation of session-based algorithms, the research community-at the time when the first neural techniques were proposed-had not yet established a level of maturity as is the case for problem setups that are based on the traditional user-item rating matrix. This led to challenges that concerned both the question what represents the state-of-the-art in terms of algorithms and the question of the evaluation protocol when time-ordered user interaction logs are the input instead of a rating matrix. Partly due to this unclear situation, it soon turned out that in some cases comparably simple non-neural techniques, in particular ones based on nearest-neighbors approaches, can lead to very competitive or even better results than neural techniques [19,23]. Besides being competitive in terms of accuracy, such more simple approaches often have the advantage that their recommendations are more transparent and can more easily be explained to the users. Furthermore, these simpler methods can often be updated online when new data becomes available, without requiring expensive model retraining.
However, during the last few years after the publication of gru4rec, we have mostly observed new proposals in the area of complex models. With this work, our aim is to assess the progress that was made in the last few years in a reproducible way. To that purpose, we have conducted an extensive set of experiments in which we compared twelve session-based recommendation techniques under identical conditions on a number of datasets. Among the examined techniques, there are six recent neural approaches, which were published at highly-ranked publication outlets such as KDD, AAAI, or SIGIR after the publication of the first version of gru4rec in 2015. 2 The main outcome of our offline experiments is that the progress that is achieved with neural approaches to session-based recommendation is still limited. In most experiment configurations, one of the simple techniques outperforms all the neural approaches. In some cases, we could also not confirm that a more recently proposed neural method consistently outperforms the much earlier gru4rec method. Generally, our analyses point to certain underlying methodological issues, which were also observed in other application areas of applied machine learning. Similar observations regarding the competitiveness of established and often more simple approaches were made before, e.g., for the domains of information retrieval, time-series forecasting, and recommender systems, [2,8,27,43], and it is important to note that these phenomena are not tied to deep learning approaches.
To help overcome some of these problems for the domain of session-based recommendation, we share our evaluation framework session-rec online 3 . The framework not only includes the algorithms that are compared in this paper, it also supports different evaluation procedures, implements a number of metrics, and provides pointers to the public datasets that were used in our experiments.
Since offline experiments cannot inform us about the quality of the recommendation as perceived by users, we have furthermore conducted a user study. In this study, we compared heuristic methods with a neural approach and the recommendations produced by a commercial system (Spotify) in the context of an online radio station. The main outcomes of this study are that heuristic methods also lead to recommendations-playlists in this case-that are well accepted by users. The study furthermore sheds some light on the importance of other quality factors in the particular domain, i.e., the capability of an algorithm to help users discover new items.
The paper is organized as follows. Next, in Section 2, we provide an overview of the algorithms that were used in our experiments. Section 3 describes our offline evaluation methodology in more detail and Section 4 presents 2 Compared to our preliminary work presented in [26], our present analysis includes considerably more recent deep learning techniques and baseline approaches. We also provide the outcomes of additional measurements regarding the scalability and stability of different algorithms. Finally, we also contrast the outcomes of the offline experiments with the findings obtained in a user study [24]. 3 https://github.com/rn5l/session-rec the outcomes of the experiments. In Section 5, we report the results of our user study. Finally, we summarize our findings and their implications in Section 6.

ALGORITHMS
Algorithms of various types were proposed over the years for session-based recommendation problems. A detailed overview of the more general family of sequence-aware recommender systems, where session-based ones are a part of, can be found in [32]. In the context of this work, we limit ourselves to a brief summary of parts of the historical development and how we selected algorithms for inclusion in our evaluations.

Historical Development and Algorithm Selection
Nowadays, different forms of session-based recommendations can be found in practical applications. The recommendation of related items for a given reference object can, for example, be seen as a basic and very typical form of session-based recommendations in practice. In such settings, the selection of the recommendations is usually based solely on the very last item viewed by the user. Common examples are the recommendation of additional articles on news web sites or recommendations of the form "Customers who bought . . . also bought" on e-commerce sites. Another common application scenario is the creation of automated playlists, e.g., on YouTube, Spotify, or last.fm. Here, the system creates a virtually endless list of next-item recommendations based on some seed item and additional observations, e.g., skips or likes, while the media is played. These application domains-web page and news recommendation, e-commerce, music playlists-also represent the main driving scenarios in academic research.
For the recommendation of web pages to visit, Mobasher et al. proposed one of the earliest session-based approaches based on frequent pattern mining in 2002 [29]. In 2005, Shani et al. [37] investigated the use of an MDP-based (Markov Decision Process) approach for session-based recommendations in e-commerce and also demonstrated its value from a business perspective. Alternative technical approaches based on Markov processes were later on proposed in 2012 and 2013 for the news domain in [9] and [10].
A early approach to music playlist generation was proposed in 2005 [33], where the selection of items was based on the similarity with a seed song. The music domain was however also very important for collaborative approaches. In 2012, the authors of [12] used a session-based nearest-neighbors technique as part of their approach for playlist generation. This nearest-neighbors method and improved versions thereof later on turned out to be highly competitive with today's neural methods [23]. More complex methods were also proposed for the music domain, e.g., an approach based on Latent Markov Embeddings [5] from 2012. Some novel technical proposals in the years 2014 and 2015 were based on a non-public e-commerce dataset from a European fashion retailer and either used Markov processes and side information [39] or a simple re-ranking scheme based on short-term intents [18]. More importantly, however, in the year 2015, the ACM RecSys conference hosted a challenge, where the problem was to predict if a consumer will make a purchase in a given session, and if so, to predict which item will be purchased. A corresponding dataset (YOOCHOOSE) was released by an industrial partner, which is very frequently used today for benchmarking session-based algorithms. Technically, the winning team used a two-stage classification approach and invested a lot of effort into feature engineering to make accurate predictions [35].
In late 2015, Hidasi et al. [15] then published the probably first deep learning based method for session-based recommendation called gru4rec, a method which was continuously improved later on, e.g., in [14] or [38]. In their work, they also use the mentioned YOOCHOOSE dataset for evaluation, although with the slightly different optimization goal, i.e., to predict the immediate next item click event. As one of their baselines, they used an item-based nearest-neighbors technique. They found that their neural method is significantly better than this technique in terms of prediction accuracy. The proposal of their method and the booming interest in neural approaches subsequently led to a still ongoing wave of new proposals that apply deep learning approaches to session-based recommendation problems.
In this present work, we consider a selection of algorithms that reflects these historical developments. We consider basic algorithms based on item co-occurrences, sequential patterns and Markov processes as well as methods that implement session-based nearest-neighbors techniques. Looking at neural approaches, we benchmark the latest versions of gru4rec as well as five other methods that were published later and which state that they outperform at least the initial version of gru4rec to a significant extent.
Regarding the selected neural approaches, we limit ourselves to methods that do not use side information about the items in order to make our work easily reproducible and not dependent on such meta-data. Another constraint for the inclusion in our comparison is that the work was published in one of the major conferences, i.e., one that is rated A or A* according to the Australian CORE scheme. Finally, while in theory algorithms should be reproducible based on the technical descriptions in the paper, there are usually many small implementation details that can influence the outcome of the measurement. Therefore, like in [8], we only considered approaches where the source code was available and could be integrated in our evaluation framework with reasonable effort.

Considered Algorithms
In total, we considered 12 algorithms in our comparison. Table 1 provides an overview of the non-neural methods. Table 2 correspondingly shows the neural methods considered in our analysis, ordered by their publication date.
Except for the ct method, the non-neural methods from Table 1 are conceptually very simple or almost trivial. As mentioned above, this can lead to a number of potential practical advantages compared to more complex models, e.g., regarding online updates and explainability. From the perspective of the computational costs, the time needed to "train" the simple methods is often low, as this phase often reduces to counting item co-occurrences in the training data or to preparing some in-memory data structures. To make the nearest-neighbors technique scalable, we implemented the internal data structures and data sampling strategies proposed in [19]. As a result, the ct method is the only one from the set of non-neural methods for which we encountered scalability issues in the form of memory consumption and prediction time when the set of recommendable items is huge.
Regarding alternative non-neural approaches, note that in the evaluation in [23], a number of additional methods were considered. We do not include these methods (iknn, fpmc, mc, smf, bpr-mf, fism, fossil) in our present analysis, because previous research showed that these methods either are generally not competitive or only lead to competitive results in a few special cases.
The development over time regarding the neural approaches is summarized in Table 3. The table also indicates which baselines were used in the original papers. The analysis shows that gru4rec was considered as a baseline in all papers. Most papers refer to the original gru4rec publication from 2016 or an early improved version that was proposed shortly afterwards (which we term gru4rec+ here, see [38]). Most papers, however, do not refer to the improved version (gru4rec2) discussed in [14]. Since the public code for gru4rec was constantly updated, we however assume that the authors ran benchmarks against the updated versions. narm, as one of the earlier neural techniques, is the only neural method other than gru4rec that is considered quite frequently by more recent works.
The analysis of the used baselines furthermore showed that only one of the more recent papers proposing a neural method considers, i.e., [40], session-based nearest-neighbors techniques as a baseline, even though their competitiveness was documented in a publication at the ACM Recommender Systems conference in 2017 [19]. The authors of [40] however only consider the original proposal and not the improved versions from 2018 [23]. The only other papers in our analysis, which consider session-based nearest-neighbors techniques as baselines, are about non-neural techniques (ct and stan). The paper proposing stan furthermore is an exception in that since it considers quite a number of neural approaches (gru4rec2, stamp, narm, sr-gnn) in its comparison.

ar
This simple "Association Rules" method counts pairwise item co-occurrences in the training sessions.
Recommendations for an ongoing session are generated by this method by returning those items that most frequently co-occurred with the last item of the current session in the past. For a formal definition, see [23].
sr This method called "Sequential Rules" was proposed in [23]. It is similar to ar in that it counts pairwise item co-occurrences in the training sessions. In addition to ar, however, it considers the order of the items in a session and the distance between them using a decay function. The method often led to competitive results in particular in terms of the Mean Reciprocal Rank in the analysis in [23].
The analysis in [19] showed that a simple session-based nearest-neighbors method similar to the one from [13] was competitive with the first version for gru4rec. Conceptually, the idea is to find past sessions that contain the same elements as the ongoing session. The recommendations are then based by selecting items that appeared in the most similar past session. Since the sequence in which items are consumed in the ongoing user session might be of importance in the recommendation process, a number of "sequential extensions" to the sknn method were proposed in [23]. Here, the order of the items in a session proved to be helpful, both when calculating the similarities as well as in the item scoring process. Furthermore, according to [25] it can be beneficial to put more emphasis on less popular items by applying an Inverse-Document-Frequency(IDF) weighting scheme. In this paper, all those extensions are implemented in the v-sknn method.
stan This method called "Sequence and Time Aware Neighborhood" was presented at SIGIR '19 [11]. stan is based on sknn [19], but it additionally takes into account the following factors for making recommendations: i) the position of an item in the current session, ii) the recency of a past session w.r.t. to the current session, and iii) the position of a recommendable item in a neighboring session.
Their results show that stan significantly improves over sknn, and is even comparable to recently proposed state-of-the-art deep learning approaches.
vstan This method, which we propose in this present paper, combines the ideas from stan and v-sknn in a single approach. It incorporates all three previously mentioned particularities of stan, which already share some similarities with the v-sknn method. Furthermore, we add a sequence-aware item scoring procedure as well as the IDF weighting scheme from v-sknn.
ct This technique is based on Context Trees, which were originally proposed for lossless data compression. It is a non-parametric method and based on variable-order Markov models. The method was proposed in [28], where it showed promising results.

EVALUATION METHODOLOGY
We benchmarked all methods under the same conditions, using the evaluation framework that we share online to ensure reproducibility of our results.

Datasets
We considered eight datasets from two domains for our evaluation, e-commerce and music. Six of them are public and several of them were previously used to benchmark session-based recommendation algorithms. Table 4 briefly describes the datasets. gru4rec gru4rec [15] was the first neural approach that employed RNNs for session-based recommendation. This technique uses Gated Recurrent Units (GRU) [6] to deal with the vanishing gradient problem. The technique was later on improved using more effective loss functions [14].
narm This model [21] extends gru4rec and improves its session modeling with the introduction of a hybrid encoder with an attention mechanism. The attention mechanism is in particular used to consider items that appeared earlier in the session and which are similar to the last clicked one. The recommendation scores for each candidate item are computed with a bilinear matching scheme based on the unified session representation.
stamp In contrast to narm, this model [22] does not rely on an RNN. A short-term attention/memory priority model is proposed, which is (a) capable of capturing the users' general interests from the long-term memory of a session context, and which (b) also takes the users' most recent interests from the short-term memory into account. The users' general interests are captured by an external memory built from all the historical clicks in a session prefix (including the last click). The attention mechanism is built on top of the embedding of the last click that represents the user's current interests.
nextitnet This recent model [44] also discards RNNs to model user sessions. In contrast to stamp, convolutional neural networks are adopted with a few domain-specific enhancements. The generative model is designed to explicitly encode item inter-dependencies, which allows to directly estimate the distribution of the output sequence (rather than the desired item) over the raw item sequence. Moreover, to ease the optimization of the deep generative architecture, the authors propose to use residual networks to wrap convolutional layer(s) by residual block.
sr-gnn This method [42] models session sequences as graph structured data (i.e., directed graphs). Based on the session graph, sr-gnn is capable of capturing transitions of items and generating item embedding vectors correspondingly, which are difficult to be revealed by conventional sequential methods like MC-based and RNN-based methods. With the help of item embedding vectors, sr-gnn furthermore aims to construct reliable session representations from which the next-click item can be inferred.
csrm This method [40] is a hybrid framework that uses collaborative neighborhood information in session-based recommendations. csrm consists of two parallel modules: an Inner Memory Encoder (IME) and an Outer Memory Encoder (OME). The IME models a user's own information in the current session with the help of Recurrent Neural Networks (RNNs) and an attention mechanism. The OME exploits collaborative information to better predict the intent of current sessions by investigating neighborhood sessions. Then, a fusion gating mechanism is used to selectively combine information from the IME and OME to obtain the final representation of the current session. Finally, csrm obtains a recommendation score for each candidate item by computing a bi-linear match with the final representation of the current session.
We pre-processed the original datasets in a way that all sessions with only one interaction were removed. As done in previous works, we also removed from sessions items that appeared less than 5 times in the dataset. Furthermore, we use an evaluation procedure where we run repeated measurements on several subsets (splits) of Table 3. Overview of the baseline techniques that each neural session-based approach was originally compared to. The methods are ordered chronologically by the date of publication. The marks (✗) indicate which baselines were used in the comparison.

Method
Publication iknn sknn bpr-mf fpmc gru4rec narm stamp A private music dataset with hand-crafted playlists.
the original data, see Section 3.2. The average characteristics of the subsets for each dataset are shown in Table 5.
We share all datasets except ZALANDO and 8TRACKS online.

Evaluation Procedure and Metrics
Data Splitting Approach. We apply the following procedure to create train-test splits. Since most datasets consist of time-ordered events, usual cross-validation procedures with the randomized allocation of events across data splits cannot be applied. Several authors only use one single time-ordered training-test split for their measurements. This, however, can lead to undesired random effects. We therefore rely on a protocol where we create five non-overlapping and contiguous subsets (splits) of the datasets. As done in previous works, we use the last n days of each split for evaluation (testing) and the other days for training the models. 4 The reported measurements correspond to the averaged results obtained for each split.
The playlist datasets (AOTM and 8TRACKS) are exceptions here as they do not have timestamps. For these datasets, we therefore randomly generated timestamps, which allows us to use the same procedure as for the other datasets.
Hyper-parameter Optimization. Proper hyper-parameter tuning is essential when comparing machine learning approaches. We therefore tuned all hyper-parameters for all methods and datasets in a systematic approach, using MRR@20 as an optimization target as done in previous works. Technically, we created subsets from the training data for validation. The size of the validation set was chosen in a way that it covered the same number of days that was used in the final test set. We applied a random hyper-parameter optimization approach with 100 iterations as done in [14,21,22]. Since narm and csrm only have a smaller set of hyper-parameters, we only had to do 50 iterations for these methods. For the sr-gnn method, we had to limit the number of iterations for the ZALANDO dataset to 40, because tuning was particularly time-consuming. The final hyper-parameter values for each method and dataset can be found online, along with a description of the investigated ranges.
Accuracy Measures. For each session in the test set, we incrementally reveal one event of a session after the other, as was proposed in [15]. The task of the recommendation algorithm is to generate a prediction for the next event(s) in the session in the form of a ranked list of items. The resulting list can then be used to apply standard accuracy measures from information retrieval. The measurement can be done in two different ways.
• As in [15] and other works, we can measure if the immediate next item is part of the resulting list and at which position it is ranked. The corresponding measures are the Hit Rate and the Mean Reciprocal Rank. • In typical information retrieval scenarios, however, one is usually not interested in having one item right (e.g., the first search result), but in having as many predictions as possible right in a longer list that is displayed to the user. For session-based recommendation scenarios, this applies as well, as usually, e.g., on music and e-commerce sites, more than one recommendation is displayed. Therefore, we measure Precision and Recall in the usual way, by comparing the objects of the returned list with the entire remaining session, assuming that not only the immediate next item is relevant for the user. In addition to Precision and Recall, we also report the Mean Average Precision metric.
The most common cut-off threshold in the literature is 20, probably because this was the chosen threshold by the authors of gru4rec [15]. We have made measurements for alternative list lengths as well, but will only report the results when using 20 as a list length in this paper. We report additional results for cut-off thresholds of 5 and 10 in an online appendix. 5 Coverage and Popularity. Depending on the application domain, factors other than prediction accuracy might be relevant as well, including coverage, novelty, diversity, or serendipity [36]. Since we do not have information about item characteristics, we focus on questions of coverage and novelty in this work.
With coverage, we here refer to what is sometimes called "aggregate diversity" [1]. Specifically, we measure the fraction of items of the catalog that ever appears in any top-n list presented to the users in the test set. This coverage measure in some ways also measures the level of context adaptation, i.e., if an algorithm tends to recommend the same set of items to everyone or specifically varies the recommendations for a given session.
We approximate the novelty level of an algorithm by measuring how popular the recommended items are on average. The underlying assumption is that recommending more unpopular items leads to higher novelty and discovery effects. Algorithms that mostly focus on the recommendation of popular items might be undesirable from a business perspective, e.g., when the goal is to leverage the potential of the long tail in e-commerce settings. Technically, we measure the popularity level of an algorithm as follows. First, we compute min-max normalized popularity values of each item in the training set. Then, during evaluation, we compute the popularity level of an algorithm by determining the average popularity value of each item that appears in its top-n recommendation list. Higher values correspondingly mean that an algorithm has a tendency to recommend rather popular items.
Running Times. Complex neural models can need substantial computational resources to be trained. Training a "model", i.e., calculating the statistics, for co-occurrence based approaches like sr or ar can, in contrast, be done very efficiently. For nearest-neighbors based approaches, actually no model is learned at all. Instead, some of our nearest-neighbors implementations need some time to create internal data structures that allow for efficient recommendation at prediction time. In the context of this paper, we will report running times for some selected datasets from e-commerce.
We executed all experiments on the same physical machine. The running times for the neural methods were determined using a GPU; the non-neural methods used a CPU. In theory, running times should be compared on the same hardware. Thererfore, since the running times of the neural methods are much longer even when a GPU can be used, we can assume that the true difference in computational complexity is in fact even higher than we can see in our measurements.
Stability with Respect to New Data. In some application domains, e.g., news recommendation or e-commerce, new user-item interaction data can come in at a high rate. Since retraining the models to accommodate the new data can be costly, a desirable characteristic of an algorithm can be that the performance of the model does not degenerate too quickly before the retraining happens. To put it differently, it is desirable that the models do not overfit too much to the training data.
To investigate this particular form of model stability, we proceeded as follows. First, we trained a model on the training data T 0 of a given train-test split 6 . Then, we made measurements using two different protocols, which we term retraining and no-retraining, respectively.
• In the retraining configuration, we first evaluated the model that was trained on T 0 using the data of the first day of the test set. Then, we added this first day of the test set to T 0 and retrained the model on this extended dataset, which we name T 1 . Then, we continued with the evaluation with the data from the second day of the test data, using the model trained on T 1 . This process of adding more data to the training set, retraining the full model, and evaluating on the next day of the test set was done for all days of the test set. • In the no-retraining configuration, we also evaluated the performance day by day on the test data, but did not retrain the models, i.e., we used the model trained on T 0 for all days in the test data. To enable a fair comparison in both configurations, we only considered items in the evaluation phase that appeared at least once in the original training data T 0 .
Note that the absolute accuracy values for a given test day depends on the characteristics of the recorded data on that day. In some cases, the accuracy for the second test day can therefore even be higher than for the first test day, even if there was no retraining. An exact comparison of absolute values is therefore not too meaningful. However, we consider the relative accuracy drop when using the initial model T 0 for a number of consecutive days as an indicator of the generalizability or stability of the learned models, provided that the investigated algorithms start from a comparable accuracy level.

RESULTS
In this section, we report the results of our offline evaluation. We will first focus on accuracy, then look at alternative quality measures, and finally discuss aspects of scalability and the stability of different models over time.

Accuracy Results
E-Commerce Datasets. Table 6 shows the results for the e-commerce datasets. The highest value across all techniques is printed in bold; the highest value obtained by the other family of algorithms-neural or non-neuralis underlined. Stars indicate significant differences (p<0.05) according to a Kruskal-Wallis test between all the models and a Wilcoxon signed-rank test between the best-performing techniques from each category. The results for the individual datasets can be summarized as follows.
• On the RETAIL dataset, the nearest-neighbors methods consistently lead to the highest accuracy results on all the accuracy measures. Among the complex models, the best results were obtained by gru4rec on all the measures except for MRR, where sr-gnn led to the best value. The results for narm and gru4rec are almost identical on most measures. • The results for the DIGI dataset are comparable, with the neighborhood methods leading to the best accuracy results. gru4rec is again the best method across the complex models on all the measures. • For the ZALANDO dataset, the neighborhood methods dominate all accuracy measures, except for the MRR. Here, gru4rec is minimally better than the simple sr method. Among the complex models, gru4rec achieves the best HR value, and the recent sr-gnn method is the best one on the other accuracy measures. • Only for the RSC15 dataset, we can observe that a neural method (narm) is able to slightly outperform our best simple baseline vstan in terms of MAP, Precision and Recall. Interestingly, however, narm is one of the earlier neural methods in this comparison. The best Hit Rate is achieved by vstan; the best MRR by sr-gnn. The differences between the best neural and non-neural methods are often tiny, in most cases around or less than 1 %. Looking at the results across the different datasets, we can make the following additional observations.
• Across all e-commerce datasets, the vstan method proposed in this paper is, for most measures, the best neighborhood-based method. This suggests that it is reasonable to include it as a baseline in future performance comparisons. • The ranking of the neural methods varies largely across the datasets and does not follow the order in which the methods were proposed. Like for the non-neural methods, the specific ranking therefore seems to be strongly depending on the dataset characteristics. This makes it particularly difficult to judge the progress that is made when only one or two datasets are used for the evaluation. • The results for the RSC15 dataset are generally different from the other results. Specifically, we found that some neural methods are competitive and slightly outperform our baselines. stamp is not among the top performers except for this dataset. Unlike for other e-commerce datasets, ct works particularly well for this dataset in terms of the MRR. Given these observations, it seems that the RSC15 dataset has some unique characteristics that are different from the other e-commerce datasets. Therefore, it seems advisable to consider multiple datasets with different characteristics in future evaluations. • We did not include measurements for nextitnet, one of the most recent methods, for the larger ZALANDO and RSC15 datasets. We found that this method does not scale well and we could not complete the hyperparameter tuning process within weeks on our machines (also for two music datasets).
Music Domain. In Table 7 we present the results for the music datasets. In general, the observations are in line with what we observed for the e-commerce domain regarding the competitiveness of the simple methods. • Again, no consistent ranking of the algorithms can be found across the datasets. In particular the neural approaches take largely varying positions in the rankings across the datasets. Generally, narm seems to be a technique which performs consistently well on most datasets and measures. Table 6 and Table 7 also contain information about the popularity bias of the individual algorithms and coverage information. Remember that we described in Section 3.2 how the numbers were calculated. From the results, we can identify the following trends regarding individual algorithms and the different algorithm families.

Coverage and Popularity
Popularity Bias.
• The ct method is very different from all other methods in terms of its popularity bias, which is much higher than for any other method. • The gru4rec method, on the other hand, is the method that almost consistently recommends the most unpopular (or: novel) items to the users. • The neighborhood-based methods are often somewhere in the middle. There are, however, also neural methods, in particular sr-gnn, which seem to have a similar or sometimes even stronger popularity bias than the nearest-neighbors approaches. The assumption that nearest-neighbors methods are in general more focusing on popular items than neural methods can therefore not be confirmed through our experiments. Coverage.
• In terms of coverage, we found that gru4rec often leads to the highest values.
• The coverage of the neighborhood-based methods varies quite a lot, depending on the specific algorithm variant. In some configurations, their coverage is almost as high as for gru4rec, while in others the coverage can be low. • The coverage values of the other neural methods also do not show a clear ranking, and they are often in the range of the neighborhood-based methods and sometimes even very low.

Scalability
We present selected results regarding the running times of the algorithms for two e-commerce datasets and one music dataset in Table 8. The reported times were measured for training and predicting for one data split. The numbers reported for predicting correspond to the average time needed to generate a recommendation for a session beginning in the test set. For this measurement, we used a workstation computer with an Intel Core i7-4790k processor and an Nvidia Geforce GTX 1080 Ti graphics card (Cuda 10.1/CuDNN 7.5).
The results generally show that the computational complexity of neural methods is, as expected, much higher than for the non-neural approaches. In some cases, researchers therefore only use a smaller fraction of the original datasets, e.g., 1 /4 or 1 /64 of the RSC15 dataset. Several algorithms-both neural ones and the ct method-exhibit major scalability issues when the number of recommendable items increases. For the nextitnet method, for example, training on the ZALANDO dataset with its almost 190k items and its particularly long sessions did not complete within a reasonable time frame in our experiments.
In some cases, like for ct or sr-gnn, not only the training time increases, but also the prediction times. In particular the prediction times can, however, be subject to strict time constraints in production settings. The prediction times for the nearest-neighbors methods are often slightly higher than those measured for methods like gru4rec, but usually lie within the time constraints of real-time recommendation (e.g., requiring about 30ms for one prediction for the ZALANDO dataset).
Since datasets in real-world environments can be even larger, this leaves us with questions regarding the practicability of some of the approaches. In general, even in case where a complex neural method would slightly outperform one of the more simple ones in an offline evaluation, it remains open if it is worth the effort to put such complex methods into production. For the ZALANDO dataset, for example, the best neural method (sr-gnn) needs several orders of magnitude 7 more time to train than the best non-neural method vstan, which also only needs half the time for recommending.

Stability With Respect to New Data
We report the stability results for the examined neural and non-neural algorithms on two datasets in Table 9. We used two months of training data and 10 days of test data for both datasets, DIGI and NOWP. The reported values show how much the accuracy results of each algorithm degrades (in percent), averaged across the test days when there is no daily retraining.
We can see from the results that the drop in accuracy without retraining can vary a lot across datasets (domains). For the DIGI dataset, the decrease in performance ranges between 0 and 10 percent across the different algorithms and performance measures. The NOWP dataset from the music domain seems to be more short-lived, with more recent trends that have to be considered. Here, the decrease in performance ranges from about 15 to 50 percent in terms of HR and from about 15 to 85 percent in terms of MRR. 8 Looking at the detailed results, we see that in both families of algorithms, i.e., neural and non-neural ones, some algorithms are much more stable than others when new data are added to a given dataset. For the family of Table 9. Relative accuracy decrease (in percent) for the evaluated algorithms on two datasets, ordered by HR@20. The best results for each metric are highlighted in bold font. The next best results from the other category (neural or non-neural) are underlined.

DIGI
NOWP Metrics HR@20 MRR@20 HR@20 MRR@20 non-neural approaches, we see that nearest-neighbor approaches are generally better than the other baselines techniques based on association rules or context trees. Among the neural methods, narm is the most stable one on the DIGI dataset, but often falls behind the other deep learning methods on the NOWP dataset. 9 On this latter dataset, the csrm method leads to the most stable results. In general, however, no clear pattern across the datasets can be found regarding the performance of the neural methods when new data comes in and no retraining is done.
Overall, given that the computational costs of training complex models can be high, it can be advisable to look at the stability of algorithms with respect to new data when choosing a method for production. According to our analysis, there can be strong differences across the algorithms. Furthermore, the nearest-neighbors methods appear to be quite stable in this comparison.

OBSERVATIONS FROM A USER STUDY
Offline evaluations, while predominant in the literature, can have certain limitations, in particular when it comes to the question how the quality of the provided recommendations is perceived by users. We therefore conducted a controlled experiment, in which we compared different algorithmic approaches for session-based recommendation in the context of an online radio station. In the following sections, we report the main insights of this experiment. While the study did not include all algorithms from our offline analysis, we consider it helpful to obtain a more comprehensive picture regarding performance of session-based recommenders. More details about the study cam be found in [24].

Research Questions and Study Setup
Research Questions. Our offline analysis indicated that simple methods are often competitive than the more complex ones. Our main research question therefore was how the recommendations generated by such simple methods are perceived by its users in different dimensions, in particular compared to recommendations by a complex method. Furthermore, we were interested how users perceive the recommendations of a commercial music streaming service, in our case Spotify, in the same situation.
Study Setup. An online music listening application in the form of an "automated radio station" was developed for the purpose of the study. Similar to existing commercial services, users of the application could select a track they like (called a "seed track"), based on which the application creates a playlist of subsequent tracks that are played automatically. While the music was played, the users could listen it to the end before moving to the next track, skip the track if they did not like the it, or press a "like" button. In case of a like action, the list of upcoming tracks was updated. Users were visually hinted that such an update takes place.
Besides recording skips and like actions, additional feedback was collected from the study participants. Before going to the next track, they had to answer for each listened track (i) if they already knew the track, (ii) to what extent the track matched the previously played track, and (iii) to what extent they liked the track (independent of the playlist), see Figure 1.
Once the participants had listened to and rated at least 15 tracks, they were forwarded to a post-task questionnaire. In this questionnaire, we asked the participants 11 questions about how they perceived the service, see also [31]. Specifically, the participants were asked to provide answers to the questions using seven-point Likert scale items, ranging from "completely disagree" to "completely agree". The questions, which include a twelfth question as an attention check, are listed in Table 10.
The study itself was based on a between-subjects design, where the treatments for each user group correspond to different algorithmic approaches to generate the recommendations. We included algorithms from different families in our study.
• ar: Association rules of length two, as described in Section 2. We included this method as a simple baseline.
• cagh: Another relatively simple baseline, which recommends the greatest hits of artists similar to those liked in the current session. This music-specific method is often competitive in offline evaluations as well, see [3]. • sknn: The basic nearest-neighbors method described above. We took the simple variant as a representative for the family of such approaches, as it performed particularly well in the ACM RecSys 2018 challenge [25]. • gru4rec: The RNN-based approach discussed above, used as a representative for neural methods. narm would have been a stable alternative, but did not scale well for the used dataset. • spotify: Recommendations in this treatment group were retrieved in real time from Spotify's API. I liked the automatically generated radio station. Q2 The radio suited my general taste in music.

Q3
The tracks on the radio musically matched the track I selected in the beginning. Q4 The radio was tailored to my preferences the more positive feedback I gave.

Q5
The radio was diversified in a good way. Q6 The tracks on the radio surprised me. Q7 I discovered some unknown tracks that I liked in the process.

Q8
I am participating in this study with care so I change this slider to two.

Q9
I would listen to the same radio station based on that track again. Q10 I would use this system again, e.g., with a different first song. Q11 I would recommend this radio station to a friend. Q12 I would recommend this system to a friend.
We optimized and trained all models on the Million Playlist Dataset Million Playlist Dataset (MPD) 10 provided by Spotify. We then recruited study participants using Amazon's Mechanical Turk crowdsourcing platform. After excluding participants who did not pass the attention checks, we ended up with N=250 participants, i.e., 50 for each treatment group, for which we were confident that they provided reliable feedback.
Most of the recruited participants (almost 80%) were US-based. The most typical age range was between 25 and 34, with more than 50% of the participants falling into this category. On average, the participants considered themselves to be music enthusiasts, with an average response of 5.75 (on the seven-point scale) to a corresponding survey question. As usual, the participants received a compensation for their efforts through the crowdsourcing platform.

User Study Outcomes
The main observations can be summarized as follows.
Feedback the Listening Experience. Looking at the feedback that was observed during the listening session, we observed the following.
• Number of Likes. There were significant differences regarding the number of likes we observed across the treatment groups. Recommendations by the simple ar method received the highest number of likes (6.48), followed by sknn (5.63), cagh (5.38), gru4rec (5.36) and spotify (4.48). • Popularity of Tracks. We found a clear correlation (r=0.89) between the general popularity of a track in the MPD dataset and the number of likes in the study. The ar and cagh methods recommended, on average, the most popular tracks. The recommendations by spotify and gru4rec were more oriented towards tracks with lower popularity. • Track Familiarity. There were also clear differences in terms of how many of the recommended tracks were already known by the users. The cagh (10.83 %) and sknn (10.13 %) methods recommended the largest number of known tracks. The ar method, even though it recommended very popular tracks, led to much more unfamiliar recommendations (8.61 %). gru4rec was somewhere in the middle (9.30 %), and spotify recommended the most novel tracks to users (7.00 %).
• Suitability of Track Continuations. The continuations created by sknn and cagh were perceived to be the most suitable ones. The differences between sknn and ar, gru4rec, and spotify were significant. The recommendations made by the ar method were considered to match the playlist the least. This is not too surprising because the ar method only considers the very last played track for the recommendation of subsequent tracks. • Individual Track Ratings. The differences regarding the individual ratings for each track ratings are generally small and not significant. Interestingly, the playlist-independent ratings for tracks recommended by the ar method were the lowest ones, even though these recommendations received the highest number of likes. An analysis of the rating distribution shows that the ar method often produces very bad recommendations, with a mode value of 1 on the 1-7 rating scale.
Post-Task Questionnaire. The post-task questionnaire revealed the following aspects: • Q1: The radio station based on sknn was significantly more liked than the stations that used gru4rec, ar, and spotify. • Q2: All radio stations matched the users general taste quite well, with median values between 5 and 6 on a seven-point scale. Only the station based on the ar method received a significantly lower rating than the others. • Q3: The sknn method was found to perform significantly better than ar and gru4rec with respect to identifying tracks that musically match the seed track. • Q4: The adaptation of the playlist based on the like statements was considered good for all radio stations.
Again, the feedback for the ar method was significantly lower than for the other methods. • Q5 and Q6: No significant differences were found regarding the surprise level of the different recommendation strategies. • Q7: Regarding the capability of recommending unknown tracks that the users liked, the recommendations by spotify were perceived to be much better than for the other methods, with significant differences compared to all other methods. • Q9 to Q12: The best performing methods in terms of the intention to reuse and the intention to recommend the radio station to others were sknn, cagh, and spotify. gru4rec and ar were slightly worse, sometimes with differences that were statistically significant. Overall, the study confirmed that methods like sknn do not only perform well in an offline evaluation, but are also able, according to our study, to generate recommendations that are well perceived in different dimensions by the users. The study also revealed a number of additional insights.
First, we found that optimizing for like statements can be misleading. The ar method received the highest number of likes, but was consistently worse than other techniques in almost all other dimensions. Apparently, this was caused by the fact that the ar method made a number of bad recommendations; see also [4] for an analysis of the effects on bad recommendations in the music domain.
Second, it turned out that discovery support seems to be an important factor in this particular application domain. While the recommendations of spotify were slightly less appreciated than those by sknn, we found no difference in terms of the user's intention to reuse the system or to recommend it to friends. We hypothesize that the better discovery support of spotify's recommendations was an important factor for this phenomenon. This observation points to the importance of considering multiple potential quality factors when comparing systems.

CONCLUSIONS AND WAYS FORWARD
Our work reveals that despite a continuous stream of papers that propose new neural approaches for session-based recommendation, the progress in the field seems still limited. According to our evaluations, today's deep learning techniques are in many cases not outperforming much simpler heuristic methods. Overall, this indicates that there still is a huge potential for more effective neural recommendation methods in the future in this area. In particular, methods that leverage deep learning techniques to incorporate side information represent a promising way forward, see [7,16,17,30].
In a related analysis of deep learning techniques for recommender systems [8], the authors found that different factors contribute to what they call phantom progress. One first problem is related to the reproducibility of the reported results. They found that in less than a third of the investigated papers, the code was made available to other researchers. The problem also exists to some extent for session-based recommendation approaches. To further increase the level of reproducibility, we share our evaluation framework publicly, so that other researchers can easily benchmark their own methods with a comprehensive set of neural and non-neural approaches on different datasets.
Through sharing our evaluation framework, we hope to also address other methodological and procedural issues mentioned in [8] that can make the comparison of algorithms unreliable or inconclusive. Regarding methodological issues, we for example found works that determined the optimal number of training epochs on the test set and furthermore determined the best Hit Rate and MRR values across different optimization epochs. Regarding procedural issues, we found that while researchers seemingly rely on the same datasets as previous works, they sometimes apply different data pre-processing strategies. Furthermore, the choice of the baselines can make the results inconclusive. Most investigated works do not consider the sknn method and its variants as a baseline. Some works only compare variants of one method and include a non-neural, but not necessarily strong other baseline. In many cases, little is also said about the optimization of the hyper-parameters of the baselines. The session-rec framework used in our evaluation should help to avoid these problems, as it contains all the code for data pre-processing, evaluation, and hyper-parameter optimization.
Finally, our analyses indicated that optimizing solely for accuracy can be insufficient also for session-based recommendation scenarios. Depending on the application domain, other quality factors such as coverage, diversity, or novelty should be considered, because they can be crucial for the adoption and success of the recommendation service. Given the insights from our controlled experiment, we furthermore argue that more user studies and field tests are necessary to understand the characteristics of successful recommendations in a given application domain.