## Abstract

Most real-world recommender systems are deployed in a commercial context or designed to represent a value-adding service, e.g., on shopping or Social Web platforms, and typical success indicators for such systems include conversion rates, customer loyalty or sales numbers. In academic research, in contrast, the evaluation and comparison of different recommendation algorithms is mostly based on offline experimental designs and accuracy or rank measures which are used as proxies to assess an algorithm’s recommendation quality. In this paper, we show that popular recommendation techniques—despite often being similar when compared with the help of accuracy measures—can be quite different with respect to which items they recommend. We report the results of an in-depth analysis in which we compare several recommendations strategies from different perspectives, including accuracy, catalog coverage and their bias to recommend popular items. Our analyses reveal that some recent techniques that perform well with respect to accuracy measures focus their recommendations on a tiny fraction of the item spectrum or recommend mostly top sellers. We analyze the reasons for some of these biases in terms of algorithmic design and parameterization and show how the characteristics of the recommendations can be altered by hyperparameter tuning. Finally, we propose two novel algorithmic schemes to counter these popularity biases.

### Similar content being viewed by others

## Notes

See (Jannach et al. 2012b) for an analysis of the literature on recommender systems, which covers over 300 research papers that were published in the five years after the Netflix Prize.

In contrast to “per-user” diversity measures, this measure rather determines how many different items are recommended to all users.

Table 1 shall be considered as an illustrative example. A systematic comparison of the recommendation lists for all users (Sect. 3.3) shows that the average overlap of the first 10 items for BPR and Funk-SVD is only at about 6 %. The overlap of the two matrix factorization (MF) methods Koren-MF and Funk-SVD is similarly small.

To make our results reproducible, we publish the source code of our evaluation framework, see http://ls13-www.cs.tu-dortmund.de/homepage/recommender101/.

We did not use the officially released MovieLens datasets because we were not able to retrieve content information for all movies. The largest MovieLens dataset we used in our experiments had about 1 million ratings. For this sample, we could, however, not run the simulation experiment using the User-kNN method within reasonable time. To make our research comparable to previous works, we report the other accuracy results for the official MovieLens1M release and a Netflix Prize sample in the Appendix.

The best results are printed in bold face, in case the numbers were significantly different from all other results (p \(<\) 0.05 with Bonferroni correction). Throughout the paper we used paired two-tailed Student’s

*t*-tests with a p \(=\) 0.05 significance level. In most tests, p \(<\) 0.01 holds but we report p \(<\) 0.05 for consistency and because this is the most common significance level in the literature.In addition, “external” and application-specific measures like sales numbers or box office figures could be used.

Not shown in Fig. 3.

This observation also applies for FM (MCMC).

Compared to the sampling function shown in Fig. 12 for

*i*, a corresponding function for*j*would have to be flipped horizontally.Note that even after a 10 % drop with respect to

*recall (All)*, BPR would still be the best-performing technique in our comparison in Sect. 3.Changing the rating prediction by small amounts is usually sufficient to push an item several places up in the recommendation lists.

While the absolute values can vary on each run even when the same data is used, the

*existence*of the biases is not affected by the random initialization.

## References

Adamopoulos, P., Tuzhilin, A.: On over-specialization and concentration bias of recommendations: probabilistic neighborhood selection in collaborative filtering systems. In: Proceedings of the 2014 ACM Conference on Recommender Systems (RecSys ’14), pp. 153–160, Foster City (2014a)

Adamopoulos, P.: Beyond rating prediction accuracy: on new perspectives in recommender systems. In: Proceedings of the 2013 ACM Conference on Recommender Systems (RecSys ’13), pp. 459–462. Hong Kong (2013)

Adamopoulos, P., Tuzhilin, A.: On unexpectedness in recommender systems: or how to better expect the unexpected. ACM Trans. Intell. Syst. Technol.

**5**(4), 54:1–54:32 (2014b)Adomavicius, G., Kwon, Y.: Improving aggregate recommendation diversity using ranking-based techniques. IEEE Trans. Knowl. Data Eng.

**24**(5), 896–911 (2012)Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng.

**17**(6), 734–749 (2005)Adomavicius, G., Zhang, J.: Impact of data characteristics on recommender systems performance. ACM Trans. Manag. Inform. Syst.

**3**(1), 3:1–3:17 (2012)Bradley, K., Smyth, B.: Improving recommendation diversity. In: Proceedings of the 12th National Conference in Artificial Intelligence and Cognitive Science (AICS ’01), pp. 75–84. Maynooth (2001)

Breese, J.S., Heckerman, D., Kadie, C.M.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp. 43–52. Madison (1998)

Celma, O., Herrera, P.: A new approach to evaluating novel recommendations. In: Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys ’08), pp. 179–186. Lausanne (2008)

Chau, P.Y.K., Ho, S.Y., Ho, K.K.W., Yao, Y.: Examining the effects of malfunctioning personalized services on online users’ distrust and behaviors. Decision Support Syst.

**56**, 180–191 (2013)Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the 2010 ACM Conference on Recommender Systems (RecSys ’10), pp. 39–46. Barcelona (2010)

Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A.V., Turrin, R.: Looking for ”good” recommendations: a comparative evaluation of recommender systems. In: Proceedings of the 13th International Conference on Human-Computer Interaction (INTERACT ’11), vol. 6948 of Lecture Notes in Computer Science, pp. 152–168. Lisbon. Springer, Berlin (2011)

Cremonesi, P., Garzotto, F., Turrin, R.: Investigating the persuasion potential of recommender systems from a quality perspective: an empirical study. ACM Trans Interact. Intell. Syst.

**2**(2), 11:1–11:41 (2012)Cremonesi, P., Garzotto, F., Quadrana, M.: Evaluating top-n recommendations “when the best are gone”. In: Proceedings of the 2013 ACM Conference on Recommender Systems (RecSys ’13), pp. 339–342. Hong Kong (2013a)

Cremonesi, P., Garzotto, F., Turrin, R.: User-centric vs. system-centric evaluation of recommender systems. In: Proceedings of the 15th International Conference on Human-Computer Interaction (INTERACT ’13), vol. 8119 of Lecture Notes in Computer Science, pp. 334–351. Cape Town. Springer, Berlin (2013b)

Dias, M.B., Locher, D., Li, M., El-Deredy, W., Lisboa, P.J.: The value of personalised recommender systems to e-business: a case study. In: Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys ’08), pp. 291–294. Lausanne (2008)

Ekstrand, M.D., Ludwig, M., Konstan, J.A., Riedl, J.T.: Rethinking the recommender research ecosystem: Reproducibility, openness, and LensKit. In: Proceedings of the 2011 ACM Conference on Recommender Systems (RecSys ’11), pp. 133–140. Chicago (2011)

Ekstrand, M.D., Harper, F.M., Willemsen, M.C., Konstan, J.A.: User perception of differences in recommender algorithms. In: Proceedings of the 2014 ACM Conference on Recommender Systems (RecSys ’14), pp. 161–168. Foster City (2014)

Fleder, D., Hosanagar, K.: Blockbuster culture’s next rise or fall: the impact of recommender systems on sales diversity. Manag. Sci.

**55**(5), 697–712 (2009)Funk, S.: Netflix update: try this at home. http://sifter.org/~simon/journal/20061211.html (2006). Accessed June 2015

Gantner, Z., Rendle, S., Drumond, L., Freudenthaler, C.: MyMediaLite: example experiments. Results are published online at http://mymedialite.net/examples/datasets.html (2014). Accessed June 2015

Garcin, F., Faltings, B., Donatsch, O., Alazzawi, A., Bruttin, C., Huber, A.: Offline and online evaluation of news recommender systems at swissinfo.ch. In: Proceedings of the 2014 ACM Conference on Recommender Systems (RecSys ’14), pp. 169–176. Foster City (2014)

Ge, M., Jannach, D., Gedikli, F.: Bringing diversity to recommendation lists - an analysis of the placement of diverse items. In: Proceedings 14th International Conference on EnterpriseInformation Systems (ICEIS ’12), Springer Lecture Notes in Business Information Processing, vol. 141, pp. 293–305. Wrozlaw (2013)

Gedikli, F., Bagdat, F., Ge, M., Jannach, D.: Rf-rec: fast and accurate computation of recommendations based on rating frequencies. In: Proceedings of the 2011 IEEE 13th Conference on Commerce and Enterprise Computing (CEC ’11), pp. 50–57. Luxembourg (2011)

Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inform. Syst.

**22**(1), 5–53 (2004)Jambor, T., Wang, J.: Optimizing multiple objectives in collaborative filtering. In: Proceedings of the 2010 ACM Conference on Recommender Systems (RecSys ’10), pp. 55–62. Barcelona (2010)

Jannach, D., Hegelich, K.: A case study on the effectiveness of recommendations in the mobile internet. In: Proceedings of the 2009 ACM Conference on Recommender Systems (RecSys ’09), pp. 205–208. New York (2009)

Jannach, D., Karakaya, Z., Gedikli, F.: Accuracy improvements for multi-criteria recommender systems. In: Proceedings of the 13th ACM Conference on Electronic Commerce (EC ’12), pp. 674–689. Valencia (2012a)

Jannach, D., Zanker, M., Ge, M., Gröning, M.: Recommender systems in computer science and information systems - a landscape of research. In: Proceedings 13th International Conference on E-Commerce and Web Technologies (EC-WEB ’12), pp. 76–87. Vienna (2012b)

Jannach, D., Lerche, L., Gedikli, F., Bonnin, G.: What recommenders recommend - an analysis of accuracy, popularity, and sales diversity effects. In: Proceedings of the 21st International Conference on User Modeling, Adaptation and Personalization (UMAP ’13), pp. 25–379. Rome (2013)

Javari, A., Jalili, M.: Accurate and novel recommendations: an algorithm based on popularity forecasting. ACM Transac. Intell. Syst. Technol.

**5**(4), 56:1–56:20 (2014)Kirshenbaum, E., Forman, G., Dugan, M.: A live comparison of methods for personalized article recommendation at Forbes.com. In: Proceedings European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD ’12), pp. 51–66. Bristol (2012)

Konstan, J.A., Riedl, J.: Recommender systems: from algorithms to user experience. User Model. User-Adapt. Interact.

**22**(1–2), 101–123 (2012)Koren, Y.: Factorization meets the neighborhood: A multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08), pp. 426–434. Las Vegas (2008)

Koren, Y.: Factor in the neighbors: scalable and accurate collaborative filtering. ACM Trans. Knowl. Discov. Data (TKDD)

**4**(1), 1:1–1:24 (2010)Lee, J., Sun, M., Lebanon, G.: A comparative study of collaborative filtering algorithms. ACM Comput. Res. Repos., arxiv:1205.3193 (2012)

Lemire, D., Maclachlan, A.: Slope one predictors for online rating-based collaborative filtering. In: Proceedings of the SIAM Conference on Data Mining, pp. 471–480. Newport Beach (2005)

Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Comput.

**7**(1), 76–80 (2003)McNee, S.M., Riedl, J., Konstan, J.A.: Being accurate is not enough: how accuracy metrics have hurt recommender systems. In: Proceedings of the 2006 Conference on Human Factors in Computing Systems (CHI ’06), pp. 1097–1101, Montreal (2006)

Murakami, T., Mori, K., Orihara, R.: Metrics for evaluating the serendipity of recommendation lists. In: Proceedings of the 2007 Conference on New Frontiers in Artificial Intelligence (JSAI ’07), pp. 40–46. Miyazaki (2008)

Niemann, K., Wolpers, M.: A new collaborative filtering approach for increasing the aggregate diversity of recommender systems. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’13), pp. 955–963. Chicago (2013)

Pan, W., Zhong, H., Xu, C., Ming, Z.: Adaptive bayesian personalized ranking for heterogeneous implicit feedbacks. Knowledge-Based Syst.

**73**, 173–180 (2015)Park, Y.-J., Tuzhilin, A.: The long tail of recommender systems and how to leverage it. In: Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys ’08), pp. 11–18. Lausanne (2008)

Prawesh, S., Padmanabhan, B.: The “top N” news recommender: Count distortion and manipulation resistance. In: Proceedings of the 2011 ACM Conference on Recommender Systems (RecSys ’11), pp. 237–244. Chicago (2011)

Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of the 2011 ACM Conference on Recommender Systems (RecSys ’11), pp. 157–164. Chicago (2011)

Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI ’09), pp. 452–461. Montreal (2009)

Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol.

**3**(3), 57:1–57:22 (2012)Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An open architecture for collaborative filtering of netnews. In: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work (CSCW ’94), pp. 175–186. Chapel Hill (1994)

Said, A., Tikk, D., Stumpf, K., Shi, Y., Larson, M., Cremonesi, P.: Recommender systems evaluation: A 3D benchmark. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE ’12), pp. 21–23. Dublin. CEUR-WS Vol. 910 (2012)

Said, A., Bellogín, A., de Vries, A.: A top-n recommender system evaluation protocol inspired by deployed systems. In: ACM RecSys 2013 Workshop on Large-Scale Recommender Systems (LSRS ’13), Hong Kong (2013a)

Said, A., Fields, B., Jain, B.J., Albayrak, S.: User-centric evaluation of a k-furthest neighbor collaborative filtering recommender algorithm. In: Proceedings of the 2013 Conference on Computer Supported Cooperative Work (CSCW ’13), pp. 1399–1408. San Antonio (2013b)

Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 257–297. Springer, New York (2011)

Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Oliver, N., Hanjalic, A.: CLiMF: Learning to maximize reciprocal rank with collaborative less-is-more filtering. In: Proceedings of the 2012 ACM Conference on Recommender Systems (RecSys ’12), pp. 139–146. Dublin (2012)

Shi, L. Trading-off among accuracy, similarity, diversity, and long-tail: A graph-based recommendation approach. In: Proceedings of the 2013 ACM Conference on Recommender Systems (RecSys ’13), pp. 57–64. Hong Kong (2013)

Steck, H.: Item popularity and recommendation accuracy. In: Proceedings of the 2011 ACM Conference on Recommender Systems (RecSys ’11), pp. 125–132. Chicago (2011)

Vargas, S., Castells, P.: Rank and relevance in novelty and diversity metrics for recommender systems. In: Proceedings of the 2011 ACM Conference on Recommender Systems (RecSys ’11), pp. 109–116. Chicago (2011)

Zanker, M., Bricman, M., Gordea, S., Jannach, D., Jessenitschnig, M.: Persuasive online-selling in quality & taste domains. In: Proceedings of the 7th International Conference on Electronic Commerce and Web Technologies (EC-WEB ’06), pp. 51–60. Krakow (2006)

Zhang, M., Hurley, N.: Avoiding monotony: Improving the diversity of recommendation lists. In: Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys ’08), pp. 123–130. Lausanne (2008)

Zhang, M., Hurley, N.: Niche product retrieval in top-n recommendation. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT ’10), pp. 74–81. Toronto (2010)

Zhang, Y. C., Séaghdha, D. Ó., Quercia, D., Jambor, T.: Auralist: Introducing serendipity into music recommendation. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM ’12), pp. 13–22. Seattle (2012)

Zhang, N., Zhang, Y., Tang, J.: A tag recommendation system for folksonomy. In: Proceedings of the 2nd Workshop on Social Web Search and Mining (SWSM ’09), pp. 9–16, Hong Kong (2009)

Zhang, M.: Enhancing the diversity of collaborative filtering recommender systems. PhD thesis, University College Dublin (2010)

Zhou, T., Kuscsik, Z., Liu, J.-G., Medo, M., Wakeling, J.R., Zhang, Y.-C.: Solving the apparent diversity-accuracy dilemma of recommender systems. Proc. National Acad. Sci.

**107**(10), 4511–4515 (2010)Ziegler, C.-N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of the 14th International Conference on World Wide Web (WWW ’05), pp. 22–32. Chiba (2005)

## Author information

### Authors and Affiliations

### Corresponding author

## Appendix

### Appendix

### 1.1 Gini index

The Gini index can be derived from the Lorenz curve, which is a cumulative distribution function as shown in Fig. 17. The diagonal corresponds to an even distribution. The higher the deviation of the Lorenz curve from the diagonal, the stronger is the unevenness of the distribution. The Gini index measures the strength of the inequality of a distribution and can be calculated as twice the difference between the area below the diagonal and the area below the curve (Zhang 2010).

In our application setting, we calculate how often each item was included in a top-10 list, sort the items according to their popularity in increasing order and group them into *n* bins \(x_1, ..., x_n\), each containing 30 items.

For such a discrete distribution, the Gini index *G* can be computed using the formula

where \(p_n\) is the cumulative sum of the first *n* bins, i.e.,

With \(q_n\), we weight each \(x_i\) according to its rank position, i.e.,

To normalize *G*, we divide it by \(G_{max} = 1-(1/n)\) to finally obtain \(G_{norm}\)

### 1.2 Results for other datasets

Table 11 reports statistics for the datasets used in our evaluations. Tables 12, 13, 14, 15, 16, 17 and 18 show the corresponding results for the evaluated metrics (notation: P10T = Precision@10 (TS), R10A = Recall@10 (All), etc.). Furthermore AvgR denotes the average rating and AvgP the average popularity of the top-10 recommended items. Div is the diversity in terms of inverse ILS and NbRec the overall number of different items recommended by the algorithms. In each column, the highest value is highlighted in case the observed difference is statistically significant (p \(<\) 0.05) when compared to the other algorithms. Due to its high computational complexity, we did not test the user-KNN method for the 7 million Netflix and 1 million MovieLens dataset. The content-based algorithm could only be benchmarked on datasets for which content information was available.

### 1.3 Artificial popularity on other datasets

Tables 19, 20 and 21 show the results of the artificial popularity bias experiment (see Sect. 4.2) for the three datasets MovieLens400k, MovieLens1M and Yahoo!Movies on the precision and recall strategies *All* (all items in the test set) and *TS* (only items with known ratings in the test set). The algorithm only recommends items that were rated by at least *p* users in the training set.

### 1.4 Detailed results for the PBA algorithm

Tables 22 and 23 show the detailed results for the PBA method when applied to the output of Koren-MF and FM (ALS) respectively. As before the table headers are shortened (P10T = Precision@10 (TS), etc.). Furthermore AvgP denotes the average popularity of the top-10 recommended items and NbRec the overall number of different recommendations. The column \(\lambda \) shows which value was used for the regularization variable to produce the results in the corresponding row. The first row (with the \(\lambda \) value left blank) contains the raw output of the underlying algorithm unaltered by the PBA method.

### 1.5 Partial derivatives for the PBA algorithm

To minimize the optimization goal of the PBA algorithm (see Eq. 8) via a gradient descent strategy we have to calculate the partial derivatives of the minimization function to estimate the step width. The derivative for a specific \(x_{ui}\) can be calculated as follows:

with the first part being reducible in the following way

and the latter part being reducible to

Thus, the combined derivatives form the following assignment rule for each gradient descent step:

## Rights and permissions

## About this article

### Cite this article

Jannach, D., Lerche, L., Kamehkhosh, I. *et al.* What recommenders recommend: an analysis of recommendation biases and possible countermeasures.
*User Model User-Adap Inter* **25**, 427–491 (2015). https://doi.org/10.1007/s11257-015-9165-3

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11257-015-9165-3