Skip to main content

Benchmarking

A Methodology for Ensuring the Relative Quality of Recommendation Systems in Software Engineering

  • Chapter
  • First Online:
Recommendation Systems in Software Engineering

Abstract

This chapter describes the concepts involved in the process of benchmarking of recommendation systems. Benchmarking of recommendation systems is used to ensure the quality of a research system or production system in comparison to other systems, whether algorithmically, infrastructurally, or according to any sought-after quality. Specifically, the chapter presents evaluation of recommendation systems according to recommendation accuracy, technical constraints, and business values in the context of a multi-dimensional benchmarking and evaluation model encompassing any number of qualities into a final comparable metric. The focus is put on quality measures related to recommendation accuracy, technical factors, and business values. The chapter first introduces concepts related to evaluation and benchmarking of recommendation systems, continues with an overview of the current state of the art, then presents the multi-dimensional approach in detail. The chapter concludes with a brief discussion of the introduced concepts and a summary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See for example the UCI Machine Learning Repository that contains a large selection of machine learning benchmark datasets. Recommendation-system-related benchmark datasets can also be found in KONECT, e.g., under category ratings.

  2. 2.

    Better recommendations postpone or eliminate the content glut effect  [32]—a variation on the idea of information overload—and thus increases customer lifetime, which is translated into additional revenue of Netflix’s monthly plan based subscription service.

  3. 3.

    The date-based partition of the NP dataset into Training/Testing sets reflects the original aim of recommendation systems, which is the prediction of future interest of users from their past ratings/activities.

  4. 4.

    Additional recommendation datasets can be found at the Recommender Systems Wiki.

  5. 5.

    Due to the different context of this dataset, no number of items is given as the dataset instead contains two sets of event types (search and download). A density cannot be calculated as there is no fixed set of items.

  6. 6.

    See mloss.org for additional general ML software and the Recommender Systems Wiki for recommendation-specific software.

  7. 7.

    Editors’ note: More broadly, recommendation systems in software engineering do not only or always deal with the information overload problem [46]; thus, the definition of user satisfaction needs to be broadened in such situations.

  8. 8.

    In some webshop implementations, clicking on a recommended content can directly add the content to the cart, thus reducing the number of steps and simplifying the purchase process.

References

  1. Adomavicius, G., Zhang, J.: Stability of recommendation algorithms. ACM Trans. Inform. Syst. 30(4), 23:1–23:31 (2012). doi:10.1145/2382438.2382442

    Google Scholar 

  2. Amatriain, X., Basilico, J.: Netflix recommendations: Beyond the 5 stars (Part 1)—The Netflix tech blog. URL http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html (2012) Accessed 9 October 2013

  3. Avazpour, I., Pitakrat, T., Grunske, L., Grundy, J.: Dimensions and metrics for evaluating recommendation systems. In: Robillard, M., Maalej, W., Walker, R.J., Zimmermann, T. (eds.) Recommendation Systems in Software Engineering, Chap. 10. Springer, New York (2014)

    Google Scholar 

  4. Bajracharya, S.K., Lopes, C.V.: Analyzing and mining a code search engine usage log. Empir. Software Eng. 17(4–5), 424–466 (2012). doi:10.1007/s10664-010-9144-6

    Article  Google Scholar 

  5. Barber, W., Badre, A.: Culturability: The merging of culture and usability. In: Proceedings of the Conference on Human Factors & the Web, Basking Ridge, NJ, USA, 5 June 1998

    Google Scholar 

  6. Bell, R., Koren, Y., Volinsky, C.: Chasing $1,000,000: How we won the Netflix Progress Prize. ASA Stat. Comput. Graph. Newslett. 18(2), 4–12 (2007)

    Google Scholar 

  7. Boxwell Jr., R.J.: Benchmarking for Competitive Advantage. McGraw-Hill, New York (1994)

    Google Scholar 

  8. Butkiewicz, M., Madhyastha, H.V., Sekar, V.: Understanding website complexity: Measurements, metrics, and implications. In: Proceedings of the ACM SIGCOMM Conference on Internet Measurement, pp. 313–328, Berlin, Germany, 2 November 2011. doi:10.1145/2068816.2068846

    Google Scholar 

  9. Carenini, G.: User-specific decision-theoretic accuracy metrics for collaborative filtering. In: Proceedings of the International Conference on Intelligent User Interfaces, San Diego, CA, USA, 10–13 January 2005

    Google Scholar 

  10. Celma, Ò., Lamere, P.: If you like the Beatles you might like : A tutorial on music recommendation. In: Proceedings of the ACM International Conference on Multimedia, pp. 1157–1158. ACM, New York (2008). doi:10.1145/1459359.1459615

    Google Scholar 

  11. Chen, L., Pu, P.: A cross-cultural user evaluation of product recommender interfaces. In: Proceedings of the ACM Conference on Recommender Systems, pp. 75–82, Lousanne, Switzerland, 23–25 October 2008. doi:10.1145/1454008.1454022

    Google Scholar 

  12. Cilibrasi, R.L., Vitányi, P.M.B.: The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007). doi:10.1109/TKDE.2007.48

    Article  Google Scholar 

  13. Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A.V., Turrin, R.: Looking for “good” recommendations: A comparative evaluation of recommender systems. In: Proceedings of the IFIP TC13 International Conference on Human–Computer Interactaction, Part III, pp. 152–168, Lisbon, Portugal, 5–9 September 2011. doi:10.1007/978-3-642-23765-2_11

    Google Scholar 

  14. Cremonesi, P., Garzotto, F., Turrin, R.: Investigating the persuasion potential of recommender systems from a quality perspective: An empirical study. ACM Trans. Interact. Intell. Syst. 2(2), 11:1–11:41 (2012). doi:10.1145/2209310.2209314

    Google Scholar 

  15. Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based recommendation methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 107–144. Springer, Boston (2011). doi:10.1007/978-0-387-85820-3_4

    Chapter  Google Scholar 

  16. Dias, M.B., Locher, D., Li, M., El-Deredy, W., Lisboa, P.J.G.: The value of personalised recommender systems to e-business: A case study. In: Proceedings of the ACM Conference on Recommender Systems, pp. 291–294, Lousanne, Switzerland, 23–25 October 2008. doi:10.1145/1454008.1454054

    Google Scholar 

  17. Ehrgott, M., Gandibleux, X. (eds.): Multiple Criteria Optimization: State of the Art Annotated Bibliographic Surveys. Kluwer, Boston (2002). doi:10.1007/b101915

    Google Scholar 

  18. Fraser, G., Arcuri, A.: Sound empirical evidence in software testing. In: Proceedings of the ACM/IEEE International Conference on Software Engineering, pp. 178–188, Zurich, Switzerland, 2–9 June 2012. doi:10.1109/ICSE.2012.6227195

    Google Scholar 

  19. Goh, D., Razikin, K., Lee, C.S., Chu, A.: Investigating user perceptions of engagement and information quality in mobile human computation games. In: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 391–392, Washington, DC, USA, 10–14 June 2012. doi:10.1145/2232817.2232906

    Google Scholar 

  20. Gomez-Uribe, C.: Challenges and limitations in the offline and online evaluation of recommender systems: A Netflix case study. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE, CEUR Workshop Proceedings, vol. 910, p. 1 (2012)

    Google Scholar 

  21. Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. Mach. Learn. Res. 10, 2935–2962 (2009)

    MATH  MathSciNet  Google Scholar 

  22. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inform. Syst. 22(1), 5–53 (2004). doi:10.1145/963770.963772

    Article  Google Scholar 

  23. Hu, R.: Design and user issues in personality-based recommender systems. In: Proceedings of the ACM Conference on Recommender Systems, pp. 357–360, Barcelona, Spain, 26–30 Septembert 2010. doi:10.1145/1864708.1864790

    Google Scholar 

  24. Jambor, T., Wang, J.: Optimizing multiple objectives in collaborative filtering. In: Proceedings of the ACM Conference on Recommender Systems, pp. 55–62, Barcelona, Spain, 26–30 Septembert 2010. doi:10.1145/1864708.1864723

    Google Scholar 

  25. Koenigstein, N., Dror, G., Koren, Y.: Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy. In: Proceedings of the ACM Conference on Recommender Systems, pp. 165–172, Chicago, IL, USA, 23–27 October 2011. doi:10.1145/2043932.2043964

    Google Scholar 

  26. Kung, H.T., Luccio, F., Preparata, F.P.: On finding the maxima of a set of vectors. J. ACM 22(4), 469–476 (1975). doi:10.1145/321906.321910

    Article  MATH  MathSciNet  Google Scholar 

  27. Lai, J.Y.: Assessment of employees’ perceptions of service quality and satisfaction with e-business. In: Proceedings of the ACM SIGMIS CPR Conference on Computer Personnel Research, pp. 236–243, Claremont, CA, USA, 13–15 April 2006. doi:10.1145/1125170.1125228

    Google Scholar 

  28. Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click behavior. In: Proceedings of the International Conference on Intelligent User Interfaces, pp. 31–40, Hong Kong, China, 7–10 February 2010. doi:10.1145/1719970.1719976

    Google Scholar 

  29. McNee, S., Lam, S.K., Guetzlaff, C., Konstan, J.A., Riedl, J.: Confidence displays and training in recommender systems. In: Proceedings of the IFIP TC13 International Conference on Human–Computer Interactaction, pp. 176–183, Zurich, Switzerland, 1–5 September 2003

    Google Scholar 

  30. Nah, F.F.H.: A study on tolerable waiting time: How long are Web users willing to wait? Behav. Inform. Technol. 23(3), 153–163 (2004). doi:10.1080/01449290410001669914

    Article  Google Scholar 

  31. Netflix Prize: The Netflix Prize rules (2006). URL http://www.netflixprize.com/rules. Accessed 9 October 2013

  32. Perry, R., Lancaster, R.: Enterprise content management: Expected evolution or vendor positioning? Tech. rep., The Yankee Group (2002)

    Google Scholar 

  33. Peška, L., Vojtáš, P.: Evaluating the importance of various implicit factors in E-commerce. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE, CEUR Workshop Proceedings, vol. 910, pp. 51–55, Dublin, Ireland, 9 September 2012

    Google Scholar 

  34. Pilászy, I., Tikk, D.: Recommending new movies: Even a few ratings are more valuable than metadata. In: Proceedings of the ACM Conference on Recommender Systems, pp. 93–100, New York, NY, USA, 23–25 October 2009. doi:10.1145/1639714.1639731

    Google Scholar 

  35. Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of the ACM Conference on Recommender Systems, pp. 157–164, Chicago, IL, USA, 23–27 October 2011. doi:10.1145/2043932.2043962

    Google Scholar 

  36. Said, A., Fields, B., Jain, B.J., Albayrak, S.: User-centric evaluation of a K-furthest neighbor collaborative filtering recommender algorithm. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, pp. 1399–1408, San Antonio, TX, USA, 23–27 February 2013. doi:10.1145/2441776.2441933

    Google Scholar 

  37. Said, A., Tikk, D., Shi, Y., Larson, M., Stumpf, K., Cremonesi, P.: Recommender systems evaluation: A 3D benchmark. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE, CEUR Workshop Proceedings, vol. 910, pp. 21–23, Dublin, Ireland, 9 September 2012

    Google Scholar 

  38. Sarwat, M., Bao, J., Eldawy, A., Levandoski, J.J., Magdy, A., Mokbel, M.F.: Sindbad: A location-based social networking system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 649–652, Scottsdale, AZ, USA, 20–24 May 2012. doi:10.1145/2213836.2213923

    Google Scholar 

  39. Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: Methods and metrics for cold-start recommendations. In: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 253–260, Tampere, Finland, 11–15 August 2002. doi:10.1145/564376.564421

    Google Scholar 

  40. Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: CROC: A new evaluation criterion for recommender systems. Electron. Commerce Res. 5(1), 51–74 (2005). doi:10.1023/B:ELEC.0000045973.51289.8c

    Article  MATH  Google Scholar 

  41. Schütze, H., Silverstein, C.: Projections for efficient document clustering. In: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 74–81, Philadelphia, PA, USA, 27–31 July 1997. doi:10.1145/258525.258539

    Google Scholar 

  42. Sumner, T., Khoo, M., Recker, M., Marlino, M.: Understanding educator perceptions of “quality” in digital libraries. In: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 269–279, Houston, Texas, USA, 27–31 May 2003. doi:10.1109/JCDL.2003.1204876

    Google Scholar 

  43. Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009)

    Google Scholar 

  44. Terveen, L., Hill, W.: Beyond recommender systems: Helping people help each other. In: Carroll, J.M. (ed.) Human–Computer Interaction in the New Millennium. Addison-Wesley, New York (2001)

    Google Scholar 

  45. Van Veldhuizen, D.A., Lamont, G.B.: Multiobjective evolutionary algorithms: Analyzing the state-of-the-art. Evol. Comput. 8(2), 125–147 (2000). doi:10.1162/106365600568158

    Article  Google Scholar 

  46. Walker, R.J.: Recent advances in recommendation systems for software engineering. In: Proceedings of the International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Lecture Notes in Computer Science, vol. 7906, pp. 372–381. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38577-3_38

    Google Scholar 

  47. Zheng, H., Wang, D., Zhang, Q., Li, H., Yang, T.: Do clicks measure recommendation relevancy?: An empirical user study. In: Proceedings of the ACM Conference on Recommender Systems, pp. 249–252, Barcelona, Spain, 26–30 September 2010. doi:10.1145/1864708.1864759

    Google Scholar 

  48. Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of the International Conference on the World Wide Web, pp. 22–32, Chiba, Japan, 10–14 May 2005. doi:10.1145/1060745.1060754

    Google Scholar 

  49. Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algorithms: Empirical results. Evol. Comput. 8(2), 173–195 (2000). doi:10.1162/106365600568202

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Martha Larson from TU Delft, Brijnesh J. Jain from TU Berlin, and Alejandro Bellogín from CWI for their contributions and suggestions to this chapter.

This work was partially carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 246016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alan Said .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Said, A., Tikk, D., Cremonesi, P. (2014). Benchmarking. In: Robillard, M., Maalej, W., Walker, R., Zimmermann, T. (eds) Recommendation Systems in Software Engineering. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45135-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45135-5_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45134-8

  • Online ISBN: 978-3-642-45135-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics