Adaptive Objective Functions and Distance Metrics for Recommendation Systems

  • Michael C. BurkhartEmail author
  • Kourosh Modarresi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11537)


We describe, develop, and implement different models for the standard matrix completion problem from the field of recommendation systems. We benchmark these models against the publicly available Netflix Prize challenge dataset, consisting of users’ ratings of movies on a 1–5 scale. While the original competition concentrated only on RMSE, we experiment with different objective functions for model training, ensemble construction, and model/ensemble testing.

Our best-performing estimators were (1) a linear ensemble of base models trained using linear regression (see ensemble \(e_1\), RMSE: 0.912) and (2) a neural network that aggregated predictions from individual models (see ensemble \(e_4\), RMSE: 0.912). Many of the constituent models in our ensembles had yet to be developed at the time the Netflix competition concluded in 2009. To our knowledge, not much research has been done to establish best practices for combining these models into ensembles. We consider this problem, with a particular emphasis on the role that the choice of objective function plays in ensemble construction.


  1. 1.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems. In: USENIX, pp. 265–283 (2016)Google Scholar
  2. 2.
    Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19(6), 716–723 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Baglama, J., Reichel, L.: Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27(1), 19–42 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Behnel, S., Bradshaw, R., Citro, C., Dalcín, L., Seljebotn, D.S., Smith, K.: Cython: the best of both worlds. Comput. Sci. Eng. 13(2), 31–39 (2011)CrossRefGoogle Scholar
  5. 5.
    Bell, R., Koren, Y.: Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In: IEEE International Conference on Data Mining, pp. 43–52 (2007)Google Scholar
  6. 6.
    Bell, R., Koren, Y., Volinsky, C.: Modeling relationships at multiple scales to improve accuracy of large recommender systems. In: SIGKDD, pp. 95–104 (2007)Google Scholar
  7. 7.
    Bell, R.M., Koren, Y.: Improved neighborhood-based collaborative filtering. In: SIGKDD (2007)Google Scholar
  8. 8.
    Bell, R.M., Koren, Y.: Lessons from the Netflix prize challenge. SIGKDD Explor. 9(2), 75–79 (2007)CrossRefGoogle Scholar
  9. 9.
    Bell, R.M., Koren, Y., Volinsky, C.: The BellKor solution to the Netflix prize (2009)Google Scholar
  10. 10.
    Bennett, J., Lanning, S.: The Netflix prize. In: KDD Cup and Workshop (2007)Google Scholar
  11. 11.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)zbMATHGoogle Scholar
  13. 13.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.: Classification and Regression Trees. Chapman & Hall/CRC in Boca Raton, FL (1984)zbMATHGoogle Scholar
  14. 14.
    Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Candès, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)Google Scholar
  17. 17.
    Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on top-n recommendation tasks. In: RecSys, pp. 39–46 (2010)Google Scholar
  18. 18.
    Dasarathy, B.V., Sheela, B.V.: A composite classifier system design: concepts and methodology. Proc. IEEE 67(5), 708–713 (1979)CrossRefGoogle Scholar
  19. 19.
    Dozat, T.: Incorporating Nesterov momentum into Adam. In: International Conference on Learning Representations (2016)Google Scholar
  20. 20.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Dziugaite, G.K., Roy, D.M.: Neural network matrix factorization (2015). arXiv:1511.06443
  22. 22.
    Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979)MathSciNetzbMATHCrossRefGoogle Scholar
  23. 23.
    Freund, Y.: Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  24. 24.
    Freund, Y.: An adaptive version of the boost by majority algorithm. Mach. Learn. 43(3), 293–318 (2001)zbMATHCrossRefGoogle Scholar
  25. 25.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)MathSciNetzbMATHCrossRefGoogle Scholar
  26. 26.
    Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  27. 27.
    Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, vol. 9, pp. 249–256 (2010)Google Scholar
  29. 29.
    Gopalan, P., Hofman, J.M., Blei, D.M.: Scalable recommendation with hierarchical poisson factorization. In: Uncertainty in Artificial Intelligence, pp. 326–335 (2015)Google Scholar
  30. 30.
    Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. Mach. Learn. Res. 10, 2935–2962 (2009)MathSciNetzbMATHGoogle Scholar
  31. 31.
    Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 993–1001 (1990)CrossRefGoogle Scholar
  32. 32.
    He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative filtering. In: International World Wide Web Conference, pp. 173–182 (2017)Google Scholar
  33. 33.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  34. 34.
    Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6, 164–189 (1927)zbMATHCrossRefGoogle Scholar
  35. 35.
    Hitchcock, F.L.: Multiple invariants and generalized rank of a p-way matrix or tensor. J. Math. Phys. 7, 39–79 (1928)zbMATHCrossRefGoogle Scholar
  36. 36.
    Hoffman, M., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013)MathSciNetzbMATHGoogle Scholar
  37. 37.
    Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)zbMATHCrossRefGoogle Scholar
  38. 38.
    Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: IEEE International Conference on Data Mining, pp. 263–272 (2008)Google Scholar
  39. 39.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  40. 40.
    Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)zbMATHCrossRefGoogle Scholar
  41. 41.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)Google Scholar
  42. 42.
    Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: SIGKDD, pp. 426–434 (2008)Google Scholar
  43. 43.
    Lam, S.K., Pitrou, A., Seibert, S.: Numba. In: Proceedings of the Workshop LLVM Compiler (2015)Google Scholar
  44. 44.
    Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. NIST 45(4), 255–282 (1950)MathSciNetGoogle Scholar
  45. 45.
    Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  46. 46.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)zbMATHCrossRefGoogle Scholar
  47. 47.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2000)Google Scholar
  48. 48.
    Li, X., She, J.: Collaborative variational autoencoder for recommender systems. In: ACM SIGKDD, pp. 305–314 (2017)Google Scholar
  49. 49.
    Liang, D., Krishnan, R.G., Hoffman, M.D., Jebara, T.: Variational autoencoders for collaborative filtering. In: International World Wide Web Conference, pp. 689–698 (2018)Google Scholar
  50. 50.
    Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neuralnetwork acoustic models. In: International Conference on Machine Learning, vol. 30 (2013)Google Scholar
  51. 51.
    Mason, L., Baxter, J., Bartlett, P.L., Frean, M.R.: Boosting algorithms as gradient descent. In: Advances in Neural Information Processing Systems, pp. 512–518 (2000)Google Scholar
  52. 52.
    Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)MathSciNetzbMATHGoogle Scholar
  53. 53.
    Metropolis, N., Ulam, S.M.: The Monte Carlo method. J. Am. Stat. Assoc. 44(247), 335–341 (1949)zbMATHCrossRefGoogle Scholar
  54. 54.
    Michie, D.: Memo functions and machine learning. Nature 218, 19–22 (1968)CrossRefGoogle Scholar
  55. 55.
    Miller, A.J.: Selection of subsets of regression variables. J. Roy. Stat. Soc. Ser. A 147(3), 389–425 (1984)MathSciNetzbMATHCrossRefGoogle Scholar
  56. 56.
    Nadaraya, E.A.: On estimating regression. Teor. Veroyatnost. i Primenen. 9(1), 157–159 (1964)zbMATHGoogle Scholar
  57. 57.
    Nair, V., Hinton, G.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (2010)Google Scholar
  58. 58.
    Nesterov, Y.: A method of solving a convex programming problem with convergence rate o(1/sqr(k)). Soviet Math. Dokl. 27, 372–376 (1983)zbMATHGoogle Scholar
  59. 59.
    Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(11), 559–572 (1901)zbMATHCrossRefGoogle Scholar
  60. 60.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  61. 61.
    Piotte, M., Chabbert, M.: The pragmatic theory solution to the Netflix grand prize (2009)Google Scholar
  62. 62.
    Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)CrossRefGoogle Scholar
  63. 63.
    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and Beyond. In: International Conference on Learning Representations (2018)Google Scholar
  64. 64.
    Rendle, S.: Factorization machines. In: IEEE International Conference on Data Mining, pp. 995–1000 (2010)Google Scholar
  65. 65.
    Rendle, S.: Factorization machines with LibFM. ACM Trans. Intell. Syst. Technol. 3(3), 57 (2012)CrossRefGoogle Scholar
  66. 66.
    Rennie, J.D.M., Srebro, N.: Fast maximum margin matrix factorization for collaborative prediction. In: International Conference on Machine Learning (2005)Google Scholar
  67. 67.
    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)MathSciNetzbMATHCrossRefGoogle Scholar
  68. 68.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation, vol. 1, pp. 318–362 (1986)Google Scholar
  69. 69.
    Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In: International Conference on Machine Learning, pp. 880–887 (2008)Google Scholar
  70. 70.
    Salakhutdinov, R.R., Mnih, A.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2008)Google Scholar
  71. 71.
    Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)Google Scholar
  72. 72.
    Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)MathSciNetzbMATHCrossRefGoogle Scholar
  73. 73.
    Srebro, N., Rennie, J., Jaakkola, T.S.: Maximum-margin matrix factorization. In: Advances in Neural Information Processing Systems, pp. 1329–1336 (2005)Google Scholar
  74. 74.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  75. 75.
    Tenenbaum, J.B., Freeman, W.T.: Separating style and content. In: Advances in Neural Information Processing Systems, pp. 662–668 (1997)Google Scholar
  76. 76.
    Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Comput. 12(6), 1247–1283 (2000)CrossRefGoogle Scholar
  77. 77.
    Tikhonov, A.N.: On the stability of inverse problems. Proc. USSR Acad. Sci. 39(5), 195–198 (1943)MathSciNetGoogle Scholar
  78. 78.
    Tikhonov, A.N.: Solution of incorrectly formulated problems and the regularization method. Proc. USSR Acad. Sci. 151(3), 501–504 (1963)zbMATHGoogle Scholar
  79. 79.
    Töscher, A., Jahrer, M.: The BigChaos solution to the Netflix grand prize (2009)Google Scholar
  80. 80.
    Töscher, A., Jahrer, M., Legenstein, R.: Improved neighborhood-based algorithms for large-scale recommender systems. In: KDD Cup (2008)Google Scholar
  81. 81.
    Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31(3), 279–311 (1966)MathSciNetCrossRefGoogle Scholar
  82. 82.
    Watson, G.S.: Smooth regression analysis. Sankhyā Ser. A 26(4), 359–372 (1964)MathSciNetzbMATHGoogle Scholar
  83. 83.
    Wright, M., Ziegler, A.: Ranger. J. Stat. Softw. 77(1), 1–17 (2017)CrossRefGoogle Scholar
  84. 84.
    Zeiler, M.D.: ADADELTA: an adaptive learning rate method (2012). arXiv:1212.5701

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Adobe Inc.San JoséUSA

Personalised recommendations