A Sparse Probabilistic Model of User Preference Data

  • Matthew Smith
  • Laurent Charlin
  • Joelle Pineau
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10233)


Modern recommender systems rely on user preference data to understand, analyze and provide items of interest to users. However, for some domains, collecting and sharing such data can be problematic: it may be expensive to gather data from several users, or it may be undesirable to share real user data for privacy reasons. We therefore propose a new model for generating realistic preference data. Our Sparse Probabilistic User Preference (SPUP) model produces synthetic data by sparsifying an initially dense user preference matrix generated by a standard matrix factorization model. The model incorporates aggregate statistics of the original data, such as user activity level and item popularity, as well as their interaction, to produce realistic data. We show empirically that our model can reproduce real-world datasets from different domains to a high degree of fidelity according to several measures. Our model can be used by both researchers and practitioners to generate new datasets or to extend existing ones, enabling the sound testing of new models and providing an improved form of bootstrapping in cases where limited data is available.


Recommender System Synthetic Data Degree Distribution User Preference Rating Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. J. Comput. 42, 30–37 (2009)CrossRefGoogle Scholar
  2. 2.
    Maxwell Harper, F., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4) (2015). Article no. 19Google Scholar
  3. 3.
  4. 4.
    Cassandra, T.: POMDP file repository.
  5. 5.
    RL-GLUE. Reinforcement learning glue.
  6. 6.
    Cointet, J.P., Roth, C.: How realistic should knowledge diffusion models be. J. Artif. Soc. Soc. Simul. 10(3), 1–11 (2007)Google Scholar
  7. 7.
    Leskovec, J.: Dynamics of large networks. Ph.D. thesis, Carnegie Mellon University (2008)Google Scholar
  8. 8.
    Rubin, D.B.: Discussion statistical disclosure limitation. JOS 9(2), 461–468 (1993)Google Scholar
  9. 9.
    Salakhutdinov, R., Mnih, A.: Probabilistic matrix factorization. In: NIPS, pp. 1257–1264 (2008)Google Scholar
  10. 10.
    Pasinato, M., Mello, C.E., Aufaure, M.A., Zimbro, G.: Generating synthetic data for context-aware recommender systems. In: BRICS-CCI CBIC 2013Google Scholar
  11. 11.
    Tso, K.H.L., Schmidt-Thieme, L.: Empirical analysis of attribute-aware recommender system algorithms using synthetic data. J. Comput. 1(4), 18–29 (2006)CrossRefGoogle Scholar
  12. 12.
    Caron, F., Fox, E.B.: Sparse graphs using exchangeable random measures. ArXiv e-prints, January 2014Google Scholar
  13. 13.
    Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E 64(2), 026118 (2001)CrossRefGoogle Scholar
  14. 14.
    Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008, pp. 263–272. IEEE, ICDM 2008 (2008)Google Scholar
  15. 15.
    Aldous, D.J.: Representations for partially exchangeable arrays of random variables. J. Multivar. Anal. 11(4), 581–598 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Hoover, D.N.: Relations on probability spaces and arrays of random variables. Technical report, Institute for Advanced Study, Princeton, NJ (1979)Google Scholar
  17. 17.
    Hernandez-Lobato, J.M., Houlsby, N., Ghahramani, Z.: Stochastic inference for scalable probabilistic modeling of binary matrices. In: ICML (2014)Google Scholar
  18. 18.
    Gopalan, P., Hofman, J.M., Blei, D.M.: Scalable recommendation with hierarchical Poisson factorization. In: UAI (2015)Google Scholar
  19. 19.
    Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of 12th ISMIR (2011)Google Scholar
  20. 20.
    Tang, J., Gao, H., Liu, H.: eTrust: discerning multi-faceted trust in a connected world. In: ACM International Conference on Web Search and Data Mining (2012)Google Scholar
  21. 21.
    Tang, J., Gao, H., Liu, H., Das Sarma, A.: eTrust: Understanding trust evolution in an online world. In: Proceedings of the 18th ACM SIGKDD, pp. 253–261. ACM (2012)Google Scholar
  22. 22.
    Ziegler, C.-N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of WWW (2005)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.School of Computer ScienceMcGill UniversityMontréalCanada
  2. 2.HEC MontréalMontréalCanada

Personalised recommendations