PMD: An Optimal Transportation-Based User Distance for Recommender Systems
- 3.3k Downloads
Abstract
Collaborative filtering predicts a user’s preferences by aggregating ratings from similar users and thus the user similarity (or distance) measure is key to good performance. Existing similarity measures either consider only the co-rated items for a pair of users (but co-rated items are rare in real-world sparse datasets), or try to utilize the non-co-rated items via some heuristics. We propose a novel user distance measure, called Preference Mover’s Distance (PMD), based on the optimal transportation theory. PMD exploits all ratings made by each user and works even if users do not share co-rated items at all. In addition, PMD is a metric and has favorable properties such as triangle inequality and zero self-distance. Experimental results show that PMD achieves superior recommendation accuracy compared with the state-of-the-art similarity measures, especially on highly sparse datasets.
Keywords
Recommendation User similarity Optimal transport1 Introduction
Collaborative filtering (CF) is one of the most widely used recommendation techniques [14, 47]. Given a user, CF recommends items by aggregating the preferences of similar users. Among CF recommendation approaches, methods based on nearest-neighbors (NN) are widely used, thanks to their simplicity, efficiency and ability to produce accurate and personalized recommendations [13, 35, 44]. Although deep learning (DL) methods [16, 19, 43] have attracted much attention in the recommendation community over the past few years, a very recent study [12] shows that NN-based CF is still a strong baseline and outperforms many DL methods. For NN-based methods, the user similarity measure plays an important role. It serves as the criterion to select a group of similar users whose ratings form the basis of recommendations, and is used to weigh users so that more similar users have greater impact on recommendations. Besides CF, user similarity is also important for applications such as link prediction [4], community detection [34] and so on.
Related Work. Traditional similarity measures, such as cosine distance (COS) [9], Pearson’s Correlation Coefficient (PCC) [9] and their variants [18, 29, 38, 39], have been widely used in CF [13, 44]. However, such measures only consider co-rated items and ignore ratings on other items, and thus may only coarsely capture users’ preferences as ratings are sparse and co-rated items are rare for many real-world datasets [35, 40, 44]. Some other similarity measures, such as Jaccard [22], MSD [39], JMSD [8], URP [27], NHSM [27], PIP [5] and BS [14] do not utilize all the rating information [6]. For example, Jaccard only uses the number of rated items and omits the specific rating values, while URP only uses the mean and the variance of the ratings. Critically, all these measures give zero similarity value when there are no co-rated items, which would harm recommendation performance. Recently, BCF [35] and HUSM [44] were proposed to alleviate the co-rating issue by modeling user similarity as a weighted sum of item similarities, where the weights are obtained using heuristics. As the weights are not derived in a principled manner, they do not satisfy important properties such as triangle inequality and zero self-distance, which are important for a high quality similarity measure.
The Earth Mover’s Distance (EMD) is a distance metric on probabilistic space that originates from the optimal transportation theory [25, 37]. EMD has been applied to many applications, such as computer vision [7], natural language processing [17, 23] and signal processing [41]. EMD has also been applied to CF [48] but is used as a regularizer to force the latent variable to fit a Gaussian prior in auto-encoder training rather than a user similarity measure.
Our Solution. We propose the Preference Mover’s Distance (PMD), which considers all ratings made by each user and is able to evaluate user similarity even in the absence of co-rated items. Similar to BCF and HUSM, PMD uses the item similarity as side information and assumes that if two users have similar opinions on similar items, then their tastes are similar. But the key difference is: PMD formulates the distance between a pair of users as an optimal transportation problem [26, 36] such that the weights for item similarities can be derived in a principled manner. In fact, PMD can be viewed as a special case of EMD [33, 37, 45], which is a metric that satisfies important properties such as triangle inequality and zero self-distance. We also make PMD practical for large datasets by employing the Sinkhorn algorithm [10] to speed up distance computation and using HNSW [30] to further accelerate the search for similar users. Experimental results show that PMD leads to superior recommendation accuracy over the state-of-the-art similarity measures, especially on sparse datasets.
2 Preference Mover’s Distance
Problem Definition. Let \(\mathcal {U}\) be a set of m users, and \(\mathcal {I}\) a set of n items. The user-item interaction matrix is denoted by \( \mathbf {R} \in \mathbb {R}^{m\times n}\), where \(\mathbf {R}(u,i) \ge 0\) is the rating user u gives to item i. \(\mathbf {R}\) is a partially observed matrix and usually highly sparse. For user \(u \in \mathcal {U}\), her rated items are denoted by \(\mathcal {I}_u \subset \mathcal {I}\). The item similarities are described by matrix \(\mathbf {D}\) and \(\mathbf {D}(i,j)\ge 0\) denotes the distance between items i and j. Item similarities can be derived from the ratings on items [35, 44] or content information [46], such as item tags, comments, etc. In this paper, we assume \(\mathbf {D}\) is given. We are interested in computing the distance between any pair (u, v) of users in \(\mathcal {U}\) given \(\mathbf {R}\) and \(\mathbf {D}\). User similarity can be easily derived from the user distance as they are negatively correlated.
Illustration. Intuitively, \(d(\mathbf {p}_u,\mathbf {p}_v)\) can be viewed as the minimum cost of transforming the ratings of user u to the ratings of user v, which we show in Fig. 1. \(\mathbf {p}_u\) and \(\mathbf {p}_v\) define two distributions of mass, while \(\mathbf {D}(i,j)\) models the cost of moving one unit of mass from \(\mathbf {p}_u(i)\) to \(\mathbf {p}_v(j)\). Therefore, PMD can model the similarity between u and v even if they have no co-rated items. If two users like similar items, \(\mathbf {W}_{u,v}(i,j)\) takes a large value for item pairs with small \(\mathbf {D}(i,j)\), which results in a small distance. This is the case for \(u_0\) and \(u_1\) in Fig. 1 as they both like science fiction movies. In contrast, if two users like dissimilar items, \(\mathbf {W}_{u,v}(i,j)\) is large for item pairs with large \(\mathbf {D}(i,j)\), which produces a large distance. In Fig. 1, \(u_0\) likes science fiction movies while \(u_2\) likes romantic movies, and thus \(d(\mathbf {p}_{u_0},\mathbf {p}_{u_2})\) is large. Even if \(u_0\) has no co-rated movies with \(u_1\) and \(u_2\), PMD still gives \(d(\mathbf {p}_{u_0},\mathbf {p}_{u_1})<d(\mathbf {p}_{u_0},\mathbf {p}_{u_2})\), which implies that \(u_0\) is more similar to \(u_1\) than to \(u_2\).
Computation Speedup. An exact solution to the optimization problem in Eq. (3) takes a time complexity of \(O(q^3\log q)\) [36], where \(q=|\mathcal {I}_u \cup \mathcal {I}_v|\). To reduce the complexity, we use the Sinkhorn algorithm [10], which produces a high-quality approximate solution with a complexity of \(O(q^2)\). To speed up the lookup for similar users in large datasets, we employ HNSW [30], which is the state-of-the-art algorithm for similarity search. HNSW builds a multi-layer k-nearest neighbour (KNN) graph for the dataset and returns high quality nearest neighbours for a query with \(O(\log N)\) distance computations, in which N is the number of users. With these two techniques, looking up for the top 100 neighbours takes only 0.02 s on average for a user and achieves a high recall of 99.2% for the Epinions dataset in our experiments. We conduct the experiment on a machine with two 2.0 GHz E5-2620 Intel(R) Xeon(R) CPU (12 physical cores in total), 48 GB RAM, a 450 GB SATA disk (6 Gb/s, 10k rpm, 64 MB cache), and 64-bit CentOS release 7.2.
3 Experiments
Data statistics.
MovieLens | Epinions | |
---|---|---|
#user | 6,040 | 116,260 |
#item | 3,959 | 41,269 |
#rating | 1,000,000 | 181,394 |
sparsity | 4.14% | 0.0038% |
#rating/user | 166 | 1.56 |
#rating/item | 250 | 4.40 |
CPMD under different K and \(\mu \).
MovieLens (\(K=200\)) | Epinions (\(K=50\)) | MovieLens (\(\mu =0.6\)) | Epinions (\(\mu =0.6\)) | ||||||
---|---|---|---|---|---|---|---|---|---|
\(\mu \) | MAE | RMSE | MAE | RMSE | K | MAE | RMSE | MAE | RMSE |
0.2 | 0.7126 | 0.9019 | 0.8542 | 1.1340 | 30 | 0.7148 | 0.9064 | 0.8518 | 1.1294 |
0.4 | 0.6970 | 0.8851 | 0.8506 | 1.1302 | 50 | 0.7084 | 0.9052 | 0.8458 | 1.1260 |
0.6 | 0.6918 | 0.8817 | 0.8458 | 1.1260 | 100 | 0.6972 | 0.8898 | 0.8550 | 1.1456 |
0.8 | 0.6955 | 0.8875 | 0.8550 | 1.1456 | 200 | 0.6918 | 0.8817 | 0.8592 | 1.1435 |
0.95 | 0.6989 | 0.8915 | 0.8596 | 1.1520 | 300 | 0.6938 | 0.8846 | 0.8667 | 1.1506 |
Item Similarity. Both MovieLens and Epinions come with side information for computing item similarities. For MovieLens, we compute movie similarity using Tag-genomes [3, 42]. For Epinions, we evaluate item similarity by applying Doc2Vec [24] on the comments. Since both Tag-genome and doc2vec derive item similarity by cosine, we convert item similarity into distance using \(\mathbf {D}(i,j)=\arccos (s(i,j))\), which is a metric on the item space. For fair comparison, the same item similarity matrix is used for PMD, BCF and HUSM^{1}.
Comparison with other user similarity measures.
Dataset | Metric | COS | PCC | MSD | Jaccard | JMSD | NHSM | BCF | HUSM | PMD | CPMD |
---|---|---|---|---|---|---|---|---|---|---|---|
Movie lens | MAE | 0.7477 | 0.7234 | 0.7387 | 0.7109 | 0.7024 | 0.7079 | 0.7044 | 0.7034 | 0.7019 | 0.6918 |
RMSE | 0.9394 | 0.9182 | 0.9293 | 0.9125 | 0.8982 | 0.9080 | 0.9089 | 0.9067 | 0.8935 | 0.8817 | |
Epin ions | MAE | 1.0476 | 1.0468 | 1.0449 | 1.0340 | 1.0392 | 1.0213 | 0.9846 | 0.9734 | 0.8757 | 0.8458 |
RMSE | 1.4412 | 1.4384 | 1.4380 | 1.4226 | 1.4291 | 1.3969 | 1.3014 | 1.2846 | 1.1701 | 1.1260 |
Comparison with latent factor models.
Dataset | Metric | NMF | SVD | SVD++ | PMD | CPMD |
---|---|---|---|---|---|---|
Movie lens | MAE | 0.7252 | 0.6864 | 0.6739 | 0.7019 | 0.6918 |
RMSE | 0.9177 | 0.8741 | 0.8629 | 0.8935 | 0.8817 | |
Epin ions | MAE | 0.9444 | 0.9482 | 0.9439 | 0.8757 | 0.8458 |
RMSE | 1.2096 | 1.2154 | 1.2091 | 1.1701 | 1.1260 |
We report the performance of various similarity measures in Table 3, where PMD is based on Eq. (3) and CPMD is based on Eq. (4). The results show that PMD and CPMD consistently outperform other similarity measures and the improvement is more significant on the Epinions dataset which is much more sparse. We believe that our methods achieve good performance on sparse datasets mainly because it utilizes all rating information and derives the weights of the items using the optimal transportation theory, which works well when there are only few or no co-rated items. This is favorable as ratings are sparse in many real-world datasets [40]. CPMD achieves better performance than PMD, which suggests that it is beneficial to distinguish positive and negative feed-backs.
We also compare our methods with the latent factor models in Table 4. On the sparse Epinions dataset, both PMD and CPMD outperform the latent factor models. We report the performance of CPMD-based NN CF under different configurations of K and \(\mu \) in Table 2. CPMD performs best when \(\mu \) is around 0.6 on both datasets possibly because positive ratings can better represent the taste of a user than the negative ratings. In contrast, the optimal value of K is dataset dependent.
4 Conclusions
We proposed PMD, a novel user distance measure based on optimal transportation, which addresses the limitation of existing methods in dealing with datasets with few co-rated items. PMD also has the favorable properties of a metric. Experimental results show that PMD leads to better recommendation accuracy for NN-based CF than the state-of-the-art user similarity measures, especially when the ratings are highly sparse.
Footnotes
- 1.
BCF and HUSM originally compute item similarity using the Bhattacharyya coefficient or the KL-divergence of ratings but we found that using the tag-genomes and doc2vec provides better performance.
Notes
Acknowledgement
The authors thank Prof. Julian McAuley for his valuable suggestions on this paper, and Prof. Shengyu Zhang for his support. This work was supported by ITF 6904945, and GRF 14208318 & 14222816, and the National Natural Science Foundation of China (NSFC) (Grant No. 61672552).
References
- 1.
- 2.
- 3.
- 4.Farshad Aghabozorgi and Mohammad Reza Khayyambashi: A new similarity measure for link prediction based on local structures in social networks. Phys. A: Stat. Mech. Appl. 501, 12–23 (2018)CrossRefGoogle Scholar
- 5.Hyung Jun Ahn: A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem. Inf. Sci. 178(1), 37–51 (2008)CrossRefGoogle Scholar
- 6.Al-bashiri, H., Abdulgabber, M.A., Romli, A., Hujainah, F.: Collaborative filtering similarity measures: revisiting. In: Proceedings of the International Conference on Advances in Image Processing, pp. 195–200. ACM (2017)Google Scholar
- 7.Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
- 8.Bobadilla, J., Serradilla, F., Bernal, J.: A new collaborative filtering metric that improves the behavior of recommender systems. Knowl.-Based Syst. 23(6), 520–528 (2010)CrossRefGoogle Scholar
- 9.Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52. Morgan Kaufmann Publishers Inc. (1998)Google Scholar
- 10.Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, vol. 26, pp. 2292–2300 (2013)Google Scholar
- 11.Cuturi, M., Solomon, J.M.: A primer on optimal transport. In: Tutorial of 31st Conference on Neural Information Processing Systems (2017)Google Scholar
- 12.Dacrema, M.F., Cremonesi, P., Jannach, D.: Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101–109. ACM (2019)Google Scholar
- 13.Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based recommendation methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 107–144. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-85820-3_4CrossRefGoogle Scholar
- 14.Guo, G., Zhang, J., Yorke-Smith, N.: A novel Bayesian similarity measure for recommender systems. In: Twenty-Third International Joint Conference on Artificial Intelligence (2013)Google Scholar
- 15.Guo, G., Zhang, J., Yorke-Smith, N.: TrustSVD: collaborative filtering with both the explicit and implicit influence of user trust and of item ratings. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)Google Scholar
- 16.He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.-S.: Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. International World Wide Web Conferences Steering Committee (2017)Google Scholar
- 17.Huang, G., et al.: Supervised word mover’s distance. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS 2016, pp. 4869–4877 (2016)Google Scholar
- 18.Jamali, M., Ester, M.: TrustWalker: a random walk model for combining trust-based and item-based recommendation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 397–406. ACM (2009)Google Scholar
- 19.Karamanolakis, G., Cherian, K.R., Narayan, A.R., Yuan, J., Tang, D., Jebara, T.: Item recommendation with variational autoencoders and heterogeneous priors. In: Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems, pp. 10–14. ACM (2018)Google Scholar
- 20.Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 426–434. ACM (2008)Google Scholar
- 21.Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 8, 30–37 (2009)CrossRefGoogle Scholar
- 22.Koutrika, G., Bercovitz, B., Garcia-Molina, H.: FlexRecs: expressing and combining flexible recommendations. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 745–758. ACM (2009)Google Scholar
- 23.Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of The 32nd International Conference on Machine Learning, pp. 957–966 (2015)Google Scholar
- 24.Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)Google Scholar
- 25.Levina, E., Bickel, P.J.: The earth mover’s distance is the mallows distance: some insights from statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2, pp. 251–256 (2001)Google Scholar
- 26.Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Trans. Pattern Anal. Mach. Intell. 29(5), 840–853 (2007)CrossRefGoogle Scholar
- 27.Liu, H., Zheng, H., Mian, A., Tian, H., Zhu, X.: A new user similarity model to improve the accuracy of collaborative filtering. Knowl.-Based Syst. 56, 156–166 (2014)CrossRefGoogle Scholar
- 28.Luo, X., Zhou, M., Xia, Y., Zhu, Q.: An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Inform. 10(2), 1273–1284 (2014)CrossRefGoogle Scholar
- 29.Ma, H., King, I., Lyu, M.R.: Effective missing data prediction for collaborative filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–46. ACM (2007)Google Scholar
- 30.Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018)CrossRefGoogle Scholar
- 31.Meng, Y., Chen, G., Li, J., Zhang, S.: Psrec: social recommendation with pseudo ratings. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 397–401. ACM (2018)Google Scholar
- 32.Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2008)Google Scholar
- 33.Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie royale des sciences de Paris (1781)Google Scholar
- 34.Pan, Y., Li, D.-H., Liu, J.-G., Liang, J.-Z.: Detecting community structure in complex networks via node similarity. Phys. A: Stat. Mech. Appl. 389(14), 2849–2857 (2010)CrossRefGoogle Scholar
- 35.Patra, B.K., Launonen, R., Ollikainen, V., Nandi, S.: A new similarity measure using Bhattacharyya coefficient for collaborative filtering in sparse data. Knowl.-Based Syst. 82, 163–177 (2015)CrossRefGoogle Scholar
- 36.Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE (2009)Google Scholar
- 37.Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 59–66 (1998)Google Scholar
- 38.Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J., et al.: Item-based collaborative filtering recommendation algorithms. In: Www, vol. 1, pp. 285–295 (2001)Google Scholar
- 39.Shardanand, U., Maes, P.: Social information filtering: algorithms for automating “word of mouth”. In: CHI, vol. 95, pp. 210–217. Citeseer (1995)Google Scholar
- 40.Symeonidis, P., Nanopoulos, A., Papadopoulos, A.N., Manolopoulos, Y.: Collaborative filtering: fallacies and insights in measuring similarity. Universitaet Kassel (2006)Google Scholar
- 41.Thorpe, M., Park, S., Kolouri, S., Rohde, G.K., Slepčev, D.: A transportation LP distance for signal analysis. J. Math. Imaging Vis. 59(2), 187–210 (2017)CrossRefGoogle Scholar
- 42.Vig, J., Sen, S., Riedl, J.: The tag genome: encoding community knowledge to support novel interaction. ACM Trans. Interact. Intell. Syst. (TiiS) 2(3), 13 (2012)Google Scholar
- 43.Wang, H., Wang, N., Yeung, D.-Y.: Collaborative deep learning for recommender systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235–1244. ACM (2015)Google Scholar
- 44.Wang, Y., Deng, J., Gao, J., Zhang, P.: A hybrid user similarity model for collaborative filtering. Inf. Sci. 418, 102–118 (2017)CrossRefGoogle Scholar
- 45.Wolsey, L.A., Nemhauser, G.L.: Integer and combinatorial optimization. Wiley, Hoboken (2014)zbMATHGoogle Scholar
- 46.Yao, Y., Harper, F.M.: Judging similarity: a user-centric study of related item recommendations. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 288–296. ACM (2018)Google Scholar
- 47.Zheng, V.W., Cao, B., Zheng, Y., Xie, X., Yang, Q.: Collaborative filtering meets mobile recommendation: a user-centered approach. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)Google Scholar
- 48.Zhong, J., Zhang, X.: Wasserstein autoencoders for collaborative filtering. arXiv preprint arXiv:1809.05662 (2018)