Abstract
Correlation determination brings out relationships in data that had not been seen before and it is imperative to successfully use the power of correlations for data mining. In this paper, we have used the concepts of correlations to cluster data, and merged it with recommendation algorithms. We have proposed two correlation clustering algorithms (RBACC and LGBACC), that are based on finding Spearman’s rank correlation coefficient among data points, and using dimensionality reduction approach (PCA) along with graph theory respectively, to produce high quality hierarchical clusters. Both these algorithms have been tested on real life data (New York yellow cabs dataset taken from http://www.nyc.gov), using distributed and parallel computing (Spark and R). They are found to be scalable and perform better than the existing hierarchical clustering algorithms. These two approaches have been used to replace similarity measures in recommendation algorithms and generate a correlation clustering based recommendation system model. We have combined the power of correlation analysis with that of prediction analysis to propose a better recommendation system. It is found that this model makes better quality recommendations as compared to the random recommendation model. This model has been validated using a real time, large data set (MovieLens dataset, taken from http://grouplens.org/datasets/movielens/latest). The results show that combining correlated points with the predictive power of recommendation algorithms, produce better quality recommendations which are faster to compute. LGBACC has approximately 25% better prediction capability but at the same time takes significantly more prediction time compared to RBACC.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Change history
12 July 2022
License copyright has been updated.
References
Didelez, V., Pigeot, I.: Judea pearl: Causality: Models, reasoning, and inference. Politische Vierteljahresschrift 42(2), 313–315 (2001)
Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel associations in large data sets. Science 334(6062), 1518–1524 (2011)
Armanfard, N., Reilly, J.P., Komeili, M.: Local feature selection for data classification. IEEE Trans. Pattern Anal. Mach Intell. 38(6), 1217–1227 (2016)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1-3), 89–113 (2004)
Böhm, C., Kailing, K., Kröger, P., Zimek, A.: Computing clusters of correlation connected objects. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp 455–466. ACM (2004)
Aggarwal, C.C., Yu, P.S.: Finding Generalized Projected Clusters in High Dimensional Spaces, vol. 29. ACM, New York (2000)
Yang, J., Wang, W., Wang, H., Yu, P.: /spl delta/-clusters: capturing subspace correlation in a large data set. In: Data Engineering, 2002 Proceedings 18th International Conference on, pp 517–528. IEEE (2002)
Achtert, E., Bohm, C., Kroger, P., Zimek, A.: Mining hierarchies of correlation clusters. In: Scientific and Statistical Database Management, 2006 18th International Conference on, pp 119–128. IEEE (2006)
Li, J., Huang, X., Selke, C., Yong, J.: A fast algorithm for finding correlation clusters in noise data. Adv. Know. Discov. Data Min., 639–647 (2007)
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp 413–418. SIAM (2007)
Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A.: On exploring complex relationships of correlation clusters. In: Scientific and Statistical Database Management, 2007 SSBDM’07, 19th International Conference on, pp 7–7. IEEE (2007)
Mukhopadhyay, P., Chaudhuri, B.B.: A survey of hough transform. Pattern Recogn. 48(3), 993–1010 (2015)
Chattopadhyay, A.K., Chattyopadhyay, T., De, T., Mondal, S.: Independent component analysis for dimension reduction classification: Hough transform and cash algorithm. In: Astrostatistical Challenges for the New Astronomy, pp 185–202. Springer (2013)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, vol. 27. ACM, New York (1998)
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMoD Record, vol. 28, pp 61–72. ACM (1999)
Kailing, K., Kriegel, H. -P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp 246–256. SIAM (2004)
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.: A monte carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp 418–427. ACM (2002)
Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp 394–405. ACM (2002)
Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: Maple: A fast algorithm for maximal pattern-based clustering. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp 259–266. IEEE (2003)
Liu, J., Wang, W.: Op-cluster: Clustering by tendency in high dimensional space. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp 187–194. IEEE (2003)
Melhem, R.: Parallel gauss-jordan elimination for the solution of dense linear systems. Parallel Comput. 4(3), 339–343 (1987)
Zimek, A.: Correlation clustering. ACM SIGKDD Explor. Newsl. 11(1), 53–54 (2009)
Feng, J., Lin, Z., Xu, H., Yan, S.: Robust subspace segmentation with block-diagonal prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3818–3825 (2014)
Kim, Y., Mesbahi, M.: On maximizing the second smallest eigenvalue of a state-dependent graph laplacian. IEEE Trans. Autom. Control 51(1), 116–120 (2006)
Kauffman, L., Rousseeuw, P.: Finding groups in data. An introduction to cluster analysis. John Willey & Sons, New York (1990)
Meilă, M.: Comparing clusterings? an information based distance. J. Multiv. Anal. 98(5), 873–895 (2007)
Xiao, C., Ye, J., Esteves, R.M., Rong, C.: Using spearman’s correlation coefficients for exploratory data analysis on big dataset. Concurr. Comput. Pract. Experience, 1–13 (2015)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM Sigmod Record, vol. 29, pp 1–12. ACM (2000)
Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. (TOIS) 22(1), 143–177 (2004)
Lee, J.-S., Jun, C.-H., Lee, J., Kim, S.: Classification-based collaborative filtering using market basket data. Expert Syst. Appl. 29(3), 700–704 (2005)
Demiriz, A.: Enhancing product recommender systems on sparse binary data. Data Min. Knowl. Disc. 9(2), 147–170 (2004)
Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. M. Learn. Res. 10, 2935–2962 (2009)
Hahsler, M.: recommenderlab: A framework for developing and testing recommendation algorithms, Southern Methodist University (2011)
Chowdhury, G.G.: Introduction to modern information retrieval. Facet Publishing, London, United Kingdom (2010)
Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TiiS) 5(4), 19 (2016)
Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R., Duval, E.: Dataset-driven research for improving recommender systems for learning. In: Proceedings of the 1st International Conference on Learning Analytics and Knowledge, pp 44–53. ACM (2011)
Shah, M., Parikh, D., Deshpande, B.: Movie recommendation system employing latent graph features in extremely randomized trees. In: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, p 42. ACM (2016)
Dooms, S., Bellogín, A., Pessemier, T.D., Martens, L.: A framework for dataset benchmarking and its application to a new movie rating dataset. ACM Tran. Intell. Syst. Technol. (TIST) 7(3), 41 (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Data availability statement (DAS)
Data available in a public repository and the links have been provided in relevant section.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pandove, D., Malhi, A. A Correlation Based Recommendation System for Large Data Sets. J Grid Computing 19, 42 (2021). https://doi.org/10.1007/s10723-021-09585-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-021-09585-9