A Correlation Based Recommendation System for Large Data Sets

Pandove, Divya; Malhi, Avleen

doi:10.1007/s10723-021-09585-9

A Correlation Based Recommendation System for Large Data Sets

Open access
Published: 18 October 2021

Volume 19, article number 42, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Grid Computing Aims and scope Submit manuscript

A Correlation Based Recommendation System for Large Data Sets

Download PDF

Divya Pandove¹ &
Avleen Malhi^2,3

786 Accesses
5 Citations
Explore all metrics

This article has been updated

Abstract

Correlation determination brings out relationships in data that had not been seen before and it is imperative to successfully use the power of correlations for data mining. In this paper, we have used the concepts of correlations to cluster data, and merged it with recommendation algorithms. We have proposed two correlation clustering algorithms (RBACC and LGBACC), that are based on finding Spearman’s rank correlation coefficient among data points, and using dimensionality reduction approach (PCA) along with graph theory respectively, to produce high quality hierarchical clusters. Both these algorithms have been tested on real life data (New York yellow cabs dataset taken from http://www.nyc.gov), using distributed and parallel computing (Spark and R). They are found to be scalable and perform better than the existing hierarchical clustering algorithms. These two approaches have been used to replace similarity measures in recommendation algorithms and generate a correlation clustering based recommendation system model. We have combined the power of correlation analysis with that of prediction analysis to propose a better recommendation system. It is found that this model makes better quality recommendations as compared to the random recommendation model. This model has been validated using a real time, large data set (MovieLens dataset, taken from http://grouplens.org/datasets/movielens/latest). The results show that combining correlated points with the predictive power of recommendation algorithms, produce better quality recommendations which are faster to compute. LGBACC has approximately 25% better prediction capability but at the same time takes significantly more prediction time compared to RBACC.

Article PDF

Scalability and sparsity issues in recommender datasets: a survey

Article 16 October 2018

An Improved Collaborative Filtering Recommendation Algorithm for Big Data

A collaborative filtering recommendation algorithm based on information theory and bi-clustering

Article 04 February 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Change history

12 July 2022
License copyright has been updated.

References

Didelez, V., Pigeot, I.: Judea pearl: Causality: Models, reasoning, and inference. Politische Vierteljahresschrift 42(2), 313–315 (2001)
Article Google Scholar
Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel associations in large data sets. Science 334(6062), 1518–1524 (2011)
Article Google Scholar
Armanfard, N., Reilly, J.P., Komeili, M.: Local feature selection for data classification. IEEE Trans. Pattern Anal. Mach Intell. 38(6), 1217–1227 (2016)
Article Google Scholar
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1-3), 89–113 (2004)
Article MathSciNet Google Scholar
Böhm, C., Kailing, K., Kröger, P., Zimek, A.: Computing clusters of correlation connected objects. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp 455–466. ACM (2004)
Aggarwal, C.C., Yu, P.S.: Finding Generalized Projected Clusters in High Dimensional Spaces, vol. 29. ACM, New York (2000)
Google Scholar
Yang, J., Wang, W., Wang, H., Yu, P.: /spl delta/-clusters: capturing subspace correlation in a large data set. In: Data Engineering, 2002 Proceedings 18th International Conference on, pp 517–528. IEEE (2002)
Achtert, E., Bohm, C., Kroger, P., Zimek, A.: Mining hierarchies of correlation clusters. In: Scientific and Statistical Database Management, 2006 18th International Conference on, pp 119–128. IEEE (2006)
Li, J., Huang, X., Selke, C., Yong, J.: A fast algorithm for finding correlation clusters in noise data. Adv. Know. Discov. Data Min., 639–647 (2007)
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp 413–418. SIAM (2007)
Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A.: On exploring complex relationships of correlation clusters. In: Scientific and Statistical Database Management, 2007 SSBDM’07, 19th International Conference on, pp 7–7. IEEE (2007)
Mukhopadhyay, P., Chaudhuri, B.B.: A survey of hough transform. Pattern Recogn. 48(3), 993–1010 (2015)
Article Google Scholar
Chattopadhyay, A.K., Chattyopadhyay, T., De, T., Mondal, S.: Independent component analysis for dimension reduction classification: Hough transform and cash algorithm. In: Astrostatistical Challenges for the New Astronomy, pp 185–202. Springer (2013)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, vol. 27. ACM, New York (1998)
Google Scholar
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMoD Record, vol. 28, pp 61–72. ACM (1999)
Kailing, K., Kriegel, H. -P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp 246–256. SIAM (2004)
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.: A monte carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp 418–427. ACM (2002)
Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp 394–405. ACM (2002)
Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: Maple: A fast algorithm for maximal pattern-based clustering. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp 259–266. IEEE (2003)
Liu, J., Wang, W.: Op-cluster: Clustering by tendency in high dimensional space. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp 187–194. IEEE (2003)
Melhem, R.: Parallel gauss-jordan elimination for the solution of dense linear systems. Parallel Comput. 4(3), 339–343 (1987)
Article MathSciNet Google Scholar
Zimek, A.: Correlation clustering. ACM SIGKDD Explor. Newsl. 11(1), 53–54 (2009)
Article Google Scholar
Feng, J., Lin, Z., Xu, H., Yan, S.: Robust subspace segmentation with block-diagonal prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3818–3825 (2014)
Kim, Y., Mesbahi, M.: On maximizing the second smallest eigenvalue of a state-dependent graph laplacian. IEEE Trans. Autom. Control 51(1), 116–120 (2006)
Article MathSciNet Google Scholar
Kauffman, L., Rousseeuw, P.: Finding groups in data. An introduction to cluster analysis. John Willey & Sons, New York (1990)
Google Scholar
Meilă, M.: Comparing clusterings? an information based distance. J. Multiv. Anal. 98(5), 873–895 (2007)
Article MathSciNet Google Scholar
Xiao, C., Ye, J., Esteves, R.M., Rong, C.: Using spearman’s correlation coefficients for exploratory data analysis on big dataset. Concurr. Comput. Pract. Experience, 1–13 (2015)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM Sigmod Record, vol. 29, pp 1–12. ACM (2000)
Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. (TOIS) 22(1), 143–177 (2004)
Article Google Scholar
Lee, J.-S., Jun, C.-H., Lee, J., Kim, S.: Classification-based collaborative filtering using market basket data. Expert Syst. Appl. 29(3), 700–704 (2005)
Article Google Scholar
Demiriz, A.: Enhancing product recommender systems on sparse binary data. Data Min. Knowl. Disc. 9(2), 147–170 (2004)
Article MathSciNet Google Scholar
Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. M. Learn. Res. 10, 2935–2962 (2009)
MathSciNet MATH Google Scholar
Hahsler, M.: recommenderlab: A framework for developing and testing recommendation algorithms, Southern Methodist University (2011)
Chowdhury, G.G.: Introduction to modern information retrieval. Facet Publishing, London, United Kingdom (2010)
Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TiiS) 5(4), 19 (2016)
Google Scholar
Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R., Duval, E.: Dataset-driven research for improving recommender systems for learning. In: Proceedings of the 1st International Conference on Learning Analytics and Knowledge, pp 44–53. ACM (2011)
Shah, M., Parikh, D., Deshpande, B.: Movie recommendation system employing latent graph features in extremely randomized trees. In: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, p 42. ACM (2016)
Dooms, S., Bellogín, A., Pessemier, T.D., Martens, L.: A framework for dataset benchmarking and its application to a new movie rating dataset. ACM Tran. Intell. Syst. Technol. (TIST) 7(3), 41 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Glover Park Group, Washington, DC, USA
Divya Pandove
Aalto University, Helsinki, Finland
Avleen Malhi
Bournemouth University, Bournemouth, UK
Avleen Malhi

Authors

Divya Pandove
View author publications
You can also search for this author in PubMed Google Scholar
Avleen Malhi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Avleen Malhi.

Additional information

Data availability statement (DAS)

Data available in a public repository and the links have been provided in relevant section.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pandove, D., Malhi, A. A Correlation Based Recommendation System for Large Data Sets. J Grid Computing 19, 42 (2021). https://doi.org/10.1007/s10723-021-09585-9

Download citation

Received: 24 June 2021
Accepted: 29 August 2021
Published: 18 October 2021
DOI: https://doi.org/10.1007/s10723-021-09585-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Correlation Based Recommendation System for Large Data Sets

Abstract

Article PDF

Similar content being viewed by others

Scalability and sparsity issues in recommender datasets: a survey

An Improved Collaborative Filtering Recommendation Algorithm for Big Data

A collaborative filtering recommendation algorithm based on information theory and bi-clustering

Change history

12 July 2022

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Data availability statement (DAS)

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Correlation Based Recommendation System for Large Data Sets

Abstract

Article PDF

Similar content being viewed by others

Scalability and sparsity issues in recommender datasets: a survey

An Improved Collaborative Filtering Recommendation Algorithm for Big Data

A collaborative filtering recommendation algorithm based on information theory and bi-clustering

Change history

12 July 2022

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Data availability statement (DAS)

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation