Skip to main content

A Correlation Based Recommendation System for Large Data Sets


Correlation determination brings out relationships in data that had not been seen before and it is imperative to successfully use the power of correlations for data mining. In this paper, we have used the concepts of correlations to cluster data, and merged it with recommendation algorithms. We have proposed two correlation clustering algorithms (RBACC and LGBACC), that are based on finding Spearman’s rank correlation coefficient among data points, and using dimensionality reduction approach (PCA) along with graph theory respectively, to produce high quality hierarchical clusters. Both these algorithms have been tested on real life data (New York yellow cabs dataset taken from, using distributed and parallel computing (Spark and R). They are found to be scalable and perform better than the existing hierarchical clustering algorithms. These two approaches have been used to replace similarity measures in recommendation algorithms and generate a correlation clustering based recommendation system model. We have combined the power of correlation analysis with that of prediction analysis to propose a better recommendation system. It is found that this model makes better quality recommendations as compared to the random recommendation model. This model has been validated using a real time, large data set (MovieLens dataset, taken from The results show that combining correlated points with the predictive power of recommendation algorithms, produce better quality recommendations which are faster to compute. LGBACC has approximately 25% better prediction capability but at the same time takes significantly more prediction time compared to RBACC.


  1. 1.

    Didelez, V., Pigeot, I.: Judea pearl: Causality: Models, reasoning, and inference. Politische Vierteljahresschrift 42(2), 313–315 (2001)

    Article  Google Scholar 

  2. 2.

    Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel associations in large data sets. Science 334(6062), 1518–1524 (2011)

    Article  Google Scholar 

  3. 3.

    Armanfard, N., Reilly, J.P., Komeili, M.: Local feature selection for data classification. IEEE Trans. Pattern Anal. Mach Intell. 38(6), 1217–1227 (2016)

    Article  Google Scholar 

  4. 4.

    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1-3), 89–113 (2004)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Böhm, C., Kailing, K., Kröger, P., Zimek, A.: Computing clusters of correlation connected objects. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp 455–466. ACM (2004)

  6. 6.

    Aggarwal, C.C., Yu, P.S.: Finding Generalized Projected Clusters in High Dimensional Spaces, vol. 29. ACM, New York (2000)

    Google Scholar 

  7. 7.

    Yang, J., Wang, W., Wang, H., Yu, P.: /spl delta/-clusters: capturing subspace correlation in a large data set. In: Data Engineering, 2002 Proceedings 18th International Conference on, pp 517–528. IEEE (2002)

  8. 8.

    Achtert, E., Bohm, C., Kroger, P., Zimek, A.: Mining hierarchies of correlation clusters. In: Scientific and Statistical Database Management, 2006 18th International Conference on, pp 119–128. IEEE (2006)

  9. 9.

    Li, J., Huang, X., Selke, C., Yong, J.: A fast algorithm for finding correlation clusters in noise data. Adv. Know. Discov. Data Min., 639–647 (2007)

  10. 10.

    Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp 413–418. SIAM (2007)

  11. 11.

    Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A.: On exploring complex relationships of correlation clusters. In: Scientific and Statistical Database Management, 2007 SSBDM’07, 19th International Conference on, pp 7–7. IEEE (2007)

  12. 12.

    Mukhopadhyay, P., Chaudhuri, B.B.: A survey of hough transform. Pattern Recogn. 48(3), 993–1010 (2015)

    Article  Google Scholar 

  13. 13.

    Chattopadhyay, A.K., Chattyopadhyay, T., De, T., Mondal, S.: Independent component analysis for dimension reduction classification: Hough transform and cash algorithm. In: Astrostatistical Challenges for the New Astronomy, pp 185–202. Springer (2013)

  14. 14.

    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, vol. 27. ACM, New York (1998)

    Google Scholar 

  15. 15.

    Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMoD Record, vol. 28, pp 61–72. ACM (1999)

  16. 16.

    Kailing, K., Kriegel, H. -P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp 246–256. SIAM (2004)

  17. 17.

    Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.: A monte carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp 418–427. ACM (2002)

  18. 18.

    Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp 394–405. ACM (2002)

  19. 19.

    Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: Maple: A fast algorithm for maximal pattern-based clustering. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp 259–266. IEEE (2003)

  20. 20.

    Liu, J., Wang, W.: Op-cluster: Clustering by tendency in high dimensional space. In: Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp 187–194. IEEE (2003)

  21. 21.

    Melhem, R.: Parallel gauss-jordan elimination for the solution of dense linear systems. Parallel Comput. 4(3), 339–343 (1987)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Zimek, A.: Correlation clustering. ACM SIGKDD Explor. Newsl. 11(1), 53–54 (2009)

    Article  Google Scholar 

  23. 23.

    Feng, J., Lin, Z., Xu, H., Yan, S.: Robust subspace segmentation with block-diagonal prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3818–3825 (2014)

  24. 24.

    Kim, Y., Mesbahi, M.: On maximizing the second smallest eigenvalue of a state-dependent graph laplacian. IEEE Trans. Autom. Control 51(1), 116–120 (2006)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Kauffman, L., Rousseeuw, P.: Finding groups in data. An introduction to cluster analysis. John Willey & Sons, New York (1990)

    Google Scholar 

  26. 26.

    Meilă, M.: Comparing clusterings? an information based distance. J. Multiv. Anal. 98(5), 873–895 (2007)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Xiao, C., Ye, J., Esteves, R.M., Rong, C.: Using spearman’s correlation coefficients for exploratory data analysis on big dataset. Concurr. Comput. Pract. Experience, 1–13 (2015)

  28. 28.

    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM Sigmod Record, vol. 29, pp 1–12. ACM (2000)

  29. 29.

    Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. (TOIS) 22(1), 143–177 (2004)

    Article  Google Scholar 

  30. 30.

    Lee, J.-S., Jun, C.-H., Lee, J., Kim, S.: Classification-based collaborative filtering using market basket data. Expert Syst. Appl. 29(3), 700–704 (2005)

    Article  Google Scholar 

  31. 31.

    Demiriz, A.: Enhancing product recommender systems on sparse binary data. Data Min. Knowl. Disc. 9(2), 147–170 (2004)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. M. Learn. Res. 10, 2935–2962 (2009)

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Hahsler, M.: recommenderlab: A framework for developing and testing recommendation algorithms, Southern Methodist University (2011)

  34. 34.

    Chowdhury, G.G.: Introduction to modern information retrieval. Facet Publishing, London, United Kingdom (2010)

  35. 35.

    Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TiiS) 5(4), 19 (2016)

    Google Scholar 

  36. 36.

    Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R., Duval, E.: Dataset-driven research for improving recommender systems for learning. In: Proceedings of the 1st International Conference on Learning Analytics and Knowledge, pp 44–53. ACM (2011)

  37. 37.

    Shah, M., Parikh, D., Deshpande, B.: Movie recommendation system employing latent graph features in extremely randomized trees. In: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, p 42. ACM (2016)

  38. 38.

    Dooms, S., Bellogín, A., Pessemier, T.D., Martens, L.: A framework for dataset benchmarking and its application to a new movie rating dataset. ACM Tran. Intell. Syst. Technol. (TIST) 7(3), 41 (2016)

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Avleen Malhi.

Additional information

Data availability statement (DAS)

Data available in a public repository and the links have been provided in relevant section.

Open Access

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pandove, D., Malhi, A. A Correlation Based Recommendation System for Large Data Sets. J Grid Computing 19, 42 (2021).

Download citation


  • Correlation clustering
  • Recommendation system model