Cluster Representation and Discrimination Based on Regression Line

  • M. S. Bhargavi
  • Sahana D. Gowda
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 801)


Clustering aims to group data into coherent groups based on the nearness of samples in multiple feature space where the coherency enriches the uniqueness of the clusters with respect to others. A cluster representation based on intrinsic cohesiveness would aid in efficient discrimination between clusters. In this paper, a novel method for representation and discrimination amongst the clusters has been proposed. Distances computed between every pair of samples in a cluster reveal the cohesiveness of samples in multi-dimensional space. As distances computed between every pair of samples enormously increase with the number of samples, distances are assimilated by histograms. The range of the bins in a histogram specifies the distance amongst the samples in a cluster. For effective discrimination, histograms are further transformed into a regression line by constructing cumulative histograms. Each cluster is represented by slope, intercept and error characterizing the regression line. The extent and angle of the slope is determined by the diameter of the cluster ranged by the bins and distribution of distances in the bins. To discriminate clusters represented by regression line, a statistical test called probability-value hypothesis testing is performed. Based on the probability obtained, the clusters are discriminated to be similar or dissimilar. Experimentation on real and synthetic clusters demonstrates the efficiency of the proposed approach in extracting unique cluster representation for discrimination.


Cluster cohesion Representation Discrimination Histogram Regression 


  1. 1.
    Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications, p. 652. Data Mining and Knowledge Discovery Series. Chapman and Hall/CRC (2013)Google Scholar
  2. 2.
    Wurzenberger, M., Skopik, F., Landauer, M., Greitbauer, P., Fiedler, R., Kastner, W.: Incremental clustering for semi-supervised anomaly detection applied on log data. In: Proceedings of the 12th International Conference on Availability, Reliability and Security, Article No. 31. ACM (2017).
  3. 3.
    Burbeck, K., Nadjm-Tehrani, S.: Adaptive real-time anomaly detection with incremental clustering. Inf. Secur. Tech. Rep. 12(1), 56–67 (2007). Scholar
  4. 4.
    Langone, R., Agudelo, O.M., De Moor, B., Suykens, J.A.K.: Incremental kernel spectral clustering for online learning of non-stationary data. Neurocomputing 139, 246–260 (2014). Scholar
  5. 5.
    Sun, Z., Mao, K.Z., Tang, W., Mak, L.-O., Xian, K., Liu, Y.: Knowledge-based evolving clustering algorithm for data stream. In: IEEE International Conference on Service Systems and Service Management, pp. 1–6 (2014).
  6. 6.
    Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)Google Scholar
  7. 7.
    Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013). Scholar
  8. 8.
    Dean, S., Illowsky, B.: Descriptive statistics: histogram. Retrieved from the Connexions Web site (2009).
  9. 9.
  10. 10.
  11. 11.
    Anderson, M.J.: Permutational multivariate analysis of variance. Dept. Stat. Univ. Auckland 26, 32–46 (2005)Google Scholar
  12. 12.
    Cai, L.: Multi-response permutation procedure as an alternative to the analysis of variance: an SPSS implementation. Behav. Res. Methods 38(1), 51–59 (2006)CrossRefGoogle Scholar
  13. 13.
    Guillot, G., Rousset, F.: Dismantling the mantel tests. Methods Ecol. Evol. 4(4), 336–344 (2013). Scholar
  14. 14.
    Clarke, K.R.: Non-parametric multivariate analyses of changes in community structure. Austral Ecol. 18(1), 117–143 (1993). Scholar
  15. 15.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very large Data bases, vol. 29, pp. 81–92. VLDB Endowment (2003)Google Scholar
  16. 16.
    Yang, C., Zhou, J.: HClustream: a novel approach for clustering evolving heterogeneous data stream. In: Sixth IEEE International Conference on Data Mining Workshops (2006).
  17. 17.
    Komkrit, U., Rakthanmanon, T., Waiyamai, K.: E-stream: evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, J., Li, X., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007). Scholar
  18. 18.
    Meesuksabai, W., Kangkachit, T., Waiyamai, K.: Hue-stream: evolution-based clustering technique for heterogeneous data streams with uncertainty. In: Tang, J., King, I., Chen, L., Wang, J. (eds.) Advanced Data Mining and Applications, vol. 7121, pp. 27–40. Springer, Heidelberg (2011). Scholar
  19. 19.
    Nagabhushan, P., Ali, S.Z., Pradeep Kumar, R.: A new cluster-histo-regression analysis for incremental learning from temporal data chunks. Int. J. Mach. Intell. 2, 53–73 (2010). Scholar
  20. 20.
    Deza, M.M., Deza, E.: Encyclopedia of distances. In: Deza, M.M., Deza, E. (eds.) Encyclopedia of distances, pp. 1–583. Springer, Heidelberg (2009). Scholar
  21. 21.
    Nagabhushan, P., Pradeep Kumar, R.: Histogram PCA. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4492, pp. 1012–1021. Springer, Heidelberg (2007). Scholar
  22. 22.
    Kumar, R.P., Nagabhushan, P.: An approach based on regression line features for low complexity content based image retrieval. In: IEEE International Conference on Computing: Theory and Applications (2007).
  23. 23.
    Wuensch, K.L.: Comparing correlation coefficients, slopes, and intercepts (2007).
  24. 24.
    Grigelionis, B.: Student’s t-distribution. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 1558–1559. Springer, Heidelberg (2011). Scholar
  25. 25.
    Krzywinski, M., Altman, N.: Points of significance: significance, P values and t-tests. Nat. Methods 10(11), 1041–1042 (2013). Scholar
  26. 26.
  27. 27.
    UCI Machine Learning Repository (2017).

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.BITBangaloreIndia
  2. 2.BNMITBangaloreIndia

Personalised recommendations