Interblend fusing of genetic algorithm-based attribute selection for clustering heterogeneous data set

  • J. Dhayanithi
  • J. Akilandeswari


Different clustering strategies to partition heterogeneous data set with numeric, binary, categorical and ordinal attributes are explored by the researchers. All the real-life applications data set is often heterogeneous in nature; if it is converted to homogeneous, then it leads to information loss. In this paper, we propose an interblend fusing of genetic algorithm-based attribute selection and increase the clustering accuracy in credit risk assessment. The proposed technique classifies the similar objects together without changing the characteristics of heterogeneous data sets. This algorithm also identifies the importance of attributes in clustering large number of objects with good many attributes. The fusing technique yields contextual distance measure for clustering the objects. The result presented in this paper provides clear interpretation of applying our methodology to the data sets. The performance of this algorithm is of the higher standard when compared to the related literature.


Distance measures Similarity measures Clustering Heterogeneous data Genetic algorithm Fusing technique 


Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.


  1. Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527CrossRefGoogle Scholar
  2. Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multi media data mining in information technology environment. Int J Comput Sci Inf Secur 10(5):69–77Google Scholar
  3. Andritsos P et al. (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the 9th international conference on extending database technology, Springer. pp 123–146Google Scholar
  4. Bache K, Lichman M (2013) UCI machine learning repository.
  5. Bashon Y, Neagu D, Ridley M (2013) A framework for comparing heterogeneous objects: on the similarity measurements for fuzzy, numerical and categorical attributes. Soft Comput A Fusion Found Methodol Appl 17(9):1595–1615Google Scholar
  6. Bie T et al. (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (supplement of bioinformatics). Oxford University Press, vol 23, issuse no 13, pp 125–132Google Scholar
  7. Chaturvedi A, Green PE, Caroll JD (2003) k-modes clustering. J Classif 18(1):35–55MathSciNetCrossRefGoogle Scholar
  8. Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection In: Icml, vol 1, pp 74–81Google Scholar
  9. Dash M et al (2005) Feature selection for clustering. Springer, ChicagoGoogle Scholar
  10. Dos Santos TRL et al (2015) Categorical data clustering: What similarity measure to recommend? Expert Syst Appl 42(3):1247–1260CrossRefGoogle Scholar
  11. Dy J, Brodley C (2000) Feature subset selection and order identification for unsupervised learning. In: ICML, pp 247–254Google Scholar
  12. Frank A, Asuncion A (2010) UCI machine learning repository. University of California, School of Information and Computer science.
  13. Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS- clustering categorical data using summaries. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 73–83Google Scholar
  14. Gao B et al (2005) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 1–31Google Scholar
  15. Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366CrossRefGoogle Scholar
  16. Hall MA (2000) Correlation-based feature selection of discrete and numeric class machine learning. In: Proceedings of the seventeenth international. Morgan Kaufmann Publishers Inc, pp 359–366Google Scholar
  17. Harikumar S, Surya PV (2015) K-medoid clustering for heterogeneous datasets. Procedia Comput Sci 70:226–237CrossRefGoogle Scholar
  18. He Z, Xu X, Deng S (2002) An efficient algorithm for clustering categorical data. J Comput Sci Technol 17(5):611–624MathSciNetCrossRefGoogle Scholar
  19. Huang Z (1997) A fastclustering algorithm to cluster very large categorical data sets in datamining. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, vol 3, issuse no 8, pp 34–39Google Scholar
  20. Huang Z (1998) Extension to the K-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304MathSciNetCrossRefGoogle Scholar
  21. Huang CL, Wang CJ, Chen MC (2007) Credit scoring with a data mining approach based on support vector machines. Expert Syst Appl 33(4):847–856CrossRefGoogle Scholar
  22. Karegowda AG et al (2010) Feature subset selection problem using wrapper approach in supervised learning. Int J Comput Appl 1(7):13–17Google Scholar
  23. Khashman A (2010) Neural networks for credit risk evaluation: investigation of different neural models and learning schemes. Expert Syst Appl 37(9):6233–6239CrossRefGoogle Scholar
  24. Kim Y, Street WN, Menczer F (2000) Feature selection in unsupervised learning via evolutionary search. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 365–369Google Scholar
  25. Kohavi R, Sommerfield D (1995) Feature subset selection using the wrapper method: overfitting and dynamic search space topology. In: Proceedings of the first international conference on knowledge discovery and data mining. KDD, pp 192–197Google Scholar
  26. Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 4:673–690CrossRefGoogle Scholar
  27. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 14(4):491–502Google Scholar
  28. Liu H et al (1998) Feature extraction, construction and selection: a data mining perspective, vol 453. Springer, Berlin, pp 50–62CrossRefGoogle Scholar
  29. Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Inf Secur 10:165–170Google Scholar
  30. Mojahed A et al (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 251–265Google Scholar
  31. Naija Y et al (2008) Extension of partitional clustering methods for handling mixed data . In: IEEE international conference on data mining workshops. IEEE, pp 257–266Google Scholar
  32. Oreski S, Oreski G (2013) Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst Appl 41(4):2052–2064CrossRefGoogle Scholar
  33. Pyle D (1999) Data preparation for data mining (The Morgan Kaufmann Series in data management systems), vol 3. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
  34. Rastogi R, Mondal P et al (2015) GA based clustering of mixed data type of attributes—numeric, categorical, ordinal, binary and ratio scaled. Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM’s) Int J Inf Technol 7(2):861–865Google Scholar
  35. Refaeilzadeh P, Tang L, Liu H (2007) On comparison of feature selection algorithms. In: Proceedings of AAAI workshop on evaluation methods for machine learning II, vol 3, p 5Google Scholar
  36. Shi et al (2007) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11(1):309–332Google Scholar
  37. Smys S, Bala GJ (2012) Performance analysis of virtual clusters in personal communication networks. Soft Comput 15(3):211–222Google Scholar
  38. Tan F et al (2008) A genetic algorithm-based method for feature subset selection. Soft Comput 12(2):111–120MathSciNetCrossRefGoogle Scholar
  39. Wang S et al (2009) Empirical analysis of support vector machine ensemble classifiers. Expert Syst Appl 36(3):6466–6476CrossRefGoogle Scholar
  40. Wilson DR, Martinez TR (1997) Improved heterogeneous distance function. J Artif Intell Res 6:1–34MathSciNetCrossRefGoogle Scholar
  41. Xing EP, Jordan MI, Karp RM (2001) Feature selection for high-dimensional genomic microarray data. In: ICML, vol 1, pp 601-608Google Scholar
  42. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863Google Scholar
  43. Zaki MJ, Peters M (2005) CLICK:mining subspace clusters in categorical data via k partite maximal cliques. In: 21st international conference on data engineering. IEEE, pp 355-356Google Scholar
  44. Zhang T, Ramakishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Rec 25:103–114CrossRefGoogle Scholar
  45. Zhuo L et al (2008) A genetic algorithm based wrapper feature selection method for classification of hyperspectral images using support vector machine. In: Geoinformatics 2008 and joint conference on gis and built environment: classification of remote sensing images. International Society for Optics and Photonics, vol 7147, p 71471Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Sona College of TechnologySalemIndia

Personalised recommendations