Unsupervised Classification of Mixed Data Type of Attributes Using Genetic Algorithm (Numeric, Categorical, Ordinal, Binary, Ratio-Scaled)
Data mining discloses hidden, previously unknown, and potentially useful information from large amounts of data. As comparison to the traditional statistical and machine learning data analysis techniques, data mining emphasizes to provide a convenient and complete environment for the data analysis. Data mining has become a popular technology in analyzing complex data. Clustering is one of the data mining core techniques. In the field of data mining and data clustering, it is a highly desirable task to perform cluster analysis on large data sets with mixed numeric, categorical, ordinal, and ratio-scaled with binary and nominal values. However, most already available data merging and grouping through clustering algorithms are effective for the numeric data rather than the mixed data set. For this purpose, this paper makes efforts to present a new amalgamation algorithm for these mixed data sets by modifying the common cost function, trace of the within cluster dispersion matrix. The genetic algorithm (GA) is used to optimize the new cost function to obtain valid clustering result. We can compare and analyze that the GA-based clustering algorithm is feasible for the high-dimensional data sets with mixed data values that are obtained in real life results. Core Idea of Our Paper: By this paper, we try to describe a technique for estimating the cost function metrics from mixed numeric, categorical and other type databases by using an uncertain grade-of-membership clustering model with the efficiency of Genetic Algorithm. This technique can be applied to the problem of opportunity analysis for business decision-making. This general approach could be adapted to many other applications where a decision agent needs to assess the value of items from a set of opportunities with respect to a reference set representing its business. For processing numeric attributes, instead of generalizing them, a prototype may be developed for experiments with synthetic and real data sets, and comparison with those of the traditional approaches. The results confirmed the feasibility of the framework and the superiority of the extended techniques.
KeywordsClustering algorithms Categorical dataset Numerical dataset Clustering Data mining Pattern discovery Genetic algorithm
The authors would like to thank the reviewers for their valuable suggestions. They would also like to thank rev. HOD-CSE, Prof. Dr. R. Radhakrishnan for his involvement and valuable suggestions on soft-computing in the early stage of this paper. We would like to thank our friends, family and seniors for their motivation and encouragement. Last but definitely not the least we would thank the almighty god without whose grace this paper would not have achieved success.
- 1.Li, J., Gao, X., Jiao, L.-C.: A GA-based clustering algorithm for large data sets with mixed numeric and categorical values. In: National Key Laboratory of Radar Signal Processing, Xidian University, Xi’anGoogle Scholar
- 2.Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp.1–8, Department of Computer Science, The University of British Columbia, CanadaGoogle Scholar
- 3.Krovi, R.: Genetic Algorithm for Clustering: A Preliminary Investigation pp. 504–544. IEEE press, Piscataway (1991) Google Scholar
- 4.Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999)Google Scholar
- 5.Christian, H., Liao, T.F.: How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. In: University College London, UK and University of Illinois, Urbana–ChampaignGoogle Scholar
- 6.Pinisetty, V.N.P., Valaboju, R., Rao, N.R.: Hybrid Algorithm for Clustering Mixed Data Sets. http://www.iosrjournals.org
- 7.Yang, S.-B., Wu Y.-G.: Genetic algorithm for clustering mixed-type data. Electron Imaging 20(1), 013003 (10 April 2010, 12 August 2010, 03 December 2010, 08 February 2011). doi: 10.1117/1.3537836
- 8.Chatzis, S.P.: A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. In: Expert Systems with Applications 38, 8684–8689 (2011). doi: 10.1016/j.eswa.2011.01.074, Source: DBLP
- 9.Holland, J.H.: Adoption in Natural and Artificial System. University of Michigan Press, Ann Arbor (1975)Google Scholar
- 10.Hortaa, D.: Evolutionary Fuzzy Clustering of Relational Data. ICMC—USP, São Carlos (2012)Google Scholar