A Framework for Data Clustering of Large Datasets in a Distributed Environment
The chief motivation is to develop a framework for handling clustering of large datasets in a distributed manner. The proposal presented in this work addresses both numerical and categorical data with effective noisy information handling approach. Two basic models are developed known as primary and connected model to design the distributed approach. After forming clusters separately based on numerical and categorical features, an evolutionary approach is suggested to merge the clusters for optimization. A modification of multiple kernel-based FCM algorithm (MKFCM) Chen et al. (A multiple kernel fuzzy c-means algorithm for image segmentation 41:1263–1274, 2011) is used to implement the proposal. A comprehensive view of the designed method and algorithm is presented in this paper. Comparison of the results on few sample datasets shows the effectiveness of the proposed approach over existing one.
KeywordsClustering Categorical and numerical data Large dataset
- 3.Inderjit, S.D., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Proceedings of KDD Workshop High Performance Knowledge Discovery, pp. 245–260 (1999)Google Scholar
- 7.Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, 1st edition (1989)Google Scholar
- 8.Flag dataset: http://archive.ics.uci.edu/ml/datasets/Fl
- 9.Adult dataset: http://archive.ics.uci.edu/ml/datasets/Adult