Abstract
The aim of this paper is to analyze different strategies to cluster large data sets derived from social context. For the purpose of clustering, trials on effective and efficient methods for large databases have only been carried out in recent years due to the emergence of the field of data mining. In this paper a sequential approach based on multiobjective genetic algorithm as clustering technique is proposed. The proposed strategy is applied to a real-life data set consisting of approximately 1.5 million workers and the results are compared with those obtained by other methods to find out an unambiguous partitioning of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alhajj, R., Kaya, M.: Multi-objective genetic algorithms based automated clustering for fuzzy association rules mining. J. Intell. Inf. Syst. 31, 243–264 (2008).
Bandyopadhyay, S., Maulik, U., Mukhopadhyay, A.: Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 45 (5), 1506–1511 (2007).
Benzécri, J.P.: Sur le calcul des taux d’inertie dans l’analyse d’un questionnaire addendum et erratum à [bin.mult.] [taux quest.]. Cahiers de l’analyse des données 4, 377–378 (1979).
Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. NY: Plenum (1981).
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974).
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Day, W.H.E.: Foreword: comparison and consensus of classifications. J. Classif. 3, 183–185 (1986).
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6 (2), 182–197 (2002).
Dubes, R.C., Jain, A.K.: Clustering techniques: the user’s dilemma. Pattern Recognit. 8, 247–260 (1976).
Falkenauer, E.: Genetic algorithms and grouping problems. Wiley, NY (1998).
Ferligoj, A., Batagelj, V.: Direct multicriteria clustering algorithm. J. Classif. 9, 43–61 (1992).
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985).
Kaufman, L., Rousseeuw, P.: Finding groups in data. Wiley, New York (1990).
Lebart, L., Morineau, A., Piron, M.: Statistique exploratoire multidimensionnelle. Dunod, Paris (2004).
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. Symp. Math. Statist. and Prob. (5th), Univ. of California, Berkeley, Vol. I: Statistics, pp. 281–297 (1967).
Mingo, I.: Concetti e quantità , percorsi di statistica sociale. Bonanno Editore, Rome (2009).
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Bocca, J., Jarke, M., Zaniolo, C. (eds.) Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, pp. 144–155 (1994).
Pakhira, M. K., Bandyopadhyay, S., Maulik, U.: Validity index for crisp and fuzzy clusters. Pattern Recognit. 37, 487–501 (2004).
Steiner, P.M., Hudec, M.: Classification of large data sets with mixture models via sufficient EM. Comput. Stat. Data Anal. 51, 5416–5428 (2007).
Tseng, L.Y., Yang, S.B.: A genetic approach to the automatic clustering problem. Pattern Recognit. 34, 415–424 (2001).
Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13, 841–847 (1991).
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 103–114 (1996).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bocci, L., Mingo, I. (2012). Clustering Large Data Set: An Applied Comparative Study. In: Di Ciaccio, A., Coli, M., Angulo Ibanez, J. (eds) Advanced Statistical Methods for the Analysis of Large Data-Sets. Studies in Theoretical and Applied Statistics(). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21037-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-21037-2_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21036-5
Online ISBN: 978-3-642-21037-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)