Clustering Large Data Set: An Applied Comparative Study

Bocci, Laura; Mingo, Isabella

doi:10.1007/978-3-642-21037-2_1

Laura Bocci⁴ &
Isabella Mingo⁴

Part of the book series: Studies in Theoretical and Applied Statistics ((STASSPSS))

4588 Accesses
1 Citations

Abstract

The aim of this paper is to analyze different strategies to cluster large data sets derived from social context. For the purpose of clustering, trials on effective and efficient methods for large databases have only been carried out in recent years due to the emergence of the field of data mining. In this paper a sequential approach based on multiobjective genetic algorithm as clustering technique is proposed. The proposed strategy is applied to a real-life data set consisting of approximately 1.5 million workers and the results are compared with those obtained by other methods to find out an unambiguous partitioning of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alhajj, R., Kaya, M.: Multi-objective genetic algorithms based automated clustering for fuzzy association rules mining. J. Intell. Inf. Syst. 31, 243–264 (2008).
Article Google Scholar
Bandyopadhyay, S., Maulik, U., Mukhopadhyay, A.: Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 45 (5), 1506–1511 (2007).
Article Google Scholar
Benzécri, J.P.: Sur le calcul des taux d’inertie dans l’analyse d’un questionnaire addendum et erratum à [bin.mult.] [taux quest.]. Cahiers de l’analyse des données 4, 377–378 (1979).
Google Scholar
Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. NY: Plenum (1981).
Book MATH Google Scholar
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974).
MathSciNet MATH Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Article Google Scholar
Day, W.H.E.: Foreword: comparison and consensus of classifications. J. Classif. 3, 183–185 (1986).
Article Google Scholar
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6 (2), 182–197 (2002).
Article Google Scholar
Dubes, R.C., Jain, A.K.: Clustering techniques: the user’s dilemma. Pattern Recognit. 8, 247–260 (1976).
Article Google Scholar
Falkenauer, E.: Genetic algorithms and grouping problems. Wiley, NY (1998).
Google Scholar
Ferligoj, A., Batagelj, V.: Direct multicriteria clustering algorithm. J. Classif. 9, 43–61 (1992).
MathSciNet MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985).
Article Google Scholar
Kaufman, L., Rousseeuw, P.: Finding groups in data. Wiley, New York (1990).
Book Google Scholar
Lebart, L., Morineau, A., Piron, M.: Statistique exploratoire multidimensionnelle. Dunod, Paris (2004).
Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. Symp. Math. Statist. and Prob. (5th), Univ. of California, Berkeley, Vol. I: Statistics, pp. 281–297 (1967).
Google Scholar
Mingo, I.: Concetti e quantità, percorsi di statistica sociale. Bonanno Editore, Rome (2009).
Google Scholar
Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Bocca, J., Jarke, M., Zaniolo, C. (eds.) Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, pp. 144–155 (1994).
Google Scholar
Pakhira, M. K., Bandyopadhyay, S., Maulik, U.: Validity index for crisp and fuzzy clusters. Pattern Recognit. 37, 487–501 (2004).
Article MATH Google Scholar
Steiner, P.M., Hudec, M.: Classification of large data sets with mixture models via sufficient EM. Comput. Stat. Data Anal. 51, 5416–5428 (2007).
Article MathSciNet MATH Google Scholar
Tseng, L.Y., Yang, S.B.: A genetic approach to the automatic clustering problem. Pattern Recognit. 34, 415–424 (2001).
Article MATH Google Scholar
Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13, 841–847 (1991).
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 103–114 (1996).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Communication and Social Research, Sapienza University of Rome, Via Salaria 113, Rome, Italy
Laura Bocci & Isabella Mingo

Authors

Laura Bocci
View author publications
You can also search for this author in PubMed Google Scholar
Isabella Mingo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Bocci .

Editor information

Editors and Affiliations

, Dept. of Statistics, University of Roma "La Sapienza", P.le Aldo Moro 5, Roma, 00185, Italy
Agostino Di Ciaccio
, Dip. di Metodi Quantitativi e Teoria Eco, University "G. d'Annunzio" of Chieti-Pes, V.le Pindaro 42, Pescara, Italy
Mauro Coli
Fac. Ciencias, Depto. Estadística e Investigación Opera, Universidad de Granada, Avenida Fuente Nueva s/n, Granada, 18071, Spain
Jose Miguel Angulo Ibanez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bocci, L., Mingo, I. (2012). Clustering Large Data Set: An Applied Comparative Study. In: Di Ciaccio, A., Coli, M., Angulo Ibanez, J. (eds) Advanced Statistical Methods for the Analysis of Large Data-Sets. Studies in Theoretical and Applied Statistics(). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21037-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-21037-2_1
Published: 28 December 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21036-5
Online ISBN: 978-3-642-21037-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics