Skip to main content

Clustering Large Data Set: An Applied Comparative Study

  • Conference paper
  • First Online:
Advanced Statistical Methods for the Analysis of Large Data-Sets

Part of the book series: Studies in Theoretical and Applied Statistics ((STASSPSS))

Abstract

The aim of this paper is to analyze different strategies to cluster large data sets derived from social context. For the purpose of clustering, trials on effective and efficient methods for large databases have only been carried out in recent years due to the emergence of the field of data mining. In this paper a sequential approach based on multiobjective genetic algorithm as clustering technique is proposed. The proposed strategy is applied to a real-life data set consisting of approximately 1.5 million workers and the results are compared with those obtained by other methods to find out an unambiguous partitioning of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Alhajj, R., Kaya, M.: Multi-objective genetic algorithms based automated clustering for fuzzy association rules mining. J. Intell. Inf. Syst. 31, 243–264 (2008).

    Article  Google Scholar 

  • Bandyopadhyay, S., Maulik, U., Mukhopadhyay, A.: Multiobjective genetic clustering for pixel classification in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 45 (5), 1506–1511 (2007).

    Article  Google Scholar 

  • Benzécri, J.P.: Sur le calcul des taux d’inertie dans l’analyse d’un questionnaire addendum et erratum à [bin.mult.] [taux quest.]. Cahiers de l’analyse des données 4, 377–378 (1979).

    Google Scholar 

  • Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. NY: Plenum (1981).

    Book  MATH  Google Scholar 

  • Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974).

    MathSciNet  MATH  Google Scholar 

  • Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).

    Article  Google Scholar 

  • Day, W.H.E.: Foreword: comparison and consensus of classifications. J. Classif. 3, 183–185 (1986).

    Article  Google Scholar 

  • Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6 (2), 182–197 (2002).

    Article  Google Scholar 

  • Dubes, R.C., Jain, A.K.: Clustering techniques: the user’s dilemma. Pattern Recognit. 8, 247–260 (1976).

    Article  Google Scholar 

  • Falkenauer, E.: Genetic algorithms and grouping problems. Wiley, NY (1998).

    Google Scholar 

  • Ferligoj, A., Batagelj, V.: Direct multicriteria clustering algorithm. J. Classif. 9, 43–61 (1992).

    MathSciNet  MATH  Google Scholar 

  • Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985).

    Article  Google Scholar 

  • Kaufman, L., Rousseeuw, P.: Finding groups in data. Wiley, New York (1990).

    Book  Google Scholar 

  • Lebart, L., Morineau, A., Piron, M.: Statistique exploratoire multidimensionnelle. Dunod, Paris (2004).

    Google Scholar 

  • MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. Symp. Math. Statist. and Prob. (5th), Univ. of California, Berkeley, Vol. I: Statistics, pp. 281–297 (1967).

    Google Scholar 

  • Mingo, I.: Concetti e quantità, percorsi di statistica sociale. Bonanno Editore, Rome (2009).

    Google Scholar 

  • Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Bocca, J., Jarke, M., Zaniolo, C. (eds.) Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, pp. 144–155 (1994).

    Google Scholar 

  • Pakhira, M. K., Bandyopadhyay, S., Maulik, U.: Validity index for crisp and fuzzy clusters. Pattern Recognit. 37, 487–501 (2004).

    Article  MATH  Google Scholar 

  • Steiner, P.M., Hudec, M.: Classification of large data sets with mixture models via sufficient EM. Comput. Stat. Data Anal. 51, 5416–5428 (2007).

    Article  MathSciNet  MATH  Google Scholar 

  • Tseng, L.Y., Yang, S.B.: A genetic approach to the automatic clustering problem. Pattern Recognit. 34, 415–424 (2001).

    Article  MATH  Google Scholar 

  • Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13, 841–847 (1991).

    Article  Google Scholar 

  • Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 103–114 (1996).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Bocci .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bocci, L., Mingo, I. (2012). Clustering Large Data Set: An Applied Comparative Study. In: Di Ciaccio, A., Coli, M., Angulo Ibanez, J. (eds) Advanced Statistical Methods for the Analysis of Large Data-Sets. Studies in Theoretical and Applied Statistics(). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21037-2_1

Download citation

Publish with us

Policies and ethics