The Journal of Supercomputing

, Volume 69, Issue 1, pp 452–467 | Cite as

A parallel clustering method combined information bottleneck theory and centroid-based clustering

  • Zhanquan Sun
  • Geoffrey Fox
  • Weidong Gu
  • Zhao Li


Clustering is an important research topic of data mining. Information bottleneck theory-based clustering method is suitable for dealing with complicated clustering problems because that its information loss metric can measure arbitrary statistical relationships between samples. It has been widely applied to many kinds of areas. With the development of information technology, the electronic data scale becomes larger and larger. Classical information bottleneck theory-based clustering method is out of work to deal with large-scale dataset because of expensive computational cost. Parallel clustering method based on MapReduce model is the most efficient method to deal with large-scale data-intensive clustering problems. A parallel clustering method based on MapReduce model is developed in this paper. In the method, parallel information bottleneck theory clustering method based on MapReduce is proposed to determine the initial clustering center. An objective method is proposed to determine the final number of clusters automatically. Parallel centroid-based clustering method is proposed to determine the final clustering result. The clustering results are visualized with interpolation MDS dimension reduction method. The efficiency of the method is illustrated with a practical DNA clustering example.


Clustering Information bottleneck theory MapReduce Centroid-based clustering 



This work is partially supported by national youth science foundation (No. 61004115), national science foundation (No. 61272433) and Provincial Fund for Nature project (No. ZR2010FQ018).


  1. 1.
    Khana SS, Ahmad A (2013) Cluster center initialization algorithm for K-modes clustering. Expert Sys Appl 40(18):7444–7456Google Scholar
  2. 2.
    Sim K, Yap GE, Hardoon DR et al (2013) Centroid-based actionable 3D subspace clustering. IEEE Trans Knowl Data Eng 25(6):1213–1226CrossRefGoogle Scholar
  3. 3.
    Tishby N, Fernando C, Bialek W (1999) The information bottleneck method. In: The 37th annual allerton conference on communication, control and computing, Monticello, pp 1–11Google Scholar
  4. 4.
    Coldberger J, Gordon S, Greenspan H (2006) Unsupervised image-set clustering using an information theoretic framework. IEEE Trans Image Process 15(2):449–457CrossRefGoogle Scholar
  5. 5.
    Slonim N, Somerville T, Tishby N (2001) Objective classification of galaxy spectra using the information bottleneck method. Mon Not R Astron 323:270–284CrossRefGoogle Scholar
  6. 6.
    Swedlow JR, Zanetti G, Best C (2011) Nat. Methods. Channeling the data deluge 8:463–465Google Scholar
  7. 7.
    Fox GC, Qiu XH et al (2009) Biomedical case studies in data intensive computing. Lect Notes Comput Sci 5931:2–18CrossRefGoogle Scholar
  8. 8.
    Sun ZQ, Fox GC (2012) Study on parallel SVM based on MapReduce. In: International conference on parallel and distributed processing techniques and applications, CSREA Press, pp 495–501Google Scholar
  9. 9.
    Blake JA, Bult CJ (2006) Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform 39(3):314–320CrossRefGoogle Scholar
  10. 10.
    Qiu J (2010) Scalable programming and algorithms for data intensive life science. J Integr Biol 15(4):1–3Google Scholar
  11. 11.
    Guha R, Gilbert K, Fox GC et al (2010) Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Curr Comput-Aided Drug Des 6:50–67CrossRefGoogle Scholar
  12. 12.
    Chang CC, He B, Zhang Z (2004) Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor 6(2):67–76CrossRefGoogle Scholar
  13. 13.
    Fox GC, Bae SH et al (2008) Parallel data mining from multicore to cloudy grids. High performance computing and grids workshop, IOS Press, pp 311–340Google Scholar
  14. 14.
    Li JJ, Cui J, Wang D et al (2011) Survey of MapReduce parallel programming model. Acta Electronica Sinica 39(11):2635–2642Google Scholar
  15. 15.
    Ekanayake J, Li H et al (2010) Twister: a runtime for iterative MapReduce. In: The first international workshop on MapReduce and its applications of ACM HPDC, ACM press, pp 810–818Google Scholar
  16. 16.
    Jolliffe IT (2002) Principal component analysis. Springer, New YorkzbMATHGoogle Scholar
  17. 17.
    George KM (2010) Self-organizing maps. INTECHGoogle Scholar
  18. 18.
    Borg I, Patrick JF (2005) Modern multidimensional scaling: theory and applications. Springer, New YorkGoogle Scholar
  19. 19.
    Bae S-H, Qiu J, Fox G (2012) Adaptive interpolation of multidimensional scaling. In: International conference on computational science, pp 393–402Google Scholar
  20. 20.
    Ananstassiou D (2000) Frequency-domain analysis of biomolecular sequences. Bioinformatics 16(12):1073–1081CrossRefGoogle Scholar
  21. 21.
    Liang B, Chen DY (2010) DNA sequence classification based on ant colony optimization clustering algorithm. Comput Eng Appl 46(25):124–126MathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Zhanquan Sun
    • 1
  • Geoffrey Fox
    • 2
  • Weidong Gu
    • 1
  • Zhao Li
    • 3
  1. 1.Key Laboratory for Computer Network of Shandong ProvinceShandong Computer Science CenterJinan China
  2. 2.School of Informatics and Computing, Pervasive Technology InstituteIndiana University BloomingtonBloomingtonUSA
  3. 3.School of Software EngineeringBeijing Jiaotong UniversityBeijing China

Personalised recommendations