Abstract
The cluster analysis has been widely applied by researchers from several scientific fields over the last decades. Advances in knowledge of biological phenomena have revived a great interest in cluster analysis due in part to the large amount of microarray data. Traditional clustering algorithms show, apart from the need of user-defined parameters, clear limitations to handle microarray data owing to its inherent characteristics: high-dimensional-low-sample-sized, highly redundant, and noisy. That has motivated the study of clustering algorithms tailored to the task of analyzing microarray data, which currently continue being developed and adapted. The present chapter is devoted to review clustering methods with different cluster analysis approaches in the challenging context of microarray data. Furthermore, the validation of the clustering results is briefly discussed by means of validity indexes used to assess the goodness of the number of clusters and the induced cluster assignments.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Achcar F, Camadro JM, Mestivier D (2009) AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology. Nucleic Acids Res 37:W63-7. https://doi.org/10.1093/nar/gkp430
Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. Chapman and Hall, Boca Raton
Aghabozorgi S, Shirkhorshidi AS, Wah T (2015) Time-series clustering - a decade review. Inform Syst 53:16–38
Agrawal R, Gehrke J, Gunopulos D et al (2005) Automatic subspace clustering of high dimensional data. Data Min Knowl Disc 11:5–33
Ahmed HA, Mahanta P, Bhattacharyya DK et al (2011) Intersected coexpressed subcube miner: an effective triclustering algorithm. In: Proceedings WICT2011. https://doi.org/10.1109/WICT.2011.6141358
Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 11:253–264
Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Anand R, Ravichandran S, Chatterjee S (2016) A new method of finding groups of coexpressed genes and conditions of coexpression. BMC Bioinform 17:486. https://doi.org/10.1186/s12859-016-1356-3
Ankerst M, Breunig MM, Kriegel H et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings ACM SIGMOD 99. https://doi.org/10.1145/304182.304187
Bandyopadhyay S, Saha S, Maulik U et al (2008) A simulated annealing based multi-objective optimization algorithm: AMOSA. IEEE Trans Evol Comput 12:269–283
Bandyopadhyay S, Maulik U, Chakrabortya R (2013) Incorporating ɛ-dominance in AMOSA: application to multiobjective 0/1 knapsack problem and clustering gene expression data. Appl Soft Comput 13:2405–2411
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–822
Belacel N, Cuperlović-Culf M, Laflamme M et al (2004) Fuzzy J-means and VNS methods for clustering genes from microarray data. Bioinformatics 20:1690–1701
Belacel N, Wang Q, Cuperlović-Culf M (2006) Clustering methods for microarray gene expression data. OMICS 10:507–531
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6:281–297
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin
Beyer K, Goldstein J, Ramakrishnan R et al (1999) When is nearest neighbor meaningful? In: Beeri C, Buneman P (eds) Proceedings ICDT 99. Springer, Berlin
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Boutros PC, Okey AB (2005) Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform 6:331–343
Brevern AG, Hazout S, Malpertuy A (2004) Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinform 5:114. https://doi.org/10.1186/1471-2105-5-114
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27
Castellanos-Garzón JA, Díaz F (2013) An evolutionary computational model applied to cluster analysis of DNA microarray data. Expert Syst Appl 40:2575–2591
Cheng Y, Church GM (2000) Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 8:93–103
Chipman H, Hastie TJ, Tibshirani R (2003) Clustering microarray data. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman and Hall, Boca Raton
Chipman H, Tibshirani R (2006) Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7:286–301
Chiu CC, Chan SY, Wang CC et al (2013) Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC Syst Biol 7(Suppl 6):S12. https://doi.org/10.1186/1752-0509-7-S6-S12
Dahl DB (2006) Model-based clustering for expression data via a Dirichlet process mixture model. In: Do KA, Müller P, Vannucci M (eds) Bayesian inference for gene expression and proteomics. Cambridge University Press, New York
Dash R, Misra BB (2018) Performance analysis of clustering techniques over microarray data: a case study. Phys A 493:162–176
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227
Dempster, AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
D’haeseleer P (2005) How does gene expression clustering work? Nature Biotech 23:1499–1501
Do JH, Choi DK (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 25:279–288
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for classifications of tumors using gene expression data. J Am Stat Assoc 97:77–87
Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4:95–104
Eisen MB, Spellman PT, Brown PO et al (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–14868
Eren K, Deveci M, Küçüktunç O et al (2013) A comparative analysis of biclustering algorithms for gene expression data. Brief Bioinform 14:279–292
Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings KDD 96. AAAI Press, Menlo Park. https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
Faceli K, Carvalho A, Souto M (2007) Multi-objective clustering ensemble. Int J Hybrid Intell Syst 4:145–156
Forti A, Foresti GL (2006) Growing hierarchical tree SOM: an unsupervised neural network with dynamic topology. Neural Netw 19:1568–1580
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Franco M, Vivo JM (2018) Genetic algorithms for parameter estimation in modelling of index returns. Eur J Financ. https://doi.org/10.1080/1351847X.2017.1392332
Fu L, Medico E (2007) FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform 8:1. https://doi.org/10.1186/1471-2105-8-3
Gentleman R, Ding B, Dudoit S et al (2005) Distance measures in DNA microarray data analysis. In: Gentleman R, Carey VJ, Huber W et al (eds) Bioinformatics and computational biology solutions using R and Bioconductor. Springer, New York
Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA 97:12079–12084
Gnanadesikan R, Harvey JW, Kettenring JR (1993) Mahalanobis metrics for cluster analysis. Sankhyā A 55:494–505
Goil S, Nagesh H, Choudhary A (1999) MAFIA: efficient and scalable subspace clustering for very large data sets. In: Proceedings 5th ACM SIGKDD 99. http://www.academia.edu/download/38278360/goil99mafia.pdf
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Gollub J, Sherlock G (2006) Clustering microarray data. In: Kimmel AR, Oliver B (eds) DNA microarrays: databases and statistics Part B. Academic Press, San Diego
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inform Syst 25-345–366
Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large databases. Inform Syst 26:35–58
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufman, San Francisco
Handl J, Knowles J (2007) An evolutionary approach to multi-objective clustering. IEEE Trans Evol Comput 11:56–76
Hartuv E, Schmitt A, Lange J et al (1999) An algorithm for clustering cDNAs for gene expression analysis. In: Proceedings 3rd RECOMB 99. https://doi.org/10.1145/299432.299483
Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Inform Proc Lett 76:175–181
Hathaway RJ, Bezdek JC (1985) Local convergence of the fuzzy c-means algorithms. Pattern Recognit 19:477–480
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52:258–271
Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2):126–136
Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9:1106–1115
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings 4th KDD 98, vol 98, pp 58–65
Hsu AL, Tang S, Halgamuge SK (2003) An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 19:2131–2140
Irigoien I, Mestres F, Arenas C (2013) The depth problem: identifying the most representative units in a data group. IEEE Trans Comput Biol Bioinform 10:161–172
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323
Jain AK, Dui RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22:4–37
Jardine CJ, Jardine N, Sibson R (1967) The structure and construction of taxonomic hierarchies. Math Biosci 1:173–179
Jaskowiak PA, Campello RJ, Costa IG (2014) On the selection of appropriate distances for gene expression data clustering. BMC Bioinform 15(S2):S2. https://doi.org/10.1186/1471-2105-15-S2-S2
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16:1370–1384
Jiang MRCTD, Pei J, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: Proceedings 10th ACM SIGKDD 04. https://doi.org/10.1145/1014052.1014101
Jiang H, Zhou S, Guan J et al (2006) gTRICLUSTER: a more general and effective 3D clustering algorithm for gene-sampletime microarray data. In: Proceedings BioDM06. Lecture notes in computer science, vol 3916. Springer, Berlin, pp 48–59
Kafieh R, Mehridehnavi A (2013) A comprehensive comparison of different clustering methods for reliability analysis of microarray data. J Med Signals Sens 3:22–30
Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Comput 32(8):68–75
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Kennedy J, Eberhart RC (1999) Particle swarm optimization. In: Proceedings 1995 IEEE neural networks. https://doi.org/10.1109/ICNN.1995.488968
Kerr G, Ruskin HJ, Crane M et al (2008) Techniques for clustering gene expression data. Comput Biol Med 38:283–293
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Korte B, Vygen J (2006) Combinatorial optimization. Theory and algorithms, 3rd edn. Springer, Berlin
Krishna K, Murty M (1999) Genetic K-means algorithm. IEEE Trans Syst Man Cybern B 29:433–439
Kumar L, Futschik ME (2007) Mfuzz: a software package for soft clustering of microarray data. Bioinformation 2(1):5–7
Liew AWC, Law NF, Yan H (2011) Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 12:498–513
Liu J, Pham TD (2011) Fuzzy clustering for microarray data analysis: a review. Curr Bioinform 6:427–443
Liu R, Liu Y, Li Y (2012) An improved method for multi-objective clustering ensemble algorithm. In: Proceedings 2012 IEEE WCCI. https://doi.org/10.1109/CEC.2012.6252972
Lord E, Willems M, Lapointe FJ et al (2017) Using the stability of objects to determine the number of clusters in datasets. Inform Sci 393:29–46
Lu Y, Lu S, Deng Y et al (2004) Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinform 5:172. https://doi.org/10.1186/1471-2105-5-172
Lu Y, Lu S, Fotouchi F et al (2004) FGKA: a fast genetic K-means clustering algorithm. In: Proceedings 2004 ACM SAC. https://doi.org/10.1145/967900.968029
Luo F, Khan L, Bastani F et al (2004) A dynamically growing self-organizing tree (DGSOT) for hierarchical clustering gene expression profiles. Bioinformatics 20(16):2605–2617
Macnaughton-Smith P, Williams WT, Dale MB et al (1964) Dissimilarity analysis: a new technique of hierarchical sub-division. Nature 202:1034–1035
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings 5th Berkeley Symp Math Stat Prob. https://projecteuclid.org/download/pdf_1/euclid.bsmspp/1200512992
Mahalanobis PC (1936) On the generalized distance in statistics. In: Proceedings of National Institute of Sciences of India. http://www.insa.nic.in/writereaddata/UpLoadedFiles/PINSA/Vol02_1936_1_Art05.pdf
McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18(3):413–422
McNicholas PD (2016) Model-based clustering. J Classif 33:331–373
McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26:2705–2712
Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118
Murali TM, Kasif S (2003) Extracting conserved gene expression motifs from gene expression data. Pac Symp Biocomput 8:77–88
Ng RT, Han J (2002) Clarans: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Oghabian A, Kilpinen S, Hautaniemi S et al (2014) Biclustering methods: biological relevance and application in gene expression analysis. PLoS One 9:e90801. https://doi.org/10.1371/journal.pone.0090801
Oyelade J, Isewon I, Oladipupo F et al (2016) Clustering algorithms: their application to gene expression data. Bioinform Biol Insights 10:237–253
Pan W, Lin J, Le CT (2002) Model-based cluster analysis of microarray gene-expression data. Genome Biol 3(2):research0009.1-0009.8. http://genomebiology.com/2002/3/2/research/0009.1
Parson L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. In: Proceedings 10th ACM SIGKDD. https://doi.org/10.1145/1007730.1007731
Pascual-Marqui RD, Pascual-Montano AD, Kochi K et al (2001) Smoothly distributed fuzzy c-means: a new self-organizing map. Pattern Recognit 34:2395–2402
Pizzuti C (2017) Evolutionary computation for community detection in networks: a review. IEEE Trans Evol Comput. https://doi.org/10.1109/TEVC.2017.2737600
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Reiss DJ, Baliga NS, Bonneau R (2006) Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinform 7:280–302
Röttger R (2016) Clustering of biological datasets in the era of big data. J Integr Bioinform 13:300. https://doi.org/10.2390/biecoll-jib-2016-300
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Roy S, Bhattacharyya DK (2007) Data clustering techniques - a review. In: Bhattacharyya DK, Hazarika SM (eds) Networks, security and soft computing: trends and future directions. Narosa Publishing House, New Delhi
Saini S, Rani P (2017) A survey on STING and CLIQUE grid based clustering methods. Int J Adv Res Comput Sci 8:1510–1512
Saxena S, Purushothaman S, Meghah V et al (2016) Role of annexin gene and its regulation during zebrafish caudal fin regeneration. Wound Repair Regen 24:551–559
Saxena A, Prasad M, Gupta A et al (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681
Shannon W, Culverhouse R, Duncan J (2003) Analyzing microarray data using cluster analysis. Pharmacogenomics 4:41–52
Sharan R, Shamir R (2000) CLICK: a clustering algorithm with applications to gene expression analysis. Proc Int Conf Intell Syst Mol Biol 8:307–316
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings 24th VLDB98. http://www.vldb.org/conf/1998/p428.pdf
Sheng Q, Moreau Y, De Smet F et al (2005) Advances in cluster analysis of microarray data. In: Azuje F, Dopazo J (eds) Data analysis and visualization in genomics and proteomics. Wiley, West Sussex
Shirkhorshidi AS, Aghabozoorgi S, Wah TY (2015) A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS One 10:e0144059. https://doi.org/10.1371/journal.pone.0144059
Sneath PHA, Sokal RR (1973) Numerical taxonomy. The principles and practice of numerical classification. Freeman, San Francisco
Steinbach M, Ertöz L, Kumar V (2004) The challenges of clustering high dimensional data. In: Wille LT (ed) New directions in statistical physics. Springer, Berlin
Strehl A, Ghosh J (2002) Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–618
Su M, Chang H (2001) A new model of self-organizing neural networks and its application in data projection. IEEE Trans Neural Netw 12:153–158
Tamayo P, Slonim D, Mesirov J et al (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96:2907–2912
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson, Boston
Tomasec N, Radovanovic M (2016) Clustering evaluation in high-dimensional data. In: Celebi ME, Aydin K (eds) Unsupervised learning algorithms. Springer, Cham
Uma MS, Porkodi R (2016) A survey on clustering algorithm for microarray gene expression data. Int J Recent Innov Trends Comput Commun 4:335–341
Van der Lann MJ, Pollard KS, Bryan J (2003) A new partitioning around medoids algorithm. J Stat Comput Simul 73:575–584
Vivo JM, Franco M, Vicari D (2018) Rethinking an ROC partial area index for evaluating the classification performance at a high specificity range. Adv Data Anal Classif 12:683–704. https://doi.org/10.107/s11634-017-0295-9
Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings 23rd VLDB97. http://www.vldb.org/conf/1997/P186.pdf
Wang Y, Miller DJ, Clarke R (2008) Approach to working in high-dimensional data spaces: genes expression microarrays. Br J Cancer 98:1023–1028
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Wong L (2004) The practical bioinformatician. World Scientific, Singapore
Xiao X, Dow ER, Eberhart R et al (2003) Gene clustering using self-organizing maps and particle swarm optimization. In: Proceedings 17th IPDPS. https://doi.org/10.1109/IPDPS.2003.1213290
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193
Yang J, Wang H, Wang W et al (2003) Enhanced biclustering on expression data. In: Proceedings 3rd IEEE BIBE 2003. https://doi.org/10.1109/BIBE.2003.1188969
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings 20th ICML-2003. https://www.aaai.org/Papers/ICML/2003/ICML03-111.pdf
Zahn CT (1971) Graph-theorical methods for detecting and describing gestalt cluster. IEEE Trans Comput C-20(1):68–86
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Disc 1:141–182
Zhao L, Zaki MJ (2005) TriCluster: an effective algorithm for mining coherent clusters in 3D microarray data. In: Proceedings 2005 ACM SIGMOD. https://doi.org/10.1145/1066157.1066236
Acknowledgements
This work has been partially supported by Spanish Ministry of Economy and Competitiveness, and the European Regional Development Fund (ERDF) under grants TIN2014-53749-C2-2R and TIN2017-85949-C2-1-R.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Franco, M., Vivo, JM. (2019). Cluster Analysis of Microarray Data. In: Bolón-Canedo, V., Alonso-Betanzos, A. (eds) Microarray Bioinformatics. Methods in Molecular Biology, vol 1986. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9442-7_7
Download citation
DOI: https://doi.org/10.1007/978-1-4939-9442-7_7
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9441-0
Online ISBN: 978-1-4939-9442-7
eBook Packages: Springer Protocols