Abstract
We develop a new technique to analyse microarray data which uses a combination of principal components analysis and consensus ensemble k-clustering to find robust clusters and gene markers in the data. We apply our method to a public microarray breast cancer dataset which has expression levels of genes in normal samples as well as in three pathological stages of disease; namely, atypical ductal hyperplasia or ADH, ductal carcinoma in situ or DCIS and invasive ductal carcinoma or IDC. Our method averages over clustering techniques and data perturbation to find stable, robust clusters and gene markers. We identify the clusters and their pathways with distinct subtypes of breast cancer (Luminal, Basal and Her2+). We confirm that the cancer phenotype develops early (in early hyperplasia or ADH stage) and find from our analysis that each subtype progresses from ADH to DCIS to IDC along its own specific pathway, as if each was a distinct disease.
Similar content being viewed by others
Abbreviations
- ADH:
-
Atypical ductal hyperplasia
- DCIS:
-
ductal carcinoma in situ
- FDR:
-
false-discover-rate
- IDC:
-
invasive ductal carcinoma
- PCA:
-
principal component analysis
- SNR:
-
signal to noise ratio
- WV:
-
weighted voting
References
Alexe G, Dalgin G S, Ramaswamy R, DeLisi C and Bhanot G 2006 Data perturbation independent diagnosis and validation of breast cancer subtypes using clustering and patterns; Cancer Informatics 2 243–274
Benjamini Y and Hochberg Y 1995 Controlling the false discovery rate: a practical and powerful approach to multiple testing; J. R. Stat. Soc. Series B 57 289–300
Bussey K J, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold W C, Zeeberg B, Ajay W and Weinstein J N 2004 MatchMiner: a tool for batch navigation among gene and gene product identifiers; Genome Biol. 4 R27
Cheng C-H, Fu A W and Zhang Y 1999 Entropy-based subspace clustering for mining numerical data; in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (San Diego, California, United States ACM Press)
Dempster A, Laird N and Rubin D 1977 Maximum likelihood from incomplete data via the EM algorithm; J. R. Stat. Soc. Series B 39 1–38
Dennis G, Sherman B T, Hosack D A, Yang J, Gao W, Lane H C and Lempicki R A 2003 DAVID: Database for annotation, visualization, and integrated discovery; Genome Biol. 4 R60
Everitt B S and Dunn G 2001 Applied multivariate data analysis (Arnold and Oxford University Press)
Fangusaro J R, Jiang Y, Holloway M P, Caldas H, Singh V, Boue D R, Hayes J and Altura R A 2005 Survivin, Survivin-2B, and Survivin-deItaEx3 expression in medulloblastoma: biologic markers of tumour morphology and clinical outcome; Br. J. Cancer 92 359–365
Friedman J H and Meulman J J 2004 Clustering objects on subsets of attributes; J. R. Stat. Soc. Series B 66 815–850
Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R and Caligiuri M A 1999 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring; Science 286 531–537
Hanahan D and Folkman J 1996 Patterns and emerging mechanisms of the angiogenic switch during tumorigenesis; Cell 86 353–364
Hanahan D and Weinberg R A 2000 The hallmarks of cancer; Cell 100 57–70
Hartigan J A 1975 Clustering algorithms (New York: John Wiley)
Hoffmann R and Valencia A 2004 A gene network for navigating the literature; Nat. Genet. 36 664
Kaufmann L and Rousseeuw P J 1990 Finding groups in data: An introduction to cluster analysis First edition (John Wiley)
Lee J P, Chang K H, Han J H and Ryu H S 2005 Survivin, a novel anti-apoptosis inhibitor, expression in uterine cervical cancer and relationship with prognostic factors; Int. J. Gynecol. Cancer 15 113–119
Ma X J, Salunga R, Tuggle J T, Gaudet J, Enright E, McQuary P, Payette T, Pistone M, Stecker K, Zhang B M et al 2003 Gene expression profiles of human breast cancer progression; Proc. Natl. Acad. Sci. USA 100 5974–5979
Monti S, Tamayo P, Mesirov J and Golub T 2003 Consensus Clustering: A resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning J. 52 91–118
Perou C M, Sorlie T, Eisen M B, van de Rijn M, Jeffrey S S, Rees C A, Pollack J R, Ross D T, Johnsen H and Akslen L A 2000 Molecular portraits of human breast tumours; Nature (London) 406 747–752
Sørlie T, Tibshirani R, Parker J, Hastie T, Marron J S, Nobel A, Deng S, Johnsen H et al 2003 Repeated observation of breast tumor subtypes in independent gene expression data sets; Proc. Natl. Acad. Sci. USA 100 8418–8423
Strehl A and Ghosh J 2002 Cluster ensembles: a knowledge reuse framework for combining partitionings; in Eighteenth National Conference on Artificial Intelligence, July 28–August 01, 2002 (Edmonton, Alberta, Canada) pp 93–98
Tibshirani R, Walther G and Hastie T 2001 Estimating the number of clusters in a dataset via the Gap statistic; J. R. Stat. Soc. Series B 411–423
Wall M E, Rechtsteiner A and Rocha L M 2003 Singular value decomposition and principal component analysis; in A practical approach to microarray data analysis (eds) D P Berrar, W Dubitzky, M Granzow and M A Norwell (Kluwer) pp 91–109
Zhao Y and Karypis G 2003 Clustering in life sciences (Humana Press)
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Alexe, G., Dalgin, G.S., Ganesan, S. et al. Analysis of breast cancer progression using principal component analysis and clustering. J Biosci 32 (Suppl 1), 1027–1039 (2007). https://doi.org/10.1007/s12038-007-0102-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12038-007-0102-4