Skip to main content

Advertisement

Log in

Analysis of breast cancer progression using principal component analysis and clustering

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

We develop a new technique to analyse microarray data which uses a combination of principal components analysis and consensus ensemble k-clustering to find robust clusters and gene markers in the data. We apply our method to a public microarray breast cancer dataset which has expression levels of genes in normal samples as well as in three pathological stages of disease; namely, atypical ductal hyperplasia or ADH, ductal carcinoma in situ or DCIS and invasive ductal carcinoma or IDC. Our method averages over clustering techniques and data perturbation to find stable, robust clusters and gene markers. We identify the clusters and their pathways with distinct subtypes of breast cancer (Luminal, Basal and Her2+). We confirm that the cancer phenotype develops early (in early hyperplasia or ADH stage) and find from our analysis that each subtype progresses from ADH to DCIS to IDC along its own specific pathway, as if each was a distinct disease.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

ADH:

Atypical ductal hyperplasia

DCIS:

ductal carcinoma in situ

FDR:

false-discover-rate

IDC:

invasive ductal carcinoma

PCA:

principal component analysis

SNR:

signal to noise ratio

WV:

weighted voting

References

  • Alexe G, Dalgin G S, Ramaswamy R, DeLisi C and Bhanot G 2006 Data perturbation independent diagnosis and validation of breast cancer subtypes using clustering and patterns; Cancer Informatics 2 243–274

    Google Scholar 

  • Benjamini Y and Hochberg Y 1995 Controlling the false discovery rate: a practical and powerful approach to multiple testing; J. R. Stat. Soc. Series B 57 289–300

    Google Scholar 

  • Bussey K J, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold W C, Zeeberg B, Ajay W and Weinstein J N 2004 MatchMiner: a tool for batch navigation among gene and gene product identifiers; Genome Biol. 4 R27

    Article  Google Scholar 

  • Cheng C-H, Fu A W and Zhang Y 1999 Entropy-based subspace clustering for mining numerical data; in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (San Diego, California, United States ACM Press)

    Google Scholar 

  • Dempster A, Laird N and Rubin D 1977 Maximum likelihood from incomplete data via the EM algorithm; J. R. Stat. Soc. Series B 39 1–38

    Google Scholar 

  • Dennis G, Sherman B T, Hosack D A, Yang J, Gao W, Lane H C and Lempicki R A 2003 DAVID: Database for annotation, visualization, and integrated discovery; Genome Biol. 4 R60

    Article  Google Scholar 

  • Everitt B S and Dunn G 2001 Applied multivariate data analysis (Arnold and Oxford University Press)

  • Fangusaro J R, Jiang Y, Holloway M P, Caldas H, Singh V, Boue D R, Hayes J and Altura R A 2005 Survivin, Survivin-2B, and Survivin-deItaEx3 expression in medulloblastoma: biologic markers of tumour morphology and clinical outcome; Br. J. Cancer 92 359–365

    PubMed  CAS  Google Scholar 

  • Friedman J H and Meulman J J 2004 Clustering objects on subsets of attributes; J. R. Stat. Soc. Series B 66 815–850

    Article  Google Scholar 

  • Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R and Caligiuri M A 1999 Molecular classification of cancer: class discovery and class prediction by gene expression monitoring; Science 286 531–537

    Article  PubMed  CAS  Google Scholar 

  • Hanahan D and Folkman J 1996 Patterns and emerging mechanisms of the angiogenic switch during tumorigenesis; Cell 86 353–364

    Article  PubMed  CAS  Google Scholar 

  • Hanahan D and Weinberg R A 2000 The hallmarks of cancer; Cell 100 57–70

    Article  PubMed  CAS  Google Scholar 

  • Hartigan J A 1975 Clustering algorithms (New York: John Wiley)

    Google Scholar 

  • Hoffmann R and Valencia A 2004 A gene network for navigating the literature; Nat. Genet. 36 664

    Article  PubMed  CAS  Google Scholar 

  • Kaufmann L and Rousseeuw P J 1990 Finding groups in data: An introduction to cluster analysis First edition (John Wiley)

  • Lee J P, Chang K H, Han J H and Ryu H S 2005 Survivin, a novel anti-apoptosis inhibitor, expression in uterine cervical cancer and relationship with prognostic factors; Int. J. Gynecol. Cancer 15 113–119

    Article  PubMed  Google Scholar 

  • Ma X J, Salunga R, Tuggle J T, Gaudet J, Enright E, McQuary P, Payette T, Pistone M, Stecker K, Zhang B M et al 2003 Gene expression profiles of human breast cancer progression; Proc. Natl. Acad. Sci. USA 100 5974–5979

    Article  PubMed  CAS  Google Scholar 

  • Monti S, Tamayo P, Mesirov J and Golub T 2003 Consensus Clustering: A resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning J. 52 91–118

    Article  Google Scholar 

  • Perou C M, Sorlie T, Eisen M B, van de Rijn M, Jeffrey S S, Rees C A, Pollack J R, Ross D T, Johnsen H and Akslen L A 2000 Molecular portraits of human breast tumours; Nature (London) 406 747–752

    Article  CAS  Google Scholar 

  • Sørlie T, Tibshirani R, Parker J, Hastie T, Marron J S, Nobel A, Deng S, Johnsen H et al 2003 Repeated observation of breast tumor subtypes in independent gene expression data sets; Proc. Natl. Acad. Sci. USA 100 8418–8423

    Article  PubMed  Google Scholar 

  • Strehl A and Ghosh J 2002 Cluster ensembles: a knowledge reuse framework for combining partitionings; in Eighteenth National Conference on Artificial Intelligence, July 28–August 01, 2002 (Edmonton, Alberta, Canada) pp 93–98

    Google Scholar 

  • Tibshirani R, Walther G and Hastie T 2001 Estimating the number of clusters in a dataset via the Gap statistic; J. R. Stat. Soc. Series B 411–423

  • Wall M E, Rechtsteiner A and Rocha L M 2003 Singular value decomposition and principal component analysis; in A practical approach to microarray data analysis (eds) D P Berrar, W Dubitzky, M Granzow and M A Norwell (Kluwer) pp 91–109

  • Zhao Y and Karypis G 2003 Clustering in life sciences (Humana Press)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to C. DeLisi or G. Bhanot.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alexe, G., Dalgin, G.S., Ganesan, S. et al. Analysis of breast cancer progression using principal component analysis and clustering. J Biosci 32 (Suppl 1), 1027–1039 (2007). https://doi.org/10.1007/s12038-007-0102-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12038-007-0102-4

Keywords

Navigation