High-Throughput Approaches to Biomarker Discovery and the Challenges of Subsequent Validation

  • Boris Veytsman
  • Ancha BaranovaEmail author
Living reference work entry


Recently introduced high-throughput technologies are producing unprecedented volumes of biomedical data available for mining and analysis. The early predictions of the imminent breakthroughs in our understanding of human diseases and making predictive diagnostics easy, however, turned out to be largely over optimistic.

We argue that this situation is not coincidental, but rather is caused by the statistical properties of the data collected. A typical high-throughput biological dataset is deeply imbalanced: the data matrix includes many measured quantities or “levels” in a relatively small number of subjects. Thus, any attempt to analyze these datasets would be undermined by so-called “Dimensionality Curse” that may be solved by removing a majority of variables. The feature selection aimed at increasing the classification power may be done using data mining or correlation-based approaches. In this chapter, both theory-driven and data-driven approaches to deal with complexity in biological systems are discussed in details.


Biomarker Molecular signature Feature selection Dimensionality curse Knowledge-based algorithms 



The authors express gratitude to the general support provided by College of Science, George Mason University, a State Contract 14.607.21.0098 dated November 27th, 2014 (Ministry of Science and Education, Russia) and by the Human Proteome Scientific Program of the Federal Agency of Scientific Organizations, Russia.


  1. Bartlett JW, Frost C, Mattsson N, Skillbäck T, Blennow K, Zetterberg H, Schott JM. Determining cut-points for Alzheimer’s disease biomarkers: statistical issues, methods and challenges. Biomark Med. 2012;6(4):391–400.CrossRefPubMedGoogle Scholar
  2. Drier Y, Domany E. Do two machine-learning based prognostic signatures for breast cancer capture the same biological processes? PLoS One. 2011;6(3):e17795. doi:10.1371/journal.pone.0017795.
  3. Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2005;21(2):171–8.CrossRefPubMedGoogle Scholar
  4. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci U S A. 2006;103(15):5923–8.PubMedCentralCrossRefPubMedGoogle Scholar
  5. Gray MA, Delahunt B, Fowles JR, Weinstein P, Cookes RR, Nacey JN. Demographic and clinical factors as determinants of serum levels of prostate specific antigen and its derivatives. Anticancer Res. 2004;24:2069–72.PubMedGoogle Scholar
  6. Hekal IA, Ibrahiem E. Obesity-PSA relationship: a new formula. Prostate Cancer Prostatic Dis. 2010;13(2):186–90.CrossRefPubMedGoogle Scholar
  7. Kupershmidt I, Su QJ, Grewal A, Sundaresh S, Halperin I, Flynn J, Shekar M, Wang H, Park J, Cui W, Wall GD, Wisotzkey R, Alag S, Akhtari S, Ronaghi M. Ontology-based meta-analysis of global collections of high-throughput public data. PLoS One. 2010;5(9):e13066. doi:10.1371/journal.pone.0013066.
  8. Mayer G, Heinze G, Mischak H, Hellemons ME, Heerspink HJ, Bakker SJ, de Zeeuw D, Haiduk M, Rossing P, Oberbauer R. Omics-bioinformatics in the context of clinical data. Methods Mol Biol. 2011;719:479–97.CrossRefPubMedGoogle Scholar
  9. McDermott JE, Wang J, Mitchell H, Webb-Robertson BJ, Hafen R, Ramey J, Rodland KD. Challenges in biomarker discovery: combining expert insights with statistical analysis of complex omics data. Expert Opin Med Diagn. 2013;7(1):37–51.PubMedCentralCrossRefPubMedGoogle Scholar
  10. Pyatnitskiy M, Karpova M, Moshkovskii S, Lisitsa A, Archakov A. Clustering mass spectral peaks increases recognition accuracy and stability of SVM-based feature selection. J Proteomics Bioinform. 2010;3:048–54. doi:10.4172/jpb.1000120.CrossRefGoogle Scholar
  11. Saeys Y, Inza I, Larraaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.CrossRefPubMedGoogle Scholar
  12. Sinay YG. Probability theory, an introductory course. Berlin/New York: Springer; 1992.Google Scholar
  13. van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6.CrossRefGoogle Scholar
  14. Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol. 2011;7(10):e1002240. doi:10.1371/journal.pcbi.1002240.
  15. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–9.CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Center for the Study of Chronic Metabolic Diseases, School of System BiologyGeorge Mason UniversityFairfaxUSA
  2. 2.Research Centre for Medical GeneticsRussian Academy of Medical SciencesMoscowRussia

Personalised recommendations