High-Throughput Approaches to Biomarker Discovery and the Challenges of Subsequent Validation
Recently introduced high-throughput technologies are producing unprecedented volumes of biomedical data available for mining and analysis. The early predictions of the imminent breakthroughs in our understanding of human diseases and making predictive diagnostics easy, however, turned out to be largely over optimistic.
We argue that this situation is not coincidental, but rather is caused by the statistical properties of the data collected. A typical high-throughput biological dataset is deeply imbalanced: the data matrix includes many measured quantities or “levels” in a relatively small number of subjects. Thus, any attempt to analyze these datasets would be undermined by so-called “Dimensionality Curse” that may be solved by removing a majority of variables. The feature selection aimed at increasing the classification power may be done using data mining or correlation-based approaches. In this chapter, both theory-driven and data-driven approaches to deal with complexity in biological systems are discussed in details.
KeywordsBiomarker Molecular signature Feature selection Dimensionality curse Knowledge-based algorithms
The authors express gratitude to the general support provided by College of Science, George Mason University, a State Contract 14.607.21.0098 dated November 27th, 2014 (Ministry of Science and Education, Russia) and by the Human Proteome Scientific Program of the Federal Agency of Scientific Organizations, Russia.
- Drier Y, Domany E. Do two machine-learning based prognostic signatures for breast cancer capture the same biological processes? PLoS One. 2011;6(3):e17795. doi:10.1371/journal.pone.0017795. http://dx.doi.org/10.1371%2Fjournal.pone.0017795
- Kupershmidt I, Su QJ, Grewal A, Sundaresh S, Halperin I, Flynn J, Shekar M, Wang H, Park J, Cui W, Wall GD, Wisotzkey R, Alag S, Akhtari S, Ronaghi M. Ontology-based meta-analysis of global collections of high-throughput public data. PLoS One. 2010;5(9):e13066. doi:10.1371/journal.pone.0013066. http://dx.doi.org/10.1371%2Fjournal.pone.0013066
- Sinay YG. Probability theory, an introductory course. Berlin/New York: Springer; 1992.Google Scholar
- van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6.CrossRefGoogle Scholar
- Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol. 2011;7(10):e1002240. doi:10.1371/journal.pcbi.1002240. http://dx.doi.org/10.1371%2Fjournal.pcbi.1002240
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–9.CrossRefPubMedGoogle Scholar