On the Sample Complexity of Cancer Pathways Identification
In this work we propose a framework to analyze the sample complexity of problems that arise in the study of genomic datasets. Our framework is based on tools from combinatorial analysis and statistical learning theory that have been used for the analysis of machine learning and probably approximately correct (PAC) learning. We use our framework to analyze the problem of the identification of cancer pathways through mutual exclusivity analysis of mutations from large cancer sequencing studies. We analytically derive matching upper and lower bounds on the sample complexity of the problem, showing that sample sizes much larger than currently available may be required to identify all the cancer genes in a pathway. We also provide two algorithms to find a cancer pathway from a large genomic dataset. On simulated and cancer data, we show that our algorithms can be used to identify cancer pathways from large genomic datasets.
Unable to display preview. Download preview PDF.
- 7.Kimura, E.T., Nikiforova, M.N., Zhu, Z., Knauf, J.A., et al.: High prevalence of braf mutations in thyroid cancer: genetic evidence for constitutive activation of the ret/ptc-ras-braf signaling pathway in papillary thyroid carcinoma. Cancer Res. 63(7), 1454–1457 (2003)Google Scholar
- 12.Mitzenmacher, M., Upfal, E.: Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press (2005)Google Scholar
- 13.Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of machine learning. MIT Press (2012)Google Scholar
- 16.Shrestha, R., Hodzic, E., Yeung, J., Wang, K., Sauerwald, T., Dao, P., Anderson, S., Beltran, H., Rubin, M.A., Collins, C.C., Haffari, G., Sahinalp, S.C.: HIT’nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 293–306. Springer, Heidelberg (2014) CrossRefGoogle Scholar
- 19.Weinstein, J.N., Collisson, E.A., Mills, G.B., et al. TCGA Research Network, The cancer genome atlas pan-cancer analysis project. Nat. Genet., 45(10), 1113–1120 (2013)Google Scholar