Abstract
Methods for translating gene expression signatures into clinically relevant information have typically relied upon having many samples from patients with similar molecular phenotypes. Here, we address the question of what can be done when it is relatively easy to obtain healthy patient samples, but when abnormalities corresponding to disease states may be rare and one-of-a-kind. The associated computational challenge, anomaly detection, is a well-studied machine learning problem. However, due to the dimensionality and variability of expression data, existing methods based on feature space analysis or individual anomalously-expressed genes are insufficient. We present a novel approach, CSAX, that identifies pathways in an individual sample in which the normal expression relationships are disrupted. To evaluate our approach, we have compiled and released a compendium of public microarray data sets, reformulated to create a testbed for anomaly detection. We demonstrate the accuracy of CSAX on the data sets in our compendium, compare it to other leading anomaly-detection methods, and show that CSAX aids both in identifying anomalies and in explaining their underlying biology. We note the potential for the use of such methods in identifying subclasses of disease. We also describe an approach to characterizing the difficulty of specific expression anomaly detection tasks and discuss how one can estimate the feasibility of a specific task. Our approach provides an important step towards identification of individual disease patterns in the era of personalized medicine.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lockhart, D., Dong, H., Byrne, M., Follettie, M., Gallo, M., Chee, M., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., Brown, E.: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotech. 14, 1675–1680 (1996)
Shalon, D., Smith, S., Brown, P.: A DNA micro-array system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Gen. Res. 6, 639–645 (1996)
Mehta, R., Jain, R., Badve, S.: Personalized medicine: the road ahead. Clin. Breast Cancer 11(1), 20–26 (2011)
Glas, A.M., Floore, A., Delahaye, L.J., Witteveen, A.T., Pover, R.C., Bakx, N., Lahti-Domenici, J.S., Bruinsma, T.J., Warmoes, M.O., Bernards, R., Wessels, L.F., Van’t Veer, L.J.: Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genomics 7, 278 (2006)
Slonim, D.: From patterns to pathways: gene expression data analysis comes of age. Nature Genetics 32(suppl.), 502–508 (2002)
Tusher, V., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98(9), 5116–5121 (2001)
Dougherty, E.: Small sample issues for microarray-based classification. Comp. Funct. Genomics 2(1), 28–34 (2001)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Mikkelsen, T., Galagan, J., Mesirov, J.: Improving genome annotations using phylogenetic profile anomaly detection. Bioinformatics 21(4), 464–470 (2005)
Kim, H., Gelenbe, E.: Anomaly detection in gene expression via stochastic models of gene regulatory networks. BMC Genomics 10(S3), S26 (2009)
Torkamani, A., Schork, N.: Prestige centrality-based functional outlier detection in gene expression analysis. Bioinformatics 25(17), 2222–2228 (2009)
Mpindi, J.P., Sara, H., Haapa-Paananen, S., Kilpinen, S., Pisto, T., Bucher, E., Ojala, K., Iljin, K., Vainio, P., Bjorkman, M., Gupta, S., Kohonen, P., Nees, M., Kallioniemi, O.: GTI: a novel algorithm for identifying outlier gene expression profiles from integrated microarray datasets. PLoS One 6(2), e17259 (2011)
Li, L., Chaudhuri, A., Chant, J., Tang, Z.: PADGE: analysis of heterogeneous patterns of differential gene expression. Physiol. Genomics 32(1), 154–159 (2007)
Ghosh, D.: Discrete nonparametric algorithms for outlier detection with genomic data. J. Biopharm. Stat. 20(2), 193–208 (2010)
Karrila, S., Lee, J., Tucker-Kellogg, G.: A comparison of methods for data-driven cancer outlier discovery, and an application scheme to semisupervised predictive biomarker discovery. Cancer Inform. 10, 109–120 (2011)
Sauer, U., Preininger, C., Hany-Schmatzberger, R.: Quick and simple: quality control of microarray data. Bioinformatics 21, 1572–1578 (2005)
Tomlins, S., Rhodes, D., Perner, S., Dhanasekaran, S., Mehra, R., Sun, X., Varambally, S., Cao, X., Tchinda, J., Kuefer, R., Lee, C., Montie, J., Shah, R., Pienta, K., Rubin, M., Chinnaiyan, A.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, 644–648 (2005)
Noto, K., Brodley, C., Slonim, D.: FRaC: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery 25, 109–133 (2011)
Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2), 93–104 (2000)
Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Computation 12(5), 1207–1245 (2000)
Tribus, M.: Thermodynamics and Thermostatics: An Introduction to Energy, Information and States of Matter, with Engineering Applications. D. Van Nostrand Company Inc., New York (1961)
Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., Mesirov, J.: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102(43), 15545–15550 (2005)
Mootha, V., Lindgren, C., Eriksson, K.-F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstråle, M., Laurila, E., Houstis, N., Daly, M., Patterson, N., Mesirov, J., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., Groop, L.C.: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 34(3), 267–273 (2003)
Spackman, K.A.: Signal detection theory: Valuable tools for evaluating inductive learning. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 160–163. Morgan Kaufmann Publishers Inc., San Francisco (1989)
Sandilands, E., Akbarzadeh, S., Vecchione, A., McEwan, D., Frame, M., Heath, J.: Src kinase modulates the activation, transport and signalling dynamics of fibroblast growth factor receptors. EMBO Reports 8, 1162–1169 (2007)
Francavilla, C., Cattaneo, P., Berezin, V., Bock, E., Ami, D., de Marco, A., Chrisofori, G., Cavallaro, U.: The binding of ncam to fgfr1 induces a specific cellular response mediated by receptor trafficking. J. Cell. Biol. 187(7), 1101 (2009)
Kales, S., Ryan, P., Nau, M., Lipkowitz, S.: Cbl and human myeloid neoplasms: the Cbl oncogene comes of age. Cancer Res. 70(12), 4789–4794 (2010)
MacDonald, J.W., Ghosh, D.: COPA–cancer outlier profile analysis. Bioinformatics 22(23), 2950–2951 (2006)
Noto, K., Brodley, C., Slonim, D.: Anomaly detection using an ensemble of feature models. In: Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010). IEEE Computer Society Press (2010)
Croft, D., O’Kelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., Jupe, S., Kalatskaya, I., Mahajan, S., May, B., Ndegwa, N., Schmidt, E., Shamovsky, V., Yung, C., Birney, E., Hermjakob, H., D’Eustachio, P., Stein, L.: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Research 39, D691–D697 (2011)
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines (2001) Software available at, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Noto, K., Brodley, C., Majidi, S., Bianchi, D.W., Slonim, D.K. (2014). CSAX: Characterizing Systematic Anomalies in eXpression Data. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-05269-4_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05268-7
Online ISBN: 978-3-319-05269-4
eBook Packages: Computer ScienceComputer Science (R0)