Information Processing at the Genomics Level

Abstract

A central objective in biology is to identify and characterize the mechanistic underpinnings (e.g., gene, protein interactions) of a biological phenomenon (e.g., a phenotype). Today, it is technologically feasible and commonplace to measure a great number of biomolecular features in a biological system at once, and to systematically investigate relationships between the former and the latter phenotype or phenomenological feature of interest across multiple spatial and temporal scales. The canonical starting point for such an investigation is typically a real number valued data matrix of N genomic features × M sample features, where N and M are integers, and N is often orders of magnitude greater than M. In this chapter we describe and rationalize the broad concepts and general principles underlying the analytic steps that start from this data matrix and lead to the identification of coherent mathematical patterns in the data that represent potential and testable mechanistic associations. A key challenge in this analysis is how one deals with false positives that largely arise from the high dimensionality of the data. False positives are mathematical patterns that are not coherent (from a technical or statistical standpoint) or coherent patterns that do not correspond to a true mechanistic association (from a biological standpoint).

Keywords

Data Matrix Tail Length Central Dogma Expression Quantitative Trait Locus Coherent Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Abbreviations

ANOVA

analysis of variance

DNA

deoxyribonucleic acid

EST

expressed sequence tag

FDR

false discovery rate

GC

Granger causality

GO

gene ontology

GWAS

genome-wide association scan

PCA

principle component analysis

R&R

gage repeatability reproducibility

RNA

ribonucleic acid

ROC

receiver operating characteristic

SAGE

serial analysis of gene expression

SNP

single-nucleotide polymorphism

SOM

self-organizing map

eQTL

expression quantitative trait loci

References

  1. 4.1.
    F. Crick: Central dogma of molecular biology, Nature 227(5258), 561–563 (1970)CrossRefGoogle Scholar
  2. 4.2.
    B. Lewin: Genes VII (Oxford Univ. Press, Oxford 2000)Google Scholar
  3. 4.3.
    T.H. Morgan: Sex limited inheritance in Drosophila, Science 32(812), 120–122 (1910)CrossRefGoogle Scholar
  4. 4.4.
    H.J. Muller: Artificial transmutation of the gene, Science 66(1699), 84–87 (1927)CrossRefGoogle Scholar
  5. 4.5.
    H.B. Creighton, B. McClintock: A correlation of cytological and genetical crossing-over in Zea mays, Proc. Natl. Acad. Sci. USA 17(8), 492–497 (1931)CrossRefGoogle Scholar
  6. 4.6.
    B. McClintock: The order of the genes C, Sh and Wx in zea mays with reference to a cytologically known point in the chromosome, Proc. Natl. Acad. Sci. USA 17(8), 485–491 (1931)CrossRefGoogle Scholar
  7. 4.7.
    A.H. Sturtevant: The linear arrangment of six sex-linked factors in Drosophila, as shown by their mode of association, J. Exp. Zool. 14, 39–45 (1927)Google Scholar
  8. 4.8.
    R.J. Robbins: Challenges in the Human Genome Project, IEEE Eng. Biol. Med. 11(1), 25–34 (1992)CrossRefGoogle Scholar
  9. 4.9.
    M. Pop, S.L. Salzberg, M. Shumay: Genome sequence assembly: Algorithms and issues, IEEE Computer 35(7), 47–54 (2002)CrossRefGoogle Scholar
  10. 4.10.
    E.E. Schadt, J. Lamb, X. Yang, J. Zhu, S. Edwards, D. Guhathakurta, S.K. Sieberts, S. Monks, M. Reitman, C. Zhang, P.Y. Lum, A. Leonardson, R. Thieringer, J.M. Metzger, L. Yang, J. Castle, H. Zhu, S.F. Kash, T.A. Drake, A. Sachs, A.J. Lusis: An integrative genomics approach to infer causal associations between gene expression and disease, Nat. Genet. 37(7), 710–717 (2005)CrossRefGoogle Scholar
  11. 4.11.
    P.A. Sharp: Splicing of messenger RNA precursors, Science 235(4790), 766–771 (1987)CrossRefGoogle Scholar
  12. 4.12.
    D.J. Duggan, M. Bittner, Y. Chen, P. Meltzer, J.M. Trent: Expression profiling using cDNA microarrays, Nat. Genet. 21(1), 10–14 (1999)CrossRefGoogle Scholar
  13. 4.13.
    E.H. Margulies, S.L. Kardia, J.W. Innis: Identification and prevention of a GC content bias in SAGE libraries, Nucleic Acids Res. 29(12), E60–60 (2001)CrossRefGoogle Scholar
  14. 4.14.
    M.L. Metzker: Sequencing technologies – The next generation, Nat. Rev. Genet. 11(1), 31–46 (2010)CrossRefGoogle Scholar
  15. 4.15.
    B. Wold, R.M. Myers: Sequence census methods for functional genomics, Nat. Methods 5(1), 19–21 (2008)CrossRefGoogle Scholar
  16. 4.16.
    D. Branton, D.W. Deamer, A. Marziali, H. Bayley, S.A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang, S.B. Jovanovich, P.S. Krstic, S. Lindsay, X.S. Ling, C.H. Mastrangelo, A. Meller, J.S. Oliver, Y.V. Pershin, J.M. Ramsey, R. Riehn, G.V. Soni, V. Tabard-Cossa, M. Wanunu, M. Wiggin, J.A. Schloss: The potential and challenges of nanopore sequencing, Nat. Biotechnol. 26(10), 1146–1153 (2008)CrossRefGoogle Scholar
  17. 4.17.
    T.D. Harris, P.R. Buzby, H. Babcock, E. Beer, J. Bowers, I. Braslavsky, M. Causey, J. Colonell, J. Dimeo, J.W. Efcavitch, E. Giladi, J. Gill, J. Healy, M. Jarosz, D. Lapen, K. Moulton, S.R. Quake, K. Steinmann, E. Thayer, A. Tyurina, R. Ward, H. Weiss, Z. Xie: Single-molecule DNA sequencing of a viral genome, Science 320(5872), 106–109 (2008)CrossRefGoogle Scholar
  18. 4.18.
    J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark, R. Dalal, A. Dewinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, S. Turner: Real-time DNA sequencing from single polymerase molecules, Science 323(5910), 133–138 (2009)CrossRefGoogle Scholar
  19. 4.19.
    D.R. Bentley, S. Balasubramanian, H.P. Swerdlow, G.P. Smith, J. Milton, C.G. Brown, K.P. Hall, D.J. Evers, C.L. Barnes, H.R. Bignell, J.M. Boutell, J. Bryant, R.J. Carter, R. Keira Cheetham, A.J. Cox, D.J. Ellis, M.R. Flatbush, N.A. Gormley, S.J. Humphray, L.J. Irving, M.S. Karbelashvili, S.M. Kirk, H. Li, X. Liu, K.S. Maisinger, L.J. Murray, B. Obradovic, T. Ost, M.L. Parkinson, M.R. Pratt, I.M. Rasolonjatovo, M.T. Reed, R. Rigatti, C. Rodighiero, M.T. Ross, A. Sabot, S.V. Sankar, A. Scally, G.P. Schroth, M.E. Smith, V.P. Smith, A. Spiridou, P.E. Torrance, S.S. Tzonev, E.H. Vermaas, K. Walter, X. Wu, L. Zhang, M.D. Alam, C. Anastasi, I.C. Aniebo, D.M. Bailey, I.R. Bancarz, S. Banerjee, S.G. Barbour, P.A. Baybayan, V.A. Benoit, K.F. Benson, C. Bevis, P.J. Black, A. Boodhun, J.S. Brennan, J.A. Bridgham, R.C. Brown, A.A. Brown, D.H. Buermann, A.A. Bundu, J.C. Burrows, N.P. Carter, N. Castillo, E. Chiara, M. Catenazzi, S. Chang, R. Neil Cooley, N.R. Crake, O.O. Dada, K.D. Diakoumakos, B. Dominguez-Fernandez, D.J. Earnshaw, U.C. Egbujor, D.W. Elmore, S.S. Etchin, M.R. Ewan, M. Fedurco, L.J. Fraser, K.V. Fuentes Fajardo, W. Scott Furey, D. George, K.J. Gietzen, C.P. Goddard, G.S. Golda, P.A. Granieri, D.E. Green, D.L. Gustafson, N.F. Hansen, K. Harnish, C.D. Haudenschild, N.I. Heyer, M.M. Hims, J.T. Ho, A.M. Horgan, K. Hoschler, S. Hurwitz, D.V. Ivanov, M.Q. Johnson, T. James, T.A. Huw Jones, G.D. Kang, T.H. Kerelska, A.D. Kersey, I. Khrebtukova, A.P. Kindwall, Z. Kingsbury, P.I. Kokko-Gonzales, A. Kumar, M.A. Laurent, C.T. Lawley, S.E. Lee, X. Lee, A.K. Liao, J.A. Loch, M. Lok, S. Luo, R.M. Mammen, J.W. Martin, P.G. McCauley, P. McNitt, P. Mehta, K.W. Moon, J.W. Mullens, T. Newington, Z. Ning, N.B. Ling, S.M. Novo, M.J. OʼNeill, M.A. Osborne, A. Osnowski, O. Ostadan, L.L. Paraschos, L. Pickering, A.C. Pike, A.C. Pike, D. Chris Pinkard, D.P. Pliskin, J. Podhasky, V.J. Quijano, C. Raczy, V.H. Rae, S.R. Rawlings, A. Chiva Rodriguez, P.M. Roe, J. Rogers, M.C. Rogert Bacigalupo, N. Romanov, A. Romieu, R.K. Roth, N.J. Rourke, S.T. Ruediger, E. Rusman, R.M. Sanches-Kuiper, M.R. Schenker, J.M. Seoane, R.J. Shaw, M.K. Shiver, S.W. Short, N.L. Sizto, J.P. Sluis, M.A. Smith, J. Ernest Sohna Sohna, E.J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C.L. Tregidgo, G. Turcatti, S. Vandevondele, Y. Verhovsky, S.M. Virk, S. Wakelin, G.C. Walcott, J. Wang, G.J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J.C. Mullikin, M.E. Hurles, N.J. McCooke, J.S. West, F.L. Oaks, P.L. Lundberg, D. Klenerman, R. Durbin, A.J. Smith: Accurate whole human genome sequencing using reversible terminator chemistry, Nature 456(7218), 53–59 (2008)CrossRefGoogle Scholar
  20. 4.20.
    Y. Benjamini, Y. Hochberg: Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. B 57(1), 289–300 (1995)MathSciNetMATHGoogle Scholar
  21. 4.21.
    J.J. Thomas, K.A. Cook (Eds.): Illuminating The Path: The Research and Development Agenda for Visual Analytics, National Gov. Pub (IEEE Computer Society, Los Alamitos 2005)Google Scholar
  22. 4.22.
    R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification (Wiley, New York 2001)MATHGoogle Scholar
  23. 4.23.
    A.T. Kho, Q. Zhao, Z. Cai, A.J. Butte, J.Y. Kim, S.L. Pomeroy, D.H. Rowitch, I.S. Kohane: Conserved mechanisms across development and tumorigenesis revealed by a mouse development perspective of human cancers, Genes Dev. 18(6), 629–640 (2004)CrossRefGoogle Scholar
  24. 4.24.
    E. Parzen: Modern Probability Theory and its Applications (Wiley, New York 1992)Google Scholar
  25. 4.25.
    M.H. DeGroot, M.J. Schervish: Probability and Statistics, 4th edn. (Addison-Wesley, Boston 2012)Google Scholar
  26. 4.26.
    D.W. Graham: Heraclitus. In: The Stanford Encyclopedia of Philosophy, ed. by E.N. Zalta (SEP, Stanford 2011)Google Scholar
  27. 4.27.
    NIST/SEMATECH: e-Handbook of Statistical Methods (NIST, 2012) available online at http://www.itl.nist.gov/div898/handbook/
  28. 4.28.
    J. Quackenbush: Microarray data normalization and transformation, Nat. Genet. 32, 496–501 (2002)CrossRefGoogle Scholar
  29. 4.29.
    P. Stafford (Ed.): Methods in Microarray Normalization (CRC, Boca Raton 2008)Google Scholar
  30. 4.30.
    I.S. Kohane, A.T. Kho, A.J. Butte: Microarrays for an Integrative Genomics (MIT, Cambridge 2003)Google Scholar
  31. 4.31.
    R.A. Johnson, D.W. Wichern: Applied Multivariate Statistical Analysis, 5th edn. (Prentice Hall, Upper Saddle River 2002)Google Scholar
  32. 4.32.
    B. Everitt: Cluster Analysis, 5th edn. (Wiley, Chichester 2011)CrossRefMATHGoogle Scholar
  33. 4.33.
    T.D. Wu: Analysing gene expression data from DNA microarrays to identify candidate genes, J. Pathol. 195(1), 53–65 (2001)CrossRefGoogle Scholar
  34. 4.34.
    J. Taylor, R. Tibshirani, B. Efron: The miss rate for the analysis of gene expression data, Biostatistics 6(1), 111–117 (2005)CrossRefMATHGoogle Scholar
  35. 4.35.
    S. Dudoit, J.P. Shaffer, J.C. Boldrick: Multiple hypothesis testing in microarray experiments, Stat. Sci. 18(1), 71–103 (2003)MathSciNetCrossRefMATHGoogle Scholar
  36. 4.36.
    D.K. Dey, S. Ghosh, B.K. Mallick (Eds.): Bayesian Modeling in Bioinformatics (Chapman Hall/CRC, New York 2010)MATHGoogle Scholar
  37. 4.37.
    K. Winstein: Styles of Inference: Bayesianness and Frequentism (CSAIL MIT, Cambridge 2011), available online at http://groups.csail.mit.edu/mac/users/gjs/6.945/readings/winstein-bayes-frequentist-2011.pdf Google Scholar
  38. 4.38.
    D.W. Huang, B.T. Sherman, R.A. Lempicki: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources, Nat. Protoc. 4(1), 44–57 (2009)CrossRefGoogle Scholar
  39. 4.39.
    O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, R. Sharan: Associating genes and protein complexes with disease via network propagation, PLoS Comput. Biol. 6(1), e1000641 (2010)MathSciNetCrossRefGoogle Scholar
  40. 4.40.
    L. Ein-Dor, I. Kela, G. Getz, D. Givol, E. Domany: Outcome signature genes in breast cancer: Is there a unique set?, Bioinformatics, 21(2), 171–178 (2005)CrossRefGoogle Scholar
  41. 4.41.
    M. Esteller: Cancer epigenomics: DNA methylomes and histone-modification maps, Nat. Rev. Genet. 8(4), 286–298 (2007)CrossRefGoogle Scholar
  42. 4.42.
    T. Fawcett: An introduction to ROCanalysis, Pattern Recognit. Lett. 27(8), 861–874 (2006)MathSciNetCrossRefGoogle Scholar
  43. 4.43.
    Y. Gilad, S.A. Rifkin, J.K. Pritchard: Revealing the architecture of gene regulation: The promise of eQTL studies, Trends Genet. 24(8), 408–415 (2008)CrossRefGoogle Scholar
  44. 4.44.
    W. Cookson, L. Liang, G. Abecasis, M. Moffatt, M. Lathrop: Mapping complex disease traits with global gene expression, Nat. Rev. Genet. 10(3), 184–194 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2014

Authors and Affiliations

  1. 1.Boston Childrenʼs HospitalBostonUSA
  2. 2.Informatics ProgramHarvard Medical School/Boston Childrenʼs HospitalBostonUSA

Personalised recommendations