Skip to main content

Topics in Computational Genomics

  • Chapter
  • First Online:
Basics of Bioinformatics

Abstract

Genomics began with large-scale sequencing of the human and many model organism genomes around 1990; rapid accumulation of vast genomic data brings a great challenge on how to decipher such massive molecular information. As bioinformatics in general, genome informatics is also data driven; many computational tools developed can soon be obsolete when new technologies and data types become available. Keeping this in mind if a student wants to work in this fascinating new field, one must be able to adapt quickly and to “shoot the moving targets” with the “just-in-time ammunition.”

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86

    Article  MATH  MathSciNet  Google Scholar 

  2. Apostolico A, Bock ME, Lonardi S (2002) Monotony of surprise and large-scale quest for unusual words. In: Proceedings of the sixth annual international conference on computational biology. ACM Press, New York, pp 22–31

    Google Scholar 

  3. Bailey TL, Elkan C (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn 21(1–2):51–80

    Google Scholar 

  4. Bairoch A (1992) PROSITE: a dictionary of site and patterns in proteins. Nucl Acids Res 20:2013–2018

    Article  Google Scholar 

  5. Bajic V, Seah S (2003) Dragon gene start finder identifies approximate locations of the 5ends of genes. Nucleic Acids Res 31:3560–3563

    Article  Google Scholar 

  6. Bajic V, Tan S, Suzuki Y, Sugano S (2004) Promoter prediction analysis on the whole human genome. Nat Biotechnol 22:1467–1473

    Article  Google Scholar 

  7. Barash Y, Bejerano G, Friedman N (2001) A simple hyper-geometric approach for discovering putative transcription factor binding sites. Lect Notes Comput Sci 2149:278–293

    Article  Google Scholar 

  8. Barash Y, Elidan G, Friedman N, Kaplan T (2003) Modeling dependencies in protein-DNA binding sites. In: Miller W, Vingron M, Istrail S, Pevzner P, Waterman M (eds) Proceedings of the seventh annual international conference on computational molecular biology, ACM Press, New York, pp 28–37. doi http://doi.acm.org/10.1145/640075.640079

  9. Beckstette M, Stothmann D, Homann R, Giegerich R, Kurtz S (2004) Possumsearch: fast and sensitive matching of position specific scoring matrices using enhanced suffix arrays. In: Proceedings of the German conference in bioinformatics. pp 53–64

    Google Scholar 

  10. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D (2004) Ultraconserved elements in the human genome. Science 304(5675):1321–1325

    Article  Google Scholar 

  11. Berezikov E, Guryev V, Plasterk RH, Cuppen E (2004) CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res 14(1):170–178. doi:10.1101/gr.1642804

    Article  Google Scholar 

  12. Berg J, Willmann S, Lassig M (2004) Adaptive evolution of transcription factor binding sites. BMC Evol Biol 4(1):42. doi:10.1186/1471-2148-4-42. URL http://www.biomedcentral.com/1471-2148/4/42

  13. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for high-accuracy computational gene prediction. PLoS Comput Biol 3:e54

    Article  MathSciNet  Google Scholar 

  14. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715

    Article  Google Scholar 

  15. Blanchette M, Sinha S (2001) Separating real motifs from their artifacts. In: Brunak S, Krogh A (eds) Proceedings of the annual international symposium on intelligent systems for molecular biology, pp 30–38

    Google Scholar 

  16. Blanchette M, Tompa M (2002) Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 12(5):739–748

    Article  Google Scholar 

  17. Brazma A, Jonassen I, Ukkonen E, Vilo J (1996) Discovering patterns and subfamilies in biosequences. In: Proceedings of the annual international symposium on intelligent systems for molecular biology, pp 34–43

    Google Scholar 

  18. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13(4):721–731

    Article  Google Scholar 

  19. Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242

    Article  Google Scholar 

  20. Burge C, Karlin S (1997) Prediction of complete gene structure in human genomic DNA. J Mol Biol 268:78–94

    Article  Google Scholar 

  21. Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression. Nat Genet 27(2):167–171

    Article  Google Scholar 

  22. Califano A (2000) SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 16(4):341–357

    Article  Google Scholar 

  23. Carninci P, et al (2006) Genomewide analysis of mammalian promoter architecture and evolution. Nat Genet 38:626–635

    Article  Google Scholar 

  24. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725):1149–1154

    Article  Google Scholar 

  25. Conlon EM, Liu XS, Lieb JD, Liu JS (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA 100(6):3339–3344

    Article  Google Scholar 

  26. Das D, Banerjee N, Zhang MQ (2004) Interacting models of cooperative gene regulation. Proc Natl Acad Sci USA 101(46):16234–16239

    Article  Google Scholar 

  27. Das D, Nahle Z, Zhang M (2006) Adaptively inferring human transcriptional subnetworks. Mol Syst Biol 2:2006.0029

    Article  Google Scholar 

  28. Davuluri R, Grosse I, Zhang M (2002) Computational identification of promoters and first exons in the human genome. Nat Genet 229:412–417; Erratum: Nat Genet 32:459

    Google Scholar 

  29. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38

    MATH  MathSciNet  Google Scholar 

  30. Dermitzakis ET, Clark AG (2002) Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol 19(7):1114–1121

    Article  Google Scholar 

  31. Dorohonceanu B, Nevill-Manning C (2000) Accelerating protein classification using suffix trees. In: Proceedings of the 8th international conference on intelligent systems for molecular biology (ISMB). pp 128–133

    Google Scholar 

  32. Down T, Hubbard T (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res 12:458–461

    Article  Google Scholar 

  33. Durbin R, Eddy SR, Krogh A, Mitchison G (1999) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press

    Google Scholar 

  34. Duta R, Hart P, Stock D (2000) Pattern classification, 2 edn. Wiley, New York

    Google Scholar 

  35. Ettwiller L, Paten B, Souren M, Loosli F, Wittbrodt J, Birney E (2005) The discovery, positioning and verification of a set of transcription-associated motifs in vertebrates. Genome Biol 6(12):R104

    Article  Google Scholar 

  36. Evans PA, Smith AD (2003) Toward optimal motif enumeration. In: Dehne FKHA, Ortiz AL, Sack JR (eds) Workshop on algorithms and data structures. Lecture notes in computer science, vol 2748, pp 47–58

    Google Scholar 

  37. Felsenstein J, Churchill G (1996) A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol 13(1):93–104

    Article  Google Scholar 

  38. Fiegler H, et al (2006) Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Res 16:1566–1574

    Article  Google Scholar 

  39. Gelfand AE, Smith AFM (1990) Sampling-based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409

    Article  MATH  MathSciNet  Google Scholar 

  40. Guigó R, et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2.1–S2.3

    Article  Google Scholar 

  41. Gupta M, Liu J (2003) Discovery of conserved sequence patterns using a stochastic dictionary model. J Am Stat Assoc 98(461):55–66

    Article  MATH  MathSciNet  Google Scholar 

  42. Halpern A, Bruno W (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15(7):910–917

    Article  Google Scholar 

  43. IUPAC-IUB Commission on Biochemical Nomenclature (1970) Abbreviations and symbols for nucleic acids, polynucleotides and their constituents: recommendations 1970. J Biol Chem 245(20):5171–5176. URL http://www.jbc.org

  44. Javier Costas FC, Vieira J (2003) Turnover of binding sites for transcription factors involved in early drosophila development. Gene 310:215–220

    Article  Google Scholar 

  45. Kel A, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis O, Wingender E (2003) MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucl Acids Res 31(13):3576–3579

    Article  Google Scholar 

  46. Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, Wu Y, Green RD, Ren B (2005) A high-resolution map of active promoters in the human genome. Nature 436:876–880

    Article  Google Scholar 

  47. Komura D, et al (2006) Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res 16:1575–1584

    Article  Google Scholar 

  48. Korbel JO, et al (2007) Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci USA 104:10110–10115

    Article  Google Scholar 

  49. Krogh A (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5:179–186

    Google Scholar 

  50. Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142

    Google Scholar 

  51. Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A, Wootton J (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208–214

    Article  Google Scholar 

  52. Lawrence C, Reilly AA (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct Funct Genet 7:41–51

    Article  Google Scholar 

  53. Li M, Ma B, Wang L (2002) On the closest string and substring problems. J ACM 49(2):157–171

    Article  MathSciNet  Google Scholar 

  54. Liu XS, Brutlag DL, Liu JS (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol 20(8):835–839

    Article  Google Scholar 

  55. Liu JS, Lawrence CE, Neuwald A (1995) Bayesian models for multiple local sequence alignment and its Gibbs sampling strategies. J Am Stat Assoc 90:1156–1170

    Article  MATH  Google Scholar 

  56. Majoros W, Pertea M, Salzberg S (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic genefinders. Bioinformatics 20:2878–2879

    Article  Google Scholar 

  57. Marinescu VD, Kohane IS, Riva A (2005) The MAPPER database: a multi-genome catalog of putative transcription factor binding sites. Nucl Acids Res 33(Suppl 1):D91–D97

    Google Scholar 

  58. Marsan L, Sagot MF (2000) Extracting structured motifs using a suffix tree – algorithms and application to promoter consensus identification. In: Minoru S, Shamir R (eds) Proceedings of the annual international conference on computational molecular biology. ACM Press, New York, pp 210–219

    Google Scholar 

  59. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E (2003) TRANSFAC(R): transcriptional regulation, from patterns to profiles. Nucl Acids Res 31(1):374–378

    Article  Google Scholar 

  60. Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB (2004) MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol 5(12):R98

    Article  Google Scholar 

  61. Moses AM, Pollard DA, Nix DA, Iyer VN, Li XY, Biggin MD, Eisen MB (2006) Large-scale turnover of functional transcription factor binding sites in drosophila. PLoS Comput Biol 2(10):e130

    Article  Google Scholar 

  62. Mustonen V, Lassig M (2005) Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc. Natl. Acad. Sci. USA 102(44):15936–15941. doi:10.1073/pnas.0505537102. URL http://www.pnas.org/cgi/content/abstract/102/44/15936

  63. Nicodeme P, Salvy B, Flajolet P (2002) Motif statistics. Theor Comput Sci 287:593–617

    Article  MATH  MathSciNet  Google Scholar 

  64. Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E (2007) Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet 39(6):730–732; Published online: 21 May 2007

    Google Scholar 

  65. Ohler U, Liao G, Niemann H, Rubin G (2002) Computational analysis of core promoters in the drosophila genome. Genome Biol 3(12):RESEARCH0087

    Google Scholar 

  66. Pearson H (2006) What is a gene?. Nat Genet 441:398–340

    Google Scholar 

  67. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Fick I, Akiyama J, Val SD, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature 444(7118):499–502

    Article  Google Scholar 

  68. Pevzner P, Sze S (2000) Combinatorial approaches to finding subtle signals in DNA sequences. In: Bourne P, et al (eds) Proceedings of the annual international symposium on intelligent systems for molecular biology. Menlo Park, AAAI Press, pp 269–278

    Google Scholar 

  69. Portugal J (1989) Footprinting analysis of sequence-specific DNA-drug interactions. Chem Biol Interact 71(4):311–324

    Article  Google Scholar 

  70. Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJL (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucl Acids Res 33(11):3455–3464

    Article  Google Scholar 

  71. Rabiner L (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77:257–286

    Article  Google Scholar 

  72. Rahmann S, Muller T, Vingron M (2003) On the power of profiles for transcription factor binding site detection. Stat Appl Genet Mol Biol 2(1):7

    MathSciNet  Google Scholar 

  73. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME (2006) Global variation in copy number in the human genome. Nature 444:444–454

    Article  Google Scholar 

  74. Roth F, Hughes J, Estep P, Church G (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16(10):939–945

    Article  Google Scholar 

  75. Salamov A, Solovyev V (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10:516–522

    Article  Google Scholar 

  76. Sandelin A, et al (2007) Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 8:424–436

    Article  Google Scholar 

  77. Schones D, Smith A, Zhang M (2007) Statistical significance of cis-regulatory modules. BMC Bioinform 8:19

    Article  Google Scholar 

  78. Sebat J, et al (2004) Large-scale copy number polymorphism in the human genome. Science 305:525–528

    Article  Google Scholar 

  79. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15(8):1034–1050

    Article  Google Scholar 

  80. Solovyev VV, et al (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl Acids Res 22:5156–5163

    Article  Google Scholar 

  81. Sonnenburg S, Zien A, Ratsch G (2006) ARTS: accurate recognition of transcription starts in human. Bioinformatics 22:e472–e480

    Article  Google Scholar 

  82. Staden R (1989) Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5(2):89–96

    Google Scholar 

  83. Stanke M, Waack S (2003) Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics 19(Suppl 2):II215–II225

    Article  Google Scholar 

  84. Sumazin P, Chen G, Hata N, Smith AD, Zhang T, Zhang MQ (2005) DWE: discriminating word enumerator. Bioinformatics 21(1):31–38

    Article  Google Scholar 

  85. Thomas M, Chiang C (2006) The general transcription machinery and general cofactors. Crit Rev Biochem Mol Biol 41:105–178

    Article  Google Scholar 

  86. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 22(22):4673–4680

    Article  Google Scholar 

  87. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23(1):137–144

    Article  Google Scholar 

  88. Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M (2007) PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17(11):1665–1674

    Article  Google Scholar 

  89. Waterman MS (1995) Introduction to computational biology: maps, sequences and genomes. Chapman and Hall, London

    Book  MATH  Google Scholar 

  90. Waterman MS, Arratia R, Galas DJ (1984) Pattern recognition in several sequences: consensus and alignment. Bull Math Biol 46:515–527

    Article  MATH  MathSciNet  Google Scholar 

  91. Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJK, Cooke JE, Elgar G (2005) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol 3(1):e7

    Article  Google Scholar 

  92. Zhang M (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci USA 94:565–568

    Article  Google Scholar 

  93. Zhang M (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698–709

    Article  Google Scholar 

  94. Zhao X, Xuan Z, Zhang MQ (2006) Boosting with stumps for predicting transcription start sites. Genome Biol 8:R17

    Article  Google Scholar 

  95. Zhou Q, Liu JS (2004) Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20(6):909–916

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Q. Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Tsinghua University Press, Beijing and Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Zhang, M.Q., Smith, A.D. (2013). Topics in Computational Genomics. In: Jiang, R., Zhang, X., Zhang, M. (eds) Basics of Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38951-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38951-1_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38950-4

  • Online ISBN: 978-3-642-38951-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics