Computational Molecular Biology of Genome Expression and Regulation

  • Michael Q. Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)


Technological advances in experimental and computational molecular biology have revolutionized the whole fields of biology and medicine. Large-scale sequencing, expression and localization data have provided us with a great opportunity to study biology at the system level. I will introduce some outstanding problems in genome expression and regulation network in which better modern statistical and machine learning technologies are desperately needed.

Recent revolution in genomics has transformed life science. For the first time in history, mankind has been able to sequence the entire human own genome. Bioinformatics, especially computational molecular biology, has played a vital role in extracting knowledge from vast amount of information generated by the high throughput genomics technologies. Today, I am very happy to deliver this key lecture at the First International Conference on Pattern Recognition and Machine Intelligence at the world renowned Indian Statistical Institute (ISI) where such luminaries as Mahalanobis, Bose, Rao and others had worked before. And it is very timely that genomics has attracted new generation of talented young statisticians, reminding us the fact that statistics was essentially conceived from and continuously nurtured by biological problems. Pattern/rule recognition is at the heart of all learning process and hence of all disciplines of sciences, and comparison is the fundamental method: it is the similarities that allow inferring common rules; and it is the differences that allow deriving new rules.

Gene expression, normally referring to the cellular processes that lead to protein production, is controlled and regulated at multiple levels. Cells use this elaborate system of “circuits” and “switches” to decide when, where and by how much each gene should be turned on (activated, expressed) or off (repressed, silenced) in response to environmental clues. Genome expression and regulation refer to coordinated expression and regulation of many genes at large-scales for which advanced computational methods become indispensable. Due to space limitations, I can only highlight some of the pattern recognition problems in transcriptional regulation, which is the most important and best studied.

Currently, there are two general outstanding problems in transcriptional regulation studies: (1) How to find the regulatory regions, in particular, the promoters regions in the genome (throughout most of this lecture, we use promoter to refer to proximal promoters, e.g. ~ 1kb DNA at the beginning of each gene); (2) How to identify functional cis-regulatory DNA elements within each such region.


Transcription Factor Binding Site Core Promoter Multivariate Adaptive Regression Spline Genome Expression Relevance Vector Machine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 2, pp. 28–36 (1994)Google Scholar
  2. 2.
    Bajic, V.B., Seah, S.H., Chong, A., Zhang, G., Koh, J.L., Brusic, V.: Dragon Promoter Finder: Recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18(1), 198–199 (2002)CrossRefGoogle Scholar
  3. 3.
    Bajic, V.B., Brusic, V.: Computational detection of vertebrate RNA polymerase II promoters. Methods Enzymol. 370, 237–250 (2003)CrossRefGoogle Scholar
  4. 4.
    Bajic, V.B., Tan, S.L., Suzuki, Y., Sagano, S.: Promoter prediction analysis on the whole human genome. Nat. Biotechnol. 22(11), 1467–1473 (2004)CrossRefGoogle Scholar
  5. 5.
    Barash, Y., Bejerano, G., Friedman, N.: A simple hyper-geometric approach for discovering putative transcription factor binding sites. In: Gascuel, O., Moret, B.M.E. (eds.) WABI 2001. LNCS, vol. 2149, pp. 278–293. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  6. 6.
    Ben-Gal, I., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., Posch, S., Grosse, I.: Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21(11), 2657–2666 (2005)CrossRefGoogle Scholar
  7. 7.
    Berg, O.G., von Hippel, P.H.: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 191(4), 723–750 (1987)CrossRefGoogle Scholar
  8. 8.
    Boffelli, D., Nobrega, M.A., Rubin, E.M.: Comparative genomics at the vertebrate extremes. Nat. Rev. Genet. 5(6), 456–465 (2004)CrossRefGoogle Scholar
  9. 9.
    Bussemaker, H.J., Li, H., Siggia, E.D.: Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad Sci USA 97(18), 10096–10100 (2000)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Bussemaker, H.J., Li, H., Siggia, E.D.: Regulatory element detection using correlation with expression. Nat. Genet. 27(2), 167–171 (2001)CrossRefGoogle Scholar
  11. 11.
    Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S.: Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100(6), 3339–3344 (2003)CrossRefGoogle Scholar
  12. 12.
    Das, D., Banerjee, N., Zhang, M.Q.: Interacting models of cooperative gene regulation. Proc. Natl. Acad. Sci. USA 101(46), 16234–16239 (2004)CrossRefGoogle Scholar
  13. 13.
    Davuluri, R.V., Grosse, I., Zhang, M.Q.: Computational identification of promoters and first exons in the human genome. Nat. Genet. 29(4), 412–417 (2001); Erratum: Nat Genet. 32(3), 459 (2002) CrossRefGoogle Scholar
  14. 14.
    Down, T.A., Hubbard, T.J.: Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12(3), 458–461 (2002)CrossRefGoogle Scholar
  15. 15.
    Eddy, S.R.: Computational genomics of noncoding RNA genes. Cell. 109(2), 137–140 (2002)CrossRefGoogle Scholar
  16. 16.
    Fazzari, M.J., Greally, J.M.: Epigenomics: Beyond CpG islands. Nat. Rev. Genet. 5(6), 446–455 (2004)CrossRefGoogle Scholar
  17. 17.
    Friedman, M.J.: Multivariate adaptive regression splines. Ann. Stat. 19, 1–67 (1991)zbMATHCrossRefGoogle Scholar
  18. 18.
    Gasch, A.P., Moses, A.M., Chiang, D.Y., Fraser, H.B., Berardini, M., Eisen, M.B.: Conservation and evolution of cis-regulatory systems in ascomycete fungi. PloS Biol. 2(12), 398 (2004)CrossRefGoogle Scholar
  19. 19.
    Hong, P., Liu, X.S., Zhou, Q., Lu, X., Liu, J.S., Wong, W.H.: A boosting approach for motif modeling using ChIP-chip data. Bioinformatics 21(11), 2636–2643 (2005)CrossRefGoogle Scholar
  20. 20.
    Ioshikhes, I.P., Zhang, M.Q.: Large-scale human promoter mapping using CpG islands. Nat. Genet. 26(1), 61–63 (2000)CrossRefGoogle Scholar
  21. 21.
    Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., Ren, B.: A high-resolution map of active promoters in the human genome. Nature (2005) (e-pub ahead of print)Google Scholar
  22. 22.
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262(5131), 208–214 (1993)CrossRefGoogle Scholar
  23. 23.
    Levine, M., Davidson, E.H.: Gene regulatory networks for development. Proc. Natl. Acad. Sci. USA 102(14), 4936–4942 (2005)CrossRefGoogle Scholar
  24. 24.
    Li, W., Meyer, C.A., Liu, X.S.: A hidden Marcov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 21(Suppl. 1), i274–i282 (2005)CrossRefGoogle Scholar
  25. 25.
    Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20(8), 835–839 (2002)Google Scholar
  26. 26.
    Lucchetta, E.M., Lee, J.H., Fu, L.A., Patel, N.H., Ismagilov, R.F.: Dynamics of Drosophila embryonic patterning network perturbed in space and time using microfluidics. Nature 434(7037), 1134–1138 (2005)CrossRefGoogle Scholar
  27. 27.
    Maniatis, T., Reed, R.: An extensive network of coupling among gene expression machines. Nature 416(6880), 499–506 (2002)CrossRefGoogle Scholar
  28. 28.
    Nobrega, M.A., Ovcharenko, I., Afzal, V., Rubin, E.M.: Scanning human gene deserts for long-range enhancers. Science 302(5644), 413 (2003)CrossRefGoogle Scholar
  29. 29.
    Pavlidis, P., Furey, T.S., Liberto, M., Haussler, D., Grundy, W.: Promoter region-based classification of genes. In: Pac. Symp. Biocomput., pp. 151–163 (2001)Google Scholar
  30. 30.
    Pedersen, A.G., Engelbrecht, J.: Investigations of Escherichia coli promoter sequences with artificial neural networks: New signals discovered upstream of the transcriptional start-point. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 3, pp. 292–299 (1995)Google Scholar
  31. 31.
    Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics 21(Suppl. 1), i344–i350 (2005)CrossRefGoogle Scholar
  32. 32.
    Scherf, M., Klingenhoff, A., Werner, T.: Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: A novel contact analysis approach. J. Mol. Biol. 297(3), 599–606 (2000)CrossRefGoogle Scholar
  33. 33.
    Segal, E., Barash, Y., Simon, I., Friedman, N., Koller, D.: From promoter sequence to expression: A probabilistic framework. In: Proc. 6th Intl. Conf. Res. Comp. Mol. Biol., pp. 263–272 (2002)Google Scholar
  34. 34.
    Siggers, T.W., Silkov, A., Honig, B.: Structural alignment of protein-DNA interfaces: Insights into the determinants of binding specificity. J. Mol. Biol. 345(5), 1027–1045 (2005)CrossRefGoogle Scholar
  35. 35.
    Smale, S.T., Kadonaga, J.T.: The RNA Polymerase II core promoter. Annu. Rev. Biochem. 72, 449–479 (2003)CrossRefGoogle Scholar
  36. 36.
    Smith, A.D., Sumazin, P., Zhang, M.Q.: Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc. Natl. Acad. Sci USA 102(5), 1560–1565 (2005)CrossRefGoogle Scholar
  37. 37.
    Stormo, G.D., Hartzell, G.W.: 3rd Identifying protein-building sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. U.S.A. 86(4), 1183–1187 (1989)CrossRefGoogle Scholar
  38. 38.
    Sumazin, P., Chen, G., Hata, N., Smith, A.D., Zhang, T., Zhang, M.Q.: DWE: Discriminating word enumerator. Bioinformatics 21(1), 31–38 (2005)CrossRefGoogle Scholar
  39. 39.
    Taatjes, D.J., Marr, M.T., Tjian, R.: Regulatory diversity among metazoan co-activator complexes. Nat. Rev. Mol. Cell. Biol. 5(5), 403–410 (2004)CrossRefGoogle Scholar
  40. 40.
    Tharakaraman, K., Marino-Ramirez, L., Sheetlin, S., Landsman, D., Spouge, J.L.: Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics 21(Suppl. 1), i440–i448 (2005)CrossRefGoogle Scholar
  41. 41.
    Tipping, M.E.: Space Bayesian learning and the relevance vector machine. J. Machine Learning Res. 1, 211–244 (2001)zbMATHMathSciNetCrossRefGoogle Scholar
  42. 42.
    Workman, C.T., Stormo, G.D.: ANN-Spec: A method for discovering transcription factor binding sites with improved specificity. In: Pac. Symp. Biocomput., pp. 467–478 (2000)Google Scholar
  43. 43.
    Wray, G.A.: Transcriptional regulation and the evolution of development. Int. J. Dev. Biol. 47(7-8), 675–684 (2003)Google Scholar
  44. 44.
    Xuan, Z., Zhao, F., Wang, J.H., Chen, G.X., Zhang, M.Q.: Genome-wide promoter extraction and analysis in human, mouse and rat. Genome Biol. (2005) (In Press)Google Scholar
  45. 45.
    Zhang, M.Q., Marr, T.G.: A weight array method for splicing signal analysis. Comput. Appl. Biosci. 9(5), 499–509 (1993)Google Scholar
  46. 46.
    Zhang, M.Q.: Identification of human gene core promoters in silico. Genome Res. 8(3), 319–326 (1998)Google Scholar
  47. 47.
    Zhang, M.Q.: Discriminant analysis and its application in DNA sequence motif recognition. Brief Bioinform. 1(4), 331–342 (2000)CrossRefGoogle Scholar
  48. 48.
    Zhang, M.Q.: Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3(9), 698–709 (2002)CrossRefGoogle Scholar
  49. 49.
    Zhang, M.Q.: Computational methods for promoter recognition. In: Jiang, T., Xu, Y., Zhang, M.Q. (eds.) Current Topics in Computational Molecular Biology, pp. 249–268. MIT Press, Cambridge (2002)Google Scholar
  50. 50.
    Zhang, M.Q.: Inferring gene regulatory networks. In: Lengquer, T. (ed.) Bioinformatics – from Genome to Therapies. Wiley-VCH, Chichester (2005) (submitted)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Michael Q. Zhang
    • 1
  1. 1.Cold Spring Harbor LaboratoryCold Spring HarborUSA

Personalised recommendations