Abstract
Technological advances in experimental and computational molecular biology have revolutionized the whole fields of biology and medicine. Large-scale sequencing, expression and localization data have provided us with a great opportunity to study biology at the system level. I will introduce some outstanding problems in genome expression and regulation network in which better modern statistical and machine learning technologies are desperately needed.
Recent revolution in genomics has transformed life science. For the first time in history, mankind has been able to sequence the entire human own genome. Bioinformatics, especially computational molecular biology, has played a vital role in extracting knowledge from vast amount of information generated by the high throughput genomics technologies. Today, I am very happy to deliver this key lecture at the First International Conference on Pattern Recognition and Machine Intelligence at the world renowned Indian Statistical Institute (ISI) where such luminaries as Mahalanobis, Bose, Rao and others had worked before. And it is very timely that genomics has attracted new generation of talented young statisticians, reminding us the fact that statistics was essentially conceived from and continuously nurtured by biological problems. Pattern/rule recognition is at the heart of all learning process and hence of all disciplines of sciences, and comparison is the fundamental method: it is the similarities that allow inferring common rules; and it is the differences that allow deriving new rules.
Gene expression, normally referring to the cellular processes that lead to protein production, is controlled and regulated at multiple levels. Cells use this elaborate system of “circuits” and “switches” to decide when, where and by how much each gene should be turned on (activated, expressed) or off (repressed, silenced) in response to environmental clues. Genome expression and regulation refer to coordinated expression and regulation of many genes at large-scales for which advanced computational methods become indispensable. Due to space limitations, I can only highlight some of the pattern recognition problems in transcriptional regulation, which is the most important and best studied.
Currently, there are two general outstanding problems in transcriptional regulation studies: (1) How to find the regulatory regions, in particular, the promoters regions in the genome (throughout most of this lecture, we use promoter to refer to proximal promoters, e.g. ~ 1kb DNA at the beginning of each gene); (2) How to identify functional cis-regulatory DNA elements within each such region.
Chapter PDF
Similar content being viewed by others
Keywords
- Transcription Factor Binding Site
- Core Promoter
- Multivariate Adaptive Regression Spline
- Genome Expression
- Relevance Vector Machine
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 2, pp. 28–36 (1994)
Bajic, V.B., Seah, S.H., Chong, A., Zhang, G., Koh, J.L., Brusic, V.: Dragon Promoter Finder: Recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18(1), 198–199 (2002)
Bajic, V.B., Brusic, V.: Computational detection of vertebrate RNA polymerase II promoters. Methods Enzymol. 370, 237–250 (2003)
Bajic, V.B., Tan, S.L., Suzuki, Y., Sagano, S.: Promoter prediction analysis on the whole human genome. Nat. Biotechnol. 22(11), 1467–1473 (2004)
Barash, Y., Bejerano, G., Friedman, N.: A simple hyper-geometric approach for discovering putative transcription factor binding sites. In: Gascuel, O., Moret, B.M.E. (eds.) WABI 2001. LNCS, vol. 2149, pp. 278–293. Springer, Heidelberg (2001)
Ben-Gal, I., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., Posch, S., Grosse, I.: Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21(11), 2657–2666 (2005)
Berg, O.G., von Hippel, P.H.: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 191(4), 723–750 (1987)
Boffelli, D., Nobrega, M.A., Rubin, E.M.: Comparative genomics at the vertebrate extremes. Nat. Rev. Genet. 5(6), 456–465 (2004)
Bussemaker, H.J., Li, H., Siggia, E.D.: Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad Sci USA 97(18), 10096–10100 (2000)
Bussemaker, H.J., Li, H., Siggia, E.D.: Regulatory element detection using correlation with expression. Nat. Genet. 27(2), 167–171 (2001)
Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S.: Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100(6), 3339–3344 (2003)
Das, D., Banerjee, N., Zhang, M.Q.: Interacting models of cooperative gene regulation. Proc. Natl. Acad. Sci. USA 101(46), 16234–16239 (2004)
Davuluri, R.V., Grosse, I., Zhang, M.Q.: Computational identification of promoters and first exons in the human genome. Nat. Genet. 29(4), 412–417 (2001); Erratum: Nat Genet. 32(3), 459 (2002)
Down, T.A., Hubbard, T.J.: Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12(3), 458–461 (2002)
Eddy, S.R.: Computational genomics of noncoding RNA genes. Cell. 109(2), 137–140 (2002)
Fazzari, M.J., Greally, J.M.: Epigenomics: Beyond CpG islands. Nat. Rev. Genet. 5(6), 446–455 (2004)
Friedman, M.J.: Multivariate adaptive regression splines. Ann. Stat. 19, 1–67 (1991)
Gasch, A.P., Moses, A.M., Chiang, D.Y., Fraser, H.B., Berardini, M., Eisen, M.B.: Conservation and evolution of cis-regulatory systems in ascomycete fungi. PloS Biol. 2(12), 398 (2004)
Hong, P., Liu, X.S., Zhou, Q., Lu, X., Liu, J.S., Wong, W.H.: A boosting approach for motif modeling using ChIP-chip data. Bioinformatics 21(11), 2636–2643 (2005)
Ioshikhes, I.P., Zhang, M.Q.: Large-scale human promoter mapping using CpG islands. Nat. Genet. 26(1), 61–63 (2000)
Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., Ren, B.: A high-resolution map of active promoters in the human genome. Nature (2005) (e-pub ahead of print)
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262(5131), 208–214 (1993)
Levine, M., Davidson, E.H.: Gene regulatory networks for development. Proc. Natl. Acad. Sci. USA 102(14), 4936–4942 (2005)
Li, W., Meyer, C.A., Liu, X.S.: A hidden Marcov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 21(Suppl. 1), i274–i282 (2005)
Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20(8), 835–839 (2002)
Lucchetta, E.M., Lee, J.H., Fu, L.A., Patel, N.H., Ismagilov, R.F.: Dynamics of Drosophila embryonic patterning network perturbed in space and time using microfluidics. Nature 434(7037), 1134–1138 (2005)
Maniatis, T., Reed, R.: An extensive network of coupling among gene expression machines. Nature 416(6880), 499–506 (2002)
Nobrega, M.A., Ovcharenko, I., Afzal, V., Rubin, E.M.: Scanning human gene deserts for long-range enhancers. Science 302(5644), 413 (2003)
Pavlidis, P., Furey, T.S., Liberto, M., Haussler, D., Grundy, W.: Promoter region-based classification of genes. In: Pac. Symp. Biocomput., pp. 151–163 (2001)
Pedersen, A.G., Engelbrecht, J.: Investigations of Escherichia coli promoter sequences with artificial neural networks: New signals discovered upstream of the transcriptional start-point. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 3, pp. 292–299 (1995)
Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics 21(Suppl. 1), i344–i350 (2005)
Scherf, M., Klingenhoff, A., Werner, T.: Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: A novel contact analysis approach. J. Mol. Biol. 297(3), 599–606 (2000)
Segal, E., Barash, Y., Simon, I., Friedman, N., Koller, D.: From promoter sequence to expression: A probabilistic framework. In: Proc. 6th Intl. Conf. Res. Comp. Mol. Biol., pp. 263–272 (2002)
Siggers, T.W., Silkov, A., Honig, B.: Structural alignment of protein-DNA interfaces: Insights into the determinants of binding specificity. J. Mol. Biol. 345(5), 1027–1045 (2005)
Smale, S.T., Kadonaga, J.T.: The RNA Polymerase II core promoter. Annu. Rev. Biochem. 72, 449–479 (2003)
Smith, A.D., Sumazin, P., Zhang, M.Q.: Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc. Natl. Acad. Sci USA 102(5), 1560–1565 (2005)
Stormo, G.D., Hartzell, G.W.: 3rd Identifying protein-building sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. U.S.A. 86(4), 1183–1187 (1989)
Sumazin, P., Chen, G., Hata, N., Smith, A.D., Zhang, T., Zhang, M.Q.: DWE: Discriminating word enumerator. Bioinformatics 21(1), 31–38 (2005)
Taatjes, D.J., Marr, M.T., Tjian, R.: Regulatory diversity among metazoan co-activator complexes. Nat. Rev. Mol. Cell. Biol. 5(5), 403–410 (2004)
Tharakaraman, K., Marino-Ramirez, L., Sheetlin, S., Landsman, D., Spouge, J.L.: Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics 21(Suppl. 1), i440–i448 (2005)
Tipping, M.E.: Space Bayesian learning and the relevance vector machine. J. Machine Learning Res. 1, 211–244 (2001)
Workman, C.T., Stormo, G.D.: ANN-Spec: A method for discovering transcription factor binding sites with improved specificity. In: Pac. Symp. Biocomput., pp. 467–478 (2000)
Wray, G.A.: Transcriptional regulation and the evolution of development. Int. J. Dev. Biol. 47(7-8), 675–684 (2003)
Xuan, Z., Zhao, F., Wang, J.H., Chen, G.X., Zhang, M.Q.: Genome-wide promoter extraction and analysis in human, mouse and rat. Genome Biol. (2005) (In Press)
Zhang, M.Q., Marr, T.G.: A weight array method for splicing signal analysis. Comput. Appl. Biosci. 9(5), 499–509 (1993)
Zhang, M.Q.: Identification of human gene core promoters in silico. Genome Res. 8(3), 319–326 (1998)
Zhang, M.Q.: Discriminant analysis and its application in DNA sequence motif recognition. Brief Bioinform. 1(4), 331–342 (2000)
Zhang, M.Q.: Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3(9), 698–709 (2002)
Zhang, M.Q.: Computational methods for promoter recognition. In: Jiang, T., Xu, Y., Zhang, M.Q. (eds.) Current Topics in Computational Molecular Biology, pp. 249–268. MIT Press, Cambridge (2002)
Zhang, M.Q.: Inferring gene regulatory networks. In: Lengquer, T. (ed.) Bioinformatics – from Genome to Therapies. Wiley-VCH, Chichester (2005) (submitted)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, M.Q. (2005). Computational Molecular Biology of Genome Expression and Regulation. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2005. Lecture Notes in Computer Science, vol 3776. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11590316_5
Download citation
DOI: https://doi.org/10.1007/11590316_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30506-4
Online ISBN: 978-3-540-32420-1
eBook Packages: Computer ScienceComputer Science (R0)