Statistical Methods in Bioinformatics

Liu, Jun S.; Jiang, Bo

doi:10.1007/978-3-642-38951-1_4

Jun S. Liu⁴ &
Bo Jiang⁵

3931 Accesses

Abstract

The linear biopolymers, DNA, RNA, and proteins, are the three central molecular building blocks of life. DNA is an information storage molecule. All of the hereditary information of an individual organism is contained in its genome, which consists of sequences of the four DNA bases (nucleotides), A, T, C, and G. RNA has a wide variety of roles, including a small but important set of functions. Proteins, which are chains of 20 different amino acid residues, are the action molecules of life, being responsible for nearly all the functions of all living beings and forming many of life’s structures. All protein sequences are coded by segments of the genome called genes. The universal genetic doe is used to translate triplets of DNA bases, called codons, to the 20-letter alphabet of proteins. How genetic information flows from DNA to RNA and then to protein is regarded as the central dogma of molecular biology. Genome sequencing projects with emergences of microarray techniques have resulted in rapidly growing and publicly available databases of DNA and protein sequences, structures, and genome-wide expression. One of the most interesting questions scientists are concerned with is how to get any useful biological information from “mining” these databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Article Google Scholar
Hasting WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109
Article Google Scholar
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Article MATH Google Scholar
Liu JS (2001) Monte Carlo strategies in scientific computing. Springer, New York
MATH Google Scholar
Dudoit S, Yang YH, Callow MJ et al (2002) Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Stat Sin 12:111–139
MATH MathSciNet Google Scholar
Yang YH, Dudoit S, Luu P et al (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30(4):e15
Article Google Scholar
Tseng GC, Oh M-K, Rohlin L et al (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 29(12):2549–2557
Article Google Scholar
Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 98(1):31–36
Article MATH Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57(1):289–300
MATH MathSciNet Google Scholar
Efron B, Tibshirani R, Storey JD et al (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Article MATH MathSciNet Google Scholar
Calinski T, Harabasz J (1998) A dendrite method for cluster analysis. Commun Stat 3:1–27
MathSciNet Google Scholar
Hartigan J (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a Data Set via the Gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423
Article MATH MathSciNet Google Scholar
Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16
Article MATH MathSciNet Google Scholar
Kohonen T (1989) Self-organization and associative memory, 3rd edn. Springer, Berlin
Book Google Scholar
Bhattacharjee A, Richards WG, Staunton J et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98(24):13790–13795
Article Google Scholar
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86
MATH MathSciNet Google Scholar
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent system for molecular biology (ISMB2000), San Diego, 19–23 Aug 2000, pp 93–103
Google Scholar
Bergmann S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E 67(3):031902
Article Google Scholar
Hastie T, Tibshirani R, Eisen M et al (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1(2):RESEARCH0003
Article Google Scholar
Fix E, Hodges JL (1951) Discriminatory analysis: non-parametric discrimination: consistency properties. USAF School of Aviation Medicine, Randolph Field
Google Scholar
Brunak S, Engelbrecht J, Knudsen S (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 220(1):49–65
Article Google Scholar
Hebsgaard SM, Korning PG, Tolstrup N et al (1996) Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24(17):3439–3452
Article Google Scholar
Khan J, Wei JS, Ringner M et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679
Article Google Scholar
Dayhoff MO (1969) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC
Google Scholar
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89(22):10915–10919
Article Google Scholar
Bairoch A (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 19:2241–2245
Article Google Scholar
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680
Article Google Scholar
Liu JS, Neuwald AF, Lawrence CE (1999) Markovian structures in biological sequence alignments. J Am Stat Assoc 94:1–15
Article Google Scholar
Liu JS (1998) The collapsed Gibbs sampler with applications to a gene regulation problem. J Am Stat Assoc 89:958–966
Article Google Scholar
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763
Article Google Scholar
Hertz GZ, Hartzell GW III, Stormo GD (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Bioinformatics 6(2):81–92
Article Google Scholar
Liu JS, Neuwald AF, Lawrence CE (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc 90:1156–1170
Article MATH Google Scholar
Lawrence CE, Altschul SF, Boguski MS et al (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214
Article Google Scholar
Liu JS, Lawrence CE (1999) Bayesian inference on biopolymer models. Bioinformatics 15(1):38–52
Article Google Scholar
McCue LA, Thompson W, Carmack CS et al (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29(3):774–782
Article Google Scholar
Gupta M, Liu JS (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci USA 102(20):7079–7084
Article Google Scholar
Liu XS, Brutlag DL, Liu JS (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20(8):835–839
Article Google Scholar
Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression. Nat Genet 27(2):167–174
Article Google Scholar
Conlon EM, Liu XS, Lieb JD et al (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA 100(6):3339–3344
Article Google Scholar
Zhou Q, Liu JS (2004) Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20(6):909–916
Article Google Scholar
Hong P, Liu XS, Zhou Q et al (2005) A boosting approach for motif modeling using ChIP-chip data. Bioinformatics 21(11):2636–2643
Article Google Scholar
Zhong W, Zeng P, Ma P et al (2005) RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics 21(22):4169–4175
Article Google Scholar
Yuan G-C, Ma P, Zhong W et al (2006) Statistical assessment of the global regulatory role of histone acetylation in Saccharomyces cerevisiae. Genome Biol 7(8):70
Article Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Thompson W et al (2004) Decoding human regulatory circuits. Genome Res 14(10a): 1967–1974
Article Google Scholar
Zhou T et al (2004) Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol Genet Genom 271(4):402–415
Google Scholar
Gupta M, Liu JS (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci U S A 102(20):7079–7084
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, Harvard University, Cambridge, MA, 02138, USA
Jun S. Liu
MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing, 100084, China
Bo Jiang

Authors

Jun S. Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun S. Liu .

Editor information

Editors and Affiliations

Department of Automation, Tsinghua University, Beijing, People's Republic of China
Rui Jiang
Department of Automation, Tsinghua University, Beijing, People's Republic of China
Xuegong Zhang
Department of Molecular and Cell Biology, The University of Texas at Dallas, Richardson, Texas, USA
Michael Q. Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, J.S., Jiang, B. (2013). Statistical Methods in Bioinformatics. In: Jiang, R., Zhang, X., Zhang, M. (eds) Basics of Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38951-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-38951-1_4
Published: 01 July 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38950-4
Online ISBN: 978-3-642-38951-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics