Abstract
The linear biopolymers, DNA, RNA, and proteins, are the three central molecular building blocks of life. DNA is an information storage molecule. All of the hereditary information of an individual organism is contained in its genome, which consists of sequences of the four DNA bases (nucleotides), A, T, C, and G. RNA has a wide variety of roles, including a small but important set of functions. Proteins, which are chains of 20 different amino acid residues, are the action molecules of life, being responsible for nearly all the functions of all living beings and forming many of life’s structures. All protein sequences are coded by segments of the genome called genes. The universal genetic doe is used to translate triplets of DNA bases, called codons, to the 20-letter alphabet of proteins. How genetic information flows from DNA to RNA and then to protein is regarded as the central dogma of molecular biology. Genome sequencing projects with emergences of microarray techniques have resulted in rapidly growing and publicly available databases of DNA and protein sequences, structures, and genome-wide expression. One of the most interesting questions scientists are concerned with is how to get any useful biological information from “mining” these databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Hasting WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Liu JS (2001) Monte Carlo strategies in scientific computing. Springer, New York
Dudoit S, Yang YH, Callow MJ et al (2002) Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Stat Sin 12:111–139
Yang YH, Dudoit S, Luu P et al (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30(4):e15
Tseng GC, Oh M-K, Rohlin L et al (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 29(12):2549–2557
Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 98(1):31–36
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57(1):289–300
Efron B, Tibshirani R, Storey JD et al (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Calinski T, Harabasz J (1998) A dendrite method for cluster analysis. Commun Stat 3:1–27
Hartigan J (1975) Clustering algorithms. Wiley, New York
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a Data Set via the Gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423
Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16
Kohonen T (1989) Self-organization and associative memory, 3rd edn. Springer, Berlin
Bhattacharjee A, Richards WG, Staunton J et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98(24):13790–13795
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent system for molecular biology (ISMB2000), San Diego, 19–23 Aug 2000, pp 93–103
Bergmann S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E 67(3):031902
Hastie T, Tibshirani R, Eisen M et al (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1(2):RESEARCH0003
Fix E, Hodges JL (1951) Discriminatory analysis: non-parametric discrimination: consistency properties. USAF School of Aviation Medicine, Randolph Field
Brunak S, Engelbrecht J, Knudsen S (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 220(1):49–65
Hebsgaard SM, Korning PG, Tolstrup N et al (1996) Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24(17):3439–3452
Khan J, Wei JS, Ringner M et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679
Dayhoff MO (1969) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89(22):10915–10919
Bairoch A (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 19:2241–2245
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680
Liu JS, Neuwald AF, Lawrence CE (1999) Markovian structures in biological sequence alignments. J Am Stat Assoc 94:1–15
Liu JS (1998) The collapsed Gibbs sampler with applications to a gene regulation problem. J Am Stat Assoc 89:958–966
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763
Hertz GZ, Hartzell GW III, Stormo GD (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Bioinformatics 6(2):81–92
Liu JS, Neuwald AF, Lawrence CE (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc 90:1156–1170
Lawrence CE, Altschul SF, Boguski MS et al (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214
Liu JS, Lawrence CE (1999) Bayesian inference on biopolymer models. Bioinformatics 15(1):38–52
McCue LA, Thompson W, Carmack CS et al (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29(3):774–782
Gupta M, Liu JS (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci USA 102(20):7079–7084
Liu XS, Brutlag DL, Liu JS (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20(8):835–839
Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression. Nat Genet 27(2):167–174
Conlon EM, Liu XS, Lieb JD et al (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA 100(6):3339–3344
Zhou Q, Liu JS (2004) Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20(6):909–916
Hong P, Liu XS, Zhou Q et al (2005) A boosting approach for motif modeling using ChIP-chip data. Bioinformatics 21(11):2636–2643
Zhong W, Zeng P, Ma P et al (2005) RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics 21(22):4169–4175
Yuan G-C, Ma P, Zhong W et al (2006) Statistical assessment of the global regulatory role of histone acetylation in Saccharomyces cerevisiae. Genome Biol 7(8):70
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Thompson W et al (2004) Decoding human regulatory circuits. Genome Res 14(10a): 1967–1974
Zhou T et al (2004) Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol Genet Genom 271(4):402–415
Gupta M, Liu JS (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci U S A 102(20):7079–7084
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Tsinghua University Press, Beijing and Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Liu, J.S., Jiang, B. (2013). Statistical Methods in Bioinformatics. In: Jiang, R., Zhang, X., Zhang, M. (eds) Basics of Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38951-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-38951-1_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38950-4
Online ISBN: 978-3-642-38951-1
eBook Packages: Computer ScienceComputer Science (R0)