Skip to main content

Statistical Methods in Bioinformatics

  • Chapter
  • First Online:
Basics of Bioinformatics
  • 3931 Accesses

Abstract

The linear biopolymers, DNA, RNA, and proteins, are the three central molecular building blocks of life. DNA is an information storage molecule. All of the hereditary information of an individual organism is contained in its genome, which consists of sequences of the four DNA bases (nucleotides), A, T, C, and G. RNA has a wide variety of roles, including a small but important set of functions. Proteins, which are chains of 20 different amino acid residues, are the action molecules of life, being responsible for nearly all the functions of all living beings and forming many of life’s structures. All protein sequences are coded by segments of the genome called genes. The universal genetic doe is used to translate triplets of DNA bases, called codons, to the 20-letter alphabet of proteins. How genetic information flows from DNA to RNA and then to protein is regarded as the central dogma of molecular biology. Genome sequencing projects with emergences of microarray techniques have resulted in rapidly growing and publicly available databases of DNA and protein sequences, structures, and genome-wide expression. One of the most interesting questions scientists are concerned with is how to get any useful biological information from “mining” these databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092

    Article  Google Scholar 

  2. Hasting WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109

    Article  Google Scholar 

  3. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741

    Article  MATH  Google Scholar 

  4. Liu JS (2001) Monte Carlo strategies in scientific computing. Springer, New York

    MATH  Google Scholar 

  5. Dudoit S, Yang YH, Callow MJ et al (2002) Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Stat Sin 12:111–139

    MATH  MathSciNet  Google Scholar 

  6. Yang YH, Dudoit S, Luu P et al (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30(4):e15

    Article  Google Scholar 

  7. Tseng GC, Oh M-K, Rohlin L et al (2001) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 29(12):2549–2557

    Article  Google Scholar 

  8. Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 98(1):31–36

    Article  MATH  Google Scholar 

  9. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57(1):289–300

    MATH  MathSciNet  Google Scholar 

  10. Efron B, Tibshirani R, Storey JD et al (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160

    Article  MATH  MathSciNet  Google Scholar 

  11. Calinski T, Harabasz J (1998) A dendrite method for cluster analysis. Commun Stat 3:1–27

    MathSciNet  Google Scholar 

  12. Hartigan J (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  13. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a Data Set via the Gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423

    Article  MATH  MathSciNet  Google Scholar 

  14. Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16

    Article  MATH  MathSciNet  Google Scholar 

  15. Kohonen T (1989) Self-organization and associative memory, 3rd edn. Springer, Berlin

    Book  Google Scholar 

  16. Bhattacharjee A, Richards WG, Staunton J et al (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98(24):13790–13795

    Article  Google Scholar 

  17. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86

    MATH  MathSciNet  Google Scholar 

  18. Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent system for molecular biology (ISMB2000), San Diego, 19–23 Aug 2000, pp 93–103

    Google Scholar 

  19. Bergmann S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E 67(3):031902

    Article  Google Scholar 

  20. Hastie T, Tibshirani R, Eisen M et al (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1(2):RESEARCH0003

    Article  Google Scholar 

  21. Fix E, Hodges JL (1951) Discriminatory analysis: non-parametric discrimination: consistency properties. USAF School of Aviation Medicine, Randolph Field

    Google Scholar 

  22. Brunak S, Engelbrecht J, Knudsen S (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 220(1):49–65

    Article  Google Scholar 

  23. Hebsgaard SM, Korning PG, Tolstrup N et al (1996) Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24(17):3439–3452

    Article  Google Scholar 

  24. Khan J, Wei JS, Ringner M et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679

    Article  Google Scholar 

  25. Dayhoff MO (1969) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC

    Google Scholar 

  26. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89(22):10915–10919

    Article  Google Scholar 

  27. Bairoch A (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 19:2241–2245

    Article  Google Scholar 

  28. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Article  Google Scholar 

  29. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680

    Article  Google Scholar 

  30. Liu JS, Neuwald AF, Lawrence CE (1999) Markovian structures in biological sequence alignments. J Am Stat Assoc 94:1–15

    Article  Google Scholar 

  31. Liu JS (1998) The collapsed Gibbs sampler with applications to a gene regulation problem. J Am Stat Assoc 89:958–966

    Article  Google Scholar 

  32. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763

    Article  Google Scholar 

  33. Hertz GZ, Hartzell GW III, Stormo GD (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Bioinformatics 6(2):81–92

    Article  Google Scholar 

  34. Liu JS, Neuwald AF, Lawrence CE (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc 90:1156–1170

    Article  MATH  Google Scholar 

  35. Lawrence CE, Altschul SF, Boguski MS et al (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214

    Article  Google Scholar 

  36. Liu JS, Lawrence CE (1999) Bayesian inference on biopolymer models. Bioinformatics 15(1):38–52

    Article  Google Scholar 

  37. McCue LA, Thompson W, Carmack CS et al (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29(3):774–782

    Article  Google Scholar 

  38. Gupta M, Liu JS (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci USA 102(20):7079–7084

    Article  Google Scholar 

  39. Liu XS, Brutlag DL, Liu JS (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20(8):835–839

    Article  Google Scholar 

  40. Bussemaker HJ, Li H, Siggia ED (2001) Regulatory element detection using correlation with expression. Nat Genet 27(2):167–174

    Article  Google Scholar 

  41. Conlon EM, Liu XS, Lieb JD et al (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci USA 100(6):3339–3344

    Article  Google Scholar 

  42. Zhou Q, Liu JS (2004) Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 20(6):909–916

    Article  Google Scholar 

  43. Hong P, Liu XS, Zhou Q et al (2005) A boosting approach for motif modeling using ChIP-chip data. Bioinformatics 21(11):2636–2643

    Article  Google Scholar 

  44. Zhong W, Zeng P, Ma P et al (2005) RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics 21(22):4169–4175

    Article  Google Scholar 

  45. Yuan G-C, Ma P, Zhong W et al (2006) Statistical assessment of the global regulatory role of histone acetylation in Saccharomyces cerevisiae. Genome Biol 7(8):70

    Article  Google Scholar 

  46. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  47. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  48. Thompson W et al (2004) Decoding human regulatory circuits. Genome Res 14(10a): 1967–1974

    Article  Google Scholar 

  49. Zhou T et al (2004) Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol Genet Genom 271(4):402–415

    Google Scholar 

  50. Gupta M, Liu JS (2005) De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci U S A 102(20):7079–7084

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun S. Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Tsinghua University Press, Beijing and Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Liu, J.S., Jiang, B. (2013). Statistical Methods in Bioinformatics. In: Jiang, R., Zhang, X., Zhang, M. (eds) Basics of Bioinformatics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38951-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38951-1_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38950-4

  • Online ISBN: 978-3-642-38951-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics