Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates


Statistical Sequence Analysis Using N-Grams


Abstract: Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and ‘n-grams’ (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets.

Availability: BLMT can be downloaded from and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows® platform. Specific tools and usage details are described in a ‘readme’ file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

    Other notations used instead of n-grams are k-grams, k- or n-tuples and k- or n-mers.


  1. 1.

    Baldi P. Bioinformatics. Cambridge (MA): MIT Press, 1998

  2. 2.

    Durbin R, Eddy S, Krogh A, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press, 1998

  3. 3.

    Searls DB. The language of genes. Nature 2002; 420(6912): 211–7

  4. 4.

    Baxevanis AD, Ouellette BFF. Bioinformatics: a practical guide to the analysis of genes and proteins. New York: Wiley-Interscience, 1998

  5. 5.

    Bolshoy A, Shapiro K, Trifonov EN, et al. Enhancement of the nucleosomal pattern in sequences of lower complexity. Nucleic Acids Res 1997; 25(16): 3248–54

  6. 6.

    Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997; 268: 78–94

  7. 7.

    Gibas C, Jambeck P. Developing bioinformatics computer skills. Sebastopol (CA): O’Reilly & Associates, 2001

  8. 8.

    Troyanskaya OG, Arbell O, Koren Y, et al. Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity. Bioinformatics 2002; 18: 679–88

  9. 9.

    Coin L, Bateman A, Durbin R. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc Natl Acad Sci U S A 2003; 100: 4516–20

  10. 10.

    Cheng BYM, Carbonell J, Klein-Seetharaman J. Protein classification based on text document classification techniques. Proteins 2004. In press

  11. 11.

    Ganapathiraju M, Weisser D, Rosenfeld R, et al. Comparative n-gram analysis of whole-genome sequences. Human Language Technologies Conference (HLT2002); 2002 Mar 24–27; San Diego (CA).

  12. 12.

    Ganapathiraju M, Klein-Seetharaman J, Balakrishnan N, et al. Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Processing Magazine 2004; 21(3): 78–87

  13. 13.

    Ganapathiraju M, WeisserD, Klein-Seetharaman J. Yule value tables from protein datasets. SCI2004: 8th World Multi-Conference on Systemics, Cybernetics and Informatics; 2004 Jul 18–21; Orlando (FL).

  14. 14.

    Weisser D, Klein-Seetharaman J. Identification of fundamental building blocks in protein sequences using statistical association measures. ACM Symposium on Applied Computing; 2004 Mar 14–17; Nicosia, Cyprus. 154–61

  15. 15.

    Liu Y, Carbonell J, Klein-Seetharaman J, Gopalakrishnan V. Comparison of probabilistic combination methods for protein secondary structure prediction. Bioinformatics. Epub 2004 Jun 24

  16. 16.

    Erhan S, Marzolf T, Cohen L. Amino-acid neighborhood relationships in proteins: breakdown of amino-acid sequences into overlapping doublets, triplets and quadruplets. Int J Biomed Comput 1980; 11(1): 67–75

  17. 17.

    Karlin S, Bucher P, Brendel V, et al. Statistical methods and insights for protein and DNA sequences. Annu Rev Biophys Biophys Chem 1991; 20: 175–203

  18. 18.

    Karlin S, Blaisdell BE, Bucher P. Quantile distributions of amino acid usage in protein classes. Protein Eng 1992; 5(8): 729–38

  19. 19.

    Karlin S, Burge C. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci U S A 1996; 93(4): 1560–5

  20. 20.

    Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990; 215: 403–10

  21. 21.

    Mantegna RN, Buldyrev SV, Goldberger AL, et al. Linguistic features of noncoding DNA sequences. Phys Rev Lett 1994; 73(23): 3169–72

  22. 22.

    Zipf GK. Selective studies and the principle of relative frequency in language. Cambridge (MA): Harvard University Press, 1932

  23. 23.

    Miller GA, Newman EB. Tests of a statistical explanation of the rank-frequency relation for words in written English. Am J Psychol 1958; 71: 209–18

  24. 24.

    Chatzidimitriou-Dreismann CA, Streffer RM, Larhammar D. Lack of biological significance in the ‘linguistic features’ of noncoding DNA: a quantitative analysis. Nucleic Acids Res 1996; 24(9): 1676–81

  25. 25.

    Tsonis AA, Elsner JB, Tsonis PA. Is DNA a language?. J Theor Biol 1997; 184(1): 25–9

  26. 26.

    Niyogi P, Berwick RC. A note on Zipf’s law, natural languages, and noncoding DNA regions. Cambridge (MA): Massachusetts Institute of Technology, Cambridge Artificial Intelligence Lab, 1995. Report no.: A024892

  27. 27.

    Israeloff NE, Kagalenko M, Chan K. Can Zipf distinguish language from noise in noncoding DNA? [letter]. Phys Rev Lett 1996; 76(11): 1976

  28. 28.

    Li W. Statistical properties of open reading frames in complete contained genome sequences. Comput Chem 1999; 23(3–4): 283–301

  29. 29.

    Strait BJ, Dewey TG. The Shannon information entropy of protein sequences. Biophys J 1996; 71(1): 148–55

  30. 30.

    Wu C, Whitson G, McLarty J, et al. Protein classification artificial neural system. Protein Sci 1992; 1: 667–77

  31. 31.

    Vries JK, Munshi R, Tobi D, et al. A sequence alignment-independent method for protein classification. Appl Bioinformatics 2004; 3(2–3): 61–72

  32. 32.

    Simons KT, Bonneau R, Ruczinski I, et al. Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 1999; Suppl. 3: 171–6

  33. 33.

    Kuznetsov IB, Rackovsky S. On the properties and sequence context of structurally ambivalent fragments in proteins. Protein Sci 2003; 12: 2420–33

  34. 34.

    Hucka M, Finney A, Sauro HM, et al. The ERATO Systems Biology Workbench: enabling interaction and exchange between software tools for computational biology. Pac Symp Biocomput 2002; 7: 450–61

  35. 35.

    Manber U, Myers G. Suffix arrays: a new method for on-line string searches. SIAM J Comput 1993; 22(5): 935–48

  36. 36.

    Delcher AL, Kasif S, Fleischman RD, et al. Alignment of whole genomes. Nucleic Acids Res 1999; 27: 2369–76

  37. 37.

    Sadakane K, Shibuya T. Indexing huge genome sequences for solving various problems. Genome Inform Ser Workshop Genome Inform 2001; 12: 175–83

  38. 38.

    Mandel-Gutfreund Y, Gregoret LM. On the significance of alternating patterns of polar and non-polar residues in beta-strands. J Mol Biol 2002; 323: 453–61

  39. 39.

    Dorohonceanu B, Nevill-Manning CG. Accelerating protein classification using suffix trees. Proc Int Conf Intell Syst Mol Biol 2000; 8: 128–33

  40. 40.

    Bejerano G, Yona G. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 2001; 17: 23–43

  41. 41.

    Kasai T, Lee G, Arimura H. Linear-time longest-common-prefix computation in suffix arrays and its applications. 12th Annual Symposium on Combinatorial Pattern Matching: CPM-2001; 2001 Jul 1–4; Jerusalem. Heidelburg: Springer-Verlag, 2001: 181–92

  42. 42.

    Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis. Cambridge (MA): MIT Press, 1975

  43. 43.

    Clarkson PR, Rosenfeld R. Statistical language modeling using the CMU-Cambridge toolkit. Proceedings ESCA Eurospeech; 1997 Sep 23–25; Rhodes, Greece.

  44. 44.

    Chen SF, Goodman J. An empirical study of smoothing techniques for language modeling. Proceedings of the 34th Conference of the Association for Computational Linguistics (ACL96); 1996 Jun 23–28; Santa Cruz (CA).

  45. 45.

    Chou PY, Fasman GD. Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol 1978; 47: 45–148

  46. 46.

    Richardson JS, Richardson DC. Amino acid preferences for specific locations at the ends of alpha helices. Science 1988; 240: 1648–52

  47. 47.

    Cai YD, Liu XJ, Chou KC, et al. Prediction of protein secondary structure content by artificial neural network. J Comput Chem 2003; 24: 727–31

  48. 48.

    Chen CP, Rost B. State-of-the-art in membrane protein prediction. Appl Bioinformatics 2002; 1(1): 21–35

  49. 49.

    Cai C, Rosenfeld R, Wasserman L. Exponential language models, logistic regression, and semantic coherence. Proceedings of the NIST/DARPA Speech Transcription Workshop; 2000 May 16–19; Adelphi (MD).

  50. 50.

    Karchin R, Karplus K, Haussler D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002; 18: 147–59

  51. 51.

    Klein-Seetharaman J, Oikawa M, Grimshaw SB, et al. Long-range interactions within a nonnative protein. Science 2002; 295(5560): 1719–22

Download references


This research was supported by National Science Foundation grants NSF0225656 and NSF0225636, and the Sofya Kovalevskaya Program of the Alexander von Humboldt-Foundation/Zukunftsinvestitionsprogramm der Bundesregierung Deutschland.

The authors have no conflicts of interest directly relevant to the content of this review.

Author information

Correspondence to Dr Judith Klein-Seetharaman.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ganapathiraju, M., Manoharan, V. & Klein-Seetharaman, J. BLMT. Appl-Bioinformatics 3, 193–200 (2004).

Download citation


  • Language Model
  • Language Technology
  • Suffix Array
  • Amino Acid Preference
  • Longe Common Prefix