Skip to main content
Log in

SWORDS: A statistical tool for analysing large DNA sequences

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

In this article, we present some simple yet effective statistical techniques for analysing and comparing large DNA sequences. These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in public domain databases housed in the Internet, we demonstrate how SWORDS can be conveniently used by molecular biologists and geneticists to unmask biologically important features hidden in large sequences and assess their statistical significance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Blaisdell B E, Campbell A M and Karlin S 1996 Similarities and Dissimilarities of phage genomes;Proc. Natl. Acad. Sci. USA 93 5854–5859

    Article  PubMed  CAS  Google Scholar 

  • Doolittle R F 1990 Molecular evolution: computer analysis of protein and nucleic acid sequences;Methods Enzymol. 183 1–735

    Google Scholar 

  • Doolittle R F 1996 Molecular evolution: computer methods for macromolecular sequence analysis;Methods Enzymol. 266 1–711

    Google Scholar 

  • Everitt B S 1993Cluster Analysis (London: Edward Arnold)

    Google Scholar 

  • Felsenstein J 1983 Statistical inference of phylogenies (with Discussion);J. R. Stat. Soc. (Ser. A)146 246–272

    Article  Google Scholar 

  • Felsenstein J 1985 Confidence limits on phylogenies: an approach using the bootstrap;Evolution 39 783–791

    Article  Google Scholar 

  • Felsenstein J 1988 Phylogenies from molecular sequences: inference and reliability;Annu. Rev. Genet. 22 521–565

    Article  PubMed  CAS  Google Scholar 

  • Felsenstein J and Kishino H 1993 Is there something wrong with the bootstrap? A reply to Hillis and Bull;Syst. Biol. 42 193–200

    Article  Google Scholar 

  • Hillis D M and Bull J J 1993 An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis;Syst. Biol. 42 182–192

    Article  Google Scholar 

  • Karlin S and Campbell A M 1994 Which bacterium is the ancestor of the animal mitochondrial genome?;Proc. Natl. Acad. Sci. USA 91 12842–12846

    Article  PubMed  CAS  Google Scholar 

  • Karlin S and Cardon L R 1994 Computational DNA sequence analysis;Annu. Rev. Microbiol. 44 619–654

    Article  Google Scholar 

  • Karlin S and Ladunga I 1994 Comparisons of eukaryotic genomic sequences;Proc. Natl. Acad. Sci. USA 91 12832–12836

    Article  PubMed  CAS  Google Scholar 

  • Karlin S, Ladunga I and Blaisdell B E 1994 Heterogeneity of genomes: measures and values;Proc. Natl. Acad. Sci. USA 91 12837–12841

    Article  PubMed  CAS  Google Scholar 

  • Leung M-Y, Marsh G M and Speed T P 1996 Over- and underrepresentation of short DNA words in herpesvirus genomes;J. Comput. Biol. 3 345–360

    PubMed  CAS  Google Scholar 

  • Martindale C and Konopka A K 1996 Oligonucleotide frequencies in DNA follow a Yule distribution;Comput. Chem. 20 35–38

    Article  PubMed  CAS  Google Scholar 

  • Nei M 1996 Phylogenetic analysis in molecular evolutionary genetics;Annu. Rev. Genet. 30 371–403

    Article  PubMed  CAS  Google Scholar 

  • Nussinov R 1980 Some rules in the ordering of nucleotides in the DNA;Nucleic Acids Res. 8 4545–4562

    Article  PubMed  CAS  Google Scholar 

  • Nussinov R 1981 Nearest neighbor nucleotide patterns: structural and biological implications;J. Biol. Chem. 256 8458–8462

    PubMed  CAS  Google Scholar 

  • Nussinov R 1982 Some indications for inverse DNA duplication;J. Theor. Biol. 95 783–793

    Article  PubMed  CAS  Google Scholar 

  • Nussinov R 1984a Doublet frequencies in evolutionary distinct groups;Nucleic Acids Res. 12 1749–1763

    Article  PubMed  CAS  Google Scholar 

  • Nussinov R 1984b Strong doublet preferences in nucleotide sequences and DNA geometry;J. Mol. Evol. 20 111–119

    Article  PubMed  CAS  Google Scholar 

  • Pan A, Basu S, Dutta C, Burma D P and Mukherjee R 1996 Nucleotide frequency map: a new technique for pictorial representation of dinucleotide frequencies;Curr. Sci. 71 50–53

    Google Scholar 

  • Pevzner P A 1992 Nucleotide sequences versus Markov models;Comput. Chem. 16 103–106

    Article  CAS  Google Scholar 

  • Pevzner P A, Borodovsky M Y and Mironov A A 1989a Linguistics of nucleotide sequences I: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words;J. Biomol. Struct. Dyn. 6 1013–1026

    PubMed  CAS  Google Scholar 

  • Pevzner P A, Borodovsky M Y and Mironov A A 1989b Linguistics of nucleotide sequences II: stationary words in genetic texts and the zonal structure of DNA;J. Biomol. Struct. Dyn. 6 1027–1038

    PubMed  CAS  Google Scholar 

  • Phillips G, Arnold J and Ivarie R 1987a Monothrough hexanucleotide composition of theEscherichia coli genome: a Markov chain analysis;Nucleic Acids Res. 15 2611–2626

    Article  PubMed  CAS  Google Scholar 

  • Phillips G, Arnold J and Ivarie R 1987b The effect of codon usage on the oligonucleotide composition of theE. coli genome and identification of over- and underrepresented sequences by Markov chain analysis;Nucleic Acids Res. 15 2627–2638

    Article  PubMed  CAS  Google Scholar 

  • Prum B, Rodolphe F and de Turckheim E 1995 Finding words with unexpected frequencies in deoxyribonucleic acid sequences;J. R. Statist. Soc. B57 205–220

    Google Scholar 

  • Reinert G and Schbath S 1999 Large compound Poisson approximations for occurrences of multiple words; inStatistics in molecular biology and genetics (ed.) F Seillier-Moiseiwitsch (IMS Lecture Notes and Monograph Series) (California: IMS Hayward) vol 33, pp 257–275

    Google Scholar 

  • Schbath S, Prum B and deTurckheimE 1995 Exceptional motifs in different Markov chain models for statistical analysis of DNA sequences;J. Comput. Biol. 2 417–437

    Article  PubMed  CAS  Google Scholar 

  • Strimmer K and vanHaelsler A 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies;Mol. Biol. Evol. 13 964–969

    CAS  Google Scholar 

  • Zardoya R and Meyer A 1996a Evolutionary relationships of the coelacanth lungfishes and tetrapods based on the 28S ribosomal RNA sequences;Proc. Natl. Acad. Sci. USA 93 5449–5454

    Article  PubMed  CAS  Google Scholar 

  • Zardoya R and Meyer A 1996b The complete nucleotide sequence of the mitochondrial genome of the lungfish (Pro- topterus dolloi) supports its phylogenetic position as a close relative of land vertebrates;Genetics 142 1249–1263

    PubMed  CAS  Google Scholar 

  • Zardoya R and Meyer A 1997 The complete DNA sequence of the mitochondrial genome of a “living fossil” the coelacanth (Latimeria chalumnae);Genetics 146 995–1010

    PubMed  CAS  Google Scholar 

  • Zharkikh A and Li W H 1992a Statistical properties of boot-strap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock;Mol. Biol. Evol. 9 1119–1147

    PubMed  CAS  Google Scholar 

  • Zharkikh A and Li W H 1992b Statistical properties of boot-strap estimation of phylogenetic variability from nucleotide sequences. II. Four taxa without a molecular clock;J. Mol. Evol. 35 356–366

    Article  PubMed  CAS  Google Scholar 

  • Zharkikh A and Li W H 1995 Estimation of confidence in phylogeny: the complete and partial bootstrap technique;Mol. Phylogenet. Evol. 4 44–63

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Probal Chaudhuri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaudhuri, P., Das, S. SWORDS: A statistical tool for analysing large DNA sequences. J Biosci 27, 1–6 (2002). https://doi.org/10.1007/BF02703678

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02703678

Keywords

Navigation