Words in DNA sequences: some case studies based on their frequency statistics

Basu, Srabashi; Burma, Debi Prosad; Chaudhuri, Probal

doi:10.1007/s00285-002-0185-3

Words in DNA sequences: some case studies based on their frequency statistics

Published: June 2003

Volume 46, pages 479–503, (2003)
Cite this article

Journal of Mathematical Biology Aims and scope Submit manuscript

Srabashi Basu¹,
Debi Prosad Burma² &
Probal Chaudhuri¹

122 Accesses
11 Citations
Explore all metrics

Abstract.

One of the critical requirements of data analysis involving large DNA sequences is an effective statistical summarization of those sequences. In this article DNA sequences have been analyzed based on word frequencies. Our analysis focuses on the detection of structural signature of a genome reflected in word frequencies and identification of phylogenetic relationships among different species reflected in the variation of word distributions in their DNA sequences. We have carried out a statistical study of the complete genome of baker's yeast, of various ribosomal RNA sequences from different prokaryotic and eukaryotic organisms and of the full genomes of some bacteriophages. Our exploratory analysis amply demonstrates the usefulness of DNA word frequencies in reducing the dimensionality of large sequences while retaining some of the structural information there that can have biological significance. Some conceptual issues that arise in course of our investigation have been addressed. A few interesting problems related to the statistics of DNA words have been pointed out with some indication of their possible solutions. The work has been partially motivated by the fact that sequence alignment and homology techniques that are quite popular for comparing and analyzing relatively smaller DNA sequences of nearly equal sizes are not applicable to data consisting of large sequences with widely varying sizes, which may contain segments with unknown or no biological functions, and consequently their comparison through functional homology is either impossible or extremely difficult.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting DNA words based on the sequence features: non-uniform distribution and integrity

Article Open access 25 January 2016

DNA word analysis based on the distribution of the distances between symmetric words

Article Open access 07 April 2017

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

Article Open access 27 January 2017

Author information

Authors and Affiliations

Theoretical Statistics and Mathematics unit, Indian Statistical Institute, 203 B.T. Road, Calcutta 700108, India. e-mail: srabashi@isical.ac.in; probal@isical.ac.in, , , , , , IN
Srabashi Basu & Probal Chaudhuri
Molecular Biology Unit, Institute of Medical Sciences, Banaras Hindu University, Varanasi, 221005, India, , , , , , IN
Debi Prosad Burma

Authors

Srabashi Basu
View author publications
You can also search for this author in PubMed Google Scholar
Debi Prosad Burma
View author publications
You can also search for this author in PubMed Google Scholar
Probal Chaudhuri
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received: 15 October 2000 / Revised version: 8 October 2002 Published online: 28 February 2003

Current address: CF186, Salt lake, Calcutta 700064, India

Research presented here was supported in part by a grant from Indian Statistical Institute.

Key words or phrases: Average linkage clustering – Chernoff's faces – Dendrograms – DNA words – F-ranks of words – F-ratios of words – l ₁-distance – Phylogenetic relationships – Rank correlation – Single linkage clustering

Rights and permissions

Reprints and permissions

About this article

Cite this article

Basu, S., Burma, D. & Chaudhuri, P. Words in DNA sequences: some case studies based on their frequency statistics. J. Math. Biol. 46, 479–503 (2003). https://doi.org/10.1007/s00285-002-0185-3

Download citation

Issue Date: June 2003
DOI: https://doi.org/10.1007/s00285-002-0185-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Words in DNA sequences: some case studies based on their frequency statistics

Abstract.

Access this article

Similar content being viewed by others

Extracting DNA words based on the sequence features: non-uniform distribution and integrity

DNA word analysis based on the distribution of the distances between symmetric words

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Words in DNA sequences: some case studies based on their frequency statistics

Abstract.

Access this article

Similar content being viewed by others

Extracting DNA words based on the sequence features: non-uniform distribution and integrity

DNA word analysis based on the distribution of the distances between symmetric words

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation