Evolutionary insights from suffix array-based genome sequence analysis

Poddar, Anindya; Chandra, Nagasuma; Ganapathiraju, Madhavi; Sekar, K.; Klein-Seetharaman, Judith; Reddy, Raj; Balakrishnan, N.

doi:10.1007/s12038-007-0087-z

Evolutionary insights from suffix array-based genome sequence analysis

Published: 06 November 2007

Volume 32, pages 871–881, (2007)
Cite this article

Journal of Biosciences Aims and scope Submit manuscript

Anindya Poddar¹,
Nagasuma Chandra^1,2,
Madhavi Ganapathiraju^1,3,
K. Sekar^1,2,
Judith Klein-Seetharaman³,
Raj Reddy³ &
…
N. Balakrishnan^1,3

74 Accesses
5 Citations
Explore all metrics

Abstract

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples.

The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a ‘meaning’ for tetra and higher n-grams.

The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Study on Suffix Trees and Their Applications in Genome Sequences Using MUMmer

Suffix tree searcher: exploration of common substrings in large DNA sequence sets

Article Open access 23 July 2014

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Article Open access 22 September 2020

References

Abouelhoda M I, Kurtz S and Ohlebusch E 2002 The enhanced suffix array and its applications to genome analysis; Proceedings of the Second Workshop on Algorithms in Bioinformatics, September 17–21 (Springer-Verlag) pp 449–463
Arimura J, Asaka H, Sakamoto H, Arikawa S 2001 Efficient discovery of proximity patterns using suffix arrays; July 1–4, Jerusalem, Israel
Bejerano G and Yona G 2001 Variations on probabilistic suffix trees: statistical modeling and prediction of protein families; Bioinformatics 17 23–43
Article PubMed CAS Google Scholar
Beuhler E C and Ungar L H 2001 Maximum entropy methods for biological sequence modeling; in Workshop on Data Mining in Bioinformatics 2001 (BIOKDD 2001) pp 60–64
Bieganski P, Riedl J, Carlis J Retzel E F 1994 Generalized Suffix Trees for Biological Sequence Data. 1994 System Sciences V: Biotechnology Computing; in Proceedings of the Twenty-Seventh Hawaii International Conference, University of Minnesota, vol 5, pp 35–44
Article Google Scholar
Brosch R, Pym A S, Gordon S V and Cole S T 2001 The evolution of mycobacterial pathogenicity: clues from comparative genomics; Trends Microbiol. 9 452–458
Article PubMed CAS Google Scholar
Burkhardt S, Crauser A, Ferragina P, Lenhof H-P, Rivals E, et al 1999 q-gram based database searching using a suffix array (QUASAR); in RECOMB, Annual Conference on Research in Computational Molecular Biology, Proceedings, Lyon, France, pp 77–83
Caporale L H 1999 Chance favors the prepared genome; Ann N. Y. Acad. Sci. 870 1–21
Article PubMed CAS Google Scholar
Cole S T, Brosch R, Parkhill J, Garnier T, Churcher C et al 1998 Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence; Nature (London) 393 537–544
Article CAS Google Scholar
Cole S T, Eiglmeier K, Parkhill J, James K D, Thomson N R et al 2001 Massive gene decay in the leprosy bacillus; Nature (London) 409 1007–1011
Article CAS Google Scholar
Delcher A L, Kasif S, Fleischmann R D, Peterson J, White O et al 1999 Alignment of whole genomes; Nucleic Acids Res. 27 2369–2376
Article PubMed CAS Google Scholar
Fang Z, Doig C, Morrison N, Watt B and Forbes K J 1999 Characterization of IS1547, a new member of the IS900 family in the Mycobacterium tuberculosis complex, and its association with IS6110; J. Bacteriol. 181 1021–1024
PubMed CAS Google Scholar
Ganapathiraju M, Klein-Seetharaman J, Balakrishnan N and Reddy R 2004a Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Processing magazine, May 2004, issue 15, 78–87
Ganapathiraju M, Manoharan V and Klein-Seetharaman J 2004b BLMT: Statistical Sequence Analysis using N-grams; J. Appl. Bioinformatics 3 193–200
Article CAS Google Scholar
Ganapathiraju M, Weisser D, Klein-Seetharaman J and Reddy R 2004c Yule value tables from protein datasets of different categories: emphasis on trasnmembrane proteins; Proc. SCI2004, Florida, USA
Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J and Reddy R et al 2002 Comparative n-gram analysis of whole-genome sequences; HLT’02: Human Language Technologies Conference, San Diego, March, 2002. San Diego, USA
Gardner M J, Hall N, Fung E, White O, Berriman M et al 2002 Genome sequence of the human malaria parasite Plasmodium falciparum; Nature (London) 419 498–511
Article CAS Google Scholar
Gusfield D 1997 Algorithms on strings, trees and sequences (Cambridge University Press)
Hunt E, Irving R W and Atkinson M 2000 Persistent Suffix Trees and Suffix Binary Search Trees as DNA Sequence Indexes. Glasgow: Department of Computing Science, University of Glasgow. TR-2000-63
Google Scholar
Irving R W and Love L 2001 suffix binary search trees and suffix arrays. Dept of Computing Science, University of Glasgow. TR-2001-82
Kasai T, Lee G, Arimura H, Arikawa S, Park K. 2001 Linear-Time Longest-Common-Prefix computation in Suffix Arrays and Its applications; Lecture Notes in Computer Science, Combinatorial Pattern Matching: 12th Annual Symposium, CPM 2001, July 1–4, Israel, Proceedings, 181–192
Klein-Seetharaman J, Ganapathiraju M, Rosenfeld R, Carbonell J and Reddy R 2002 Rare and frequent amino acid n-grams in whole-genome protein sequences; 2002; RECOMB’02: The Sixth Annual International Conference on Research in Computational Molecular Biology, Washington DC, USA
Malde K, Coward E and Jonassen I 2003 Fast sequence clustering using a suffix array algorithm; Bioinformatics 19 1221–1226
Article PubMed CAS Google Scholar
Manoharan V, Ganapathiraju M and Klein-Seetharaman J 2006 Ambient Intelligence Everyday Life; in Lecture notes in computer science (eds) Y Cai, J Abascal, (Springer) (in press)
Puglisi, S J, Smyth, W F and Turpin, A H 2007 A taxonomy of suffix array construction algorithms; ACM Comput. Surv. 39, 2, Article 4, June
Article Google Scholar
Rosenfeld R 1997 CMU Cambridge statistical language modeling toolkit (Proceedings ESCA Eurospeech)
Sivaraman B, Ganapathiraju M, Klein-Seetharaman J, Balakrishnan N and Reddy R 2003 Extensions to biological language modelling toolkit (BLMT); Pittsburgh, USA
Yamamoto M and Church KW 2001 Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus; Comput. Linguist. 27 1–30
Article Google Scholar
Ukkonen E 1995 Online construction of suffix trees; Algorithmica 14 249–260
Article Google Scholar

Download references

Author information

Authors and Affiliations

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, 560 012, India
Anindya Poddar, Nagasuma Chandra, Madhavi Ganapathiraju, K. Sekar & N. Balakrishnan
Bioinformatics Centre, Indian Institute of Science, Bangalore, 560 012, India
Nagasuma Chandra & K. Sekar
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Madhavi Ganapathiraju, Judith Klein-Seetharaman, Raj Reddy & N. Balakrishnan

Authors

Anindya Poddar
View author publications
You can also search for this author in PubMed Google Scholar
Nagasuma Chandra
View author publications
You can also search for this author in PubMed Google Scholar
Madhavi Ganapathiraju
View author publications
You can also search for this author in PubMed Google Scholar
K. Sekar
View author publications
You can also search for this author in PubMed Google Scholar
Judith Klein-Seetharaman
View author publications
You can also search for this author in PubMed Google Scholar
Raj Reddy
View author publications
You can also search for this author in PubMed Google Scholar
N. Balakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Balakrishnan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Poddar, A., Chandra, N., Ganapathiraju, M. et al. Evolutionary insights from suffix array-based genome sequence analysis. J Biosci 32 (Suppl 1), 871–881 (2007). https://doi.org/10.1007/s12038-007-0087-z

Download citation

Published: 06 November 2007
Issue Date: August 2007
DOI: https://doi.org/10.1007/s12038-007-0087-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolutionary insights from suffix array-based genome sequence analysis

Abstract

Access this article

Similar content being viewed by others

A Study on Suffix Trees and Their Applications in Genome Sequences Using MUMmer

Suffix tree searcher: exploration of common substrings in large DNA sequence sets

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evolutionary insights from suffix array-based genome sequence analysis

Abstract

Access this article

Similar content being viewed by others

A Study on Suffix Trees and Their Applications in Genome Sequences Using MUMmer

Suffix tree searcher: exploration of common substrings in large DNA sequence sets

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation