Skip to main content

Sequence Comparison Tools

  • Chapter
  • First Online:
Bioinformatics

Abstract

The evolution of methods which capture genetic sequence data has inspired a parallel evolution of computational tools which can be used to analyze and compare the data. Indeed, much of the progress in modern biological research has stemmed from the application of such technology. In this chapter we provide an overview of the main classes of tools currently used for sequence comparison. For each class of tools we provide a basic overview of how they work, their history, and their current state. There have been literally hundreds of different tools produced to align, cluster, filter, or otherwise analyze sequence data and it would be impossible to list all of them in this chapter, so we supply only an overview of the tools that most readers may encounter. We apologize to researchers who feel that their particular piece of software should have been included here. The reader will notice that there is much conceptual and application overlap between tools and in many cases one tool or algorithm is used as one part of another tool’s implementation. Most of the more popular sequence comparison tools are based on ideas and algorithms which can be traced back to the 1960s and 1970s when the cost of computing power first became low enough to enable wide spread development in this area. Where applicable we describe the original algorithms and then list the iterations of the idea (often by different people in different labs) noting the important changes that were included at each stage. Finally we describe the software packages currently used by today’s bioinformaticians. A quick search will allow the reader to find many papers which formally compare different implementations of a particular algorithm, so while we may note that one algorithm is more efficient or accurate than another we stress that we have not performed any formal benchmarking or comparison analysis here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    CAS  PubMed  Google Scholar 

  • Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16(6):545–552

    Article  CAS  PubMed  Google Scholar 

  • Burke J, Davison D, Hide W (1999) d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 9:1135–1142

    Article  CAS  PubMed  Google Scholar 

  • Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES et al (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18(5):810–820

    Article  CAS  PubMed  Google Scholar 

  • Chaisson MJ, Pevzner PA (2008) Short read fragment assembly of bacterial genomes. Genome Res 18:324–330

    Article  CAS  PubMed  Google Scholar 

  • Chaisson M, Pevzner PA, Tang HX (2004) Fragment assembly with short reads. Bioinformatics 20(13):2067–2074

    Article  CAS  PubMed  Google Scholar 

  • Dayhoff Mo, ed., 1978, Atlas of protein Sequence and Structure, Vol 5

    Google Scholar 

  • Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2007) SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 17:1697–1706

    Article  CAS  PubMed  Google Scholar 

  • Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797

    Article  CAS  PubMed  Google Scholar 

  • Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using phred. 1. accuracy assessment. Genome Res 8:175–185

    CAS  PubMed  Google Scholar 

  • Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360

    Article  CAS  PubMed  Google Scholar 

  • Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708

    Article  CAS  PubMed  Google Scholar 

  • Hazelhurst S, Hide W, Liptak Z, Nogueira R, Starfield R (2008) An overview of the wcd EST clustering tool. Bioinformatics 24(13):1542–1546

    Article  CAS  PubMed  Google Scholar 

  • Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89(22):10915–10919

    Article  CAS  PubMed  Google Scholar 

  • Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J (2008) De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res 18(5):802–809

    Article  CAS  PubMed  Google Scholar 

  • Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244

    Article  CAS  PubMed  Google Scholar 

  • Higgins DG, Bleasby AJ, Fuchs R (1992) CLUSTAL V: improved software for multiple sequence alignment. Bioinformatics 8(2):189–191

    Article  CAS  Google Scholar 

  • Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868–877

    Article  CAS  PubMed  Google Scholar 

  • Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2:291–306

    Article  CAS  PubMed  Google Scholar 

  • Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23:2942–2944

    Article  CAS  PubMed  Google Scholar 

  • Kent JW (2002) BLAT – the BLAST-like alignment tool. Genome Res 12:656–664

    CAS  PubMed  Google Scholar 

  • Murata M, Richardson JS, Sussman JL (1985) Simultaneous comparison of three protein sequences. Proc Natl Acad Sci USA 82(10):3073–3077

    Article  CAS  PubMed  Google Scholar 

  • Myers EW (1995) Toward simplifying and accurately formulating fragment assembly. J Comput Biol 2:275–290

    Article  CAS  PubMed  Google Scholar 

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453

    Article  CAS  PubMed  Google Scholar 

  • Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217

    Article  CAS  PubMed  Google Scholar 

  • O’Connor M, Peifer M, Bender W (1989) Construction of large DNA segments in Escherichia coli. Science 244:1307–1312

    Article  PubMed  Google Scholar 

  • Penzner PA (2001) Fragment assembly with double-barreled data. Bioinformatics 17:S225–S233

    Google Scholar 

  • Pevzner PA (1989) l-tuple DNA sequencing: computer analysis. J Biomol Struct Dyn 7:63–73

    CAS  PubMed  Google Scholar 

  • Pevzner PA, Tang HX, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748–9753

    Article  CAS  PubMed  Google Scholar 

  • Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467

    Article  CAS  PubMed  Google Scholar 

  • Sellers PH (1974) On the theory and computation of evolutionary distances. J Appl Math (siam) 26:787–793

    Article  Google Scholar 

  • Smit AFA, Hubley R, Green P RepeatMasker Open-3.0. 1996-2004. http://www.repeatmasker.org

  • Staden R (1979) A strategy of DNA sequencing employing computer programs. Nucleic Acids Res 6:2601–2610

    Article  CAS  PubMed  Google Scholar 

  • Thompson JD, Higgins DG, Gibson TJ, Clustal W (1994) Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. Nov 11;22(22):4673–4680

    Google Scholar 

  • Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4):500–501

    Article  CAS  PubMed  Google Scholar 

  • Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. J Adv Math 20:367–387

    Article  Google Scholar 

  • Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A et al (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189):U872–U875

    Article  Google Scholar 

  • Whiteford N, Haslam N, Weber G, Prugel-Bennett A, Essex JW, Roach PL et al (2005) An analysis of the feasibility of short read sequencing. Nucleic Acids Res 33(19):e171

    Article  PubMed  Google Scholar 

  • Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Imelfort .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Imelfort, M. (2009). Sequence Comparison Tools. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_2

Download citation

Publish with us

Policies and ethics