Skip to main content

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

  • Protocol
  • First Online:
Multiple Sequence Alignment Methods

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1079))

Abstract

Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies—based on simulation, consistency, protein structure, and phylogeny—and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application—with a keen awareness of the assumptions underlying each benchmarking strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Stefano Iantorno and Kevin Gori contributed equally to this work.

References

  1. Kemena C, Notredame C (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25(19):2455–2465

    Article  PubMed  CAS  Google Scholar 

  2. Aniba MR, Poch O, Thompson JD (2010) Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res 38(21):7353–7363

    Article  PubMed  CAS  Google Scholar 

  3. Edgar RC (2010) Quality measures for protein alignment benchmarks. Nucleic Acids Res 38(7):2145–2153

    Article  PubMed  CAS  Google Scholar 

  4. Thompson JD, Linard B, Lecompte O, Poch O (2011) A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 6(3):e18093

    Article  PubMed  CAS  Google Scholar 

  5. Löytynoja A (2012) Alignment methods: strategies, challenges, benchmarking, and comparative overview. Methods Mol Biol 855:203–235

    Article  PubMed  Google Scholar 

  6. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680

    Article  PubMed  CAS  Google Scholar 

  7. Morrison DA (2009) Why would phylogeneticists ignore computerized sequence alignment? Syst Biol 58(1):150–158

    Article  PubMed  CAS  Google Scholar 

  8. Mardis ER (2008) The impact of next-generation sequencing technology on genetics. Trends Genet 24(3):133–141. doi:10.1016/j.tig.2007.12.007

    Article  PubMed  CAS  Google Scholar 

  9. Anisimova M, Cannarozzi G, Liberles D (2010) Finding the balance between the mathematical and biological optima in multiple sequence alignment. Trends Evol Biol 2(1):e7

    Article  Google Scholar 

  10. Stebbings LA, Mizuguchi K (2004) HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic Acids Res 32(Database issue):D203–D207

    Article  PubMed  CAS  Google Scholar 

  11. Thompson JD, Koehl P, Ripp R, Poch O (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 61:127–136

    Article  PubMed  CAS  Google Scholar 

  12. Stoye J, Evers D, Meyer F (1998) Rose: generating sequence families. Bioinformatics 14(2):157–163

    Article  PubMed  CAS  Google Scholar 

  13. Cartwright RA (2005) DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 21(Suppl 3):iii31–iii38

    Article  PubMed  CAS  Google Scholar 

  14. Hall BG (2008) Simulating DNA coding sequence evolution with EvolveAGene 3. Mol Biol Evol 25(4):688–695

    Article  PubMed  CAS  Google Scholar 

  15. Fletcher W, Yang Z (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 26(8):1879–1888

    Article  PubMed  CAS  Google Scholar 

  16. Sipos B, Massingham T, Jordan GE, Goldman N (2011) PhyloSim – Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC Bioinformatics 12(1):104

    Article  PubMed  Google Scholar 

  17. Koestler T, Av H, Ebersberger I (2012) REvolver: modeling sequence evolution under domain constraints. Mol Biol Evol 29(9):2133–2145

    Article  PubMed  CAS  Google Scholar 

  18. Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C (2012) ALF-a simulation framework for genome evolution. Mol Biol Evol 29(4):1115–1123

    Article  PubMed  CAS  Google Scholar 

  19. Thompson JD, Plewniak F, Poch O (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27(13):2682–2690, gkc432 [pii]

    Article  PubMed  CAS  Google Scholar 

  20. Blackburne BP, Whelan S (2012) Measuring the distance between multiple sequence alignments. Bioinformatics 28(4):495–502. doi:10.1093/bioinformatics/btr701

    Article  PubMed  CAS  Google Scholar 

  21. Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320(5883):1632–1635. doi:10.1126/science.1158395

    Article  PubMed  Google Scholar 

  22. Golubchik T, Wise MJ, Easteal S, Jermiin LS (2007) Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Mol Biol Evol 24(11):2433–2442

    Article  PubMed  CAS  Google Scholar 

  23. Huelsenbeck JP (1995) Performance of phylogenetic methods in simulation. Syst Biol 44(1):17–48

    Google Scholar 

  24. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15(2):330–340. doi:10.1101/gr.2821705

    Article  PubMed  CAS  Google Scholar 

  25. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217. doi:10.1006/jmbi.2000.4042

    Article  PubMed  CAS  Google Scholar 

  26. Lassmann T, Sonnhammer ELL (2005) Automatic assessment of alignment quality. Nucleic Acids Res 33(22):7120–7128

    Article  PubMed  CAS  Google Scholar 

  27. Landan G, Graur D (2007) Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol 24(6):1380–1383

    Article  PubMed  CAS  Google Scholar 

  28. Hall BG (2008) How well does the HoT score reflect sequence alignment accuracy? Mol Biol Evol 25(8):1576–1580

    Article  PubMed  CAS  Google Scholar 

  29. Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823

    PubMed  CAS  Google Scholar 

  30. Mizuguchi K, Deane CM, Blundell TL, Overington JP (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci 7(11):2469–2471. doi:10.1002/pro.5560071126

    Article  PubMed  CAS  Google Scholar 

  31. Thompson JD, Plewniak F, Poch O (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1):87–88, btc017 [pii]

    Article  PubMed  CAS  Google Scholar 

  32. Van Walle I, Lasters I, Wyns L (2005) SABmark – a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21(7):1267–1268. doi:10.1093/bioinformatics/bth493

    Article  PubMed  Google Scholar 

  33. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797. doi:10.1093/nar/gkh340

    Article  PubMed  CAS  Google Scholar 

  34. Gardner P, Wilm A, Washietl S (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res 33(8):2433–2439

    Article  PubMed  CAS  Google Scholar 

  35. Kim J, Sinha S (2010) Towards realistic benchmarks for multiple alignments of non-coding sequences. BMC Bioinformatics 11:54

    Article  PubMed  Google Scholar 

  36. Mathews DH (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics 21(10):2246–2253. doi:10.1093/bioinformatics/bti349

    Article  PubMed  CAS  Google Scholar 

  37. Havgaard JH, Lyngso RB, Stormo GD, Gorodkin J (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 21(9):1815–1824. doi:10.1093/bioinformatics/bti279

    Article  PubMed  CAS  Google Scholar 

  38. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    PubMed  CAS  Google Scholar 

  39. Thompson JD, Fdr P, Ripp R, Thierry J-C, Poch O (2001) Towards a reliable objective function for multiple sequence alignments1. J Mol Biol 314(4):937–951. doi:10.1006/jmbi.2001.5187

    Article  PubMed  CAS  Google Scholar 

  40. Raghava GP, Searle SM, Audley PC, Barber JD, Barton GJ (2003) OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4:47. doi:10.1186/1471-2105-4-47

    Article  PubMed  CAS  Google Scholar 

  41. Russell RB, Barton GJ (1992) Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins 14(2):309–323. doi:10.1002/prot.340140216

    Article  PubMed  CAS  Google Scholar 

  42. Pop M, Salzberg SL (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24(3):142–149. doi:10.1016/j.tig.2007.12.006

    Article  PubMed  CAS  Google Scholar 

  43. Berger SA, Stamatakis A (2011) Aligning short reads to reference alignments and trees. Bioinformatics 27(15):2068–2075. doi:10.1093/bioinformatics/btr320

    Article  PubMed  CAS  Google Scholar 

  44. Dessimoz C, Gil M (2010) Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol 11(4):R37

    Article  PubMed  Google Scholar 

  45. Jordan G, Goldman N (2011) The effects of alignment error and alignment filtering on the sitewise detection of positive selection. Mol Biol Evol 29:1125. doi:10.1093/molbev/msr272

    Article  PubMed  Google Scholar 

  46. Blackshields G, Wallace IM, Larkin M, Higgins DG (2006) Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 6(4):321–339

    PubMed  CAS  Google Scholar 

  47. Lassmann T, Sonnhammer EL (2002) Quality assessment of multiple alignment programs. FEBS Lett 529(1):126–130, S0014579302031897 [pii]

    Article  PubMed  CAS  Google Scholar 

  48. Strope CL, Abel K, Scott SD, Moriyama EN (2009) Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol Biol Evol 26(11):2581–2593. doi:10.1093/molbev/msp174

    Article  PubMed  CAS  Google Scholar 

  49. Lassmann T, Sonnhammer EL (2006) Kalign, Kalignvu and Mumsa: web servers for multiple sequence alignment. Nucleic Acids Res 34(Web Server issue):W596–W599. doi:10.1093/nar/gkl191

    Article  PubMed  CAS  Google Scholar 

  50. Kemena C, Taly JF, Kleinjung J, Notredame C (2011) STRIKE: evaluation of protein MSAs using a single 3D structure. Bioinformatics 27(24):3385–3391. doi:10.1093/bioinformatics/btr587

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

The authors thank Julie Thompson for helpful feedback on the manuscript. CD is supported by SNSF advanced researcher fellowship #136461. This article started as assignment for the graduate course “Reviews in Computational Biology” at the Cambridge Computational Biology Institute, University of Cambridge.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Iantorno, S., Gori, K., Goldman, N., Gil, M., Dessimoz, C. (2014). Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment. In: Russell, D. (eds) Multiple Sequence Alignment Methods. Methods in Molecular Biology, vol 1079. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-646-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-646-7_4

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-645-0

  • Online ISBN: 978-1-62703-646-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics