Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies—based on simulation, consistency, protein structure, and phylogeny—and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application—with a keen awareness of the assumptions underlying each benchmarking strategy.
Springer Nature is developing a new tool to find and evaluate Protocols. Learn more
The authors thank Julie Thompson for helpful feedback on the manuscript. CD is supported by SNSF advanced researcher fellowship #136461. This article started as assignment for the graduate course “Reviews in Computational Biology” at the Cambridge Computational Biology Institute, University of Cambridge.
Kemena C, Notredame C (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25(19):2455–2465PubMedCrossRefGoogle Scholar
Aniba MR, Poch O, Thompson JD (2010) Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res 38(21):7353–7363PubMedCrossRefGoogle Scholar
Thompson JD, Linard B, Lecompte O, Poch O (2011) A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 6(3):e18093PubMedCrossRefGoogle Scholar
Löytynoja A (2012) Alignment methods: strategies, challenges, benchmarking, and comparative overview. Methods Mol Biol 855:203–235PubMedCrossRefGoogle Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680PubMedCrossRefGoogle Scholar
Fletcher W, Yang Z (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 26(8):1879–1888PubMedCrossRefGoogle Scholar
Sipos B, Massingham T, Jordan GE, Goldman N (2011) PhyloSim – Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC Bioinformatics 12(1):104PubMedCrossRefGoogle Scholar
Koestler T, Av H, Ebersberger I (2012) REvolver: modeling sequence evolution under domain constraints. Mol Biol Evol 29(9):2133–2145PubMedCrossRefGoogle Scholar
Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C (2012) ALF-a simulation framework for genome evolution. Mol Biol Evol 29(4):1115–1123PubMedCrossRefGoogle Scholar
Thompson JD, Plewniak F, Poch O (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27(13):2682–2690, gkc432 [pii]PubMedCrossRefGoogle Scholar