Safe and Complete Contig Assembly Via Omnitigs
Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs – a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph G (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from G as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66 % to 82 % longer on average than the popular unitigs, and 29 % of dbSNP locations have more neighbors in omnitigs than in unitigs.
KeywordsBlock Size Graph Model Full Version Contig Assembly Contig Length
We would like to thank Daniel Lokshtanov for initial discussions, Rayan Chikhi for feedback on the manuscript, and Nidia Obscura Acosta for helpful discussions. This work was supported in part by NSF awards DBI-1356529, IIS-1453527, and IIS-1421908 to PM, and by Academy of Finland grant 274977 to AT.
- 7.Guénoche, A.: Can we recover a sequence, just knowing all its subsequences of given length? Comput. Appl. Biosci. 8(6), 569–574 (1992)Google Scholar
- 8.Haussler, D., et al.: Genome 10 K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100(6), 659–674 (2008)Google Scholar
- 10.Jackson, B.G.: Parallel methods for short read assembly. Ph.D. thesis, Iowa State University (2009)Google Scholar
- 11.Kapun, E., Tsarev, F.: De Bruijn superwalk with multiplicities problem is NP-hard. BMC Bioinform. 14(Suppl 5), S7 (2013)Google Scholar
- 14.Kececioglu, J.D.: Exact and approximation algorithms for DNA sequence reconstruction. Ph.D. thesis, University of Arizona, Tucson, AZ, USA (1992)Google Scholar
- 19.Lysov, I., et al.: Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. a new method. Dokl Akad Nauk SSSR 303(6), 1508–1511 (1988)Google Scholar
- 25.Myers, E.W.: The fragment assembly string graph. In: ECCB/JBI, p. 85 (2005)Google Scholar
- 28.Narzisi, G., Mishra, B., Schatz, M.C.: On algorithmic complexity of biomolecular sequence assembly problem. In: Dediu, A.-H., Martín-Vide, C., Truthe, B. (eds.) AlCoB 2014. LNCS, vol. 8542, pp. 183–195. Springer, Heidelberg (2014)Google Scholar
- 29.Peltola, H., et al.: Algorithms for some string matching problems arising in molecular genetics. In: IFIP Congress, 59–64 (1983)Google Scholar
- 30.Pevzner, P.A.: L-Tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn. 7(1), 63–73 (1989)Google Scholar
- 34.Salmela, L., Sahlin, K., Mäkinen, V., Tomescu, A.I.: Gap filling as exact path length problem. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 281–292. Springer, Heidelberg (2015)Google Scholar
- 38.Tomescu, A.I., Medvedev, P.: Safe and complete contig assembly via omnitigs (2016). http://arxiv.org/abs/1601.02932