Estimating Sequence Similarity from Contig Sets

Ryšavý, Petr; Železný, Filip

doi:10.1007/978-3-319-68765-0_23

Petr Ryšavý¹⁶ &
Filip Železný¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10584))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1003 Accesses
2 Citations

Abstract

A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The threshold of 20 nucleotides is set to prevent very short random overlaps.
2.
While the problem is polynomial, the time complexity of the brute-force solution incurs a polynomial of order 4 rendering it unusable even on small datasets.
3.
For example for string ACCGGATT its reversed complement is AATCCGGT.
4.
Often long substrings, called repeats, occur multiple times in a DNA sequence. Assembly algorithms may identify a repeat as a single contig or as two contigs based on the number of mutations.
5.
AF389115, AF389119, AY260942, AY260945, AY260949, AY260955, CY011131, CY011135, CY011143, HE584750, J02147, K00423 and outgroup AM050555. The genomes are available at http://www.ebi.ac.uk/ena/data/view/<accession>.
6.
\((\alpha , l) \in \{0.1, 0.3, 0.5, 0.7, 1, 1.5, 2, 2.5, 3, 4, 5, 7, 10, 15, 20, 30,40,50,70,100\} \times \{3,5,10, 15, 20, 25, 30, 40, 50, 70, 100, 150,200,500\}\).
7.
A sample implementation and supplementary material are available on https://github.com/petrrysavy/ida2017.

References

Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)
Article MATH Google Scholar
Hernandez, D., et al.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18(5), 802–809 (2008)
Article Google Scholar
Huang, W., Li, L., Myers, J.R., Marth, G.T.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)
Article Google Scholar
Kleinberg, J., Tardos, E.: Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8) (1966)
Google Scholar
Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 926–932 (1993)
Article Google Scholar
Nurk, S., Bankevich, A., Antipov, D., Gurevich, A., Korobeynikov, A., Lapidus, A., Prjibelsky, A., Pyshkin, A., Sirotkin, A., Sirotkin, Y., Stepanauskas, R., McLean, J., Lasken, R., Clingenpeel, S.R., Woyke, T., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: Assembling genomes and mini-metagenomes from highly chimeric reads. In: Deng, M., Jiang, R., Sun, F., Zhang, X. (eds.) RECOMB 2013. LNCS, vol. 7821, pp. 158–170. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37195-0_13
Chapter Google Scholar
Ryšavý, P., Železný, F.: Estimating sequence similarity from read sets for clustering sequencing data. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 204–214. Springer, Cham (2016). doi:10.1007/978-3-319-46349-0_18
Chapter Google Scholar
Ryšavý, P., Železný, F.: Estimating Sequence Similarity from Read Sets for Clustering Next-Generation Sequencing Data (preprint, 2017), http://arxiv.org/abs/1705.06125
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)
Google Scholar
Simpson, J.T., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 9(6), 1117–1123 (2009)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Warren, R.L., et al.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4), 500–501 (2007)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Article Google Scholar

Download references

Acknowledgment

This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS17/189/OHK3/3T/13. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic
Petr Ryšavý & Filip Železný

Authors

Petr Ryšavý
View author publications
You can also search for this author in PubMed Google Scholar
Filip Železný
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Ryšavý .

Editor information

Editors and Affiliations

Imperial College London, London, United Kingdom
Niall Adams
Brunel University London, Uxbridge, United Kingdom
Allan Tucker
Birkbeck, University of London, London, United Kingdom
David Weston

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ryšavý, P., Železný, F. (2017). Estimating Sequence Similarity from Contig Sets. In: Adams, N., Tucker, A., Weston, D. (eds) Advances in Intelligent Data Analysis XVI. IDA 2017. Lecture Notes in Computer Science(), vol 10584. Springer, Cham. https://doi.org/10.1007/978-3-319-68765-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-68765-0_23
Published: 04 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68764-3
Online ISBN: 978-3-319-68765-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics