Skip to main content

Estimating Sequence Similarity from Contig Sets

  • Conference paper
  • First Online:
Advances in Intelligent Data Analysis XVI (IDA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10584))

Included in the following conference series:

Abstract

A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The threshold of 20 nucleotides is set to prevent very short random overlaps.

  2. 2.

    While the problem is polynomial, the time complexity of the brute-force solution incurs a polynomial of order 4 rendering it unusable even on small datasets.

  3. 3.

    For example for string ACCGGATT its reversed complement is AATCCGGT.

  4. 4.

    Often long substrings, called repeats, occur multiple times in a DNA sequence. Assembly algorithms may identify a repeat as a single contig or as two contigs based on the number of mutations.

  5. 5.

    AF389115, AF389119, AY260942, AY260945, AY260949, AY260955, CY011131, CY011135, CY011143, HE584750, J02147, K00423 and outgroup AM050555. The genomes are available at http://www.ebi.ac.uk/ena/data/view/<accession>.

  6. 6.

    \((\alpha , l) \in \{0.1, 0.3, 0.5, 0.7, 1, 1.5, 2, 2.5, 3, 4, 5, 7, 10, 15, 20, 30,40,50,70,100\} \times \{3,5,10, 15, 20, 25, 30, 40, 50, 70, 100, 150,200,500\}\).

  7. 7.

    A sample implementation and supplementary material are available on https://github.com/petrrysavy/ida2017.

References

  1. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)

    Article  MATH  Google Scholar 

  2. Hernandez, D., et al.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18(5), 802–809 (2008)

    Article  Google Scholar 

  3. Huang, W., Li, L., Myers, J.R., Marth, G.T.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)

    Article  Google Scholar 

  4. Kleinberg, J., Tardos, E.: Algorithm Design. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)

    Google Scholar 

  5. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8) (1966)

    Google Scholar 

  6. Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 926–932 (1993)

    Article  Google Scholar 

  7. Nurk, S., Bankevich, A., Antipov, D., Gurevich, A., Korobeynikov, A., Lapidus, A., Prjibelsky, A., Pyshkin, A., Sirotkin, A., Sirotkin, Y., Stepanauskas, R., McLean, J., Lasken, R., Clingenpeel, S.R., Woyke, T., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: Assembling genomes and mini-metagenomes from highly chimeric reads. In: Deng, M., Jiang, R., Sun, F., Zhang, X. (eds.) RECOMB 2013. LNCS, vol. 7821, pp. 158–170. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37195-0_13

    Chapter  Google Scholar 

  8. Ryšavý, P., Železný, F.: Estimating sequence similarity from read sets for clustering sequencing data. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds.) IDA 2016. LNCS, vol. 9897, pp. 204–214. Springer, Cham (2016). doi:10.1007/978-3-319-46349-0_18

    Chapter  Google Scholar 

  9. Ryšavý, P., Železný, F.: Estimating Sequence Similarity from Read Sets for Clustering Next-Generation Sequencing Data (preprint, 2017), http://arxiv.org/abs/1705.06125

  10. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)

    Google Scholar 

  11. Simpson, J.T., et al.: ABySS: a parallel assembler for short read sequence data. Genome Res. 9(6), 1117–1123 (2009)

    Article  Google Scholar 

  12. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  13. Warren, R.L., et al.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4), 500–501 (2007)

    Article  Google Scholar 

  14. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

Download references

Acknowledgment

This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS17/189/OHK3/3T/13. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petr Ryšavý .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ryšavý, P., Železný, F. (2017). Estimating Sequence Similarity from Contig Sets. In: Adams, N., Tucker, A., Weston, D. (eds) Advances in Intelligent Data Analysis XVI. IDA 2017. Lecture Notes in Computer Science(), vol 10584. Springer, Cham. https://doi.org/10.1007/978-3-319-68765-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68765-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68764-3

  • Online ISBN: 978-3-319-68765-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics