Advertisement

Circular Sequence Comparison with q-grams

  • Roberto Grossi
  • Costas S. Iliopoulos
  • Robert Mercaş
  • Nadia Pisanti
  • Solon P. PissisEmail author
  • Ahmad Retha
  • Fatima Vayani
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9289)

Abstract

Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

Keywords

Edit Distance Suffix Array Pairwise Sequence Alignment Random Dataset Circular Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Barton, C., Iliopoulos, C.S., Kundu, R., Pissis, S.P., Retha, A., Vayani, F.: Accurate and efficient methods to improve multiple circular sequence alignment. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 247–258. Springer, Heidelberg (2015) CrossRefGoogle Scholar
  2. 2.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Res. 28(1), 15–18 (2000)CrossRefGoogle Scholar
  3. 3.
    Bray, N., Pachter, L.: MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14(4), 693–699 (2004)CrossRefGoogle Scholar
  4. 4.
    Brodie, R., Smith, A.J., Roper, R.L., Tcherepanov, V., Upton, C.: Base-By-Base: single nucleotide-level analysis of whole viral genome alignments. BMC Bioinform. 5(1), 96 (2004)CrossRefGoogle Scholar
  5. 5.
    Bunke, H., Buhler, U.: Applications of approximate string matching to 2D shape recognition. Pattern Recogn. 26(12), 1797–1812 (1993)CrossRefGoogle Scholar
  6. 6.
    Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: Algorithms for jumbled pattern matching in strings. Int. J. Found Comput. Sci. 23(2), 357–374 (2012)CrossRefzbMATHGoogle Scholar
  7. 7.
    Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: \(q\)-gram based database searching using a suffix array (QUASAR). In: 3rd RECOMB, pp. 77–83 (1999)Google Scholar
  8. 8.
    Chao, K.M., Zhang, J., Ostell, J., Miller, W.: A tool for aligning very similar DNA sequences. CABIOS 13(1), 75–80 (1997)Google Scholar
  9. 9.
    Cohen, S., Houben, A., Segal, D.: Extrachromosomal circular DNA derived from tandemly repeated genomic sequences in plants. Plant J. 53(6), 1027–1034 (2008)CrossRefGoogle Scholar
  10. 10.
    Craik, D.J., Allewell, N.M.: Thematic minireview series on circular proteins. J. Biol. Chem. 287(32), 26999–27000 (2012)CrossRefGoogle Scholar
  11. 11.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, New York (2007) CrossRefzbMATHGoogle Scholar
  12. 12.
    del Castillo, C.S., Hikima, J.I., Jang, H.B., Nho, S.W., Jung, T.S., Wongtavatchai, J., Kondo, H., Hirono, I., Takeyama, H., Aoki, T.: Comparative sequence analysis of a multidrug-resistant plasmid from Aeromonas hydrophila. Antimicrob. Agents Chemother. 57(1), 120–129 (2013)CrossRefGoogle Scholar
  13. 13.
    Ehlers, T., Manea, F., Mercaş, R., Nowotka, D.: k-Abelian pattern matching. In: Shur, A.M., Volkov, M.V. (eds.) DLT 2014. LNCS, vol. 8633, pp. 178–190. Springer, Heidelberg (2014) Google Scholar
  14. 14.
    Fernandes, F., Pereira, L., Freitas, A.T.: CSA: an efficient algorithm to improve circular DNA multiple alignment. BMC Bioinform. 10(1), 1–13 (2009)CrossRefGoogle Scholar
  15. 15.
    Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  16. 16.
    Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009)CrossRefGoogle Scholar
  17. 17.
    Goios, A., Pereira, L., Bogue, M., Macaulay, V., Amorim, A.: mtDNA phylogeny and evolution of laboratory mouse strains. Genome Res. 17(3), 293–298 (2007)CrossRefGoogle Scholar
  18. 18.
    Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)CrossRefGoogle Scholar
  19. 19.
    Helinski, D.R., Clewell, D.B.: Circular DNA. Annu. Rev. Biochem. 40(1), 899–942 (1971)CrossRefGoogle Scholar
  20. 20.
    Lee, T., Na, J.C., Park, H., Park, K., Sim, J.S.: Finding consensus and optimal alignment of circular strings. Theor. Comput. Sci. 468, 92–101 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Maes, M.: On a cyclic string-to-string correction problem. IPL 35(2), 73–78 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Marzal, A., Barrachina, S.: Speeding up the computation of the edit distance for cyclic strings. In: 15th ICPR, vol. 2, pp. 891–894 (2000)Google Scholar
  24. 24.
    Mosig, A., Hofacker, I.L., Stadler, P.F.: Comparative analysis of cyclic sequences: viroids and other small circular RNAs. In: GCB. LNI, vol. 83, pp. 93–102. GI (2006)Google Scholar
  25. 25.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRefGoogle Scholar
  26. 26.
    Peterlongo, P., Sacomoto, G.T., do Lago, A.P., Pisanti, N., Sagot, M.F.: Lossless filter for multiple repeats with bounded edit distance. Algorithm Mol. Biol. 4(3), 1–20 (2009)Google Scholar
  27. 27.
    Peterlongo, P., Pisanti, N., Boyer, F., do Lago, A.P., Sagot, M.F.: Lossless filter for multiple repetitions with Hamming distance. JDA 6(3), 497–509 (2008)MathSciNetzbMATHGoogle Scholar
  28. 28.
    Pisanti, N., Giraud, M., Peterlongo, P.: Filters and seeds approaches for fast homology searches in large datasets. In: Elloumi, M., Zomaya, A.Y. (eds.) Algorithms in computational molecular biology, chap. 15, pp. 299–320. John Wiley & sons (2010)Google Scholar
  29. 29.
    Ponting, C.P., Russell, R.B.: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem. Sci. 20(5), 179–180 (1995)CrossRefGoogle Scholar
  30. 30.
    Rasmussen, K., Stoye, J., Myers, E.: Efficient \(q\)-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Rice, P., Longden, I., Bleasby, A.: EMBOSS: the european molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)CrossRefGoogle Scholar
  32. 32.
    Ukkonen, E.: Approximate string-matching with \(q\)-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Wang, Z., Wu, M.: Phylogenomic reconstruction indicates mitochondrial ancestor was an energy parasite. PLoS ONE 10(9), e110685 (2014)CrossRefGoogle Scholar
  34. 34.
    Weiner, J., Bornberg-Bauer, E.: Evolution of circular permutations in multidomain proteins. Mol. Biol. Evol. 23(4), 734–743 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Roberto Grossi
    • 1
  • Costas S. Iliopoulos
    • 2
  • Robert Mercaş
    • 2
    • 3
  • Nadia Pisanti
    • 1
  • Solon P. Pissis
    • 2
    Email author
  • Ahmad Retha
    • 2
  • Fatima Vayani
    • 2
  1. 1.Department of Computer ScienceUniversity of Pisa and Erable Team, INRIAPisaItaly
  2. 2.Department of InformaticsKing’s College LondonLondonUK
  3. 3.Department of Computer ScienceKiel UniversityKielGermany

Personalised recommendations