The Colored Longest Common Prefix Array Computed via Sequential Scans

Garofalo, Fabio; Rosone, Giovanna; Sciortino, Marinella; Verzotto, Davide

doi:10.1007/978-3-030-00479-8_13

Fabio Garofalo¹⁷,
Giovanna Rosone¹⁸,
Marinella Sciortino¹⁷ &
…
Davide Verzotto¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11147))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

768 Accesses
2 Citations

Abstract

Due to the increased availability of large datasets of biological sequences, tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most alignment-free approaches require the computation of statistics when comparing sequences, even if such computations may not scale well in in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the high practical efficiency of our approach.

G.R. and M.S. are partially supported and D.V. is supported by the project Italian MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

https://github.com/giovannarosone/cLCP-mACS
https://github.com/BEETL/BEETL
https://github.com/giovannarosone/BCR_LCP_GSA
https://github.com/felipelouza/egsa
https://github.com/felipelouza/egap
http://kmacs.gobics.de/
Apostolico, A., Guerra, C., Pizzi, C.: Alignment free sequence similarity with bounded hamming distance. In: Data Compression Conference, DCC 2014, pp. 183–192. IEEE (2014)
Google Scholar
Bauer, M., Cox, A., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. 483, 134–148 (2013)
Article MathSciNet Google Scholar
Belazzougui, D., Cunial, F.: Indexed matching statistics and shortest unique substrings. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 179–190. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11918-2_18
Chapter Google Scholar
Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds.) SPIRE 2017. LNCS, vol. 10508, pp. 161–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67428-5_14
Chapter MATH Google Scholar
Burrows, M., Wheeler, D.: A block sorting data compression algorithm. Technical report, DEC Systems Research Center (1994)
Google Scholar
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4), 327–344 (1994)
Article MathSciNet Google Scholar
Cohen, E., Chor, B.: Detecting phylogenetic signals in eukaryotic whole genome sequences. J. Comput. Biol. 19(8), 945–956 (2012)
Article MathSciNet Google Scholar
Comin, M., Verzotto, D.: The irredundant class method for remote homology detection of protein sequences. J. Comput. Biol. 18(12), 1819–1829 (2011)
Article Google Scholar
Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol. Biol. 7(1), 34 (2012)
Article Google Scholar
Comin, M., Verzotto, D.: Whole-genome phylogeny by virtue of unic subwords. In: DEXA, pp. 190–194. IEEE (2012)
Google Scholar
Comin, M., Verzotto, D.: Comparing, ranking and filtering motifs with character classes: application to biological sequences analysis. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, chap. 13. Wiley (2013)
Google Scholar
Comin, M., Verzotto, D.: Filtering degenerate patterns with application to protein sequence analysis. Algorithms 6(2), 352–370 (2013)
Article MathSciNet Google Scholar
Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(4), 628–637 (2014)
Article Google Scholar
Cox, A.J., Garofalo, F., Rosone, G., Sciortino, M.: Lightweight LCP construction for very large collections of strings. J. Discret. Algorithms 37, 17–33 (2016)
Article MathSciNet Google Scholar
Cox, A.J., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.B.: Comparing DNA sequence collections by direct comparison of compressed text indexes. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 214–224. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_17
Chapter Google Scholar
Egidi, L., Louza, F.A., Manzini, G., Telles, G.P.: External memory BWT and LCP computation for sequence collections with applications. ArXiv e-prints (2018)
Google Scholar
Ferraro Petrillo, U., Guerra, C., Pizzi, C.: A new distributed alignment-free approach to compare whole proteomes. Theor. Comput. Sci. 698, 100–112 (2017)
Article MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Leimeister, C.A., Morgenstern, B.: Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)
Article Google Scholar
Louza, F., Telles, G., Hoffmann, S., Ciferri, C.: Generalized enhanced suffix array construction in external memory. Algorithms Mol. Biol. 12(1), 26 (2017)
Article Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1990, pp. 319–327. Society for Industrial and Applied Mathematics (1990)
Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
Article MathSciNet Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)
Article MathSciNet Google Scholar
Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 347–358. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16321-0_36
Chapter Google Scholar
Pizzi, C.: MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol. Biol. 11, 6 (2016)
Article Google Scholar
Puglisi, S.J., Turpin, A.: Space-time tradeoffs for longest-common-prefix array computation. In: Hong, S.-H., Nagamochi, H., Fukunaga, T. (eds.) ISAAC 2008. LNCS, vol. 5369, pp. 124–135. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-92182-0_14
Chapter MATH Google Scholar
Ren, J., Song, K., Sun, F., Deng, M., Reinert, G.: Multiple alignment-free sequence comparison. Bioinformatics 29(21), 2690–2698 (2013)
Article Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Article MathSciNet Google Scholar
Thankachan, S., Chockalingam, S., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol. 23(6), 452–460 (2016)
Article MathSciNet Google Scholar
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)
Article MathSciNet Google Scholar
Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Palermo, Palermo, Italy
Fabio Garofalo & Marinella Sciortino
University of Pisa, Pisa, Italy
Giovanna Rosone & Davide Verzotto

Authors

Fabio Garofalo
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Rosone
View author publications
You can also search for this author in PubMed Google Scholar
Marinella Sciortino
View author publications
You can also search for this author in PubMed Google Scholar
Davide Verzotto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanna Rosone .

Editor information

Editors and Affiliations

Diego Portales University, Santiago, Chile
Travis Gagie
The University of Melbourne, Melbourne, VIC, Australia
Alistair Moffat
University of Chile, Santiago, Chile
Gonzalo Navarro
Universidad de Ingeniería y Tecnología, Lima, Peru
Ernesto Cuadros-Vargas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garofalo, F., Rosone, G., Sciortino, M., Verzotto, D. (2018). The Colored Longest Common Prefix Array Computed via Sequential Scans. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-00479-8_13
Published: 14 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics