Efficient Online String Matching Based on Characters Distance Text Sampling

Faro, Simone; Marino, Francesco Pio; Pavone, Arianna

doi:10.1007/s00453-020-00732-4

Efficient Online String Matching Based on Characters Distance Text Sampling

Published: 20 June 2020

Volume 82, pages 3390–3412, (2020)
Cite this article

Algorithmica Aims and scope Submit manuscript

301 Accesses
1 Citation
Explore all metrics

Abstract

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Stratified random sampling from streaming and stored data

Article 23 October 2020

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Article 15 February 2022

Notes

Search speed of an online string matching algorithm may depend on the length of the pattern. Typical search speed of a fast solution, on a modern laptop computer, goes from 1 GB/s (in the case of short patterns) to 5 GB/s (in the case of very long patterns) [5].
Search speed of a fast offline solution do not depend on the length of the text and is typically under 1 ms per query.
According to their theoretical evaluation and their experimental results it turns out that, when searching on an English text, the best performance are obtained when the least 13 characters are removed from the original alphabet.
In practical cases we can implement our solution with a block size \(k=256\), which allows to represent the elements of the sequence \(\dot{y}\) using a single byte. In such a case the assumption \(k\ge \sigma\) is plausible for any practical application.
The Smart tool is available online for download at http://www.dmi.unict.it/~faro/smart/ or at https://github.com/smart-tool/smart.
Specifically, the text buffer is the concatenation of two different texts: The King James version of the bible (3.9 MB) and The CIA world fact book (2.4 MB). The first 5MB of the resulting text buffer have been used in our experimental results.

References

Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, London (1974)
MATH Google Scholar
Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer, Berlin (1985)
Chapter Google Scholar
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)
Article Google Scholar
Cantone, D., Faro, S., Giaquinta, E.: Adapting Boyer-Moore-like algorithms for searching Huffman encoded texts. Int. J. Found. Comput. Sci. 23(2), 343–356 (2012)
Article MathSciNet Google Scholar
Cantone, D., Faro, S., Pavone, A.: Speeding up string matching by weak factor recognition. Stringology 2017, 42–50 (2017)
Google Scholar
Claude, F., Navarro, G., Peltola, H., Salmela, L., Tarhio, J.: String matching with alphabet sampling. J. Discrete Algorithms 11, 37–50 (2012)
Article MathSciNet Google Scholar
Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W.: Speeding up two string-matching algorithms. Algorithmica 12(4), 247–267 (1994)
Article MathSciNet Google Scholar
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)
Article Google Scholar
Faro, S., Lecroq, T., Borzì, S., Di Mauro, S., Maggio, A.: The String Matching Algorithms Research Tool. In Procedings of Stringology, pp. 99–111, (2016)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Fredriksson, K., Grabowski, S.: A general compression algorithm that supports fast searching. Inf. Process. Lett. 100(6), 226–232 (2006)
Article MathSciNet Google Scholar
Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Porceedings of String Processing and Information Retrieval (SPIRE 2015), Lecture Notes in Computer Science, vol 9309, Springer, pp. 287–298 (2015)
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Article Google Scholar
Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Proceedings of 2nd Annual International Conference on Computing and Combinatorics (COCOON), LNCS 1090, pp. 219–230 (1996)
Klein, S.T., Shapira, D.: A new compression method for compressed matching. In: Data Compression Conference, IEEE. pp. 400–409 (2000)
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Article MathSciNet Google Scholar
Manber: A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. Inf. Syst. 15(2), 124–136 (1997)
Article Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for online string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet Google Scholar
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. 18(2), 113–139 (2000)
Article Google Scholar
Navarro, G., Tarhio, J.: LZgrep: a Boyer-Moore string matching tool for Ziv-Lempel compressed text. Softw. Pract. Exp. 35, 1107–1130 (2005)
Article Google Scholar
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: CIAC 306–315 (2000)
Yao, A.C.: The complexity of pattern matching for a random string. SIAM J. Comput. 8(3), 368–387 (1979)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Informatica, Università di Catania, viale A.Doria n.6, 95125, Catania, Italy
Simone Faro & Francesco Pio Marino
Dipartimento di Scienze Cognitive, Università di Messina, via Concezione n.6, 98122, Messina, Italy
Arianna Pavone

Authors

Simone Faro
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Pio Marino
View author publications
You can also search for this author in PubMed Google Scholar
Arianna Pavone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Faro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Faro, S., Marino, F.P. & Pavone, A. Efficient Online String Matching Based on Characters Distance Text Sampling. Algorithmica 82, 3390–3412 (2020). https://doi.org/10.1007/s00453-020-00732-4

Download citation

Received: 23 April 2018
Accepted: 03 June 2020
Published: 20 June 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s00453-020-00732-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Online String Matching Based on Characters Distance Text Sampling

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Stratified random sampling from streaming and stored data

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient Online String Matching Based on Characters Distance Text Sampling

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Stratified random sampling from streaming and stored data

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation