Optimal String Mining Under Frequency Constraints

Fischer, Johannes; Heun, Volker; Kramer, Stefan

doi:10.1007/11871637_17

Johannes Fischer²¹,
Volker Heun²¹ &
Stefan Kramer²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4213))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

3501 Accesses
20 Citations

Abstract

We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ ²-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.

Download to read the full chapter text

Chapter PDF

A New Approach to String Pattern Mining with Approximate Match

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Article Open access 22 September 2020

Frequency-Constrained Substring Complexity

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics (submitted)
Google Scholar
Fischer, J., Kramer, S., Heun, V.: Fast frequent string mining using suffix arrays. In: Proc. ICDM, pp. 609–612. IEEE Computer Society, Los Alamitos (2005)
Google Scholar
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, W.B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Berkman, O., Vishkin, U.: Recursive star-tree parallel data structure. SIAM J. Comput. 22(2), 221–242 (1993)
Article MATH MathSciNet Google Scholar
Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006)
Chapter Google Scholar
Wang, L., Zhao, H., Dong, G., Li, J.: On the complexity of finding emerging patterns. In: Proc. COMPSAC - Workshops and Fast Abstracts, pp. 126–129. IEEE Press, Los Alamitos (2004)
Google Scholar
Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: Proc. DASFAA, pp. 119–126. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
De Raedt, L., Jäger, M., Lee, S.D., Mannila, H.: A theory of inductive query answering. In: Proc. ICDM, pp. 123–130. IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Lee, S.D., De Raedt, L.: An efficient algorithm for mining string databases under constraints. In: Goethals, B., Siebes, A. (eds.) KDID 2004. LNCS, vol. 3377, pp. 108–129. Springer, Heidelberg (2005)
Chapter Google Scholar
Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Proc. KDD, pp. 43–52. ACM Press, New York (1999)
Google Scholar
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)
Chapter Google Scholar
Schürmann, K.B., Stoye, J.: An incomplex algorithm for fast suffix array construction. In: Proceedings of ALENEX/ANALCO, pp. 77–85. SIAM Press, Philadelphia (2005)
Google Scholar
Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)
Article MATH MathSciNet Google Scholar
Manzini, G.: Two space saving tricks for linear time lcp array computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 372–383. Springer, Heidelberg (2004)
Chapter Google Scholar
Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph, Dokl. Acad. Nauk. SSSR 194, 487–488 (1970) (in Russian); English translation in Soviet Math. Dokl. 11, 1209–1210 (1975)
Google Scholar
Hui, L.C.K.: Color set size problem with application to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Article MATH MathSciNet Google Scholar
Phan, I.Q.H., Pilbout, S.F., Fleischmann, W., Bairoch, A.: NEWT, a new taxonomy portal. Nucleic Acids Res. 31(13), 3822–3823 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstr. 17, D-80333, München
Johannes Fischer & Volker Heun
Institut für Informatik/I12, Technische Universität München, Boltzmannstr. 3, D-85748, Garching b. München
Stefan Kramer

Authors

Johannes Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Volker Heun
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Kramer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Knowledge Engineering Group, Technische Universität Darmstadt,
Johannes Fürnkranz
Max Planck Institute for Computer Science, Saarbrücken, Germany
Tobias Scheffer
Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany
Myra Spiliopoulou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fischer, J., Heun, V., Kramer, S. (2006). Optimal String Mining Under Frequency Constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Knowledge Discovery in Databases: PKDD 2006. PKDD 2006. Lecture Notes in Computer Science(), vol 4213. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871637_17

Download citation

DOI: https://doi.org/10.1007/11871637_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45374-1
Online ISBN: 978-3-540-46048-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimal String Mining Under Frequency Constraints

Abstract

Chapter PDF

Similar content being viewed by others

A New Approach to String Pattern Mining with Approximate Match

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Frequency-Constrained Substring Complexity

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Optimal String Mining Under Frequency Constraints

Abstract

Chapter PDF

Similar content being viewed by others

A New Approach to String Pattern Mining with Approximate Match

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Frequency-Constrained Substring Complexity

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation