Optimal String Mining Under Frequency Constraints

  • Johannes Fischer
  • Volker Heun
  • Stefan Kramer
Conference paper

DOI: 10.1007/11871637_17

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)
Cite this paper as:
Fischer J., Heun V., Kramer S. (2006) Optimal String Mining Under Frequency Constraints. In: Fürnkranz J., Scheffer T., Spiliopoulou M. (eds) Knowledge Discovery in Databases: PKDD 2006. PKDD 2006. Lecture Notes in Computer Science, vol 4213. Springer, Berlin, Heidelberg

Abstract

We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ2-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Johannes Fischer
    • 1
  • Volker Heun
    • 1
  • Stefan Kramer
    • 2
  1. 1.Institut für InformatikLudwig-Maximilians-Universität MünchenMünchen
  2. 2.Institut für Informatik/I12Technische Universität MünchenGarching b. München

Personalised recommendations