Optimal String Mining Under Frequency Constraints

  • Johannes Fischer
  • Volker Heun
  • Stefan Kramer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)

Abstract

We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ2-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics (submitted)Google Scholar
  2. 2.
    Fischer, J., Kramer, S., Heun, V.: Fast frequent string mining using suffix arrays. In: Proc. ICDM, pp. 609–612. IEEE Computer Society, Los Alamitos (2005)Google Scholar
  3. 3.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Frakes, W.B., Baeza-Yates, R.A. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)Google Scholar
  4. 4.
    Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  6. 6.
    Berkman, O., Vishkin, U.: Recursive star-tree parallel data structure. SIAM J. Comput. 22(2), 221–242 (1993)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Wang, L., Zhao, H., Dong, G., Li, J.: On the complexity of finding emerging patterns. In: Proc. COMPSAC - Workshops and Fast Abstracts, pp. 126–129. IEEE Press, Los Alamitos (2004)Google Scholar
  9. 9.
    Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: Proc. DASFAA, pp. 119–126. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  10. 10.
    De Raedt, L., Jäger, M., Lee, S.D., Mannila, H.: A theory of inductive query answering. In: Proc. ICDM, pp. 123–130. IEEE Computer Society, Los Alamitos (2002)Google Scholar
  11. 11.
    Lee, S.D., De Raedt, L.: An efficient algorithm for mining string databases under constraints. In: Goethals, B., Siebes, A. (eds.) KDID 2004. LNCS, vol. 3377, pp. 108–129. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Proc. KDD, pp. 43–52. ACM Press, New York (1999)Google Scholar
  13. 13.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  14. 14.
    Schürmann, K.B., Stoye, J.: An incomplex algorithm for fast suffix array construction. In: Proceedings of ALENEX/ANALCO, pp. 77–85. SIAM Press, Philadelphia (2005)Google Scholar
  15. 15.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Manzini, G.: Two space saving tricks for linear time lcp array computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 372–383. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  17. 17.
    Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph, Dokl. Acad. Nauk. SSSR 194, 487–488 (1970) (in Russian); English translation in Soviet Math. Dokl. 11, 1209–1210 (1975)Google Scholar
  18. 18.
    Hui, L.C.K.: Color set size problem with application to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)Google Scholar
  19. 19.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
  20. 20.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Phan, I.Q.H., Pilbout, S.F., Fleischmann, W., Bairoch, A.: NEWT, a new taxonomy portal. Nucleic Acids Res. 31(13), 3822–3823 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Johannes Fischer
    • 1
  • Volker Heun
    • 1
  • Stefan Kramer
    • 2
  1. 1.Institut für InformatikLudwig-Maximilians-Universität MünchenMünchen
  2. 2.Institut für Informatik/I12Technische Universität MünchenGarching b. München

Personalised recommendations