Skip to main content

A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

  • Conference paper
  • First Online:
Algorithmic Learning Theory (ALT 1998)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1501))

Included in the following conference series:

Abstract

We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT) ⇒ C that expresses a rule that if a text contains a subword TATAfollowed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (α, k, Β) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n 5), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O(maxk; mn 2) and space O(kn), where m and n are the number and the total length of classification examples, respectively, and k is a small constant around 30 ∼ 50. This algorithm combines the sufix tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modification of the algorithm runs very efficiently in time O(kn log3 n) and space O(kn). We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences. A relationship with efficient Agnostic PAC-learning is also discussed.

Presently working for Fujitsu LTD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul, Querying semi-structured data. In Proc. ICDT’97 (1997).

    Google Scholar 

  2. R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases. In Proc. the 1993 ACM SIGMOD Conference on Management of Data, 207–216 (1993).

    Google Scholar 

  3. A. Amir, M. Farach, Z. Galil, R. Giancarlo, K. Park, Dynamic Dictionary Matching. JCSS, 49 (1994), 208–222.

    MATH  MathSciNet  Google Scholar 

  4. H. Arimura, R. Fujino, T. Shinohara, and S. Arikawa. Protein motif discovery from positive examples by minimal multiple generalization over regular patterns Proc. Genome Informatics Workshop 1994, 39–48, 1994.

    Google Scholar 

  5. H. Arimura, H. Ishizaka, T. Shinohara, Learning unions of tree patterns using queries, Theoretical Computer Science, 185 (1997) 47–62.

    Article  MATH  MathSciNet  Google Scholar 

  6. H. Arimura, T. Shinohara, S. Otsuki. Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In Proc. the 11th STACS, LNCS 775, (1994) 649–660.

    Google Scholar 

  7. L. Devroye, W. Szpankowski, B. Rais, A note on the height of the suffix trees. SIAM J. Comput., 21, 48–53 (1992).

    Article  MATH  MathSciNet  Google Scholar 

  8. D. P. Dobkin and D. Gunopulos, Concept learning with geometric hypothesis. In Proc. COLT95 (1995) 329–336.

    Google Scholar 

  9. R. Feldman and I. Dagan, Knowledge Discovery in Textual Databases (KDT). In Proc. KDD-95 (1995).

    Google Scholar 

  10. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules. In Proc. the 1996 ACM SIGMOD Conference on Management of Data, (1996) 13–23.

    Google Scholar 

  11. GenBank, GenBank Release Notes, IntelliGenetics Inc. (1991).

    Google Scholar 

  12. G. Gras and J. Nicolas, FOREST, a browser for huge DNA sequences. In Proc. the 7th Workshop on Genome Informatics (1996).

    Google Scholar 

  13. G. Gonnet, PAT 3.1: An efficient text searching system, User’s manual. UW Center for the New OED, University of Waterloo (1987).

    Google Scholar 

  14. G. Gonnet, R. Baeza-Yates, Handbook of Algorithms and Data Structures, Addison-Wesley (1991).

    Google Scholar 

  15. J. Han, Y. Cai, N. Cercone, Knowledge discovery in databases: An attributeoriented approach. In Proc. the 18th VLDB Conference, 547–559 (1992).

    Google Scholar 

  16. D. Haussler, Decision theoretic generalization of the PAC model for neural net and other learning applications. Information and Computation 100 (1992) 78–150.

    Article  MATH  MathSciNet  Google Scholar 

  17. M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.

    MATH  Google Scholar 

  18. J. Kyte and R. F. Doolittle, In J. Mo. Biol., 157 (1982), 105–132.

    Article  Google Scholar 

  19. D. D. Lewis, Challenges in machine learning for text classification. In Proc. 9th Computational Learning Theory (1996), pp. 1.

    Google Scholar 

  20. W. Maass, Efficient agnostic PAC-learning with simple hypothesis, In Proc. COLT94 (1994), 67–75.

    Google Scholar 

  21. U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 133–136 (1991).

    Article  MATH  MathSciNet  Google Scholar 

  22. U. Manber and E. Myers, “Suffix arrays”: a new method for on-line string earches. In Proc. the 1st ACM-SIAM Symposium on Discrete Algorithms (1990) 319–327.

    Google Scholar 

  23. H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences. In Proc. KDD’96 (1996) 146–151.

    Google Scholar 

  24. E. M. McCreight, A space-echonomical suffix tree constructiooon algorithm. JACM 23 (1976), 262–272.

    Article  MATH  MathSciNet  Google Scholar 

  25. M. Nakanishi, M. Hashidume, M. Ito, A. Hashimoto, A linear-time algorithm for computing characteristic strings. In Proc. the 5th International Symposium on Algorithms and Computation (1994), 315–23.

    Google Scholar 

  26. F. P. Preparata, M. I. Shamos, Computational Geometry. Springer-Verlag (1985).

    Google Scholar 

  27. Opentext Index. http://www.index.opentext.net (1997).

  28. J. T.-L. Wang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, K. Zhang. Combinatorial Pattern Discovery for Scientific Data: Some preliminary results. In Proc. 1994 SIGMOD (1994) 115–125.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Arimura, H., Wataki, A., Fujino, R., Arikawa, S. (1998). A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 1998. Lecture Notes in Computer Science(), vol 1501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49730-7_19

Download citation

  • DOI: https://doi.org/10.1007/3-540-49730-7_19

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65013-3

  • Online ISBN: 978-3-540-49730-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics