A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

Arimura, Hiroki; Wataki, Atsushi; Fujino, Ryoichi; Arikawa, Setsuo

doi:10.1007/3-540-49730-7_19

Hiroki Arimura⁵,
Atsushi Wataki⁵,
Ryoichi Fujino⁵ &
…
Setsuo Arikawa⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1501))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

397 Accesses
18 Citations

Abstract

We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT) ⇒ C that expresses a rule that if a text contains a subword TATAfollowed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (α, k, Β) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n ⁵), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O(maxk; mn ²) and space O(kn), where m and n are the number and the total length of classification examples, respectively, and k is a small constant around 30 ∼ 50. This algorithm combines the sufix tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modification of the algorithm runs very efficiently in time O(kn log³ n) and space O(kn). We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences. A relationship with efficient Agnostic PAC-learning is also discussed.

Presently working for Fujitsu LTD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, Querying semi-structured data. In Proc. ICDT’97 (1997).
Google Scholar
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases. In Proc. the 1993 ACM SIGMOD Conference on Management of Data, 207–216 (1993).
Google Scholar
A. Amir, M. Farach, Z. Galil, R. Giancarlo, K. Park, Dynamic Dictionary Matching. JCSS, 49 (1994), 208–222.
MATH MathSciNet Google Scholar
H. Arimura, R. Fujino, T. Shinohara, and S. Arikawa. Protein motif discovery from positive examples by minimal multiple generalization over regular patterns Proc. Genome Informatics Workshop 1994, 39–48, 1994.
Google Scholar
H. Arimura, H. Ishizaka, T. Shinohara, Learning unions of tree patterns using queries, Theoretical Computer Science, 185 (1997) 47–62.
Article MATH MathSciNet Google Scholar
H. Arimura, T. Shinohara, S. Otsuki. Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In Proc. the 11th STACS, LNCS 775, (1994) 649–660.
Google Scholar
L. Devroye, W. Szpankowski, B. Rais, A note on the height of the suffix trees. SIAM J. Comput., 21, 48–53 (1992).
Article MATH MathSciNet Google Scholar
D. P. Dobkin and D. Gunopulos, Concept learning with geometric hypothesis. In Proc. COLT95 (1995) 329–336.
Google Scholar
R. Feldman and I. Dagan, Knowledge Discovery in Textual Databases (KDT). In Proc. KDD-95 (1995).
Google Scholar
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules. In Proc. the 1996 ACM SIGMOD Conference on Management of Data, (1996) 13–23.
Google Scholar
GenBank, GenBank Release Notes, IntelliGenetics Inc. (1991).
Google Scholar
G. Gras and J. Nicolas, FOREST, a browser for huge DNA sequences. In Proc. the 7th Workshop on Genome Informatics (1996).
Google Scholar
G. Gonnet, PAT 3.1: An efficient text searching system, User’s manual. UW Center for the New OED, University of Waterloo (1987).
Google Scholar
G. Gonnet, R. Baeza-Yates, Handbook of Algorithms and Data Structures, Addison-Wesley (1991).
Google Scholar
J. Han, Y. Cai, N. Cercone, Knowledge discovery in databases: An attributeoriented approach. In Proc. the 18th VLDB Conference, 547–559 (1992).
Google Scholar
D. Haussler, Decision theoretic generalization of the PAC model for neural net and other learning applications. Information and Computation 100 (1992) 78–150.
Article MATH MathSciNet Google Scholar
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.
MATH Google Scholar
J. Kyte and R. F. Doolittle, In J. Mo. Biol., 157 (1982), 105–132.
Article Google Scholar
D. D. Lewis, Challenges in machine learning for text classification. In Proc. 9th Computational Learning Theory (1996), pp. 1.
Google Scholar
W. Maass, Efficient agnostic PAC-learning with simple hypothesis, In Proc. COLT94 (1994), 67–75.
Google Scholar
U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 133–136 (1991).
Article MATH MathSciNet Google Scholar
U. Manber and E. Myers, “Suffix arrays”: a new method for on-line string earches. In Proc. the 1st ACM-SIAM Symposium on Discrete Algorithms (1990) 319–327.
Google Scholar
H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences. In Proc. KDD’96 (1996) 146–151.
Google Scholar
E. M. McCreight, A space-echonomical suffix tree constructiooon algorithm. JACM 23 (1976), 262–272.
Article MATH MathSciNet Google Scholar
M. Nakanishi, M. Hashidume, M. Ito, A. Hashimoto, A linear-time algorithm for computing characteristic strings. In Proc. the 5th International Symposium on Algorithms and Computation (1994), 315–23.
Google Scholar
F. P. Preparata, M. I. Shamos, Computational Geometry. Springer-Verlag (1985).
Google Scholar
Opentext Index. http://www.index.opentext.net (1997).
J. T.-L. Wang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, K. Zhang. Combinatorial Pattern Discovery for Scientific Data: Some preliminary results. In Proc. 1994 SIGMOD (1994) 115–125.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Hakozaki 6-10-1, Fukuoka, 812-8581, Japan
Hiroki Arimura, Atsushi Wataki, Ryoichi Fujino & Setsuo Arikawa

Authors

Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar
Atsushi Wataki
View author publications
You can also search for this author in PubMed Google Scholar
Ryoichi Fujino
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AG Künstliche Intelligenz - Expertensysteme, UniversitÄt Kaiserslautern, Postfach 3049, D-67653, Kaiserslautern, Germany
Michael M. Richter
Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
Carl H. Smith
AG Algorithmischesn Lernen, UniversitÄt Kaiserslautern, Postfach 3049, D-67653, Kaiserslautern, Germany
Rolf Wiehagen
Graduate School of Information Science and Electrical Engineering Department of Informatics, Kyushu University, Kassuga, 816-8580, Japan
Thomas Zeugmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arimura, H., Wataki, A., Fujino, R., Arikawa, S. (1998). A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 1998. Lecture Notes in Computer Science(), vol 1501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49730-7_19

Download citation

DOI: https://doi.org/10.1007/3-540-49730-7_19
Published: 24 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65013-3
Online ISBN: 978-3-540-49730-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics