Abstract
We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT) ⇒ C that expresses a rule that if a text contains a subword TATAfollowed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (α, k, Β) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n 5), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O(maxk; mn 2) and space O(kn), where m and n are the number and the total length of classification examples, respectively, and k is a small constant around 30 ∼ 50. This algorithm combines the sufix tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modification of the algorithm runs very efficiently in time O(kn log3 n) and space O(kn). We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences. A relationship with efficient Agnostic PAC-learning is also discussed.
Presently working for Fujitsu LTD.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul, Querying semi-structured data. In Proc. ICDT’97 (1997).
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases. In Proc. the 1993 ACM SIGMOD Conference on Management of Data, 207–216 (1993).
A. Amir, M. Farach, Z. Galil, R. Giancarlo, K. Park, Dynamic Dictionary Matching. JCSS, 49 (1994), 208–222.
H. Arimura, R. Fujino, T. Shinohara, and S. Arikawa. Protein motif discovery from positive examples by minimal multiple generalization over regular patterns Proc. Genome Informatics Workshop 1994, 39–48, 1994.
H. Arimura, H. Ishizaka, T. Shinohara, Learning unions of tree patterns using queries, Theoretical Computer Science, 185 (1997) 47–62.
H. Arimura, T. Shinohara, S. Otsuki. Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In Proc. the 11th STACS, LNCS 775, (1994) 649–660.
L. Devroye, W. Szpankowski, B. Rais, A note on the height of the suffix trees. SIAM J. Comput., 21, 48–53 (1992).
D. P. Dobkin and D. Gunopulos, Concept learning with geometric hypothesis. In Proc. COLT95 (1995) 329–336.
R. Feldman and I. Dagan, Knowledge Discovery in Textual Databases (KDT). In Proc. KDD-95 (1995).
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules. In Proc. the 1996 ACM SIGMOD Conference on Management of Data, (1996) 13–23.
GenBank, GenBank Release Notes, IntelliGenetics Inc. (1991).
G. Gras and J. Nicolas, FOREST, a browser for huge DNA sequences. In Proc. the 7th Workshop on Genome Informatics (1996).
G. Gonnet, PAT 3.1: An efficient text searching system, User’s manual. UW Center for the New OED, University of Waterloo (1987).
G. Gonnet, R. Baeza-Yates, Handbook of Algorithms and Data Structures, Addison-Wesley (1991).
J. Han, Y. Cai, N. Cercone, Knowledge discovery in databases: An attributeoriented approach. In Proc. the 18th VLDB Conference, 547–559 (1992).
D. Haussler, Decision theoretic generalization of the PAC model for neural net and other learning applications. Information and Computation 100 (1992) 78–150.
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.
J. Kyte and R. F. Doolittle, In J. Mo. Biol., 157 (1982), 105–132.
D. D. Lewis, Challenges in machine learning for text classification. In Proc. 9th Computational Learning Theory (1996), pp. 1.
W. Maass, Efficient agnostic PAC-learning with simple hypothesis, In Proc. COLT94 (1994), 67–75.
U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 133–136 (1991).
U. Manber and E. Myers, “Suffix arrays”: a new method for on-line string earches. In Proc. the 1st ACM-SIAM Symposium on Discrete Algorithms (1990) 319–327.
H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences. In Proc. KDD’96 (1996) 146–151.
E. M. McCreight, A space-echonomical suffix tree constructiooon algorithm. JACM 23 (1976), 262–272.
M. Nakanishi, M. Hashidume, M. Ito, A. Hashimoto, A linear-time algorithm for computing characteristic strings. In Proc. the 5th International Symposium on Algorithms and Computation (1994), 315–23.
F. P. Preparata, M. I. Shamos, Computational Geometry. Springer-Verlag (1985).
Opentext Index. http://www.index.opentext.net (1997).
J. T.-L. Wang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, K. Zhang. Combinatorial Pattern Discovery for Scientific Data: Some preliminary results. In Proc. 1994 SIGMOD (1994) 115–125.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Arimura, H., Wataki, A., Fujino, R., Arikawa, S. (1998). A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 1998. Lecture Notes in Computer Science(), vol 1501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49730-7_19
Download citation
DOI: https://doi.org/10.1007/3-540-49730-7_19
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65013-3
Online ISBN: 978-3-540-49730-1
eBook Packages: Springer Book Archive