Don’t Be Afraid of Simpler Patterns
This paper investigates the trade-off between the expressiveness of the pattern language and the performance of the pattern miner in structured data mining. This trade-off is investigated in the context of correlated pattern mining, which is concerned with finding the k-best patterns according to a convex criterion, for the pattern languages of itemsets, multi-itemsets, sequences, trees and graphs. The criteria used in our investigation are the typical ones in data mining: computational cost and predictive accuracy and the domain is that of mining molecular graph databases. More specifically, we provide empirical answers to the following questions: how does the expressive power of the language affect the computational cost? and what is the trade-off between expressiveness of the pattern language and the predictive accuracy of the learned model? While answering the first question, we also introduce a novel stepwise approach to correlated pattern mining in which the results of mining a simpler pattern language are employed as a starting point for mining in a more complex one. This stepwise approach typically leads to significant speed-ups (up to a factor 1000) for mining graphs.
Unable to display preview. Download preview PDF.
- 1.Zaki, M.: Efficiently mining frequent trees in a forest. In: Hand, D., Keim, D., Ng, R. (eds.) KDD, pp. 71–80 (2002)Google Scholar
- 2.Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724 (2002)Google Scholar
- 3.Kramer, S., De Raedt, L., Helma, C.: Molecular feature mining in HIV data. In: KDD, pp. 136–143 (2001)Google Scholar
- 4.Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994)Google Scholar
- 5.Morishita, S., Sese, J.: Traversing itemset lattices with statistical metric pruning. In: PODS, pp. 226–236 (2000)Google Scholar
- 7.Helma, C., Cramer, T., Kramer, S., De Raedt, L.: Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of Chemical Information and Computer Systems 44, 1402–1411 (2004)Google Scholar
- 9.Cohen, W.W.: Fast effective rule induction. In: Prieditis, A., Russell, S.J. (eds.) ICML, pp. 115–123 (1995)Google Scholar
- 10.Quinlan, J.R.: C4.5: Programs for Machine Learning (1993)Google Scholar
- 13.Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: KDD, pp. 647–652 (2004)Google Scholar
- 14.Horváth, T., Gärtner, T., Wrobel, S.: Cyclic pattern kernels for predictive graph mining. In: KDD, pp. 158–167 (2004)Google Scholar
- 15.Wale, N., Karypis, G.: Acyclic subgraph-based descriptor spaces for chemical compound retrieval and classification. Technical report, Univ. Minnesota (2006)Google Scholar