Abstract
Pattern-based classification has demonstrated its power in recent studies, but because the cost of mining discriminative patterns as features in classification is very expensive, several efficient algorithms have been proposed to rectify this problem. These algorithms assume that feature values of the mined patterns are binary, i.e., a pattern either exists or not. In some problems, however, the number of times a pattern appears is more informative than whether a pattern appears or not. To resolve these deficiencies, we propose a mathematical programming method that directly mines discriminative patterns as numerical features for classification. We also propose a novel search space shrinking technique which addresses the inefficiencies in iterative pattern mining algorithms. Finally, we show that our method is an order of magnitude faster, significantly more memory efficient and more accurate than current approaches.
Keywords
Research was sponsored in part by the U.S. National Science Foundation under grants CCF-0905014, and CNS-0931975, Air Force Office of Scientific Research MURI award FA9550-08-1-0265, and by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053 (NS-CTA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. The second author was supported by the National Science Foundation OCI-07-25070 and the state of Illinois. The third author was supported by a NDSEG PhD Fellowship.
Download to read the full chapter text
Chapter PDF
References
Chang, C.-C., Lin, C.-J.: LIBSVM: a Library for Support Vector Machines (2001), Software is available for download, at http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Cheng, H., Yan, X., Han, J., Hsu, C.-W.: Discriminative frequent pattern analysis for effective classification. In: ICDE (2007)
Cheng, H., Yan, X., Han, J., Yu, P.S.: Direct discriminative pattern mining for effective classification. In: ICDE (2008)
Chi, Y., Xia, Y., Yang, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering (TKDE) 17(2), 190–202 (2005)
Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting via column generation. Machine Learning 46(1-3), 225–254 (2002)
Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., Yu, P.S., Verscheure, O.: Direct mining of discriminative and essential frequent patterns via model-based search tree. In: KDD (2008)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
Levy, S., Stormo, G.D.: Dna sequence classification using dawgs. In: Structures in Logic and Computer Science, A Selection of Essays in Honor of Andrzej Ehrenfeucht, London, UK, pp. 339–352. Springer, Heidelberg (1997)
Li, W., Han, J., Pei, J.: Cmar: Accurate and efficient classification based on multiple class-association rules. In: ICDM, pp. 369–376 (2001)
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: KDD, pp. 80–86 (1998)
Lo, D., Cheng, H., Han, J., Khoo, S.-C., Sun, C.: Classification of software behaviors for failure detection: A discriminative pattern mining approach. In: KDD (2009)
Nash, S.G., Sofer, A.: Linear and Nonlinear Programming. McGraw-Hill, New York (1996)
Nowozin, S., Gökhan Bakõr, K.T.: Discriminative subsequence mining for action classification. In: ICCV (2007)
Saigo, H., Kadowaki, T., Kudo, T., Tsuda, K.: A linear programming approach for molecular qsar analysis. In: MLG, pp. 85–96 (2006)
Saigo, H., Krämer, N., Tsuda, K.: Partial least squares regression for graph mining. In: KDD (2008)
Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T., Tsuda, K.: gboost: a mathematical programming approach to graph classification and regression. Mach. Learn. 75(1), 69–89 (2009)
The Stanford Natural Language Processing Group. The Stanford Parser: A statistical parser, http://www-nlp.stanford.edu/software/lex-parser.shtml
Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: KDD (2009)
Zaki, M.J., Aggarwal, C.C.: Xrules: an effective structural classifier for xml data. In: KDD (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, H., Kim, S., Weninger, T., Han, J., Abdelzaher, T. (2010). NDPMine: Efficiently Mining Discriminative Numerical Features for Pattern-Based Classification. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. Lecture Notes in Computer Science(), vol 6322. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15883-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-15883-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15882-7
Online ISBN: 978-3-642-15883-4
eBook Packages: Computer ScienceComputer Science (R0)