Fast, Effective Molecular Feature Mining by Local Optimization

  • Albrecht Zimmermann
  • Björn Bringmann
  • Ulrich Rückert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6323)

Abstract

In structure-activity-relationships (SAR) one aims at finding classifiers that predict the biological or chemical activity of a compound from its molecular graph. Many approaches to SAR use sets of binary substructure features, which test for the occurrence of certain substructures in the molecular graph. As an alternative to enumerating very large sets of frequent patterns, numerous pattern set mining and pattern set selection techniques have been proposed. Existing approaches can be broadly classified into those that focus on minimizing correspondences, that is, the number of pairs of training instances from different classes with identical encodings and those that focus on maximizing the number of equivalence classes, that is, unique encodings in the training data. In this paper we evaluate a number of techniques to investigate which criterion is a better indicator of predictive accuracy. We find that minimizing correspondences is a necessary but not sufficient condition for good predictive accuracy, that equivalence classes are a better indicator of success and that it is important to have a good match between training set and pattern set size. Based on these results we propose a new, improved algorithm which performs local minimization of correspondences, yet evaluates the effect of patterns on equivalence classes globally. Empirical experiments demonstrate its efficacy and its superior run time behavior.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bringmann, B., Zimmermann, A.: Tree2 - Decision trees for tree structured data. In: Jorge, A., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 46–58. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Bringmann, B., Zimmermann, A.: One in a million: picking the right patterns. Knowledge and Information Systems 18(1), 61–81 (2009)CrossRefGoogle Scholar
  3. 3.
    Bringmann, B., Zimmermann, A., De Raedt, L., Nijssen, S.: Don’t be afraid of simpler patterns. In: Fürnkranz, et al. (eds.) [6], pp. 55–66 (2006)Google Scholar
  4. 4.
    Cheng, H., Yan, X., Han, J., Hsu, C.W.: Discriminative frequent pattern analysis for effective classification. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 716–725. IEEE, Los Alamitos (2007)CrossRefGoogle Scholar
  5. 5.
    Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., Yu, P.S., Verscheure, O.: Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230–238. ACM, New York (2008)CrossRefGoogle Scholar
  6. 6.
    Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.): PKDD 2006. LNCS (LNAI), vol. 4213. Springer, Heidelberg (2006)MATHGoogle Scholar
  7. 7.
    Geamsakul, W., Matsuda, T., Yoshida, T., Motoda, H., Washio, T.: Performance evaluation of decision tree graph-based induction. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 128–140. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  8. 8.
    Hasan, M.A., Chaoji, V., Salem, S., Besson, J., Zaki, M.J.: Origami: Mining representative orthogonal graph patterns. In: Ramakrishnan, N., Zaiane, O. (eds.) ICDM, pp. 153–162. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  9. 9.
    Joachims, T.: Making large-scale support vector machine learning practical. In: Advances in Kernel Methods: Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)Google Scholar
  10. 10.
    Knobbe, A.J., Ho, E.K.Y.: Pattern teams. In: Fürnkranz, et al. (eds.) [6], pp. 577–584 (2006)Google Scholar
  11. 11.
    Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1(3), 241–258 (1997)CrossRefGoogle Scholar
  12. 12.
    Morishita, S., Sese, J.: Traversing itemset lattice with statistical metric pruning. In: Proceedings of the 19th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (2000)Google Scholar
  13. 13.
    Rückert, U.: Capacity control for partially ordered feature sets. In: ECML PKDD ’09: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 318–333. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  14. 14.
    Rückert, U., Kramer, S.: Optimizing feature sets for structured data. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 716–723. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Swamidass, S.J., Chen, J.H., Bruand, J., Phung, P., Ralaivola, L., Baldi, P.: Kernels for small molecules and the prediction of mutagenicity,toxicity and anti-cancer activity. In: ISMB (Supplement of Bioinformatics), pp. 359–368 (2005)Google Scholar
  16. 16.
    Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H.P., Smola, A.J., Song, L., Yu, P.S., Yan, X., Borgwardt, K.M.: Near-optimal supervised feature selection among frequent subgraphs. In: Proceedings of the SIAM International Conference on Data Mining, SDM 2009, pp. 1–12. SIAM, Philadelphia (2009)Google Scholar
  17. 17.
    Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. In: Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (eds.) Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp. 316–325. ACM, Washington (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Albrecht Zimmermann
    • 1
  • Björn Bringmann
    • 1
  • Ulrich Rückert
    • 2
  1. 1.Katholieke Universiteit LeuvenLeuvenBelgium
  2. 2.EECS DepartmentUC BerkeleyBerkeleyUSA

Personalised recommendations