Machine Learning

, Volume 83, Issue 2, pp 137–161 | Cite as

Effective feature construction by maximum common subgraph sampling

  • Leander Schietgat
  • Fabrizio Costa
  • Jan Ramon
  • Luc De Raedt
Article

Abstract

The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the block-and-bridge-preserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with state-of-the-art methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.

Keywords

Feature generation Subgraph mining Structure-activity learning Chemoinformatics 

References

  1. Ben-David, S., Eiron, N., & Simon, H. U. (2002). Limitations of learning via embeddings in Euclidean half spaces. Journal of Machine Learning Research, 3, 441–461. CrossRefMathSciNetGoogle Scholar
  2. Bringmann, B., Zimmermann, A., Raedt, L. D., & Nijssen, S. (2006). Don’t be afraid of simpler patterns. In Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases (pp. 55–66). Google Scholar
  3. Bunke, H., & Shearer, K. (1998). A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19, 255–259. CrossRefMATHGoogle Scholar
  4. Ceroni, A., Costa, F., & Frasconi, P. (2007). Classification of small molecules by two- and three-dimensional decomposition kernels. Bioinformatics, 23(16), 2038–2045. CrossRefGoogle Scholar
  5. Chaoji, V., Al Hasan, M., Salem, S., Besson, J., & Zaki, J. M. (2008). Origami: a novel and effective approach for mining representative orthogonal graph patterns. Statistical Analysis and Data Mining, 1(2), 67–84. CrossRefMathSciNetGoogle Scholar
  6. De Raedt, L. (2008). Logical and relational learning. Berlin: Springer. CrossRefMATHGoogle Scholar
  7. De Raedt, L., & Ramon, J. (2009). Deriving distance metrics from generality relations. Pattern Recognition Letters, 30(3), 187–191. CrossRefMATHGoogle Scholar
  8. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. Google Scholar
  9. Deshpande, M., Kuramochi, M., Wale, N., & Karypis, G. (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1036–1050. CrossRefGoogle Scholar
  10. Diestel, R. (2000). Graph theory. Berlin: Springer. Google Scholar
  11. Garey, M. R., & Johnson, D. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman. MATHGoogle Scholar
  12. Gärtner, T. (2005). Kernels for structured data. PhD thesis, University of Bonn, Germany. Google Scholar
  13. Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123. CrossRefGoogle Scholar
  14. He, H., & Singh, A. K. (2006). Graphrank: statistical modeling and mining of significant subgraphs in the feature space. In ICDM ’06: proceedings of the sixth international conference on data mining, Washington, DC, USA (pp. 885–890). Las Alamitos: IEEE Comput. Soc. CrossRefGoogle Scholar
  15. Horváth, T., Gärtner, T., & Wrobel, S. (2004). Cyclic pattern kernels for predictive graph mining. In KDD ’04: proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 158–167). Google Scholar
  16. Horváth, T., Ramon, J., & Wrobel, S. (2006). Frequent subgraph mining in outerplanar graphs. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, August 2006, pp. 197–206. Google Scholar
  17. Joachims, T. (2002). Learning to classify text using support vector machines: methods, theory, and algorithms. Berlin: Springer. Google Scholar
  18. Karunaratne, T., & Boström, H. (2006). Learning to classify structured data by graph propositionalization. In Proceedings of the second IASTED international conference on computational intelligence (pp. 393–398). Google Scholar
  19. Kramer, S., De Raedt, L., & Helma, C. (2001). Molecular feature mining in HIV data. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD-01 (pp. 136–143). New York: ACM. CrossRefGoogle Scholar
  20. Kramer, S., Lavrač, N., & Flach, P. (2001). Propositionalization approaches to relational data mining. In S. Džeroski & N. Lavrač (Eds.), Relational data mining (pp. 262–291). Berlin: Springer. Google Scholar
  21. Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38. CrossRefMATHMathSciNetGoogle Scholar
  22. Plotkin, G. (1971). A further note on inductive generalization. In Machine intelligence (Vol. 6, pp. 101–124). Edinburgh: Edinburgh University Press. Google Scholar
  23. Provost, F., & Fawcett, T. (1998). Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Proceedings of the third international conference on knowledge discovery and data mining (pp. 43–48). Menlo Park: AAAI Press. Google Scholar
  24. Raymond, J., & Willett, P. (2002). Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of Computer-Aided Molecular Design, 16, 521–533. CrossRefGoogle Scholar
  25. Schietgat, L., Ramon, J., Bruynooghe, M., & Blockeel, H. (2008). An efficiently computable graph-based metric for the classification of small molecules. In Lecture notes in artificial intelligence : Vol. 5255. Proceedings of the eleventh international conference on discovery science (pp. 197–209). Berlin: Springer. Google Scholar
  26. Sebag, M. (1997). Distance induction in first order logic. In N. Lavrač & S. Džeroski (Eds.), Lecture notes in artificial intelligence : Vol. 1297. Proceedings of the seventh international workshop on inductive logic programming (pp. 264–272). Berlin: Springer. Google Scholar
  27. Swamidass, S. J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., & Baldi, P. (2005). Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21(suppl_1), 359–368. CrossRefGoogle Scholar
  28. Wale, N., Watson, I., & Karypis, G. (2008). Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14, 347–375. CrossRefGoogle Scholar
  29. Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4(1), 66–82. CrossRefMATHGoogle Scholar
  30. Willett, P. (2006). Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today, 11(23/24), 1046–1051. CrossRefGoogle Scholar
  31. Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of the 2002 IEEE international conference on data mining, ICDM 2002, Japan (pp. 721–724). Las Alamitos: IEEE Comput. Soc. Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Leander Schietgat
    • 1
  • Fabrizio Costa
    • 1
  • Jan Ramon
    • 1
  • Luc De Raedt
    • 1
  1. 1.Department of Computer ScienceKatholieke Universiteit LeuvenLeuvenBelgium

Personalised recommendations