Fast, Effective Molecular Feature Mining by Local Optimization

Zimmermann, Albrecht; Bringmann, Björn; Rückert, Ulrich

doi:10.1007/978-3-642-15939-8_36

Albrecht Zimmermann²³,
Björn Bringmann²³ &
Ulrich Rückert²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6323))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3572 Accesses
5 Citations

Abstract

In structure-activity-relationships (SAR) one aims at finding classifiers that predict the biological or chemical activity of a compound from its molecular graph. Many approaches to SAR use sets of binary substructure features, which test for the occurrence of certain substructures in the molecular graph. As an alternative to enumerating very large sets of frequent patterns, numerous pattern set mining and pattern set selection techniques have been proposed. Existing approaches can be broadly classified into those that focus on minimizing correspondences, that is, the number of pairs of training instances from different classes with identical encodings and those that focus on maximizing the number of equivalence classes, that is, unique encodings in the training data. In this paper we evaluate a number of techniques to investigate which criterion is a better indicator of predictive accuracy. We find that minimizing correspondences is a necessary but not sufficient condition for good predictive accuracy, that equivalence classes are a better indicator of success and that it is important to have a good match between training set and pattern set size. Based on these results we propose a new, improved algorithm which performs local minimization of correspondences, yet evaluates the effect of patterns on equivalence classes globally. Empirical experiments demonstrate its efficacy and its superior run time behavior.

Download to read the full chapter text

Chapter PDF

Feature importance correlation from machine learning indicates functional relationships between proteins and similar compound binding characteristics

Article Open access 09 July 2021

Profiling and analysis of chemical compounds using pointwise mutual information

Article Open access 10 January 2021

“DompeKeys”: a set of novel substructure-based descriptors for efficient chemical space mapping, development and structural interpretation of machine learning models, and indexing of large databases

Article Open access 23 February 2024

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Bringmann, B., Zimmermann, A.: Tree² - Decision trees for tree structured data. In: Jorge, A., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 46–58. Springer, Heidelberg (2005)
Chapter Google Scholar
Bringmann, B., Zimmermann, A.: One in a million: picking the right patterns. Knowledge and Information Systems 18(1), 61–81 (2009)
Article Google Scholar
Bringmann, B., Zimmermann, A., De Raedt, L., Nijssen, S.: Don’t be afraid of simpler patterns. In: Fürnkranz, et al. (eds.) [6], pp. 55–66 (2006)
Google Scholar
Cheng, H., Yan, X., Han, J., Hsu, C.W.: Discriminative frequent pattern analysis for effective classification. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 716–725. IEEE, Los Alamitos (2007)
Chapter Google Scholar
Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., Yu, P.S., Verscheure, O.: Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230–238. ACM, New York (2008)
Chapter Google Scholar
Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.): PKDD 2006. LNCS (LNAI), vol. 4213. Springer, Heidelberg (2006)
MATH Google Scholar
Geamsakul, W., Matsuda, T., Yoshida, T., Motoda, H., Washio, T.: Performance evaluation of decision tree graph-based induction. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 128–140. Springer, Heidelberg (2003)
Chapter Google Scholar
Hasan, M.A., Chaoji, V., Salem, S., Besson, J., Zaki, M.J.: Origami: Mining representative orthogonal graph patterns. In: Ramakrishnan, N., Zaiane, O. (eds.) ICDM, pp. 153–162. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. In: Advances in Kernel Methods: Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)
Google Scholar
Knobbe, A.J., Ho, E.K.Y.: Pattern teams. In: Fürnkranz, et al. (eds.) [6], pp. 577–584 (2006)
Google Scholar
Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1(3), 241–258 (1997)
Article Google Scholar
Morishita, S., Sese, J.: Traversing itemset lattice with statistical metric pruning. In: Proceedings of the 19th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (2000)
Google Scholar
Rückert, U.: Capacity control for partially ordered feature sets. In: ECML PKDD ’09: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 318–333. Springer, Heidelberg (2009)
Chapter Google Scholar
Rückert, U., Kramer, S.: Optimizing feature sets for structured data. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 716–723. Springer, Heidelberg (2007)
Chapter Google Scholar
Swamidass, S.J., Chen, J.H., Bruand, J., Phung, P., Ralaivola, L., Baldi, P.: Kernels for small molecules and the prediction of mutagenicity,toxicity and anti-cancer activity. In: ISMB (Supplement of Bioinformatics), pp. 359–368 (2005)
Google Scholar
Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H.P., Smola, A.J., Song, L., Yu, P.S., Yan, X., Borgwardt, K.M.: Near-optimal supervised feature selection among frequent subgraphs. In: Proceedings of the SIAM International Conference on Data Mining, SDM 2009, pp. 1–12. SIAM, Philadelphia (2009)
Google Scholar
Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. In: Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (eds.) Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp. 316–325. ACM, Washington (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001, Leuven, Belgium
Albrecht Zimmermann & Björn Bringmann
EECS Department, UC Berkeley, 750 Sutardja Dai Hall #1776, Berkeley, CA 94720-1776, USA
Ulrich Rückert

Authors

Albrecht Zimmermann
View author publications
You can also search for this author in PubMed Google Scholar
Björn Bringmann
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Rückert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Matemáticas, Estadística y Computación, Universidad de Cantabria, Avenida de los Castros, s/n, 39071, Santander, Spain
José Luis Balcázar
Yahoo! Research Barcelona, Avinguda Diagonal 177, 08018, Barcelona, Spain
Francesco Bonchi
Yahoo! Research Barcelona, Avinguda Diagnonal 177, 08018, Barcelona, Spain
Aristides Gionis
TAO, CNRS-INRIA-LRI, Université Paris-Sud, 91405, Orsay, France
Michèle Sebag

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zimmermann, A., Bringmann, B., Rückert, U. (2010). Fast, Effective Molecular Feature Mining by Local Optimization. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. Lecture Notes in Computer Science(), vol 6323. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15939-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-15939-8_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15938-1
Online ISBN: 978-3-642-15939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast, Effective Molecular Feature Mining by Local Optimization

Abstract

Chapter PDF

Similar content being viewed by others

Feature importance correlation from machine learning indicates functional relationships between proteins and similar compound binding characteristics

Profiling and analysis of chemical compounds using pointwise mutual information

“DompeKeys”: a set of novel substructure-based descriptors for efficient chemical space mapping, development and structural interpretation of machine learning models, and indexing of large databases

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fast, Effective Molecular Feature Mining by Local Optimization

Abstract

Chapter PDF

Similar content being viewed by others

Feature importance correlation from machine learning indicates functional relationships between proteins and similar compound binding characteristics

Profiling and analysis of chemical compounds using pointwise mutual information

“DompeKeys”: a set of novel substructure-based descriptors for efficient chemical space mapping, development and structural interpretation of machine learning models, and indexing of large databases

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation