A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction

Wicker, Jörg; Fenner, Kathrin; Kramer, Stefan

doi:10.1007/978-3-319-31858-5_5

Jörg Wicker⁵,
Kathrin Fenner⁶ &
Stefan Kramer⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 645))

1279 Accesses
3 Citations

Abstract

One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. Since the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice. Results from leave-one-out cross-validation show that a recall and precision of approximately 0.8 can be achieved for a subset of 13 transformation rules. The set of used rules is further extended using multi-label classification, where dependencies among the transformation rules are exploited to improve the predictions. While the results regarding recall and precision vary, the area under the ROC curve can be improved using multi-label classification. Therefore, it is possible to optimize precision without compromising recall. Recently, we integrated the presented approach into enviPath, a complete redesign and re-implementation of UM-PPS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For an overview of multi-label classification see the paper by Tsoumakas et al. [23].
2.
Those 25 pesticides were also tested in our previous experiments investigating the sensitivity and selectivity of the method (see Table 6 in [7]). In addition, 22 other xenobiotics (pharmaceuticals) were only used for determining the reduction of predictions (see Table 4) because their degradation products are not known.
3.
We count the false negatives in a slightly different way than in a previous paper [7], as we only consider products that are suggested by any of the biotransformation rules. In other words, we do not take into account products of reactions that are not subsumed by any of the rules. This is done because only for the products suggested by the UM-PPS, the method proposed here becomes effective—the classifiers can only restrict the rules, not extend them.
4.
Note that any other machine learning algorithm for classification and, similarly, any other method for the computation of substructural or other molecular descriptors could be applied to the problem.
5.
We cannot compare our results with those of CATABOL because the system is proprietary and cannot be trained to predict the probability of individual rules—the pathway structure has to be fixed for training (for details we refer to Sect. 7). This means that CATABOL addresses a different problem than the approach presented here.
6.
In other words, it shows that informed classifiers do not pay for the rest of the rules.

References

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Button, W.G., Judson, P.N., Long, A., Vessey, J.D.: Using absolute and relative reasoning in the prediction of the potential metabolism of xenobiotics. J. Chem. Inf. Comput. Sci. 43(5), 1371–1377 (2003)
Article Google Scholar
Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Proceedings of the 2003 Conference on Advances in Neural Information Processing Systems, vol. 16, pp. 313–320 (2004)
Google Scholar
Dimitrov, S., Kamenska, V., Walker, J., Windle, W., Purdy, R., Lewis, M., Mekenyan, O.: Predicting the biodegradation products of perfluorinated chemicals using CATABOL. SAR QSAR Environ. Res. 15(1), 69–82 (2004)
Article Google Scholar
Dimitrov, S., Pavlov, T., Nedelcheva, D., Reuschenbach, P., Silvani, M., Bias, R., Comber, M., Low, L., Lee, C., Parkerton, T., et al.: A kinetic model for predicting biodegradation. SAR QSAR Environ. Res. 18(5–6), 443–457 (2007)
Article Google Scholar
Ellis, L.B., Roe, D., Wackett, L.P.: The University of Minnesota biocatalysis/biodegradation database: the first decade. Nucleic Acids Res. 34(Database issue), D517–D521 (2006)
Google Scholar
Fenner, K., Gao, J., Kramer, S., Ellis, L., Wackett, L.: Data-driven extraction of relative reasoning rules to limit combinatorial explosion in biodegradation pathway prediction. Bioinformatics 24(18), 2079–2085 (2008)
Article Google Scholar
Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Mach. Learn. 73(2), 133–153 (2008)
Article Google Scholar
Greene, N., Judson, P., Langowski, J., Marchant, C.: Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. SAR QSAR Environ. Res. 10(2–3), 299–314 (1999)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Hou, B.K., Ellis, L.B., Wackett, L.P.: Encoding microbial metabolic logic: predicting biodegradation. J. Ind. Microbiol. Biotechnol. 31(6), 261–272 (2004)
Article Google Scholar
http://nar.oxfordjournals.org/content/44/D1/D502
Joachims, T., Hofmann, T., Yue, Y., Yu, C.N.: Predicting structured objects with support vector machines. Commun. ACM 52(11), 97–104 (2009)
Article Google Scholar
Klopman, G., Tu, M., Talafous, J.: META 3 a genetic algorithm for metabolic transform priorities optimization. J. Chem. Inf. Comput. Sci. 37(2), 329–334 (1997)
Article Google Scholar
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press (1999)
Google Scholar
REACH: Regulation (EC) no 1907/2006 of the European Parliament and of the council of 18 December 2006 concerning the registration, evaluation, authorisation and restriction of chemicals (REACH). Off. J. Eur. Union 49, L396 (2006)
Google Scholar
Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned sets. In: 8th IEEE International Conference on Data Mining, pp. 995–1000. IEEE (2008)
Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Springer (2009)
Google Scholar
Rückert, U., Kramer, S.: Frequent free tree discovery in graph data. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 564–570. ACM (2004)
Google Scholar
Sinclair, C.J., Boxall, A.B.: Assessing the ecotoxicity of pesticide transformation products. Environ. Sci. Technol. 37(20), 4617–4625 (2003)
Article Google Scholar
Tsoumakas, G., Dimou, A., Spyromitros, E., Mezaris, V., Kompatsiaris, I., Vlahavas, I.: Correlation-based pruning of stacked binary relevance models for multi-label learning. In: Tsoumakas, G., Zhang, M.L., Zhou, Z.H. (eds.) Proceeding of ECML/PKDD 2009 Workshop on Learning from Multi-Label Data, pp. 101–116 (2009)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: A review of multi-label classification methods. In: Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery, pp. 99–109 (2006)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer (2010)
Google Scholar
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: a java library for multi-label learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)
MathSciNet MATH Google Scholar
Wicker, J., Fenner, K., Ellis, L., Wackett, L., Kramer, S.: Machine learning and data mining approaches to biodegradation pathway prediction. In: Bridewell, W., Calders, T., de Medeiros, A.K., Kramer, S., Pechenizkiy, M., Todorovski, L. (eds.) Proceedings of the 2nd International Workshop on the Induction of Process Models at ECML PKDD 2008 (2008)
Google Scholar
Wicker, J., Fenner, K., Ellis, L., Wackett, L., Kramer, S.: Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach. Bioinformatics 26(6), 814–821 (2010)
Article Google Scholar
Wicker, J., Pfahringer, B., Kramer, S.: Multi-label classification using Boolean matrix decomposition. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 179–186. ACM (2012)
Google Scholar
Zhang, M.L., Zhou, Z.H.: A k-nearest neighbor based algorithm for multi-label classification. In: IEEE International Conference on Granular Computing, vol. 2, pp. 718–721. IEEE (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, Johannes Gutenberg-Universität Mainz, Staudingerweg 9, 55128, Mainz, Germany
Jörg Wicker & Stefan Kramer
Eawag, Swiss Federal Institute for Aquatic Science and Technology, CH-8600, Dübendorf, Switzerland
Kathrin Fenner

Authors

Jörg Wicker
View author publications
You can also search for this author in PubMed Google Scholar
Kathrin Fenner
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Kramer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jörg Wicker .

Editor information

Editors and Affiliations

Department of Computer Science, University of Applied Sciences Zittau/Gö, Goerlitz, Germany
Jörg Lässig
Fraunhofer IAIS, University of Bonn, Sankt Augustin, Germany
Kristian Kersting
Department of Computer Science, TU Dortmund University, Dortmund, Germany
Katharina Morik

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wicker, J., Fenner, K., Kramer, S. (2016). A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction. In: Lässig, J., Kersting, K., Morik, K. (eds) Computational Sustainability. Studies in Computational Intelligence, vol 645. Springer, Cham. https://doi.org/10.1007/978-3-319-31858-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-31858-5_5
Published: 21 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31856-1
Online ISBN: 978-3-319-31858-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics