A Hybrid Machine Learning and Knowledge Based Approach to Limit Combinatorial Explosion in Biodegradation Prediction

Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 645)

Abstract

One of the main tasks in chemical industry regarding the sustainability of a product is the prediction of its environmental fate, i.e., its degradation products and pathways. Current methods for the prediction of biodegradation products and pathways of organic environmental pollutants either do not take into account domain knowledge or do not provide probability estimates. In this chapter, we propose a hybrid knowledge-based and machine learning-based approach to overcome these limitations in the context of the University of Minnesota Pathway Prediction System (UM-PPS). The proposed solution performs relative reasoning in a machine learning framework, and obtains one probability estimate for each biotransformation rule of the system. Since the application of a rule then depends on a threshold for the probability estimate, the trade-off between recall (sensitivity) and precision (selectivity) can be addressed and leveraged in practice. Results from leave-one-out cross-validation show that a recall and precision of approximately 0.8 can be achieved for a subset of 13 transformation rules. The set of used rules is further extended using multi-label classification, where dependencies among the transformation rules are exploited to improve the predictions. While the results regarding recall and precision vary, the area under the ROC curve can be improved using multi-label classification. Therefore, it is possible to optimize precision without compromising recall. Recently, we integrated the presented approach into enviPath, a complete redesign and re-implementation of UM-PPS.

References

  1. 1.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Button, W.G., Judson, P.N., Long, A., Vessey, J.D.: Using absolute and relative reasoning in the prediction of the potential metabolism of xenobiotics. J. Chem. Inf. Comput. Sci. 43(5), 1371–1377 (2003)CrossRefGoogle Scholar
  3. 3.
    Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Proceedings of the 2003 Conference on Advances in Neural Information Processing Systems, vol. 16, pp. 313–320 (2004)Google Scholar
  4. 4.
    Dimitrov, S., Kamenska, V., Walker, J., Windle, W., Purdy, R., Lewis, M., Mekenyan, O.: Predicting the biodegradation products of perfluorinated chemicals using CATABOL. SAR QSAR Environ. Res. 15(1), 69–82 (2004)CrossRefGoogle Scholar
  5. 5.
    Dimitrov, S., Pavlov, T., Nedelcheva, D., Reuschenbach, P., Silvani, M., Bias, R., Comber, M., Low, L., Lee, C., Parkerton, T., et al.: A kinetic model for predicting biodegradation. SAR QSAR Environ. Res. 18(5–6), 443–457 (2007)CrossRefGoogle Scholar
  6. 6.
    Ellis, L.B., Roe, D., Wackett, L.P.: The University of Minnesota biocatalysis/biodegradation database: the first decade. Nucleic Acids Res. 34(Database issue), D517–D521 (2006)Google Scholar
  7. 7.
    Fenner, K., Gao, J., Kramer, S., Ellis, L., Wackett, L.: Data-driven extraction of relative reasoning rules to limit combinatorial explosion in biodegradation pathway prediction. Bioinformatics 24(18), 2079–2085 (2008)CrossRefGoogle Scholar
  8. 8.
    Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Mach. Learn. 73(2), 133–153 (2008)CrossRefGoogle Scholar
  9. 9.
    Greene, N., Judson, P., Langowski, J., Marchant, C.: Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. SAR QSAR Environ. Res. 10(2–3), 299–314 (1999)CrossRefGoogle Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  11. 11.
    Hou, B.K., Ellis, L.B., Wackett, L.P.: Encoding microbial metabolic logic: predicting biodegradation. J. Ind. Microbiol. Biotechnol. 31(6), 261–272 (2004)CrossRefGoogle Scholar
  12. 12.
  13. 13.
    Joachims, T., Hofmann, T., Yue, Y., Yu, C.N.: Predicting structured objects with support vector machines. Commun. ACM 52(11), 97–104 (2009)CrossRefGoogle Scholar
  14. 14.
    Klopman, G., Tu, M., Talafous, J.: META 3 a genetic algorithm for metabolic transform priorities optimization. J. Chem. Inf. Comput. Sci. 37(2), 329–334 (1997)CrossRefGoogle Scholar
  15. 15.
    Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press (1999)Google Scholar
  16. 16.
    REACH: Regulation (EC) no 1907/2006 of the European Parliament and of the council of 18 December 2006 concerning the registration, evaluation, authorisation and restriction of chemicals (REACH). Off. J. Eur. Union 49, L396 (2006)Google Scholar
  17. 17.
    Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned sets. In: 8th IEEE International Conference on Data Mining, pp. 995–1000. IEEE (2008)Google Scholar
  18. 18.
    Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Springer (2009)Google Scholar
  19. 19.
    Rückert, U., Kramer, S.: Frequent free tree discovery in graph data. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 564–570. ACM (2004)Google Scholar
  20. 20.
    Sinclair, C.J., Boxall, A.B.: Assessing the ecotoxicity of pesticide transformation products. Environ. Sci. Technol. 37(20), 4617–4625 (2003)CrossRefGoogle Scholar
  21. 21.
    Tsoumakas, G., Dimou, A., Spyromitros, E., Mezaris, V., Kompatsiaris, I., Vlahavas, I.: Correlation-based pruning of stacked binary relevance models for multi-label learning. In: Tsoumakas, G., Zhang, M.L., Zhou, Z.H. (eds.) Proceeding of ECML/PKDD 2009 Workshop on Learning from Multi-Label Data, pp. 101–116 (2009)Google Scholar
  22. 22.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: A review of multi-label classification methods. In: Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery, pp. 99–109 (2006)Google Scholar
  23. 23.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer (2010)Google Scholar
  24. 24.
    Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: a java library for multi-label learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)MathSciNetMATHGoogle Scholar
  25. 25.
    Wicker, J., Fenner, K., Ellis, L., Wackett, L., Kramer, S.: Machine learning and data mining approaches to biodegradation pathway prediction. In: Bridewell, W., Calders, T., de Medeiros, A.K., Kramer, S., Pechenizkiy, M., Todorovski, L. (eds.) Proceedings of the 2nd International Workshop on the Induction of Process Models at ECML PKDD 2008 (2008)Google Scholar
  26. 26.
    Wicker, J., Fenner, K., Ellis, L., Wackett, L., Kramer, S.: Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach. Bioinformatics 26(6), 814–821 (2010)CrossRefGoogle Scholar
  27. 27.
    Wicker, J., Pfahringer, B., Kramer, S.: Multi-label classification using Boolean matrix decomposition. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 179–186. ACM (2012)Google Scholar
  28. 28.
    Zhang, M.L., Zhou, Z.H.: A k-nearest neighbor based algorithm for multi-label classification. In: IEEE International Conference on Granular Computing, vol. 2, pp. 718–721. IEEE (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Institut für InformatikJohannes Gutenberg-Universität MainzMainzGermany
  2. 2.Eawag, Swiss Federal Institute for Aquatic Science and TechnologyDübendorfSwitzerland

Personalised recommendations