Binary Classification Models Comparison: On the Similarity of Datasets and Confusion Matrix for Predictive Toxicology Applications

Makhtar, Mokhairi; Neagu, Daniel C.; Ridley, Mick J.

doi:10.1007/978-3-642-23208-4_11

Binary Classification Models Comparison: On the Similarity of Datasets and Confusion Matrix for Predictive Toxicology Applications

Mokhairi Makhtar²⁰,
Daniel C. Neagu²⁰ &
Mick J. Ridley²⁰

Conference paper

688 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6865))

Abstract

Nowadays generating predictive models by applying machine learning and model ensembles techniques is a faster task facilitated by development of more user-friendly data mining tools. However, such progress raises the issues related to model management: once developed, many classifiers for example become accessible in collections of models. Choosing the relevant model from the collection can reduce costs of generating new predictive models: calculating the similarity of predictive models is the key to rank them, which may improve model selection or combination. For this aim we introduce a methodology to measure the similarity of classifiers by comparing their datasets, transfer functions and confusion matrices. We propose the Dataset Similarity Coefficient to calculate the similarity of datasets, and the Similarity of Models measure to calculate the similarity between such predictive models. In this paper we focus on toxicology applications of binary classification models. The results show that our methodology performs well in measuring models similarity from a collection of classifiers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Makhtar, M., Neagu, D.C., Ridley, M.: Predictive Model Representation and Comparison: Towards Data and Predictive Models Governance. In: Proceedings of the 10th UK Workshop on Computational Intelligence UKCI 2010, pp. 1–6. University of Essex, UK (2010)
Google Scholar
Todeschini, R., Consonnia, V., Pavan, M.: A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemometrics and Intelligent Laboratory Systems 70, 55–61 (2004)
Article Google Scholar
Choi, S.-S., Cha, S.-H., Tappert, C.C.: A Survey of Binary Similarity and Distance Measures. Journal of Systemics, Cybernetics and Informatics 8, 43–48 (2010)
Google Scholar
Lesot, M.-J., Rifqi, M.: Similarity measures for binary and numerical data: a survey. International Journal of Knowledge Engineering and Soft Data Paradigms 1, 63–84 (2009)
Article Google Scholar
Sequeira, K., Zaki, M.J.: Exploring Similarities across High-dimensional Datasets. In: Taniar, D. (ed.) Research and Trends in Data Mining Technologies and Applications, vol. 3, pp. 53–85. Idea Group Inc., USA (2007)
Chapter Google Scholar
Prasanna, S.R.M., Yegnanarayana, B., Pinto, J.P., Hermansky, H.: Analysis of Confusion Matrix to Combine Evidence for Phoneme Recognition. IDIAP Research Report, IDIAP-RR-27-2007 (2007)
Google Scholar
Freitas, C.O.A., Carvalho, J.M.D.: J. Jose Josemar Oliveira, S. B. K. Aires, and R. Sabourin.: Confusion Matrix Disagreement for Multiple Classifiers. In: Proceedings of the Congress on pattern recognition 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 387–396 (2007)
Google Scholar
Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: Practical Machine Learning Tools and Techniques with Java Implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES 1999 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, pp. 192–196 (1999)
Google Scholar
D. M. Group.: PMML 3.2 - Model Explanation Documents (2008)
Google Scholar
Kohavi, R., Provost, F.: Glossary of Terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30, 271–274 (1998)
Google Scholar
Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories (2004)
Google Scholar
DEMETRA Project (2008), http://www.demetra-tox.net/
TETRATOX.: TETRATOX Home (2008), http://www.vet.utk.edu/TETRATOX/index.php
Trundle, P.: Hybrid Intelligent Systems Applied to Predict Pesticides Toxicity - a Data Integration Approach. Phd Thesis. School of Informatics, University of Bradford, UK (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Informatics and Media, University of Bradford, Bradford, BD7 1DP, UK
Mokhairi Makhtar, Daniel C. Neagu & Mick J. Ridley

Authors

Mokhairi Makhtar
View author publications
You can also search for this author in PubMed Google Scholar
Daniel C. Neagu
View author publications
You can also search for this author in PubMed Google Scholar
Mick J. Ridley
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Ludwig-Maximilians-Universität, Oettingenstrasse 67, 80538, München, Germany
Christian Böhm
Department of Computer Science, San José State University, One Washington Square, 95192-0249, San José, CA, U.S.A.
Sami Khuri
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University, Technicka 2, 166 27, Prague 6, Czech Republic
Lenka Lhotská
Dipartimento di Informatica, Università di Pisa, Italy
Nadia Pisanti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Makhtar, M., Neagu, D.C., Ridley, M.J. (2011). Binary Classification Models Comparison: On the Similarity of Datasets and Confusion Matrix for Predictive Toxicology Applications. In: Böhm, C., Khuri, S., Lhotská, L., Pisanti, N. (eds) Information Technology in Bio- and Medical Informatics. ITBAM 2011. Lecture Notes in Computer Science, vol 6865. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23208-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-23208-4_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23207-7
Online ISBN: 978-3-642-23208-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics