Skip to main content

Binary Classification Models Comparison: On the Similarity of Datasets and Confusion Matrix for Predictive Toxicology Applications

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6865))

Abstract

Nowadays generating predictive models by applying machine learning and model ensembles techniques is a faster task facilitated by development of more user-friendly data mining tools. However, such progress raises the issues related to model management: once developed, many classifiers for example become accessible in collections of models. Choosing the relevant model from the collection can reduce costs of generating new predictive models: calculating the similarity of predictive models is the key to rank them, which may improve model selection or combination. For this aim we introduce a methodology to measure the similarity of classifiers by comparing their datasets, transfer functions and confusion matrices. We propose the Dataset Similarity Coefficient to calculate the similarity of datasets, and the Similarity of Models measure to calculate the similarity between such predictive models. In this paper we focus on toxicology applications of binary classification models. The results show that our methodology performs well in measuring models similarity from a collection of classifiers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Makhtar, M., Neagu, D.C., Ridley, M.: Predictive Model Representation and Comparison: Towards Data and Predictive Models Governance. In: Proceedings of the 10th UK Workshop on Computational Intelligence UKCI 2010, pp. 1–6. University of Essex, UK (2010)

    Google Scholar 

  2. Todeschini, R., Consonnia, V., Pavan, M.: A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemometrics and Intelligent Laboratory Systems 70, 55–61 (2004)

    Article  Google Scholar 

  3. Choi, S.-S., Cha, S.-H., Tappert, C.C.: A Survey of Binary Similarity and Distance Measures. Journal of Systemics, Cybernetics and Informatics 8, 43–48 (2010)

    Google Scholar 

  4. Lesot, M.-J., Rifqi, M.: Similarity measures for binary and numerical data: a survey. International Journal of Knowledge Engineering and Soft Data Paradigms 1, 63–84 (2009)

    Article  Google Scholar 

  5. Sequeira, K., Zaki, M.J.: Exploring Similarities across High-dimensional Datasets. In: Taniar, D. (ed.) Research and Trends in Data Mining Technologies and Applications, vol. 3, pp. 53–85. Idea Group Inc., USA (2007)

    Chapter  Google Scholar 

  6. Prasanna, S.R.M., Yegnanarayana, B., Pinto, J.P., Hermansky, H.: Analysis of Confusion Matrix to Combine Evidence for Phoneme Recognition. IDIAP Research Report, IDIAP-RR-27-2007 (2007)

    Google Scholar 

  7. Freitas, C.O.A., Carvalho, J.M.D.: J. Jose Josemar Oliveira, S. B. K. Aires, and R. Sabourin.: Confusion Matrix Disagreement for Multiple Classifiers. In: Proceedings of the Congress on pattern recognition 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 387–396 (2007)

    Google Scholar 

  8. Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: Practical Machine Learning Tools and Techniques with Java Implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES 1999 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, pp. 192–196 (1999)

    Google Scholar 

  9. D. M. Group.: PMML 3.2 - Model Explanation Documents (2008)

    Google Scholar 

  10. Kohavi, R., Provost, F.: Glossary of Terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30, 271–274 (1998)

    Google Scholar 

  11. Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories (2004)

    Google Scholar 

  12. DEMETRA Project (2008), http://www.demetra-tox.net/

  13. TETRATOX.: TETRATOX Home (2008), http://www.vet.utk.edu/TETRATOX/index.php

  14. Trundle, P.: Hybrid Intelligent Systems Applied to Predict Pesticides Toxicity - a Data Integration Approach. Phd Thesis. School of Informatics, University of Bradford, UK (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Makhtar, M., Neagu, D.C., Ridley, M.J. (2011). Binary Classification Models Comparison: On the Similarity of Datasets and Confusion Matrix for Predictive Toxicology Applications. In: Böhm, C., Khuri, S., Lhotská, L., Pisanti, N. (eds) Information Technology in Bio- and Medical Informatics. ITBAM 2011. Lecture Notes in Computer Science, vol 6865. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23208-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23208-4_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23207-7

  • Online ISBN: 978-3-642-23208-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics