A peek into the black box: exploring classifiers by randomization

Henelius, Andreas; Puolamäki, Kai; Boström, Henrik; Asker, Lars; Papapetrou, Panagiotis

doi:10.1007/s10618-014-0368-8

A peek into the black box: exploring classifiers by randomization

Published: 23 July 2014

Volume 28, pages 1503–1529, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Andreas Henelius¹,
Kai Puolamäki¹,
Henrik Boström²,
Lars Asker² &
…
Panagiotis Papapetrou²

2555 Accesses
64 Citations
Explore all metrics

Abstract

Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Notes

The algorithm can be downloaded from https://bitbucket.org/aheneliu/goldeneye/ (accessed 7 July 2014) or easily installed using the command install_bitbucket from the devtools package (Wickham and Chang 2014) as follows: install_bitbucket(repo = “goldeneye”, username = “aheneliu”).

References

Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl Based Syst 8(6):373–389
Article Google Scholar
Bache K, Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Chanda P, Cho YR, Zhang A, Ramanathan M (2009) Mining of attribute interactions using information theoretic metrics. In: IEEE International Conference on Data Mining Workshops, pp 350–355
Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3(4):261–283
Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293
MATH Google Scholar
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’11, pp 564–572
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Article MATH MathSciNet Google Scholar
Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3):103–130
Article MATH Google Scholar
Freitas AA (2001) Understanding the crucial role of attribute interaction in data mining. Artif Intell Rev 16(3):177–199
Article MATH MathSciNet Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something i don’t know: Randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’09, pp 379–388
Henelius A, Korpela J, Puolamäki K (2013) Explaining interval sequences by randomization. In: Blockeel H, Kersting K, Nijssen S, Z̆elezný Filip (eds) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 8188, pp 337–352
Hornik K, Buchta C, Zeileis A (2009) Open-source machine learning: R meets Weka. Comput Stat 24(2):225–232. doi:10.1007/s00180-008-0119-7
Article MATH MathSciNet Google Scholar
Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44(1):4–31
Article MATH MathSciNet Google Scholar
Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B (2003) Attribute interactions in medical data analysis. In: 9th Conference on Artificial Intelligence in Medicine in Europe, pp 229–238
Janitza S, Strobl C, Boulesteix AL (2013) An auc-based permutation variable importance measure for random forests. BMC Bioinform 14:119
Article Google Scholar
Johansson U, König R, Niklasson L (2003) Rule extraction from trained neural networks using genetic programming. In: 13th International Conference on Artificial Neural Networks, pp 13–16
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22, http://CRAN.R-project.org/doc/Rnews/
Lijffijt J, Papapetrou P, Puolamäki K (2014) A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Discov 28:238–263. doi:10.1007/s10618-012-0298-2
Article MATH MathSciNet Google Scholar
Misra G, Golshan B, Terzi E (2012) A Framework for Evaluating the Smoothness of Data-Mining Results. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, vol II, pp 660–675
Ojala M, Garriga GC (2010) Permutation tests for studying classier performance. J Mach Learn Res 11:1833–1863
MATH MathSciNet Google Scholar
Plate T (1999) Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika 26:29–50
Article Google Scholar
Pulkkinen P, Koivisto H (2008) Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. Int J Approx Reason 48(2):526–543
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Google Scholar
R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/
Segal MR, Cummings MP, Hubbard AE (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 57(2):632–643
Article MATH MathSciNet Google Scholar
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(25):
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307
Article Google Scholar
Wickham H, Chang W (2014) devtools: Tools to make developing R code easier. http://CRAN.R-project.org/package=devtools, r package version 1.5
Zacarias OP, Boström H (2013) Comparing support vector regression and random forests for predicting malaria incidence in Mozambique. In: International conference on advances in ICT for Emerging regions, IEEE, pp 217–221
Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp 1156–1161
Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228
Google Scholar

Download references

Acknowledgments

AH and KP were partly supported by the Revolution of Knowledge Work project, funded by Tekes. HB and LA were partly supported by the project High-Performance Data Mining for Drug Effect Detection at Stockholm University, funded by Swedish Foundation for Strategic Research under grant IIS11-0053.

Author information

Authors and Affiliations

Finnish Institute of Occupational Health, Topeliuksenkatu 41 a A, 00250 , Helsinki, Finland
Andreas Henelius & Kai Puolamäki
Department of Computer and Systems Sciences, Stockholm University, Forum 100, 164 40 , Kista, Sweden
Henrik Boström, Lars Asker & Panagiotis Papapetrou

Authors

Andreas Henelius
View author publications
You can also search for this author in PubMed Google Scholar
Kai Puolamäki
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Boström
View author publications
You can also search for this author in PubMed Google Scholar
Lars Asker
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Henelius.

Additional information

Responsible editor: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Henelius, A., Puolamäki, K., Boström, H. et al. A peek into the black box: exploring classifiers by randomization. Data Min Knowl Disc 28, 1503–1529 (2014). https://doi.org/10.1007/s10618-014-0368-8

Download citation

Received: 14 February 2014
Accepted: 19 June 2014
Published: 23 July 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10618-014-0368-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A peek into the black box: exploring classifiers by randomization

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A peek into the black box: exploring classifiers by randomization

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation