Abstract
Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.
Similar content being viewed by others
Notes
The algorithm can be downloaded from https://bitbucket.org/aheneliu/goldeneye/ (accessed 7 July 2014) or easily installed using the command install_bitbucket from the devtools package (Wickham and Chang 2014) as follows: install_bitbucket(repo = “goldeneye”, username = “aheneliu”).
References
Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl Based Syst 8(6):373–389
Bache K, Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
Chanda P, Cho YR, Zhang A, Ramanathan M (2009) Mining of attribute interactions using information theoretic metrics. In: IEEE International Conference on Data Mining Workshops, pp 350–355
Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3(4):261–283
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293
De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’11, pp 564–572
De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3):103–130
Freitas AA (2001) Understanding the crucial role of attribute interaction in data mining. Artif Intell Rev 16(3):177–199
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something i don’t know: Randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’09, pp 379–388
Henelius A, Korpela J, Puolamäki K (2013) Explaining interval sequences by randomization. In: Blockeel H, Kersting K, Nijssen S, Z̆elezný Filip (eds) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 8188, pp 337–352
Hornik K, Buchta C, Zeileis A (2009) Open-source machine learning: R meets Weka. Comput Stat 24(2):225–232. doi:10.1007/s00180-008-0119-7
Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44(1):4–31
Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B (2003) Attribute interactions in medical data analysis. In: 9th Conference on Artificial Intelligence in Medicine in Europe, pp 229–238
Janitza S, Strobl C, Boulesteix AL (2013) An auc-based permutation variable importance measure for random forests. BMC Bioinform 14:119
Johansson U, König R, Niklasson L (2003) Rule extraction from trained neural networks using genetic programming. In: 13th International Conference on Artificial Neural Networks, pp 13–16
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22, http://CRAN.R-project.org/doc/Rnews/
Lijffijt J, Papapetrou P, Puolamäki K (2014) A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Discov 28:238–263. doi:10.1007/s10618-012-0298-2
Misra G, Golshan B, Terzi E (2012) A Framework for Evaluating the Smoothness of Data-Mining Results. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, vol II, pp 660–675
Ojala M, Garriga GC (2010) Permutation tests for studying classier performance. J Mach Learn Res 11:1833–1863
Plate T (1999) Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika 26:29–50
Pulkkinen P, Koivisto H (2008) Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. Int J Approx Reason 48(2):526–543
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/
Segal MR, Cummings MP, Hubbard AE (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 57(2):632–643
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(25):
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307
Wickham H, Chang W (2014) devtools: Tools to make developing R code easier. http://CRAN.R-project.org/package=devtools, r package version 1.5
Zacarias OP, Boström H (2013) Comparing support vector regression and random forests for predicting malaria incidence in Mozambique. In: International conference on advances in ICT for Emerging regions, IEEE, pp 217–221
Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp 1156–1161
Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228
Acknowledgments
AH and KP were partly supported by the Revolution of Knowledge Work project, funded by Tekes. HB and LA were partly supported by the project High-Performance Data Mining for Drug Effect Detection at Stockholm University, funded by Swedish Foundation for Strategic Research under grant IIS11-0053.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.
Rights and permissions
About this article
Cite this article
Henelius, A., Puolamäki, K., Boström, H. et al. A peek into the black box: exploring classifiers by randomization. Data Min Knowl Disc 28, 1503–1529 (2014). https://doi.org/10.1007/s10618-014-0368-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0368-8