Abstract
The literature lacks definitions for quantitative measures of model interpretability for automatic model selection to achieve high accuracy and interpretability, hence we define inherent model interpretability. We extend the work of Lipton et al. and Liu et al. from qualitative and subjective concepts of model interpretability to objective criteria and quantitative measures. We also develop another new measure called simplicity of sensitivity and illustrate prior, initial and posterior measurement. Measures are tested and validated with some measures recommended for use. It is demonstrated that high accuracy and high interpretability are jointly achievable with little to no sacrifice in either.
Keywords
 Model interpretability
 Model transparency
 Support vector machines
 Kernels
Download conference paper PDF
1 Introduction
For machine learning (ML) models, data and results, there is a demand for transparency, ease of understanding and explanations [24] to satisfy a citizen’s “right to explanation” in the European Union [20] and to meet health care requirements for justification and explanation [7, 22].
Without quantitative measures of transparency and understandability, doctors (or users) will select models which maximize accuracy but may unnecessarily or unintentionally neglect or sacrifice transparency and understandability, or they will choose models in an ad hoc manner to try and meet all criteria. We refer to the transparency and understandability of models as inherent model interpretability—defined further in Sect. 3.
We propose criteria and measures of inherent model interpretability to help a doctor select ML models (Table 1 steps 1 and 2) which are more transparent and understandable, in a quantitative and objective manner. More transparent models can offer additional views of results (Table 1 step 3) for interpretation. Our measures facilitate the inclusion of better models as candidates and the selection of better models for use.
Some of our proposed measures are specific to support vector machines (SVM), as one popular ML method. We perform experiments to validate the SVM measures against a set of propositions and evaluate their utility by concordance or matched pair agreement.
Notably, the proposed measures do not provide an interpretation or explanation. They also do not indicate how useful or meaningful a model is in the context of data. For example, a model that always classifies patient data as belonging to the positive class is very understandable (interpretable). We can easily construct the explanation of the model and result—all patients are classified as positive—but that does not mean that the model is useful, meaningful, appropriate, or unbiased. Accuracy and common sense address the latter issues. The proposed measures only indicate how understandable a model is, i.e., how likely we are able to provide an interpretation, as the necessary basis for subsequent explanation.
Making ML more interpretable facilitates its use in health care because there is a perception that ML is a black box [31] lacking interpretability which inhibits its use. Greater use is important because for a good number of health care problems and data, ML methods offer better accuracy in classification [12, 15, 41] than common alternatives among statistical methods, decision trees and rulebased methods and instancebased methods. Interpretable ML also facilitates research on models and model fit.
2 Notation
A machine learning task begins with data in a matrix X consisting of N instances \(\underline{x}_{i}\) which are vectors, each containing \(n_{\text {}}\) features.
Entry \(x_{i,j}\) in the matrix is the \(j^{th}\) feature of instance \(\underline{x}_{i}\). We assume realvalued features converting any atomic data type to reals as needed (Appendix A).
A supervised learning task also has N targets (or outcomes) in a vector \(\underline{y}\) which are binary in classification,
or continuous in regression:
In binary classification there are \(N^{+}\) instances in the positive class and \(N^{}\) instances in the negative class.
We refer to a posterior model (e.g., Fig. 1), or simply model, as a learning method (e.g., SVM, neural networks) with all of its associated learning/estimation functions (e.g., kernels and transfer functions), hyperparameters, structure (e.g., layers, connections, components in a composite kernel), constraints and learned model weights, in the context of specific data. A model only learns from, and has meaning in, the context of specific data.
We refer to an initial model as a model in the context of specific data with initial model weights prior to learning/iteration.
We refer to a family of models, or a prior model, as the set of models possible when hyperparameters are variables (not specified)—e.g., SVM with a Gaussian RBF kernel with unspecified box constraint and kernel width.
The prior, initial and posterior models are available at different points in the process of machine learning and/or statistical learning process (Fig. 2).
Other notation is introduced in the context of discussion.
3 Inherent Model Interpretability Concept and Measures
We propose the concept of inherent model interpretability as distinguished from an individual’s understanding and we propose two measures for any learning method or model with numeric inputs.
Feynman said that if we understand a concept we must be able to describe it at a freshman level, which often requires simplification or reduction, otherwise we don’t really understand it [21]. Badii et al. express that complexity is closely related to understanding and that understanding comes from accurate models which use condensed information or reduction schemes [4]. Miller indicates that selection is a key attribute of explanations [38]. Hence, we posit that the simpler a model is, the easier it is to understand, interpret and describe, with all other aspects of the model being equal. This leads to the following general measure.
3.1 A General Measure of Inherent Model Interpretability
As stated above, the simpler a model is, the more interpretable it is, inherently. Formally, we propose the following definition.
Definition 1
Inherent model interpretability (or understandability) U, is a measure with range \(\left[ 0,1\right] \) based on either: a measure of model transparency T in the same range, the inverse of semiinfinite model complexity \(H_{\infty }\), or the inverse of finite model complexity \(H_{b}\), respectively as follows:
where:

\(H_{\infty }\) and \(H_{b}\) are measures of model complexity based on parts [4] in the categories of information, entropy, code length or dimension [33],

inherent indicates that the measure is independent of an individual, e.g., their specific learning and forgetting curves [44], and

the multiplicative inverse [29] in (4)ii or additive inverse [57] in (4)iii are applied as needed for absolute or relative measure respectively according to the comparison required. The relative measure is preferred where applicable since it is more intuitive and interpretable (not shown).

e.g., to compare a set of models where the range \(\left[ a,\,b\right] \) is known to encompass them all, a relative measure (iii) is fine, however, to compare them to any future model where the maximum b is not known, use an absolute measure (ii), i.e., let \(b=\infty \).

The separation of model interpretability into at least two parts, one part that is inherent to the model (and data) and another part that depends on the individual, aligns with the functionallygrounded approach [17].
In order to use this general measure, one must further define T, \(H_{\infty }\) or \(H_{b}\), as we do in subsequent sections. We note also that measurement may be performed prior to, initially at, or posterior to, optimizing the model weights (Fig. 2).
3.2 A New Measure: Simplicity of Output Sensitivity
We consider the continuous underlying output of a classifier (e.g., Fig. 1) to be the most accurate representation of a classifier’s behaviour. It is available most learning classifiers, in machine learning or statistical learning, such as, neural networks, SVM, logistic regression and naive bayes. It is also facilitated by most implementations, e.g., for SVM it is available in Matlab, R, Python, SPSS, Weka, libsvm and Orange, where the output may be the probability of the positive class or a nonprobabilistic value, e.g., “classification score”.
Some measure or analyze a classifier’s behaviour based on its binary output instead [46]—this approach lacks finegrained behavioural information. Others measure classifier behaviour by modeling its responses with a separate explanation model that provides a continuous output [5, 46]—this post hoc approach may not meet untested legal, assurance or business requirements.
We use the underlying continuous output, and the logic similar to the previous measure to posit that:
If a model is uniformly sensitive in its output to changing values in input features and instances, then its sensitivity is simple to describe, understand and interpret (as one value). Conversely, a model that is differently sensitive to each feature and instance is more difficult to describe, understand and interpret, in those terms or from that perspective. Formally, we propose the following definition:
Definition 2
The simplicity of output sensitivity \(U_{Hs}\) is a measure of inherent model interpretability. It describes the simplicity of the sensitivity of the model’s continuous output (e.g., Fig. 1) to changes in input. It is specified as the inverse of Shannon entropy \(H_{\text {s}}\) with a finite range (4)iii, repeated below:
where s is the set of sensitivities \(S_{j,q}\) of the model’s continuous output \(\hat{y}_{\text {c}}\) (the value which is underlying for a classifier) to small changes \(\varepsilon =\left( 0.1\right) \cdot 3\sigma \) in each input instance j, one feature q at a time,
and where \(N_{s}\) is the number of bins according to standard binning methods for histograms [18, 47, 53].
We use entropy to measure the global complexity of sensitivities across the space for input data. In the literature, entropy has been applied quite differently to measure the information loss of perturbed features, to indicate their influence—we use entropy instead to measure the complexity of influence with perturbed features.
Our measure uses a firstorder central difference (first derivative approximation) as a standard and easy to understand approach to sensitivity that does not require knowing or differentiating the model’s formulas. We can generalize this idea to second and thirdorder differences/derivatives, and so on, like the derivatives in deep Taylor decomposition [39]—but the latter requires a model’s formulas and derivatives. Whereas [39] examines the local behaviours of a model, we do that and compute the complexity of the values.
We treat the entries \(S_{j,q}\) as a set or random variable s (8) because we are measuring model interpretability overall, across features and instances, not within a feature nor within an instance.
We note that instead of Shannon entropy, it may be possible to apply other types of entropy, such as Renyi entropy, Tsallis entropy, effective entropy or total information [19, 45, 56] and/or KullbackLeibler (KL) divergence [14], however such a change would require validation. Prior to this study we experimented with discrete KullbackLeibler (KL) divergence as implemented by four measures in the ITK toolkit [54, 55], as an alternative to Shannon entropy, however, our experimental results with KL divergence did not sufficiently match our expectations, so we focused on Shannon entropy as a more popular and credible measure.
We also implemented differential entropy [14], which is the continuous version of entropy and is defined as the KL divergence from a uniform probability density function (pdf) to the pdf of interest, but put that aside based on the previously mentioned KL divergence results and also because it was more compute intensive as it required a kernel density estimate.
Finally we note that the sensitivity portion of our measure (i.e., entropy aspect aside) differs from how other authors compute sensitivity globally across both instances and features [27].
4 Criteria for Model Transparency and a Measure for SVM
We identify criteria for model transparency from the literature (Table 2) for any model, and propose new criteria in most cases, which are objective, not subjective, and thus suitable for a (quantitative) measure of model transparency.
We apply the proposed criteria (Table 2) for any model, to create a measure specific to kernel methods or support vector machines (SVM).
We use the seven proposed criteria for inherent prior model interpretability (Sect. 4) to define 6 Dirac (binary) measures for SVM (Table 3) meeting each criterion without overlap, except for criterion d (since all SVM kernels are generalized linear models).
We define an overall measure as follows:
A benefit of this measure is that while independent of the data, it requires little computation and it informs model selection prior to optimization.
5 Creating More Measures Specific to SVM
In this section we propose measures specific to SVM.
Support Vectors: In SVM, a subset of the patients in the data set are key to defining the model. They are known as support vectors since they support the definition of the model’s class boundary and decision surface. For example, the decision regarding whether a patient has a disease or not, is determined by a subset of patients, e.g., 5 out of 200 patients, the model learned/picked as positive and negative examples of disease.
The more support vectors there are, the more complex the model is, with all other things being equal: \(H_{sv}=sv\). SVM models have at least three support vectors in general—at least two to define the line, curve, hyperplane or surface that is the class boundary, and at least one to define the margin, so \(sv\ge 3,\;sv\,{\in }\,\mathbb {N}\).
To select a model for one data set, or to compare results between two data sets, we know the maximum number of patients N, so \(sv\le N\), and we apply (4)iii to obtain a relative measure, \(U_{sv,r}\). Or to obtain an absolute measure \(U_{sv,a}\), to compare against any current or future data set, we assume \(N=\infty \) and apply (4)ii.
Degrees of Freedom: Akaike includes all method and kernel hyperparameters and weights as among the degrees of freedom [50]. We calculate the prior complexity measure \(\check{H}_{dof}\) with three terms comprised of: the number of SVM hyperparameters, e.g., 1 for C, the number of kernel hyperparameters, e.g., 1 for the kernel width for a Gaussian RBF kernel, the number of independent inputs, e.g., 1 for a Gaussian RBF kernel or stationary kernel, 2 otherwise. We calculate the posterior complexity measure \(H_{dof}\) with an additional term for the support vectors and apply the general measure for model interpretability.
Relevant Dimensionality Estimate: The relevant dimensionality estimate (rde) [9] provides a way to measure the complexity of the SVM feature space induced by a kernel. There are two complexity measures \(H_{rde\text {T}}\) and \(H_{rde\text {L}}\) corresponding to two rde methods: the twocomponent model and the leaveoneout method, respectively.
6 Validation of Measures
We validate our proposed measures with sanity checks on formulas (not shown) and by agreement with propositions that describe our expectations and knowledge about model complexity and interpretability.
We create propositions based on expected relationships between measures, and check/test the propositions with a statement P and its inverse \(\mathbf {P}^{1}\) such as the following,
where \(\overset{\text {usually}}{\rightarrow }\) is a notation that means “implies the majority of the time”. For brevity \(\mathbf {P}^{1}\) is implied but not shown in statements that follow. We measure how much our results agree with these propositions using either Kendall’s W coefficient of rank correlation [26] or matched pair agreement [48], where the latter is applied to control for confounding factors.
If a proposition is robust, then the percentage of the concordance coefficient or matched pair agreement indicates how correct and useful the measure is, from that perspective. A measure has some utility, if it is correct the majority of the time, for different models/kernels and data sets, with a confidence interval that does not include 50%.
We validate our propositions using two types of experiments (#1 and #2 as below). We run each experiment five times on each of three data sets from the University of California at Irvine repository: the Statlog Heart, Hepatitis and Bupa Liver data sets. Missing data in the Hepatitis data set are imputed with Stata, taking one of three multiple imputations with Monte Carlo Markov Chains. Bupa Liver is used with the common target [36] rather than the clinically meaningful target.

Experiment Type #1: For each of 90 points chosen randomly in the hyperparameter space, we choose a pair of models, matched pairs [48], that differ by one hyperparameter/dof that is fixed in one and free in the other, and check propositions as the percentage truth of the propositions. We use 3 pairs of kernels that differ by a single dof, e.g., a polynomial kernel of varying degree versus a linear kernel, a Gaussian RBF kernel with/without a fixed kernel width and a Mercer sigmoid kernel [11] with/without a fixed horizontal shift.

Experiment Type #2: From the experiment type #1 we identify three points in the hyperparameter space which perform well for each kernel. For each of 3 fixed points, we choose 30 values of C equally spaced (as logarithms) throughout the range from \(10^{3}\) to \(10^{6}\) and check propositions as the concordance of the lefthand side with the righthand side in the propositions, using Kendall’s W coefficient of concordance. If the righthand side should have opposite rank to the lefthand side then we apply a negative to the measure on the righthand side for concordance to measure agreement of rank. We use the following kernels: linear, polynomial, Gaussian RBF and Mercer sigmoid kernel [11].
6.1 Propositions
Proposition 1
The majority of the time we expect that a model with less degrees of freedom \( \check{dof} _{1}\), with all other things being equal when compared to another model with \( \check{dof} _{2}\), will be simpler and have a relevant dimensionality estimate (rde) [9] that is less than or equal to the other model and therefore be more interpretable/understandable (\(U_{rde}^{*}\)):
This applies to rde with the twocomponent model (rdeT) and the leaveoneout method (rdeL).
Proposition 2
In SVM, the hyperparameter C is called the box constraint or cost of error. Authors have remarked [49, Remark 7.31] that C is not an intuitive parameter, although it has a lower bound for use \(C\ge \frac{1}{N}\) and its behaviour suggests \(C\doteq \frac{1}{\nu N}\), where \(\nu \) is a proportion of support vectors. We therefore expect that a model with a higher value \(C_{1}\) versus a second model with \(C_{2}\) will have less support vectors (sv) and consequently be more interpretable/understandable (\(U_{Hs}\)):
This applies to simplicity of sensitivity \(U_{Hs}\) with any binning method.
Our experiment uses three binning methods: Scott \(U_{H\text {sc}}\), FreedmanDiaconis \(U_{H\text {fd}}\) and Sturges \(U_{H\text {st}}\).
Proposition 3
The majority of the time we expect that, if a prior measure is useful, then it reflects the same rankings as the posterior measure,
Proposition 4
We expect that the linear kernel is the simplest of all kernels with greater transparency than other kernels such as the polynomial, Gaussian RBF kernel, sigmoid and Mercer sigmoid kernels, whereby,
7 Results
We summarize the results of our validation tests (Tables 4 and 5) as follows: we recommend \(\check{U}_{\partial }\) and \(U_{sv}\) as good measures. We find that \(U_{rdeT}^{*}\), \(U_{rdeL}^{*}\) and \(U_{H\text {st}}\) are measures which are of limited use, because they may be wrong one third of the time when providing guidance on decisions. \(U_{H\text {sc}}\) and \(U_{H\text {fd}}\) are not distinguished from chance by our propositions and are therefore not recommended. If \(U_{H\text {st}}\) is validated to a greater degree in the future, then the initial measure \(U_{H\text {st}}^{*}\) has been shown to be a good proxy for it, incurring some loss of information (Table 5).
Our proposed measure of kernel transparency \(\check{U}_{\partial }\), a prior measure, scored 100% agreement. This is a good measure that may be used a priori, but it is highlevel and not specific to the match between a model and data. No surprises or complexities arose regarding the attributes of kernels.
The general measure based on the number of support vectors, \(U_{sv}\), scored \(81\pm 2.3\%\) agreement—this is a good measure.
Our proposed simplicity of sensitivity measure with Sturges binning \(U_{H\text {st}}\) scored \(64\pm 3.2\)% and \(62\pm 3.5\)%, which is of limited use—we are interested in agreement that is sufficiently greater than chance (50%), enough to be reliable.
The same measure with Scott binning (\(U_{H\text {sc}}\)), however, is barely distinguishable from chance in one test, and not distinguishable in another, and with FreedmanDiaconis binning (\(U_{H\text {fd}}\)) it is not distinguishable from chance in both tests. We recommend further validation to examine the role of confounding factors such as kernel width/scale along with C per [6, 16].
If the simplicity of sensitivity measure \(U_{H\text {st}}\) can be validated to a greater degree in the future, then the initial measure \(U_{H\text {st}}^{*}\) which scores \(80\pm 3.2\%\) agreement with it, may be used in its place to avoid optimization, or to gain an initial estimate prior to optimization.
The general measure based on the relevant dimensionality of the feature space, \(U_{rdeT}^{*}\) and \(U_{rdeL}^{*}\) scored \(62\pm 5.0\%\) and \(59\pm 5.2\%\) agreement, respectively. These are of some use. We did not include Braun’s noise estimate, which in hindsight should improve the measure.
8 Application
We apply model interpretability to results in a toy problem. When we select results for maximum accuracy with the Gaussian RBF kernel, we find that the top result in our sorted list of results achieves 100% accuracy (rounded to no decimal places) with 51 support vectors, while the second best result also achieves 100% accuracy with 40 support vectors and the fifth best result according to the list also achieves 100% accuracy with 25 support vectors.
Selecting results for maximum interpretability \(U_{sv,r}\), we find the top result uses 9 support vectors for 99% accuracy and the fourth best result uses 10 support vectors for the same accuracy.
We plot the results (Fig. 3) of accuracy versus interpretability \(U_{sv,r}\) (above 80% in each) and find that there are many results which are highly accurate and highly interpretable, i.e., above 96% in both. These results indicate that there is not a tradeoff between accuracy and model interpretability based on support vectors in this data set.
We also plot the results of accuracy versus interpretability \(U_{sv,r}\) for other data sets (Figs. 4 and 5) and it is clear that there is no trend in all points showing a tradeoff between accuracy and model interpretability, although this trend may be present at the pareto front. A tradeoff trend would show as an inverse correlation, a trend line running from the top left to the bottom right—instead, high interpretability is consistently achievable with high accuracy, i.e., there are points toward the top right of a bounding box for all points.
9 Related Work
Lipton [30] provides a good taxonomy for model interpretability with concepts falling into two broad categories: transparency (the opposite of a black box) and posthoc interpretability.
Posthoc interpretability involves an explanatory model separate from the predictive model, or visuals that transform data where the transformation is also a separate explanatory model. Liang [28] cautions against explaining a black box predictive model with another black box explanatory model.
Riberio et al. [46] create an external local linear model to approximate the prediction model in a posthoc approach called LIME. They jointly optimize accuracy and model complexity but they do not elucidate much about model complexity as in our work. LIME perturbs features in a separate binary representation of features, which sometimes map to nonlocal features in the original space of data. In their examples they use the binary model output, only referring in passing to the possibility of using a continuous output for classifiers, as we do.
Transparency, on the other hand, focuses on the predictive model itself, and has three aspects: decomposability, simulatability and algorithmic transparency [30].
Decomposability refers to being able to see and understand the parts of the model of the model, e.g., kernels and parameters and the parts of the data, i.e., features and instances—and how they contribute to a result from the predictive model. Some authors refer to the output from decomposition as an interpretation, e.g., initial understanding, separate from an explanation [24, 39] that may require analysis, selection or perhaps synthesis. Miller adds that explanations are selected and social [38].
Since the social and synthesis tasks are more suitable to a person than a computer—it is reasonable for our work to focus on inherent measures of interpretability, rather than explanations.
[34] express that some types of models are more intelligible (i.e., decomposable) than others. We include categories for generalized linear and generalized additive models in our measures as a result of their work.
Simulatability, as another aspect of transparency, refers to a model that a person can mentally simulate or manually compute in reasonable time [30] and is correlated, for example, with the number of features in a linear model, or the depth of the tree in a decision tree. Model complexity is implied Lipton’s examples but the term is not invoked although other authors refer to it [10, 35, 42].
Ockham’s razor, also called the principle of parsimony [50], is a well known principle related to model complexity. Regarding models, it says that among sufficient explanations (e.g., equally accurate^{Footnote 1} models), the simplest^{Footnote 2} should be preferred. A quick note on sufficiency: for multiple equally accurate models, none are necessary, because any one of them is sufficient. Model accuracy is sought first, then simplicity. Using our proposed measure one can search for the model with highest interpretability among equally accurate models.
Backhaus et al. propose a quantitative measure of model interpretability [3]—but that is for a different meaning or definition—the ability for a model to interpret data, with relevance in relevance vector machines as the context.
Related to our work, sensitivity analysis of model outputs (SAMO) [2, 23] describe how sensitive a model output is to a change in feature values, one at a time—which is the approach of our proposed general measure.
In variancebased sensitivity analysis, Sobol [51] finds the variance in the output explained by an input feature. Liu et al. [32] performs entropybased sensitivity analysis, called global response probabilistic sensitivity analysis (GRPSA), to find the influence of input features—where entropy is used to compute the effect as information loss. Lemaire et al. [27] apply sensitivity analysis but their perturbations are nonlocal and could easily create points outside of any known clusters of instances and true states of nature. Poulin et al. [43] provides effective visualization and analysis tools but for SVM they only apply their method to linear SVM and its binary output.
Automatic model selection methods have been proposed for accuracy [1, 40]—these are based on rules computed from many data sets. The rulebased approach is brittle in comparison to our measures, since it only works with a fixed set of candidate kernels.
10 Conclusions
We developed and validated measures for inherent model interpretability to enable automatic model selection and ongoing research. Two measures are recommended: our proposed kernel transparency measure \(\check{U}_{\partial }\) which is an inexpensive prior measure, and a posterior measure based on support vectors \(U_{sv}\). Three other measures, \(U_{rdeT}^{*}\), \(U_{rdeL}^{*}\) and \(U_{H\text {st}}\) were found to be of limited use but may be further validated by future work.
We also contributed ideas as a foundation for these measures: the concept of inherent model interpretability, a general measure, a simplicity of sensitivity measure, and measurement of interpretability at different points in the learning process, i.e., via prior, initial and posterior models.
We applied our measure to model selection and demonstrated that choosing a model based on a sorted list of accuracy alone can result in models with substantively less inherent model interpretability despite the consistent availability of models with high accuracy and high interpretability in multiple data sets. The notion of a tradeoff between accuracy and interpretability does not hold for these data sets.
Notes
 1.
Where accuracy cannot be distinguished with statistical significance.
 2.
[Sober] refers to [Akaike]’s definition of the simplest model as the model with the least degrees of freedom, i.e., least number of (independent) coefficients.
 3.
E.g., a combination of atomic data types can make up a complex data type—e.g., a combination of letters or symbols (nominals) make up a string as a complex data type.
References
Ali, S., Smith, K.A.: On learning algorithm selection for classification. Appl. Soft Comput. 6(2), 119–138 (2006)
Auder, B., Iooss, B.: Global sensitivity analysis based on entropy. In: Proceedings of the ESREL 2008 Safety, reliability and risk analysis Conference, pp. 2107–2115 (2008)
Backhaus, A., Seiffert, U.: Quantitative measurements of model interpretability for the analysis of spectral data. In: IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 18–25. IEEE (2013)
Badii, R., Politi, A.: Complexity: Hierarchical Structures and Scaling in Physics, vol. 6. Cambridge University Press, Cambridge (1999)
Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., MÃžller, K.R.: How to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (2010)
BenHur, A., Weston, J.: A user’s guide to support vector machines. In: Data Mining Techniques for the Life Sciences, pp. 223–239. Springer (2010)
Berner, E.S.: Clinical Decision Support Systems. Springer, New York (2007). https://doi.org/10.1007/9780387383194
Boughorbel, S., Tarel, J.P., Boujemaa, N.: Conditionally positive definite kernels for SVM based image recognition. In: IEEE International Conference on Multimedia and Expo, ICME 2005, pp. 113–116. IEEE (2005)
Braun, M.L., Buhmann, J.M., MÃžller, K.R.: On relevant dimensions in kernel feature spaces. J. Mach. Learn. Res. 9, 1875–1908 (2008)
Breiman, L.: Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat. Sci. 16(3), 199–231 (2001)
Carrington, A.M., Fieguth, P.W., Chen, H.H.: A new mercer sigmoid kernel for clinical data classification. In: 36th Annual International Conference on Engineering in Medicine and Biology Society (EMBC), pp. 6397–6401. IEEE (2014)
Caruana, R., NiculescuMizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168. ACM (2006)
Cotter, A., Keshet, J., Srebro, N.: Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603 (2011)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)
Cruz, J.A., Wishart, D.S.: Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2, 59–78 (2006)
Devos, O., Ruckebusch, C., Durand, A., Duponchel, L., Huvenne, J.P.: Support vector machines (SVM) in near infrared (NIR) spectroscopy: focus on parameters optimization and model interpretation. Chemom. Intell. Lab. Syst. 96(1), 27–33 (2009)
DoshiVelez, F., Kim, B.: Towards a rigorous science of interpretable machine learning (2017)
Freedman, D., Diaconis, P.: On the histogram as a density estimator: L\(_2\) theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57(4), 453–476 (1981)
GellMann, M., Lloyd, S.: Information measures, effective complexity, and total information. Complexity 2(1), 44–52 (1996)
Goodman, B., Flaxman, S.: European union regulations on algorithmic decisionmaking and a “right to explanation”. In: 1st Workshop on Human Interpretability in Machine Learning, International Conference of Machine Learning (2016)
Goodstein, D.L., Goodstein, J.R.: Feynman’s Lost Lecture: The Motion of Planets Around the Sun, vol. 1. W. W. Norton & Company, New York (1996)
Greenes, R.A.: Clinical Decision Support: The Road Ahead. Academic Press, SanDiego (2011)
Hanson, K.M., Hemez, F.M.: Sensitivity analysis of model output. In: Proceedings of the 4th International Conference on Sensitivity Analysis of Model Output (SAMO 2004), Santa Fe, 8–11 March 2004. Los Alamos National Laboratory (2005)
Holzinger, A., Biemann, C., Pattichis, C.S., Kell, D.B.: What do we need to build explainable AI systems for the medical domain? arXiv preprint arXiv:1712.09923 (2017)
Jernigan, M.E., Fieguth, P.: Introduction to Pattern Recognition. University of Waterloo (2004)
Kendall, M.G.: The treatment of ties in ranking problems. Biometrika 33(3), 239–251 (1945)
Lemaire, V., Féraud, R., Voisine, N.: Contact personalization using a score understanding method. In: IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IJCNN 2008, pp. 649–654. IEEE (2008)
Liang, P.: Provenance and contracts in machine learning. In: Proceedings of the 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016) (2016)
Lin, D.: An informationtheoretic definition of similarity. ICML 98, 296–304 (1998)
Lipton, Z.C., et al.: The mythos of model interpretability. In: IEEE Spectrum (2016)
Lisboa, P.J.G.: Interpretability in machine learning – principles and practice. In: Masulli, F., Pasi, G., Yager, R. (eds.) WILF 2013. LNCS (LNAI), vol. 8256, pp. 15–21. Springer, Cham (2013). https://doi.org/10.1007/9783319032009_2
Liu, H., Chen, W., Sudjianto, A.: Relative entropy based method for probabilistic sensitivity analysis in engineering design. J. Mech. Des. 128(2), 326–336 (2006)
Lloyd, S.: Measures of complexity: a nonexhaustive list. IEEE Control Syst. Mag. 21(4), 7–8 (2001)
Lou, Y., Caruana, R., Gehrke, J.: Intelligible models for classification and regression. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158. ACM (2012)
Martens, D., Baesens, B.: Building acceptable classification models. In: Stahlbock, R., Crone, S., Lessmann, S. (eds.) Data Mining. Annals of Information Systems, pp. 53–74. Springer, Boston (2010). https://doi.org/10.1007/9781441912800_3
McDermott, J., Forsyth, R.S.: Diagnosing a disorder in a classification benchmark. Pattern Recognit. Lett. 73, 41–43 (2016)
Mercer, J.: Functions of positive and negative type, and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A 209, 415–446 (1909). Containing papers of a mathematical or physical character
Miller, T., Howe, P., Sonenberg, L.: Explainable AI: beware of inmates running the asylum. In: IJCAI17 Workshop on Explainable AI (XAI), p. 36 (2017)
Montavon, G., Lapuschkin, S., Binder, A., Samek, W., Müller, K.R.: Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit. 65, 211–222 (2017)
Nahar, J., Ali, S., Chen, Y.P.P.: Microarray data classification using automatic SVM kernel selection. DNA Cell Biol. 26(10), 707–712 (2007)
Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H.: PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min. 10(1), 36 (2017)
Perez, P.S., Nozawa, S.R., Macedo, A.A., Baranauskas, J.A.: Windowing improvements towards more comprehensible models. Knowl. Based Syst. 92, 9–22 (2016)
Poulin, B., et al.: Visual explanation of evidence with additive classifiers. In: Proceedings of the National Conference On Artificial Intelligence, vol. 21, p. 1822. AAAI Press, Menlo Park (1999). MIT Press, Cambridge (2006)
Pusic, M.V., Boutis, K., Hatala, R., Cook, D.A.: Learning curves in health professions education. Acad. Med. 90(8), 1034–1042 (2015)
Rényi, A., et al.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California (1961)
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016)
Scott, D.W.: On optimal and databased histograms. Biometrika 66(3), 605–610 (1979)
Selvin, S.: Statistical Analysis of Epidemiologic Data. Oxford University Press, New York (2004)
ShaweTaylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Sober, E.: Parsimony and predictive equivalence. Erkenntnis 44(2), 167–197 (1996)
Sobol, I.M.: Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math. Comput. Simul. 55(1), 271–280 (2001)
Stevens, S.S.: On the theory of scales of measurement (1946)
Sturges, H.A.: The choice of a class interval. J. Am. Stat. Assoc. 21(153), 65–66 (1926)
Szabó, Z., Póczos, B., Lőrincz, A.: Undercomplete blind subspace deconvolution. J. Mach. Learn. Res. 8, 1063–1095 (2007)
Szabó, Z., Póczos, B., Lőrincz, A.: Separation theorem for independent subspace analysis and its consequences. Pattern Recognit. 45, 1782–1791 (2012)
Tsallis, C.: Possible generalization of BoltzmannGibbs statistics. J. Stat. Phys. 52(1), 479–487 (1988)
Tussy, A., Gustafson, R.: Elementary Algebra. Nelson Education (2012)
Donders, A.R.T., Van Der Heijden, G.J.M.G., Stijnen, T., Moons, K.G.M.: A gentle introduction to imputation of missing values. J. clin. epidemiol. 59(10), 1087–1091 (2006). Elsevier
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix: Treating Features of Any Atomic Data Type as Continuous
A Appendix: Treating Features of Any Atomic Data Type as Continuous
Assuming that we are not given a fixed pretrained model, but can instead the machine learning method and model, we can select one that handles continuous values, and we can treat features of any atomic data type (defined below) as continuous. This treatment requires three steps—and most of the content in these steps are standard practice, with a few exceptions denoted by an asterix*.
We define atomic data types (Table 7) as the following set of data types which are fundamental building blocks for all electronic data^{Footnote 3}: reals, integers, datetimes, dates, ordinals, binary and nominals. These atomic data types are based on Steven’s scales of measurement [52], but are specified at a level that is more interpretable and useful.
Although binary values may also be considered nominals, we identify them separately because there are methods in the literature specific to binary data (e.g., for imputation and similarity measurement) and the data type is specifically defined in programming languages, machine learning platforms, database schema and data extraction tools.

1.
Treat missing data. Assuming data are missing completely at random (MCAR) do the following, otherwise refer to [58].

(a)
Impute missing data for reals, integers, datetimes, dates and ordinals, using whichever method meets requirements—e.g., multiple imputation with Monte Carlo Markov chain, expectation maximization, hotdeck imputation or mean imputation.

(b)
Impute missing data for nominals using the mode, i.e., the most frequent level.

(c)
Impute missing binary data with a method that will produce continuous values and which is appropriate for binary distributions—e.g., multiple imputation or expectation maximization. We refer to the output as continuouslyimputed binary data.

(a)

2.
Convert nominals to binary indicators, one for each level.

3.
Center and normalize data

(a)
For continuouslyimputed binary data, bottomcode and topcode the data to the limits, then minmax normalize the data to the range [–1, +1] for SVM or [0, 1] for neural networks and logistic regression.

(b)
For binary data, minmax normalize the data to the set {–1, +1} for SVM or {0, 1} for neural networks and logistic regression. This data will be treated as reals by the methods/models, but {–1, +1} makes more sensible use of the symmetric kernel geometry in SVM than {0, 1}.

(c)
For all other data types, center and normalize each feature using zscore normalization (or scalar variations based on 2 or 3 sigma instead of 1 sigma).

(a)
Now all of the data are ready to be treated as reals by the methods/models.
Rights and permissions
Copyright information
© 2018 IFIP International Federation for Information Processing
About this paper
Cite this paper
Carrington, A., Fieguth, P., Chen, H. (2018). Measures of Model Interpretability for Model Selection. In: Holzinger, A., Kieseberg, P., Tjoa, A., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CDMAKE 2018. Lecture Notes in Computer Science(), vol 11015. Springer, Cham. https://doi.org/10.1007/9783319997407_24
Download citation
DOI: https://doi.org/10.1007/9783319997407_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319997391
Online ISBN: 9783319997407
eBook Packages: Computer ScienceComputer Science (R0)