Keywords

1 Introduction

Machine learning models are optimized for predictive performance, but it is often required to understand models, e.g., to debug them, gain trust in the predictions, or satisfy regulatory requirements. Many post-hoc interpretation methods either quantify effects of features on predictions, compute feature importances, or explain individual predictions, see [17, 24] for more comprehensive overviews. While model-agnostic post-hoc interpretation methods can be applied regardless of model complexity [30], their reliability and compactness deteriorates when models use a high number of features, have strong feature interactions and complex feature main effects. Therefore, model complexity and interpretability are deeply intertwined and reducing complexity can help to make model interpretation more reliable and compact. Model-agnostic complexity measures are needed to strike a balance between interpretability and predictive performance [4, 31].

Contributions. We propose and implement three model-agnostic measures of machine learning model complexity which are related to post-hoc interpretability. To our best knowledge, these are the first model-agnostic measures that describe the global interaction strength, complexity of main effects and number of features. We apply the measures to different datasets and machine learning models. We argue that minimizing these three measures improves the reliability and compactness of post-hoc interpretation methods. Finally, we illustrate the use of our proposed measures in multi-objective optimization.

2 Related Work and Background

In this section, we introduce the notation, review related work, and describe the functional decomposition on which we base the proposed complexity measures.

Notation: We consider machine learning prediction functions \(f:\mathbb {R}^p \mapsto \mathbb {R}\), where f(x) is a prediction (e.g., regression output or a classification score). For the decomposition of f, we write \(f_S:\mathbb {R}^{|S|} \mapsto \mathbb {R}\), \(S \subseteq \{1, \ldots , p\}\), to denote a function that maps a vector \(x_S \in \mathbb {R}^{|S|}\) with a subset of features to a marginal prediction. If subset S contains a single feature j, we write \(f_j\). We refer to the training data of the machine learning model with the tuples \(\mathcal {D}= \{(x^{(i)},y^{(i)})\}_{i=1}^n\) and refer to the value of the j-th feature from the i-th instance as \(x_j^{(i)}\). We write \(X_j\) to refer to the j-th feature as a random variable.

Complexity and Interpretability Measures: In the literature, model complexity and (lack of) model interpretability are often equated. Many complexity measures are model-specific, i.e., only models of the same class can be compared (e.g., decision trees). Model size is often used as a measure for interpretability (e.g., number of decision rules, tree depth, number of non-zero coefficients) [3, 16, 20, 22, 31,32,33,34]. Akaikes Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are more widely applicable measures for the trade-off between goodness of fit and degrees of freedom. In [26], the authors propose model-agnostic measures of model stability. In [27], the authors propose explanation fidelity and stability of local explanation models. Further approaches measure interpretability based on experimental studies with humans, e.g., whether humans can predict the outcome of the model [8, 13, 20, 28, 35].

Functional Decomposition: Any high-dimensional prediction function can be decomposed into a sum of components with increasing dimensionality:

$$\begin{aligned} f(x) =&\overbrace{f_0}^\text {Intercept} + \overbrace{\sum _{j=1}^p f_j(x_j)}^\text {1st order effects} + \overbrace{\sum _{j<k}^p f_{jk}(x_j, x_k)}^\text {2nd order effects} + \ldots + \overbrace{f_{1,\ldots ,p}(x_1, \ldots , x_p)}^\text {p-th order effect} \end{aligned}$$
(1)

This decomposition is only unique with additional constraints regarding the components. Accumulated Local Effects (ALE) were proposed in [1] as a tool for visualizing feature effects (e.g., Fig. 1) and as unique decomposition of the prediction function with components \(f_S = f_{S,ALE}\). The ALE decomposition is unique under an orthogonality-like property described in [1].

The ALE main effect \(f_{j,ALE}\) of a feature \(x_j, j \in \{1,\ldots ,p\}\) for a prediction function f is defined as

$$\begin{aligned} f_{j,ALE}(x_j) = \int _{z_{0,j}}^{x_j} \mathbb {E}\left[ \frac{\partial f(X_1,\ldots ,X_p)}{\partial X_j}\bigg |{X_{j}} = {z_{j}}\right] {dz_{j}}-{c_{j}} \end{aligned}$$
(2)

Here, \(z_{0,j}\) is a lower bound of \(X_j\) (usually the minimum of \(x_j\)) and the expectation \(\mathbb {E}\) is computed conditional on the value for \(x_j\) and over the marginal distribution of all other features. The constant \(c_j\) is chosen so that the mean of \(f_{j,ALE}(x_j)\) with respect to the marginal distribution of \(X_j\) is zero, so that the ALE components sum to the full prediction function. By integrating the expected derivative of f with respect to \(X_j\) the effect of \(x_j\) on the prediction function f is isolated from the effects of all other features. ALE main effects are estimated with finite differences, i.e., access to the gradient of a prediction function is not required (see [1]). We base our proposed measures on the ALE decomposition, because ALE are computationally cheap (worst case O(n) per main effect), they can be computed sequentially instead of simultaneously, they do not require knowledge of the joint distribution, and several software implementations exist [2, 25].

3 Functional Complexity

In this section, we motivate complexity measures based on functional decomposition. Based on Eq. 1, we decompose the prediction function into a constant (estimated as \(f_0 = \frac{1}{n}\sum _{i=1}^n f(x^{(i)})\)), main effects (estimated by ALE), and a remainder term containing interactions (i.e., the difference between the full model and constant + main effects).

$$\begin{aligned} f(x) =&\underbrace{f_0 + \sum _{j=1}^p \overbrace{f_{j,ALE}(x_j)}^\text {MEC: How complex?} + \overbrace{IA(x)}^{\text {IAS: Interaction strength?}}}_{\text {NF: How many features were used?}} \end{aligned}$$
(3)

This arrangement of components emphasizes a decomposition of the prediction function into a main effect model and an interaction remainder. We can analyze how well the main effect model itself approximates f by looking at the magnitude of the interaction measure IAS. The average main effect complexity (MEC) captures how many parameters are needed to describe the one-dimensional main effects on average. The number of features used (NF) describes how many features were used in the full prediction function.

3.1 Number of Features (NF)

We propose an approach based on feature permutation to determine how many features are used by a model. We regard features as “used” when changing a feature changes the prediction. If available, the model-specific number of features is preferable. The model-agnostic version is useful when the prediction function is only accessible via API or when the machine learning pipeline is complex.

The proposed procedure is formally described in Algorithm 1. To estimate whether the j-th feature was used, we sample instances from data \(\mathcal {D}\), replace their j-th feature values with random values from the distribution of \(X_j\) (e.g., by sampling \(x_j\) from other instances from \(\mathcal {D}\)), and observe whether the predictions change. If the prediction of any sample changes, the feature was used.

figure a

We tested the NF heuristic with the Boston Housing data. We trained decision trees (CART) with maximum depths \(\in \{1,2,10\}\) leading to 1, 2 and 4 features used and an L1-regularized linear model with penalty \(\lambda \in \{10, 5, 2, 1, 0.1, 0.001\}\) leading to 0, 2, 3, 4, 11 and 13 features used. For each model, we estimated NF with sample sizes \(M\in \{10, 50, 500\}\) and repeated each estimation 100 times. For the elastic net models, NF was always equal to the number of non-zero weights. For CART, the mean absolute differences between NF and number of features used in the trees were 0.300 (\(M=10\)), 0.020 (\(M=50\)) and 0.000 (\(M=500\)).

3.2 Interaction Strength (IAS)

Interactions between features mean that the prediction cannot be expressed as a sum of independent feature effects, but the effect of a feature depends on values of other features [24]. We propose to measure interaction strength as the scaled approximation error between the ALE main effect model and the prediction function f. Based on the ALE decomposition, the ALE main effect model is defined as the sum of first order ALE effects:

$$f_{ALE1st}(x) = f_0 + f_{1,ALE}(x_1) + \ldots + f_{p,ALE}(x_p)$$

We define interaction strength as the approximation error measured with loss L:

$$\begin{aligned} IAS = \frac{\mathbb {E}(L(f, f_{ALE1st}))}{\mathbb {E}(L(f, f_0))} \ge 0 \end{aligned}$$
(4)

Here, \(f_0\) is the mean of the predictions and can be interpreted as the functional decomposition where all feature effects are set to zero. IAS with the L2 loss equals 1 minus the R-squared measure, where the true targets \(y_i\) are replaced with \(f(x^{(i)})\).

$$IAS = \frac{\sum _{i=1}^n(f(x^{(i)})-f_{ALE1st}(x^{(i)}))^2}{\sum _{i=1}^n(f(x^{(i)}) - f_0)^2} = 1 - R^2$$

If \(IAS=0\), then \(L(f,f_{ALE1st})=0\), which means that the first order ALE model perfectly approximates f and the model has no interactions.

3.3 Main Effect Complexity (MEC)

To determine the average shape complexity of ALE main effects \(f_{j,ALE}\), we propose the main effect complexity (MEC) measure. For a single ALE main effect, we define \(\text {MEC}_j\) as the number of parameters needed to approximate the curve with piece-wise linear models. For the entire model, MEC is the average \(\text {MEC}_j\) over all main effects, weighted with their variance. Figure 1 shows an ALE plot (= main effect) and its approximation with two linear segments.

Fig. 1.
figure 1

ALE curve (solid line) approximated by two linear segments (dotted line).

We use piece-wise linear regression to approximate the ALE curve. Within the segments, linear models are estimated with ordinary least squares. The breakpoints that define the segments are found by greedy and exhaustive search along the interval boundaries of the ALE curve. Greedy here means that we first optimize the first breakpoint, then the second breakpoint with the first breakpoint fixed and so on. We measure the degrees of freedom as the number of non-zero coefficients for intercepts and slopes of the linear models. The approximation allows some error, e.g., an almost linear main effect may have \(\text {MEC}_j=1\), even if dozens of parameters would be needed to describe it perfectly. The approximation quality is measured with R-squared (\(R^2\)), i.e., the proportion of variance of \(f_{j,ALE}\) that is explained by the approximation with linear segments. An approximation has to reach an \(R^2 \ge 1-\epsilon \), where \(\epsilon \) is the user defined maximum approximation error. We also introduced parameter \(max_{seg}\), the maximum number of segments. In the case that an approximation cannot reach an \(R^2 \ge 1-\epsilon \) with a given \(max_{seg}\), \(\text {MEC}_j\) is computed with the maximum number of segments. The selected maximum approximation error \(\epsilon \) should be small, but not too small. We found \(\epsilon \) between 0.01 and 0.1 visually meaningful (i.e. a subjectively good approximation) and used \(\epsilon =0.05\) throughout the paper. We apply a post-processing step that greedily sets slopes of the linear segments to zero, as long as \(R^2\in \{1-\epsilon ,1\}\). The post-processing potentially decreases the \(\text {MEC}_j\), especially for models with constant segments like decision trees. \(\text {MEC}_j\) is averaged over all features to obtain the global main effect complexity. Each \(\text {MEC}_j\) is weighted with the variance of the corresponding ALE main effect to give more weight to features that contribute more to the prediction. Algorithm 2 describes the MEC computation in detail.

figure b

4 Application of Complexity Measures

In the following experiment, we train various machine learning models on different prediction tasks and compute the model complexities. The goal is to analyze how the complexity measures behave across different datasets and models. The dataset are: Bike Rentals [10] (n = 731; 3 numerical, 6 categorical features), Boston Housing (n = 506; 12 numerical, 1 categorical features), (downsampled) Superconductivity [18] (n = 2000; 81 numerical, 0 categorical features) and Abalone [9] (n = 4177; 7 numerical, 1 categorical features).

Table 1. Model performance and complexity on 4 regression tasks for various learners: linear models (lm), cross-validated regularized linear models (cvglmnet), kernel support vector machine (ksvm), random forest (rf), gradient boosted generalized additive model (gamboost), decision tree (cart) and decision tree with depth 2 (cart2).

Table 1 shows performance and complexity of the models. As desired, the main effect complexity for linear models is 1 (except when categorical features with 2+ categories are present as in the bike data), and higher for more flexible methods like random forests. The interaction strength (IAS) is zero for additive models (boosted GAM, (regularized) linear models). Across datasets we observe that the underlying complexity measured as the range of MEC and IAS across the models varies. The bike dataset seems to be adequately described by only additive effects, since even random forests, which often model strong interactions show low interaction strength here. In contrast, the superconductivity dataset is better explained by models with more interactions. For the abalone dataset there are two models with low MSE: the support vector machine and the random forest. We might prefer the SVM, since main effects can be described with single numbers (\(MEC = 1\)) and interaction strength is low.

5 Improving Post-hoc Interpretation

Minimizing the number of features (NF), the interaction strength (IAS), and the main effect complexity (MEC) improves reliability and compactness of post-hoc interpretation methods such as partial dependence plots, ALE plots, feature importance, interaction effects and local surrogate models.

Fewer Features, More Compact Interpretations. Minimizing the number of features improves the readability of post-hoc analysis results. The computational complexity and output size of most interpretation methods scales with \(O(\text {NF})\), like feature effect plots [1, 14] or feature importance [6, 11]. As demonstrated in Table 2, a model with fewer features has a more compact representation. If additionally \(IAS=0\), the ALE main effects fully characterize the prediction function. Interpretation methods that analyze 2-way feature interactions scale with \(O(\text {NF}^2)\). A complete functional decomposition requires to estimate \(\sum _{k=1}^{NF} {NF\atopwithdelims ()k}\) components which has a computational complexity of \(O(2^{NF})\).

Less Interaction, More Reliable Feature Effects. Feature effect plots such as partial dependence plots and ALE plots visualize the marginal relationship between a feature and the prediction. The estimated effects are averages across instances. The effects can vary greatly for individual instances and even have opposite directions when the model includes feature interactions.

In the following simulation, we trained three models with different capabilities of modeling interactions between features: a linear regression model, a support vector machine (radial basis kernel, C = 0.05), and gradient boosted trees. We simulated 500 data points with 4 features and a continuous target based on [15]. Figure 2 shows an increasing interaction strength depending on the model used. More interaction means that the feature effect curves become a less reliable summary of the model behavior.

Fig. 2.
figure 2

The higher the interaction strength in a model (IAS increases from left to right), the less representative the partial dependence plot (light thick line) becomes for individual instances represented by their individual conditional expectation curves (dark thin lines).

The Less Complex the Main Effects, the Better Summarizable. In linear models, a feature effect can be expressed by a single number, the regression coefficient. If effects are non-linear the method of choice is visualization [1, 14]. Summarizing the effects with a single number (e.g., using average marginal effects [23]) can be misleading, e.g., the average effect might be zero for U-shaped feature effects. As a by-product of MEC, there is a third option: Instead of reporting a single number, the coefficients of the segmented linear model can be reported. Minimizing MEC means preferring models with main effects that can be described with fewer coefficients, offering a more compact model description.

6 Application: Multi-objective Optimization

We demonstrate model selection for performance and complexity in a multi-objective optimization approach. For this example, we predict wine quality (scale from 0 to 10) [7] from the wines physical-chemical properties such as alcohol and residual sugar of 4870 white wines. It is difficult to know the desired compromise between model complexity and performance before modeling the data. A solution is multi-objective optimization [12]. We suggest searching over a wide spectrum of model classes and hyperparameter settings, which allows to select a suitable compromise between model complexity and performance.

We used the mlrMBO model-based optimization framework [19] with ParEGO [21] (500 iterations) to find the best models based on four objectives: number of features used (NF), main effect complexity (MEC), interaction strength (IAS) and cross-validated mean absolute error (MAE) (5-fold cross-validated). We optimized over the space of following model classes (and hyperparameters): CART (maximum tree-depth and complexity parameter cp), support vector machine (cost C and inverse kernel width sigma), elastic net regression (regularization alpha and penalization lambda), gradient boosted trees (maximum depth, number of iterations), gradient boosted generalized additive model (number of iterations nrounds) and random forest (number of split features mtry).

Results. The multi-objective optimization resulted in 27 models. The measures had the following ranges: MAE 0.41–0.63, number of features 1–11, mean effect complexity 1–9 and interaction strength 0–0.71. For a more informative visualization, we propose to visualize the main effects together with the measures in Table 2. The selected models show different trade-offs between the measures.

Table 2. A selection of four models from the Pareto optimal set, along with their ALE main effect curves. From left to right, the columns show models with (1) lowest MAE, (2) lowest MAE when \(MEC=1\), (3) lowest MAE when \(IAS =\le 0.2\), and (4) lowest MAE with \(NF \le 7\).

7 Discussion

We proposed three measures for machine learning model complexity based on functional decomposition: number of features used, interaction strength and main effect complexity. Due to their model-agnostic nature, the measures allow model selection and comparison across different types of models and they can be used as objectives in automated machine learning frameworks. This also includes “white-box” models: For example, the interaction strength of interaction terms in a linear model or the complexity of smooth effects in generalized additive models can be quantified and compared across models. We argued that minimizing these measures for a machine learning model improves its post-hoc interpretation. We demonstrated that the measures can be optimized directly with multi-objective optimization to make the trade-off between performance and post-hoc interpretability explicit.

Limitations. The proposed decomposition of the prediction function and definition of the complexity measures will not be appropriate in every situation. For example, all higher order effects are combined into a single interaction strength measure that does not distinguish between two-way interactions and higher order interactions. However, the framework of accumulated local effect decomposition allows to estimate higher order effects and to construct different interaction measures. The main effect complexity measure only considers linear segments but not, e.g., seasonal components or other structures. Furthermore, the complexity measures quantify machine learning models from a functional point of view and ignore the structure of the model (e.g., whether it can be represented by a tree). For example, main effect complexity and interaction strength measures can be large for short decision trees (e.g. in Table 1).

Implementation. The code for this paper is available at https://github.com/compstat-lmu/paper_2019_iml_measures. For the examples and experiments we relied on the mlr package [5] in R [29].