Transforming Feature Space to Interpret Machine Learning Models

Model-agnostic tools for interpreting machine-learning models struggle to summarize the joint effects of strongly dependent features in high-dimensional feature spaces, which play an important role in pattern recognition, for example in remote sensing of landcover. This contribution proposes a novel approach that interprets machine-learning models through the lens of feature space transformations. It can be used to enhance unconditional as well as conditional post-hoc diagnostic tools including partial dependence plots, accumulated local effects plots, or permutation feature importance assessments. While the approach can also be applied to nonlinear transformations, we focus on linear ones, including principal component analysis (PCA) and a partial orthogonalization technique. Structured PCA and diagnostics along paths offer opportunities for representing domain knowledge. The new approach is implemented in the R package `wiml`, which can be combined with existing explainable machine-learning packages. A case study on remote-sensing landcover classification with 46 features is used to demonstrate the potential of the proposed approach for model interpretation by domain experts.


Introduction
Interpreting complex nonlinear machine-learning models is an inherently difficult task.A common approach is the post-hoc analysis of black-box models for dataset-level interpretation (Murdoch et al. 2019) using model-agnostic techniques such as the permutation-based variable importance, and graphical displays such as partial dependence plots that visualize main effects while integrating over the remaining dimensions (Molnar, Casalicchio, and Bischl 2020).
These tools are so far limited to displaying the relationship between the response and one (or sometimes two) predictor(s), while attempting to control for the influence of the other predictors.This can be rather unsatisfactory when dealing with a large number of highly correlated predictors, which are often semantically grouped.While the literature on explainable machine learning has often focused on dealing with dependencies affecting individual features, e.g. by introducing conditional diagnostics (Strobl et al. 2008;Molnar, König, Bischl, et al. 2020), no practical solutions are available yet for dealing with model interpretation in highdimensional feature spaces with strongly dependent features (Molnar, Casalicchio, and Bischl 2020;Molnar, König, Herbinger, et al. 2020).
These situations routinely occur in environmental remote sensing and other geographical and ecological analyses (Landgrebe 2002;Zortea, Haertel, and Clarke 2007), which motivated the present proposal to enhance existing model interpretation tools by offering a new, transformed perspective.For example, vegetation 'greenness' as a measure of photosynthetic activity is often used to classify landcover or land use from satellite imagery acquired at multiple time points throughout the growing season (Peña and Brenning 2015;Peña, Liao, and Brenning 2017).Spectral reflectances of equivalent spectral bands (the features) are usually strongly correlated within the same phenological stage since vegetation characteristics vary gradually.Similarly.when using texture features to characterize image structure based on a filter bank, features with similar filter settings can be strongly correlated, as is in the case in our case study (Brenning, Long, and Fieguth 2012).
Although it may be tempting in these situations to use feature engineering or feature selection techniques to reduce the complexity of feature space, experience shows that this may lead to a decline in predictive performance.Also, re-training a model using modified features is not normally an option in post-hoc analyses of machine-learning models.
Considering these challenges, and the inherent need to reduce the complexity at the time of interpreting an already trained model, a novel strategy to visualize machine-learning models along cross-section through feature space is proposed in this paper.In many situations, principal components (PCs) offer a particularly appealing perspective onto feature space from a practitioner's point of view, although the proposed approach is not limited to this transformation.In addition, a modification is proposed that focuses on subgroups of features and their principal axes in order to allow for a more structured approach to model interpretation that more consistent with the data analyst's domain knowledge.
Turning our attention back to the importance of individual features, an orthogonalization technique that can be used to single out the effect of individual features on model predictions, avoiding the sometimes complex structure of PCs.This approach can, in principle, be applied to arbitrary paths through feature space, such as nonlinear curves defined by domain-specific perspectives, or data-driven transitions between clusters of observations.
The proposed approaches can be combined with commonly used plot types and diagnostics including partial dependence plots, accumulated local effects (ALE) plots, and permutation-based variable importance measures, among other model-agnostic techniques that only have access to the trained model (Apley and Zhu 2020;Molnar 2019).While the focus of this contribution is on visualizing main effects, analyses of conditional relationships may also benefit from this perspective (Strobl et al. 2008;Molnar, König, Bischl, et al. 2020).

Proposed Method
Let's consider a regression or classification model that was fitted to a training sample L in the (original, untransformed) p-dimensional feature space X ⊂ R p .I will assume f (x) ∈ R; in the case of classification problems, f (x) shall therefore represent predictions of some real-valued quantity such as the probability or logit of a selected target class.One of the features, referred to as x s , is selected as the feature of interest, and the remaining features are denoted by x C .

Example: partial dependence plots
In this situation, the partial dependence plot of f with respect to x S can formally be defined as (Molnar 2019).This plot, which can be generalized to more than one x s dimension, was introduced by Friedman (2001) to visualize main effects of predictors in machine-learning models.
Partial dependence plots have some disadvantages such as the extrapolation of f beyond the region in X for which training data is available (Apley and Zhu 2020;Molnar, König, Herbinger, et al. 2020).This is especially the case when predictors are strongly correlated, as in our case study.Nevertheless, without loss of generality, this simple plot type helps to illustrate the proposed approach.

Transformed feature space
When several predictors are strongly correlated and/or express the same domain-specific concept such as 'early-season vegetation vigour' in vegetation remote-sensing, we may be more interested in exploring the overall effect of these predictors.Principal component analysis (PCA) and related data transformation techniques such as factor analysis are tools that are often used by practitioners to synthesize and interpret multivariate data Basille et al. (2008).
More generally speaking, we could think of bijective (invertible) transformation function that can be used to re-express the features in our data set.We will assume that T is continuous and differentiable.PCA is one such example.
Through the composition of the back transformation T −1 and the model function f , we can now formally define a model ĝ on W , ĝ := f • T −1 which predicts the real-valued response based on 'data' in W although it was trained using a learning sample L ⊂ X in the untransformed space.
We can use this to formally re-express the partial dependence plot as a function of w s : Note that T −1 , when used only on data in Im T (X) := T(X), does not create x values outside the datasupported region X, and it therefore avoids extrapolation of f .Also, when choosing PCA for T the w variables in T(L) are linearly independent, and statistically independent if L arises from a multivariate normal distribution.Thus, the PCA approach overcomes one of the limitations of partial dependence plots and broadens their applicability.

Orthogonalization approaches
In some instances, PCs (and other multivariate transformations) of large and complex feature sets can be difficult to interpret, and analysts would therefore like to focus on individual features that are perhaps 'representative' of a larger group of features -for example, vegetation greenness in mid-June may be a good proxy for vegetation greenness a few weeks earlier and later, as expressed by other features in the feature set (Peña and Brenning 2015).
This can be addressed by proposing a transformation of X in which w s := x s is retained, while making the remaining base vectors linearly independent of x s .This can be achieved through orthogonalization: where b i equals Pearson's correlation coefficient of x s and x i , and where we assume for simplicity of notation that all features are zero-mean with a unit standard deviation.
This defines a linear transformation T : X → W , which can be represented by its coefficient matrix.Note that T can be inverted using since x s = w s , and assuming that all b i < 1.A related iterative orthogonalization approach has previously been proposed in the context of feature ranking (Adebayo and Kagal 2016).

Dependence plots along paths
Data analysts may more generally want to visualize the effect of a real-valued function of multiple features.
As an example, knowing that several features are strongly correlated, how does the response vary with the mean value of these features, or more generally a linear combination?This information is sometimes hidden in an ocean of individual main effects plots or variable importance measures.
In other situations, there may be simple process-based models that have the potential to provide deeper insights into black-box models.These models may be candidates for an enhancement of feature space, or they might express specific theories or hypotheses.
Any of these transformations can be thought of as a function that defines a one-dimensional path in feature space.
In different use cases there may be different ways of constructing paths of interest to the data analyst: • A group of strongly positively correlated features could be averaged to obtain an overall signal, or contrasts between groups of features could be calculated.• A linear path can be drawn from one cluster centre to another, where cluster centres c 1 , . . ., c k ∈ X are obtained by unsupervised clustering in feature space (e.g., k-means).The path between clusters 1 and 2 is simply defined as c 1 + tc 2 , etc. • A linear path between user-defined points in feature space, e.g. in remote sensing so-called endmembers representing spectral characteristics of 'pure' surface types such as asphalt or water (Somers et al., 2016).
Evidently, creating a dependence plot for a feature x s is just a special case that follows one of the base vectors of feature space.In other words, without loss of generality we can formally include t as an additional feature in our feature set and denote it with x s .This is just a formal inclusion, without actually offering this new variable to the model for training.
For a formally more accurate treatment, and keeping t out of the feature space X ⊂ R p , we write w s := t and construct the w i 's as in the previous section by orthogonalization, however this time for all i = 1, . . ., p. Thus, w s comes on top of the other p dimensions, and thus W ⊂ R (p+1) .
Due to the redundancy of w s , the transformation function can now be defined as a mapping T : W → X from (p + 1)-to p-dimensional space, with the x i 's being recovered from the w i 's as in the previous section.

Other model-agnostic plots
The same principles outlined in the previous section can be applied to ALE plots and related model-agnostic tools, including permutation-based variable importance and their conditional modifications (see reviews by Molnar et al., 2020a andMolnar, 2021).Also, this is not limited to x s ∈ R -this principle equally applies to bivariate x s ∈ R 2 relationships, which can be used to display pairwise interactions.Clearly, in a high-dimensional situation, the need to reduce dimensionality in post-hoc model interpretation is even more pressing when interpreting up to p * (p − 1)/2 pairwise interactions, and the proposed approach offers a practical tool to address this in situations where dimension reduction is viable.

Implementation
The proposed methods are provided as an R package wiml (https://github.com/alexanderbrenning/wiml).It implements transformation functions called 'warpers' based on PCA (of all features or a subset of features), structured PCA (for multiple groups of features), and feature orthogonalization, all of which are based on rotation matrices and therefore share a common core.Due to the modular and object-oriented structure, users can easily implement their own transformations without requiring changes to the package.
These warpers can be used to implement the composition f •T −1 by 'warping' a fitted machine-learning model.The resulting object behaves like a regular fitted machine-learning model in R, offering an adapted predict method.From a user's perspective, the resulting model feels like it had been fitted to the transformed data T −1 (L), except that it hasn't.This 'warped' fitted model can, in principle, be used with any model-agnostic tool that doesn't require refitting.An implementation of the composition f • T −1 involving the untrained model f is also available; this can be used for drop and re-learn or permute and re-learn techniques (Hooker and Mentch 2019).
The package has been tested and works well with the iml package for interpretable machine learning (Molnar, Bischl, and Casalicchio 2018), but it can also be combined with other frameworks since it only builds thin wrappers around standard R model functions.Initial tests with the DALEX framework for explainable machine-learning (Biecek 2018) and its interactive environment modelStudio (Baniecki and Biecek 2019) have been successful, as have been tests with the pdp package (Greenwell 2017).

Case Study
The potential of the proposed methods will be demonstrated in a case study from land cover classification, which is a common machine-learning task in environmental remote sensing (e.g., Mountrakis, Im, and Ogole 2011;Peña and Brenning 2015).One particularly challenging task is the detection of rock glaciers, which, unlike 'true' glaciers, do not present visible ice on their surface; they are rather the visible expression of creeping ice-rich mountain permafrost.In the present case study, we look at a subset of a well-documented data set consisting of a sample of 1000 labelled point locations (500 presence and 500 absence locations of flow structures on rock glaciers) in the Andes of central Chile (Brenning, Long, and Fieguth 2012).
There are 46 features in total, which are divided into two unequal subsets: Six features are terrain attributes (local slope angle, potential incoming solar radiation, mean slope angle of the catchment area, and logarithm of catchment height and catchment area), which are proxies for processes related to rock glacier formation.
The other 40 features are Gabor texture features (Clausi and Jernigan 2000), which are designed to detect the furrow-and-ridge structure in high-resolution (1 m × 1 m) satellite imagery, in this case panchromatic IKONOS imagery (see Brenning, Long, and Fieguth 2012 for details).The 40 Gabor features correspond to different filter bandwidths (5, 10, 20, 30 and 50 m), anisotropy factors (1 or 2), and types of aggregation over different filter orientations (minimum, median, maximum, and range).
Texture features with similar filter settings are often strongly correlated with each other.This is especially true for minimum and median aggregation with otherwise equal settings, and for maximum and range aggregation.
Overall, the median of each feature's strongest Pearson correlation is 0.92 (minimum: 0.80).Correlations among terrain attributes are much smaller (median strongest correlation: 0.60).Terrain attributes and texture features are weakly correlated (maximum correlation: 0.32).Correlation statistics are very similar for Spearman's rank-based correlation.To explore the feature sets, PCAs were performed for the entire set of 46 feature and for the subset of 40 Gabor features (Figure 1).In the entire feature set, 63.6% of the variance is concentrated in the first two PCs (first six PCs: 83.7%).In the more strongly correlated Gabor feature set, in contrast, the first two PCs make up 72.2% of the variance (first six PCs: 89.5%).The main PCs turned out to be interpretable by domain experts.PC #1 of the Gabor feature set ('Gabor1,' in the figures) is basically an overall average of all texture features, meaning that it expresses the overall presence of striped patterns of any characteristics.
Gabor PC #2 represents the contrast between minimum and median aggregated anisotropic Gabor features and the rest; large values are interpreted as incoherent patterns with no distinct, repeated stripe pattern.Gabor PC #3 expresses differences between large-wavelength range or maximum-aggregated features versus the short-wavelength features, which represents the heterogeneity in the width of stripes, and thus the size of linear surface structures.Very large values correspond to distinct patterns of large amplitude.

Slope angle log10 of catchment area
A random forest classifier is used for the classification of rock glaciers based on the features introduced above.Its overall accuracy, estimated by spatial cross-validation between the two sub-regions (Brenning 2012), is 80.1%.Omitting terrain attributes from the feature set has a greater impact on performance than omitting the texture features (Table 1).

Results
With 46 features that are grouped into two semantic feature types (terrain attributes, texture features), it can be challenging to interpret the patterns represented by marginal effects plots (Figure 2).Although there appears to be some consistency in direction among many of the texture features, it is difficult to identify an overall pattern that can be summarized verbally, and it would be unreasonable to present such detailed visual information to a conference audience that is expecting a concise and coherent narrative.
The ALE plots along principal axes distill 71.6 of the feature variance into only three plots (Figure 3).Nevertheless, considering the semantic differences and weak correlations between texture features and terrain attributes, it seems unnecessary to combine all features in a joint PCA, which results in PCs with an at least slightly mixed meaning.
The structured PCA approach, in contrast, allows us to explicitly separate the model's representation of effects of texture features and terrain attributes, which is desitable from a domain expert's perspective and statistically justifiables based on the weak correlations between these feature groups.Larger overall texture signals (Gabor PC #1) are associated with higher predicted rock glacier probability (Figure 4).However, a large contrast between minimum/median anisotropic texture features and the remaining texture features, as expressed by a high Gabor PC #2 value, is more often associated with an absence of rock glaciers.In other words, the absence of coherently oriented, repeated stripes is not typical of rock glaciers -these may be more typical of non-repeated stripes (e.g., erosion gullies, jagged rock slopes).
The permutation-based assessment of the importance of texture PCs and terrain attributes shows that subsequent PCs contribute much less to the predictive performance, and that slope angle is the most salient feature overall (Figure 5).Clearly, the combined importance of Gabor features as summarized by Gabor PCs #1 and #2 provides a more comprehensible summary than an incoherent litany of individual feature importances of strongly correlated features, which should not be permuted independently of each other.

Discussion
Overall, interpretation plots along the principal axes are capable of distilling complex high-dimensional relationships into low-dimensional summaries, thus providing a tidier, better structured and more focused approach to model interpretation than traditional tools that focus on individual predictors in an ocean of highly correlated features.This behaviour is highly desirable from a domain expert's perspective, and applying it in a structured manner allows the analyst to honour domain knowledge and feature semantics.
Of course fitting the classifier to PCA-transformed data as input features could have provided direct access to ALE plots along principal axes.However, we would want our feature engineering decisions to be directed towards improving predictive performance, and we would therefore prefer not to risk compromising an optimal performance to satisfy our desire to interpret our model.While this is not an issue in the present case study (overall accuracy 0.791 with PC features versus 0.801 with the original predictors), our experience shows that PCA-transformed predictors can worsen the predictive performance.Also, model-agnostic post-hoc analysis tools are precisely meant to be applicable to black-box models that are provided 'as is,' without the possibility of altering their input features, in which case the proposed 'hands-off' access to transformed perspectives is particularly valuable.The proposed use of PCA and related linear transformation technique appears to be in contradiction to the use of complex nonlinear machine-learning models.Nevertheless, it could be argued that linear cross-sections of feature space along the original feature axes are no less arbitrary and limiting, considering the often strong correlations with other features that would have to be interpreted simultaneously.From that perspective, principal axes provide a 'tidier' perspective and smarter peek into feature space than traditional ALE or partial dependence plots.Linear transformations similar to PCA may further enhance interpretability by offering a more structured or target oriented perspective based on simple components (Rousson and Gasser 2004) or discriminant functions (Cunningham and Ghahramani 2015).
One could even argue that ALE plots can be misleading for highly correlated features as they look at the often tiny contributions of individual features in an isolated way, while the proposed approach focuses on the bigger picture and captures the combined effect of a bundle of features.This also becomes evident in permutation-based feature importance assessments, where individual texture features consistently achieve discouragingly low importances, while the first two PCs of the texture features are ranked very highly.For the specific case of permutation assessments, it has also been proposed to jointly permute groups of features (Molnar 2019); unlike the techniques proposed here, this approach is not transferable to other interpretation tools that are not based on permutations.
Beyond linear transformations, the proposed approach provides a general framework even for nonlinear perspectives on feature space and model functions.In particular, paths proposed in section 2.4 may well be nonlinear, as e.g.defined by a physical model that could be used by domain experts to check model plausibility.Also, curvilinear component analyses (CCA) or autoencoders as state-of-the-art multivariate nonlinear transformation methods provide a logical extension of PCA and highlight the link between explainable machine-learning and projection-based visual analytics (Schaefer et al. 2013).
Finally, due to the orthogonality and thus linear independence of PCs, the more naturally interpretable partial dependence plots become a more viable option for the interpretation of black-box machine-learning models.In original feature space, in contrast, the less intuitive and sometimes rather coarse ALE plots should usually be preferred despite their limitations (compare Figures 6 and 7).

Conclusions
Despite the inherent limitations of post-hoc machine-learning model interpretation, feature space transformations, and structured PCA transformations in particular, are a powerful tool that allows us to distill complex nonlinear relationships into an even smaller number of univariate plots than previously possible, representing perspectives that are informed by domain knowledge.These transformations provide an intuitive access to feature space, which can be easily wrapped around existing model implementations.Model interpretation through the lens of feature transformation and dimension reduction allows us to peek into the feature space at an oblique angle -a strategy that many of us have have successfully applied when checking if our kids are asleep in their beds, and a much more successful strategy than staring along the walls, i.e. the original feature axes, especially when these are nearly parallel.

Figure 1 :
Figure 1: Feature (sub)space diagrams for the first PCs of the entire feature set.

Figure 4 :Figure 5 :Figure 6 :
Figure 4: ALE plots along the first principal axes of texture features, and for the most important terrain attributes.

Table 1 :
Overall accuracies of random forest classification of rock glaciers using the entire feature set, and omitting either the terrain attributes or the texture features.