1 Introduction

In the past 20 years, manifold visualisation methods are a major development in unsupervised learning. The trend that started from ISOMAP in 2000 (Tenenbaum et al., 2000) has resulted in hundreds of methods to be developed, popular examples of which include methods such as t-SNE (van der Maaten & Hinton, 2008) and UMAP (McInnes et al., 2020). Manifold visualisation methods can be used to embed data into typically two or three dimensions while preserving some of the relevant features of the data. These methods have proven to be invaluable and central to exploring and understanding complex datasets in fields from genetics (Kobak & Berens, 2019; Diaz-Papkovich et al., 2021) to astronomy (Anders et al., 2018) and linguistics (Levine et al., 2020).

Another recent development is explainable artificial intelligence (XAI), where the objective is to understand and explore black box supervised learning algorithms; see Guidotti et al. (2019) for a recent survey. The explanation methods can roughly be divided into global and local methods. Global methods try to explain the global behaviour of a supervised learning method by constructing a global understandable (white box) surrogate model that approximates the complex black box model. The drawback of the global approach is that for a sufficiently complex model, there is no simple surrogate model that would replicate the whole model with reasonable fidelity.

The alternative is local explanations focusing on how individual data items are classified or regressed. The advantage is that it is often possible to give high-fidelity, interpretable local explanations. The obvious disadvantage is that an explanation that is good for one data item may be useless for the other data items. A common model-agnostic approach for local explanations is to locally approximate the black box model with an interpretable white box model. These white box models are used to better understand the decision process by, e.g., showing which variables affect the outcome and how to achieve a different outcome.

In this paper, we combine the above two developments, namely, manifold visualisations and local explanations, to obtain global supervised manifold visualisations of the space of local explanations by using outputs of various black box supervised learning algorithms. We call the algorithm slisemap.

The idea of slisemap is straightforward: we want to find an embedding of data points into a (typically) two-dimensional plane such that the same interpretable model explains the supervised learning model of the data points nearby in the embedding. The embedding of the data points and the local models associated with each point in the embedding form a global explanation of the supervised learning model as a combination of the local explanations. At the same time, our method produces a visualisation of the data where the data points that are being classified (or regressed) with the same rules are shown nearby.

Fig. 1
figure 1

PCA (left) and slisemap (right) embeddings of a toy dataset described in the text. The toy data matrix consists of 4-dimensional Gaussian noise in \(\mathbf{{X}}\in {\mathbb {R}}^{99\times 4}\), and the response vector \(\mathbf{y} \in \mathbb {R}^{99}\) comes from a black box model \(f(\mathbf{{x}})=\max \mathbf{{x}}_{1:3}\). The legend in the plots corresponds to the value of \(\mathbf{{c}}_i=\arg \max \nolimits _{j\in \{1,2,3\}}{\mathbf{{X}}_{ij}}\). We have added some jitter to the slisemap embeddings to make the points in the clusters stand out

Example 1

First, consider a toy regression example where we have 99 data points composed of 4-dimensional covariates represented by rows of matrix \(\mathbf{{X}}\in {\mathbb {R}}^{99\times 4}\) and a pretrained black box regression model given by function \(f:{\mathbb {R}}^4\rightarrow {\mathbb {R}}\), which we want to study. The response vector \(\mathbf{{y}}\in {\mathbb {R}}^{99}\) is given by the regression estimates as \(\mathbf{{y}}_i=f(\mathbf{{X}}_{i\cdot })\), where \(\mathbf{{X}}_{i\cdot }\) denotes the ith row of the matrix \(\mathbf{{X}}\). Unknown to the user, the elements of matrix \(\mathbf{{X}}\) have been sampled at random from a normal distribution with zero mean and unit variance, and the regression function f is given by \(f(\mathbf{{x}})=\max {\mathbf{x}_{1:3}}=\max \nolimits _{j\in \{1,2,3\}}{\mathbf{{x}}_j}\), where \(\mathbf{{x}}\in {\mathbb {R}}^4\). In other words, the regression utilises the first three attributes in a nonlinear manner while ignoring the fourth attribute altogether.

Now, assume the user wishes to study the black box regression function and the dataset by embedding this 4-dimensional toy dataset into two dimensions. Any dimensionality reduction method that only take the covariate matrix \(\mathbf{{X}}\) into account, and ignore the response variables in \(\mathbf{{y}}\), would see only Gaussian noise; resulting in a limited insight about the data and the regression function, as shown in the PCA visualisation of Fig. 1 (left).

Then, consider a variant of slisemap, where ordinary least squares linear regression is used as an interpretable white box model. slisemap will produce an embedding where the data are split into three clusters indexed by \(\mathbf{{c}}_i=\arg {\max \nolimits _{j\in \{1,2,3\}}{\mathbf{{X}}_{ij}}}\), as shown in Fig. 1 (right). Each of the clusters corresponds to a different white box model denoted by \(g_i:{\mathbb {R}}^4\rightarrow {\mathbb {R}}\) for all \(i\in \{1,\ldots ,99\}\) and are in this example simply given by \(g_i(\mathbf{{x}})=\mathbf{{x}}_{\mathbf{{c}}_i}\).

For these toy data, slisemap can, therefore, partition the data into three clusters, each modelled locally to good accuracy by a separate linear white box model. The slisemap embedding, together with the white box models, produces a global explanation of the black box model. The slisemap embedding could help an analyst find “functional groups” of data points, with each group modelled by a simple linear model. This kind of “reverse-engineering” of the black box model can be instrumental in understanding how the black box model works and where it does not.

Uninformative directions in the data space, such as the 4th attribute in this example, are automatically ignored; slisemap follows the possibly nonlinear manifold relevant for the supervised learning task. Note that in each of the three clusters in the slisemap embedding, the value of the response variable \(\mathbf{{y}}_i\) obeys an identical distribution: nearby points in the slisemap embedding have similar white box models, not necessarily similar values of the response variables!

Example 2

A property of local explanations is that there may be several explanations, with roughly equally good fidelity, for any given data point. Consider the toy dataset described above, but let us assume that the user wants to add a new point \(\mathbf{x}'\) where some of the maximal variables are identical, \(\mathbf{x}'_i=\mathbf{x}'_j=\max {\mathbf{x}'_{1:3}}\), where \(i,j\in \{1,2,3\}\) and \(i\ne j\). This new point would fit equally well into both clusters i and j and, hence, has two potential local explanations. As shown later in the experiments, this also occurs with real datasets, and slisemap can be used to reveal this ambiguity, unlike more traditional local explanation methods that output only one white box model.

Example 3

As a more realistic and complex example, Fig. 2 shows the visualisation of a black box model that classifies 2 versus 3 in the classic MNIST (Lecun et al., 1998) dataset of hand-written digits. Here, the black box model is a convolutional neural network, and the white box model is a logistic regression classifier that takes the flattened image pixels as an input vector. The images are projected onto a two-dimensional plane such that the black box classifier for nearby images can, with reasonable fidelity, be approximated by the same logistic regression model. The digits are split into four visually separable clusters, with digits in each cluster classified by different sets of pixels. The logistic regression coefficients for different image pixels are shown in Fig. 2 (right). For example, the classifier separates the 2 s and 3 s at the bottom right mainly by identifying the "lower curve" in 3 s, while in the images on the left, the classifier is looking for black pixels below the centre. Seeing this visualisation enables us to find similar points in terms of the supervised learning problem and understand how the model classifies the data items.

Fig. 2
figure 2

slisemap visualisation of 2:s and 3:s in the MNIST dataset with a black box deep learning classifier that tries to classify the digits into 2:s and 3:s. The left shows the embedding of the digits in two dimensions, with a random selection of digits shown as images. The white box models are logistic regressions that use the image pixels as attributes. The right shows the same embedding, but the images show the regression coefficients associated with each pixel for the same selection of digits. The colour intensity indicates the magnitude of the coefficient. We can see from the right image that nearby digits are described by similar white box models

The benefits of slisemap compared to prior manifold visualisation or explanation methods include the following: (i) slisemap finds visual patterns, like clusters, such that all data items within the same cluster are explained by the same simple model. For example, in Fig. 1slisemap reveals three clusters, while Fig. 2 shows roughly four clusters of digits that can be separated by a given subset of pixels. (ii) Unlike existing local explanation methods, slisemap provides both global and local explanations of the data. For example, Fig. 2 compactly shows the explanations for all digits, in addition to the fact that roughly four linear models are sufficient to explain the classification of all digits to a reasonable fidelity. (iii) slisemap can be used to discover a nonlinear structure in a dataset, as shown in Fig. 2 and later in Sect. 4.4.

1.1 Contributions

The contributions of this paper are as follows: (i) We define a criterion for a supervised manifold embedding that shows local explanations and give an efficient algorithm to find such embeddings. (ii) We show experimentally that our method results in informative and useful visualisations, and local white box models can be used to explain and understand supervised learning models. (iii) We compare our contribution to manifold visualisation methods and local explanation methods.

2 Related work

This section briefly reviews the explainable AI and dimensionality reduction methods.

2.1 Explainable artificial intelligence

Explainable AI aims to provide insights into how black box (machine learning) models operate through various kinds of explanations. The explanations are used to analyse the black box models (Lapuschkin et al., 2019), when checking the results, e.g., looking for bias and systematic misclassifications (Selvaraju et al., 2020), and to discover issues and deficiencies in the data (Ribeiro et al., 2016). Furthermore, explanations can sometimes be a legal requirement (Goodman & Flaxman, 2017) or be used as a tool for facilitating AI and human collaboration (Samek et al., 2019).

The explanations of black box models can be generally divided into the exploration of global aspects, i.e., the entire model (Baehrens et al., 2010; Henelius et al., 2014, 2017; Adler et al., 2018; Datta et al., 2016), or inspection of local attributes, i.e., individual decisions (Ribeiro et al., 2016, 2018; Fong & Vedaldi, 2017; Lundberg & Lee, 2017); See Guidotti et al. (2019) for a recent survey and references. On a global level, the scope of the explanations is on understanding how the model has produced predictions, where the why is usually beyond human comprehension due to model complexity. On this level, we can examine which features affect the predictions most (Fisher et al., 2019) and what interactions there are between features (Goldstein et al., 2015; Henelius et al., 2014, 2017).

However, we are interested in local explanation methods, specifically those that can be used for any type of model (model-agnostic) and do not require any model modifications (post hoc). A common approach in this niche is to locally approximate the black box model with a simpler white box model. One of the first such methods, lime (Ribeiro et al., 2016), generates interpretations for user-defined areas of interest by perturbing the data and training a linear model based on the predictions. Another similar method is shap (Lundberg & Lee, 2017), which finds the weights based on Shapley value estimation (Shapley, 1951). Nonlinear white box models, such as decision rules (Guidotti et al., 2018; Ribeiro et al., 2018), can also be used.

Many of these methods generate local explanations based on perturbed data, but designing a good data generation process is nontrivial (Guidotti et al., 2019; Laugel et al., 2018; Molnar, 2019), e.g., replacing pixels in an image with random noise seldom results in natural-looking images. One method that only utilises existing data is called slise (Björklund et al., 2019, 2022), which finds the largest subset of data items that can be approximated (up to a given accuracy) by a sparse linear model. The work presented here can be seen as a global extension of slise.

2.2 Dimensionality reduction

Another way of assessing high-dimensional data is to reduce the number of covariates by, e.g., removing uninformative and redundant features or combining multiple features into single elements, thus making the data more interpretable. There are advantages of utilising dimensional reduction, as it removes correlated features in the data and allows for easier visualisation, e.g., in two dimensions. Still, combined features can also become less interpretable, and some information will inevitably be lost. The most straightforward dimensional reduction techniques are methods operating on the whole dataset by keeping the most dominant features with, e.g., backward elimination and forward selection, or by finding a combination of new features.

These methods include principal component analysis (PCA) and other linear methods (Cunningham & Ghahramani, 2015). Other approaches include locally linear embedding (LLE, MLLE) (Roweis & Saul, 2000; Zhang & Wang, 2006), spectral embedding (Belkin & Niyogi, 2003) and multidimensional scaling (MDS) (Kruskal, 1964), global-distance preserving MDS (Mead, 1992), ISOMAP (Tenenbaum et al., 2000), t-SNE (van der Maaten & Hinton, 2008), and UMAP (McInnes et al., 2020). Recently, some supervised methods have also become available, based on t-SNE (Kang et al., 2021; Hajderanj et al., 2019) and UMAP (McInnes et al., 2018).

There are some recent developments toward combining dimensionality reduction with explainable AI. Anbtawi (2019) presents an interactive tool which embeds data with standard t-SNE, and the user can examine the explanations of individual data items created by lime. However, there are no interactions between t-SNE and lime. Meanwhile, Bibal et al. (2020) use lime to explain the t-SNE embedding, with no supervised learning method involved.

2.3 Local linear models

Local linear models, such as Nelles et al. (2000), estimate a response variable by fitting linear models to neighbourhoods of data items. Cheng and Wu (2013) improves the computational efficiency by using dimensionality reduction, after which they apply local linear models on the embedding. These methods use local models, similar to slisemap. However, they are regression methods and do not produce visualisations or explanations.

3 Definitions and algorithms

3.1 Problem definition

A dataset consists of n data points \((\mathbf{x}_1,\mathbf{y}_1),\ldots ,(\mathbf{x}_n,\mathbf{y}_n)\), where \(\mathbf{x}_i\in \mathcal{X}\) are the covariates and \(\mathbf{y}_i\in \mathcal{Y}\) are responses for one data point and \(i\in [n]=\{1,\ldots ,n\}\). \(\mathcal{X}\) and \(\mathcal{Y}\) are the domains of the covariates and responses, respectively. In this paper and in our software implementation, we restrict ourselves to real spaces, \(\mathcal{X}={\mathbb {R}}^m\) and \(\mathcal{Y}={\mathbb {R}}^p\), but the derivations in this subsection are general and would be valid, for example, for categorical variables as well.

The goal is to find a local white box model \(g_i:\mathcal{X}\rightarrow \mathcal{Y}\) for every data point \((\mathbf{x}_i, \mathbf{y}_i)\), where \(i\in [n]\). We use \(\tilde{\mathbf{y}}_{ij}=g_i(\mathbf{x}_j)\) to denote the estimate of \(\mathbf{y}_j\) obtained by a white box model associated with data point \(\mathbf{x}_i\). Again, while the derivation is general, in this paper, we focus on cases where the white box model, \(g_i\), is either a linear projection (for regression problems) or multinomial logistic regression (for classification problems), as defined later in Sect. 3.3.

If we have access to a trained black box supervised learning algorithm \(f:\mathcal{X}\rightarrow \mathcal{Y}\), then we can use estimates given by the model \(\hat{\mathbf{y}}_i=f(\mathbf{x}_i)\) instead of \(\mathbf{y}_i\). This will make the local models \(g_i\) local approximations of the black box model. These approximations can then also be used to explain the predictions of the black box model as in Björklund et al. (2019).

Additionally, we want to find a lower-dimensional embedding \(\mathbf{Z}_{i\cdot }\) for every data point \(i\in [n]\), where \(\mathbf{Z}_{i\cdot }\) denotes the ith row of matrix \(\mathbf{Z}\in \mathbb {R}^{n \times d}\). Our objective is that neighbouring data items in the embedding space have similar local models \(g_i\). Since we focus on visualisation, in our examples, \(\mathbf{Z}_{i\cdot }\) is typically 2-dimensional (\(d=2\)).

We denote by \(\mathbf{D}_{ij}\) the Euclidean distance between the points \(\mathbf{Z}_{i\cdot }\) and \(\mathbf{Z}_{j\cdot }\) in the embedding, where

$$\begin{aligned} \mathbf{D}_{ij}=\left( \sum \nolimits _{k=1}^d{\left( \mathbf{Z}_{ik}-\mathbf{Z}_{jk}\right) ^2}\right) ^{1/2}. \end{aligned}$$
(1)

We define the soft neighbourhood by using a softmax function as follows:

$$\begin{aligned} \mathbf{W}_{ij}=\frac{e^{-\mathbf{D}_{ij}}}{\sum \nolimits _{k=1}^n{e^{-\mathbf{D}_{ik}}}}. \end{aligned}$$
(2)

We define the radius of the d-dimensional embedding to be the square root of the variance of the embedding or

$$\begin{aligned} {\text {radius}}(\mathbf{Z})=\left( \frac{1}{n}\sum _{i=1}^{n}{\sum _{k=1}^{d}{\mathbf{Z}_{ik}^2}} \right) ^{1/2}. \end{aligned}$$
(3)

We further define a loss function \(l:\mathcal{Y}\times \mathcal{Y}\rightarrow {\mathbb {R}_{\ge 0}}\) for the white box models. Here, we use the shorthand notation

$$\begin{aligned} \mathbf{L}_{ij}=l(\tilde{\mathbf{y}}_{ij},\mathbf{y}_j)=l(g_i(\mathbf{x}_j),\mathbf{y}_j). \end{aligned}$$
(4)

In this work, we use quadratic losses (for regression) and Hellinger distances between multinomial distributions (for classification), which we define later in Sect. 3.3.

The local white box model \(g_i\) can optionally have a regularisation term, which we denote by \(G_i\). Since slisemap consists of local models, regularisation can be important to handle small neighbourhoods. In this paper, we will use Lasso regularisation (Tibshirani, 1996) to be later defined in Eqs. (9) and (12) in Sect. 3.3.

Recall that the goal is that all points in the (soft) neighbourhood of point \(\mathbf{Z}_{i\cdot }\) to be modelled well by the local white box model \(g_i\). Mathematically, this can be formalised as minimising the following weighted loss:

$$\begin{aligned} \mathcal{L}_i=\sum \nolimits _{j=1}^n{\mathbf{W}_{ij}{} \mathbf{L}_{ij}}+G_i, \end{aligned}$$
(5)

Each local model \(g_i\) has its own set of weights \(\mathbf{W}_{i\cdot }\), of which \(\mathbf{W}_{ii}\) is the largest (due to \(\mathbf{D}_{ii} = 0\)). This is what makes the models local. If the embedding, and therefore \(\mathbf{W}\), is fixed, we can obtain the local models simply by minimising the loss of Eq. (5).

Our final loss function is obtained by summing all losses given by Eq. (5). We summarise everything in the main problem definition:

Problem 1

slisemap Given dataset \((\mathbf{x}_1,\mathbf{y}_1),\ldots ,(\mathbf{x}_n,\mathbf{y}_n)\), white box functions \(g_i\) and regularisation terms \(G_i\) for \(i\in [n]\), loss function l, and the desired radius of the embedding \(z_{radius}>0\), find the parameters for \(g_1,\ldots ,g_n\) and embedding \(\mathbf{Z}\in {\mathbb {R}}^{n\times d}\) that minimise the loss given by

$$\begin{aligned} \mathcal{L}= \sum _{i=1}^n{\sum _{j=1}^n{\mathbf{W}_{ij}{} \mathbf{L}_{ij}}} +\sum _{i=1}^n{G_i}. \end{aligned}$$
(6)

where \(\mathbf{L}_{ij}=l(g_i(\mathbf{x}_j),\mathbf{y}_j)\), \(\mathbf{W}_{ij}=e^{-\mathbf{D}_{ij}}/\sum \nolimits _{k=1}^n{e^{-\mathbf{D}_{ik}}}\), and \(\mathbf{D}_{ij}=(\sum \nolimits _{k=1}^d{(\mathbf{Z}_{ik}-\mathbf{Z}_{jk})^2})^{1/2}\), with the constraint that \(\mathrm{{radius}}(\mathbf{Z}) = z_{radius}\).

The loss function is invariant with respect to the rotation, which means that the embedding is invariant under rotation. The \(z_\mathrm{radius}\) parameter essentially fixes the sizes of the neighbourhoods. At the limit of small \(z_\mathrm{radius}\), all points will be compressed close to the origin, and hence, all points will be described by the same local model. On the other hand, if \(z_\mathrm{radius}\) is very large, the points are far away from each other, and the neighbourhood of each of the points consists only of the point itself.

3.2 Adding new data points to an existing solution

Often, it is useful to add new data points to an existing embedding without recomputing the whole embedding. Here, we define an auxiliary problem to this end.

Assume that we have a new data point denoted by \((\mathbf{x}_{n+1},\mathbf{y}_{n+1})\). Define parameters for a new local model \(g_{n+1}\) and a new embedding matrix by \(\mathbf{Z}'\in {\mathbb {R}}^{(n+1)\times d}\), such that the first n rows are the solution to Problem 1. We formulate the problem of adding a new point to an existing slisemap solution as follows:

Problem 2

slisemap-new Given the definitions above and a new data point \((\mathbf{x}_{n+1},\mathbf{y}_{n+1})\), find the parameters for \(g_{n+1}\) and \(\mathbf{Z}_{n+1,\cdot }'\in {\mathbb {R}}^d\) such that the loss of Eq. (6) is minimised; when \(g_{n+1}\) is added to the set of local models and \(\mathbf{Z}\) is replaced by \(\mathbf{Z}'\).

Solving Problem 2 is much easier than solving the full Problem 1 because in Problem 2, only the parameters for the new point need to be found, as opposed to the parameters for the n points in the full Problem 1. As a drawback, solving the full problem should result in slightly smaller loss. However, the difference should asymptotically vanish at the limit of large n. We study this difference experimentally in Sect. 4.6.

3.3 Slisemap for regression and classification

While the definitions in Sect. 3.1 were general, in this paper, we focus on regression and classification problems where the covariates are given by m-dimensional real vectors, or \(\mathcal{X}={\mathbb {R}}^m\). We denote the data matrix by \(\mathbf{X}\in {\mathbb {R}}^{n\times m}\), where the rows correspond to the covariates or \(\mathbf{X}_{i\cdot }=\mathbf{x}_i\). If necessary, we include in the data matrix a column of ones to account for the intercept terms.

Regression In regression problems, we use linear regression as the white box model. More specifically, we assume that the dependent variables are real numbers or \(\mathcal{Y}={\mathbb {R}}\). The white box regression model is given by a linear function

$$\begin{aligned} g_R(\mathbf{x},\mathbf{b})=\mathbf{x}^T\mathbf{b}, \end{aligned}$$
(7)

where \(\mathbf{b}\in {\mathbb {R}}^m\), and the loss is quadratic,

$$\begin{aligned} l_R(\tilde{\mathbf{y}},\mathbf{y})=\left( \tilde{\mathbf{y}}-\mathbf{y}\right) ^2. \end{aligned}$$
(8)

The linear regression model \(g_R\) is parametrised by the vector \(\mathbf{b} \in \mathbb {R}^m\). If we gather the parameter vectors from all the local models in Problem 1 into one matrix \(\mathbf{B} \in \mathbb {R}^{n \times m}\) such that the row \(\mathbf{B}_{i\cdot }\) gives the parameter vector of the local model \(g_i\), then the parameters being optimised in Problem 1 are \(\mathbf{B}\) and \(\mathbf{Z}\).

We use Lasso regularisation, see Eq. (5), for any \(i\in [n]\) given by

$$\begin{aligned} G_i^R=\lambda \times \sum _{j=1}^m{\left| \mathbf{B}_{ij}\right| }, \end{aligned}$$
(9)

where \(\lambda\) is a parameter setting the strength of the regularisation. We can then write Eq. (6) to be optimised explicitly as \(\mathcal{L}_R(\mathbf{X},\mathbf{y},\mathbf{B},\mathbf{Z})\) with \(\mathbf{L}_{ij}=\left( (\mathbf{X}{} \mathbf{B}^T)_{ij}-\mathbf{y}_j\right) ^2\).

Classification In classification problems, we assume that the black box classifier outputs class probabilities for p classes. We use multinomial logistic regression as the white box model. The dependent variables are multinomial probabilities in p-dimensional simplex or \(\mathcal{Y}=\{\mathbf{y}\in {\mathbb {R}}^p_{\ge 0}\mid \sum \nolimits _{i=1}^p{\mathbf{y}_i}=1\}\). Multinomial logistic regression can be parametrised by \(\mathbf{b} \in \mathbb {R}^{(p-1) m}\). The white box classification model is that of the multinomial logistic regression (Hastie et al., 2009),

$$\begin{aligned} \tilde{\mathbf{y}}_i= g_C(\mathbf{x},\mathbf{b})_i= {\left\{ \begin{array}{ll} \frac{\exp \left( {\mathbf{x}^T\mathbf{b}_{((i-1)m+1):(im)}}\right) }{1+\sum \nolimits _{j=1}^{p-1}{\exp \left( {\mathbf{x}^T\mathbf{b}_{((j-1)m+1):(jm)}}\right) }} &{} \text {if }i<p \\ \frac{1}{1+\sum \nolimits _{j=1}^{p-1}{\exp \left( {\mathbf{x}^T\mathbf{b}_{((j-1)m+1):(jm)}}\right) }} &{} \text {if }i=p \end{array}\right. }, \end{aligned}$$
(10)

We used \(\mathbf{b}_{a:b}\) to denote an \((b-a+1)\)-dimensional vector \((\mathbf{b}_a,\mathbf{b}_{a+1},\ldots ,\mathbf{b}_b)^T\). When using \(g_C\) as the white box model in Problem 1, we can express the parameters for all the local models using a matrix \(\mathbf{B} \in \mathbb {R}^{n \times (p-1)m}\), where the ith row \(\mathbf{B}_{i\cdot }\) corresponds to the parameter vector of the ith data point.

The loss function could be any distance measure between multinomial probabilities, such as Kullback–Leibler (KL) divergence. Here, however, we choose the more numerically stable squared Hellinger distance (Ali & Silvey, 1966; Liese & Vajda, 2006),

$$\begin{aligned} l_C(\tilde{\mathbf{y}},\mathbf{y}) = \frac{1}{2}\sum _{i=1}^p{\left( \sqrt{\tilde{\mathbf{y}}_i}-\sqrt{\mathbf{y}_i}\right) ^2} = 1-\sum _{i=1}^p{\sqrt{\tilde{\mathbf{y}}_i\mathbf{y}_i}}. \end{aligned}$$
(11)

The squared Hellinger distance is symmetric and bounded in interval [0, 1], unlike the KL, which is not symmetric or upper bounded. The squared Hellinger distance has convenient information-theoretic properties; for example, it is proportional to a tight lower bound for the KL divergence.

Note that when there are only two classes (\(p=2\)), the multinomial logistic regression reduces to the standard logistic regression.

As in the regression formulation, we use Lasso regularisation for \(i\in [n]\) given by

$$\begin{aligned} G_i^C=\lambda \times \sum _{j=1}^{(p-1)m}{\left| \mathbf{B}_{ij}\right| }. \end{aligned}$$
(12)

where \(\lambda\) is a parameter setting the strength of the regularisation.

We can then write Eq. (6) to be explicitly optimised as \(\mathcal{L}_C(\mathbf{X},\mathbf{y},\mathbf{B},\mathbf{Z})\) with \(\mathbf{L}_{ij}=l_C(g_{Ci}(\mathbf{x}_j),\mathbf{y}_j)\) expressed by using the Hellinger loss \(l_C\) of Eq. (11) and multinomial logistic regression \(g_C\) of Eq. (10).

Alternative formulation for binary classification In case the targets are given by a black box model, we can also use an alternative formulation for binary classification (\(p=2\)). Here, we simply transform the probability \(\hat{y}_1\) with a logit function, \(\hat{y}_1' = \log (\hat{y}_1 / (1 - \hat{y}_1))\), from the interval [0, 1] to the interval \([-\infty ,\infty ]\) and then run slisemap for regression with quadratic loss, as above. Using a logit transformation followed by a linear model matches the behaviour of shap (Lundberg & Lee, 2017) and slise (Björklund et al., 2019).

3.4 Algorithm

Pseudocode for slisemap is given in Algorithm 1. As the initial values for the embedding \(\mathbf{Z}\), we use the principal component projection of the data (PCA). Then, we optimise the values of \(\mathbf{B}\) and \(\mathbf{Z}\) by minimising the loss given by Eq. (6).

figure a

In our algorithm, we keep the radius of the embedding \(\mathbf{Z}\) constant by always dividing it by \({radius}(\mathbf{Z})\) during the optimisation. Due to this normalisation, the loss term \(\mathcal{L}()\) does not depend on the radius of \(\mathbf{Z}\). Thus, for numerical stability, we add a small penalty term \(({radius}(\mathbf{Z})-1)^2\) to the loss (line 8 of Algorithm 1).

For the implementation of “\(\arg \min\)” in Algorithm 1, we use PyTorch (Paszke et al., 2019), which enables us to optionally take advantage of GPU acceleration. The optimisation of \(\mathbf{B}\) and \(\mathbf{Z}\) is performed using the L-BFGS (Nocedal, 1980) optimiser of PyTorch. As explained earlier, in this paper, we assume that the data are real valued and use the white box models and losses of Sect. 3.3 to study regression and classification problems.

In addition to the L-BFGS gradient search, we use an additional heuristic (function Escape in Algorithm 1) to help with escaping local optima. The heuristic consists of moving each item (embedding and local model) to the soft neighbourhood, given by \(\mathbf{W}\) in Eq. (2), that have the most suitable local models. This process is repeated until no further improvement is found. We empirically validate the advantage of using the escape heuristic in Appendix B.

The pseudocode for Problem 2 (adding new data points to a slisemap solution) is also given in Algorithm 1 (function Slisemap-new). Here, we use the same escape heuristic to find a suitable neighbourhood as a starting point and then optimise the embedding and local model for the new data item(s) with PyTorch and L-BFGS.

The source code, published under an open source MIT license, as well as the code needed to replicate all of the experiments in this paper, is available via GitHub (Björklund et al., 2022b).

3.5 Computational complexity

Evaluation of the loss function of Eq. (6) requires at least \(O(n^2m)\) iterations for linear regression and \(O(n^2mp)\) for multinomial logistic regression. Because, for every local model O(n), the prediction and loss O(mp) must be calculated for every data item O(n). The calculation of the soft neighbourhoods requires \(O(n^2d)\) (from calculating the Euclidean distances), but \(d<mp\) in most circumstances.

However, this is an iterative algorithm, where Eq. (6) has to be evaluated multiple times. While it is difficult to provide strict running time limits for iterative optimisation algorithms such as L-BFGS—we study this experimentally in Sect. 4—it is obvious that the algorithm may not scale well for very large (n) datasets.

However, usually it is sufficient to subsample \(\min {(n,n_0)}\) data points, where \(n_0\) is a suitably chosen constant, optimise for the loss function (Problem 1), and then add points to the existing solution (Problem 2). By this procedure, the asymptotic complexity of slisemap is linear with respect to the number of data points n. Especially for visualisation purposes, it often makes no sense to compute exact projection for a huge number of data points: visualisations cannot show more data points than there are pixels, so having an extremely accurate solution to the full optimisation problem instead of an approximate solution usually brings little additional benefit. Instead, finding a quick solution for sub-sampled data and adding the necessary number of data points to the embedding works well in practice, as shown in the experiments of Sect. 4.6.

4 Experiments

In the experiments, we usually embed the data into two dimensions (\(d=2\)) and normalise data attributes, columns of the data matrix \(\mathbf{X}\), to zero mean and unit variance as well as add an intercept term (column of ones) before running slisemap. Furthermore, unless otherwise mentioned, we subsample the large datasets into 1000 data items and run all experiments ten times.

Most datasets have been used in two scenarios, first as normal regression or classification using the definitions from Sect. 3.3, and second in an XAI-inspired scenario where the targets are predictions from black box models, using the alternative formulation from Sect. 3.3 in the case of classification. When the white box model is a linear regression, we use \(\lambda =10^{-4}\) as the regularisation coefficient and \(\lambda =10^{-2}\) for logistic regression. An overview of the datasets and black box models can be seen in Table 1.

As explained earlier, we used PyTorch version 1.11 (Paszke et al., 2019). The runtime experiments were run on a server having an AMD Epyc processor at 2.4 GHz with 4 cores and 16 GB of memory allocated and an NVIDIA Tesla V100 GPU. The code to run the experiments is available via GitHub (Björklund et al., 2022b).

4.1 Datasets

In this section, we describe the datasets used in the experiments. The datasets and the black box models are available from OpenML (https://openml.org) (Vanschoren et al., 2014). A quick summary can be seen in Table 1.

Table 1 An overview of the datasets and black box models used in the experiments

Synthetic data We create synthetic regression data (rsynth) as follows: given parameters dataset size n (number of data items) and m (number data attributes), as well as \(k=3\) (number of clusters) and \(s=0.25\) (standard deviation of the clusters). We first sample \(j \in [k]\) coefficient vectors \(\mathbf{\beta }_j\in {\mathbb {R}}^m\) from a normal distribution with zero mean and unit variance and cluster centroids \(\mathbf{c}_j\in {\mathbb {R}}^m\) from a normal distribution with zero mean and standard deviation of s. We then create data items \(i\in [n]\) by sampling the cluster index \(j_i\in [k]\) uniformly and then generating a data vector \(\mathbf{x}_i\) by sampling from a normal distribution with a mean of \(\mathbf{c}_{j_i}\) and unit variance. The dependent variable is given by \(\mathbf{y}_i=\mathbf{x}_i^T\mathbf{\beta }_{j_i}+\epsilon _i\), where \(\epsilon _i\) is Gaussian noise with zero mean and standard deviation of 0.1.

Air Quality data, cleaned and filtered as in Oikarinen et al. (2021), contains 7355 instances of 12 different air quality measurements, one of which is used as a dependent variable and the others as covariates.

Boston Housing Dataset collected by the US census service from the Boston Standard Metropolitan Statistical Area in 1970. The size of the dataset is 506 items with 14 attributes, including the median value of owner-occupied homes that is used as the dependent variable.

Spam (Cranor & LaMacchia, 1998) is a UCI dataset containing both spam, i.e., unsolicited commercial email, as well as professional and personal emails. There are 4601 instances with 57 attributes (mostly word frequencies) in the dataset.

Higgs (Baldi et al., 2014) is a UCI dataset containing 11 million simulated collision events for benchmarking classification algorithms. The dependent variable is whether a collision produces Higgs bosons. There are 28 attributes, the first 21 featuring kinematic properties measured by the particle detectors, and the last seven are functions of the first 21.

Covertype is a UCI dataset with over half a million instances, used to classify forest cover type (seven different types, but we only use the first two) from 54 attributes. The areas represent natural forests with minimal human-caused disturbances.

MNIST (Lecun et al., 1998) is the classic machine learning dataset of handwritten digits from 0 to 9. Each digit is represented by a 28 \(\times\) 28 greyscale image (784 pixels with integer pixel values between 0 and 255). Due to the large number of pixels, we create a binary classification task by limiting the available digits to 2 and 3 and subsample them to 5000 data items.

4.2 Metrics

To compare different slisemap solutions, we want to be able to objectively measure the performance. To accomplish that, we consider the following metrics.

Loss The most obvious thing to measure is the loss we are trying to minimise; see Eq. 6. However, the loss will change based on the parameters and the size of the dataset.

Cluster Purity For the synthetic dataset, we know the ground truth, which means that we can compare the original clusters to the embedding found by slisemap. If we denote the true cluster id:s as \(c_1,\ldots ,c_n\), we can measure how well low-dimensional embeddings reconstruct the true clusters:

$$\begin{aligned} \frac{1}{n} \sum \nolimits _{i=1}^n |\text {k-NN}(i) \cap \{j \mid c_i = c_j \}| / k, \end{aligned}$$
(13)

where \(\text {k-NN}(i)\) is the set of k nearest neighbours (of item i) in the embedding space, using Euclidean distance, and \(j \in [n]\). A larger value (closer to one) indicates that the dimensionality reduction has found the true clusters.

Fidelity The fidelity of a local model (Guidotti et al., 2019) measures how well it can predict the correct outcome. Using the losses defined in Sect. 3.3, we obtain:

$$\begin{aligned} \frac{1}{n} \sum \nolimits _{i=1}^n l(g_i(\mathbf{{x}}_i), \mathbf{{y}}_i). \end{aligned}$$
(14)

We are interested not only in how the local models perform on the corresponding data items but also in how well they work for the neighbours in the embedding space, using, e.g., the k nearest neighbours:

$$\begin{aligned} \frac{1}{n} \sum \nolimits _{i=1}^n \frac{1}{k} \sum \nolimits _{j \in \text {k-NN}(i)} l(g_i(\mathbf{{x}}_j), \mathbf{{y}}_j). \end{aligned}$$
(15)

A smaller value indicates better fidelity.

Coverage We also want local models that generalise to other data points. Otherwise, it would be trivial to find solutions. The coverage (Guidotti et al., 2019) of a local model can be measured by counting the number of data items that have a loss less than a threshold \(l_0\):

$$\begin{aligned} \frac{1}{n} \sum \nolimits _{i=1}^n \frac{1}{n} \sum \nolimits _{j=1}^n (l(g_i(\mathbf{{x}}_j), \mathbf{{y}}_j) < l_0). \end{aligned}$$
(16)

This requires us to select the loss threshold \(l_0\). Unless otherwise mentioned, in this paper, we choose the threshold to be the 0.3 quantile of the losses of a global model (without the distance-based weights). Furthermore, we also want this behaviour to be reflected in the low-dimensional embedding. To verify this information, we limit the coverage testing to only the k nearest neighbours:

$$\begin{aligned} \frac{1}{n} \sum \nolimits _{i=1}^n \frac{1}{k} \sum \nolimits _{j \in \text {k-NN}(i)} (l(g_i(\mathbf{{x}}_j), \mathbf{{y}}_j) < l_0). \end{aligned}$$
(17)

A larger coverage value (closer to one) is better.

4.3 Parameter selection

slisemap has one unusual parameter that needs to be selected: \(z_\mathrm{radius}\). If \(z_\mathrm{radius}\) is too small, then all data items are in the same cluster, resulting in underfitting local models that are almost identical to the global model. However, if \(z_\mathrm{radius}\) is too large, then the neighbourhoods become singular, which causes the local models to overfit.

Fig. 3
figure 3

Fidelity of the local models versus the fraction of nearest neighbours (in the fidelity calculation) for different values of \(z_\mathrm{radius}\). Smaller fidelity is better, especially for the nearest neighbours. Here, \(3 \le z_\mathrm{radius} \le 4\) results in the best coverage

In Fig. 3, we investigate how different values of \(z_\mathrm{radius}\) affect the fidelity of the local models. Unless the local model is underfitting, the fidelity for the corresponding data item should be close to zero. Then, as the number of nearest neighbours grows, the fidelity should stay as low as possible for as long as possible to avoid overfitting. Based on these results, \(z_\mathrm{radius}\) values from three to four seem to work well for all datasets.

We also consider how the coverage of the local models depends on the \(z_\mathrm{radius}\). The coverage plots can be seen in Appendix A and support the same conclusion as the fidelity results. Thus, we use \(z_\mathrm{radius} = 3.5\) as the default value for all the other experiments in this paper.

4.4 Visualisations of the datasets

While fidelity and coverage can be used for the quantitative analysis of \(z_\mathrm{radius}\), there is still room for a qualitative comparison to account for subjective preferences. In Fig. 4, we plot the low-dimensional embeddings for different \(z_\mathrm{radius}\) values. At small values of \(z_\mathrm{radius}\), all points converge to the same cluster, as expected. With large values, the points form smaller and smaller clusters, potentially leading to overfitting. Based on Fig. 4, \(z_\mathrm{radius}\) values between three and four seem optimal, which matches the conclusions from above.

Fig. 4
figure 4

The low-dimensional embedding of the Boston dataset with different values for \(z_\mathrm{radius}\). Large values (bottom right) lead to sparse solutions with small clusters and potentially overfit local models. Small values of \(z_\mathrm{radius}\) (top left) create a single dense cluster in the centre, with all (non-)local models being almost identical

With slisemap, we obtain not only an embedding but also local models for the data items. Data items that are nearby in the embedding space should have similar local models. We can verify this by clustering the local models independently of the embeddings and comparing these clusters to the structure in the embedding. Furthermore, models far apart in the embedding should look different due to the different local weights.

Fig. 5
figure 5

Clustering the local models (right) to see if they correspond to clusters in the embedding (left). This also shows that the clusters in the embedding (left) have distinct local models (right). The barplot (right) only shows the five most important attributes of the boston dataset

In Fig. 5, we cluster the coefficients of the local models using k-means clustering (on the boston dataset). Here, we see that the clusters in the local models clearly match clusters in the embedding and that the clusters have different local models. Looking at the local models, we notice something curious: the number of rooms (RM) has a positive coefficient for most of the data, but in one cluster it is even negative! Investigating this cluster reveals that this cluster represents industrial locations with a high property tax (see density plots in Appendix C), making large homes less desirable. This insight could, e.g., be used in city planning and construction decisions.

A plot of the MNIST data is shown in Fig. 2 in the introduction, where we can see that some local models focus heavily on the bottom curve of 3:s, while others compare the differences between the pixels in the centre and the pixels just below the centre.

4.5 Uniqueness

In slisemap, the embedding is influenced by the local models. Thus, if multiple local models are suitable for a particular data item, then the optimal embedding might be ambiguous. Some overlap between the local models is expected, and neighbouring (in the embedding) local models should be especially similar, due to the distance-based kernels in the loss function, Eq. 6. We also expect the hyperplanes of the local models to intersect, and any data items at these intersections will fit both models equally well.

In Fig. 6, we select seven data items from the boston dataset and plot scatterplots of the embedding, where the colour of each dot represents how suitable that local model is for the selected data item. We see that not all local models suit all data items, i.e., the local models are actually local. Furthermore, neighbouring points tend to have the most suitable local models, as expected. However, some data items fit well into multiple neighbourhoods.

Fig. 6
figure 6

A slisemap embedding for the boston dataset. The embedding is plotted seven times with different data items selected. The points in the embedding are coloured based on how well the corresponding local model fits the selected data item. Some data items only fit local models that are nearby in the embedding (the same neighbourhood), while some data items are more general

These data items with multiple potential neighbourhoods make the solutions non-unique, since there are multiple local optima with almost equally good losses. However, as shown in Fig. 5, the local models in the different neighbourhoods are different, and this is important for the data items in Fig. 6 matching only a single neighbourhood.

4.6 Subset sampling

With large datasets, the quadratic scaling of slisemap, see Sect. 3.5, can become problematic. One solution is to run slisemap on a random subset of the data, and then, post hoc, add unseen data items whenever necessary (see Sect. 3.2). With larger subsets, we expect better results, but with diminishing returns after the dataset is sufficiently covered.

To investigate how much data are needed, we randomly select 1000 data items from the large datasets to be unseen test data and train slisemap solutions on increasing numbers of data items sampled from the remaining data. Then, we add the unseen data items, using Slisemap-new  from Algorithm 1. We repeat this process ten times for each dataset and compare the fidelity, Eq. (14), between the training data and the test data.

Fig. 7
figure 7

Adding new data items to slisemap solutions trained on subsampled datasets. With a sufficiently large training dataset, the fidelity of unseen test data matches that of the training data. For most of these datasets, only a couple of hundreds of initial data items are required. Lower fidelity for the test data is better

The results can be seen in Fig. 7. If the training dataset is too small, then the local models tend to overfit, but for most datasets, only a couple of hundred data items are needed to stabilise the results. This also coincides with the fidelities of the unseen test data approaching the fidelities of the training data.

4.7 Higher dimensional embeddings

In most experiments discussed in this paper, we use a two-dimensional embedding (\(d=2\)). This is because a two-dimensional embedding is easy to visualise, which we consider to be an important use-case for the embedding. However, slisemap is not limited to only two dimensions, which we demonstrate in this section.

Fig. 8
figure 8

Coverage of the local models versus the fraction of nearest neighbours (in the coverage calculation) for different values of \(z_\mathrm{radius}\) and different numbers of embedding dimensions d. As the threshold for the coverage, we use the 0.3 quantile of the losses from a global model. Larger coverage is better, especially for the nearest neighbours. Here, \(3 \le z_\mathrm{radius} \le 3.5\) results in the best coverage, even for higher dimensional embeddings. The full plot is available in Appendix D

Using the same fidelity and coverage metrics as in Sect. 4.3, we can find the best \(z_\mathrm{radius}\) value for higher dimensional embeddings. In Fig. 8 and Appendix D, we demonstrate that the same default parameter value of \(z_\mathrm{radius} = 3.5\), which works well for two-dimensional embeddings, is also suitable for higher dimensions.

Fig. 9
figure 9

Comparing losses for different numbers of embedding dimensions d. With higher-dimensional embeddings, we expect either minor improvements to the loss due to more flexible distances between multiple clusters or that the loss stays roughly the same

With two-dimensional embeddings, the intercluster distances are only independent for up to three clusters. This means that we expect higher dimensional embeddings to produce slightly lower losses if there are more than three clusters. In Fig. 9, we compare the losses for different numbers of dimensions. For some datasets, we indeed see minor improvements in the loss with increasing dimensionality. But, for example, in rsynth we know that there are only three clusters, so higher dimensional embeddings offer no advantage.

4.8 GPU acceleration

Since we implement slisemap using PyTorch, the calculations can be accelerated using a GPU. Running slisemap on a GPU should be faster than running on a CPU, especially for larger datasets. In Fig. 10, we apply slisemap on rsynth datasets with different sizes, both with and without GPU acceleration. The GPU implementation has some overhead, making it slower for small datasets (less than \(400 \times 10\)) but substantially faster for larger datasets.

Fig. 10
figure 10

Runtimes for different dataset sizes (using the rsynth dataset). GPU acceleration (cuda) brings some overhead but offers better parallelisation on large datasets. Note the logarithmic scale of the axis

4.9 Comparison to dimensionality reduction methods

The feature that differentiates slisemap from other dimensionality reduction methods is that slisemap provides both a low-dimensional embedding and local models. To demonstrate that doing this optimisation simultaneously is necessary, we take the embeddings from other dimensionality reduction methods and fit local models post hoc (essentially running slisemap with a fixed \(\mathbf{Z}\) given by the dimensionality reduction methods).

We use the following dimensionality reduction methods from the Scikit-learn package (Pedregosa et al., 2011) for the comparison: PCA, LLE (Roweis & Saul, 2000), MLLE (Zhang & Wang, 2006), MDS (Kruskal, 1964), ISOMAP (Tenenbaum et al., 2000), and t-SNE (van der Maaten, 2014). We also consider UMAP (McInnes et al., 2020).

A selection of the results can be seen in Table 2 (results for all datasets can be found in Appendix F). Since none of the other methods consider the relationship between \(\mathbf{X}\) and \(\mathbf{y}\) (most do not even use \(\mathbf{y}\)), their post hoc local models are, unsurprisingly, nonoptimal. However, the downside of using slisemap is the additional time required for convergence. An empirical scalability comparison of the dimensionality reduction methods can be seen in Appendix E.

Table 2 Comparing slisemap against other dimensionality reduction methods.

4.10 Comparison to local explanation methods

If we have access to a black box model, we can use slisemap to find local and interpretable approximations of that black box model. In this section, we investigate how good the approximations are by checking both how local and how general the local models are. We also compare against other model-agnostic, local explanation methods. Furthermore, slisemap finds all local models simultaneously, which could provide a speed benefit.

Of the local, model-agnostic, approximating explanations methods mentioned in Sect. 2, slisemap is most closely related to slise Björklund et al. (2019). slise uses robust regression (Björklund et al., 2022) on data that have been centred on the selected data item to produce the local approximation. lime (Ribeiro et al., 2016) creates a neighbourhood of synthetic data by mutating the selected data item (and using the black box model to obtain predictions). To increase interpretability lime, normally, discretise continuous variables into binary variables (e.g., into quantiles). Then, lime fits a least squares linear model to the synthetic neighbourhood to form the local approximation. shap (Lundberg & Lee, 2017) tries to estimate the Shapley value of keeping a variable in the selected data item versus changing it. This is conceptually quite similar to the discretisation in lime. The model-agnostic variants of shap generally accomplishes this by creating variants of the selected data item where some of the variables are sampled from the dataset. These Shapley values are then used as the local approximation.

Table 3 Comparison of the local white box models given by slisemap, slise, shap, lime, and lime with no discretisation

In addition to the methods outlined above, slisemap, slise, shap, and lime (with and without discretisation), we also consider a global model as a reference. The global models allow us to check that the local approximations are indeed local (better fidelity than the global model) and how general the approximations are (by comparing the coverage). As the threshold for measuring coverage as well as the error tolerance parameter in slise, we use the 0.3 quantile of the losses of the global model. The results can be seen in Table 3.

By definition, slise an shap have perfect fidelity for the data item corresponding to the local model, with slisemap not far behind. The global model is obviously not local and, thus, should have the worst fidelity. However, there is nothing in the lime procedure that ensures that the local approximation matches the selected data item. This results in the fidelity of lime being comparable to the global model.

One of the advantages of slise is specifically optimising the subset size, which results in outstanding coverage. The local models in slisemap are affected by the low-dimensional embedding. This reduced flexibility results in lower coverage than slise but better coverage than both lime and shap. Both shap and lime create synthetic neighbourhoods, which results in local models that are more difficult to generalise to real data items, reducing the coverage.

By computing all the local approximations at the same time, slisemap tends to be faster than the methods doing it one-by-one, the exception being lime with no discretisation. Furthermore, slisemap also finds a low-dimensional embedding that can be used to visualise and compare different data items, different local approximations, and how they relate to each other.

5 Conclusions

In this paper, we present a novel supervised manifold embedding method, slisemap, that embeds data items into a lower-dimensional space such that nearby data items are modelled by the same white box model. Therefore, in addition to reducing the dimensionality of the data, slisemap creates a visualisation that can be used to globally explore and explain black box classification and regression models.

We show that the state-of-the-art dimensionality reduction methods, unsurprisingly, cannot be used to explain classifiers or regression models. On the other hand, the state-of-the-art tools used to explain black box models typically only provide local explanations for single examples, whereas slisemap gives an overview of all local explanations.

Interesting future work would be to explore how slisemap visualisations can be used to better understand data, both with and without a black box model, and to help build better models. For example, if a slisemap visualisation could show that some group of data items should be handled differently. Future work could also explore how to use slisemap to detect anomalous behaviours, such as outliers or concept drift. Finally, the scaling of slisemap could be improved by, e.g., using stochastic optimisation or prototypes.

The source code for slisemap, published under an open source MIT license, as well as the code needed to replicate all of the experiments in this paper, is available via GitHub (Björklund et al., 2022b).