1 Introduction

A key requirement for trustworthy artificial intelligence (AI) and machine learning (ML) is its transparency and explainability [1]. However, there seems to be no widely accepted measure for the explainability of an ML method [2, 3]. Informally, explaining a ML method amounts to communicating some information that is relevant to the understanding of its working principles [4]. In a supervised ML setting, such information might be obtained from (local) approximations of the hypothesis map learnt by ML methods [5].

Building on our recent work [6], we interpret the act of explaining as a communication problem with the goal to effectively convey the operational principles of the ML method to the intended users. To quantify the explainability, we propose using entropy as a measure, which reflects the (lack of) uncertainty about the predictions delivered by the ML method.

A key challenge for XML is the variation in the background knowledge of human end users [6, 7]. ML methods that are explainable for a domain expert might be opaque (“black-box”) for a lay user. For example, a deep neural network used for diagnosing skin cancer from images may be explained by quantifying the influence of individual pixels on the prediction, which could be visualized as a heat map [8, 9]. However, such an explanation may not be sufficient for a lay person without expertise in dermatology [9].

To ensure the user-specific (or personalized) explainability of ML, we introduce the concept of a user (feedback) signal. User signals are conceptually similar to the concept of labelled data in supervised ML. While labels represent quantities of interest associated with data points, user signals reflect how data points are comprehended by the human user. The main contribution of this paper is a XML method whose explainability is tailored to a specific user [1, 4].

Our approach is agnostic to the means of acquiring user signals, which could include social network behaviour, biophysical measurements, observation of facial expressions, or manually chosen interpretable features of data points. More generally, user signals might be obtained from interpretable representations of data points [5]. Section 4 considers hate speech detection in social media where data points represent short messages (“tweets”). Here, the user signal could be the presence of specific (key-) words that the user considers an indicator of hate speech [5].

This paper proposes explainable empirical risk minimization (EERM) as a novel XML method that can be applied to a wide range of models. We obtain different instances of EERM, such as explainable linear regression (see Sect. 3.1) and explainable decision tree (DT) classification (see Sect. 3.2). EERM requires a training set with known user signals which are used to estimate the subjective explainability of the hypothesis.

EERM learns a hypothesis whose predictions do not deviate too much across data points that are considered similar by a human user. In particular, if a user assigns identical user signals to data points, then predictions delivered by EERM should be close. This requirement is similar in spirit to the smoothness assumption of supervised ML, which requires similar predictions for data points with similar features [10].

EERM is an instance of structural risk minimization [11, Ch. 7] that uses the subjective explainability of a hypothesis as a regularization term [8, 12, 13]. Depending on the quality of the user signal, enforcing the subjective explainability of a learnt hypothesis might be beneficial or detrimental to the resulting prediction accuracy (see Sect. 3.1). If the user signal is correlated with the true label of a data point, our requirement for explainability helps to steer or regularize the learning task. Indeed, we might interpret the user feedback signal as a manifestation of domain expertise and, in turn, the explainability requirement as a means to incorporate domain expertise into a ML method.

1.1 Related work

Existing XML methods can be roughly divided into two main flavours: model-agnostic and intrinsically explainable (or interpretable) [2, 3, 14]. Model-agnostic methods construct explanations for any given ML method while the latter category restricts the choice of ML models to “simple" models. This paper proposes EERM as a novel XML method that bridges these two flavours. EERM shares the flexibility of model-agnostic XML in allowing for using arbitrary models. However, EERM does not construct explanations but learns a hypothesis that is intrinsically explainable to a specific user.

Model-agnostic methods can be combined with any ML method for which it is possible to efficiently compute predictions for data points. These methods do not require the details of the ML method such as the optimization algorithms used for model training. Instead, they only need to be provided the predictions for some probing points. These predictions are then used to construct a local approximation of the overall behaviour of the ML method [5, 6]. Maybe the most basic example for a XML method from this category is to locally approximate a learnt hypothesis by a linear function [5]. Instead of a local (linear) approximation of a ML method, EERM uses regularization to nudge ERM to become more explainable.

A second main category of XML methods is obtained by restricting the design choice for the ML model to a distinct set of intrinsically explainable (“simple") models. Examples of intrinsically explainable models include linear models using few features and shallow DTs [15]. The explanation (or interpretation) of a trained linear model is typically constructed from the learned weights for the individual features. A large (in magnitude) weight indicates a high relevance of the corresponding feature for the resulting predictions. The prediction delivered by a DT might be explained by the path traversed from the root node to the decision node during the computation of the prediction [16].

In general, however, there is no widely accepted definition of which model is considered intrinsically explainable. Moreover, there is no consensus about how to measure the explainability of a “simple” model [3]. We close this gap by introducing a novel measure for the subjective explainability of a hypothesis. This measure is derived from the conditional entropy of the predictions obtained by applying the hypothesis to a random data point, given its user signal. Conditional entropy is closely related to the concept of mutual information [17], which has been used previously to quantify the effect of explanations [6, 18]. While the XML methods of [6, 18] construct explanations for a given ML method, we train a given model such that its predictions are intrinsically explainable (without the need for additional explanations) to a specific user.

1.2 Contributions

We next enumerate the main contributions of this paper.

  • We apply information-theoretic concepts to develop a novel measure for the subjective explainability of predictions delivered by a ML method.

  • Our main conceptual contribution is to identify the notion of subjective explainability with predictability: A hypothesis is considered explainable to a specific user if its predictions can be anticipated by the user. The extent of anticipation is measured by the conditional entropy of the predictions, conditioned on the user signal (see Sect. 2.2).

  • Our main methodological contribution is to propose the EERM principle, which is obtained from the ERM principle by adding subjective explainability as a regularizer (see Sect. 3).

  • We present an overall performance assessment \(E^{\star }\) measure and illustrate the usefulness of EERM by two numeric experiments in real datasets (see Sect. 4).

1.3 Notation

Throughout the paper, the scalars are represented with normal lowercase letters while the vectors are denoted in bold. We represent different topological spaces with calligraphic font and use the symbol \({\widehat{x}}\) to denote the estimation of a variable x.

2 Problem setup

We consider a generic ML setup that involves data points characterized by a label (quantity of interest) \(y\in \mathcal {Y}\) and some features (attributes) \(\textbf{x}= \big (x_{1},\ldots ,x_{n} \big )^{T} \in \mathbb {R}^{n} \in \mathcal {X}\) [11, 16, 19]. The XML method developed in Sect. 3 does not place any restrictions on the choice for feature space \(\mathcal {X}\) and label space \(\mathcal {Y}\). Each data point is also assigned a user signal \(u\in \mathcal {U}\), which we will use to construct measures for the subjective explainability of ML methods.

The proposed XML method (see Sect. 3) can be combined with different sources for user signals. User signals might be obtained from facial expressions or biophysical measurements [20]. Another example for a user signal is manually chosen features that are considered interpretable [5]. Section 4.2 studies hate speech detection in social media with data points being short messages. Here, the user signal is defined via the presence of certain keywords that are considered a strong indicator of hate speech.

Our key assumption is that a hypothesis is subjectively explainable for a user if it delivers similar predictions for data points with similar user signals. To summarize, we represent a data point by a triplet \(\big ( \textbf{x}, y, u\big )\) that consists of a feature vector \(\textbf{x}\), a label \(y\) , and a user signal \(u\).

The goal of many important ML methods is to learn a hypothesis map [11]

$$\begin{aligned} h(\cdot ):\mathcal {X}\rightarrow \mathcal {Y}: \textbf{x}\mapsto \hat{y}=h(\textbf{x}). \end{aligned}$$
(1)

that is used to compute the predicted label \(\hat{y}=h(\textbf{x})\) solely from the features \(\textbf{x}\) of a data point. Given finite computational resources can only use a subset of (computationally) feasible maps. We refer to this subset as the hypothesis space (model) \(\mathcal {H}\) of a ML method. The XML method presented in Sect. 3 allows for a wide range of choices for the model, including linear models, DTs, and artificial neural networks (ANNs) [16, 19, 21]. The main requirement for the model \(\mathcal {H}\) is merely that it allows for efficient training (optimization) algorithms. Examples of such models which typically include linear maps, DTs, or ANNs [21,22,23].

For a given data point with features \(\textbf{x}\) and label \(y\), we measure the quality of a hypothesis \(h\) using some loss function \(L\big ({(\textbf{x},y)},{h}\big )\). The quantity \(L\big ({(\textbf{x},y)},{h}\big )\) measures the error incurred by predicting the label \(y\) of a data point using the prediction \(\hat{y}= h(\textbf{x})\). Popular examples for loss functions are the squared error loss \(L\big ({(\textbf{x},y)},{h}\big ) = (h(\textbf{x}) - y)^{2}\) (for numeric labels \(y\in \mathbb {R}\)) or the logistic loss \(L\big ({(\textbf{x},y)},{h}\big ) = \log (1+\exp (-h(\textbf{x})y))\) (for label space \(\mathcal {Y}= \{-1,1\}\)).

Roughly speaking, we would like to learn a hypothesis \(h\) that incurs a small loss on any data point. To make this informal goal precise, we can use the notion of expected loss or risk

$$\begin{aligned} \mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h}\big ) \big \} :=\mathbb {E} \big \{ L\big ({\big (\textbf{x},y\big )},{h}\big ) \big \}. \end{aligned}$$
(2)

This definition tacitly uses an i.i.d. assumption where data points are interpreted as realizations of statistically independent RVs with a common pdf \(p(\textbf{x},y,u)\) (which underlies the expectation in (2)). Ideally, we would like to learn a hypothesis \(\hat{h}\) with minimum risk

$$\begin{aligned} \mathbb {E} \big \{ L\big ({(\textbf{x},y)},{\hat{h}}\big ) \big \} = \min _{h\in \mathcal {H}} \mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h}\big ) \big \}. \end{aligned}$$
(3)

The risk minimization principle (3) is impractical as we typically do not know the probability distribution \(p(\textbf{x},y)\) required for evaluating the risk (2).

Section 2.1 discusses how empirical risk minimization (ERM) is obtained by approximating the risk using an average loss over some training set [11, 16, 21].

In its basic form, ERM-based methods are prone to overfitting and require some form of regularization [11, Ch. 6]. We regularize ERM by a measure for the subjective explainability of a hypothesis \(h(\textbf{x})\). Section 2.2 explains how this regularization term is obtained from the conditional entropy of the predictions \(h(\textbf{x})\), given the user signal \(u\). The resulting EERM principle is then discussed in Sect. 3.

2.1 Empirical risk minimization

ERM methods approximate the risk (2) by the average loss (or empirical risk)

$$\begin{aligned} {\widehat{L}}(h| \mathcal {D}) :=(1/m) \sum _{i=1}^{m} L\big ({\big (\textbf{x}^{(i)},y^{(i)} \big )},{h}\big ). \end{aligned}$$
(4)

The average loss \({\widehat{L}}(h| \mathcal {D})\) of the hypothesis \(h\) is measured on a set of labelled data points (the training set)

$$\begin{aligned} \mathcal {D}= \big \{ \big (\textbf{x}^{(1)},y^{(1)}, u^{(1)}\big ),\ldots ,\big (\textbf{x}^{(m)},y^{(m)}, u^{(m)} \big ) \big \}. \end{aligned}$$
(5)

The training set \(\mathcal {D}\) consists of data points characterized by features \(\textbf{x}^{(i)}\), label value \(y^{(i)}\) , and user signal \(u^{(i)}\), for \(i=1,\ldots ,m\).

A plethora of ML methods are based on solving the ERM problem

$$\begin{aligned} \hat{h} \in \mathop {\rm{argmin}}\limits _{h\in \mathcal {H}}{\widehat{L}}(h| \mathcal {D}) \end{aligned}$$
(6)

using different choices for model \(\mathcal {H}\), training data \(\mathcal {D}\) and loss \(L\). However, a direct implementation of ERM (6) is prone to overfitting if the hypothesis space \(\mathcal {H}\) is too large compared to the size \(m\) of the training set. Two examples of such a high-dimensional regime are linear regression with a large number of features and artificial neural network (“deep learning") using a large number of trainable parameters.

One of the most widely used techniques to avoid overfitting in this high-dimensional regime is regularization [24, 25]. There are many different implementations or regularization techniques, such as data augmentation or early stopping [11, 21]. In what follows, we regularize ERM by adding a regularization (or penalty) term \(\lambda \mathcal {R}(h)\) to the empirical risk in (6),

$$\begin{aligned} h^{(\lambda )} \in \mathop {\rm{argmin}}\limits _{h\in \mathcal {H}} {\widehat{L}}(h| \mathcal {D})+ \lambda \mathcal {R}(h). \end{aligned}$$
(7)

The choice of the regularization parameter \(\lambda \!\ge \!0\) in (7) can be guided either by probabilistic models for the data [11, Ch. 7] or validation techniques [16]. For linear models \(h(\textbf{x}) = \textbf{w}^{T} \textbf{x}\), two popular choices for the regularizer are \(\mathcal {R}(h) = \Vert \textbf{w}\Vert ^{2}_{2}\) (“ridge regression”) and \(\mathcal {R}(h) = \Vert \textbf{w}\Vert _{1}\) (“Lasso”) [11, 23].

A dual form of regularized ERM (7) is obtained by replacing the regularization term with a constraint,

$$\begin{aligned} h^{(\eta )} \in \mathop {\rm{argmin}}\limits _{h\in \mathcal {H}} {\widehat{L}}(h| \mathcal {D}) \text{ such } \text{ that } \mathcal {R}(h)\le \eta . \end{aligned}$$
(8)

The solutions of (8) coincide with those of (7) for an appropriate choice of \(\eta\) [26]. Solving the primal formulation (7) might be computationally more convenient as it is an unconstrained optimization problem in contrast to the dual formulation (8) [27]. On the other hand, the dual form (8) allows explicitly specifying an upper bound \(\eta\) on the value \(\mathcal {R}(h^{(\eta )})\) for the learned hypothesis \(h^{(\eta )}\).

Regularization techniques are typically used to improve the statistical performance (risk) of the learned hypothesis. Instead, we use regularization to ensure the subjective explainability of the learned hypothesis. We use a regularization term that is not primarily meant to estimate the generalization error \(\mathbb {E} \big \{ L\big ({(\textbf{x},y)},{{\widehat{h}}}\big ) \big \}- {\widehat{L}}({\widehat{h}}| \mathcal {D})\). but to measure the subjective explainability of the predictions \(\hat{y}={\widehat{h}}(\textbf{x})\). The regularization parameter \(\lambda\) in (7) (or \(\eta\) in the dual formulation (8)) adjusts the level of subjective explainability of the learned hypothesis \(\hat{h}\). Larger values of \(\lambda\) (smaller values of \(\eta\)) favour a hypothesis with high explainability at the cost of incurring a higher risk. Section 3.1 studies the trade-off between subjective explainability and risk in linear regression using a simple probabilistic model for data.

2.2 Subjective explainability

There seems to be no widely accepted formal definition for the explainability (or interpretability) of a ML method. Some authors refer to ML methods as intrinsically interpretable if they use specific design choices for the model [4, 6, 14]. We believe that a useful concept of interpretability can only be subjective, i.e. depending on the specific human user of a ML method. Indeed, while linear regression methods might be considered interpretable for a user with formal training in statistics, the predictions obtained by applying a linear hypothesis to a huge number of features might be difficult to grasp for a layman.

The key idea of this paper is to construct a measure for subjective explainability of a hypothesis \(h\in \mathcal {H}\) via the user signal \(u\) associated with each data point. We consider this hypothesis as subjectively explainable if it delivers similar predictions \(\hat{y} = h{\textbf{x}}\) for data points having similar user signals. Informally, the hypothesis is subjectively explainable to a user if

$$\begin{aligned} h\big (\textbf{x}^{(1)}\big ) \approx h\big (\textbf{x}^{(2)}\big ) \text{ for } \text{ data } \text{ points } \text{ with } u^{(1)} \approx u^{(2)} \end{aligned}$$
(9)

Similar to [18], we use information-theoretic concepts to make the informal notion (9) of subjective explainability precise. This approach interprets each data point as realizations of i.i.d. random variables. In particular, the features \(\textbf{x}\), label \(y\) , and user signal \(u\) associated with a data point are realizations drawn from a joint probability density function (pdf) \(p(\textbf{x},y,u)\). In general, the joint pdf \(p(\textbf{x},y,u)\) is unknown and needs to be estimated from data using, e.g. maximum likelihood methods [19, 23].

Note that since we model the features of a data point as the realization of a RV, the prediction \(\hat{y} = h(\textbf{x})\) also becomes the realization of a RV. Figure 1 summarizes the overall probabilistic model for data points, the user signal, and the predictions delivered by (the hypothesis learned with) a ML method.

We measure the subjective explainability of the predictions \(\hat{y}\) delivered by a hypothesis \(h\) for some data points \(\big (\textbf{x},y,u\big )\) as,

$$\begin{aligned} E(h|u) :=C- H(h|u). \end{aligned}$$
(10)

Here, we used the conditional (differential) entropy \(H( h |u)\) (see Ch. 2 and Ch. 8 [17])

$$\begin{aligned} H(h|u)&:=- \mathbb {E} \bigg \{ \log p(\underbrace{ h(\textbf{x}) }_{=\hat{y}}|u) \bigg \} \end{aligned}$$
(11)

We introduce the (“calibration”) constant C in (10) for notational convenience. The actual value of C is meaningless for our approach (see Sect. 3) and serves only the convention that the subjective explainability \(E(h|u)\) is non-negative.

For regression problems, the predicted label \(\hat{y}\) might be modelled as a continuous random variable. In this case, the quantity \(H(\hat{y}|u)\) is a conditional differential entropy. With slight abuse of notation we refer to \(H(\hat{y}|u)\) as a conditional entropy and do not explicitly distinguish between the case where \(\hat{y}\) is discrete, such as in classification problems studied in Sects. 3.1, 3.2 and 4.

The conditional entropy \(H(h|u)\) in (10) quantifies the uncertainty (of a user that assigns the value \(u\) to a data point) about the prediction \(\hat{y} = h(\textbf{x})\) delivered by the hypothesis \(h\). Smaller values \(H(h|u)\) correspond to smaller levels of subjective uncertainty about the predictions \(\hat{y} = h(\textbf{x})\) for a data point with known user signal \(u\). This, in turn, corresponds to a larger value \(E(h|u)\) of subjective explainability.

Section 4 discusses explainable methods for detecting hate speech or the use of offensive language. A data point represents a short text message (a tweet). Here, the user signal \(u\) could be the presence of specific keywords that are considered a strong indicator of hate speech or offensive language. These keywords might be provided by the user via answering a survey or they might be determined by computing word histograms on public datasets that have been manually labelled [28].

Fig. 1
figure 1

The features \(\textbf{x}\), label \(y\) and user signal \(u\) of a data point are realizations drawn from a pdf \(p(\textbf{x},y,u)\). Our goal is to learn a hypothesis h such that its predictions \(\hat{y}\) have a small conditional entropy given the user signal \(u\)

3 Explainable empirical risk minimization

Section 2 has introduced all the components of EERM as a novel principle for XML. EERM learns a hypothesis h by using an estimate \({\widehat{H}}(h|u)\) for the conditional entropy in (10) as the regularization term \(\mathcal {R}(h)\) in (7),

$$\begin{aligned} h^{(\lambda )} \!:=\! \mathop {\rm{argmin}}\limits _{h \in \mathcal {H}} {\widehat{L}}(h| \mathcal {D})+ \lambda \underbrace{ {\widehat{H}}(h|u)}_{= \mathcal {R}(h)}. \end{aligned}$$
(12)

A dual form of (12) is obtained by specializing (8),

$$\begin{aligned} h^{(\eta )} \!:=\! \mathop {\rm{argmin}}\limits _{h\in \mathcal {H}} {\widehat{L}}( h| \mathcal {D}) \text{ such } \text{ that } {\widehat{H}}(h|u) \le \eta . \end{aligned}$$
(13)

The empirical risk \({\widehat{L}}(h| \mathcal {D})\) and the regularizer \({\widehat{H}}(h|u)\) are computed solely from the available training set (5). We will discuss specific choices for the estimator \({\widehat{H}}(\hat{y}|u)\) in Sects. 3.1 and 3.2.

The idea of EERM is that the solution of (12) (or (13)) is a hypothesis that balances the requirement of a small loss (accuracy) with a sufficient level of subjective explainability \(E(h|u) \big (= C - H(h|u))\). This balance is steered by the parameter \(\lambda\) in (12) and \(\eta\) in (13), respectively. Figure 2 illustrates the parametrized solutions of (12) in the plane spanned by risk and subjective explainability. The precise location of the curve in Fig. 3 depends on the training set (assumed to consist of i.i.d. data points) and estimators \({\widehat{H}}\) of the conditional entropy \(H(h|u)\) in (10).

Choosing a large value for \(\lambda\) in (12) (small value for \(\eta\) in (13)) penalizes any hypothesis resulting in a large estimate \({\widehat{H}}(h|u)\) for the conditional entropy \(H(h|u)\). Assuming \({\widehat{H}}(h|u) \approx H(h|u)\), using a large \(\lambda\) in (12) (small \(\eta\) in (13) enforces a high subjective explainability (10) of the learned hypothesis \(h^{(\lambda )}\). Asymptotically (for \(\lambda \rightarrow \infty\)), the solutions \(h^{(\lambda )}\) of (12) will maximize subjective explainability \(E(h|u)\) at the cost of increasing risk.

For the specific choice \(\lambda =0\), EERM (12) reduces to plain ERM that delivers a hypothesis \(h^{(\lambda =0)}\) with risk \(\overline{L}_{\rm{min}}\). This special case of EERM is obtained from the dual form (8) using a sufficiently large \(\eta\). The small risk of \(h^{(\lambda =0)}\) comes at the cost of a relatively small subjective explainability \(E(h^{(\lambda =0)}|u)\). We choose the constant C in (10) such that \(E(h^{(\lambda =0)}|u)=0\) for notational convenience.

Fig. 2
figure 2

The solutions of EERM, either in the primal (12) or dual (13) form, trace out a curve in the plane spanned by the risk \(\overline{L}(h)\) and the subjective explainability \(E(h|u)\)

We emphasize that we do not assume to have any control over the choice of user signals. Much like the process of determining the features of a data point also the construction of user signals is beyond the scope of this paper. In particular, we do not have any control over the correlation between user signals and the labels of data points. If the user signal is nearly uncorrelated with the label, requiring subjective explainability will typically result in a hypothesis with a large risk. On the other hand, if the user signal is strongly correlated with the label of a data point (e.g. when the user is a domain expert), then EERM (12) might learn a hypothesis with a smaller risk compared to the hypothesis learnt by plain ERM (6).

3.1 Explainable linear regression

We now specialize EERM in its primal form (12) to linear regression [19, 23]. Linear regression methods learn the parameters \(\textbf{w}\) of a linear hypothesis \(h^{(\textbf{w})}(\textbf{x}) = \textbf{w}^{T} \textbf{x}\) to minimize the squared error loss of the resulting prediction error. The features \(\textbf{x}\) and user signal \(u\) of a data point are modelled realizations of jointly Gaussian random variables with mean zero and covariance matrix \(\textbf{C}\),

$$\begin{aligned} \big (\textbf{x}^{T},u\big )^{T} \sim \mathcal {N}(\textbf{0},\textbf{C}). \end{aligned}$$
(14)

Note that (14) only specifies the marginal of the joint pdf \(p(\textbf{x},y,u)\) (see Fig. 1). Using the probabilistic model (14), we obtain (see [17])

$$\begin{aligned} H(h|u)&= (1/2) \log \sigma ^{2}_{\hat{y}|u}. \end{aligned}$$
(15)

Here, we use the conditional variance \(\sigma ^{2}_{\hat{y}|u}\) of the predicted label \(\hat{y} = h(\textbf{x})\) given the user signal \(u\) for a data point.

To develop an estimator \({\widehat{H}}(h|u)\) for (15), we use the identity [29, Sec. 4.6.]

$$\begin{aligned} \sigma ^{2}_{\hat{y}|u} = \min _{\alpha \in \mathbb {R}} \mathbb {E} \big \{ \big (h(\textbf{x}) - \alpha u\big )^{2}\big \}. \end{aligned}$$
(16)

The identity (16) relates the conditional variance \(\sigma ^{2}_{\hat{y}|u}\) to the minimum mean squared error that can be achieved by estimating \(\hat{y}\) using a linear estimator \(\alpha u\) with some \(\alpha \in \mathbb {R}\). We obtain an estimator for the conditional variance \(\sigma ^{2}_{\hat{y}|u}\) by replacing the expectation in (16) with a sample average over the training set \(\mathcal {D}\) (5),

$$\begin{aligned} \hat{\sigma }^{2}(\hat{y}|u) :=\min _{\alpha \in \mathbb {R}} (1/m) \sum _{i=1}^{m} \big ( \textbf{w}^{T} \textbf{x}^{(i)} - \alpha u^{(i)}\big )^{2}. \end{aligned}$$
(17)

It seems reasonable to estimate the conditional entropy \({\widehat{H}}(h^{(\textbf{w})}|u)\) via the plugging in the estimated conditional variance (17) into (15), yielding the plug-in estimator (1/2). However, in view of the duality between (8) and (13), any monotonic increasing function of a given entropy estimator essentially amounts to a reparametrization \(\lambda \mapsto \lambda '\) and \(\eta \mapsto \eta '\). Since such a reparametrization is irrelevant as we choose \(\lambda\) in a data-driven fashion, we will use the estimated conditional variance (17) itself as an estimator

$$\begin{aligned} {\widehat{H}}(h^{(\textbf{w})}|u) :=\min _{\alpha \in \mathbb {R}} (1/m) \sum _{i=1}^{m} \big ( \textbf{w}^{T} \textbf{x}^{(i)} - \alpha u^{(i)}\big )^{2}. \end{aligned}$$
(18)

Note that we neither require the estimator (18) to be consistent nor to be unbiased [30]. Our main requirement is that, with high probability, the estimator (18) varies monotonically with the conditional entropy \(H(h^{(\textbf{w})}|u)\).

Inserting the estimator (18) into EERM (12) yields Algorithm 1 as an instance of EERM for linear regression in primal form. Algorithm 1 requires as input a choice for the regularization parameter \(\lambda > 0\) and a training set \(\mathcal {D}= \big \{ \big ( \textbf{x}^{(1)},y^{(1)},u^{(1)} \big ),\ldots ,\big ( \textbf{x}^{(m)},y^{(m)},u^{(m)} \big ) \big \}\). Algorithm 1 delivers a hypothesis \(h^{(\lambda )}\) that compromises between small risk \(\mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h}\big ) \big \}\) and subjective explainability \(E(h|u)\). This compromise is controlled by the value of \(\lambda\).

Choosing a large \(\lambda\) for Algorithm 1 favours a hypothesis \(h^{(\lambda )}\) with small conditional entropy \(H(h^{(\lambda )}|u)\) and, in turn, high subjective explainability \(E(h^{(\lambda )}|u)\) (see (10)). On the contrary, choosing a small \(\lambda\) puts more emphasis on obtaining a small risk \(\mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h^{(\lambda )}}\big ) \big \}\) at the expense of increased conditional entropy \(H(h^{(\lambda )}|u)\) and, in turn, reduced subjective explainability \(E(h^{(\lambda )}|u)\).

Algorithm 1
figure a

Explainable Linear Regression in primal form

Trade-Off Between Subjective Explainability and Risk. Let us now study the fundamental trade-off between subjective explainability \(E(h|u)\) and risk of a linear hypothesis for data points characterized by a single feature \(x\). We consider data points \((x,u,y)^{T}\), characterized by a single feature \(x\in \mathbb {R}\), numeric label \(y\in \mathbb {R}\) and user feedback \(u\in \mathbb {R}\), as i.i.d. realizations of a Gaussian random vector

$$\begin{aligned} \big (x,y,u)^{T} \sim \mathcal {N}\big ( \varvec{\mu }, \textbf{C} \big ) \text{ with } \varvec{\mu } = \begin{pmatrix} \mu _{x} \\ \mu _{y} \\ \mu _{u} \end{pmatrix}, \textbf{C} = \begin{pmatrix} \sigma _{x}^{2} &{} \sigma _{x,y} &{} \sigma _{x,u} \\ \sigma _{y,x} &{} \sigma _{y}^{2} &{} \sigma _{y,u} \\ \sigma _{u,x} &{} \sigma _{u,y} &{} \sigma ^2_{u} \end{pmatrix}. \end{aligned}$$
(20)

Our goal is to learn a linear hypothesis \(h(\textbf{x}) = \textbf{x}^{T} \textbf{w}\) which is parametrized by a weight vector \(\textbf{w}= \begin{pmatrix} w_{1}&w_{0} \end{pmatrix}^{T}\), where \(\textbf{x}= \begin{pmatrix} x&1 \end{pmatrix}^{T}\).

Let us require a minimum prescribed subjective explainability \(E(h|u) \ge C- \eta\), which is equivalent to the constraint (see (10) and (15))

$$\begin{aligned} H(h|u)&= (1/2) \log \sigma ^{2}_{\hat{y}|u} \le \eta . \end{aligned}$$
(21)

We can further develop the constraint (21) using (20) and basic calculus for Gaussian processes [31],

$$\begin{aligned} \begin{aligned} \sigma ^{2}_{\hat{y}|u}&= \sigma ^{2}_{\hat{y}} - \sigma ^2_{\hat{y},u}/ \sigma ^{2}_{u} \\&{\mathop {=}\limits ^{\hat{y}= \textbf{x}^{T} \textbf{w}}} w_{1}^{2}(\sigma ^{2}_{x} - \sigma ^2_{x,u} / \sigma ^{2}_{u}) \\&= w_{1}^{2}\sigma ^{2}_{x|u}. \end{aligned} \end{aligned}$$
(22)

The constraint (21) is enforced by requiring

$$\begin{aligned} w_{1}^{2} \le \exp (2\eta ) \sigma ^{-2}_{x|u}. \end{aligned}$$
(23)

The goal is to find a linear hypothesis \(h(\textbf{x}) = \textbf{w}^{T} \textbf{x}\), whose weight vector \(\textbf{w}\) satisfies (23), that incurs a minimum risk

$$\begin{aligned}&\mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h}\big ) \big \} = \mathbb {E} \big \{ \big ( y- h(x) \big )^{2} \big \} \nonumber \\&{\mathop {=}\limits ^{h(x)= \textbf{x}^{T} \textbf{w}}} \mathbb {E} \big \{ (y- w_{1} x- w_{0})^{2} \big \} \nonumber \\&= P_{x}w_1^2 + w_0^2 -2\mu _{y}w_0 + P_{y} + 2\mu _{x}w_1w_0 - 2P_{xy}w_1\end{aligned}$$
(24)
$$\begin{aligned}&= \textbf{w}^{T} \begin{bmatrix} \mu _{x^2} &{} \mu _{x} \\ \mu _{x} &{} 1 \end{bmatrix} \textbf{w}- 2\textbf{w}^{T} \begin{bmatrix} \mu _{xy} \\ \mu _{y} \end{bmatrix} + P_{y}, \end{aligned}$$
(25)

where \(P_{x} = \sigma _{x}^{2} + \mu _{x}^{2}\), \(P_{y} = \sigma _{y}^{2} + \mu _{y}^{2}\) and \(P_{x,y} = \sigma _{x,y} + \mu _{x}\mu _{y}\).

We minimize the risk (24) under the constraint (23), which is equivalent to enforcing subjective explainability of at least \(C - \eta\),

$$\begin{aligned}&\min _{w_{1}, w_{0} \in \mathbb {R} } P_{x}w_1^2 + w_0^2 -2\mu _{y}w_0 + P_{y} + 2\mu _{x}w_1w_0 - 2P_{x,y}w_1 \\&\text{ subject } \text{ to } w_{1}^{2} \le \exp (2\eta )\sigma ^{-2}_{x|u}. \nonumber \end{aligned}$$
(26)

A set of necessary and sufficient conditions for a weight vector \({\overline{\textbf{w}}} = \big ( \overline{w}_{1},\overline{w}_{0} \big )^{T}\) to solve (26) are the Karush–Kuhn–Tucker conditions [27, Sec. 5.5.3.]

$$\begin{aligned} 2 \begin{bmatrix} P_{x} + \rho &{} \mu _{x} \\ \mu _{x} &{} 1 \end{bmatrix} {\overline{\textbf{w}}} - 2 \begin{bmatrix} P_{xy} \\ \mu _{y} \end{bmatrix}&= 0 \nonumber \\ {\overline{w}}_{1}^{2} - \exp (2\eta )\sigma ^{-2}_{x|u}&\le 0 \nonumber \\ \rho&\ge 0 \nonumber \\ \rho \big ( {\overline{w}}_{1}^{2} - \exp (2\eta )\sigma ^{-2}_{x|u} \big )&= 0. \end{aligned}$$
(27)

By inspection of (27), one can show that

$$\begin{aligned}&{\overline{w}}_{1} = {\left\{ \begin{array}{ll} \sigma _{y,x}/ \sigma _{x}^{2} &{} \text{ if } \sigma ^2_{y,x}/ \sigma _{x}^{4} \le \exp (2\eta )\sigma ^{-2}_{x|u} \\ \rm{sign} \{ \sigma _{y,x} \} \exp (\eta )/\sigma _{x|u} &{} \text{ if } \sigma ^2_{y,x}/ \sigma _{x}^{4} > \exp (2\eta )\sigma ^{-2}_{x|u}. \end{array}\right. } \end{aligned}$$
(28)
$$\begin{aligned}&{\overline{w}}_{0} = \mu _{y} - \mu _{x} {\overline{w}}_{1} \end{aligned}$$
(29)

By inserting (28), (29) into (24), we obtain that the minimum achievable risk \(\mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h}\big ) \big \}\) of a linear hypothesis with required subjective explainability \(E(h|u) \ge C - \eta\) is

$$\begin{aligned} \mathbb {E} \big \{ L\big ({(\textbf{x},y)},{h}\big ) \big \} = {\left\{ \begin{array}{ll} \sigma ^2_{y|x} &{} \text{ if } \sigma ^2_{y,x}/ \sigma _{x}^{4} \le \exp (2\eta )\sigma ^{-2}_{x|u} \\ {\phi (\eta )} &{} \text{ if } \sigma ^2_{y,x}/ \sigma _{x}^{4} > \exp (2\eta )\sigma ^{-2}_{x|u}, \end{array}\right. } \end{aligned}$$
(30)

where \(\phi (\eta ) = \sigma _{x}^{2} \exp (2\eta )\sigma ^{-2}_{x|u} - 2| \sigma _{y,x}| \exp (\eta )/\sigma _{x|u} +\sigma ^2_{y}\).

Inserting the estimator (28), (29) into EERM (13) yields Algorithm 2 as an instance of EERM for linear regression in dual form. Algorithm 2 requires as input a choice for the upper bound of conditional entropy \(\eta\) and a training set \(\mathcal {D}= \big \{ \big ( \textbf{x}^{(1)},y^{(1)},u^{(1)} \big ),\ldots ,\big ( \textbf{x}^{(m)},y^{(m)},u^{(m)} \big ) \big \}\). Algorithm 2 delivers a hypothesis \(h^{(\lambda )}\) that achieves a minimum empirical risk \(\overline{L}(h)\) under the corresponding upper bound of conditional entropy \(\eta\), i.e. maximum subjective explainability \(E(h|u)\). The optimal model is controlled by the value of \(\eta\).

Choosing a large \(\eta\) for Algorithm 2 means the hypothesis \(h^{(\lambda )}\) has a large margin of conditional entropy \(H(h^{(\lambda )}|u)\) and, in turn, allows lower subjective explainability \(E(h^{(\lambda )}|u)\) and lower empirical risk\(\overline{L}(h)\) (see (21), (30)). On the contrary, choosing a small \(\eta\) puts more restriction on obtaining a lower conditional entropy \(H(h^{(\lambda )}|u)\) at the expense of increased empirical risk\(\overline{L}(h)\) and, in turn, increased subjective explainability \(E(h^{(\lambda )}|u)\).

Fig. 3
figure 3

The solutions of EERM, in the dual (13) form, trace out a curve in the plane spanned by the risk \(\overline{L}(h)\) and the upper bound of conditional entropy \(\eta\)

Algorithm 2
figure b

Explainable Linear Regression in dual form

3.2 Explainable decision trees

We now apply EERM in its dual (constraint) form (13) to DT classifiers [19, 23]. Consider data points characterized by features \(\textbf{x}\), a binary label \(y\in \{0,1 \}\) and a binary user signal \(u\in \{0,1\}\). The restriction to binary labels and user signals is for ease of exposition. Our approach can be generalized easily to more than two label values (multi-class classification) and non-binary user signals.

The model \(\mathcal {H}\) in (13) is constituted by all DTs whose root node tests the user signal \(u\) and whose depth does not exceed a prescribed maximum depth \(d_{\rm{max}}\) [16]. The depth \(d\) of a specific DT \(h\) is the maximum number of test nodes that are encountered along any possible path from the root node to a leaf node [16].

Figure 4 illustrates a hypothesis \(h\) obtained from a DT with depth \(d=2\). We consider only DTs whose nodes implement a binary test, such as whether a specific feature \(x_{j}\) exceeds some threshold. Each such binary test can maximally contribute one bit to the entropy of the resulting prediction (at some leaf node).

Thus, for a given user signal \(u\), the conditional entropy of the prediction \(\hat{y} = h(\textbf{x})\) is upper bounded by \(d-1\) bits. Indeed, since the root node is reserved for testing the user signal \(u\), the number of binary tests carried out for computing the prediction is upper bounded by \(d-1\). We then obtain Algorithm 3 from (13) by using the estimator \({\widehat{H}}(h|u) :=d-1\).

Algorithm 3
figure c

Explainable DT Classification

Fig. 4
figure 4

EERM implementation for learning an explainable decision tree classifier. EERM learns a separate decision tree for all data points sharing a common user signal \(u\). The constraint in (13) can be enforced naturally by fixing a maximum tree depth \(d\)

4 Numerical experiments

This section reports the results of different numerical experiments to verify the performance of EERM. Besides, we present a new metric to evaluate the overall performance of EERM, which combines different preferences between objective accuracy and subjective explainability. It details the various choices evaluated against end users, evaluation metrics and results obtained.

A. Dataset

In order to demonstrate the usefulness of EERM (12), we conduct illustrative numeric experiments that revolve around explainable weather forecasting (see Sect. 4.1) and explainable hate speech detection in social media (see Sect. 4.2). The weather prediction dataset is obtained at the observation station “Nuuksio” in Finland from 2020-1-1 to 2021-12-31 from the Finnish Meteorological Institute (FMI).Footnote 1 The hate speech detection dataset consists of 25.3k texts on Twitter data within hate speech, offensive language, and neither classifications from Kaggle.Footnote 2

Section 4.2 discusses the application of EERM to the detection of hate speech and offensive language in social networks [32]. Hate speech is a main obstacle towards embracing the Internet’s potential for deliberation and freedom of speech [33]. Moreover, the detrimental effect of hate speech seems to have been amplified during the period of the COVID-19 pandemic [34].

Hate speech is a contested term whose meaning ranges from concrete threats to individuals to venting anger against authority [35]. Hate speech is characterized by devaluing individuals based on group-defining characteristics such as their race, ethnicity, religion, and sexual orientation [36]. Detecting hate speech requires multi-disciplinary expertise from social sciences and computer science [37, 38]. Providing subjective explainability for ML users with different backgrounds is crucial for the diagnosis and improvement of hate speech detection systems [33, 34, 39].

B. Evaluation Metrics

Since the quality of a ML model with subjective explainability is highly dependent on the domain knowledge of the end users, as well as its inherence factors which are contextual and subjective, there is a lack of consensus regarding unified metrics to effectively assess it.

To this end, we define a new comprehensive quality metric \(E^{\star }\) measure to evaluate the overall performance of a ML model with subjective explainability, i.e. the combination between the interpretability (higher subjective explainability for various levels of end users) and the fidelity (lower empirical risk). \(E^{\star }\) computes their harmonic mean and takes the effects of both empirical risk and conditional entropy into account simultaneously. It thus symmetrically represents both measures in one metric.

$$\begin{aligned} {E^{\star } = \frac{2 ( 1 - \widetilde{L} )( 1 - \widetilde{H} )}{2 - \widetilde{L} - \widetilde{H}},} \end{aligned}$$
(32)

where \(\widetilde{L}\) and \(\widetilde{H}\) denote the normalization of the empirical risk \(\hat{L}(h)\) and the conditional entropy \(\hat{H}( h |u)\) for a given human end user, respectively.

Referred to F1score in statistics, \(E^{\star }\) is an indicator designed for considering two important evaluation criteria, empirical risk and conditional entropy. It combines them into a single measure to provide a balanced assessment of the model’s subjective explainability ranging between 0.0 and 1.0. The satisfying (highest) possible value of a \(E^{\star }\) is 1.0, if both empirical risk and conditional entropy are perfect, while the lowest value is 0.0, indicating poor performance, i.e. worst empirical risk or conditional entropy. For public policymakers, they should consider the different background knowledge levels of the end users. So it is a proper choice to give priority to high subjective explainability for normal people, while it might be more important to ensure that the model has a high accuracy for the research by domain-specific experts. However, high accuracy might cause low subjective explainability, which means low empirical risk but high conditional entropy. Since both measures are interactive, \(E^{\star }\) is supposed to be large enough under a certain compromise but hard to reach the ideal value of 1.0 with real datasets.

4.1 Explainable linear regression

This experiment applies EERM to learn an explainable linear predictor for the maximum daytime temperature at some observation stations of the Finnish Meteorological Institute (FMI) (https://en.ilmatieteenlaitos.fi/download-observations). Data points that represent the daily weather recordings along with a time-stamp at the weather station “Nuuksio” in Finland. The feature x of a data point is the minimum temperature during that day, while the maximum temperature of the same day is the label \(y\). Each data point is also characterized by a user signal \(u\in \mathbb {R}\). We compute the weights \({\widehat{\textbf{w}}} = \big ({\widehat{w}}_{1}, {\widehat{w}}_{0} \big )^{T}\) of a linear hypothesis by plugging in sample estimates for means and (co-)variances in (28)–(30). Thus, we construct a linear hypothesis whose subjective explainability is approximately lower bounded by a prescribed value of \(C-\eta\).

In order to simulate the variation in the background knowledge of users, we generate user signals by adding different perturbations on maximum temperatures. Specifically, the perturbations are obtained by the probability density function of the normal distribution \(p(x)=(2\pi \xi ^2)^{-\frac{1}{2}}\rm{exp}(-\frac{(x-m)^2}{2\xi ^2})\), where m is the mean and \(\xi\) is the standard deviation. In Figs. 5 and 6, we set \(m=0\), \(\xi =\{0, 5, 7.5, 10\}\). With the stand deviation \(\xi\) increasing, the user signal \(u\) becomes worse.

Figures 5 and 6 depict, respectively, the resulting empirical risk and weights for the varying upper bound of conditional entropy \(\eta\). Overall, for increasing subjective explainability \(E(h^{(\lambda )}|u)\) (decreasing \(\eta\)), the empirical risk \(\hat{L}(h)\) increases and the weight \({\widehat{w}}_{1}\) for the feature x becomes smaller (less relevant). On the other hand, for decreasing explainability \(E(h^{(\lambda )}|u)\) (increasing \(\eta\)), the empirical risk \(\hat{L}(h)\) decreases and the weight \({\widehat{w}}_{1}\) for the feature x becomes larger (more relevant). Besides, it shows the larger the values of \(\xi\) are, the larger the empirical risk is got to achieve the same subjective explainability, which means the worse the quality of the user signal is. Since the end user has less domain-specific background knowledge, the model needs a larger upper bound of conditional entropy \(\eta\) to obtain the minimum empirical risk \(\hat{L}(h)\). Moreover, all curves tend to be close to \(\overline{L} = \sigma ^2_{y}\) as \(\eta \rightarrow -\infty\), i.e. the explainability is large enough. While \(\eta \ge \frac{1}{2}ln(\sigma ^2_{y,x} \sigma ^{2}_{x|u}/ \sigma _{x}^{4})\), the optimal model is reached that \(\bar{L}_{{\min }} = \sigma _{{y|x}}^{2}\).

Fig. 5
figure 5

The empirical risk \(\hat{L}({\widehat{h}})\) of a linear hypothesis learnt from weather recordings. Each curve is parametrized by the upper bound \(\eta\) on the condition entropy \(H\big ({\widehat{h}} \big | u\big )\). We obtain the weights of the linear model using (28), (29) by inserting sample estimates for \(\sigma _{x| u}\), \(\sigma _{y, x}\), \(\mu _{x}\) and \(\mu _{y}\)

Fig. 6
figure 6

The weights \(\textbf{w}\) of a linear hypothesis learned from weather data with prescribed subjective explainability of (approximately) \(C-\eta\)

Additionally, we compute the \(E^{\star }\) measure using (32) as the overall performance assessment of the model EERM in the weather prediction experiment. Different curves plotted in Figs. 7 and 8 illustrate the relationship between \(E^{\star }\) with the normalization of the conditional entropy \(\widetilde{H}\) and the empirical risk \(\widetilde{L}\), respectively. We can see that EERM is underperforming if we just pursue a perfect subjective explainability (lower \(\widetilde{H}\)) or get satisfying model accuracy (lower \(\widetilde{L}\)) solely. \(E^{\star }\) is supposed to be large enough under a certain compromise but hard to reach the ideal value of 1.0 in real datasets (see Fig. 9). Only by considering both fully can the \(E^{\star }\) increase to achieve a better performance of EERM (see Fig. 10).

Fig. 7
figure 7

The overall performance assessment \(E^{\star }\) of the EERM for the weather prediction with subjective explainability. Each curve is affected by the quality of user signals \(\xi\) and the desired subjective explainability represented by the conditional entropy under normalization.

Fig. 8
figure 8

The overall performance assessment \(E^{\star }\) of the EERM for the weather prediction with subjective explainability. Each curve is affected by the quality of user signals \(\xi\) and the expected accuracy represented by the empirical risk under normalization.

Fig. 9
figure 9

An overview of the relationship among \(E^{\star }\), empirical risk, and the conditional entropy of the EERM for the weather prediction with subjective explainability. High accuracy causes low subjective explainability, while better subjective explainability is accompanied by worse accuracy. The \(E^{\star }\) is supposed to be large enough under a certain compromise but hard to reach the ideal value of 1.0 in real datasets.

Fig. 10
figure 10

An overview of the complete relationship among \(E^{\star }\), empirical risk, and the conditional entropy under normalization. Only if both accuracy and subjective explainability are the best can the overall performance assessment \(E^{\star }\) reach the ideal value of 1.0.

4.2 Explainable hate speech detection

We now discuss a numerical experiment that uses EERM to learn an explainable hate speech detector for social networks. This experiment uses a public dataset that contains curated short messages (tweets) from a social network [28]. Each tweet has been manually rated by a varying number of users as either “hate speech”, “offensive language”, or “neither”. For each tweet, we define its binary label as \(y=1\) (“inappropriate tweet”’) if the majority of users rated the tweet either as “hate speech” or “offensive language”. If the majority of users rated the tweet as “neither”, we define its label value as \(y=0\) (“appropriate tweet”).

The feature vector \(\textbf{x}\) of a tweet is constructed using the normalized frequencies (“tf-idf”) of individual words (stop words removed) [40]. Each tweet is also characterized by a binary user signal \(u\in \{0,1\}\). The user signal is defined to be \(u=1\) if the tweet contains at least one of the five most frequent words appearing in tweets with \(y=1\).

We use Algorithm 3 to learn an explainable decision tree classifier with its subjective explainability upper bounded by \(\eta =2\) bits. The training set \(\mathcal {D}\) used for Algorithm 3 is obtained by randomly selecting a fraction of around \(90 \%\) of the entire dataset. The remaining \(10 \%\) of tweets are used as a test set.

To learn the decision tree classifiers in step 3 and 4 of Algorithm 3, we used the implementations provided by the current version of the Python package scikit-learn [22]. The resulting explainable DT classifier \(h^{(\eta =2)}(\textbf{x})\) achieved a test-set accuracy of 0.929.

5 Conclusion

The explainability of predictions provided by ML becomes increasingly relevant for their use in automated decision-making [41, 42]. Given lay and expert users’ different levels of expertise and knowledge, providing subjective (tailored) explainability is instrumental for achieving trustworthy AI [41, 43]. Our main contribution is EERM as a new design principle for subjective XML. EERM is obtained using the conditional entropy of predictions, given a user signal, as a regularizer. The hypothesis learned by EERM balances between small risk and subjective explainability for a specific user (explainee) of the ML method.

5.1 Future works

Though our method is flexible and agnostic to the precise means of obtaining user signals, interesting avenues for future work include user studies that allow measuring different forms of user signals, using feedback on explanations, and combining with counterfactuals. Besides, we will develop our work further in practical implementations of the EERM principle for more complex ML models like artificial neural networks (ANNs).