1 Introduction

Many business processes in industry are still based on manual human execution steps, checks and assessments. These manual processes are often in place for years, if not decades. Hence, a lot of historical transactional data slumbers in IT systems that could be used to design data driven decision agents using supervised machine learning (SML). Trained machine learning (ML) models can then be used to either fully automate business processes through automated decision making (ADM) or at least to assist during the process in the form of a decision support system (DSS), where unlike in ADM the human expert is still in control over the final decision. Automating business processes is beneficial since it reduces process costs and potentially also increases standardization. Consequently, there is a noticeable shift from the usage of ML for predictive modeling towards prescriptive modeling, where appropriate actions are supposed to be triggered in real world scenarios. This trend has recently been coined prescriptive machine learning [17].

Nevertheless, the usage of data-induced decision agents is not free of risk. The decisions of an ML model cannot be considered correct all the time, for instance, there might be issues related to the (training) data, such as data and concept drift or shift, inadequate or wrong supervision (human decisions cannot always be considered as ground truth) or even inherent non-determinism in the dependency between input and output. This last uncertainty is often referred to as aleatoric uncertainty. Uncertainty with regards to the quality and amount of training data is known as approximation uncertainty. Identifying the right type of model for a particular problem is referred to as model uncertainty. Both previous uncertainties can be attributed to epistemic uncertainty, which is reducible unlike aleatoric uncertainty [18].

With the before mentioned uncertainties, it is hardly conceivable that high-stake business domains will immediately go from a purely manual human decision process to a fully ML automated process at once since this would entail a lot of (financial) risk. A practical approach could be to first automate rather clear or certain cases and still leave the more complex or uncertain cases to a human expert. In recent years, the topic of uncertainty quantification in machine learning has gained a lot of attention [12, 22, 33]. The capability of a machine learning model to quantify its uncertainty related to a certain query could be utilized to quantify the risk of a wrong decision. Knowing the potential risk of a wrong decision for a particular query could then serve as a means to distinguish between fully automated decision making and decision support. Roughly speaking, when the risk of a wrong decision is high, the machine learning model is (at most) supposed to be used as a decision support and the final decision must be left to a human expert. In contrast, if the risk of a wrong decision is considered low, the process can be fully automated through automated decision making.

A versatile method for quantifying uncertainty, that is also widely used in practice, is conformal prediction [8, 9, 20, 23, 36]. As a foundation, conformal prediction only requires a model that is capable of outputting heuristic probabilities which makes it almost model agnostic and broadly applicable. Consequently, in this paper we will evaluate how uncertainty quantification with conformal prediction can be used to draw an uncertainty-aware decision boundary between automated decision making and decision support, where the final decision is still up to a human expert. We will do this by means of a case study using a goodwill data set of a BMW National Sales Company (NSC) containing customer goodwill requests and manual contribution decisions made by human experts.

2 Machine learning for automated decision making

In many business domains there is a demand for automating repetitive tasks through machine learning with the main goal to free work force and thereby save costs. One such exemplary business process is goodwill assessment, where a (car) manufacturer compensates customers in cases of product related queries outside of the warranty window (usually after 3–5 years). The aim of granting goodwill is to keep customers satisfied and loyal to the brand. To a large extent, these goodwill assessments are still carried out manually at BMW. Business experts check the goodwill requests, which contain extensive information regarding the vehicle and the present problem, and subsequently grant a certain repair cost contribution percentage (binned to ten percent steps, i.e., elements of \(\mathcal {Y} = \{0,10,20, \ldots ,100\}\)) separately for labor and parts.

Since this manual process is in place for years, there is plenty of data that can be used for machine learning. This data comes in the form

$$\begin{aligned} \mathcal {D}=\big \{ (\varvec{x}_1,y_1), \ldots ,(\varvec{x}_n,y_n) \big \} \,, \end{aligned}$$

with goodwill requests represented as feature vectors \(\varvec{x}_i \in \mathcal {X} \subseteq \mathbb {R}^m\) and observed human goodwill decisions as labels \(y_i \in \mathcal {Y}\). This is exactly the type of data commonly assumed in the setting of supervised machine learning, where the goal is to learn an optimal predictor \(h^* \in \mathcal {H}\) maximizing predictive accuracy, or, more generally, minimizing the expected loss (risk)

$$\begin{aligned} \mathcal {R}(h) := \mathbb {E}_{(\varvec{x},y) \sim P} \, l(y, h(\varvec{x})) \, , \end{aligned}$$
(1)

where \(l: \, \mathcal {Y} \times \mathcal {Y} \rightarrow \mathbb {R}\) is a loss function and the expectation is taken with respect to the data generating process P (a joint probability measure on \(\mathcal {X} \times \mathcal {Y}\)). Moreover, \(\mathcal {H} \subset \mathcal {Y}^\mathcal {X}\) is the set of predictors (mappings \(\mathcal {X} \rightarrow \mathcal {Y}\)) the learner can choose from; this set is also called the hypothesis space in machine learning.

As already said, the goodwill use case qualifies as what has recently been coined prescriptive machine learning [17]. In contrast to the common setting of predictive machine learning, the goal is not to predict some underlying ground-truth, but rather to learn models that stipulate appropriate decisions or actions to be taken in order to achieve a certain goal. In fact, in the case of goodwill, one may argue that there is nothing like a “right” or “true” monetary contribution, nor is a decision either right or wrong. Instead, a decision is more or less appropriate, fair for the customer and strategically opportune for the company. From this point of view, one may also question the idea of learning a model that seeks to mimic the human expert, taking her decisions as a target for prediction [34], all the more since these decisions appear to be biased. For example, we found that a decision of 50% contribution is somewhat overrepresented in the data, letting one suspect that this is often taken as a default choice for a partial cost coverage, even if it might not necessarily be the most appropriate percentage. In the following, we will nevertheless assume that mimicking the expert is a reasonable strategy, at least as a first step toward a data-driven goodwill assessment, leaving more elaborate approaches for future work.

Under this premise, the problem can essentially be tackled by methods for supervised learning, which, in one way or the other, replace the true risk (1) as a target of optimization by the empirical risk

$$\begin{aligned} \mathcal {R}_{emp}(h) := \frac{1}{n} \sum _{i=1}^{n} l(y_i, h(\varvec{x}_i)) \, . \end{aligned}$$

As opposed to the true risk, which requires knowledge of P, the latter can be computed on the training data.

3 Uncertainty in automated decision making

Since \(\mathcal {R}_{emp}(h)\) is only an estimation of the true risk \(\mathcal {R}(h)\), the empirical risk minimizer

$$\begin{aligned} \hat{h}:= \underset{h \in \mathcal {H}}{\arg \min } \ \mathcal {R}_{emp}(h) \end{aligned}$$

(or the minimizer of any variant of the empirical risk) will at best approximate but not equal the true risk minimizing hypothesis

$$\begin{aligned} h^{*}:= \underset{h \in \mathcal {H}}{\arg \min } \ \mathcal {R}(h) \,. \end{aligned}$$

Consequently, there is uncertainty related to a presumably sub-optimal model \(\hat{h}\), the prescriptions of which might not always be appropriate. Hence, in business processes like goodwill assessment, where wrong decisions might heavily impact customer satisfaction and also have a financial impact on the manufacturer, deploying prescriptive models without any safety mechanisms is hard to conceive.

From a risk minimizing perspective it is reasonable to equip the model with a reject option and to abstain from an automatic decision in case the uncertainty related to a query \(\varvec{x}\) is too high. Abstaining from decisions and trading off coverage for higher classification accuracy is also known as selective classification [11]. A standard selective classifier consists of a classifier function f and a binary selection function \(g: \mathcal {X} \rightarrow \{0,1\}\) which controls whether the classifier f abstains from a prediction or not:

$$\begin{aligned} (f,g)(\varvec{x}) := {\left\{ \begin{array}{ll} f(\varvec{x}) &{} \text { if } \, g(x) = 1 \\ \varnothing &{} \text { if } \, g(x) = 0 \end{array}\right. } \, . \end{aligned}$$

In our specific assessment use case, since there is already a manual human assessment process in place, rejection means to forward the query to a human expert for a manual assessment. The whole assessment process could hereby be considered as a piecewise function \(a(\varvec{x})\), with the sub-functions \(\hat{h} (\varvec{x})\) and \(m(\varvec{x})\) for automatic prescriptive machine learning and manual human assessment, respectively:

$$\begin{aligned} a(\varvec{x}) = (\hat{h}, m, g) (\varvec{x}) := {\left\{ \begin{array}{ll} \hat{h} (\varvec{x}) &{} \text { if } \, g(\varvec{x}) = 1 \\ m(\varvec{x}) &{} \text { if } \, g(\varvec{x}) = 0 \end{array}\right. } \, . \end{aligned}$$

Whether the input \(\varvec{x}\) is selected for prescription or not depends on a risk assessment with regard to \(\varvec{x}\) and \(\hat{h} (\varvec{x})\). In case the risk \(\mathcal {R}_{\hat{h}}(\varvec{x})\) associated with a query \(\varvec{x}\) exceeds a predefined risk threshold \(\delta \), the query is not supposed to be processed automatically and the selection function will make the system abstain:

$$\begin{aligned} g_{\delta }(\varvec{x}) := {\left\{ \begin{array}{ll} 1 &{} \text { if } \mathcal {R}_{\hat{h}}(\varvec{x}) \le \delta \\ 0 &{} \text { otherwise} \end{array}\right. } \, . \end{aligned}$$

A tradeoff between reliability and degree of automation is inherent in an ML-enhanced assessment process \(a(\varvec{x})\). Since ML results produced by \(\hat{h} (\varvec{x})\) will most likely not be perfect all the time, there is a serious risk of wrong (maybe costly) ML decisions that might significantly impact the overall reliability of the process. This loss in reliability can be circumvented by shifting requests with high risk to human experts \(m(\varvec{x})\), which in turn will come at a loss of automation. In order to maximize the degree of automation while still maintaining sufficient reliability in the decision process, accurately quantifying the risk related to a request \(\varvec{x}\) is crucial. For business domains it is of great interest to find an optimal degree of automation vs. risk of inappropriate decisions depending on the criticality of the business process and its associated costs. This trade-off between risk and degree of automation is also known as the risk-coverage (RC) trade-off [11].

4 Reliable decision making using conformal prediction

In the following, we will outline our selective uncertainty-aware approach to automated decision making. We will start with enhancing our existing hierarchical model with conformal prediction, which allows us to quantify uncertainty associated to queries. In the next step, we will turn these uncertainties into risk values. Finally, we discuss how we can optimize the trade-off between risk and the degree of automation on the system level using multi-objective optimization. In the end, it is then up to a business decision maker (DM) to select a Pareto-optimal solution that best suits the use case at hand.

4.1 Conformal prediction for uncertainty quantification

One method that is widely used to quantify uncertainty is conformal prediction [3, 32, 36]. Unlike in a standard classification scenario, where a predictor outputs a single class (point prediction), conformal prediction outputs a prediction set \(\Gamma ^{\epsilon }(\varvec{x})\) which is guaranteed to contain the correct label y with a probability of \(1-\epsilon \), where \(\epsilon > 0\) is a user-defined significance level or error rate. For instance, \(\epsilon =0.05\) means that the algorithm is allowed to make at most 5% invalid predictions on average. More formally, prediction sets \(\Gamma ^{\epsilon }(\varvec{x})\) are guaranteed to fulfill the following property, which is also referred to as marginal coverage:

$$\begin{aligned} 1-\epsilon \le P(y \in \Gamma ^{\epsilon }(\varvec{x})) \le 1 - \epsilon + \frac{1}{n+1} \, , \end{aligned}$$

where n is the number of training examples seen by the learning algorithm so far.

Fig. 1
figure 1

Distribution of the past contributions before and after the hierarchical restructuring

The construction of prediction sets relies on so-called non-conformity scores \(s(\varvec{x},y) \in \mathbb {R}\), which can be interpreted as a measure of plausibility of the input/output pair \((\varvec{x},y)\) in light of the data \(\mathcal {D}\) seen so far: the higher the value \(s(\varvec{x},y)\), the less the (hypothetical) data point \((\varvec{x},y)\) “fits” the (truly observed) training data. The standard inductive conformal prediction (ICP) algorithm consists of the following steps [1, 29, 30]:

  1. 1.

    Split the available data into a training, calibration, and test data set.

  2. 2.

    Induce a predictive model h on the training data.

  3. 3.

    Define a score function \(\alpha = s(\varvec{x},y) \in \mathbb {R}\), where larger scores mean higher non-conformity of \((\varvec{x},y)\); for example, if h is a scoring classifier, \(s(\varvec{x},y)\) could be given by the score assigned to y by \(h(\varvec{x})\).

  4. 4.

    Compute the critical value \(\hat{q}\) as the \(\frac{\lceil (n+1)(1-\epsilon ) \rceil }{n}\) empirical quantile (which is essentially 1 - \(\epsilon \) with a small correction) of the true calibration scores \(\alpha _{1}=s(\varvec{x}_{1},y_{1}), \ldots ,\alpha _{n}=s(\varvec{x}_{n},y_{n})\)

  5. 5.

    Use the critical value \(\hat{q}\) to calculate the prediction sets for new before unseen examples:

    $$\begin{aligned} \Gamma ^{\epsilon }(\varvec{x}) = \{y: \alpha =s(\varvec{x},y) \le \hat{q}\} \end{aligned}$$

The value \(\hat{q}\) plays the role of a p-value as known from statistical hypothesis testing. Such a p-value can also be associated with every candidate outcome:

$$\begin{aligned} p(\varvec{x}, y) = \frac{\# \big \{ i \in \{1, \ldots , n+1\} \, \vert \, \alpha _i \ge \alpha _{n+1} = s(\varvec{x},y) \big \}}{n+1} \,. \end{aligned}$$

Thus, \(p(\varvec{x}, y)\) corresponds to the percentage of (real) data points that are at least as nonconforming as \((\varvec{x}, y)\). Consequently, the smaller \(p(\varvec{x}, y)\), the less plausible y can be considered as an outcome for \(\varvec{x}\), and the p-values of all candidate outcomes \(y \in \mathcal {Y}\) allows one to sort them from most plausible to least plausible.

The prediction set \(\Gamma ^{\epsilon }(\varvec{x})\) is obtained by cutting p-values at the threshold \(\hat{q}\), thereby dichotomising \(\mathcal {Y}\) into plausible and implausible candidates. Ideally, \(\Gamma ^{\epsilon }(\varvec{x})\) is a singleton set, suggesting that there is exactly one plausible outcome while all other can be excluded. This is a case in which the learner can decide in an unequivocal way. More generally, the larger \(|\Gamma ^{\epsilon }(\varvec{x})|\), the more uncertain the learner is. Obviously, the size of \(\Gamma ^{\epsilon }(\varvec{x})\) is also influenced by the error probability \(\epsilon \): The smaller \(\epsilon \), the larger \(\Gamma ^{\epsilon }(\varvec{x})\) tends to be.

Interestingly, the prediction set can also be empty (\(\Gamma ^{\epsilon }(\varvec{x})=\emptyset \)). This happens in cases where a query \(\varvec{x}\) cannot be combined into a sufficiently conforming tuple \((\varvec{x},y)\) with any of the candidates y, e.g., because \(\varvec{x}\) itself is an atypical case. Obviously, just like overly large prediction sets \(\Gamma ^{\epsilon }(\varvec{x})\), empty predictions indicate a high level of uncertainty, suggesting to the learner that it might be better to abstain.

Let us finally make a remark on the error probability \(\epsilon \), which, as already mentioned, has a direct influence on the size of the prediction sets — and hence the probability that a learner may abstain from taking action. In conformal prediction, this value is normally quite small, with 0.1 and 0.05 being typical choices. Such values are also common in statistical hypothesis testing, so as to guarantee a low type-I error probability. While keeping the error probability low is reasonable in general, and indeed important in many applications, larger values of \(\epsilon \) might be quite meaningful in applications such as goodwill assessment. Here, \(\epsilon \) can also be seen as a parameter controlling the degree of automation and hence the workload of the human expert to whom ambiguous cases are transferred. In principle, \(\epsilon \) can then also be tuned to the availability of human resources. Starting with a very small \(\epsilon \) close to 0, all prediction sets will be full (\(\Gamma ^{\epsilon }(\varvec{x}) = \mathcal {Y}\)) and hence all cases rejected. By increasing \(\epsilon \) step by step, the learner will become less cautious and exclude outcomes in a more aggressive way, thereby increasing the number of cases that can be decided automatically (and decreasing the workload of the human expert). If human resources are limited, this might be the only way to achieve the necessary level of automation.

4.2 The hierarchical assessment model

For the model training step, we will re-use the hierarchical approach already outlined in [13]. It uses a qualitative ranking layer to predict the three main goodwill contribution ranks \(\mathcal {Y}_\textrm{rank} = \{1, 2, 3\} = \{ \text {NO}, \text {PARTIAL}, \text {FULL}\}\) and a subsequent quantitative regression layer for an exact prediction of the PARTIAL goodwill contributions (\(\mathcal {Y}_{\textrm{partial}} =\{10,20, \ldots ,90\}\)).

This hierarchical approach to goodwill assessment was chosen because the data is heavily imbalanced, with many 0 and 100% contributions on the one side and fewer, more widely distributed partial contributions on the other side [13] (cf. Fig. 1a). Combining the partial contribution data in the first layer counteracts this imbalance (cf. Fig. 1b).

Structuring the model hierarchically also makes sense from a risk assessment perspective, because errors in the qualitative ranking layer (e.g., NO vs. FULL contribution) potentially have a greater impact than errors in the quantitative regression layer (e.g., 50% vs. 80% contribution), both financially and on customer satisfaction.

In the hierarchical model, ranking is reduced to binary classifications using the framework presented in [25]:

$$\begin{aligned} r(\varvec{x}) = 1 + \sum _{k=1}^{K-1} f(\varvec{x} , k) \, . \end{aligned}$$
(2)

Here, f is a binary predictor trained to answer the question whether the true rank of \(\varvec{x}\) exceeds k (in which case \(f(\varvec{x}, k) = 1\), otherwise 0). Data for training f is constructed from the original training data. To this end, \(K-1\) new training examples are produced for each original training example \((\varvec{x}, y)\), one for every kFootnote 1:

$$\begin{aligned} \varvec{x}^{k} = (\varvec{x},k), \quad y^{k}= \llbracket k < y \rrbracket , \quad w_{y,k} = |\mathcal {C}_{y,k} - \mathcal {C}_{y,k+1} |\,. \end{aligned}$$

Here, \(w_{y,k}\) is the weight of the training example,Footnote 2 which is derived from the original cost-matrix: \(\mathcal {C}_{y,k}\) is the cost of predicting k when the ground-truth is y (see Implementation section for an example of a neutral cost matrix). Using this cost sensitive approach for training the models, different strategies can be implemented, e.g., customer friendly vs. cost oriented.

Fig. 2
figure 2

Overview of uncertainty-aware goodwill assessments with reject option

Figure 2 summarizes the architecture of our uncertainty-aware approach with each model layer being equipped with an additional risk assessment and reject option. The model can abstain from a decision when the risk assessment step indicates a too high risk for a wrong assessment. Rejecting a decision in our case means to forward the query to a human expert for manual assessment. Nonetheless, the model output can be used to assist the expert in the form of a decision support system (DSS). In this case, the human expert is in full control of the final decision but also gets the model’s output presented to support her in the decision process.

4.3 Conformalizing the hierarchical model

A core engineering task, which has a major influence on the quality of conformal prediction, is to build a good nonconformity function that entails all known information about the data and the model. Based on the outputs of the nonconformity function, the critical value \(\hat{q}\) that controls the outcomes to be put into the final prediction set is determined.

4.3.1 Conformalizing the ranking layer

Recall the binary predictor that we use to define the ranking function (2). We realize this predictor by training a probabilistic classifier, i.e., by setting \(f(\varvec{x},k) = \llbracket p(y=1 \, |\, \varvec{x},k) > 1/2 \rrbracket \), where \(p(y=1 \, |\, \varvec{x},k)\) is the (predicted) probability that the rank of \(\varvec{x}\) exceeds k. To define a nonconformity score for the ranking layer, we refer to these probabilistic predictions:

$$\begin{aligned} s_\textrm{rank}(\varvec{x},y):= \left|\left( 1 + \sum _{k=1}^{K-1} \hat{p}(y=1 \, |\, \varvec{x},k) \right) - y \right|\in [0,K-1] \,. \end{aligned}$$

The sum over probabilities yields a “soft” rank expressed in terms of a real (instead of an integer) number in [1, K], and \(s_\textrm{rank}(\varvec{x},y)\) is a measure of distance of that number to the rank y.

The prediction set for the ranking layer is given by

$$\begin{aligned} \Gamma ^{\epsilon }_\textrm{RA}(\varvec{x}) = \{y \, |\, s_\textrm{rank}(\varvec{x},y) \le \hat{q} \} \subseteq \{ 1, 2, 3 \} \, , \end{aligned}$$

where \(\hat{q}\) is the critical value obtained on the calibration data for the significance level \(\epsilon \).

4.3.2 Conformalizing the regression layer

Nonconformity scores for the regression layer can be obtained using quantile regression (QR), which is the standard approach to create a notion of uncertainty for real-valued problems [1, 31]. Depending on the significance level \(\epsilon \), a lower (\(\epsilon /2\)) and an upper quantile (\(1-\epsilon /2\)) need to be determined. QR yields prediction intervals of the form \([ \hat{t}_{\epsilon /2}(\varvec{x}), \hat{t}_{1-\epsilon /2}(\varvec{x})]\), and the width of these intervals serves as a heuristic notion of uncertainty. The score function can be defined as the projective distance of a candidate outcome y to the interval:

$$\begin{aligned} s_\textrm{reg}(\varvec{x},y) := \max \left\{ \hat{t}_{\epsilon /2}(\varvec{x})-y,y-\hat{t}_{1-\epsilon /2}(\varvec{x}) \right\} \end{aligned}$$

Note that \(s_\textrm{reg}(\varvec{x},y)\) is negative for values y inside the interval and positive outside; the minimal value is obtained for the midpoint of the interval.

Using conformal prediction, the scores can then be calibrated as usual. The prediction interval for conformalized quantile regression is then given by

$$\begin{aligned} \Gamma ^{\epsilon }_\textrm{RE}(\varvec{x}) = \left[ \hat{t}_{\epsilon /2}(\varvec{x}) - \hat{q} ,\hat{t}_{1-\epsilon /2}(\varvec{x}) + \hat{q} \right] . \end{aligned}$$

4.4 Risk quantification using conformal prediction

As already mentioned, in conformal prediction the uncertainty of the conformal predictor is quantified by the size of the prediction set. The higher the cardinality of the prediction set, or the width of the prediction interval in the case of regression, the higher the uncertainty. In the following, we make use of this notion of uncertainty to quantify the risk associated with a certain goodwill request being processed in an automated fashion by the prescriptive models.

4.4.1 Quantifying risk

To quantify the risk of wrong assessments in ranking (WARA), we make use of the conformal predictor’s prediction set size \(|\Gamma ^{\epsilon }(\varvec{x})|\), which is either 1, 2 or 3 (or 0 in the case of the empty set):

$$\begin{aligned} \mathcal {R}_{\textrm{WARA}}(\varvec{x})=\dfrac{|\Gamma ^{\epsilon }_\textrm{RA}(\varvec{x})|}{3} \end{aligned}$$

Note that, if the conformal predictor for ranking outputs an empty set \(\Gamma ^{\epsilon }_\textrm{RA}(\varvec{x}) = \emptyset \) we consider this as low risk query with \(\mathcal {R}_{\textrm{WARA}}(\varvec{x}) = 0\), since the model must anyway abstain from a decision.

The risk of wrong assessments in regression (WARE) is based on the conformal predictor’s interval size normalized by the overall regression interval size (in our use case from 10 to 90 %):

$$\begin{aligned} \mathcal {R}_{\textrm{WARE}}(\varvec{x}) = \min \left( \frac{\max \Gamma ^{\epsilon }_\textrm{RE}(\varvec{x}) - \min \Gamma ^{\epsilon }_\textrm{RE}(\varvec{x})}{80} , 1 \right) \end{aligned}$$

The interval cannot be empty in that sense but it can get arbitrarily small.

4.5 Selective uncertainty-aware automated decision making

To abstain from decisions in cases where the risk is too high, we need to define selection functions for the ranking and regression layer, respectively, as well as corresponding risk thresholds \(\delta _\textrm{rank}\) and \(\delta _\textrm{reg}\). The empty prediction set is treated as an exception and also leads to abstention:

$$\begin{aligned} g_{\delta _\textrm{rank}}(\varvec{x})&:= {\left\{ \begin{array}{ll} 1 &{} \text { if } \, \Gamma ^{\epsilon }_\textrm{RA}(\varvec{x}) \ne \emptyset \\ &{} \wedge \mathcal {R}_{\textrm{WARA}}(\varvec{x}) \le \delta _\textrm{rank} \\ 0 &{} \text { otherwise} \end{array}\right. } \\ g_{\delta _\textrm{reg}}(\varvec{x})&:= {\left\{ \begin{array}{ll} 1 &{} \text { if } \, \mathcal {R}_{\textrm{WARE}}(\varvec{x}) \le \delta _\textrm{reg} \\ 0 &{} \text { otherwise}. \end{array}\right. } \end{aligned}$$

We can now outline the complete uncertainty-aware assessment system \(a(\varvec{x})\) as follows. First, the query \(\varvec{x}\) is processed by the ranking layer \(\hat{h}_\textrm{rank}\). If the selection function \(g_{\delta _\textrm{rank}}(\varvec{x})\) selects the input for decision, the result of \(\hat{h}_\textrm{rank} (\varvec{x})\) is considered valid. In the case of a PARTIAL contribution (\(\hat{h}_\textrm{rank} (\varvec{x})= 2\)), the query is passed on to the regression layer and further processed by the regression model \(\hat{h}_\textrm{reg}\). In any case, if the ranking \(g_{\delta _\textrm{rank}}\) or regression selection functions \(g_{\delta _\textrm{reg}}\) abstain from a decision, the query is forwarded to a manual assessment \(m(\varvec{x})\) by a human expert:

$$\begin{aligned} \begin{aligned}&a(\varvec{x}) = (\hat{h}_\textrm{rank}, g_{\delta _\textrm{rank}}, \hat{h}_\textrm{reg},g_{\delta _\textrm{reg}},m) (\varvec{x}) := \\&\qquad {\left\{ \begin{array}{ll} \hat{h}_\textrm{rank}(\varvec{x}) &{} \text { if} \, g_{\delta _\textrm{rank}}(\varvec{x}) = 1 \\ &{} \wedge (\hat{h}_\textrm{rank}(\varvec{x}) = 1 \vee \hat{h}_\textrm{rank}(\varvec{x}) = 3 ) \\ \hat{h}_\textrm{reg}(\varvec{x}) &{} \text { if} \, g_{\delta _\textrm{rank}}(\varvec{x}) = 1 \wedge \hat{h}_\textrm{rank} (\varvec{x})= 2 \\ &{} \wedge g_{\delta _\textrm{reg}}(\varvec{x}) = 1 \\ m(\varvec{x}) &{} \text { otherwise}. \end{array}\right. } \end{aligned} \end{aligned}$$

4.6 The risk vs. degree of automation trade-off

Given a proper uncertainty quantification, there is an obvious trade-off between risk and degree of automation in decision support systems. The more risk of possibly suboptimal or inappropriate decisions one is willing to take, the higher the degree of automation of the system can be. This trade-off can be formalized in terms of a multi-objective optimization (MO) problem. Essentially, in our use case we seek to maximize the degree of automation while simultaneously minimizing the overall risk of wrong assessments.

In general, a MO problem can mathematically be formulated as follows [15]:

$$\begin{aligned} \begin{aligned} \min \quad&f(x) = \{f_{1}(x),\ldots ,f_{k}(x)\} \\ \text {s.t.} \quad&x \in \Omega \end{aligned} \end{aligned}$$

Usually, the goal is to find a Pareto-optimal solution. A solution \(x^*\in \Omega \) is called Pareto-optimal if there is no other solution \(x\in \Omega \), \(x^* \ne x \), such that \(f_{i}(x) \le f_{i}(x^*)\) and \(f_{j}(x) < f_{j}(x^*)\) for at least one j [15].

When a Pareto optimal solution is found, a decision maker (DM) can select the best solution from the Pareto set or front. The DM is supposed to be a domain expert and must be able to select the solution representing the best trade-off for the problem at hand.

Methods for solving MO problems are categorized according to when in the optimization process the DM contributes her expertise in finding the best trade-off. In a priori methods, the DM is asked for her preferences in advance. Her preferences are then taken into account during the optimization process to find a Pareto-optimal solution as close as possible to the specified preferences. In a posteriori methods, an approximation of the whole Pareto set is determined and presented to the DM. The DM can then select the best trade-off. In interactive methods, the DM’s expertise and preferences are integrated into the optimization process and she can iteratively provide feedback.

When looking at our use case, we have four parameters that control the risk and the degree of automation of our assessment system: The threshold risk values (\(\delta _\textrm{rank}\),\(\delta _\textrm{reg}\)) and the conformal predictors’ significance levels (\(\epsilon _\textrm{rank},\epsilon _\textrm{reg}\)):

$$\begin{aligned} \varvec{u} = \begin{pmatrix} \epsilon _\textrm{rank} \\ \delta _\textrm{rank} \\ \epsilon _\textrm{reg}\\ \delta _\textrm{reg} \end{pmatrix} \end{aligned}$$

The three objectives we seek to optimize are the risk for ranking \(\mathcal {R}_{\textrm{WARA}}\) and regression \(\mathcal {R}_{\textrm{WARE}}\), as well as the overall degree of automation (DoA):

$$\begin{aligned} \varvec{v} = \begin{pmatrix} \bar{\mathcal {R}}_{\textrm{WARA}}(\varvec{u}) \\ \bar{\mathcal {R}}_{\textrm{WARE}}(\varvec{u}) \\ \hbox {DoA}(\varvec{u}) \end{pmatrix} \, , \end{aligned}$$

where

$$\begin{aligned} \bar{\mathcal {R}}_{\textrm{WARA}}(\varvec{u})&= \frac{1}{n} \sum _{i=1}^n \mathcal {R}_{\textrm{WARA}}(\varvec{x}_i \, |\, \varvec{u}) \, , \\ \bar{\mathcal {R}}_{\textrm{WARE}}(\varvec{u})&= \frac{1}{n} \sum _{i=1}^n \mathcal {R}_{\textrm{WARE}}(\varvec{x}_i \, |\, \varvec{u}) \, . \end{aligned}$$

Moreover, the DoA is defined as follows:

$$\begin{aligned} \begin{aligned} DoA(\varvec{u})&= {\frac{1}{n}} \sum _{i=1}^{n} \, \llbracket \, g_{\delta _\textrm{rank}}(\varvec{x}) = 1 \wedge \hat{h}_\textrm{rank}(\varvec{x}) \in \{ 1, 3 \} \, \rrbracket \\&\quad + \llbracket \, g_{\delta _\textrm{rank}}(\varvec{x}) = 1 \wedge \hat{h}_\textrm{rank} (\varvec{x})= 2 \wedge g_{\delta _\textrm{reg}}(\varvec{x}) = 1 \, \rrbracket \end{aligned} \end{aligned}$$

Formally, our optimization problem can be formulated according to the equation below. The risk values (\(\mathcal {R}_{\textrm{WARA}}\), \(\mathcal {R}_{\textrm{WARE}}\)) are supposed to be minimized, whereas the degree of automation (DoA) is supposed to be maximized. Moreover, all optimization parameters \(\varvec{u}\) are restricted to the interval [0, 1].

$$\begin{aligned} \begin{aligned} \min _{\varvec{u}} \quad&\bar{\mathcal {R}}_{\textrm{WARA}}(\varvec{u}) \\ \min _{\varvec{u}} \quad&\bar{\mathcal {R}}_{\textrm{WARE}}(\varvec{u}) \\ \max _{\varvec{u}} \quad&\hbox {DoA}(\varvec{u}) \\ \text {s.t.} \quad&0 \le \epsilon _\textrm{rank}, \, \delta _\textrm{rank}, \, \epsilon _\textrm{reg}, \, \delta _\textrm{reg} \le 1 \end{aligned} \end{aligned}$$

In the end, our overall goal is to offer the business DM a Pareto set of solutions from which she can choose the best trade-off in terms of risk and degree of automation. Explicating and clearly explaining this trade-off with a set of Pareto-optimal solutions makes the ML system more transparent to business DMs. This may also help to increase trust into the ML system, as the trade-off is known and can be controlled.

5 Evaluation

In the following, we conduct several experiments using our approach as outlined in the previous section and the goodwill data set. We begin with a short description of the data set and some implementation details. Next, we evaluate the coverage and set sizes of our conformal predictors based on different significance levels. Subsequently, we identify Pareto-optimal solutions for our objective space (risk, degree of automation, accuracy) using random search. These Pareto-optimal solutions can then be used to identify a suitable trade-off by a decision maker.

5.1 The goodwill data set

The data set we will use to evaluate our approach is a goodwill data set of a BMW NSC. The features are the data contained in a goodwill request and the labels are the contributions assessed for labor and parts by the human experts. We will not treat the problem as a multi-label classification task, but instead build separate prescriptive conformal predictors for labor and part contributions, respectively. Table 1 summarizes the characteristics of the data set.

Table 1 Characteristics of the goodwill data set

5.2 Implementation

To implement the ranking part of the hierarchical model according to [25], we make use of XGBoost [7] with the cost matrix shown in Table 2. Essentially, this is a neutral cost matrix that does not implement a certain strategy (e.g., customer friendly vs. cost oriented). In the case of partial ranks, the costs equal the absolute error of the regression layer and lie in the interval [0, 80].

Table 2 Cost matrix for the ranking layer
Fig. 3
figure 3

Overview of the inference architecture

Fig. 4
figure 4

Coverage and set size plots for several significance levels \(\epsilon \)

To implement the regression layer, as well as the quantile regression models for conformal prediction, we make use of a feed-forward neural network with two dense hidden layers and 512 neurons each. The model is trained for 200 epochs with batch size 32. For quantile regression, we use the pinball loss function and for the regular regression layer the mean absolute error (mae) loss function.

Figure 3 depicts our conformal inference architecture in detail. It consists of three layers:

  1. 1.

    The point prediction layer contains the hierarchical goodwill assessment model already outlined in [13]. It outputs point predictions for goodwill requests without any uncertainty awareness.

  2. 2.

    The conformal prediction layer enhances the point prediction layer with inductive conformal predictors for the ranking and regression layers.

  3. 3.

    The risk assessment layer utilizes the prediction set and interval sizes output by the conformal prediction layer to quantify the risk associated with a request and either forwards the request to a human assessment or takes over the point prediction result as the result of the assessment.

5.3 Evaluation of conformal prediction

First, we evaluate our conformal prediction implementation on the goodwill data set of the NSC using ten-fold cross validation for several significance levels \(\epsilon =\{0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.05,0.03,0.02,0.01\}\). During each iteration, we use approximately 690 examples (5%) of the training examples for calibration. The following plots then display the mean and the 95% confidence interval for the 10 folds.

Figure 4 shows the prediction set and interval sizes as well as the coverage of the ranking and regression layers for parts and labor contributions. As expected, smaller significance levels \(\epsilon \) lead to higher coverage and also larger prediction sets and interval sizes. The coverage of a conformal predictor’s prediction set (or interval size in the case of regression) can be calculated as follows:

$$\begin{aligned} C = \frac{1}{n} \sum _{i=1}^{n} \mathbb {1} \{y_{i} \in \Gamma ^{\epsilon }(\varvec{x}_i)\} \end{aligned}$$

The mean value of the coverage \(\bar{C}\) calculated during the ten folds should center around \(1-\epsilon \), which is the case for ranking, e.g. \(\epsilon =0.2, \bar{C}=0.78\) or \(\epsilon =0.7, \bar{C}=0.275\). This is a good indicator for the correct implementation of conformal prediction. For regression, the coverage plot is not as accurate as for ranking but also displays a constant coverage increase for smaller significance levels. In addition, the prediction set and interval sizes stay small for a long time and only increase steeply for very small significance levels \(\epsilon \le 0.1\), which also underlines the accuracy of the conformal predictor and the quality of the score functions. The average prediction set size for ranking is hereby calculated as follows:

$$\begin{aligned} S = \frac{1}{n} \sum _{i=1}^{n} |\Gamma ^{\epsilon }(\varvec{x}_i)|. \end{aligned}$$

In the case of regression, the spread of the interval is taken as the interval size:

$$\begin{aligned} S = \frac{1}{n} \sum _{i=1}^{n} \max \Gamma ^{\epsilon }(\varvec{x}_i) - \min \Gamma ^{\epsilon }(\varvec{x}_i) \, . \end{aligned}$$

5.4 Evaluation of selective uncertainty-aware Pareto optimization

In order to identify good a posteriori trade-offs for our objectives, we perform a simple random search limited to 1000 iterations. Table 3 shows the design space used for randomly exploring the objective space. The values are hereby drawn from a uniform distribution.

Table 3 Design space for randomly exploring the objective space (risk, accuracy, degree of automation)
Fig. 5
figure 5

Trade-offs between risk and degree of automation (DoA)

Fig. 6
figure 6

Trade-off between accuracy and degree of automation (DoA) for the ranking layer

Fig. 7
figure 7

Trade-off between mean absolute error (MAE) and degree of automation (DoA) for the regression layer

Fig. 8
figure 8

Overall trade-off between accuracy and degree of automation (DoA)

In each random search trial, we train the hierarchical model using the training data set (13,164 examples), then calibrate our conformal predictors with the calibration data set (693 examples) and evaluate our model’s conformal and point predictions using the test set (1540 examples). Next, we determine the non-dominated points in our explored objective space forming the Pareto front of our multi objective optimization problem. We hereby first look at the risk vs. degree of automation trade-off for ranking and regression in Fig. 5 also known as risk coverage trade-off. The degree of automation that is achievable in the ranking layer hereby linearly increases with increasing risk. Requests whose risk values exceed the given risk thresholds are hereby rejected and not considered for automatic processing. A similar behavior is visible for the regression layer, when looking at the Pareto set for the regression risk vs. degree of automation trade-off (cf. Fig. 5). However, the regression risk does not increase constantly. It first increases moderately and shoots up for higher degrees of automation. Nevertheless, higher risk goes hand in hand with higher degree of automation for both layers.

Since our calculated risk values based on conformal prediction outputs are rather abstract values, we also look at the accuracy vs. degree of automation trade-offs for the ranking layer in Fig. 6. As a baseline, we also show the overall accuracy of our ranking layer, which is 92.7% for labor and 90.97% for parts contributions respectively. The shown plots are very similar to Accuracy-Rejection Curves [27], but instead of plotting the amount of rejected queries in per cent we plot the amount of selected or processed queries accumulating in the degree of automation of the system. The accuracy of the ranking layer is monotone decreasing for increasing degrees of automation, which indicates that, by virtue of our conformal ranking predictor, the ranking layer is capable of quantifying its uncertainty well.

When looking at the mean absolute error (mae) vs. degree of automation trade-off in the regression layer, we can also see a similar behavior (cf. Fig. 7). For increasing degrees of automation, the mean absolute error is monotone increasing, which also underpins the capability of the regression layer to quantify its uncertainty well. Abstaining randomly would in contrast lead to a flat curve. An overall MAE of 5.49 for labor and 6.67 for parts respectively in the regression layer can easily be undercut by reducing the degree of automation.

Figure 8 shows plots for the overall accuracy vs. degree of automation trade-offs of the hierarchical model as a whole, including the ranking layer as well as the regression layer. An accuracy of 100% is achievable with a degree of automation of 20%, which is however not a practically useful scenario. A degree of automation of 70% might be a good trade-off and leads to an accuracy of 98% for labor and parts, respectively, on the test data. In general, we can also see a clear monotonic decrease of the overall accuracy with increasing degree of automation which ensures the uncertainty quantification capability also of the overall hierarchical model. Looking at this trade-off, a business decision maker can select a practically reasonable solution. Whether degree of automation outweighs high accuracy requirements very much depends on the use case. As goodwill assessment is a process entailing financial risk, very high accuracy is definitely an important requirement. Since there is anyway a human assessment process in place, degree of automation is presumably a less important criterion than accuracy.

Table 4 Some selected accuracy vs. degree of automation trade-off values including the corresponding design space values for labor
Table 5 Some selected accuracy vs. degree of automation trade-off values including the corresponding design space values for parts

Tables 4 and 5 show some selected accuracy vs. degree of automation trade-off values for parts and labor, respectively, including the corresponding design space values.

Fig. 9
figure 9

Accuracy and degree of automation plots for \(\delta _\textrm{rank}=\frac{1}{3}\) and \(\delta _\textrm{reg}=\frac{10}{80}\) depending on \(\epsilon \)

Fig. 10
figure 10

Abstention types of the conformal hierarchical predictor depending on \(\epsilon \)

Fig. 11
figure 11

Abstention ranks of the conformal hierarchical predictor depending on \(\epsilon \)

Fig. 12
figure 12

Processed ranks of the conformal hierarchical predictor depending on \(\epsilon \)

5.5 The effect of the significance level \(\epsilon \)

In the following, we study the effect of the significance level \(\epsilon \) on the achievable prescription accuracy and degree of automation. This can be done by fixing the risk thresholds for the ranking as well as the regression layer. A reasonable threshold for ranking might be \(\delta _\textrm{rank}=\frac{1}{3}\), which essentially means that we only want to consider prediction sets for automated decision where the conformal ranking predictor is certain about the result. For regression, we might want to tolerate a risk of \(\delta _\textrm{reg}=\frac{10}{80}\), which is an interval spread of 10%, otherwise we do not trust the result and want the case to be processed manually. Please note that these thresholds are exemplary thresholds and not universally applicable. They are specific to the problem of goodwill assessment and the proposed hierarchical model structure. In general, defining an optimal risk threshold is a task on its own which must also take the context of the application into account [38], as even an optimal risk-averse threshold does not reliably go in a particular direction [19]. In the case of goodwill assessment, the risk-averse decision maker [37] may also not want to miss out on reduced costs trough automation and take these into account when defining risk thresholds.

Figure 9 shows 10-fold cross validated mean plots for the conformal predictor’s accuracy and the overall degree of automation depending on the significance level \(\epsilon =\{0.9,0.8, \ldots ,0.1,0.05,0.03,0.02,0.01\}\). As a baseline, we again show the overall accuracy (ACC) of the hierarchical model as a whole over all test data. One can see that with decreasing \(\epsilon \) the degree of automation (DoA) increases whereas the accuracy decreases (ACC CP). So if accuracy is important, \(\epsilon \) values need to be rather large. If the degree of automation is important, \(\epsilon \) values need to be rather low. At a certain \(\epsilon \) value, accuracy and degree of automation drop of steeply, since the prediction sets and intervals become too large and exceed the predefined risk thresholds, which makes the model abstain completely from deciding requests.

Figure 10 displays the corresponding reasons for abstentions depending on the significance level \(\epsilon \). For larger \(\epsilon \) values, abstentions are exclusively caused by empty sets. In that case, few predicted cases fall below the required quantile threshold \(\hat{q}\). For instance, if \(\epsilon =0.9\) only 10% (\(1-\epsilon =1-0.9=0.1\)) of the lowest scores are considered valid results and lie within the quantile \(\hat{q} = 1 - \epsilon =1-0.9=0.1\). With decreasing \(\epsilon \) there are less and less empty prediction sets until the sets grow so large that abstentions are solely due to risk assessments. In the end, for \(\epsilon \le 0.03\), the conformal predictors only output non-unique prediction sets, which leads to complete abstention in our case due to our strict thresholds.

Figure 11 breaks down the abstentions by contribution type (no, partial, or full contribution). Abstentions for all types of contributions strictly decrease for decreasing \(\epsilon \) values until the sets become too large, leading to complete abstention due to violation of the risk threshold. It is noticeable that for labor as well as part abstentions the No abstentions drop off steeper in the beginning. One may speculate that the No contributions have the smallest scores and are therefore overrepresented in the smaller score quantiles \(\hat{q}\), e.g., with \(\epsilon > 0.6\). Moreover, given this observation, at least a part of the No contribution assessments seems to be quite certain or obvious.

Figure 12 breaks down the processed contributions by their contribution type (no, partial, or full contribution). Processed hereby means that the predictor did not abstain from answering the particular request. Like for abstentions, it is visible that No contributions are processed preferentially. The No contribution scores seem to be overrepresented in the lower quantiles. Nevertheless, contributions strictly increase for all contribution types with increasing \(\epsilon \) until complete risk abstention sets in.

Since the abstentions and processed contributions are not balanced, one could argue to use class-balanced conformal prediction [1] instead, where scores and quantiles are determined per class. However, given the use case at hand, there is not necessarily a need for class-balanced coverage. If manual work reduction is the main goal of introducing ML into the process, this coverage imbalance might have no negative impact at all, since there is no difference in effort known between the assessments of the different contributions. It could even be considered beneficial that No contributions are the most certain ones to be assessed automatically, as they also entail the least financial risk.

6 Conclusion and future work

We developed and evaluated an uncertainty-aware approach for automated decision making, in which conformal prediction is used to quantify the risk associated with ML prescriptions. As a use case, we looked at automated decision making for goodwill assessments in the automotive domain using a goodwill data set of a car manufacturer. Instead of providing mathematical guarantees for limited risk, we emphasize the trade-off between risk and degree of automation, and how an a posteriori Pareto-optimal solution can be explored by a business decision maker to select the best trade-off for the particular business use case at hand.

To underpin the capability of conformal predictors to quantify uncertainty in a proper way, we present risk-coverage plots and accuracy-rejection curves. We also analyzed CP’s significance level parameter \(\epsilon \) and how it affects the number of empty prediction sets as well as the achievable accuracy and degree of automation of the system. Concretely, by abstaining to answer the 30% most risky or uncertain queries, our hierarchical predictor is capable of increasing its overall accuracy from 92 to 98% for labor and from 90 to 98% for parts contributions, respectively.

Achieving even higher accuracies is presumably not very reasonable, as this comes at a significant loss in degree of automation. Additionally, human decisions cannot be considered a consistent gold standard and might be biased in one or another direction. A certain amount of aleatoric uncertainty is supposedly irreducible in a human decision process and will remain. Nevertheless, the amount of wrongly prescribed contributions can be significantly reduced with our selective uncertainty-aware approach, which makes the introduction of ML in high-stake environments more feasible.

Proceeding from this well working uncertainty-aware approach to automated decision making, we plan to address three major challenges in the future:

  1. 1.

    Explainability: Making machine learning based goodwill prescriptions more accessible and transparent to IT and business decision makers is in our eyes of utmost importance to foster trust into the system, but also to fulfill internal revision audit requirements. We consider decision explanations equally important for both scenarios in which the machine learning models are supposed to be used (Automated Decision Making (ADM) or Decision Support System (DSS)). Therefore, we plan to investigate and satisfy the different explanation needs of our stakeholders using Explainable Artificial Intelligence (XAI) methods [5, 16, 26].

  2. 2.

    Human-AI interaction: How human experts are influenced by AI assisting their work or taking over some of their workload is another interesting and important aspect that needs to be followed up [4]. Overconfidence into the decision model by human experts and decision makers, also known as automation bias [24], as well as undue reluctance, also known as algorithm aversion [10], are issues to be evaluated and calibrated properly. Whether XAI can help in this trust calibration process, by making the reasoning process of machine learning models more transparent, is still an active area of research [21, 28, 35]. Moreover, there is also a recent line of research particularly focusing on the effect of providing set-valued predictions to human-AI teams instead of single predictions [2, 6].

  3. 3.

    Weak supervision: As already mentioned, human goodwill decisions cannot necessarily be taken as a gold standard. The data may contain concept drift and shift due to strategy changes in the assessment process over time or other human induced biases leading to noisy labels. Hence, past decisions should be considered and modeled as weak information about the target rather than an incontestable ground truth, suggesting the use of methods for weakly supervised learning [14, 39].