1 Introduction

Predictive analytics supports decision-making by exploiting the patterns present in historical data to obtain insights about future states. Machine learning techniques play a crucial role, as they facilitate the estimation of the likelihood of an outcome of interest. However, a key concern in real-world applications lies in foreseeing the effects of different actions on an outcome variable. This task is performed by uplift modeling techniques and allows decision-makers to prescribe the course of action that maximizes a given objective at the individual level. Hence, uplift modeling is a type of prescriptive analytics (Bertsimas and Kallus 2019).

The identification of the most favorable action (hereafter referred to as treatment) for an individual corresponds to estimating the effect that a decision variable (e.g., treatment) has on an outcome variable (e.g., response). This association is known in the causal literature as the individual treatment effect (ITE) and frames uplift modeling as a causal inference task. The potential outcomes framework (Rubin 1974) defines the ITE as the difference between potential outcomes of distinct treatment alternatives. From a machine learning perspective, this consists of contrasting the predicted values of the outcome variable for each of the treatments at the individual level.

Since making causal inferences is tied to a treatment applied to an individual, uplift modeling is functional in cases where a decision-maker has control over a variable whose manipulation is expected to cause a behavioral change. For instance, marketers launch campaigns that maximize the intentions of customers to buy particular products (Gubela et al. 2017).

Uplift modeling can be implemented in different domains. However, the most common applications are found in the fields of marketing (Lo 2002; Hansotia and Rukstales 2002; Guelman et al. 2012, 2014a, b; Kane et al. 2014; Guelman et al. 2015; Gross and Tibshirani 2016; Michel et al. 2017; Gubela et al. 2017) and personalized medicine (Alemi et al. 2009; Jaskowski and Jaroszewicz 2012). Particularly, uplift modeling has helped marketers to increase the return on marketing investment by segmenting the customer base into four categories according to the recommendations of the model. Customers who respond favorably because of the campaign are categorized as \({\textit{persuadables}}\). On the other hand, the \({\textit{do}}\)-\({\textit{not}}\)-\({\textit{disturb}}\) segment includes customers adversely affect by the campaign: they do not respond at all, while they would have responded if they were not contacted. The customers in the third and fourth categories either never respond to any offer—the \({\textit{lost}} \ {\textit{causes}}\)—or always respond regardless of the offer—the \({\textit{sure}} \ {\textit{things}}\). The interest lies in targeting the \({\textit{persuadables}}\) and avoiding the other segments.

The literature on uplift modeling is primarily focused on the estimation of a single treatment effect. Studies that generalize the binary treatment effect framework to applications where the effects of different treatment alternatives are estimated are scattered and limited in number. Hence, there is at most a vague understanding of which uplift multitreatment techniques are available and limited empirical evidence regarding the uses and performances of these methods.

This study contributes to the state-of-the-art in the field of uplift modeling in three ways: (1) it provides an exhaustive survey of the literature on multitreatment uplift modeling and introduces a framework to classify multitreatment uplift modeling methods, (2) it proposes two novel multi-treatment uplift modeling methods, and (3) it presents the results of an extensive benchmarking study, which provides ample empirical evidence of the performances of thirteen multitreatment uplift modeling methods across eight multitreatment uplift data sets. The experiments are performed on data sets from diverse domains such as marketing, politics, personalized medicine and human resources. The Qini metric and the expected response are used to evaluate the performances of the models.

Additionally, uplift studies where selection bias is tested and controlled are uncommon. Therefore, we verify and, if needed, correct for the imbalance among the pretreatment characteristics of the treatment groups to ensure a correct interpretation of the estimated uplift.

The remainder of this paper is structured as follows. The first part provides a general introduction to the fundamentals of uplift modeling and an overview of current approaches to estimate uplift in a multitreatment scenario and presents two novel methods. Section 3 discusses the evaluation of multitreatment uplift models. Next, the experimental design is described in Sect. 4, and empirical results are discussed in Sect. 5. Finally, Sect. 6 concludes and provides directions for future research.

2 Uplift modeling

This section starts with a general definition of uplift modeling and a description of the single treatment and multitreatment scenarios. Next, we provide an overview of current uplift modeling techniques and propose two novel methods.

2.1 Definition

Uplift modeling is a machine learning approach that employs Rubin’s causal inference framework (Rubin 1974) to estimate the ITE of (a) treatment(s) on an outcome of interest. The ITE estimation requires three elements to be present in the data: a set of variables representing the pretreatment characteristics of individuals, X, a decision-variable indicating the exposure to a treatment, T, and the corresponding outcome, Y.

In a binary treatment assignment, \(Y_i(T=1)\) and \(Y_i(T=0)\) correspond to the potential outcomes (i.e., the future state of the outcome) of an individual when she/he receives treatment and nontreatment, respectively. Then, the ITE of treatment against nontreatment on Y for the i individual, \(\tau _i\), is \(Y_i(T=1) - Y_i(T=0)\). If the result of the subtraction is a nonzero value, it can be inferred that the treatment exerts an impact on the outcome for that particular individual. In uplift modeling, the potential outcomes are estimated by machine learning algorithms as conditional probabilities whose difference is used to determine the effect of the treatments. The multitreatment scenario is a generalization of Rubin’s framework to applications where the decision variable can assume more than two values. Examples include the situations in which policy makers have to decide among various assistance programs or when marketers have to choose among different channels to reach out to customers.

Causal discovery infers causal structures from data with respect to interventions (Peters et al. 2017). The focus of uplift modeling, on the other hand, lies in customizing the treatment assignments. The aim is to target individuals on who the treatment will have the largest positive effect according to the predictions of the model. An analogous approach to uplift modeling is the estimation of heterogeneous treatment effects (Zhao and Harinen 2019). A large portion of this literature employs machine learning methods to estimate the conditional average treatment effect (CATE). The motivation behind the understanding of treatment effect heterogeneity is that the CATE can be used to select the optimal treatment rule, since it considers the treatment effectiveness to vary with the characteristics of individuals. Applications in the binary treatment case include those of Kallus (2017), Athey and Wager (2017), Kallus and Zhou (2018) and Athey and Imbens (2019).

To the best of our knowledge, the multitreatment setting has only been addressed by Imai et al. (2013) and Zhou et al. (2018). In contrast to uplift modeling, these methods serve to formulate treatment rules conditioned for individual characteristics, thus prioritizing the estimations of causal effects and statistical inference rather than their predictive power.

2.1.1 Binary model

Binary treatment uplift modeling is formally introduced by Radcliffe and Surry (1999) as a technique to predict the incremental effects of marketing activities. The difference between uplift modeling and response modeling is that the latter uses predictive models to estimate the likelihood of a favorable outcome. The former, however, predicts how much the outcome will vary when the individual is exposed to a treatment.

$$\begin{aligned} {\hat{\tau }}_{i,1}(x_i,T) := {\hat{P}}\big (Y_i = 1 | x_i, \ do(T=1)\big ) - {\hat{P}}\big (Y_i = 1 | x_i, \ do(T=0)\big ) \end{aligned}$$
(1)

Assuming a binary outcome variable \(Y \in \{0,1\}\), Eq. 1 defines the predicted individual uplift for \(T=1\) (\({\hat{\tau }}_{i,1}\)) as a function of the individual’s pretreatment characteristics X and the two treatment alternatives \(T = \{0,1\}\). This definition integrates the \(do(\cdot )\) operator to indicate that the observed change in the probability of the outcome is due to the treatment itself and not to the presence of confounders (Pearl 2009). A fundamental assumption is that individuals are somehow sensitive to the given treatment (Guelman et al. 2014b). The contrast between the two groups allows the identification of the individuals who are most likely to have a favorable outcome when treated. This makes uplift modeling an appropriate tool for customizing treatment assignment and prescribing the course of action that maximizes a given objective.

2.1.2 Multitreatment model

A binary uplift model can be extended to applications where the interest lies in evaluating the ITE of a diverse set of treatments. This corresponds to real-world scenarios where decision-makers must choose between multiple treatment alternatives in order to optimize the performance of treatments and to personalize the experience of users. Examples of such decisions are identifying the product design, the communication channel or promotion that is the most appealing to a customer, the most favorable medical treatment option for a patient, or selecting the assistance program with the largest benefit for a vulnerable individual.

Multitreatment uplift modeling (MTUM) requires a set \(T = \{0, 1, \ldots , k\}\) of mutually exclusive treatments, a collection of observed pretreatment characteristics X, and a binary outcome variable \(Y \in \{0,1\}\). Similarly to binary treatment uplift models, the aim of MTUM is to find the treatment whose effect on the outcome is the most favorable from a larger set of treatment alternatives. The machine learning task consists in estimating the conditional probabilities of a positive outcome for each individual, given the pretreatment characteristics and the exposure to the treatments. Later, these estimates are contrasted to identify the treatment whose ITE is the largest.

MTUM takes into account two different contrasts: multiple treatment groups without a control group and multiple treatment groups with a control group. The former consists of \({k}\atopwithdelims (){2}\) simultaneous pairwise comparisons and seeks to identify the best rank order for each individual. The latter compares each treatment alternative against a control group and aims to determine the optimal action for each individual (Zhao and Harinen 2019). To maintain similarity with the current MTUM literature, this study applies to scenarios with multiple treatment groups, including a control group. For example, a government agency wants to send personalized letters to motivate individuals to vote by: (1) sending a letter with the message “Do your civic duty” (treatment 1), (2) sending a letter with the message “You are being studied” (treatment 2), or not sending a letter at all (control group). The goal of MTUM is then to identify whether a letter should be sent for each individual and, if so, which type of message it should contain.

Formally, the optimal treatment (\(\pi _{i,k}^{*}\)) for individual i is the treatment for which the uplift \({\hat{\tau }}_{i,k}\) is the largest,

$$\begin{aligned} \pi _{i,k}^{*} = argmax({\hat{\tau }}_{i,1},\ldots ,{\hat{\tau }}_{i,k}). \end{aligned}$$
(2)

This is obtained after estimating the differences in the probabilities of a positive outcome between the treatments under evaluation and the control group (\(T=0\)) at the individual level, as shown in Eq. 1.

2.2 Survey of multitreatment uplift modeling approaches

The MTUM literature is still limited. This study categorizes the different MTUM approaches according to the classification proposed by Devriendt et al. (2018) for binary uplift models. The authors distinguish two main methods to obtain uplift estimates: the data preprocessing approach and the data processing approach. The former learns an uplift model by means of conventional machine learning algorithms by redefining the original outcome variable or by modifying the input space before training. The data processing approach comprises methods wherein standard machine learning algorithms are trained separately, or their internal structures are adapted to the multitreatment case. Table 1 provides an overview of the modeling strategies that are surveyed in this study. In particular, the naive uplift approach and the multitreatment modified outcome approach are our contributions to the current uplift literature. These methods are introduced in Sect. 2.3.

Table 1 A summary of MTUM approaches

The dummy and interactions approach (DIA) is the only data preprocessing approach that has been proposed within the multitreatment uplift literature. This method extends the input space by adding treatment indicators encoded as dummies \(D = \{0,\ldots ,k \}\) and interaction terms. The latter capture the interplay between the dummies and the pretreatment characteristics. Uplift is then modeled by means of any machine learning algorithm that receives as input the pretreatment characteristics X, the dummy variables D, and the interaction terms \(D \times X\), so that \(P(Y=1 | X, do(T)) = f(X,D,D \times X)\).

Lo (2002) and Tian et al. (2014) implement the DIA for binary treatment uplift models and Chen et al. (2015) for the MTUM case. The Personalized Revenue Maximization (PRM) algorithm proposed by the latter authors is particularly discussed in the context of customized pricing and personalized assortment optimization. The inputs the algorithm uses are the vector of individual characteristics, the assigned treatment (e.g., price offered), the interaction terms and the outcomes. The optimization problem lies in minimizing the gap between the predicted expected revenue according to the optimal treatment assignment and the expected revenue obtained with complete knowledge of the parameters that specified customer behavior. The results of the customized pricing for airline priority seating show that the SMA using a random forest algorithm slightly outperforms the PRM method for all data sizes.

The DIA is a simple approach, since conventional algorithms do not need to be modified and the outcome variable does not necessarily have to be binary. However, the enlargement of the input space can cause overfitting and multicollinearity problems when the amount of interactions is considerably large (Kane et al. 2014).

Most studies addressing the MTUM case can be categorized within the data processing approach. This implies that the uplift is modeled in either an indirect or a direct way. Modeling uplift indirectly corresponds to a strategy in which training cases are grouped according to the treatment that they received. Later, a model is trained for each group. By contrast, a direct uplift estimation trains a single model by employing multitreatment uplift algorithms.

Estimating uplift indirectly is also known as the separate model approach (SMA). This is the baseline technique and was initially proposed to train binary uplift models. Later, it was extended to multitreatment applications due to its simplicity. It employs standard machine learning algorithms to train separate predictive models for each treatment group. Afterwards, the models are used to compute the \({\hat{P}}(Y = 1 | X, do(T=k))\) for each test case, so that the optimal treatment is the one for which the largest difference is obtained (see Eq. 2).

Lo and Pachamanova (2015) demonstrate the estimation of the ITE in a multitreatment scenario by applying the SMA, due to its simplicity and general acceptance as a baseline method. The authors present a framework that formulates the MTUM task as an optimization problem and considers the level of risk aversion of the modeler. An application is presented in which separate logistic regressions are trained to estimate the \({\hat{\tau }}_{i,k}\). Later, these estimates are used as input variables to determine the cluster level uplift of each treatment. Treatments are then allocated by considering the estimated uplift scores and the variability among estimates.

There are two main disadvantages in applying the SMA. First, training several models increases computational costs. Second, the modeling objective of the different predictive models does not correspond to estimating the uplift. Each model learns the likelihood of a positive outcome, rather than the what-if difference in behavior (Radcliffe and Surry 2011). Nonetheless, Rudaś and Jaroszewicz (2018) demonstrate that the SMA performs competitively for uplift regression when the sample size is sufficiently large and highly correlated variables are removed.

Modified machine learning algorithms are proposed in the MTUM literature to improve the accuracy of the uplift estimate and offset the main drawbacks of the methods mentioned above. In this regard, Alemi et al. (2009) and Guelman (2015) proposed to adapt the K-nearest neighbor classifier (Cover and Hart 1967) to infer the optimal treatment based on the treatment that has worked the best for individuals who are similar to the test case. In personalized medicine, the Sequential K-Nearest Neighbor Analysis (SKNN) (Alemi et al. 2009) sequentially examines the K most similar individuals until the success or failure of the treatment is determined to be statistically significant. Likewise, the Causal K-Nearest-Neighbor (CKNN) approach (Guelman 2015) predicts the optimal treatment for a given individual by weighting the evidence of similar individuals more strongly. This approach is computationally expensive, since all of the training data must be stored to score test cases.

The splitting criterion and pruning method of the most common decision tree classifiers, such as the classification and regression trees (CART) (Leo et al. 1984), chi-square automatic interaction detection (CHAID) (Kass 1980), and C4.5 (Quinlan 1993), can be adjusted for MTUM. Rzepakowski and Jaroszewicz (2012) propose a splitting criterion that compares the probability distributions of treatment groups by using divergence measures from the information theory literature: the Kullback–Leibler (KL), the squared Euclidean distance (ED) and the chi-squared divergence. Pruning is based on the maximum class probability approach. The measure of divergence for multiple distributions allows the modeler to determine the relative importance assigned to the dissimilarity between all of the treatments and the control, and between the treatments themselves. The relative importance of the treatments is also considered.

Adjustments to the splitting criterion and termination rules of the random forest algorithm (Breiman 2001) are suggested to counteract the instability of a single decision tree. The Contextual Treatment Selection (CTS) algorithm (Zhao et al. 2017b) is a forest of randomized trees whose splitting criterion directly maximizes a measure of performance: the expected response. This ensures that the split with the largest increase in expectation is performed at each step. The Unbiased Contextual Treatment Selection (UCTS) algorithm (Zhao et al. 2017a) eliminates the estimation bias present in the CTS by randomly splitting the training set into an approximation set that generates the tree structure and an estimation set that estimates the leaf response. According to the authors’ findings, the UCTS proves to be more competitive in terms of performance for some data sets compared to the CTS.

Li et al. (2018) propose a reinforcement learning application that relates MTUM with an offline contextual bandit problem. Since the objective of offline contextual bandits is to maximize the expected response to an action instead of maximizing the expected uplift, the authors formulate the uplift modeling task as a Markov Decision Process (MDP). This is solved by using Sutton et al. (2000)’s neuralized policy gradient method. In addition, Sawant et al. (2018) use counterfactual matching as part of the data collection and incorporate contextual Bayesian multiarmed bandits to optimize causal treatment effects.

Last, the cost difference of applying treatments in MTUM is incorporated by Zhao and Harinen (2019). The authors adapt the X-Learner (Künzel et al. 2019) and the R-Learner (Nie and Wager 2017) meta-learners to the multitreatment uplift setting and propose a net-value optimization framework to consider the cost of each treatment.

2.3 Proposed methods

This section presents the two methods proposed in this article to estimate uplift in multitreatment applications. First, the MTUM task is transformed into a multiclass prediction problem that can be solved by conventional machine learning algorithms. This is a generalization of the Modified Outcome Variable Approach (MOVA), a conventional method in the binary uplift modeling setting. It considers the information in the data set about the treatment allocated to individuals and their corresponding observed outcome in order to create a new outcome variable. The second method builds separate uplift models employing modified binary uplift algorithms. Each model contrasts the \(T=k\) treatment group against the control group (\(T=0\)).

2.3.1 Multitreatment modified outcome approach (MMOA)

The MOVA is proposed by Kane et al. (2014) and Lai (2006) for the binary treatment case. The aim is to use any standard multiclass classification algorithm to obtain the required predictions to compute the ITE from a single model. Since a data set suitable for uplift modeling contains information regarding the treatments received by individuals and their observed outcomes, we can segment cases into different categories. These will be the labels of the new outcome variable. For example, the new outcome variable consists of four segments of individuals in a binary treatment case: treated responders (\(R_{T=1}\)), control nonresponders (\(NR_{T=0}\)), treated nonresponders (\(NR_{T=1}\)) and control responders (\(R_{T=0}\)). The multiclass algorithm outputs the likelihood of each test case belonging to each of these categories. The intuition behind this approach is that the ITE (\({\hat{\tau }}_{i,1}\)) can be computed as follows:

$$\begin{aligned} {\hat{\tau }}_{i,1} = \Bigg (\frac{{\hat{P}}(R_{T=1}|x_i)}{P_{T=1}} + \frac{{\hat{P}}(NR_{T=0}|x_i)}{P_{T=0}} \Bigg )- \Bigg (\frac{{\hat{P}}(NR_{T=1}|x_i)}{P_{T=1}}+ \frac{{\hat{P}}(R_{T=0}|x_i)}{P_{T=0}}\Bigg ). \end{aligned}$$
(3)

Equation 3 is analogous to Eq. 1, since the left side indicates the individual’s likelihood to have a favorable outcome due to the treatment. Depending on its magnitude, it determines whether an individual should be targeted. Additionally, the prior probabilities of the treatments (\(P_{T=k}\)) are incorporated as weights to counteract the imbalance of treatment groups.

The extension to the multitreatment case is straightforward, since Eq. 3 can be generalized to calculate the \({\hat{\tau }}_{i,k}\) for any number of treatments. In the case of two treatment groups and one control group \(T = \{0,1,2\}\), the new labels of the outcome variable are shown in the third column of Table 2.

Table 2 Modified outcome variable for three treatment groups

A multiclass probabilistic model is trained to later predict for each individual the probabilities of responding positively (\(Y = 1\)) and negatively (\(Y=0\)) to every treatment alternative. The predicted optimal treatment for the i individual is \(\pi _{i,k}^{*} =argmax({\hat{\tau }}_{i,1},{\hat{\tau }}_{i,2})\). The \({\hat{\tau }}_{i,1}\) and \({\hat{\tau }}_{i,2}\) are calculated as follows:

$$\begin{aligned} {\hat{\tau }}_{i,1}= & {} \Bigg (\frac{{\hat{P}}(R_{T=1}|x_i)}{P_{T=1}} + \frac{{\hat{P}}(NR_{T=0}|x_i)}{P_{T=0}} \Bigg )- \Bigg (\frac{{\hat{P}}(NR_{T=1}|x_i)}{P_{T=1}}+ \frac{{\hat{P}}(R_{T=0}|x_i)}{P_{T=0}}\Bigg ),\\ {\hat{\tau }}_{i,2}= & {} \Bigg (\frac{{\hat{P}}(R_{T=2}|x_i)}{P_{T=2}} + \frac{{\hat{P}}(NR_{T=0}|x_i)}{P_{T=0}} \Bigg )- \Bigg (\frac{{\hat{P}}(NR_{T=2}|x_i)}{P_{T=2}}+ \frac{{\hat{P}}(R_{T=0}|x_i)}{P_{T=0}}\Bigg ). \end{aligned}$$

The advantage of the MMOA is that the uplift estimation is reduced to a multiclass classification problem, where a wide variety of classifiers can be used. Additionally, this setting allows the implementation of models that are easier to interpret. For instance, favoring simple models facilitates the observation of the influence that the pretreatment characteristics exert on the uplift estimation. Nevertheless, the MMOA can become inefficient when the amount of treatments rises exponentially.

2.3.2 Naive uplift approach (NUA)

The binary treatment uplift models presented in the survey by Devriendt et al. (2018) can be extended to indirectly predict the optimal treatments in the MTUM scenario. The NUA is a data processing method in which uplift is estimated indirectly. It trains different binary treatment uplift models separately. Each binary treatment model contrasts a treatment group against the control group and outputs the probabilities that are needed to predict the best treatment for the i individual (\(\pi _{i,k}^{*}\)).

In the example with two treatment groups and a control group, we build two separate binary uplift models. One model directly estimates the individual-level probabilities of a positive outcome by contrasting \(T=1\) (treatment 1 group) and \(T=0\) (control group), whereas a second model does the same by comparing \(T=2\) (treatment 2 group) and \(T=0\) (control group). Then, test cases are scored using the two models, and the best treatment is predicted as specified in Eq. 2.

The difference between the NUA and the SMA lies in the number of models to train and the algorithms that can be used. The individual-level uplift is calculated by SMA based on the predictions of the models built on each treatment group, a task that can be performed by any standard classifier. In contrast, the NUA takes advantage of existing binary uplift modeling machine learning algorithms to train \(k-1\) models, which directly compare the treatments with the control group.

Fig. 1
figure 1

Comparison of the training schemes of the SMA and the NUA when three treatment groups are considered. Whereas three separate conventional classifiers are trained under the SMA, the NUA estimates the uplift by employing two binary uplift models

Figure 1 illustrates the difference between the two methodologies in the case of three treatment groups and a binary outcome variable, where \(Y = 1\) represents a positive outcome.

3 Evaluation metrics

In predictive analytics, the model with the lowest prediction error (e.g., error rate or loss function) is typically considered to be the best performing model. In this regard, error refers to the lack of fit between the predicted outcome value and the true outcome value for an individual in the holdout set. However, in the uplift modeling case, such an approach is infeasible because the true effect of the treatments is not observed, as a consequence of the fundamental problem of causal inference (Holland 1986). This makes direct test set evaluation impossible: hence, an error cannot be computed. Suggestions to tackle this problem are proposed in the uplift literature, but none have proven to be optimal. One such suggested approach creates groups of test set individuals similarly ranked by the model and extracts the uplift estimate from their respective true outcomes and observed treatments. A second method computes the expected response given the optimal treatment suggested by the uplift model.

Lo and Pachamanova (2015) and Chen et al. (2015) evaluate uplift models in accordance with the optimization objective of their study. The first authors present a framework that formulates the MTUM task as an optimization problem and considers the level of risk aversion of the modeler. The \({\hat{\tau }}_{i,k}\) estimates are used as input variables for cluster analysis to determine the cluster level uplift for each treatment. The risk/return trade-off is summarized using the \(efficient \ frontier\) graph. As such, the modeler selects the treatment assignment according to her/his risk aversion profile. The second study determines model performance based on the expected revenue that can be achieved by targeting individuals with the suggested optimal treatment. Li et al. (2018) propose the Uplift Modeling General Metric (UMG) and the Self-Normalized Uplift Modeling General Metric (SN-UMG). Their objective is to find a treatment rule that maximizes the expected uplift response under a specific treatment policy by comparing the expected treatment responses with the expected natural responses. The difference between the UMG and the SN-UMG is that the latter reduces the variance by adding standardized weights to the UMG.

In the remainder of this section, we further discuss the uplift evaluation techniques used to compare the results of our experiments.

3.1 Conventional uplift metrics

A conventional uplift methodology assumes that test cases which are similarly scored by a model behave in a similar manner. The performance of a model is then assessed at the level of groups of individuals. First, the estimated uplift score \({\hat{\tau }}_{i,k}\) of the optimal treatment \(\pi _{i,k}^{*}\) is used to rank each individual in the test set in descending order. Later, groups of test cases are formed from the resulting split of the test set in various bins (e.g., deciles). Given that we observe the assigned treatment and the corresponding outcome for each individual, the response rate for each treatment can be calculated within the group. The uplift is then estimated within each group as the difference in response rates. The intuition behind this approach is that a model with an outstanding performance places potential responders at the top of the ranking. Therefore, larger uplifts are expected in the first groups than in the bottom groups (Hansotia and Rukstales 2002). The advantage of this method is that it provides a comprehensive view of model performance and facilitates decision-making (Moro et al. 2011). However, this technique can be misleading when there are large differences in the pretreatment characteristics of test individuals or large imbalances in the size of the treatment groups. In Sect. 4, propensity score matching is proposed to offset these concerns.

Although the evaluation of MTUM following this approach poses some challenges, examples of its implementation are found in Sawant et al. (2018) and Zhao and Harinen (2019). The main difficulty lies in the fact that individuals in the test set are exposed to treatments at random and their predicted optimal treatment does not necessarily match the treatment that is observed. Imai et al. (2013) propose as an alternative the assignment of a pay-off to the test cases whose responses are favorable to the treatments recommended by the model. However, such practice can generate biases in the uplift estimation. For this reason, we adopt the solution suggested by Chen et al. (2015) and Rzepakowski and Jaroszewicz (2012), in which the mismatched test cases are not considered for the evaluation. This naturally leads to a considerable loss of data but assures an unbiased assessment.

The performance of an uplift model can be visualized by means of an uplift curve (Rzepakowski and Jaroszewicz 2012). Given the ranking of individuals, this curve illustrates the cumulative difference in response rate by applying the optimal treatment to p percent of test cases relative to the control group. The x-axis displays the percentage of targeted individuals, whereas the y-axis shows the cumulative difference between the response rates of the predicted optimal treatments and the response rates of the control group. The overall effect of the treatments when all individuals are targeted (i.e., \(p = 100\) percent) is implicitly observed on the plot. A highly right-skewed uplift curve is desirable, since it indicates that the likely responders are primarily grouped in the top segments. An uplift curve is comparable to the lift curves in standard classification models, since it results from subtracting the estimated lift curve of the group with the optimal treatments from the estimated lift curve of the standard treatment group. In addition, another straight line is drawn within the two extremes of the uplift curve to represent the net incremental gains of randomly intervening individuals. This line serves as a baseline to graphically observe how well a model outperforms the action of targeting subjects at random.

Since the uplift curve is a subtraction of lift curves, this facilitates the estimation of a modified metric which is conceptually similar to the Gini coefficient (Kuusisto et al. 2014). The Qini metric (Radcliffe 2007), also known as the Area Under the Uplift Curve (AUUC) (Rzepakowski and Jaroszewicz 2010), is a standard tool to compare the performance among uplift models. This is calculated as the area between the uplift curve and the random model line. The greater this metric, the larger the incremental effects of the predicted optimal treatments.

3.2 Expected response

The expected response is proposed by Zhao et al. (2017b) as an alternative to evaluate the performance of uplift models. This method is generalized for applications where multiple treatments are considered, as well as for different types of outcome variables. In addition, it addresses potential biases that may result when the sizes of the treatment groups are highly imbalanced.

The expected response method calculates a new variable Z that depends on the observed treatment in the test set, the predicted optimal treatment by the uplift model, the prior probabilities of the treatments \(P_{T=k}\), and the observed outcome Y. The computation also considers the Iverson bracket \({\mathbb {I}} (\cdot )\), which is equal to one if the predicted optimal treatment matches the observed treatment, and zero otherwise. Formally, the individual expected response is as follows:

$$\begin{aligned} z_i = \sum _{k = 1}^N \frac{y_i}{P_{T=k}} {\mathbb {I}}\{h(x_i) = k\} {\mathbb {I}}\{T = k\}. \end{aligned}$$

When the predicted optimal treatment equals the observed treatment, \(z_i\) represents the observed outcome scaled by the prior probability of being exposed to the treatment. The expected response of a multitreatment uplift model is then calculated as follows:

$$\begin{aligned} {\bar{z}}= \frac{1}{N}\sum _{i=1}^{N}z_i. \end{aligned}$$
(4)

The modified uplift curve illustrates the performance of a multitreatment uplift model in terms of the expected response (Zhao et al. 2017b). This curve is a plot of the cumulative expected response as a function of the percentage of test set cases that are targeted according to model suggestions. Similarly to the conventional uplift evaluation, test individuals are ranked in descending order according to their predicted uplift scores, and \({\bar{z}}\) is calculated for a given p percent.

4 Experimental setup

The experimental evaluation contrasts the performances of a subset of the above presented MTUM approaches with respect to eight data sets. First, we provide an overview of the main characteristics of the data sets. Later, we describe the data preprocessing and partitioning strategy, along with the MTUM techniques considered for the experiments. At the end, the statistical tests and their implementation are discussed.

4.1 Data sets

Customizing treatment allocation is a main concern among decision-makers in different domains. This study evaluates uplift models with respect to eight multidisciplinary data sets. Table 3 summarizes the most relevant information in relation to the data sets. Because some data sets are not specifically designed to estimate individual treatment effects, the treatment groups are formed according to the observed values of a specific decision variable (see Rzepakowski and Jaroszewicz (2012)). To assure that the uplift estimate is unbiased, we assess the balance of pretreatment characteristics among treatment groups before training. When imbalance is detected, we implement propensity score matching as proposed by Guelman (2015). This technique is further discussed in the next subsection. Overall, aside from profile, sociodemographic or transactional information, each data set also contains a treatment indicator encoded as a categorical variable with K possible treatments, along with a binary outcome variable. The following data sets are included in the experiments:

  • The \({\textit{Hillstrom}}\) direct marketing campaign data set (Hillstrom 2018) comprises a sample of 64.000 individuals. Three treatment groups are identified. Some customers receive an e-mail with men’s merchandise, a second group is targeted with an e-mail corresponding to women’s merchandise, and a last segment is not contacted. Success is considered when a customer visits the website within two weeks after receiving the e-mail.

  • The \({\textit{Gerber}}\) data set (Gerber et al. 2008) relates to the study of the political behavior of voters. The aim is to analyze whether social pressure increases turnout from a sample of 180.002 households. Direct mailings were randomly sent 11 days before the August 2006 primary election. The households that received either the “Self” message or the “Neighbors” message are the treated groups to evaluate, whereas those who were targeted with the “Civic duty” message represent the control group. The outcome variable is positive if a vote was given in the elections.

  • The \({\textit{Bladder}}\) data set (Therneau 2015) contains information regarding recurrence of bladder cancer for three treatment groups: 1) pyridoxine, 2) thiotepa, and 3) placebo. As in Sołtys et al. (2015), patients who had remaining cancer, or at least one recurrence, are classified as negative cases.

  • The colon data set (Therneau 2015) includes data of chemotherapy trials against colon cancer. A low-toxicity medication, Levamisole, was administered to some patients, whereas a combination of Levamisole with the moderately toxic 5-FU chemotherapy agent was received by another subsample. The control treatment group corresponds to the nontreated patients. Following the setup proposed by Sołtys et al. (2015), two outcome variables can be extracted: 1) recurrence or death (Colon1) and 2) death (Colon2). The two data sets slightly differ in the way that the predictor variable \({\textit{time}}\) is processed. For the Colon1 data set, this variable is split into two factors: 1) the number of days until the recurrence event and 2) the number of days until the death event. In the Colon2 data set, \({\textit{time}}\) refers only to the number of days until death, since there is no recurrence.

  • The \({\textit{AOD}}\) data set corresponds to alcohol and drug usage (McCaffrey et al. 2013). In this subset of 600 observations, three treatment groups are identified: “community,” “metcbt5” and “scy.” We assigned individuals within the former category to the control group. Given that the outcome variable is continuous, we apply binary encoding by assuming that a positive case is an individual whose substance use frequency declines by the 12th month after the treatment is applied. An important observation is that only 5 out of the 23 original pretreatment variables are available in this subset. Therefore, information on demography, substance use, criminal activities, mental health function and environmental risk is mostly absent.

  • The \({\textit{Bank}} \ {\textit{Marketing}}\) data set (Moro et al. 2014) is publicly available in the UCI repository. This set contains information regarding a direct marketing campaign conducted by a commercial bank. To obtain a multitreatment set, the categorical variable “contact” is chosen as the decision variable to determine the different treatment groups. Depending on the type of contact communication, individuals are assigned to either the “cellular” group or “telephone” group. The “unknowns” are the control group. The outcome variable is positive if a customer decides to open a term deposit with the institution.

  • The \({\textit{Turnover}}\) data set provided by a private Belgian organization comprises information regarding retention strategies aiming to reduce voluntary turnover. A subset of the 1.951 white collar employees is targeted with two retention campaigns: “recognition” and “flexibility.” The remaining group is not treated, and hence is classified as control. A positive case is represented by an employee who does not voluntarily leave the company the year after the strategies are deployed.

Table 3 Multitreatment data sets

4.2 Data preprocessing and partitioning

Estimating the ITE of multiple treatments conveys some degree of uncertainty because an individual can only be assigned to one treatment group. Hence, the outcomes under the remaining alternatives are never observed in reality. If K represents the amount of total treatment states, there are \(K-1\) unknown outcomes that correspond to the different counterfactual scenarios.

In a randomized control trial, the counterfactuals can be imputed from the observed outcomes of “similar” individuals who were exposed to the alternative treatments (Rosenbaum and Rubin 1983). In this context, similarity indicates that there are no considerable differences in the pretreatment characteristics among the treatment groups. This assures that the only cause of the behavioral change of an individual is the exposure to a particular treatment, all other factors being equal. However, in situations where the selection rule to allocate treatments is unknown or is not random (i.e., observational study), the estimation of treatment effects can be biased due to the heterogeneity of the treatment groups.

The majority of uplift models implicitly assume that treatments are allocated to individuals at random: as a result, the balance of pretreatment characteristics is often not validated. We believe that an unbiased estimation of treatment effects at the individual level demands verification and, if needed, corrective action. As such, the influence of the pretreatment characteristics in the assignment of treatments is removed. In this study, we implement the propensity score matching (PSM) (Rosenbaum and Rubin 1983) method as an attempt to form a quasi-randomized experiment to control for any selection bias that may affect the uplift estimate (Lopez et al. 2017).

Matching is convenient for estimation of treatment effects, since, in principle, it guarantees a homogeneous sample of individuals in terms of their observed pretreatment characteristics. Diverse matching strategies are proposed in the literature (see Morgan and Winship (2015) for an overview of matching techniques). The main differences among the techniques lie in how the sets of “similar” individuals are formed. For example, exact matching consists in grouping individuals whose only difference is the allocated treatment, whereas PSM joins the information of all pretreatment variables into a “score” that is later used to perform the matching.

PSM consists of estimating the probability for each individual of being treated as a function of the pretreatment characteristics, which is known as the propensity score (PS), \(PS_{i,k} = {\hat{P}}(T= k|x_i)\). Later, individuals with similar PS values are matched, and an estimate of the treatment effect is computed based on the differences in their observed outcomes. In sum, this technique aims to balance the observed pretreatment variables X among the treated groups T to obtain an unbiased estimate of the causal effect of T on Y. This assures that the remaining differences in the observed outcomes among treatment groups can be attributed solely to the effects of the treatments (Morgan and Winship 2015).

The PSM approach creates sets of individuals who are “similar” to some degree with respect to the observable pretreatment variables. As such, it provides transparency with respect to the mechanism of treatment assignment. Most common techniques for PSM employ nearest neighbor algorithms, kernel matching or one-to-one matching. They differ in the selection of distance measure, as well as in the amount of cases to group. Diamond and Sekhon (2013) proposed a genetic search algorithm to assure that optimal balance is achieved. Optimal matching has proven to be efficient, since it minimizes the distance between matched individuals and works well when the control group is smaller than the other treatment groups.

One advantage of PSM is that it achieves good performance for small data sets and counteracts the difficulties of applying matching to high-dimensional data sets. One limitation, however, is that unmatched cases are excluded from the analysis, leading to an important loss of information that can hamper the generalization of the findings. In addition, given the differences in pretreatment characteristics among treated groups, biases may not be completely solved, since the estimation of the PS is highly dependent on the correct specification of the set of observable characteristics that account for the systematic differences between these groups.

Regarding the partitioning of the data sets, a cross-validation strategy is used for evaluating model performance. This decreases the risk of overfitting and assures the generalization of model estimates. We partition each data set into five folds of approximately equal size, without overlap. In addition, stratification is applied with respect to the treatment groups to preserve the observed treatment effects to remain as similar as possible among the folds. Models are then fitted by performing multiple rounds, in which one fold is left out for testing and the remaining folds are considered as the training set. Later, the final models are applied to individuals in the test set, and performance is evaluated. The results of each round are averaged to obtain the overall performance.

4.3 Uplift modeling techniques

Table 4 provides an overview of the MTUM techniques whose performances are evaluated in this study. These methods are a selection of data preprocessing and data processing approaches.

Among the techniques are well-established standard algorithms, such as the Logistic regression and the Random forest. Moreover, four modified algorithms that estimate the uplift directly in multitreatment applications are considered: Causal K-nearest neighbor (Guelman 2015), CTS random forest (Zhao et al. 2017b), ED random forest (Rzepakowski and Jaroszewicz 2012), X-Learner random forest and R-Learner random forest (Zhao and Harinen 2019). For the NUA and the MMOA, we use the binary Uplift random forest developed by (Guelman 2014) and the Multinomial logistic regression, respectively. These latter approaches complement the existing set of methods in the MTUM literature.

When a variable selection procedure is not embedded within an algorithm, the generalized linear model with a stepwise variable selection procedure is deployed. This wrapper method removes some of the pretreatment variables until it finds the optimal combination that maximizes the performance of the model as guided by the Akaike Information Criterion (AIC). At the end of the iterations, a vector with the final variables is returned. Later, these variables are used to fit the models. An optimal sample of pretreatment characteristics decreases not only the computational time but also the complexity of the models. A more parsimonious and interpretable model can then be achieved, with potential gains in model performance and stability (Kuhn and Johnson 2013).

Table 4 MTUM techniques considered in the benchmarking study

4.4 Statistical test

The results of the benchmarking experiments are contrasted using a statistical test in order to detect whether the observed differences in performance are significantly different. We adopt the procedure documented by Demšar (2006), which performs a nonparametric Friedman test (Friedman 1940) with a corresponding post hoc test. First, we calculate the ranking for a model \(j \in J = \{j_1, \ldots , j_9 \}\) within each data set. In the case that some results are identical, the final ranking is an average of the ranks that were initially assigned. Second, we calculate the average ranking \({\bar{r}}_j\) for the j model over the n data sets and estimate the test statistic as follows:

$$\begin{aligned} \chi ^2_F = \frac{12n}{k(k+1)}\sum _{j=1}^k \Bigg ({\bar{r}}_j - \frac{k+1}{2}\Bigg )^2. \end{aligned}$$

At a level of \(\alpha = 0.05\), we are interested in rejecting the null hypothesis stating that there are no significant differences in performance across the data sets. The probability distribution of \(\chi ^2_F\) is accurately approximated by that of a chi-squared solely when both n and k are sufficiently large, which is fulfilled in this study (i.e., \(n = 8\) and \(k = 13\)). If the p-value \(P(chi_{k-1}^2 \ge \chi ^2_F)\) indicates that there are statistically significant differences, a post hoc Nemenyi (Nemenyi 1963) test is suggested to compare all of the models to each other.

4.5 Implementation

The prescribed experiments are implemented in R (R Core Team 2017) and Python (Van Rossum and Drake Jr 1995). For the analysis of selection bias, the RItools package is used to check the imbalance in pretreatment characteristics among treatment groups. In the case of detecting any imbalance, the MatchIt package applies optimal matching based on the propensity scores.

In R, the caret package includes the standard Logistic regression and the Random forest algorithms. Furthermore, in this programming language, the uplift package (Guelman 2014) incorporates the CKNN ( upliftKNN), the Uplift random forest (upliftRF) and the Uplift causal conditional inference forest (ccif). For the setup of the modified outcome techniques, the randomForest algorithm (Liaw and Wiener 2002) and the Multinomial log-linear model algorithm (multinom) (Ripley and Venables 2011) are chosen. Recent implementations of the CTS, ED, X-Learner and R-Learner algorithms are available in the causalML Python package.

Upon publication of this article, we will make the implementation of our experiments publicly available via Github. Our intention is to make the presented results reproducible and verifiable, as well as to stimulate and facilitate further MTUM research.

5 Empirical results

This section presents the assessment of balance of the pretreatment characteristics among treatment groups, and the respective correction by means of PSM. Later, the results of the benchmarking experiments are reported and discussed. The Qini metric and the expected response are used to evaluate model performance. The Friedman test is applied to determine whether the observed differences in performance are significantly different. In the end, the models’ average rankings are calculated and visualized.

5.1 Identifying and correcting selection bias

We perform a PSM preprocessing step in the case of detecting any imbalance in the pretreatment characteristics among the treatment groups. The purpose of this corrective action is to decrease the possibility of obtaining a biased uplift estimate. Considering that all data sets in this study consist of two treatments and a control group, balance is assessed in a pairwise fashion, as shown in Table 5. To verify whether there is at least one pretreatment variable for which the two groups are different, we compute a \(\chi ^2-test\) that performs the omnibus test proposed by Hansen and Bowers (2009).

Table 5 Balance assessment and indication of matching

Table 5 illustrates the results of the balance assessments and indicates whether matching is performed. The resulted p-values of the initial chi-square tests do not provide evidence of imbalance among the pairs of treatment groups that are part of the Hillstrom, Gerber and AOD data sets. Therefore, matching is not required. However, the test suggests that at least one of the pretreatment variables in the Bladder, Colon1, Colon2 and Turnover data sets is creating an imbalance between the treatment pairs. Given this result, we apply matching. The p-values of the postmatching chi-square test (i.e., final p-value) indicate that the imbalance is considerably reduced. An important relevant remark is that the prior imbalances among the groups of the bank data set are not successfully corrected by the chosen matching strategy. Therefore, in this specific case, the uplift estimates can be biased, given the differences in the pretreatment characteristics of the individuals.

5.2 Assessing model performance: the Qini metric

Table 6 reports the results of the benchmarking study for the Qini metric. We consider two scenarios in which the MTUM technique is used to target the full sample of test cases (Panel A) and the top 10 percent of individuals most likely to respond favorably to the treatments (Panel B). The CKNN algorithm is not implemented in the Hillstrom, Gerber and Bank data sets, given its operational inefficiency for large data sets. The Qini metrics of the best performing models are in bold, and the corresponding standard deviations are within brackets. Overall, in panel A, it is observed that none of the MTUM approaches evaluated in this study outperform the others. Most of the techniques perform well for some data sets, but poorly for others. Nonetheless, in five out of the eight data sets, our proposed approaches perform better than current methods. Among the recent algorithms such as CTS, XLearner and RLearner, only the former slightly excels beyond the performance of our proposed approaches with respect to the Colon2 data set. Moreover, the proposed approaches generally exhibit reduced variability among folds. Their predictions are more stable, and therefore more reliable. The Friedman test is applied to the results of Panel A to corroborate whether there are statistically significant differences among the performances of the different models. The estimated p-value for this test is 0.24, which allows us to conclude that there is no proof of a statistically significant difference in performance among techniques.

The Qini metric for the top 10 percent of targeted test cases indicates how well an MTUM technique prioritizes treatment allocation. In practical settings, campaigns have budgetary constraints that limit their scope. Therefore, model performance is assessed within a smaller proportion of test cases. Panel B of Table 6 shows that the Qini metric varies when the targeted population is reduced to the top 10 percent of responders. Under this restriction, MTUM techniques with outstanding performance when targeting the whole population are no longer suitable. For instance, the same MTUM approach can be employed in only three out of the eight data sets when targeting 100 percent and 10 percent of test cases. Generally, there are no significant differences in performance between MTUM techniques (Friedman test p-value of 0.37). Every model, without distinction, performs well for some data sets and poorly for others. Furthermore, their predictions become more unstable, as shown by their standard deviations. The bias-variance trade-off is more evident, since performance improvements are made at the expense of decreasing reliability. This is especially observed for small data sets such as Bladder, Colon1, Colon2 and AOD, whose Qini metrics and standard deviations are larger for 10 percent targeting than for 100 percent targeting.

On the other hand, in data sets where the observed overall effect of treatments is negative (e.g., Bladder), MTUM techniques prove to be valuable instruments to improve treatment effectiveness. For example, the treatments considered in the Bladder, Colon1, Colon2 and AOD data sets would exhibit unfavorable effects if target assignment was not customized according to the predictions of the MTUM techniques.

Table 6 Qini metric

A Qini curve is a useful visualization tool to assess model performance. This curve graphically displays the performance of a model compared to random targeting. Figure 2 shows the Qini curves of the MTUM techniques evaluated with respect to the Hillstrom data set. The results for this data set are exemplary of those of most data sets. The diagonal line represents a random assignment of treatments, whereas the lines in different colors correspond to the different MTUM techniques. The Qini curve of a model with outstanding performance is as far away as possible from the random line curve (in black). Overall, any of the MTUM techniques boosts the effect of the treatments for a particular proportion of targeted individuals. Nonetheless, DIALR, SMALR and ED appear to be more suitable to achieve superior treatment effects when targeting small samples: for instance, when launching campaigns with a high constraint on the number of participants. On the other hand, MMOALR or RLearner can be more appropriate when considering larger exposure groups.

One important remark is that there are slight differences between the Qini plots of binary uplift models and the Qini plot in MTUM. In the latter case, the Qini curves of the different models, including the random targeting line, do not converge at the end (when targeting 100 percent of test cases). We explained in Sect. 3.1 that the treatment given to an individual in the test set does not necessarily match her/his predicted optimal treatment, due to the random assignment of treatments. For this reason, the mismatched test cases are not considered in the evaluation. Therefore, MTUM techniques achieve distinct uplift levels when the full population is targeted.

Fig. 2
figure 2

Qini curves as a function of the targeted population for the Hillstrom data set. The curves correspond to the 12 different experimentally evaluated MTUM approaches, and the straight line is the baseline indicating random targeting

5.3 Assessing model performance: the expected response

Alternatively to the Qini metric, the expected responses of optimal targeting as predicted by the MTUM techniques are reported in Table 7. Panel A and Panel B show the expected responses of targeting the full test sample and only the top 10 percent of test cases, respectively. The largest expected responses are in bold. In large data sets such as Hillstrom, Gerber and Bank, the expected responses for both exposure segments do not differ significantly across models. However, there are slight differences in the expected responses among the MTUM approaches for small data sets. The CTS, ED, XLearner and RLearner methods are as competitive as the approaches we propose in this study. Conventional techniques, on the other hand, are clearly suboptimal. The p-values of the Friedman test, 0.19 and 0.13, indicate that none of the applied approaches differ significantly in terms of performance when targeting either 100 percent or 10 percent of the test cases, respectively. This is consistent with the results obtained for the Qini metric.

Table 7 Expected response

Figure 3 plots the expected responses of the MTUM approaches at different targeting levels for the Hillstrom data set. The horizontal axis indicates the percentage of the population targeted with the predicted optimal treatments, whereas the vertical axis shows the expected response. As expected, optimal targeting positively influences the effect of treatments. The advantage of this visualization tool is that it supports decision-making in the sense that when confronted with resource constraints, one can select the model that yields the largest expected response for a given percentage of the targeted population. We observe that the ED and RLearner methods generally achieve the highest expected response, regardless of the proportion of targeted test cases.

Fig. 3
figure 3

Expected response as a function of the population targeted for the Hillstrom data set. The curves correspond to the 12 different experimentally evaluated MTUM approaches

5.4 Matched test cases and overall ranking of MTUM approaches

A final analysis of the results consists of assessing the matched test cases and contrasting the performance of each model according to the different evaluation metrics.

We emphasize in Sect. 3 that evaluating the performance of MTUM approaches can be challenging. Particularly, test set cases receive treatments at random, and hence their predicted optimal treatments do not necessarily match their observed treatments. In order to assure a correct interpretation of the findings, the Qini metric and the expected response only consider the test cases with the same predicted and observed treatments. The major drawback of this method is that it results in discarding a considerable quantity of data points. Figure 4 shows the cumulative proportion of matched test cases as a function of the percentage of the population targeted for each MTUM approach. As expected, due to the random allocation of treatments, performance metrics use approximately half of the total test samples for evaluation.

Fig. 4
figure 4

Percentage of matched test samples as a function of the population targeted for the Hillstrom data set. The different curves correspond to the 12 different experimentally evaluated MTUM approaches

On the other hand, we also rank the MTUM approaches based on the different performance metrics. This is illustrated in Fig. 5. The horizontal axis displays the different models, and the vertical axis shows their average ranking according to performance metrics (Qini and expected responses with 10 percent and 100 percent targeting). The shapes represent the evaluation metrics, and the lengths of the vertical lines represent the dispersion of the ranks. The average rank of a model is calculated based on its performance with respect to each data set (i.e., the model with the best performance is ranked first). Later, ranks are averaged among the 8 data sets with respect to the evaluation metrics.

It is observed that most of the MTUM approaches do not consistently outperform the others. The MMOARF generally achieves satisfactory results for all data sets and, therefore, is similarly ranked by the evaluation metrics. Remarkably, the CKNN algorithm performs poorly for all data sets and holds the worst position in the ranking. Moreover, recent algorithms such as RLearner and XLearner perform competitively when the metric of evaluation is the expected response. For the Qini metric, the methods that employ decision trees, such as SMARF, ED and CTS, exhibit better results.

Fig. 5
figure 5

Overall ranking of the different MTUM approaches by performance metrics. The shapes indicate the performance metric, whereas the lines show the ranking dispersion of each model given the performance metrics

The methods proposed in this study are competitive in terms of performance compared to current MTUM techniques. Irrespective of the size of the data sets, they achieve the best results in relation to the Qini metric and expected response (at 100 percent) in five and seven out of the eight data sets, respectively. Their estimations are also more stable, since they have smaller variations. For example, MMOALR is consistently among the best performers for the Hillstrom data set, as observed in the plots of the Qini curves and the expected responses. It is a simple, easily interpretable and computationally inexpensive approach.

In summary, the primary advantage of our methods is their ease of implementation, since they are based on existing algorithms that are readily available and generally known. Moreover, they are built upon conventionally accepted binary uplift modeling approaches that have been previously evaluated by several studies.

6 Conclusion

Predicting treatment effects at the individual level supports decision-makers in the allocation of scarce resources, since it facilitates the identification of the individuals most likely to respond to particular actions. In this regard, uplift modeling serves as a tool to investigate and anticipate the effects of treatments in diverse contexts. Conventional uplift techniques are mostly limited to queries involving the effect of a single treatment. Situations in which more than one treatment alternative is at hand are rarely considered. Therefore, there exists only a vague understanding of which MTUM techniques are available, as well as little evidence regarding the cases which have been elaborated.

We contribute to the state-of-the-art in the field of uplift modeling by: (1) providing an exhaustive survey of the literature on MTUM and applying a framework to classify these methods; (2) proposing two new MTUM techniques; and (3) presenting the results of an extensive benchmarking study, and thus providing ample empirical evidence with respect to the performances of 13 MTUM methods for eight multitreatment uplift data sets. The experiments are performed on data sets from diverse domains such as marketing, political behavior, personalized medicine and human resources. The performances of the models are evaluated by means of the Qini metric and the expected responses in order to facilitate their comparison.

Current multitreatment uplift approaches are classified into two main categories: data preprocessing and data processing approaches. The former learn an uplift model by means of conventional machine learning algorithms. They redefine before training the original outcome variable or extend the input space by adding dummies and interaction terms. In contrast, data processing approaches separately train standard predictive algorithms or adapt their internal functioning. As a result, the uplift can be computed indirectly or directly. Indirect estimation separately processes the information contained in each treatment group, whereas direct estimation uses a multitreatment uplift algorithm that includes all treatments during training.

This paper extends the modified outcome method originally proposed for binary uplift modeling to the MTUM case. The MMOA directly estimates the uplift by means of any standard multiclass probabilistic classification algorithm. Moreover, the NUA takes advantage of existing binary uplift modeling machine learning algorithms. As opposed to the SMA, fewer models are trained, and each treatment is directly contrasted with the control group.

Evaluating the performance of MTUM techniques is challenging due to the fundamental problem of causal inference. Estimating the true uplift is an impossible task in reality, since an individual cannot be simultaneously exposed to all treatments. Therefore, the different counterfactual scenarios are unobservable. In this article, conventional uplift evaluation methods (i.e., uplift curve and Qini metric) are implemented and adapted to the multitreatment case and contrasted with the expected response approach recently proposed by Zhao et al. (2017a) and Zhao et al. (2017b). Given that treatments are randomly assigned to test cases, the predicted optimal treatments do not necessarily match the observed treatments. As such, only matched test cases are considered in evaluating the performances of models. Although it is expected and observed that such strategy implies a considerable data loss of approximately 50 percent, it assures a correct evaluation of the performance of MTUM techniques.

The experimental setup includes an inventory of eight data sets from various domains. This facilitates testing uplift techniques in diverse multitreatment scenarios. In addition, studies where selection bias is tested and controlled are rare in the uplift literature. Therefore, we verify and, if needed, correct the imbalance among the pretreatment characteristics of the treatment groups by applying matching. We apply PSM to four data sets where the chi-square test detected imbalance. However, this does not necessarily eliminate the risk of selection bias, nor does it aim to improve the performances of the models.

Different MTUM approaches are considered in the experimental evaluation. The Friedman test confirms that none of the evaluated techniques consistently outperform other techniques in terms of the Qini metric and the expected response. Therefore, we conclude that the two techniques proposed in this study are competitive. They achieve similar performances as current MTUM techniques. In addition, the proposed approaches can be easily implemented, since the required algorithms are readily available in standard software packages. Generally, the study shows that the performance of an uplift multitreatment technique is highly context-dependent.

On the other hand, we observe that the size of the uplift data set has implications for the capacity of a model to compute reliable estimates. Small data sets such as Bladder, Colon1, Colon2 and AOD present high volatility in the uplift predictions among different folds in the cross-validation evaluation.

This study includes certain limitations, which can serve as a motivation for future research. First, optimal matching with propensity scores leads to an important loss of information when treatment groups are not of equal size. In addition, this technique is highly dependent on the correct specification of the set of observable characteristics. Other methods for correcting selection bias could offer more reliable uplift estimates. Second, to ensure a correct evaluation, our study does not consider test cases for which the predicted and the observed treatment do not match. Consequently, a significant amount of data is obviated. Other solutions may consider all test cases, wherein mismatches are penalized but not removed from the analysis. Finally, the level of the analysis can be enriched by discriminating among types of treatments and individuals. Inexpensive and effective treatments should be privileged over less effective and costly treatments. Analogously, some customers are more valuable than others.