1 Introduction

This study examines the problem of determining whether to treat individuals based on observed covariates. A standard approach is to employ a plug-in rule that selects the treatment according to the sign of an estimate of the conditional average treatment effect (CATE). Kernel regression is a prevalent technique for estimating the CATE, and a crucial aspect of implementing this method is bandwidth selection. Many studies have proposed bandwidth selection approaches for estimation problems (see Li and Racine (2007)). However, to the best of our knowledge, there are few studies that investigate bandwidth selection for the treatment choice problem. We propose a novel method for bandwidth selection in the treatment choice problem when the outcome is a binary variable.

In this study, we consider a planner who wants to use experimental data to determine whether to treat individuals with a particular covariate value. Following Manski (2004, 2007), Stoye (2009, 2012), Tetenov (2012), Ishihara and Kitagawa (2021), and Yata (2021), we focus on the minimax regret criterion to solve the decision problem. When the outcome variable is binary, its conditional distribution is characterized by a conditional mean function. We assume that this conditional mean function is a Lipschitz function, and show that the maximum regret can be calculated by maximizing over two parameters. Based on these results, we propose a computationally tractable algorithm for obtaining the optimal bandwidth.

Ishihara and Kitagawa (2021) and Yata (2021) derive the minimax regret rule in a similar setting when the outcome variable is normally distributed. Using the results of Ishihara and Kitagawa (2021), calculation of the maximum regret for a nonrandomized statistical treatment rule which incorporates fitted values derived from nonparametric kernel regression can be performed with ease. However, this approach relies on the normality of the outcome variable, and therefore cannot be applied to binary outcomes.

Stoye (2012) considers the statistical decision problem with a binary outcome. The setting of Stoye (2012) is similar to this study, but our framework differs in two respects. First, we focus on the treatment choice at a particular covariate value, whereas Stoye (2012) considers treatment assignment functions that map from the covariate support into a binary treatment status. Second, our restriction on the conditional mean function differs from that of Stoye (2012). Stoye (2012) assumes that the variation of the conditional mean function is bounded. In contrast, our study assumes that the conditional mean function is a Lipschitz function.

The remainder of this paper is organized as follows. Section 2 describes the study setting. Section 3 defines the statistical decision problem and provides a computationally tractable algorithm to obtain the optimal bandwidth. Section 4 presents a numerical exercise where bandwidth selections for binary and normally distributed outcomes are compared. Section 5 concludes the study.

2 Setting

Suppose that we have experimental data \(\{Y_i,D_i,X_i\}_{i=1}^n\), where \(X_i \in \mathbb {R}^{d_x}\) is a vector of observable pre-treatment covariates, \(D_i \in \{0,1\}\) is a binary indicator for the treatment, and \(Y_i \in \{0,1\}\) is a binary outcome. Then \(Y_i\) satisfies

$$\begin{aligned} Y_i \ = \ \mu (D_i,X_i) + U_i, \ \ \ \ E[U_i|D_i,X_i]=0. \end{aligned}$$
(1)

This implies \(Y_i|D_i,X_i \sim Ber \left( \mu (D_i,X_i) \right)\). Under the unconfoundedness assumption, we can view \(\mu (1,x) - \mu (0,x)\) as the CATE.

For simplicity, we assume that \(D_i\) and \(X_i\) are deterministic, \(D_i=1\) for \(i = 1, \ldots , n_1\), and \(D_i = 0\) for \(i = n_1+1, \ldots , n_1+n_0\), where \(n_1+n_0 = n\). We define \(Y_{1,i} \equiv Y_i\) for \(i = 1, \ldots , n_1\), \(Y_{0,i} \equiv Y_{n_1 + i}\) for \(i = 1, \ldots , n_0\), \(X_{1,i} \equiv X_i\) for \(i = 1, \ldots , n_1\), and \(X_{0,i} \equiv X_{n_1 + i}\) for \(i = 1, \ldots , n_0\). Letting \(p_{1,i} \equiv \mu (1,X_{1,i})\) for \(i=1,\ldots ,n_1\) and \(p_{0,i} \equiv \mu (0,X_{0,i})\) for \(i=1,\ldots ,n_0\), then \(Y_{1,i} \sim Ber(p_{1,i})\) for \(i=1,\ldots ,n_1\) and \(Y_{0,i} \sim Ber(p_{0,i})\) for \(i=1,\ldots ,n_0\). Additionally, we define \(p_{1,0} \equiv \mu (1,0)\) and \(p_{0,0} \equiv \mu (0,0)\). Then, the distributions of \({\varvec{Y}}_1 \equiv (Y_{1,1},\ldots ,Y_{1,n_1})'\) and \({\varvec{Y}}_0 \equiv (Y_{0,1},\ldots ,Y_{0,n_0})'\) are determined by the parameters \({\varvec{p}}_1 \equiv (p_{1,0}, \ldots , p_{1,n_1})'\) and \({\varvec{p}}_0 \equiv (p_{0,0}, \ldots , p_{0,n_0})'\).

Throughout this paper, we consider a planner seeking to determine whether to treat individuals with \(X_i = x_0\) based on the data \(\textbf{D} \equiv ({\varvec{Y}}_1, {\varvec{Y}}_0)\) with the parameters \({\varvec{p}}_1 \equiv (p_{1,0}, \ldots , p_{1,n_1})'\) and \({\varvec{p}}_0 \equiv (p_{0,0}, \ldots , p_{0,n_0})'\) unknown. Here \(x_0\) is a specific value of the covariate vector. The value \(x_0\) does not have to be included in the support of the covariate distribution in data. Without loss of generality, we assume that \(x_0 = 0\).

Following Manski (2004, 2007), Stoye (2009, 2012), Tetenov (2012), Ishihara and Kitagawa (2021), and Yata (2021), we focus on the minimax regret criterion to solve the decision problem. To this end, we assume that the true parameters \({\varvec{p}} \equiv ({\varvec{p}}_1, {\varvec{p}}_0)\) are known to belong to the parameter space \(\mathcal {P} \equiv \mathcal {P}_1 \times \mathcal {P}_0\). We impose the following restrictions on this parameter space:

$$\begin{aligned} \mathcal {P}_1\equiv & {} \{{\varvec{p}}_1 \in [0,1]^{n_1+1}: |p_{1,i} - p_{1,j}| \le C \Vert X_{1,i} - X_{1,j} \Vert \}, \\ \mathcal {P}_0\equiv & {} \{{\varvec{p}}_0 \in [0,1]^{n_0+1}: |p_{0,i} - p_{0,j}| \le C \Vert X_{0,i} - X_{0,j} \Vert \}, \end{aligned}$$

where \(X_{1,0} = X_{0,0} \equiv 0\). This assumption implies that \(\mu (1,x)\) and \(\mu (0,x)\) are Lipschitz functions with the Lipschitz constant \(C>0\).

3 Main results

3.1 Welfare and regret

Given a treatment choice action \(\delta \in [0,1]\), we define the welfare attained by \(\delta\) as

$$\begin{aligned} W(\delta ) \ \equiv \ (p_{1,0} - p_{0,0}) \cdot \delta + p_{0,0}. \end{aligned}$$
(2)

The optimal treatment choice action given knowledge of \({\varvec{p}}\) is

$$\begin{aligned} \delta ^{*} \ \equiv \ 1 \left\{ p_{1,0} \ge p_{0,0} \right\} . \end{aligned}$$
(3)

\(W(\delta ^{*})\) is then the welfare that would be attained if we knew \({\varvec{p}}\).

Let \(\hat{\delta }(\textbf{D}) \in \{0,1\}\) be a statistical treatment rule that maps data \(\textbf{D}\) to the binary treatment choice decision. The welfare regret of \(\hat{\delta }(\textbf{D})\) is defined as

$$\begin{aligned} R({\varvec{p}}, \hat{\delta })\equiv & {} E_{{\varvec{p}}} \left[ W(\delta ^{*}) - W(\hat{\delta }(\textbf{D})) \right] \nonumber \\= & {} (p_{1,0} - p_{0,0}) \cdot \left( \delta ^{*} - E_{{\varvec{p}}} [ \hat{\delta }(\textbf{D}) ] \right) , \end{aligned}$$
(4)

where \(E_{{\varvec{p}}}(\cdot )\) is the expectation with respect to the sampling distribution of \(\textbf{D}\) given the parameters \({\varvec{p}}\).

This study focuses on the following statistical treatment rule.

Assumption 1

We consider the following class of statistical treatment rules \(\mathcal {D}\):

$$\begin{aligned} \mathcal {D} \equiv \left\{ \hat{\delta }_{\theta }(\textbf{D}) \equiv 1 \left\{ \frac{\sum _{i=1}^{n_1} K(X_{1,i}/\theta ) Y_{1,i}}{\sum _{i=1}^{n_1} K(X_{1,i}/\theta )} - \frac{\sum _{i=1}^{n_0} K(X_{0,i}/\theta ) Y_{0,i}}{\sum _{i=1}^{n_0} K(X_{0,i}/\theta )} \ge 0 \right\} \,: \, \theta > 0 \right\} , \end{aligned}$$
(5)

where \(K: \mathbb {R}^{d_x} \rightarrow \mathbb {R}_{+}\) denotes a kernel function.

Assumption 1 states that we focus on non-randomized statistical treatment rules that plug in the fitted values from a nonparametric kernel regression. In addition, we assume that the kernel function takes non-negative values. Our results rely on this condition. In (5), \(\theta\) is the bandwidth of the kernel regression estimator.

The minimax regret criterion selects a statistical treatment rule that minimizes the maximum regret:

$$\begin{aligned} \hat{\delta }_{\mathcal {D}}^{*} \ = \ \text {arg} \min _{\hat{\delta } \in \mathcal {D}} \max _{{\varvec{p}} \in \mathcal {P}} R({\varvec{p}}, \hat{\delta }). \end{aligned}$$

Under Assumption 1, a statistical treatment rule \(\hat{\delta } \in \mathcal {D}\) is characterized by its bandwidth \(\theta\), and the optimal bandwidth can be calculated as follows:

$$\begin{aligned} \theta ^{*} \ = \ \text {arg} \min _{\theta } \max _{{\varvec{p}} \in \mathcal {P}} R({\varvec{p}}, \hat{\delta }_{\theta }), \end{aligned}$$
(6)

where \(\hat{\delta }_{\theta }\) is as defined in (5). In the next subsection, we describe the calculation of the optimal bandwidth \(\theta ^{*}\).

3.2 Minimax regret rule

The following theorem implies that we can calculate \(\max _{{\varvec{p}} \in \mathcal {P}} R({\varvec{p}},\hat{\delta }_{\theta })\) by maximizing over two parameters.

Theorem 1

Under Assumption 1, for any \(\hat{\delta }_{\theta } \in \mathcal {D}\) we obtain

$$\begin{aligned} \max _{{\varvec{p}} \in \mathcal {P}} R({\varvec{p}},\hat{\delta }_{\theta })= & {} \max \left[ \max _{p_{1,0} > p_{0,0}} \left\{ (p_{1,0}-p_{0,0}) \cdot \left( 1 - E_{\tilde{{\varvec{p}}}^{-}(p_{1,0}, p_{0,0})} [\hat{\delta }_{\theta }(\textbf{D}) ] \right) \right\} , \right. \nonumber \\{} & {} \left. \max _{p_{1,0} < p_{0,0}} \left\{ (p_{0,0}-p_{1,0}) \cdot E_{\tilde{{\varvec{p}}}^{+}(p_{1,0}, p_{0,0})} [ \hat{\delta }_{\theta }(\textbf{D}) ] \right\} \right] , \end{aligned}$$
(7)

where \(\tilde{{\varvec{p}}}^{-}(p_{1,0}, p_{0,0}) \equiv \left( \tilde{{\varvec{p}}}_1^{-}(p_{1,0}), \tilde{{\varvec{p}}}_0^{+}(p_{0,0}) \right)\), \(\tilde{{\varvec{p}}}^{+}(p_{1,0}, p_{0,0}) \equiv \left( \tilde{{\varvec{p}}}_1^{+}(p_{1,0}), \tilde{{\varvec{p}}}_0^{-}(p_{0,0}) \right)\),

$$\begin{aligned} \tilde{{\varvec{p}}}_1^{-}(p)\equiv & {} \left( \tilde{p}_{1,0}^{-}(p), \tilde{p}_{1,1}^{-}(p), \ldots , \tilde{p}_{1,n_1}^{-}(p) \right) ' \\\equiv & {} \left( p, \max \{ p-C\Vert X_{1,1} \Vert , 0 \}, \ldots , \max \{ p-C\Vert X_{1,n_1} \Vert , 0 \} \right) ', \\ \tilde{{\varvec{p}}}_1^{+}(p)\equiv & {} \left( \tilde{p}_{1,0}^{+}(p), \tilde{p}_{1,1}^{+}(p), \ldots , \tilde{p}_{1,n_1}^{+}(p) \right) ' \\\equiv & {} \left( p, \min \{ p+C\Vert X_{1,1} \Vert , 1 \}, \ldots , \min \{ p+C\Vert X_{1,n_1} \Vert , 1 \} \right) ', \\ \tilde{{\varvec{p}}}_0^{-}(p)\equiv & {} \left( \tilde{p}_{0,0}^{-}(p), \tilde{p}_{0,1}^{-}(p), \ldots , \tilde{p}_{0,n_0}^{-}(p) \right) ' \\\equiv & {} \ \left( p, \max \{ p-C\Vert X_{0,1} \Vert , 0 \}, \ldots , \max \{ p-C\Vert X_{0,n_0} \Vert , 0 \} \right) ', \\ \tilde{{\varvec{p}}}_0^{+}(p)\equiv & {} \left( \tilde{p}_{0,0}^{+}(p), \tilde{p}_{0,1}^{+}(p), \ldots , \tilde{p}_{0,n_0}^{+}(p) \right) ' \\\equiv & {} \left( p, \min \{ p+C\Vert X_{0,1} \Vert , 1 \}, \ldots , \min \{ p+C\Vert X_{0,n_0} \Vert , 1 \} \right) '. \end{aligned}$$

Theorem 1 provides a way of finding the parameters that maximize the regret of \(\hat{\delta }_{\theta } \in \mathcal {D}\). If \(p_{1,0} > p_{0,0}\), then the regret of \(\hat{\delta }_{\theta }\) is maximized at

$$\begin{aligned} {\varvec{p}}_1= & {} \left( p_{1,0}, \max \{ p_{1,0}-C\Vert X_{1,1} \Vert , 0 \}, \ldots , \max \{ p_{1,0}-C\Vert X_{1,n_1} \Vert , 0 \} \right) ' \ \text {and} \\ {\varvec{p}}_0= & {} \left( p_{0,0}, \min \{ p_{0,0}+C\Vert X_{0,1} \Vert , 1 \}, \ldots , \min \{ p_{0,0}+C\Vert X_{0,n_0} \Vert , 1 \} \right) '. \end{aligned}$$

Similarly, if \(p_{1,0} < p_{0,0}\), then the regret of \(\hat{\delta }_{\theta }\) is maximized at

$$\begin{aligned} {\varvec{p}}_1= & {} \left( p_{1,0}, \min \{ p_{1,0}+C\Vert X_{1,1} \Vert , 1 \}, \ldots , \min \{ p_{1,0}+C\Vert X_{1, n_1} \Vert , 1 \} \right) ' \ \text {and} \\ {\varvec{p}}_0= & {} \left( p_{0,0}, \max \{ p_{0,0}-C\Vert X_{0,1} \Vert , 0 \}, \ldots , \max \{ p_{0,0}-C\Vert X_{0,n_0} \Vert , 0 \} \right) '. \end{aligned}$$

These results allow us to calculate the maximum regret for a given \(p_{1,0}\) and \(p_{0,0}\), and we can then compute \(\max _{{\varvec{p}} \in \mathcal {P}} R({\varvec{p}},\hat{\delta }_{\theta })\) by maximizing over \(p_{1,0}\) and \(p_{0,0}\).

In view of Theorem 1, we can compute the optimal bandwidth \(\theta ^{*}\) using the following algorithm.

1.:

Fix \(\hat{\delta }_{\theta } \in \mathcal {D}\) and generate uniformly distributed random variables \(\{U_{1,i,s}\}_{i=1, \ldots , n_1, \, s = 1, \ldots , S}\) and \(\{U_{0,i,s}\}_{i=1, \ldots , n_0, \, s = 1, \ldots , S}\), where S is a large number. Define the following variables:

$$\begin{aligned} \tilde{Y}_{1,i,s}^{-}(p)\equiv & {} 1 \left\{ U_{1,i,s} \le \tilde{p}_{1,i}^{-}(p) \right\} , \ \ \tilde{Y}_{1,i,s}^{+}(p) \equiv 1 \left\{ U_{1,i,s} \le \tilde{p}_{1,i}^{+}(p) \right\} , \\ \tilde{Y}_{0,i,s}^{-}(p)\equiv & {} 1 \left\{ U_{0,i,s} \le \tilde{p}_{0,i}^{-}(p) \right\} , \ \ \tilde{Y}_{0,i,s}^{+}(p) \equiv 1 \left\{ U_{0,i,s} \le \tilde{p}_{0,i}^{+}(p) \right\} , \\ \tilde{{\varvec{Y}}}^{-}_{1,s}(p)\equiv & {} \left( \tilde{Y}_{1,1,s}^{-}(p), \ldots , \tilde{Y}_{1,n_1,s}^{-}(p) \right) , \ \ \tilde{{\varvec{Y}}}^{+}_{1,s}(p) \equiv \left( \tilde{Y}_{1,1,s}^{+}(p), \ldots , \tilde{Y}_{1,n_1,s}^{+}(p) \right) , \\ \tilde{{\varvec{Y}}}^{-}_{0,s}(p)\equiv & {} \left( \tilde{Y}_{0,1,s}^{-}(p), \ldots , \tilde{Y}_{0,n_0,s}^{-}(p) \right) , \ \ \tilde{{\varvec{Y}}}^{+}_{0,s}(p) \equiv \left( \tilde{Y}_{0,1,s}^{+}(p), \ldots , \tilde{Y}_{0,n_0,s}^{+}(p) \right) . \end{aligned}$$
2.:

Approximate \(E_{\tilde{{\varvec{p}}}^{-}(p_{1,0}, p_{0,0})} [ \hat{\delta }_{\theta }(\textbf{D}) ]\) and \(E_{\tilde{{\varvec{p}}}^{+}(p_{1,0}, p_{0,0})} [ \hat{\delta }_{\theta }(\textbf{D}) ]\) by \(\pi ^{-}_{\theta }(p_{1,0},p_{0,0})\) and \(\pi ^{+}_{\theta }(p_{1,0},p_{0,0})\), where

$$\begin{aligned} \pi ^{-}_{\theta }(p_{1,0},p_{0,0})\equiv & {} \frac{1}{S} \sum _{s=1}^S \hat{\delta }_{\theta }( \tilde{{\varvec{Y}}}^{-}_{1,s}(p_{1,0}), \tilde{{\varvec{Y}}}^{+}_{0,s}(p_{0,0})), \\ \pi ^{+}_{\theta }(p_{1,0},p_{0,0})\equiv & {} \frac{1}{S} \sum _{s=1}^S \hat{\delta }_{\theta }( \tilde{{\varvec{Y}}}^{+}_{1,s}(p_{1,0}), \tilde{{\varvec{Y}}}^{-}_{0,s}(p_{0,0})). \end{aligned}$$
3.:

Calculate the maximum regret \(\overline{R}(\theta )\), where

$$\begin{aligned} \overline{R}(\theta )\equiv & {} \max \left[ \max _{p_{1,0}>p_{0,0}} \left\{ (p_{1,0}-p_{0,0}) \left( 1 - \pi ^{-}_{\theta }(p_{1,0},p_{0,0}) \right) \right\} , \right. \\{} & {} \left. \max _{p_{1,0}<p_{0,0}} \left\{ (p_{0,0}-p_{1,0}) \pi ^{+}_{\theta }(p_{1,0},p_{0,0}) \right\} \right] . \end{aligned}$$
4.:

Minimize \(\overline{R}(\theta )\) and obtain the optimal bandwidth \(\theta ^{*}\).

This algorithm is advantageous from a computational perspective. Computation of the exact minimax regret rule is often challenging in the context of statistical treatment choice. In situations where the sample size is large, calculating the maximum regret necessitates addressing an optimization problem of substantial dimension. However, using Theorem 1, it is possible to calculate the maximum regret with greater ease. In the next section, we compute \(\theta ^{*}\) by using this algorithm.

Remark 1

Similar to Ishihara and Kitagawa (2021), when the Lipschitz constant C is unknown, it is unclear how to select it in a theoretically justifiable data-driven manner. However, Ishihara and Kitagawa (2021) and Yata (2021) propose some practical choices for C. In the empirical application of Ishihara and Kitagawa (2021), C is selected by leave-one-out cross-validation. Yata (2021) estimates a lower bound on C by using the derivative of the conditional mean function. Both of these methods can also be applied in our setting.

Remark 2

When C is sufficiently large, the maximum regret does not depend on the bandwidth. If \(X_{1,0}, X_{1,1}, \ldots , X_{1,n_1}\) and \(X_{0,0}, X_{0,1}, \ldots , X_{0,n_0}\) differ, then for large enough C, the parameter space \(\mathcal {P}\) becomes \([0,1]^{n_1+1} \times [0,1]^{n_0+1}\). From Theorem 1, if \(p_{1,0} > p_{0,0}\), then the regret of \(\hat{\delta }_{\theta }\) is maximized at

$$\begin{aligned} {\varvec{p}}_1 = (p_{1,0},0,\ldots ,0)' \ \ \text {and} \ \ {\varvec{p}}_0 = (p_{0,0},1,\ldots ,1)'. \end{aligned}$$

Similarly, if \(p_{1,0} < p_{0,0}\), then the regret of \(\hat{\delta }_{\theta }\) is maximized at \({\varvec{p}}_1 = (p_{1,0},1,\ldots ,1)'\) and \({\varvec{p}}_0 = (p_{0,0},0,\ldots ,0)'\). These results imply that

$$\begin{aligned} \max _{{\varvec{p}} \in \mathcal {P}} R({\varvec{p}},\hat{\delta }_{\theta }) \ = \ \max \left\{ \max _{p_{1,0} > p_{0,0}} (p_{1,0} - p_{0,0}), \max _{p_{1,0} < p_{0,0}} (p_{0,0} - p_{1,0}) \right\} \ = \ 1. \end{aligned}$$

Hence, the maximum regret does not depend on the bandwidth \(\theta\).

4 Numerical exercise

In this section, we perform a numerical exercise to compare the optimal bandwidth described in the previous section with the optimal bandwidth for a normally distributed outcome. Throughout this section, we set the pre-treatment covariate values to equidistant grid points on \([-1,1]\):

$$\begin{aligned} X_{1,i}= & {} -1 + \frac{2(i-1)}{n_1-1} \ \ \ \text {for}\, i = 1, \ldots , n_1, \\ X_{0,i}= & {} -1 + \frac{2(i-1)}{n_0-1} \ \ \ \text {for}\, i = 1, \ldots , n_0, \end{aligned}$$

with \(n_1=n_0=n/2\). We consider the kernel regression class \(\mathcal {D}\) defined in (5), and use the Gaussian kernel as the kernel function. We calculate the optimal bandwidth \(\theta ^{*}\) using our proposed method.

We compare our method with the bandwidth selection that minimizes the maximum regret when the outcome variable is normally distributed. Suppose that \(Y_{1,i} \sim N(p_{1,i}, 0.5^2)\) and \(Y_{0,i} \sim N(p_{0,i}, 0.5^2)\). Then, we set the parameter space to be \(\mathcal {P}^{N} \equiv \mathcal {P}^N_1 \times \mathcal {P}_0^N\), where

$$\begin{aligned} \mathcal {P}_1^N\equiv & {} \{{\varvec{p}}_1 \in \mathbb {R}^{n_1+1}: |p_{1,i} - p_{1,j}| \le C \Vert X_{1,i} - X_{1,j} \Vert \}, \\ \mathcal {P}_0^N\equiv & {} \{{\varvec{p}}_0 \in \mathbb {R}^{n_0+1}: |p_{0,i} - p_{0,j}| \le C \Vert X_{0,i} - X_{0,j} \Vert \}. \end{aligned}$$

Define \(w_{1,i,\theta } \equiv K(X_{1,i}/\theta ) / \sum _{i=1}^{n_1} K(X_{1,i}/\theta )\) and \(w_{0,i,\theta } \equiv K(X_{0,i}/\theta ) / \sum _{i=1}^{n_0} K(X_{0,i}/\theta )\). Using the results of Ishihara and Kitagawa (2021), the maximum regret can be expressed as follows:

$$\begin{aligned} \max _{{\varvec{p}} \in \mathcal {P}^N} R({\varvec{p}},\hat{\delta }_{\theta }) \ = \ s(\theta ) \eta \left( \frac{b(\theta )}{s(\theta )} \right) , \end{aligned}$$
(8)

where \(\eta (a) \equiv \max _{t>0} \{ t \cdot \Phi (-t+a) \}\) with \(\Phi (\cdot )\) being the N(0, 1) distribution function, \(s(\theta ) \equiv 0.5 \sqrt{\sum _{i=1}^{n_1} w_{1,i,\theta }^2 + \sum _{i=1}^{n_0} w_{0,i,\theta }^2}\), and

$$\begin{aligned} b(\theta ) \equiv C \left( \sum _{i=1}^{n_1} w_{1,i,\theta } \Vert X_{1,i}\Vert + \sum _{i=1}^{n_0} w_{0,i,\theta } \Vert X_{0,i}\Vert \right) . \end{aligned}$$

Appendix 2 provides the details of the derivation. Using this result, we calculate the bandwidth that minimizes the maximum regret when the outcome is normally distributed.

Table 1 Bandwidth choices

Table 1 lists the optimal bandwidth choices in the binary and normal cases. In the binary case, we calculate the optimal bandwidth using the algorithm of Sect. 3.2. In the normal case, we calculate the bandwidth that minimizes (8), that is, the bandwidth choice proposed by Ishihara and Kitagawa (2021).

In many cases, the optimal bandwidth decreases as C decreases or n increases. When n is large, the optimal bandwidth with a binary outcome is close to that for a normally distributed outcome. This phenomenon arises due to the asymptotic normality of the kernel regression estimator. In the binary case, the maximum regret (7) is a step function with respect to \(\theta\), and approaches a continuous function as n increases. Hence, Table 1 reports the optimal bandwidth range when \(n=10\). When n is small, the optimal bandwidth with a binary outcome can be significantly different from that for a normally distributed outcome.

5 Conclusion

This study investigated whether to treat individuals based on observed covariates. The standard approach to this problem is to use a plug-in rule that determines the treatment based on the sign of an estimate of the CATE. We focused on statistical treatment rules based on nonparametric kernel regression. In situations in which the outcome variable is normally distributed, Ishihara and Kitagawa (2021) showed that the maximum regret can be calculated easily. This study demonstrated that, when the outcome variable is a binary variable, the maximum regret can be calculated by maximizing over two parameters. Using these results, we proposed a method for selecting the optimal bandwidth in the case of a binary outcome. In addition, we performed a numerical exercise to compare the optimal bandwidth choices for binary and normally distributed outcomes.