## Abstract

The statistical decision theory pioneered by (Wald, Statistical decision functions, Wiley, 1950) has used state-dependent mean loss (risk) to measure the performance of statistical decision functions across potential samples. We think it evident that evaluation of performance should respect stochastic dominance, but we do not see a compelling reason to focus exclusively on mean loss. We think it instructive to also measure performance by other functionals that respect stochastic dominance, such as quantiles of the distribution of loss. This paper develops general principles and illustrative applications for statistical decision theory respecting stochastic dominance. We modify the Wald definition of admissibility to an analogous concept of stochastic dominance (SD) admissibility, which uses stochastic dominance rather than mean sampling performance to compare alternative decision rules. We study SD admissibility in two relatively simple classes of decision problems that arise in treatment choice. We reevaluate the relationship between the MLE, James–Stein, and James–Stein positive part estimators from the perspective of SD admissibility. We consider alternative criteria for choice among SD-admissible rules. We juxtapose traditional criteria based on risk, regret, or Bayes risk with analogous ones based on quantiles of state-dependent sampling distributions or the Bayes distribution of loss.

### Similar content being viewed by others

Avoid common mistakes on your manuscript.

## 1 Introduction

Wald (1950) considered the broad problem of using sample data to make decisions under uncertainty. He posed the task as choice of a statistical decision function (a rule, for short), which maps potentially available data into a choice among the feasible actions. He recommended ex ante evaluation of statistical decision functions as procedures, chosen prior to realization of the data, specifying how a decision maker would use whatever data may be realized. Expressing the objective as minimization of loss, he proposed that the decision maker evaluate a rule by its mean performance across potential samples, which he termed risk.

In the presence of uncertainty about the loss function and the sampling process yielding the data, Wald prescribed a three-step decision process. The first stage specifies the state space (parameter space), which indexes the loss functions and sampling distributions that the decision maker deems possible. The second stage eliminates inadmissible rules. A rule is inadmissible (weakly dominated) if there exists another one that yields at least as good mean sampling performance in every possible state of nature and strictly better mean performance in some state. The third stage uses some criterion to choose an admissible rule. Wald studied the minimax criterion when the decision maker places no subjective probability distribution on the state space and minimization of Bayes risk (the subjective mean of risk across states) when such a distribution is present.

In many respects, the Wald framework has breathtaking generality. It enables comparison of all statistical decision functions whose risk is well-defined in each possible state. It applies whatever the sampling process and sample size may be. It applies whatever information the decision maker may have about the loss function and the sampling process. The state space may be finite dimensional (parametric) or larger (nonparametric). The true state of nature may be point or partially identified.

A striking exception to the generality of the Wald framework is its use of mean loss to measure the probabilistic performance of alternative rules. Risk is state-dependent mean loss across potential samples and Bayes risk is overall mean loss across samples and states when a subjective distribution is placed on the state space. The literature on statistical decision theory has followed Wald in measuring sampling and overall performance by risk and Bayes risk. See, for example, the texts of Ferguson (1967) and Berger (1985).

We cannot be sure why statistical decision theory has exclusively used mean loss to measure the performance of statistical decision functions, but we can conjecture. One reason may have been the predisposition of statisticians in the mid-twentieth century to use the mean to express the central tendency of probability distributions rather than the median or other location parameters; see Huber (1981) for an interesting discussion. Another reason may have been the influence of the von Neumann and Morgenstern (1944) and Savage (1954) axiomatic derivations of maximization of expected utility, which have often been interpreted as providing rationales to favor this decision criterion over others. Yet subsequent developments in axiomatic decision theory have called into question whether the axioms that yield expected utility maximization are as compelling as they once seemed. See, for example, Binmore (2009).

Considering the matter afresh, we think it evident that evaluation of the probabilistic performance of statistical decision functions should respect stochastic dominance. However, we do not see a compelling reason to focus exclusively on mean loss. We think it instructive to measure probabilistic performance by various functionals that respect stochastic dominance. These include the means of increasing functions of loss and quantiles of the distribution of loss.

This paper develops general principles and illustrative applications for statistical decision theory respecting stochastic dominance. The general principles are introduced in Sect. 2. We modify the Wald definition of admissibility to an analogous concept of stochastic dominance (SD) admissibility, which uses stochastic dominance rather than mean sampling performance to compare alternative statistical decision functions. We cite representation theorems that characterize stochastic dominance in terms of inequalities ordering the means of increasing functions and the quantiles of two probability distributions. These theorems yield alternative characterizations of SD-admissibility.

Sections 3 and 4 apply the general principles to particular classes of decision problems. Section 3 considers the special case of state-dependent binary loss, where the loss function takes only two values in each state. We show that, when the loss function has this form, state-dependent error probabilities are a sufficient statistic for sampling performance and SD admissibility is equivalent to mean admissibility.

An important application occurs in decision problems where a planner uses sample data to inform choice of one of two treatments to assign to a population of persons. It has been common in medical and other settings to use experimental or observational data on treatment response to test the superiority of one treatment relative to the other and to use the test result to make a treatment choice. In this setting, every rule assigning all members of the population to a single treatment is characterizable as performance of a hypothesis test. We show that the use of error probabilities to determine the admissibility of test rules differs from its traditional use in the theory of hypothesis testing.

Section 4 studies a class of decision problems in which SD and mean admissibility do not coincide. These are problems in which the set of feasible actions is ordered and the sampling process, which generates real-valued data, satisfies the monotone likelihood ratio property. Analysis of mean admissibility in this setting dates back to Karlin and Rubin (1956). Here, we study SD admissibility. Possible applications occur when choosing the dose of a real-valued treatment for a population given real-valued sample data that are informative about dose response.

Section 5 revisits the Stein phenomenon of mean inadmissibility of the MLE estimator of a multivariate normal mean of dimension greater or equal to three when the loss function is the component-wise sum of squared losses. We re-evaluate the relationship between the MLE, James–Stein, and James–Stein positive part estimators from the perspective of SD admissibility.

Section 6 considers alternative criteria for choice among SD-admissible actions. We juxtapose traditional criteria based on risk, regret, or Bayes risk with analogous ones based on quantiles of state-dependent sampling distributions or the Bayes distribution of loss. We show how mean and quantile criteria differ when applied to choice of a test rule.

## 2 General principles

Section 2.1 reviews the concepts of Wald’s statistical decision theory. Section 2.2 generalizes these concepts to make stochastic dominance rather than risk the basic quantity used to evaluate the performance of statistical decision functions. Section 2.3 uses two representation theorems for stochastic dominance to characterize SD-admissibility by classes of inequalities that order the means of increasing functions and the quantiles of loss.

### 2.1 Concepts of the Wald theory

Wald’s statistical decision theory begins with specification of a state space \(S\), a set of feasible decisions (or actions) \(D\), and a loss function \(L\left(\cdot ,\cdot \right):S\times D\to \left[0,\infty \right)\) specifying the loss incurred by each feasible action in each possible state. The ideal objective is to minimize loss in the true state. Given that the true state is unknown, the ideal objective is sure to be achievable only if there exists an action that uniformly minimizes loss in all states of \(S\). Wald’s practical objective is to prescribe reasonable decision rules when no such action exists.

The adjective “statistical” describes statistical decision theory because Wald assumes that a state-dependent sampling distribution \({Q}_{s}\) generates data whose value, say \(\psi\), lies in a known sample space \(\Psi\). He supposes that the decision maker observes \(\psi\) and knows the vector \(\left({Q}_{s}, s \in S\right)\) of state-dependent sampling distributions. In this setting, a statistical decision function \(\delta \left(\cdot \right):\Psi \to D\) is any \(\Psi\)-measurable function that maps the data into an action. Let \(\Delta\) denote the space of feasible rules.

Research in statistical decision theory often finds it useful to consider randomized rules that map the data into a specified probability distribution on \(D\) rather than into a specific action. Consideration of randomized rules does not require alteration of the definition of \(\delta\). One may define the sample space and the state-dependent sampling distributions to include a white-noise component used to randomly choose an action.

To measure the performance of a candidate rule \(\delta\), Wald focuses on the state-dependent mean loss (risk) that it generates across potential samples; that is,

Risk is computable in principle, although computation may be difficult in practice. Supposing that computation of risk is tractable, Wald recommends use of the vector \([R(s, \delta ), s\in S]\) of state-dependent risks to measure the performance of \(\delta\) across potential samples and to compare \(\delta\) with other rules.

To begin, rule \(\delta\) is deemed better than rule \(\delta ^{\prime}\) if \(R(s, \delta ) \le R(s, \delta ^{\prime})\) for all \(s\in S\) and \(R(s, \delta ) < R(s, \delta ^{\prime})\) for some \(s\). If there exists a \(\delta\) that is better than \(\delta ^{\prime}\) in this sense, then \(\delta ^{\prime}\) is said to be inadmissible and should be eliminated from further consideration. A rule that is not inadmissible is called admissible.

Going a bit further, a decision maker can eliminate an admissible rule when there exists a risk-equivalent rule that is retained for consideration. Rules \(\delta\) and \(\delta ^{\prime}\) are risk-equivalent if \(R(s, \delta ) = R(s, \delta ^{\prime})\) for all \(s\in S\). When multiple admissible rules are risk-equivalent, a decision maker who uses risk to evaluate sampling performance can eliminate all but one of them without consequence.

Having eliminated all inadmissible rules and perhaps some admissible rules within risk-equivalent groups of rules, the decision maker’s problem is to choose among the subset of rules that remain, say \({\Delta }_{a}\). It is possible in principle that \({\Delta }_{a}\) may be empty, but applications of the Wald theory typically have enough regularity to ensure not only that \({\Delta }_{a}\) is non-empty but that every inadmissible decision function is dominated by an admissible one.

Whereas elimination of inadmissible and risk-equivalent admissible rules is uncontroversial, there is no consensus on choice within \({\Delta }_{a}\), which requires comparison of rules whose risk vectors are unordered. Wald studied minimization of Bayes risk when the decision maker places a subjective probability distribution, say \(\uppi\), on the state space. This criterion solves the problem

Wald suggested the minimax criterion in the absence of a subjective distribution on \(S\), stating (p. 18) “a minimax solution seems, in general, to be a reasonable solution of the decision problem when an a priori distribution.... does not exist or is unknown to the experimenter.” This criterion solves

Viewing the minimax criterion as unduly conservative, Savage (1951) suggested the minimax-regret criterion in his review essay on Wald (1950). This criterion solves

Subsequent research applying statistical decision has generally used criterion (2), (3), or (4).

It often is difficult to determine the set of admissible rules. Given this, researchers applying the Wald theory commonly skip the step of determining admissibility and use a decision criterion to choose among all feasible options, not just those that are admissible. When any of the criteria listed above yields a unique choice, it necessarily is admissible. When a criterion yields a set of equally good choices, the set may include inadmissible options that are strictly dominated only in states that do not affect the value of the optimum. Bayes risk is unaffected by values of risk that occur off the \(\pi\)-support of \(S\). Maximum risk and regret are unaffected by dominance in states that do not determine the maximum.

### 2.2 Respect for stochastic dominance

The new work of this paper begins with the observation that the basic probabilistic quantity underlying statistical decision theory is not risk but rather the state-dependent distribution of loss that a decision function generates across potential samples; that is, \({Q}_{s}\left\{L\left[s,\delta \left(\psi \right)\right]\right\}\). The expectation (risk) is but one of many potentially relevant features of this distribution.

State-dependent distributions of loss are computable in principle. Supposing that computation is tractable, we think it natural to generalize the Wald theory by recommending use of the vector \(\left({Q}_{s}\left\{L\left[s,\delta \left(\psi \right)\right]\right\},s \in S\right)\) to measure the performance of \(\delta\) across potential samples. It is also natural to recommend that evaluation of the performance of alternative statistical decision functions should respect stochastic dominance. This recommendation has many precedents in studies of decision making that are not explicitly concerned with use of sample data. See, for example, Quirk and Saposnik (1962), Hadar and Russell (1969), Hanoch and Levy (1969), and Manski (1988).

Let \(P{\ge }_{sd}{P}^{\prime}\) denote that distribution \(P\) either equals or stochastically dominates \(P^{\prime}\), and let \(P{>}_{sd}{P}^{\prime}\) denote that \(P\) stochastically dominates \(P^{\prime}\). When considering state-dependent sampling performance, respect for stochastic dominance means that one should prefer rule \(\delta\) to \(\delta ^{\prime}\) if \({Q}_{s}\left\{L\left[s,\mathrm{\delta ^{\prime}}\left(\psi \right)\right]\right\}{{\ge }_{sd}Q}_{s}\left\{L\left[s,\delta \left(\psi \right)\right]\right\}\) for all \(s \in S\) and \({Q}_{s}\left\{L\left[s,{\delta^{\prime}}\left(\psi \right)\right]\right\}{{>}_{sd}Q}_{s}\left\{L\left[s,\delta \left(\psi \right)\right]\right\}\) for some \(s\). We will adapt Wald’s definition of (mean) admissibility and say that \(\delta {^\prime}\) is stochastic-dominance inadmissible (*SD-inadmissible*) if these conditions hold.

When a decision maker places a subjective distribution on the state space, respect for stochastic dominance means that one should Bayes-SD prefer \(\delta\) to \(\delta {^\prime}\) if the distribution of loss across samples and states generated by \(\delta {^\prime}\) stochastically dominates that generated by \(\delta\). The distribution of loss under \(\delta\) is

We will adapt Wald’s definition of Bayes risk and say that \({\Phi }_{\uppi }\) is the Bayes loss distribution.

The fact that \(\Phi\) is the mean over \(S\) of the state-dependent loss distributions implies a connection between SD-preference and Bayes-SD preference. The following lemma follows immediately from (5):

### Lemma 1

*If rule* \(\delta\) *is SD-preferred to rule* \(\delta ^{\prime}\), *then* \({\Phi }_{\uppi }\left\{L\left[s,{\delta^ {\prime}}\left(\psi \right)\right]\right\}{{\ge }_{sd}\Phi }_{\uppi }\{L\left[s,\delta \left(\psi \right)\right]\}\). \(\square\)

### 2.3 Representation theorems relating SD-admissibility to mean and quantile loss

Respect for stochastic dominance does not require the decision maker to use any particular real functional of loss distributions to measure the performance of a decision function. Nevertheless, there exists useful representation theorems that characterize stochastic dominance in terms of two alternative classes of functionals, these being means of increasing functions of loss and quantiles of the distribution of loss.

#### 2.3.1 Means of increasing functions of loss

Let \(P\) and \(P^{\prime}\) denote two probability distributions on the real line. It has long been known that \(P = P^{\prime}\) if and only if \(\int f\left(\mathrm{y}\right)dP\left(y\right)=\int f\left(y\right)dP^{\prime}\left(y\right)\) for every integrable increasing function \(f\left(\cdot \right)\). Several articles studying expected utility maximization when utility is an increasing function of income show that \(P\) stochastically dominates \(P^{\prime}\) if and only if \(\int f\left(y\right) dP\left(y\right)\ge \int f\left(y\right) dP^{\prime}\left(y\right)\) for every integrable increasing function \(f\left(\cdot \right)\) and \(\int f\left(y\right) dP\left(y\right)>\int f\left(y\right) dP^{\prime}\left(y\right)\) for some increasing \(f\left(\cdot \right)\). See Quirk and Saposnik (1962), Hadar and Russell (1969), and Hanoch and Levy (1969). This representation theorem immediately yields a characterization of SD-inadmissibility:

### Lemma 2

*Rule* \(\delta^ {\prime}\) *is SD-inadmissible if and only if there exists a rule* \(\delta\) *such that*

\(\square\)

#### 2.3.2 Quantiles of loss

Let \(P\) denote a probability distribution on the real line. For \(\lambda \in (0, 1)\), we will denote by \(V_{\lambda }\left(P\right)\) the \(\lambda\)-quantile of \(P\). It has long been known that \(P=P^{\prime}\) if and only if \(V_{\lambda }\left(P\right)=V_{\lambda }\left(P^{\prime}\right)\) for all \(\lambda \in (0, 1)\). Levy and Kroll (1978) show that \(P^{\prime}\) stochastically dominates \(P\) if and only if \(V_{\lambda }\left(P^{\prime}\right)\ge V_{\lambda }\left(P\right)\) for all \(\lambda \in (0, 1)\) and \(V_{\lambda }\left(P^{\prime}\right)>V_{\lambda }\left(P\right)\) for some \(\lambda \in (0, 1)\). This representation theorem immediately yields another characterization of SD-inadmissibility.

### Lemma 3

Rule \(\delta ^{\prime}\) is SD-inadmissible if and only if there exists a \(\delta\) such that

\(\square\)

## 3 State-dependent binary loss

SD-inadmissibility and mean inadmissibility are equivalent to one another when the loss function has a special form that occurs in some important applications. Partition the set \(D\) of feasible actions into two mutually exclusive subsets, say \({D}_{sa}\) and \({D}_{sb}\), in each state \(s \in S\). Let \({L}_{sa}\) and \({L}_{sb}\) be a specified pair of state-dependent real numbers. The loss function has the state-dependent binary form if

Section 3.1 develops the basic finding. Section 3.2 applies it to choice between two treatments.

### 3.1 Using error probabilities to characterize SD and mean admissibility

When loss has form (8), its state-dependent sampling distribution is determined by the state-dependent choice probabilities \({Q}_{s}\left[\delta \left(\psi \right)\in {D}_{sa}\right]\) and \({Q}_{s}\left[\delta \left(\psi \right)\in {D}_{sb}\right]\), which sum to one. Rule \(\delta\) has a state-dependent distribution placing mass \({Q}_{s}\left[\delta \left(\psi \right)\in {D}_{sa}\right]\) at \({L}_{sa}\) and mass \({Q}_{s}\left[\delta \left(\psi \right)\in {D}_{sb}\right]\) at \({L}_{sb}\). Hence, rule \(\delta\) is SD-preferred to \(\delta ^{\prime}\) if and only if \(\delta\) places weakly more mass than \(\delta ^{\prime}\) at \(\mathrm{min}\left({L}_{sa},{L}_{sb}\right)\) in every state and strictly more mass in some state where \({L}_{sa}\ne {L}_{sb}\).

A succinct way to express SD-preference is to define the state-dependent probability \({\rho }_{s}\left(\delta \right)\) that \(\delta\) yields an error, choosing the action with larger loss rather than the one with smaller loss. An error is logically impossible when \({L}_{sa}={L}_{sb}\), so \({\rho }_{s}\left(\delta \right)=0\) in these states. In states with \({L}_{sa}\ne {L}_{sb}\),

With this definition of error probabilities, we obtain a simple characterization of SD-inadmissibility.

### Lemma 4

*Let the loss function have form* (8). *Then rule* \(\delta ^{\prime}\) *is SD-inadmissible if and only if there exists another rule* \(\delta\) *such that* \({\rho }_{s}\left(\delta \right)\le {\rho }_{s}\left({\delta }^{{\prime}}\right)\) *for all* \(s \in S\) *and* \({\rho }_{s}\left(\delta \right)<{\rho }_{s}\left({\delta }^{{\prime}}\right)\) *for some* \(s\). \(\square\)

Error probabilities also characterize mean admissibility. Given a loss function of form (8), mean loss in state \(s\) (risk) is

This yields a parallel characterization of mean inadmissibility.

### Lemma 5

*Let the loss function have form* (8). *Then rule* \(\delta ^{\prime}\) *is mean inadmissible if and only if there exists another rule* \(\delta\) *such that* \({\rho }_{s}\left(\delta \right)\le {\rho }_{s}\left({\delta }^{{\prime}}\right)\) *for all* \(s \in S\) *and* \({\rho }_{s}\left(\delta \right)<{\rho }_{s}\left({\delta }^{{\prime}}\right)\) *for some* \(s\). \(\square\)

### 3.2 Choice between two treatments

An important class of applications of statistical decision theory considers use of sample data on treatment response to inform a planner who must choose treatments for a population. Past work by Manski (2004, 2005, 2021), Manski and Tetenov (2007), Hirano and Porter (2009), Stoye (2009, 2012), Tetenov (2012), Manski and Tetenov (2016), and Kitagawa and Tetenov (2018) has used the Wald framework to study this decision problem. A statistical decision function uses the data to choose a treatment allocation, so such a function has been called a *statistical treatment rule* (STR). The planner’s objective has been expressed as maximization of a social welfare function that sums treatment outcomes across the population. The mean sampling performance of an STR has been called *expected welfare*. Maximization of social welfare is equivalent to minimization of loss. Expected welfare is negative risk.

We consider here the relatively simple case in which the planner must assign one of two treatments to each member of a treatment population, denoted \(J\). The feasible treatments are \(T=\{a, b\}\). Each \(j \in J\) has a response function \({u}_{j}\left(\cdot \right):T\to Y\) mapping treatments \(t\in T\) into real-valued individual welfare outcomes \({u}_{j}\left(t\right)\). Treatment is individualistic; that is, a person’s outcome may depend on the treatment he is assigned but not on the treatments assigned to others. The population is a probability space \((J, \Omega , P)\), and the probability distribution \(P\left[u\left(\cdot \right)\right]\) of the random function \(u\left(\cdot \right):T\to {\mathbb{R}}\) describes treatment response across the population. The population is large in the sense that \(J\) is uncountable and \(P(j)=0, j \in J\).

While treatment response may be heterogeneous, we suppose here that the members of the population are observationally identical to the planner. That is, the planner does not observe person-specific covariates that would enable systematic differentiation of treatment of different persons. In principle, the planner can randomly allocate persons to the two treatments with specified allocation probabilities. The notation introduced below allows for this possibility. However, when applying the findings of Sect. 3.1, we will consider only *test rules*, which assign all members of the population to one treatment or the other.

#### 3.2.1 The mean sampling performance of STRs

A statistical treatment rule maps sample data into a treatment allocation. Let \(\Delta\) denote the space of functions that map \(T \times\Psi\) into the unit interval and that satisfy the adding-up conditions: \(\delta \in\Delta \Rightarrow\delta \left(a,\psi \right)+\delta \left(b,\psi \right)=1,\forall\psi \in\Psi .\) Each function \(\delta \in\Delta\) defines a statistical treatment rule, \(\delta (a, \psi )\) and \(\delta (b, \psi )\) being the fractions of the population assigned to treatments \(a\) and \(b\) when the data are \(\psi\). Observe that this definition of an STR does not specify which persons receive each treatment, only the assignment shares. Designation of the particular persons receiving each treatment is immaterial because assignment is random, the population is large, and the planner has an additive welfare function. As \(\delta (a, \psi )+\delta (b, \psi ) = 1\), we use the shorthand \(\delta (\psi )\) to denote the fraction assigned to treatment \(b\). The fraction assigned to treatment \(a\) is \(1-\delta (\psi )\).

The planner wants to maximize population welfare, which adds welfare outcomes across persons. Given data \(\psi\), the population welfare (negative loss) realized if the planner were to choose rule \(\delta\) is

where \(\alpha \equiv E\left[u\left(a\right)\right]={\int }_{J}{u}_{j}\left(a\right)dP\left(j\right)\) and \(\beta \equiv E\left[u\left(b\right)\right]={\int }_{J}{u}_{j}\left(b\right)dP\left(j\right)\) are assumed to be finite. Inspection of (11) shows that, whatever value \(\psi\) may take, it is optimal to set \(\delta (\psi )=0\) if \(\alpha \ge \beta\) and \(\delta (\psi )=1\) if \(\alpha \le \beta\).

The problem of interest is treatment choice when knowledge of \(P\) and \(Q\) does not suffice to determine the ordering of \(\alpha\) and \(\beta\). Hence, the planner does not know the optimal treatment. Let \(\left\{\left({\mathrm{P}}_{\mathrm{s}},{\mathrm{Q}}_{\mathrm{s}}\right),\mathrm{s}\in \mathrm{S}\right\}\) be the set of \((P,Q)\) pairs that the planner deems possible. The planner does not know the optimal treatment if \(S\) contains at least one state such that \({{\alpha }}_{s}>{\beta }_{s}\) and another such that \({{\alpha }}_{s}<{\beta }_{s}\). We assume this throughout.

Considered as a function of \(\psi\), \(U\left(\delta ,{P}_{s},\psi \right)\) is a random variable with state-dependent sampling distribution \({Q}_{s}\left[U\left(\delta ,{P}_{s},\psi \right)\right]\). Following Wald’s view of statistical decision functions as procedures, we use the vector \(\left\{{Q}_{s}\left[U\left(\delta ,{P}_{s},\psi \right)\right], s\in S\right\}\) of state-dependent welfare distributions to evaluate rule \(\delta\). In principle this vector is computable, whatever the state space and sampling process may be. Hence, in principle, a planner can compare the vectors of state-dependent welfare distributions yielded by different STRs and base treatment choice on this comparison.

Respect for stochastic dominance means that the planner prefers rule \(\delta\) to an alternative rule \(\delta ^{\prime}\) if \({Q}_{s}\left[U\left(\delta ,{P}_{s},\psi \right)\right]{\ge }_{sd}{Q}_{s}\left[U\left({\delta }^{\prime},{P}_{s},\psi \right)\right]\) for all \(s\in S\) and \({Q}_{s}\left[U\left(\delta ,{P}_{s},\psi \right)\right]{>}_{sd}{Q}_{s}\left[U\left({\delta }^{\prime},{P}_{s},\psi \right)\right]\) for some \(s\). The expected welfare (negative risk) of rule \(\delta\) in state \(s\), denoted \(W\left(\delta ,{P}_{s},{Q}_{s}\right),\) is

where \({E}_{s}\left[\delta \left(\psi \right)\right]\equiv {\int }_{\Psi }\delta \left(\psi \right)d{Q}_{s}\left(\psi \right)\) is the mean (across potential samples) fraction of persons who are assigned to treatment \(b\).

#### 3.2.2 SD and mean admissibility of test rules

An important class of STRs are the *uniformly singleton* rules. Given a treatment set of any size, a rule is uniformly singleton if, for every possible data realization, it assigns the entire population to one treatment. The treatment to which the entire population is assigned may vary with the data realization.

Uniformly singleton rules are particularly simple when there are two treatments. In this case, a rule is uniformly singleton rule if, for each \(\psi \in \Psi\), \(\delta (\psi ) = 1\) or \(\delta (\psi ) = 0\). The class of uniformly singleton STRs is the same as the class of rules that use the outcome of a hypothesis test to choose between the treatments.

Construction of a *test rule* begins by partitioning the state space into disjoint subsets \({S}_{a}\) and \({S}_{b}\), where \({S}_{a}\) contains all states in which treatment \(a\) is uniquely optimal and \({S}_{b}\) contains all states in which \(b\) is uniquely optimal. Thus, \({{\alpha }}_{s}>{\beta }_{s}\Rightarrow s\in {S}_{a},\) \({{\alpha }}_{s}<{\beta }_{s}\Rightarrow s\in {S}_{b},\) and the states with \({{\alpha }}_{s}={\beta }_{s}\) are somehow split between the two sets. Let \({s}^{*}\) denote the unknown true state. The two hypotheses are \(\left[{s}^{*}\in {S}_{a}\right]\) and \(\left[{s}^{*}\in {S}_{b}\right]\).

A test rule \(\delta\) partitions the sample space \(\Psi\) into disjoint acceptance regions \({\Psi }_{\delta a}\) and \({\Psi }_{\delta b}\). When the data \(\psi\) lie in \({\Psi }_{\delta a}\), the rule accepts hypothesis \(\left[{s}^{*}\in {S}_{a}\right]\) by setting \(\delta (\psi )=0\). When \(\psi\) lies in \({\Psi }_{\delta b}\), the rule accepts \(\left[{s}^{*}\in {S}_{b}\right]\) by setting \(\delta (\psi )=1\). We use the word “accepts” rather than the traditional term “does not reject” because treatment choice is an affirmative action.

The above shows that test rules are uniformly singleton. The converse holds as well. If \(\delta\) is uniformly singleton, one can collect all of the data values for which the rule assigns everyone to treatment \(a\), call this subset of the sample space the acceptance region \({\Psi }_{\delta a}\), and do likewise for \({\Psi }_{\delta b}\). In what follows, we use the term *test* rule rather than *uniformly singleton* rule.

Test rules have the state-dependent binary form (8), with \({L}_{sa}=-{{\alpha }}_{s},\) \({L}_{sb}=-{\beta }_{s},\) \({D}_{sa}=0\) and \({D}_{sb}=1.\) An error is impossible when \({{\alpha }}_{s}={\beta }_{s}\), so \({\rho }_{s}\left(\delta \right)=0\) in these states. In states with \({{\alpha }}_{s}\ne {\beta }_{s}\), the error probability is

Lemmas 4 and 5 show that a test rule \(\delta\) is both SD and mean inadmissible if there exists another test rule \({\delta }^{\prime}\) such that \({\rho }_{s}\left(\delta \right)\le {\rho }_{s}\left({\delta }^{\prime}\right)\) for all \(s\in S\) and \({\rho }_{s}\left(\delta \right)<{\rho }_{s}\left({\delta }^{\prime}\right)\) for some \(s\).

A special but important class of hypothesis tests juxtaposes two simple hypotheses. Then the Neyman-Pearson Lemma shows that, among all tests with a specified probability of a Type I error, the likelihood-ratio test minimizes the probability of a Type II error, and vice versa. In the context of treatment choice, having two simple hypotheses means that \(S\) contains two states, with treatment a better in one state and b better in the other. Then the Neyman-Pearson Lemma implies that a planner considering use of a test rule need not look beyond the class of likelihood-ratio tests. Applying Lemmas 4 and 5 to likelihood ratio tests yields this result, which makes explicit the form of error probabilities for likelihood-ratio tests.

### Lemma 6

*Let* \(S = \{\mathrm{0,1}\}\), *with* \({\alpha }_{0}>{\beta }_{0}\) *and* \({\alpha }_{1}<{\beta }_{1}\)_{.} *Let the data have distinct state-dependent sampling distributions* \({Q}_{0}\) *and* \({Q}_{1}\) *with either Lebesgue density or probability mass functions* \({q}_{0}\left(\cdot \right)\) *and* \({q}_{1}\left(\cdot \right)\). *Let* \(\delta\) *be a test rule. For* \(\eta \ge 0\), *let* \(\delta (\eta )\) *be the likelihood-ratio rule with threshold* \(\eta\); *thus*, \({\Psi }_{\delta \left(\eta \right)a}=\left[\psi \in\Psi :{q}_{1}\left(\psi \right)\le \eta {q}_{0}\left(\psi \right)\right]\) *and* \({\Psi }_{\delta \left(\eta \right)b}=\left[\psi \in\Psi :{q}_{1}\left(\psi \right)>\eta {q}_{0}\left(\psi \right)\right]\). *Rule* \(\delta\) *is both SD and mean inadmissible if there exists an* \(\eta \ge 0\) *such that* \({\rho }_{0}\left(\delta \right)\ge {Q}_{0}\left[{q}_{1}\left(\psi \right)>\eta {q}_{0}\left(\psi \right)\right],\) \({\rho }_{1}\left(\delta \right)\ge {Q}_{1}\left[{q}_{1}\left(\psi \right)\le\eta {q}_{0}\left(\psi \right)\right],\) *and at least one inequality is strict*. \(\square\)

### Proof

Rule \(\delta (\eta )\) has error probabilities \({\rho }_{0}\left(\delta \left(\eta \right)\right)={Q}_{0}\left[{q}_{1}\left(\psi \right)>\eta {q}_{0}\left(\psi \right)\right]\) and \({\rho }_{1}\left(\delta \left(\eta \right)\right)={Q}_{1}\left[{q}_{1}\left(\psi \right)\le\eta {q}_{0}\left(\psi \right)\right]\). Hence, the result is an immediate application of the proposition. \(\square\)

A fundamental feature of the above analysis is that all error probabilities symmetrically determine the result. In contrast, the classical theory of hypothesis testing differentiates between null and alternative hypotheses, and correspondingly between Type I and Type II errors. It restricts attention to tests that yield a predetermined probability of a Type I error and seeks a test of this type that yields an adequately small probability of a Type II error.

For example, a document of the U. S. Food and Drug Administration providing guidance for the design of randomized clinical trials (RCTs) evaluating new medical devices states that the probability of a Type I error is conventionally set to 0.05 and that the probability of a Type II error depends on the claim for the device but should not exceed 0.20 (U. S. Food and Drug Administration, 2014). The International Conference on Harmonisation (ICH) has provided similar guidance for the design of RCTs evaluating pharmaceuticals. The ICH document states the following (International Conference on Harmonization, 1999, p. 1923):

“Conventionally the probability of type I error is set at 5% or less or as dictated by any adjustments made necessary for multiplicity considerations; the precise choice may be influenced by the prior plausibility of the hypothesis under test and the desired impact of the results. The probability of type II error is conventionally set at 10–20%; it is in the sponsor’s interest to keep this figure as low as feasible especially in the case of trials that are difficult or impossible to repeat. Alternative values to the conventional levels of type I and type II error may be acceptable or even preferable in some cases.”

Such asymmetric treatment of the two hypotheses is illogical from the perspective of statistical decision theory.

## 4 Ordered actions and continuous real data satisfying the monotone-likelihood ratio property

We now study a class of decision problems in which SD and mean admissibility do not coincide. These are problems in which the set of feasible actions is ordered and the sampling process generating the data has a continuous distribution that satisfies the monotone likelihood ratio property. Analysis of mean admissibility in this setting dates back to Karlin and Rubin (1956), with continuation by Manski and Tetenov (2007). Here we study SD admissibility. Section 4.1 develops the basic finding. Section 4.2 applies it to treatment choice.

### 4.1 Basic finding

Proposition 7 shows that the fractional monotone treatment rules form an essentially complete class with respect to stochastic dominance when the data satisfy the maintained assumptions. A *fractional monotone* rule is one in which \(\delta \left(\psi \right)\) is weakly increasing in \(\psi\). *Essential completeness* means that any randomized decision rule \(\delta \left(\psi ,\upsilon \right)\) can be replaced by a fractional monotone rule \(\delta ^{\prime}(\psi )\) that weakly stochastically dominates \(\delta (\psi , \upsilon )\) in each state \(s\). The planner then does not need to consider any other types of STRs. Manski and Tetenov (2007, Proposition 1) show that fractional monotone rules form an essentially complete class when the planner wants to maximize the expectation \({E}_{s}\left[f\left(U\left(\delta ,{P}_{s},\psi ,\upsilon \right)\right)\right]\) of a concave-monotone function \(f\left(\cdot \right)\) of the population welfare and \(\psi\) is binomial. Here, we establish that a planner with any decision criterion that respects stochastic dominance can restrict attention to fractional monotone rules when \(\psi\) has a continuous distribution satisfying the MLR property.

### Proposition 7

*Let* \(u\left(d,s\right)\)* be the payoff function from action *\(d\in D\subset {\mathbb{R}}\)* in state *\(s \in S\)*. Assume that *\(D\)* is a closed set. Assume that *\(u\left(d,s\right)\)* is weakly monotonic in *\(d\)* for each *\(s\)*. For *\(s \in S\)*, let the data *\(\psi \in {\mathbb{R}}\)* have a continuous distribution *\({Q}_{s}\left(\psi \right)\)* and density *\({q}_{s}\left(\psi \right)\)* with respect to Lebesgue measure. Let *\(\upsilon \sim \mathrm{Uniform}[0,1]\)* be a randomization variable independent of *\(\psi\)*. Assume that there exists a state *\({s}_{0}\)* for which *\(u\left(d,{s}_{0}\right)\)* is constant in *\(d\)*.*

*Let *\({Q}_{s}\left(\psi \right)\)* possess the monotone likelihood ratio property for all pairs *\(\left(s,{s}_{0}\right)\)* such that *\(u\left(d,s\right)\)* is not constant in *\(d\)*. That is,*

*Then for any randomized strategy* \(\delta \left(\psi ,\upsilon\right):\Psi \times \left[\mathrm{0,1}\right]\to D,\) *there exists a monotone non-randomized strategy* \({\delta }^{\prime}\left(\psi \right):\Psi \to D\) *whose distribution of payoffs* \({Q}_{s}\left[u\left({\delta }^{\prime}\left(\psi \right),s\right)\right]\) *weakly stochastically dominates the distribution of payoffs* \({Q}_{s}\left[u\left(\delta \left(\psi ,\upsilon\right),s\right)\right]\) *of* \(\delta\) *in each state* \(s\). \({\delta }^{\prime}\) *could be constructed by monotonically rearranging the values taken by* \(\delta \left(\psi ,\upsilon\right)\) *in state* \({s}_{0}\):

*where* \({G}_{\delta ,{s}_{0}}^{-1}\left(\cdot \right)\) *is the quantile function of the distribution* \({Q}_{{s}_{0}}\left[\delta \left(\psi ,\upsilon\right)\right]\) *of the action* \(\delta (\psi , \upsilon )\) *in state* \({s}_{0}\) *and* \({F}_{0}\left(t\right)\equiv {Q}_{{s}_{0}}\left(\psi \le t\right)\) *is the c.d.f. of* \(\psi\) *in state* \({s}_{0}\). \(\square\)

### Proof

First, we show that the non-randomized strategy \({\delta }^{\prime}\left(\psi \right)\) defined in (15) re-arranges the values of \(\delta \left(\psi ,\upsilon\right)\) to be weakly increasing in \(\psi\) and has the same probability distribution of actions (and hence payoffs) as \(\delta \left(\psi ,\upsilon\right)\) in state \({s}_{0}\). Then, we show that \({Q}_{s}\left[u\left({\delta }^{\prime}\left(\psi \right),s\right)\right]\) weakly stochastically dominates \({Q}_{s}\left[u\left(\delta \left(\psi ,\upsilon\right),s\right)\right]\) in all other states of nature.

Given that \(\psi\) has a continuous distribution, random variable \({F}_{0}\left(\psi \right)\) has a Uniform(0, 1) distribution in state \({s}_{0}\). Hence, random variable \({\delta }^{\prime}\left(\psi \right)\equiv {G}_{\delta ,{s}_{0}}^{-1}\left({F}_{0}\left(\psi \right)\right)\) has c.d.f. \({G}_{\delta ,{s}_{0}}\) in state \({s}_{0}\).^{Footnote 1} Given that both \({G}_{\delta ,{s}_{0}}^{-1}\) and \({F}_{0}\left(\cdot \right)\) are non-decreasing, \({\delta }^{\prime}\left(\psi \right)\) is also non-decreasing in \(\psi\). Given that \({F}_{0}\) is continuous and \({G}_{\delta ,{s}_{0}}^{-1}\) is left-continuous, \({\delta }^{\prime}\left(\psi \right)\) is also left-continuous. Since \({G}_{\delta ,{s}_{0}}^{-1}\) takes values on the support of \({Q}_{{s}_{0}}\left[\delta \left(\psi ,\upsilon\right)\right]\), \(\delta \left(\psi ,\upsilon\right)\in D\) and \(D\) is a closed set, \({\delta }^{\prime}\left(\psi \right)\in D.\)

In states where \(u\left(d,s\right)\) is constant in \(d\), the distributions of payoffs are identical for all strategies. Hence, weak stochastic dominance holds. Now suppose that state \(s\) satisfies (14a), so \(u\left(d,s\right)\) is non-increasing in \(d\). (The proof is analogous for states in which \(u\left(d,s\right)\) is non-decreasing in \(d\).)

We want to show that the distribution of \({\delta }^{\prime}\left(\psi \right)\) is weakly stochastically dominated by the distribution of \(\delta \left(\psi ,\upsilon\right)\). Denote the c.d.f. of action \(\delta \left(\psi ,\upsilon\right)\) in state \(s\) by

Given any \(t\in {\mathbb{R}}\), consider the indicator functions \(1\left[\delta \left(\psi ,\upsilon\right)\le t\right]\) and \(1\left[{\delta }^{\prime}\left(\psi \right)\le t\right]\). These indicator functions generate rejection regions for classical hypothesis tests with null hypothesis \({s}_{0}\) and alternative hypothesis \(s\). A randomized test with rejection region \(\Omega \equiv \left\{\left(\psi ,\upsilon\right):\delta \left(\psi ,\upsilon\right)\le t\right\}\) has power function \({G}_{\delta ,s}\left(t\right)\) as a function of \(s\). Similarly, \({G}_{{\delta }^{\prime},s}\left(t\right)\equiv \int {q}_{s}\left(\psi \right)\cdot 1\left[{\delta }^{\prime}\left(\psi \right)\le t\right]d\psi\) is the power function (as a function of \(s\)) of a non-randomized test with rejection region \({\Omega }^{\prime}\equiv \left\{\psi :{\delta }^{\prime}\left(\psi \right)\le t\right\}.\) We have shown above that the two tests have equal power in state \({s}_{0}\): \({G}_{{\delta }^{\prime},{s}_{0}}\left(t\right)={G}_{\delta ,{s}_{0}}\left(t\right).\) Given that \({\delta }^{\prime}\left(\psi \right)\) is non-decreasing in \(\psi\), \(1\left[{\delta }^{\prime}\left(\psi \right)\le t\right]\) is non-increasing in \(\psi\) and there exists \({\psi }_{t}\) such that

Given that state *s* satisfies (14a), the test with rejection region \({\Omega }^{\prime}=\left\{\psi :\psi \le {\psi }_{t}\right\}\) is a likelihood-ratio test. The tests with rejection regions \(\Omega \text{ and } {\Omega }^{\prime}\) have the same size. If follows from the Neyman–Pearson lemma that test \({\Omega }^{\prime}\) must be at least as powerful as \({\Omega }\) in state *s*.^{Footnote 2} That is, \({G}_{{\delta }^{\prime},s}\left(t\right)\ge {G}_{\delta ,s}\left(t\right).\)

We can thus establish that \({G}_{{\delta }^{\prime},s}\left(t\right)\ge {G}_{\delta ,s}\left(t\right)\) for all *t*. Hence, the distribution of \(\delta \left(\psi ,\upsilon\right)\) weakly stochastically dominates the distribution of \({\delta }^{\prime}\left(\psi \right)\). Given that \(u\left(d,s\right)\) is a weakly decreasing function of *d*, \({Q}_{s}\left[u\left({\delta }^{\prime}\left(\psi \right),s\right)\right]\) weakly stochastically dominates \({Q}_{s}\left[u\left(\delta \left(\psi ,\upsilon\right),s\right)\right]\). \(\square\)

### 4.2 Applications to treatment choice

#### 4.2.1 Choice between two treatments

Proposition 7 applies to the treatment-choice problem of Sect. 3.2, with action \(d\in \left[\mathrm{0,1}\right]\) denoting the fraction of the population assigned to treatment \(b\). Payoff function (12) is decreasing in \(d\) when \({\beta }_{s}-{{\alpha }}_{s}<0,\) increasing in \(d\) when \({\beta }_{s}-{{\alpha }}_{s}>0,\) and constant when \({\beta }_{s}-{{\alpha }}_{s}=0.\) Hence, the payoff function satisfies the assumptions of the proposition. Suppose that \({Q}_{s}\left(\psi \right)\) is continuous and possesses the monotone likelihood ratio property in \((\beta - \alpha )\). Then the proposition shows that the class of fractional monotone STRs is essentially complete under any decision criterion that respects stochastic dominance.

#### 4.2.2 Choice of treatment dose

Let action *d* be a dose level for a real-valued treatment; for example, it may be the dose of a medical drug treatment. Suppose that administering a higher dose is beneficial but costly. In particular, let the payoff function have the linear form \(u\left(d,s\right)=b\left(s\right)d-cd,\) which gives the benefit minus cost of different dose levels. Here \(c\) is the known marginal cost of increasing the dose and \(b(s)\) is the unknown marginal dose response. Then \(u(d, s)\) is increasing if \(b(s) > c\) and decreasing if \(b(s) < c\). Welfare does not vary with the dose level when \(b(s) = c\). Thus, \(u\left(\cdot ,\cdot \right)\) satisfies the assumptions of Proposition 7.

Suppose that one obtains real-valued data \(\psi\) drawn from a continuous distribution and that \(\psi\) provides an informative but imperfect signal about \(b(s)\); for example, \(\psi\) may be the result of an informative but imperfect diagnostic test. It is relatively easy to imagine signal generation processes in which \(\psi\) has the MLR property. For example, it may be that \(\psi\) equals \(b(s)\) plus a white-noise error. Then Proposition 7 implies that treatment rule should be monotone in the realization of \(\psi\) and should be non-randomized.

## 5 The Stein phenomenon from the perspective of stochastic dominance

This section studies another class of decision problems in which SD admissibility and mean admissibility do not coincide. We reappraise the Stein phenomenon (Stein, 1956) from the perspective of SD admissibility. We consider the simplest setting in which the Stein phenomenon occurs. We are interested in estimating a three-dimensional parameter \(\theta ={\left({\theta }_{1}\, {\theta }_{2}\, {\theta }_{3}\right)}^{\prime}\in {\mathbb{R}}^{3}\). We observe a three-dimensional outcome \(x={\left({x}_{1}\, {x}_{2}\, {x}_{3}\right)}^{\prime}\) whose sampling distribution is multivariate normal with mean vector \(\theta\) and a known identity variance matrix, \(x\sim \mathcal{N}\left(\theta ,{I}_{3}\right)\). The statistical decision rule maps the outcome into a three-dimensional estimator \(\delta \left(x\right)={\left({\delta }_{1}\left(x\right)\, {\delta }_{2}\left(x\right)\, {\delta }_{3}\left(x\right)\right)}^{\prime}\) and the loss function is the component-wise sum of squared losses \(L\left(\theta ,\delta \right)\equiv {\Vert \delta -\theta \Vert }^{2}={\sum }_{i=1}^{3}{\left({\delta }_{i}-{\theta }_{i}\right)}^{2}\).

The MLE estimator in this problem is \({\delta }_{MLE}\left(x\right)\equiv x\). Stein (1956) has shown that the MLE estimator is mean-inadmissible. It is mean-dominated, for example, by the James–Stein (1961) estimator \({\delta }_{JS}\equiv \left(1-\frac{1}{{\Vert x\Vert }^{2}}\right)x\). For any \(\theta \in {\mathbb{R}}^{3}\), \({E}_{\theta }L\left(\theta ,{\delta }_{JS}\right)<{E}_{\theta }L\left(\theta ,{\delta }_{MLE}\right)\). Other estimators mean-dominating the MLE estimator have been subsequently proposed. Among those we will consider the James–Stein positive part estimator \({\delta }_{JSPP}\equiv \mathrm{max}\left(\mathrm{0,1}-\frac{1}{{\Vert x\Vert }^{2}}\right)x\) from Efron and Morris (1973).

Does the loss distribution of the James–Stein estimator stochastically dominate the loss distribution of the MLE estimator? To show that the statement is false it suffices to find one value of \(\theta\) for which \({Q}_{s}\{L\left[\theta ,{\delta }_{JS}\right]\}{\ngeq }_{sd}{Q}_{s}\{L\left[\theta ,{\delta }_{MLE}\right]\}\). While we cannot offer a mathematical proof of this statement, we have convincing evidence from simulations that stochastic dominance does not hold for \(\theta ={\left[0\,0\,0 \right]}^{\prime}\).

Figure 1 plots the cumulative distribution of loss for the MLE, James–Stein, and James–Stein positive part estimators at \(\theta ={\left[0\,0\,0 \right]}^{\prime}\) computed from 100 million draws. The left panel, which plots the loss distribution in the loss range [0, 15], suggests that the James–Stein estimator stochastically dominates the MLE. However, close scrutiny of upper quantiles in the right panel, which plots the distribution in the loss range [5, 20], shows otherwise. The CDFs of the two estimators cross at a loss value of approximately 12. Thus, the loss distribution of the James–Stein estimator does not stochastically dominate the one of the MLE.

On the other hand, the loss distribution of the James–Stein positive part estimator does seem to stochastically dominate that of the MLE. In simulations for other parameter values, we have not been able to find any \(\theta\) for which the loss distribution of the positive part estimator would not seem to stochastically dominate that of the MLE. Based on the simulation evidence, we cannot rule out that the MLE is also SD inadmissible (dominated by the positive part estimator). While we are unable to offer a mathematical proof, the simulation evidence suggests interesting avenues for further research on the Stein phenomenon from the perspective of SD admissibility.

## 6 Mean and quantile decision criteria

We now turn attention from SD-admissibility to choice of a decision function. Section 2.1 posed three leading criteria that use mean performance to evaluate alternative rules–minimax, minimax-regret, and minimization of Bayes risk. Here, we juxtapose these criteria with analogous ones that use quantile performance. Section 6.1 presents the quantile criteria in abstraction. Section 6.2 applies them to selection of a test rule for choice between two treatments.

Decision making using a quantile-utility criterion was proposed by Manski (1988) in a setting without sample data. It was observed there that maximization of expected and quantile utility differ in important respects. Whereas the ranking of actions by expected utility is invariant only to cardinal transformations of the objective function, the ranking by quantile utility is invariant to ordinal transformations. Whereas expected utility conveys risk preferences through the shape of the utility function, quantile utility does so through the specified quantile, with higher values conveying more risk preference. Whereas expected utility is not well-defined when the distribution of utility has unbounded support with fat tails, quantile utility is always well-defined. An axiomatic derivation of quantile-utility maximization has been provided by Rostek (2010).

There is reason to think that quantiles of welfare distributions matter to decision makers. For example, recent writings on finance have shown explicit concern with low quantiles of earnings distributions, using the term *value-at-risk*. See, for example, Jorion (2006).

### 6.1 Quantile criteria

Let \(\lambda \in \left(\mathrm{0,1}\right)\) be a specified quantile. The \(\lambda\)-quantile minimax and minimax-regret criteria are analogous to criteria (3) and (4) that use risk to measure state-dependent sampling performance. The corresponding quantile criteria are

There are at least two ways that one might define a quantile analog to minimization of Bayes risk. Replacement of risk with \(\lambda\)-quantile loss in criterion (2) yields

This hybrid criterion measures the sampling performance of rule \(\delta\) by the \(\lambda\)-quantile of the sampling distribution of loss and evaluates performance of the rule across states by the mean of these \(\lambda\)-quantiles.

An alternative begins not with definition (2) of the standard Bayes criterion but with the equivalent definition

where \(E\left\{L\left[s,\delta \left(\psi \right)\right]\right\}=\int R\left(s,\delta \right)d\uppi \left(s\right)=\int \int L\left[s,\delta \left(\psi \right)\right]d\Phi \left(s,\psi \right)\) is the Bayes expectation of loss across samples and states. Replacement of the Bayes expectation of loss with its \(\lambda\)-quantile yields

Although the mean-based criteria (2) and (2′) are equivalent, the quantile-based criteria (18) and (19) generally differ from one another.

It is well-known that minimization of Bayes risk is also equivalent to solution of the collection of conditional Bayes decision criteria

See, for example, Berger (1985, pp.159–160). That is, minimization of Bayes risk is equivalent to minimization of the posterior expected value of Bayes loss at every point in the sample space. This result, which follows from Fubini’s Theorem, does not hold for quantile-based criteria. The quantile analog of (2′′) would be to minimize the posterior λ-quantile of Bayes loss at every point in the sample space. This posterior quantile criterion generally differs from both (18) and (19). We do not know whether it has an interpretation from the perspective of ex ante statistical decision theory. Hence, we do not consider it further.

### 6.2 Criteria for selection of a test rule

The mean and quantile-based decision criteria of Sects. 2.1 and 6.1 offer a menu of procedures for choice of a statistical decision function. To illustrate their application, we continue the analysis of Sect. 3.2 and consider choice between two treatments, focusing on test rules. Following common practice, we skip the step of determining admissibility and use the criteria to choose among all feasible test rules, not just those that are admissible.

The state-dependent expected and quantile welfare of test rules have simple expressions, namely

Observe that mean and quantile sampling performance are both monotonically decreasing in the error probability, falling from \(\mathrm{max}\left({{\alpha }}_{s},{\beta }_{s}\right)\) to \(\mathrm{min}\left({{\alpha }}_{s},{\beta }_{s}\right)\) as \({\rho }_{s}\left(\delta \right)\) increases from 0 to 1. However, they differ in the pattern of decrease. Whereas mean performance varies linearly with the error probability, quantile performance is a step function. This difference in the pattern of decrease implies differences between decision criteria based on mean and quantile performance, described below.

#### 6.2.1 Mean and quantile maximin criteria

Let \(\Delta\) denote the space of all test rules. The maximin criteria based on mean and \(\lambda\)-quantile sampling performance are

Research on treatment choice evaluating mean sampling performance has dismissed as unpalatable maximin treatment choice based on mean sampling performance. The reason is that criterion (20) is typically solved by a data-invariant rule. Savage (1951) mentions this in passing. Manski (2004) proves it in the special case in which one treatment is a status quo option whose mean treatment response is known.

The reasoning is simple. Let \({\alpha }_{L}\equiv \underset{s\in S}{\text{min}}\,{\alpha }_{s},\) \({\beta }_{L}\equiv \underset{s\in S}{\text{min}}\,{\beta }_{s},\) and suppose that \({\alpha }_{L}>{\beta }_{L}.\) There are two data-invariant rules. One always chooses treatment \(a\), yielding error probabilities \({\rho }_{s}\left(\delta \right)=0\) in states where \({\alpha }_{s}\ge {\beta }_{s}\) and \({\rho }_{s}\left(\delta \right)=1\) when \({\alpha }_{s}<{\beta }_{s}.\) Minimum expected welfare for this rule is \({\alpha }_{L}.\) The other always chooses treatment \(b\), yielding \({\rho }_{s}\left(\delta \right)=0\) when \({\alpha }_{s}\le {\beta }_{s}\) and \({\rho }_{s}\left(\delta \right)=1\) when \({\alpha }_{s}>{\beta }_{s}.\) Minimum expected welfare for this rule is \({\beta }_{L}\). The maximin criterion prefers the former rule to the latter when \({{\alpha }}_{L}>{\beta }_{L}.\)

Now consider a data-varying rule, which chooses treatment \(a\) for some data realizations and \(b\) for others. Suppose, as is typically the case in practice, that \({\rho }_{s}\left(\delta \right)>0\) in every state where \({\alpha }_{s}\ne {\beta }_{s}.\) Minimum expected welfare for this rule is less than \({{\alpha }}_{L}\) because there exist states in which \({\beta }_{s}<{{\alpha }}_{L}\) and there is positive sampling probability that the rule choose treatment \(b\). Hence, the data-invariant rule that always chooses treatment a uniquely solves the maximin problem.

Maximin choice based on quantile performance does not yield such an extreme result and, hence, may be more palatable. Minimum \(\lambda\)-quantile welfare with the two data-invariant rules are \({{\alpha }}_{L}\) and \({\beta }_{L}\). The minimum \(\lambda\)-quantile welfare of a data-varying rule is less than \({{\alpha }}_{L}\) if \({\rho }_{s}\left(\delta \right)\ge\lambda\) in some state where \({\beta }_{s}<{{\alpha }}_{L}\). However, the minimum \(\lambda\)-quantile welfare of such a rule is greater than \({{\alpha }}_{L}\) if \({\rho }_{s}\left(\delta \right)<\lambda\) in all states where \({\beta }_{s}<{{\alpha }}_{L}\) and in some state where \({\beta }_{s}>{{\alpha }}_{L}\). Thus, a data-varying rule solves the maximin problem if its error probabilities are positive but not too large.

#### 6.2.2 Mean and quantile minimax-regret criteria

The minimax-regret criteria based on mean and \(\lambda\)-quantile sampling performance are

Criterion (24) has been studied by Manski (2004, 2005), Stoye (2009, 2012), Tetenov (2012), and Manski and Tetenov (2016). Criterion (25), which is new, differs because it multiplies the state-dependent loss \(\left|{\beta }_{s}-{\alpha }_{s}\right|\) by a step function of \({\rho }_{s}\left(\delta \right)\) rather than by \({\rho }_{s}\left(\delta \right)\) itself.

The difference is consequential. The minimax value of mean regret is generically positive. It is zero only in degenerate settings where there exists a rule with \({\rho }_{s}\left(\delta \right)=0\) in all states of nature. On the other hand, minimax \(\lambda\)-quantile regret is zero in some settings with positive error probabilities. Maximum \(\lambda\)-quantile regret is zero if \({\rho }_{s}\left(\delta \right)<\lambda\) in all states.

First observe that a rule with \({\rho }_{s}\left(\delta \right)<\lambda\) in all states trivially exists when \(\lambda > 1/2\). While one ordinarily thinks of \(\psi\) as data that are informative about treatment response, statistical decision theory also encompasses study of STRs that make treatment choice vary with uninformative data. That is, \(\delta\) may make the treatment allocation depend on data generated by a randomizing device. Suppose in particular that \(\Psi =\left\{\text{0,1}\right\},\) \({Q}_{s}\left(\psi =0\right)={Q}_{s}\left(\psi =1\right)=1/2\) for all \(s \in S\), and \(\delta\) is the rule that lets \({\Psi }_{\delta a}=\left\{0\right\}\) and \({\Psi }_{\delta b}=\{1\}\). The error probabilities for this test rule are \({\rho }_{s}\left(\delta \right)=1/2\) for all \(s \in S.\) Hence, the \(\lambda\)-quantile maximum regret of rule \(\delta\) is zero for all \(\lambda > 1/2\).

To the best of our knowledge, there exists no similarly obvious way to form a rule with zero \(\lambda\)-quantile maximum regret when \(\lambda \le 1/2.\) In this domain, achievement of zero maximum regret becomes a more stringent condition as \(\lambda\) decreases. It appears infeasible to perform an elementary general analysis, but we can make progress by examining particular contexts.

Proposition 8 demonstrates that test rules with zero maximum regret exist if \(S\) is a metric space with positive distance between the sets \({S}_{a}\) and \({S}_{b}\) (for example, if \(S\) is finite), and the data enable sufficiently precise estimation of the true state. In contrast, Proposition 9 shows that for \(\lambda < 1/2\), no such test rule exists if the set \(S\) is connected and other regularity conditions hold. In combination, the two propositions show that zero \(\lambda\)-quantile maximum regret is neither an empty concept nor ubiquitous. It is attainable by a test rule in some settings but not in others.

### Proposition 8

*Let *\(S\)* be a subset of a metric space *\(\left(\Theta ,d\right)\)* with distance *\(d\left(\cdot ,\cdot \right).\)* Let*

*Suppose that an estimator *\({\check{s}}\) \(\left(\cdot \right):\Psi \to \Theta\)* is *\(\varepsilon\)*-far from the true state *\(s\)* with probability below *\(\lambda\):

*Then the minimum-distance test rule*

*has zero *\(\lambda\)*-quantile maximum regret*. \(\square\)

### Proof

It follows from the definition of \(\varepsilon\) that in every state \(s\in {S}_{a},\)

The same is true for \(s\in {S}_{b}\) and \({s}^{\prime}\in {S}_{a}.\) Hence, a necessary condition for rule (28) to yield an error when \(s\) is the true state is that \(d\left(\check{s}\left(\psi \right),s\right)\ge\varepsilon .\) Hence,

It follows from this and (27) that \({\rho }_{s}\left({\delta }_{md}\right)<\lambda\) for every \(s \in S.\) Hence, the rule has zero maximum regret. \(\square\)

### Remark

A sufficient condition for (26) to hold is that the average treatment effect \({\beta }_{s}-{{\alpha }}_{s}\) be uniformly continuous in \(s\) and bounded away from zero. If the state space \(S\) is finite, condition (27) is satisfied whenever \(\check{s}\left(\cdot \right)\) is a weakly consistent estimator of the true state and the sample size is sufficiently large.

### Proposition 9

*Let* \(S\) *be a connected subset of a metric space* \(\left(\Theta ,d\right)\) *with distance* \(d\left(\cdot ,\cdot \right)\)*. Let* \({S}_{a>}\equiv \left\{s\in S:{{\alpha }}_{s}>{\beta }_{s}\right\}\) *and* \({S}_{b>}\equiv \left\{s\in S:{\alpha }_{s}<{\beta }_{s}\right\}.\) *Assume that the closure of the set* \({S}_{a>}\cup {S}_{b>}\) *is* \(S\)*; that is, for any* \(s \in S\) *and any* \(r > 0\)*, there exists* \({s}^{\prime}\in {S}_{a>}\cup {S}_{b>}\) *such that* \(d\left(s,{s}^{\prime}\right)<r.\) *Let the probability* \({Q}_{s}\left({\Psi }_{0}\right)\) *be continuous in* \(s\) *for every measurable subset of the sample space* \({\Psi }_{0}\subset\Psi .\) *Then no test rule with zero* \(\lambda\)*-quantile maximum regret exists for* \(\lambda < 1/2\). \(\square\)

### Proof

Let \(\lambda < 1/2\) and suppose that test-rule \(\delta\) has zero maximum regret. Let \({s}_{a}\in {S}_{a>}\) and \({s}_{b}\in {S}_{b>}.\) Then \({Q}_{{s}_{a}}\left({\Psi }_{\delta b}\right)<\lambda\) and \({Q}_{{s}_{b}}\left({\Psi }_{\delta b}\right)>1-\lambda .\)

Given that \({Q}_{s}\left({\Psi }_{\delta b}\right)\) is continuous in \(s\) and that \(S\) is connected, there exists \({s}^{*}\) such that \({Q}_{{s}^{*}}\left({\Psi }_{\delta b}\right)=1/2.\) (See Rudin, 1976, Theorem 4.22). Continuity of \({Q}_{s}\left({\Psi }_{\delta b}\right)\) in \(s\) implies that there exists \(r > 0\) such that \(d\left(s,{s}^{*}\right)<r\Rightarrow \left|{Q}_{s}\left({\Psi }_{\delta b}\right)-{Q}_{{s}^{*}}\left({\Psi }_{\delta b}\right)\right|=\left|{Q}_{s}\left({\Psi }_{\delta b}\right)-1/2\right|<1/2-\lambda .\) Given that \(S=\text{cl}\left({S}_{a>}\cup {S}_{b>}\right),\) there exists either \({s}^{\prime}\in {S}_{a>}\) or \({s}^{\prime}\in {S}_{b>}\) such that \(d\left({s}^{\prime},{s}^{*}\right)<r.\) Hence, \(\left|{Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)-1/2\right|<1/2-\lambda .\)

If \({s}^{\prime}\in {S}_{a>}\), the condition \({Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)<\lambda\) implies that \(\left|{Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)-1/2\right|=1/2-{Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)>1/2-\lambda ,\) which contradicts the conclusion reached above. If \({s}^{\prime}\in {S}_{b>}\), the condition \({Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)>1-\lambda\) implies that \(\left|{Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)-1/2\right|={Q}_{{s}^{\prime}}\left({\Psi }_{\delta b}\right)-1/2>1/2-\lambda ,\) which again contradicts the conclusion reached above. Hence, \(\delta\) has positive maximum regret.\(\square\)

### Remark

The state space has the required structure if \(a\) is a status-quo treatment with known mean outcome \({{\alpha }}^{*}\in \left(\text{0,1}\right)\) and \(b\) is an innovation with mean outcome known to lie in the interval \(\left(\text{0,1}\right).\) Then \(S=\left(\text{0,1}\right),\) with \({{\alpha }}_{s}={{\alpha }}^{*}\) and \({\beta }_{s}=s\) for \(s \in S.\) It is the case that \({S}_{a>}=\left(0,{{\alpha }}^{*}\right),\) \({S}_{b>}=\left({{\alpha }}^{*},1\right),\) and \(\text{cl}\left({S}_{a>}\cup {S}_{b>}\right)=\left(\text{0,1}\right).\) The sampling distribution has the required continuity if, for example, \({Q}_{s}\) is Normal(*s*, *k*) for some fixed \(k > 0\) or if \({Q}_{s}\) is Binomial(*n*, *s*) for some integer \(n\).

## Notes

Let \(Q\) and \(F\) be the quantile and distribution functions of a random variable. It is the case that, for all \(u\in \left(\mathrm{0,1}\right)\) and all real \(t\) (see, for example, Pfeiffer, 1990, p.266) \(Q\left(u\right)\le t\iff u\le F\left(t\right).\) If \(u\) is itself random with distribution \(P\), it follows that \(P\left[Q\left(u\right)\le t\right]=P\left[u\le F\left(t\right)\right].\) If \(u\) is uniform, \(P\left[u\le F\left(t\right)\right]=F\left(t\right).\)

For a version that covers randomized tests see, for example, Lehmann and Romano (2008), Theorem 3.2.1.

## References

Berger, J. (1985).

*Statistical decision theory and Bayesian analysis*(2nd ed.). Springer-Verlag.Binmore, K. (2009).

*Rational decisions*. Princeton University Press.Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors–An empirical bayes approach.

*Journal of the American Statistical Association,**68*, 117–130.Ferguson, T. (1967).

*Mathematical statistics: A decision theoretic approach*. Academic Press.Hadar, J., & Russell, W. (1969). Rules for ordering uncertain prospects.

*American Economic Review,**59*, 2–34.Hanoch, G., & Levy, H. (1969). The efficiency analysis of choices involving risk.

*Review of Economic Studies,**36*, 335–346.Hirano, K., & Porter, J. (2009). Asymptotics for statistical treatment rules.

*Econometrica,**77*, 1683–1701.Huber, P. (1981).

*Robust statistics*. Wiley.International Conference on Harmonisation. (1999). ICH E9 expert working group statistical principles for clinical trials ICH harmonized tripartite guideline.

*Statistics in Medicine,**18*, 1905–1942.James, W., & Stein, C. (1961). Estimation with quadratic loss.

*Proceeding of the Fourth Berkeley Symposium on Mathematical Statistics and Probability,**1*, 361–379.Jorion, P. (2006).

*Value at risk: The new benchmark for managing financial risk*(3rd ed.). McGraw-Hill.Karlin, S., & Rubin, H. (1956). The theory of decision procedures for distributions with monotone likelihood ratio.

*Annals of Mathematical Statistics,**27*, 272–299.Kitagawa, T., & Tetenov, A. (2018). Who should be treated? Empirical welfare maximization methods for treatment choice.

*Econometrica,**86*, 591–616.Lehmann, E., & Romano, J. (2005).

*Testing statistical hypotheses*. Springer.Levy, H., & Kroll, Y. (1978). Ordering uncertain options with borrowing and lending.

*Journal of Finance,**33*, 553–574.Manski, C. (1988). Ordinal utility models of decision making under uncertainty.

*Theory and Decision,**25*, 79–104.Manski, C. (2004). Statistical treatment rules for heterogeneous populations.

*Econometrica,**72*, 221–246.Manski, C. (2005).

*Social choice with partial knowledge of treatment response*. Princeton University Press.Manski, C. (2021). Econometrics for decision making: building foundations sketched by Haavelmo and Wald.

*Econometrica,**89*, 2827–2853.Manski, C., & Tetenov, A. (2007). Admissible treatment rules for a risk-averse planner with experimental data on an innovation.

*Journal of Statistical Planning and Inference,**137*, 1998–2010.Manski, C., & Tetenov, A. (2016). Sufficient trial size to inform clinical practice.

*Proceedings of the National Academy of Sciences,**113*, 10518–10523.Pfeiffer, P. (1990).

*Probability for applications*. Springer.Quirk, J., & Saposnik, R. (1962). Admissibility and measurable utility functions.

*Review of Economic Studies,**29*, 140–146.Rostek, M. (2010). Quantile maximization in decision theory.

*Review of Economic Studies,**77*, 339–371.Rudin, W. (1976).

*Principles of mathematical analysis*. McGraw-Hill.Savage, L. (1951). The theory of statistical decision.

*Journal of the American Statistical Association,**46*, 55–67.Savage, L. (1954).

*The foundations of statistics*. Wiley.Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.

*Proceeding of the Third Berkeley Symposium on Mathematical Statistics and Probability,**1*, 197–206.Stoye, J. (2009). Minimax regret treatment choice with finite samples.

*Journal of Econometrics,**151*, 70–81.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validity of experiments.

*Journal of Econometrics,**166*, 138–156.Tetenov, A. (2012). Statistical treatment choice based on asymmetric minimax regret criteria.

*Journal of Econometrics,**166*, 157–165.U. S. Food and Drug Administration. (2014). Statistical guidance for clinical trials of nondiagnostic medical devices. www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm106757.htm, accessed October 11, 2014

von Neumann, J., & Morgenstern, O. (1944).

*Theory of games and economic behavior*. Princeton University Press.Wald, A. (1950).

*Statistical decision functions*. Wiley.

## Funding

Open access funding provided by University of Geneva.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper supersedes an early draft titled “The Quantile Performance of Statistical Treatment Rules: Using Hypothesis Tests to Allocate a Population to Two Treatments.” We have benefitted from the opportunity to present parts of this work at the January 2014 Workshop on Likelihood and Simplicity at Bar-Ilan University, the November 2014 Cemmap Conference on Microdata Methods and Practice, and in seminars at the Center for the Study of Rationality of the Hebrew University of Jerusalem, Northwestern University, the University of California at Berkeley, University College London, and University of Southern Denmark.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Manski, C.F., Tetenov, A. Statistical decision theory respecting stochastic dominance.
*JER* **74**, 447–469 (2023). https://doi.org/10.1007/s42973-023-00145-2

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s42973-023-00145-2