Empirical Bayes and Selective Inference

We review the empirical Bayes approach to large-scale inference. In the context of the problem of inference for a high-dimensional normal mean, empirical Bayes methods are advocated as they exhibit risk-reducing shrinkage, while establishing appropriate control of frequentist properties of the inference. We elucidate these frequentist properties and evaluate the protection that empirical Bayes provides against selection bias.


REVIEW ARTICLE
or Bayesian, empirical Bayes methodology appears to provide a satisfactory compromise between these two main philosophies of inference. Sun 21 noted the main features: they exhibit Bayesian features such as risk-reducing shrinkage and selection adaptivity, while establishing appropriate control of frequentist properties of the inference. We consider here the empirical Bayes approach and illustrate these properties, while issuing the warning that it does not entirely solve the problems generated by data snooping.

ManyNormalMeans
The many normal means problem serves as template for analysis of contemporary large scale inference 5 . In the problem, we model data x as the outcome of a random variable X = (X 1 , . . . , X p ) T , p ≥ 3 , with a p-dimensional normal distribution with mean θ = (θ 1 , . . . , θ p ) T and identity covariance matrix I p , so that X 1 , . . . , X p are independent and Inference is required for the unknown θ assumed to have generated the data. In the frequentist perspective, θ is considered as fixed, but having some true, unknown value. In the Bayesian formulation, θ is itself considered as a random quantity, with some assumed prior distribution θ ∼ g(θ) . Then, given the specified prior, Bayesian inference is extracted from the posterior Bayesian inference: in Bayesian inference, (X, θ) are assumed both to be random, and inference about the value of θ which gave x is derived from the posterior distribution of θ , given X = x.

3
J. Indian Inst. Sci. | VOL xxx:x | xxx-xxx 2022 | journal.iisc.ernet.in distribution g(θ|x) , obtained through Bayes' Theorem 5, p.2 . We might, for instance, assume as prior a multivariate normal distribution in which θ 1 , . . . , θ p are independent, identically distributed N(0, A), in which case the posterior distribution is multivariate normal with mean Bx and variance matrix BI p , where B = A/(A + 1).

FrequentistAnalysis,ManyNormal
Means We start our discussion with a frequentist analysis, described by Efron 3 as 'the most striking theorem of post-war mathematical statistics' . Given an estimator δ(X) of θ , define its risk function as where � · � is the Euclidean norm, �X� 2 = X 2 1 + · · · + X 2 p , and E θ means expectation with respect to repeated sampling of X from the model, for fixed parameter value θ . The James-Stein estimator is In the inference problem, the intuitively obvious estimator is δ(X) ≡ X . This has constant risk: R(θ, X) ≡ p , whatever the value of θ . It turns out though (see, for instance, 25, Chapter 3 ) that the James-Stein estimator (2) has a risk which is strictly smaller: R(θ, δ JS (X)) < p , whatever the value of θ . We speak of X as 'inadmissible' as an estimator of θ . This inadmissibility result was discussed by Stein 19 and James and Stein 16 , though a simple proof was not provided until Stein 20 . Note the restriction p ≥ 3 here: if p = 1 or p = 2 the estimator δ(X) = X is actually admissible. The James-Stein estimator incorporates shrinkage: the individual X i are shrunk, towards 0 in this formulation, in providing estimators of θ 1 , . . . , θ p , though the shrinkage factor (1 − p−2 �X� 2 ) involves all components of X, which are assumed independent. Practical applications in data analysis of the James-Stein and related estimators, are described, for example, by Efron and Morris 11,12 , and, famously, in the general interest article Efron and Morris 13 .

Empirical Bayes Interpretation
The counterintuitive use in the James-Stein estimator of what may be termed indirect evidence, of the X j , j = i , in the estimation of the mean of X i has 9, page 282 always aroused Bayes' Theorem: Bayes' Theorem is the rule for manipulation of conditional probabilities, P(A|B) = P(B|A)P(A)/P(B) . In Bayesian inference, its application tells us that the posterior distribution g(θ|X = x) is proportional to the product of the prior density assumed on θ , g(θ) , and the likelihood function, the density of X evaluated at the observed data value x, f (x|θ).
Multivariate normal distribution: a random vector has a multivariate normal distribution if any linear combination of its components has the univariate normal distribution, with the density of values around a central point being determined by the 'bell-shaped' Gaussian curve. The N (µ, σ 2 ) distribution has this density defined by Admissibility: An estimator δ(X) is said to dominate an estimator δ * (X) if R(θ, δ(X)) ≤ R(θ, δ * (X)) for all θ , and the inequality is strict for some θ . If an estimator can be dominated by some other estimator it is said to be inadmissible; otherwise it is admissible.
controversy, but may be rationalized in empirical Bayes terms 10 . In the Bayes formulation suggested above, in which θ 1 , . . . , θ p are independent, identically distributed N(0, A) and under the assumed measure of loss �θ − δ(X)� 2 , the appropriate estimator of θ is the mean of the posterior distribution, δ B (X) = BX in terms of the random variable underlying the data sample. This minimises the Bayes risk r(g, δ(X)) = R(θ , δ(X))g(θ)dθ , the risk function averaged over the assumed prior g(θ) on θ . If the prior variance A is specified, this Bayes estimator can be immediately applied, and its Bayes risk is readily calculated (Young and Smith, Chapter 3) as Under the model assumed, the X i are marginally independent, identically distributed as N (0, A + 1) and a simple calculation shows that the shrinkage factor has expectation under this marginal distribution So, in the case when A is unspecified in the model formulation, we may replace the unknown B in the expression for the Bayes estimator by the estimator B = 1 − p−2 �X� 2 , giving precisely the James-Stein estimator. We then note 25, Section 3.5 that so that the increase in Bayes risk due to using the James-Stein estimator rather than the Bayes estimator δ B (X) tends to zero as the prior variance A → ∞ . So, the JS estimator has desirable risk properties. Frequentist risk is uniformly smaller than that of the obvious estimator, and the Bayes risk will often be close to that of the Bayes estimator, a desirable situation, as the Bayes estimator has important theoretical properties, such as admissibility: there is no estimator δ(X) with risk R(θ , δ(X)) uniformly smaller than R(θ , δ B (X)) : see, for instance Young and Smith 25, Chapter 3 . Under our model assumptions, that we have X i |θ i , i = 1, . . . , p , independently distributed as N (θ i , 1) and θ 1 , . . . , θ p are independent, identically distributed N(0, A), and so with B = A/(A + 1) , we have, given data outcome is a 95% confidence interval (in a frequentist sense) for θ i . Since the posterior distribution for θ i |x i is N (Bx i , B) , we have that is a 95% posterior credible set for θ i . Estimating B by B , gives an empirical Bayes posterior interval B x i ± 1.96 B . Efron 5, Section 1.5 noted a result, which he attributes to Carl Morris, that taking into account the variability of B as an estimate of B, leads to a refined empirical Bayes posterior interval The above calculations are presented assuming the prior g(θ) under which θ 1 , . . . , θ p are independent, identically distributed N(0, A). Such assumptions can be generalised. Suppose still the model (1), but consider now the prior assumption that θ 1 , . . . , θ p are independent, identically distributed with common density g(θ) . An elegant characterisation of the posterior distribution is given by 'Tweedie's formula' 6,9,Section 20.3 .
Suppose that X is distributed as N (θ, 1) and θ has prior g(θ) . The marginal density of X is in terms of the density φ(·) of N(0, 1). Tweedie's formula provides an expression for the posterior expectation of θ having observed x: The key point here is that the posterior expectation E(θ|x) is expressed directly in terms of the marginal density f(x), the context for empirical Bayes. We do not know the prior g(θ) , but in large-scale situations we can construct an estimate f (x) of f(x) from the data x = (x 1 , . . . , x p ) , the realised value of X, using techniques such as Poisson regression.
In general, empirical Bayes analysis is characterised by the estimation of prior parameter values from marginal distributions of data. With the prior parameter values fixed at these estimates, we proceed as in a regular Bayes analysis, Confidence interval: in frequentist inference a confidence set is a random set S(X), which under the assumed sampling distribution for X contains the true fixed value of the parameter θ determining this distribution a specified proportion of the time.
(4) θ i ∈ Bx i ± 1.96 √ B Posterior credible set: a posterior credible set S(x) is a set which contains a specified proportion of the probability mass of the posterior distribution g(θ|x) , for the given data x.
as if the values had been specified, without consideration of the data, at the beginning.

Properties of Empirical Bayes and Their Relevance
Empirical Bayes methods are advocated for contemporary large-scale problems of statistical inference on the basis that: they provide a synthesis between frequentist and Bayesian approaches; they ensure some degree of protection against selection bias.
Stressed throughout contemporary discussions of empirical Bayes is the notion that such methods yield, for the context of large scale simultaneous inference, procedures with interpretable frequentist properties. The desirability of this is supported by Cox 1, Appendix B who comments that 'from a general perspective one view of Bayesian procedures is that, formulated carefully, they may provide a convenient algorithm for producing procedures that may have very good frequentist properties' . We will demonstrate this in empirical illustrations below, examining the frequentist coverage properties of empirical Bayes intervals (5). Efron 8 provides ingenious methods by which the frequentist properties of Bayesian procedures may be estimated directly from given data.
It is often asserted (see, for instance, 9, Chapter 3 ) that Bayesian inference is immune to selection bias. Taking the assertion as justified offers 9, Section 20.3 some hope that empirical Bayes estimators, such as the James-Stein estimator and those constructed via Tweedie's formula, provide a realistic protection against selection bias, and will provide some cure for data snooping. Convincing evidence is given by Efron 6 . However, we discuss below that some care must be taken in trusting this assertion. In essence, the immunity only holds if selection is assumed to operate both on θ and X, rather than only on X (for fixed θ generated from its prior g(θ) ): see Sect. 4.2.

Testing Versus Estimation
Focus of the above is on estimation of θ in the model (1). The empirical Bayes analysis that we have sketched utilises what may be termed 9, Section 15.5 an effect size model: θ i ∼ h(θ) and, given θ i , X i ∼ N (θ i , 1) , with the assumed prior h(θ) not having an atom at θ = 0 . A major focus of large-scale inference is the application of empirical Bayes ideas to provide an effective untangling of the interpretation of simultaneous test results: see, for instance, Efron and Hastie 9, Section 15.3 . J. Indian Inst. Sci. | VOL xxx:x | xxx-xxx 2022 | journal.iisc.ernet.in A simple Bayesian framework for simultaneous testing 9, Section 15.3 is provided by a two-groups model: each of the p 'cases' (x 1 , . . . , x p ) is either null, with prior probability π 0 or non-null, with probability π 1 = 1 − π 0 . The observation x then has density either f 0 (x) or f 1 (x) . If π 0 = Pr(null) , the density underlying observation We may reasonably assume f 0 (x) to be the density of the standard normal distribution N(0, 1), while the non-null density Suppose an observation x i is seen to exceed some threshold x 0 , and define the Bayes false discoveryrate F dr(x 0 ) to be the probability that the observation x i is null, given that it exceeds x 0 . Then F dr(x 0 ) = π 0 S 0 (x 0 )/S(x 0 ) . We suppose that S 0 (x 0 ) is known, and π 0 may reasonably in typical applications be assumed to be close to 1. While Then we immediately have an empirical Bayes estimate of the Bayes false discovery rate: Efron and Hastie 9, Chapter 15 discuss the relationship of this empirical Bayes posterior probability of nullness with frequentist procedures of simultaneous hypothesis testing based around control of the false discovery rate: see also Efron ([ 5 ], Chapter 4). Efron and Hastie 9, page 282 note how, in contrast with James-Stein estimation, such methods of simultaneous hypothesis testing arouse little conceptual controversy.
Having observed x i equal to some value x 0 , we would be more interested in the probability of nullness given x i = x 0 , rather than given x i ≥ x 0 . We can therefore define the local false-discovery rate as We have that so a local false-discovery estimate can be constructed using a curve f (x) which smooths a histogram of the values {x 1 , . . . x p } . The R package locfdr implements construction of the local false-discovery estimate, which in a data analysis can be used as a selection mechanism to identify parameters for formal inference. The null proportion π 0 can be estimated, or approximated to be 1. Similarly, the theoretical standard normal null density f 0 (x) can, in practice, be estimated: see Efron and Hastie 9, Section 15.5 for a summary and discussion. In a data analysis we might define an observation as being 'interesting' if, say, f dr(x i ) ≤ 0.2 , and flag such for follow-up investigation, or as cases where we wish to do a formal inference. This, of course, has to be done in a way that accounts for the selection

SelectiveInference
Classical statistical methods are designed to give error guarantees in situations where the objectives of the inference are specified before collecting the data. In contemporary problems, though, such idealised settings are the exception rather than the norm. More realistically, an exploratory analysis of the data is performed before selecting the relevant inferential questions to examine, often, as in a regression setting, in the form of a selected model. Failing to acknowledge this adaptivity in the subsequent inferences can yield the reported error assessments invalid: for instance, frequentist Type 1 error guarantees of testing procedures are lost. This problem of selection bias has received considerable attention in recent years, particularly from a frequentist perspective. Efron 7 describes methods for error assessment in inference on parameters which account for model selection effects. Though 9, Chapter 20 there is no overarching general theory for inference after data snooping, prominent among approaches to remedy of the effects of selection bias is the conditional approach, which says that inference after selection should be based on hypothetical data samples which would lead to the same inference problem being tackled. This provides a broad framework for inference, which encapsulates the large-scale inference problem, expressed by (1), which is our focus here.
Suppose our data x represents the realisation of a random variable X ∈ X , whose sampling distribution we model by some parametric family {F (x; θ) : θ ∈ �} , with F (x; θ) denoting the distribution function of X under θ . In the example that is our focus here F (x; θ) denotes the multivariate Gaussian distribution N p (θ, I p ) . The density function of X we denote by f (x; θ) . We assume that there is a set of m potential parameters of interest, {ψ 1 (θ), . . . , ψ m (θ)} , from which at most one is to be selected for inference after observing the data. This selection may, as we will discuss, be made according to a randomised procedure. In the many normal means problem, the set of potential parameters of interest would contain all subsets of the p means {θ 1 , . . . , θ p } . We assume that we know the forms of functions p i : In our illustrations later, we will specify selection to entail choice of a one-dimensional parameter for inference, specifically the mean θ I corresponding to the largest element of X, X I = max{X 1 , . . . , X p } . We therefore simplify notation by writing the selected parameter simply as ψ , and the corresponding selection probability as p(y).
The conditional approach to frequentist inference 14 advocates that inference for the selected parameter ψ should be based on the conditional distribution of the data given selection. This distribution has density so that the normalising constant ϕ(θ) is the probability that ψ gets selected when θ is the true parameter. In general, inference based on this selective density f S (x; θ) may be awkward: ϕ(θ) may be intractable, and inference on ψ may be complicated by the presence of nuisance parameters. In the normal means example, inference on θ I depends on the unknown nuisance parameters θ j : j � = I . Simpler forms of inference, which achieve the same protection against selection bias, are desirable: an attractive idea is discussed below.
We can interpret the conditional approach as a form of information splitting. For a given ψ , let R be the Bernoulli random variable which takes the value 1 if ψ gets selected for inference, and 0 otherwise, so that R|X ∼ Bernoulli{p(X)} . Following Fithian et al. 14 , the data generating process of X may be thought of as consisting of two stages. In the first, the value, r say, of R is sampled from its marginal distribution, and in the second stage X is sampled from the conditional distribution X|r. Since it is R which determines whether inference is provided for ψ or not, inference based on information revealed at the second stage in necessarily free of any selection bias, since it eliminates the information about the parameter provided by R. So, the information provided by the data is divided into two portions, one of which is used for selection (R) and the other is used for the actual inference (X|R).
There is a trade-off between the power of the selection mechanism (the ability to identify a parameter when it is truly interesting, such as a significant effect), and the power of the subsequent inferential method. If powerful inference is required and obtaining new data after selection is infeasible, we need to utilise the available information efficiently. The amount of information used for selection can be limited by applying the selection mechanism to a randomised version of the original data: see Tian and Taylor 22 , Garcia Rasines and Young 15 . Formally, we can generate a random variable W, with known distribution and independent of the data, and apply the selection mechanism to U = u(X, W ) , where u is some function of the data and the noise: a convenient case for the context of the model (1) is U = X + W . Note that, if p U (u) denotes the selection function in terms of U, the selection function of the data X would be computed as is a noise vector independent of X and E ⊆ R p is some selection event, defining when the quantity ψ is chosen for inference. Then we note that U = X + W and V = X − 1 γ W are independent, from properties of the normal distribution. Now, selection is defined only in terms of U, so a simple inference can be based on V, which is unaffected by the selection of ψ as our focus of inferential interest. Such inference is trivial, as V is distributed as N p (θ , {1 + γ −1 }I p ) . Note that U is distributed as N p (θ , {1 + γ }I p ) , so the noise parameter γ has the role of balancing how much information about θ we have in the selection and inferential stages. Garcia Rasines and Young 15 consider methods of inference based on V in regression models.

A Simple Univariate Model
To illustrate some of these ideas, consider the simple univariate ( p = 1 ) normal model in which X ∼ N (θ , 1) , but suppose a selection, or truncation, condition X > 0 is imposed: any data provided for analysis satisfies x > 0 . Under the condition on selection paradigm, inference on θ is based on the conditional distribution of X|X > 0 , with selective density f S (x; θ) = φ(x − θ)/�(θ) , in terms of the distribution function �(·) of the standard normal distribution. Let F (x; θ) be the corresponding distribution function. For given observed data outcome x o we can construct the appropriate selective confidence interval, of exact coverage 1 − α under repeated sampling of X subject to the selection event X > 0 . Unfortunately, such inference is inappropriate. Consider the case θ ≪ 0 : in such a situation the selection probability P(X > 0) is vanishingly small, and the data outcome X = x o contains little information about the value of θ . Indeed, Kivaranovic and Leeb 17 show that such confidence intervals have infinite expected length under repeated sampling. Suppose, instead, we apply the randomisation idea, and provide inference on θ if and only if U = X + W > 0 , where W is random noise, independent of X, with distribution N (0, γ ) . Then in the definition of the selective density . Now, with this randomisation, the confidence interval constructed from the selective density is known to have finite expected length 18 . In fact, the length of the confidence interval is bounded above by the length of the confidence interval based on the There is loss, in terms of the size of the confidence set, in providing inference here using V alone, rather than from the conditional distribution of X|U > 0 . In general, however, the full conditional model X|{U ∈ E} may be complicated or intractable. The cost incurred in using V alone for inference will depend on how informative the distribution of U |{U ∈ E} is about the parameter of interest. Figure 1 considers the length of confidence sets of coverage 90% constructed from the selective density (7), as a function of the true mean θ , in comparison with the length of the confidence set constructed from the normal distribution of V, which also has repeated sampling coverage 90%, and the length of the 'face value' interval constructed from the N (θ, 1) distribution of X, ignoring selection. The latter does not have repeated sampling coverage close to the nominal 90%, unless θ ≫ 0 . In this simple univariate normal model, the cost in terms of the length of the confidence set might be judged as very slight if the true value of θ is less than, say, about − 1.

Selection Bias and Bayesian Inference
Why is selection bias a problem? Frequentist methods evaluate the accuracy of inferential procedures with respect to the sampling distribution of X at a fixed value of the parameter. Since selection modifies the sampling distribution, by favouring data values with higher selection probability, it is clear that inferential correctness requires that the reported accuracy be appropriately modified by accounting for the selection, through use of f S (x; θ) as the basis for inference. The Bayesian viewpoint, as we have seen, is, instead, that once the data has been observed, the recognition that a different data realisation could have resulted in a different inferential problem being posed, or none at all, should have no effect on the inference 2 . This position has been challenged (see, for instance, 24 ). Our central thesis here is that we must reassess the view that Bayesian and empirical Bayes methods necessarily provide the protection from selection effects that has been crucial to valid inference in large-scale problems.
According to Yekutieli 24 , the correct Bayesian inference for a selected parameter depends on how the selection mechanism acts on the parameter space. Consider the joint sampling distribution of (θ , X) , and a selection function p(x). We say that θ is random if the joint sampling scheme for the parameter and data is such that the pairs (θ , X) are sampled from their joint distribution until ψ ≡ ψ(θ) gets selected and say that θ is fixed if θ is sampled from its marginal distribution, held fixed, and X sampled from its conditional distribution X|θ until ψ is selected for inference. Woody et al. 23 refer to these two scenarios as 'joint selection' and 'conditional selection' , respectively. As above, let R be the binary random variable that indicates if selection of the parameter ψ under consideration has happened. If θ is random, its density given selection and a prior density π(θ) is π(θ|R = 1) ∝ π(θ)P θ (R = 1) = π(θ)ϕ(θ) . On the other hand, if θ is fixed, its conditional density is unchanged, π(θ |R = 1) = π(θ) . The conditional density of the data x given θ and selection is f S (x; θ) = f (x; θ)p(x)/ϕ(θ) in both cases. Then, the posterior distribution for a random parameter is the usual Bayesian posterior, constructed without consideration of the selection. Hence Bayesian inference about ψ = ψ(θ) , which is extracted from π(θ |x) , is unaffected by selection in this case. In the case of a fixed parameter, the posterior is given by So, for a fixed parameter the posterior needs to be adjusted, and would formally be obtained by attaching the prior density π(θ) to the selective likelihood, f S (x; θ) . The viewpoint that Bayesian inference does not require an adjustment for selection, and protection against selection bias might be expected to be afforded by the empirical Bayesian approaches to inference sketched above, follows from the implicit assumption that the parameter is random. While posterior densities (8) and (9) are formally correct given the respective sampling mechanisms, it might be argued that it is not clear that a parameter can be labelled as random or fixed without explicit consideration of the sampling mechanism. In the context of the normal means problem, it may be reasonable to consider the parameter θ as random, but the sampling process might not be well-defined, and caution is appropriate in any assumption that Bayesian inference (or the empirical Bayes inference we have described) does provide protection against selection bias. In the context, say, of a genetic study where the quantity X i is a measurement relating to gene i, with θ i being some 'true effect' due to that gene, it might be reasonable to consider θ i as an intrinsic quantity associated with that gene i.e. to consider, in terms of our discussion, θ i as a fixed parameter.

Random and Fixed Parameter Models
We describe here a variant of the analysis carried out by Efron and Hastie 9, Section 20.3 to examine the idea that empirical Bayes estimates are a realistic approach to the problem of selection bias introduced by data snooping. We consider the many normal means model (1), with p = 1000 . We specify the distribution of θ = (θ 1 , . . . , θ p ) to be such that the components are independent N(0, 1), so that the posterior distribution of θ i |x i is N (Bx i , B) , with B = 1/2 , and the Bayes estimator is E(θ i |x i ) = Bx i . The empirical Bayes estimator, that is the James-Stein estimator, is i . We generate 50,000 replications from the specified joint distribution of (θ , X) . For each we determine the index I corresponding to the largest observed data point, I = argmax{X i } , and construct the face value interval (3), the Bayes interval (4) and empirical Bayes interval (5) for θ I . Figure 2 shows a histogram for the first 1000 replications of X I − θ I , together with the corresponding histogram for θ I − θ I . Selection bias is obvious: the fact that we have chosen to examine the parameter value θ I corresponding to the largest observation means that the uncorrected, face value differences are not centred on zero, but shifted to the right. By contrast, the empirical Bayes differences θ I − θ I are centred at zero. The coverages of the true θ I over the 50,000 replications of the face value, Bayes and empirical Bayes intervals (3), (4) and (5) were 0.330, 0.950 and 0.949 respectively. The empirical Bayes interval delivers the desired frequentist property, of containing the parameter of interest θ I in very close to 95% of the replications. Selection of the parameter of interest as θ I from the data means that the face value interval has frequentist coverage very far from the nominal desired 95%. Note that the face value interval (3) has, for this context, constant width 3.92, while the Bayes interval (4) has constant width 2.77, and over the 50,000 replications the empirical Bayes interval had average width 2.80. The Bayes and empirical Bayes estimators of θ I were virtually unbiased over the replications, while the face value estimator X I displays substantial positive bias in this situation, demonstrating the need to correct the inference for selection. As we have argued, to mitigate against the selection bias, we can utilise the idea of randomisation. For each specified noise level, we define the parameter of interest from a particular dataset X = (X 1 , . . . , X p ) , with the X i independent, X i ∼ N (θ i , 1) , and independent noise {W 1 , . . . , W p } , with the W i independent, identically distributed N (0, γ ).
The above analysis reflects what we described in Sect. 4.2 as a random parameter context: on each of the replications (θ, X) was simulated from the specified joint distribution. Instead, we consider now repeating the simulation for a fixed parameter context. Here, a fixed value θ = (θ 1 , . . . , θ p ) , again with p = 1000 , was generated from the assumed prior, in which the elements of θ are independent N(0, 1). Figure 3 shows a histogram of the 1000 values θ 1 , . . . , θ 1000 , as a probability distribution, with the N(0, 1) density from which they were generated superimposed.
Two different analyses are then carried out. In the first simulation, for each of 20,000 replications we make inference for θ I , I = argmax{X i + W i } . Therefore, as before, we are considering inference for a different target parameter on each replication. Note that we are not therefore considering coverages of confidence sets in any conventional frequentist sense, as the parameter for which inference is made is not held fixed over the replications. As before, we consider the repeated sampling coverages of the face value, Bayes and empirical Bayes intervals (3), (4) and (5), which we recall are all of nominal 95% coverage. Now also included in the analysis are coverages of confidence intervals for θ I obtained from the normal distribution of V I = X I − 1 γ W I , which we argued is unaffected by selection. Table 1 shows that, indeed, intervals based on this latter quantity yield the nominal desired repeated sampling properties in this fixed parameter sampling model, even under selection. The empirical Bayes intervals do not fully mitigate against data  snooping in this sampling model, while inference based on V I does. Table 1 shows coverages of the Bayes and empirical Bayes intervals to be some way off the desired 95%. A histogram of differences θ I − θ I for the 'no noise case' , γ = 0 , is still centred on zero, but is not symmetric, with positive skewness. In a further simulation, a particular dataset X = (X 1 , . . . , X p ) , with the X i independent, X i ∼ N (θ i , 1) was generated, and I = argmax{X I } ≡ 269 identified. It is worth noting that the selected parameter of interest is not the largest θ i : in fact 22 values exceed θ 269 , as shown by the vertical line in Fig. 3. Then we reconsider the repeated sampling coverages of the face value, Bayes and empirical Bayes intervals (3), (4) and (5), but now conditional on the fixed parameter value θ 269 (actually equal to 2.120) being chosen as the parameter of interest on each replication. So, each of the 20,000 replications in this case had X 269 = max{X j } . On each of the replications, the parameter of interest is the same, so in this analysis we are actually examining the coverages of the confidence sets in a strict frequentist sense. The empirical Bayes method does not protect against selection bias in this fixed parameter context. The whole analysis was repeated based on randomised data (X 1 + W 1 , . . . , X 1000 + W 1000 ) , for different noise levels γ . When γ = 1.0 , for instance, the target parameter turned out to be defined as θ I ≡ θ 115 = 2.719 . The coverages of the Bayes and empirical Bayes intervals, shown in Table 2 are now very far from the nominal desired 95%. Inference based on V I does ensure strict frequentist control of confidence set coverage. Figure 4 provides the analogue of Fig. 2 for this fixed parameter model, when the analysis is based on the randomised data (X 1 + W 1 , . . . , X 1000 + W 1000 ) , for the case γ = 1 . The same bias due to selection of the interest parameter of a 'face value' inference as seen in Fig. 2 is evident. By contrast with Fig. 2, in this fixed parameter model, the differences θ I − θ I are no longer centered around zero. Figure 5 shows that the distribution over the replications of V I − θ I is centered at zero. The corresponding figures for the face value and empirical Bayes estimators X I and θ I are very similar for other noise levels γ , including the case γ = 0 , when no randomisation is employed.

A Two-Groups Model
We consider now the two-groups model considered in Sect. 3.3. In this situation, as discussed, it is reasonable to suppose that the proportion π 0 of the elements of θ that are null, θ i = 0 , is large. We take, as before, p = 1000 , and set θ 1 = · · · = θ 900 = 0 , with the remaining components of θ as a set of independent  realisations of N(0, 1), held fixed over a series of 20,000 replications of the model (1). We report here results for the situation where inference is made on θ I , this chosen parameter of interest being selected on the basis of randomised data: X I + W I = max{X 1 + W 1 , . . . , X 1000 + W 1000 } , with the noise variables W 1 , . . . , W 1000 independent N (0, γ ) . So, again, the target parameter varies over the replications. In this context, since we might primarily be interested in identifying true, non-zero, effects, rather than in just examining the overall coverage properties of the empirical Bayes interval and the interval based on V I , we examine: P(θ I ∈ Interval | θ I = 0) and P(θ I / ∈ Interval | θ I � = 0) . If an interval contains zero, we might conclude that there is no evidence to suggest that the corresponding effect is nonnull, while if the interval does not include zero, we might infer evidence of a non-null effect. Results are given in Table 3. The empirical Bayes intervals for θ I contain zero too high a proportion of times, while intervals based on V I correctly contain θ I , when the true value of this selected parameter is θ I = 0 , on the specified proportion 95% of replications. The inference based on the randomisation quantity V I is more powerful, in the sense that the intervals for non-zero selected θ I do not include zero in a higher proportion of replications. Of course, as the noise level γ increases, the proportion of replications for which the selected parameter of interest is actually null increases.

Data Analysis
Efron and Hastie 9, Section 13.3 and elsewhere discuss analysis of data from a prostate cancer study. The data consists of a set of p = 6033 observations (X 1 , . . . , X p ) , each measuring the effect of one gene. Efron and Hastie 9, Section 15.1 describe how these observations are extracted from raw gene expression data comparing a set of prostate cancer patients and a set of control patients. The objective is to identify non-null genes, for which the patients and the controls respond differently: a reasonable model for both null and non-null genes is the normal means model (1). Suppose we use f dr(x i ) < 0.2 as a selection rule, based on the data on all p = 6033 genes, giggled by injection of small levels of random N (0, γ ) noise, with γ = 0.25 . This identifies 15 'interesting' cases. The plot produced by locfdr with default settings is shown as Fig. 6. Note that the estimated null distribution, by both maximum likelihood and the central matching estimate method 4 , are normal distributions with standard deviation close to √ (1 + 0.25) , which we expect, as locfdr is applied to {X 1 + W 1 , . . . , X p + W p } , with the X i assumed to have variance 1 and the independent noise variables W i specified to have variance 0.25.
Of the p = 6033 cases, 478 of the face value intervals (3), 130 of the Bayes intervals (4) and 9 of the empirical Bayes intervals (5)  gene with label 914, which might temper any willingness to read too much into the fact that 9 of the empirical Bayes intervals suggest non-null effects among the 15 cases selected from the full set of 6033 genes for detailed inspection.

Discussion
Many contemporary problems of large-scale inference may, perhaps after transformation and data scaling, be expressed in terms of the many normal means model (1), with interest typically being in some subset of θ = (θ 1 , . . . , θ p ) T chosen after examination of the data, such as the element of θ corresponding to the largest observed data point. The empirical Bayes approach to inference in this model provides a framework with attractive properties. If estimation is required for the whole parameter vector θ , empirical Bayes estimators incorporate shrinkage, through the indirect evidence provided by all of the elements of X = (X 1 , . . . , X p ) in estimation of all of the individual components of θ : the result is desirable frequentist and Bayes risk properties. The empirical Bayes inference can be seen to be adaptive to the data-driven specification of the parameter chosen for inference, maintaining appropriate control of frequentist properties of the inference, at least under a random parameter or joint selection assumption. Under a random parameter assumption, for instance, an empirical Bayes 95% confidence set will contain the target parameter of interest on close to 95% of instances. This is not necessarily true under a fixed parameter or conditional selection model. Such frequentist properties do not, of course, relate formally to those demanded by the condition on selection paradigm of selective inference, which requires a 95% confidence set to contain a specified target parameter for 95% of instances for which that fixed target parameter is chosen by the selection mechanism. Some care is required on attributing to empirical Bayes methods such strong frequentist control. If this is demanded, methods based on selection of the target parameters from randomised versions of sample data offer a simple alternative. A further key context where there is datadriven choice of the parameters selected for formal inference concerns high-dimensional regression, where inference is carried out after model selection using the same data. Formal examination of the ability of empirical Bayes methods to account for selection bias in that context is unexplored, but would add to the conclusions reached here for many normal means problem.

Publisher'sNote
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Conflictofinterest
On behalf of all authors, the corresponding author states that there is no conflict of interest.

OpenAccess
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.