E-values for k-Sample Tests With Exponential Families

We develop and compare e-variables for testing whether $k$ samples of data are drawn from the same distribution, the alternative being that they come from different elements of an exponential family. We consider the GRO (growth-rate optimal) e-variables for (1) a `small' null inside the same exponential family, and (2) a `large' nonparametric null, as well as (3) an e-variable arrived at by conditioning on the sum of the sufficient statistics. (2) and (3) are efficiently computable, and extend ideas from Turner et al. [2021] and Wald [1947] respectively from Bernoulli to general exponential families. We provide theoretical and simulation-based comparisons of these e-variables in terms of their logarithmic growth rate, and find that for small effects all four e-variables behave surprisingly similarly; for the Gaussian location and Poisson families, e-variables (1) and (3) coincide; for Bernoulli, (1) and (2) coincide; but in general, whether (2) or (3) grows faster under the alternative is family-dependent. We furthermore discuss algorithms for numerically approximating (1).

1 Introduction E-variables (and the value they take, the e-value) provide an alternative to p-values that is inherently more suitable for testing under optional stopping and continuation, and that lies at the basis of anytime-valid confidence intervals that can be monitored continuously [Grünwald et al., 2023, Vovk and Wang, 2021, Shafer, 2021, Ramdas et al., 2022, Henzi and Ziegel, 2022, Grünwald, 2023].While they have their roots in the work on anytime-valid testing by H. Robbins and students (e.g.[Darling and Robbins, 1967]), they have begun to be investigated in detail for composite null hypotheses only very recently.E-variables can be associated with a natural notion of optimality, called GRO (growth-rate optimality), introduced and studied in detail by Grünwald et al. [2023].GRO may be viewed as an analogue of the uniformly most powerful test in an optional stopping context.In this paper, we develop GRO and near-GRO e-variables for a classical statistical problem: parametric k-sample tests.Pioneering work in this direction appears already in Wald [1947]: as we explain in Example 1, his SPRT for a sequential test of two proportions can be re-interpreted in terms of e-values for Bernoulli streams.Wald's e-values are not optimal in the GRO sense -GRO versions were derived only very recently by Turner et al. [2021], Turner and Grünwald [2022a], but again only for Bernoulli streams.Here we develop e-variables for the case that the alternative is associated with an arbitrary but fixed exponential family, M, with data in each of the k groups sequentially sampled from a different distribution in that family.We mostly consider tests against the null hypothesis, denoted by H 0 (M) that states that outcomes in all groups are i.i.d. by a single member of M. We develop the GRO e-variable S gro(M) for this null hypothesis, but it is not efficiently computable in general.Therefore, we introduce two more tractable e-variables: S gro(iid) and S cond .The former is defined as the GRO e-variable, for the much larger null hypothesis that the k groups are i.i.d.from an arbitrary distribution, denoted by H 0 (iid): since an e-variable relative to a null hypothesis H 0 is automatically an e-variable relative to any null that is a subset of H 0 , S gro(iid) is automatically also an e-variable relative to H 0 (M).Whenever below we refer to 'the null', we mean the smaller H 0 (M).The use of S gro(iid) rather than S gro(M) for this null, for which it is not GRO, is justifiable by ease of computation and robustness against misspecification of the model M.However, exactly this robustness might also cause it to be too conservative when M is well-specified.The third e-variable we consider, S cond , does not have any GRO status, but is specifically tailored to H 0 (M), so that it might still be better than S gro(iid) in practice.Finally, we introduce a pseudo-e-variable S pseudo(M) , which coincides with S gro(M) whenever the latter is easy to compute; in other cases it is not a real e-variable, but it is still highly useful for our theoretical analysis.

Results
Besides defining S gro(M) , S gro(iid) and S cond and proving that they achieve what they purport to, we analyze their behaviour both theoretically and by simulations.Our main theoretical results, Theorem 2 and 3 reveal some surprising facts: for any exponential family, the four types of (pseudo-) e-variables achieve almost the same growth rate under the alternative, hence are almost equally good, whenever the 'distance' between null and alternative is sufficiently small.That is, suppose that the (shortest) ℓ 2 -distance between the k dimensional parameter of the alternative and the parameter space of the null is given by δ.Then for any two of the aforementioned e-variables S, S ′ , we have E[log S − log S ′ ] = O(δ 4 ), where the expectation is taken under the alternative.Here, E[log S] can be interpreted as the growth rate of S, as explained in Section 1.1.
While S gro(iid) and S cond are efficiently computable for the families we consider, this is generally not the case for S gro(M) , since to compute it we need to have access to the reverse information projection (RIPr; [Li, 1999, Grünwald et al., 2023]) of a fixed simple alternative to the set H 0 (M).In general, this is a convex combination of elements of H 0 (M), which can only be found by numerical means.Interestingly, we find that for three families, Gaussian with fixed variance, Bernoulli and Poisson, the RIPr is attained at a single point (i.e. a mixture putting all its mass on that point) that can be efficiently computed.Furthermore, in these cases S gro(M) coincides with one of the other e-variables (S gro(iid) for Bernoulli, S cond for Gaussian and Poisson).For other exponential families, for k = 2, we approximate the RIPr and hence S gro(M) using both an algorithm proposed by Li (1999) and a brute-force approach.We find that we can already get an extremely good approximation of the RIPr with a mixture of just two components.This leads us to conjecture that perhaps the deviation from the RIPr is just due to numerical imprecision and that the actual RIPr really can be expressed with just two components.The theoretical interest of such a development notwithstanding, we advise to use S cond or S gro(iid) rather than S gro(M) for practical purposes whenever more than one component is needed for the RIPr, as their growth rates are not much worse, and they are much easier to compute.If furthermore robustness against misspecification of the null is required, then S gro(iid) is the most sensible choice.
Method: Restriction to Single Blocks and Simple Alternatives The main interest of e-variables is in analyzing sequential, anytime-valid settings: the data arrives in k streams corresponding to k groups, and we may want to stop or continue sampling at will (optional stopping); for example, we only stop when the data looks sufficiently good; or we stop unexpectedly, because we run out of money to collect new data.Nevertheless, in this paper we focus on what happens in a single block, i.e. a vector X k = (X 1 , . . ., X k ), where each X j denotes a single outcome in the j-th stream.By now, there are a variety of papers (see e.g.Grünwald et al. [2023], Ramdas et al. [2022], Turner et al. [2021]) that explain how e-variables defined for such a single block can be combined by multiplication to yield e-processes (in our context, coinciding with nonnegative supermartingales) that can be used for testing the null with optional stopping if blocks arrive sequentially -that is, one observes one outcome of each sample at a time.Briefly, one multiplies the e-variables and at any time one intends to stop, one rejects the null if the product of e-values observed so-far exceeds 1/α for pre-specified significance level α.This gives an anytime-valid test at level α: irrespective of the stopping rule employed, the Type-I error is guaranteed to be below α.Similarly, one can extend the method to design anytime-valid confidence intervals by inverting such tests, as described in detail by Ramdas et al. [2022].This is done for the 2-sample test with Bernoulli data by Turner and Grünwald [2022a]; their inversion methods are extendable to the general exponential family case we discuss here.Thus, we refer to the aforementioned papers for further details and restrict ourselves here to the 1-block case.Also, Turner et al. [2021], Turner and Grünwald [2022b] describe how one can adapt an e-process for data arriving in blocks to general streams in which the k streams do not produce data points at the same rate; we briefly extend their explanation to the present setting in Appendix A. Finally, we mainly restrict to the case of a simple alternative, i.e. a single member of the exponential family under consideration.While this may seem like a huge restriction, extension from simple to composite alternatives (e.g. the full family under consideration) is straightforward using the method of mixtures (i.e.Bayesian learning of the alternative over time) and/or the plug-in method.We again refer to Grünwald et al. [2023], Ramdas et al. [2022] for detailed explanations, and Turner et al. [2021] for an explanation in the 2-sample Bernoulli case, and restrict here to the simple alternative case: all the 'real' difficulty lies in dealing with composite null hypotheses, and that, we do explicitly and exhaustively in this paper.

Related Work and Practical Relevance
As indicated, this paper is a direct (but far-reaching) extension of the papers Turner et al. [2021], Turner and Grünwald [2022a] on 2-sample testing for Bernoulli streams as well as Wald's (1947) sequential two-sample test for proportions to streams coming from an exponential family.There are also nonparametric sequential [Lhéritier and Cazals, 2018] and anytime-valid 2-sample tests [Balsubramani andRamdas, 2016, Pandeva et al., 2022] that tackle a somewhat different problem.They work under much weaker assumptions on the alternative (in some versions the samples could be arbitrary high-dimensional objects such as pictures and the like).The price to pay is that they will need a much larger sample size before a difference can be detected.Indeed, while our main interest is theoretical (how do different e-variables compare? in what sense are they optimal?), in settings where data are expensive, such as randomized clinical trials, the methods we describe here can be practically very useful: they are exact (existing methods are often based on chi-squared tests, which do not give exact Type-I error guarantees at small sample size), they allow for optional stopping, and they need small amounts of data due to the strong parametric assumptions for the alternative.As a simple illustration of the practical importance of these properties, we refer to the recent SWEPIS study [Wennerholm et al., 2019] which was stopped early for harm.As demonstrated by Turner et al. [2021], if an anytime-valid two-sample test had been used in that study, substantially stronger conclusions could have been drawn.
We also mention that k-sample tests can be viewed as independence tests (is the outcome independent of the group it belongs to?) and as such this paper is also related to recent papers on e-values and anytime-valid tests for conditional independence testing [Grünwald et al., 2022, Shaer et al., 2022, Duan et al., 2022].Yet, the setting studied in those papers is quite different in that they assume the covariates (i.e.indicator of which of the k groups the data belongs to) to be i.i.d.

Contents
In the remainder of this introduction, we fix the general framework and notation and we briefly recall how e-variables are used in an anytime-valid/optional stopping setting.In Section 2 we describe our four (pseudo-) e-variables in detail, and we provide preliminary results that characterize their behaviour in terms of growth rate.In Section 3 we provide our main theoretical results which show that, for all regular exponential families, the expected growth of the four types of e-variables is of surprisingly small order δ 4 if the parameters of the alternative are at ℓ 2 -distance δ to the parameter space of the null.In Section 4 we give more detailed comparisons for a large number of standard exponential families (Gaussian, Bernoulli, Poisson, exponential, geometric, beta), including simulations that show what happens if δ gets larger.Section 5 provides some additional simulations about the RIPr.All proofs, and some additional simulations, are in the appendix.

Formal Setting
Consider a regular one-dimensional exponential family M = {P µ : µ ∈ M} given in its mean-value parameterization (see e.g.[Barndorff-Nielsen, 1978] for more on definitions and for all the proofs of all standard results about exponential families that are to follow).Each member of the family is a distribution for some random variable U , taking values in some set U, with density p µ;[U ] relative to some underlying measure ρ [U ] which, without loss of generality, can be taken to be a probability measure.For regular exponential families, M is an open interval in R and p µ;[U ] can be written as: where λ(µ) maps mean-value µ to canonical parameter β.We then have µ = E Pµ [t(U )], where t(U ) is a measurable function of U and A(β) is the log-normalizing factor.The measure ρ [U ] induces a corresponding (marginal) measure ρ := ρ [X] on the sufficient statistic X := t(U ), and similarly the density (1.1) induces a corresponding density p µ := p µ;[X] on X, i.e. we have All e-variables that we will define can be written in terms of the induced measure and density of the sufficient statistic of X; in other words, we can without loss of generality act as if our family is natural.Therefore, from now on we simply assume that we observe data in terms of their sufficient statistics X rather than the potentially more fine-grained U , and will be silent about U ; for simplicity we thus abbreviate p µ;[X] to p µ and ρ [X] to ρ.Note that exponential families are more usually defined with a carrier function h(X) and ρ set to Lebesgue or counting measure; we cover this case by absorbing h into ρ, which we do not require to be Lebesgue or counting.The data comes in as a block X k = (X 1 , . . ., X k ) ∈ X k , where X is the support of ρ.To calculate our e-values we only need to know X k ∈ X k , and under the alternative hypothesis, all X j , j = 1 . . .k are distributed according to some element P µ j of M. In our main results we take the alternative hypothesis to be simple, i.e. we assume that µ = (µ 1 , . . ., µ k ) ∈ M k is fixed in advance.The alternative hypothesis is thus given by simple H 1 : Note that we will keep µ fixed throughout the rest of this section and Section 2. This is without loss of generality as µ is defined as an arbitrary element of M k , so that all results stated for µ hold for any element of M k .The extension to composite alternatives by means of the method of mixtures or the plug-in method is straightforward, and done in a manner that has become standard for e-value based testing [Ramdas et al., 2022].
Our null hypothesis is directly taken to be composite, for as regards the null, the composite case is inherently very different from the simple case [Ramdas et al., 2022, Grünwald et al., 2023].It expresses that the X k are identically distributed.We shall consider various variants of this null hypothesis, all composite: let P be a set of distributions on X , then the null hypothesis relative to P, denoted H 0 (P), is defined as composite H 0 (P) : X 1 ∼ P, X 2 ∼ P, . . ., X k ∼ P i.i.d. for some P ∈ P.
Our most important instantiation for the null hypothesis will be H 0 = H 0 (M) for the same exponential family M from which the alternative was taken; then H 0 (M) is a one-dimensional parametric family expressing that the X i are i.i.d.from P µ 0 for µ 0 ∈ M. Still, we will also consider H 0 = H 0 (P) where P is the much larger set of all distributions on X .Then the null simply expresses that the X k are i.i.d.; we shall abbreviate this null to H 0 (iid).Finally we sometimes consider H 0 = H 0 (M ′ ) where M ′ ⊂ M is a subset of P µ ∈ M with µ ∈ M ′ for some sub-interval M ′ ⊂ M. The statistics that we use to gain evidence against these null hypotheses are e-variables.
Definition 1.We call any nonnegative random variable S on a sample space Ω (which in this paper will always be Ω = X k ) an e-variable relative to H 0 if it satisfies for all P ∈ H 0 : The GRO E-variable for General H 0 In general, there exist many e-variables for testing any of the null hypotheses introduced above.Each e-variable S can in turn be associated with a growth rate, defined by E Pµ [log S].
Roughly, this can be interpreted as the (asymptotic) exponential growth rate one would achieve by using S in consecutive independent experiments and multiplying the outcomes if the (simple) alternative was true (see e.g.[Grünwald et al., 2023, Section 2.1] or [Kelly, 1956]).The Growth Rate Optimal (GRO) e-variable is then the e-variable with the greatest growth rate among all e-variables.The central result (Theorem 1) of Grünwald et al. [2023] states that, under very weak conditions, GRO e-variables take the form of likelihood ratios between the alternative and the reverse information projection [Li, 1999] of the alternative onto the null.We instantiate their Theorem 1 to our setting by providing Lemma 1 and 2, both special cases of their Theorem 1.Before stating these, we need to introduce some more notation and definitions.For µ = (µ 1 , . . ., µ k ) we use the following notation: Whenever in this text we refer to KL divergence D(Q∥R), we refer to measures Q and R on X k .Here Q is required to be a probability measure, while R is allowed to be a sub-probability measure, as in [Grünwald et al., 2023].A sub-probability measure R on X k is a measure that integrates to 1 or less, i.e x∈X dR(x) ≤ 1.
The following lemma follows as a very special case of Theorem 1 (simplest version) of Grünwald et al. [2023], when instantiated to our k-sample testing set-up: Lemma 1.Let P be a set of probability distributions on X k and let conv(P) be its convex hull.Then there exists a sub-probability measure P * 0 with density p * 0 such that (1.4) 0 is called the reverse information projection (RIPr) of P µ onto conv(P).
Clearly, if P * 0 ∈ conv(P) (the minimum is achieved) then P * 0 is a probability measure, i.e. integrates to exactly one.We show that this happens for certain specific exponential families in Section 4.However, in general we can neither expect the minimum to be achieved, nor the RIPr to integrate to one.Lemma 2 below, again a special case of [Grünwald et al., 2023, Theorem 1], shows that the RIPr characterizes the GRO e-variable, and explains the use of the term gro in the definition below.
Definition 2. S gro(P) is defined as where p * 0 is the density of the RIPr of P µ onto conv(P).Lemma 2. For every set of distributions P on X , S gro(P) is an e-variable for H 0 (P).Moreover, it is the GRO (Growth-Rate-Optimal) e-variable for H 0 (P), i.e. it essentially uniquely achieves sup where the supremum ranges over all e-variables for H 0 (P).
Here, essential uniqueness means that any other GRO e-variable must be equal to S gro(P) with probability 1 under P µ .This in turn implies that the measure P * 0 is in fact unique, as members of regular exponential families must have full support.Thus, once we have fixed our alternative and defined our null as H 0 (P) for some set of distributions P on X , the optimal (in the GRO sense) e-variable to use is the S gro(P) e-variable as defined above.

The Four Types of E-variables
In this section, we define our four types of e-variables; the definitions can be instantiated to any underlying 1-parameter exponential family.More precisely, we define three 'real' e-variables S gro(M) , S cond , S gro(iid) and one 'pseudo-e-variable' S pseudo(M) , a variation of S gro(M) which for some exponential families is an e-variable, and for others is not.

2.1
The GRO E-variable for H 0 (M) and the pseudo e-variable We now consider the GRO e-variable for our main null of interest, H 0 (M).In practice, for some exponential families M, the infimum over conv(M) in (1.4) is actually achieved for some P µ * 0 ∈ M. In this easy case we can determine S gro(M) analytically (this happens if S gro(M) = S pseudo(M) , see below).For all other M, i.e. whenever the infimum is not achieved at all or is in conv(M) \ M, we do not know if S gro(M) can be determined analytically.In this hard case will numerically approximate it by S ′ gro(M) as defined below.First, for a fixed parameter µ 0 ∈ M we define the vector ⟨µ 0 ⟩ as the vector indicating the distribution on X k with all parameters equal to µ 0 : Next, with W a distribution on M, we define to be the Bayesian marginal density obtained by marginalizing over distributions in H 0 (M) according to W . Clearly, if W has finite support then the corresponding distribution P W has P W ∈ conv(M).We now set 0 is chosen so that p W ′ 0 is within a small ϵ of achieving the minimum in (1.4), i.e.
Then, by Corollary 2 of Grünwald et al. [2023], S ′ gro(M) will not be an e-variable unless ϵ ′ = 0, but in each case (i.e. for each choice of M) we verify numerically that sup µ 0 ∈M E Pµ 0 ,...,µ 0 [S] = 1 + δ for negligibly small δ, i.e. δ goes to 0 quickly as ϵ ′ goes to 0. We return to the details of the calculations in Section 5.
S pseudo(M) is not always a real e-variable relative to H 0 (M), which explains the name 'pseudo'.Still, it will be very useful as an auxiliary tool in Theorem 2 and derivations.Note that, if it is an e-variable then we know that it is equal to S gro(M) : Proposition 1. S pseudo(M) is an e-variable for M iff S pseudo(M) = S gro(M) .
The proposition above does not give any easily verifiable condition to check whether S pseudo(M) is an e-variable or not.The following proposition does provide a condition which is sometimes easy to check (and which we will heavily employ below).With µ * 0 as in (2.3), define

2.2
The GRO E-variable for H 0 (iid) Recall that we defined H 0 (iid) as the set of distributions under which X j , j = 1, . . .k, are i.i.d.from some arbitrary distribution on X .By the defining property of e-variables, i.e. expected value bounded by one under the null (1.3), it should be clear that any e-variable for H 0 (iid) is also an e-variable for H 0 (M), since H 0 (M) ⊂ H 0 (iid).In particular, we can also use the GRO e-variable for H 0 (iid) in our setting with exponential families.It turns out that this e-variable, which we will denote as S gro(iid) , has a simple form that is generically easy to compute.We now show this: Theorem 1.The minimum KL divergence inf P ∈conv(H 0 (iid)) D(P µ ∥P ) as in Lemma 1 is achieved by the distribution P * 0 on X k with density Hence, S gro(iid) , as defined below, is the GRO e-variable for H 0 (iid).
The proof of Theorem 1 extends an argument of Turner et al. [2021] for the 2-sample Bernoulli case to the general k-sample case.The argument used in the proof does not actually require the alternative to equal the product distribution of k independent elements of an exponential family -it could be given by the product of k arbitrary distributions.However, we state the result only for the former case, as that is the setting we are interested in here.

The Conditional E-variable S cond
So far, we have defined e-variables as likelihood ratios between P µ and cleverly chosen elements of either H 0 (M) or H 0 (iid).We now do things differently by not considering the full original data X 1 , . . .X k , but instead conditioning on the sum of the sufficient statistics, i.e.Z = k i=1 X i .It turns out that doing so actually collapses H 0 (M) to a single distribution, so that the null becomes simple.That is, the distribution of X k | Z is the same under all elements of H 0 (M), as we will prove in Proposition 3.This means that instead of using a likelihood ratio of the original data, we can use a likelihood ratio conditional on Z, which 'automatically' gives an e-variable.
Definition 5. Setting Z to be the random variable Z := k i=1 X i , S cond is defined as with µ 0 ∈ M and (X) the sufficient statistic as in (1.2).
it can be written as a function of (λ 1 , . . ., λ k−1 ).As a special case, for all µ 0 , µ ′ 0 ∈ M, it holds that of the alternative: data with the same outcome in both groups is effectively ignored.A non-sequential version of S cond for the Bernoulli model was analyzed earlier in great detail by Adams [2020].
Furthermore, for any c ∈ R, we have that is the line of distributions within M 2 with the same odds ratio log(µ 1 (1 − µ 2 )/((1 − µ 1 )µ 2 )) = c.The sequential probability ratio test of two proportions from Wald [1947] was based on fixing a c for the alternative (viewing it as a notion of 'effect size') and analyzing sequences of paired data X (1) , X (2) , . . .with X (i) = (X i,1 , X i,2 ) by the product of conditional probabilities thus effectively using S cond (here, we abuse notation slightly, writing p c (x | z) when we mean It is, however, important to note that this product was not used for an anytime-valid test but rather for a classical sequential test with a fixed stopping rule especially designed to optimize power.

Growth Rate Comparison of Our E-variables
Above we provided several recipes for constructing e-variables S = S µ whose definition implicitly depended on the chosen alternative µ.To compare these, we define, for any non-negative random variables S µ 1 and S µ 2 , S µ 1 ⪰ S µ 2 to mean that for all µ ∈ M k , it holds and there exists µ ∈ M k for which equality does not hold.From now on we suppress the dependency on µ again, i.e. we write S instead of S µ .We trivially have, for every underlying exponential family M, S pseudo(M) ⪰ S gro(M) ⪰ S gro(iid) and S gro(M) ⪰ S cond . (3.1) We proceed with Theorem 2 and 3 below (proofs in the Appendix).These results go beyond the qualitative assessment above, by numerically bounding the difference in growth rate between S pseudo(M) and S gro(iid) (and, because S gro(M) must lie in between them, also between these two and S gro(M) ) and S pseudo(M) and S cond respectively.Theorem 2 and 3 are asymptotic (in terms of difference between mean-value parameters) in nature.To give more precise statements rather than asymptotics we need to distinguish between individual exponential families; this is done in the next section.
To state the theorems, we need a notion of effect size, or discrepancy between the null and the alternative.So far, we have taken the alternative to be fixed and given by µ, but effect sizes are usually defined with the null hypothesis as starting point.To this end, note that each P ⟨µ 0 ⟩ ∈ H 0 (M) corresponds to a whole set of alternatives for which P ⟨µ 0 ⟩ is the closest point in KL within the null.This set of alternatives is parameterized by 3).We can re-parameterize this set as follows, using the special notation ⟨µ 0 ⟩ as given by (2.1).Let A be the set of unit vectors in R k whose entries sum to 0, i.e. α ∈ A iff k j=1 α 2 j = 1 and k j=1 α j = 0. Clearly µ ∈ M (k) (µ 0 ) if and only if µ 1 , . . ., µ k ∈ M and µ = ⟨µ 0 ⟩ + δα for some scalar δ ≥ 0 and α ∈ A. We can think of δ as expressing the magnitude of an effect and α as its direction.Note that, if k = 2, then there are only two directions, 2) and a −1 = −a 1 , corresponding to positive and negative effects: we have , as illustrated later on in Figure 1.Also note that, for general k, in the theorem below, we can simply interpret δ as the Euclidean distance between µ and ⟨µ 0 ⟩.Theorem 2. Fix some µ 0 ∈ M, some α ∈ A and let µ = ⟨µ 0 ⟩ + δα for δ ≥ 0 such that µ ∈ M (k) (µ 0 ).The difference in growth rate between S pseudo(M) and S gro(iid) is given by As is implicit in the O(•)-notation, the expectation on the left is well-defined and finite and the integral in the middle equation is finite as well.The theorem implies that for general exponential families, S gro(iid) is surprisingly close (O(δ 4 )) to the optimal S gro(M) in the GRO sense, whenever the distance δ between H 1 and H 0 (M) is small.This means that, whenever S gro(M) ̸ = S pseudo(M) (so S gro(M) is hard to compute and S pseudo(M) is not an e-variable), we might consider using S gro(iid) instead: it will be more robust (since it is an e-variable for the much larger hypothesis H 0 (iid)) and it will only be slightly worse in terms of growth rate.
Theorem 2 is remarkably similar to the next theorem, which involves S cond rather than S gro(iid) .To state it, we first set X k (x k−1 , z) := z − k−1 i=1 x i , and we denote the marginal distribution of Z = k i=1 X i under P µ as P µ;[Z] , noting that its density p µ;[Z] is given by where ρ is extended to the product measure of ρ on X k−1 and The difference in growth rate between S pseudo(M) and S cond is given by where g z (δ) := p ⟨µ 0 ⟩+αδ;[Z] (z) and ρ [Z] denotes the measure on Z induced by the product measure of ρ on X k ; an explicit expression for g ′′ z (0) is where I(µ) denotes the Fisher information for µ and I ′ (µ) is its first derivative.
Again, the expectation on the left is well-defined and finite and the integral on the right is finite.Comparing Theorem 3 to Theorem 2, we see that f x (0), the sum of k identical densities evaluated at x, is replaced by g z (0), the density of the sum of k i.i.d.random variables evaluated at z. Corollary 1.With the definitions as in the two theorems above, the growth-rate difference E Pµ log S cond − log S gro(iid) can be written as 4 Growth Rate Comparison for Specific Exponential Families We will now establish more precise relations between the four (pseudo-) e-variables in k-sample tests for several standard exponential families, namely those listed in Table 1 and a few related ones, as listed at the end of this section.For each family M under consideration, we give proofs for which different e-variables are the same, i.e. S = S ′ , where S, S ′ ∈ {S gro(M) , S cond , S gro(iid) , S pseudo(M) }.Whenever we can prove that S gro(M) ̸ = S for another e-variable S ∈ {S cond , S gro(iid) }, we can infer that S gro(M) ≻ S because S gro(M) is the GRO e-variable for H 0 (M).Whenever both S cond and S gro(iid) are not equal to S gro(M) , we will investigate via simulation whether S gro(iid) ≻ S cond or vice versa -our theoretical results do not extend to this case.All simulations are carried out for the case k = 2 in the paper.
Theorem 2 and Theorem 3 show that in the neighborhood of δ = 0 (µ 1 , . . ., µ k all close together), the difference E Pµ [log S − log S ′ ] is of order δ 4 when S, S ′ ∈ {S gro(M) , S pseudo(M) , S gro(iid) , S cond }.Hence in the figures we will show (E Pµ [log S − log S ′ ]) 1/4 , since then we expect the distances to increase linearly as we move away from the diagonal, making the figures more informative.Our findings, proofs as well as simulations, are summarised in Table 1.For each exponential family, we list the rank of the (pseudo-)e-variables when compared with the order '≻'.The ranks that are written in black are proven in Appendix D, while the ranks in blue are merely conjectures based on our simulations as stated above.The results of the simulations on which these conjectures are based are given in Figure 1.Furthermore, the rank of S pseudo(M) is colored red whenever it is not an e-variable for that model, as shown in the Appendix.Note that whenever any of the e-variables have the same rank, they must be equal ρ-almost everywhere, by strict concavity of the logarithm together with full support of the distributions in the exponential family.For example, the results in the table reflect that for the Bernoulli family, we have shown that S pseudo(M) = S gro(M) = S gro(iid) and that S pseudo(M) ≻ S cond .Also, for the geometric family and beta with free β and fixed α, we have proved that S pseudo(M) is not an e-variable, that S gro(M) ̸ = S gro(iid) and that S gro(M) ̸ = S cond , so that it follows from (3.1) that S pseudo(M) ≻ S gro(M) , S gro(M) ≻ S gro(iid) and S gro(M) ≻ S cond .Then the findings of the simulations shown in Figure 1a suggest that S gro(iid) ≻ S cond for beta with free β and fixed α and in Figure 1b suggest that S cond ≻ S gro(iid) for geometric family, but these are not proven.Figure 1c shows that S gro(iid) ≻ S cond for Gaussians with free variance and fixed mean.Finally, Figure 1d shows that for the exponential, there is no clear relation between S gro(iid) and S cond .That is, S gro(iid) grows faster than S cond for some µ 1 , . . ., µ k ∈ M, and slower for others, which is indicated by rank ( 3 Table 1: The ranks of the four different e-variables when compared with the relation '≻'.The ranks in black are proved in Appendix D, while the ranks in blue are conjectures based on the simulations in Figure 1.The rank of S pseudo(M) is denoted in red whenever it is not an e-variable, as shown in Appendix D Finally, we note that for each family listed in the table, the results must extend to any other family that becomes identical to it if we reduce it to the natural form (1.2).For example, the family of Pareto distributions with fixed minimum parameter v can be reduced to that of the exponential distributions: if U ∼ Pareto(v, α), then we can do a transformation X = t(U ) with t(U ) = log(U/v), and then X ∼ Exp(α).Thus, the k-sample problem for U with the Pareto(v, α) distributions, with α as free parameter, is equivalent to the k-sample problem for X with the exponential distributions; the e-value S gro(M) obtained Figure 1: A comparison of S gro(iid) and S cond for four exponential families.We evaluated the expected growth difference on a grid of 50 × 50 alternatives (µ 1 , µ 2 ), equally spaced in the standard parameterization (explaining the nonlinear scaling on the depicted mean-value parameterization).On the left are the corresponding heatmaps.On the right are diagonal 'slices' of these heatmaps: the red curve corresponds to the main diagonal (top left -bottom right), the blue curve corresponds to the diagonal starting from the second tick mark (10th discretization point) top left until the second tick mark bottom right.These slices are symmetric around 0, their value only depending on , where µ * 0 = (µ 1 + µ 2 )/2 and δ is as in Theorem 2 13 with a particular alternative in the Pareto setting for observation U coincides with S gro(M) for the corresponding alternative in the exponential setting for observation X = t(U ), and the same holds for S gro(iid) and S cond .Therefore, the ordering for Pareto must be the same as the ordering for exponential in Table 1.Similarly, the e-variables for the log-normal distributions (with free mean or variance) can be reduced to the two corresponding normal distribution e-variables.

Simulations to Approximate the RIPr
Because of its growth optimality property, we may sometimes still want to use the GRO e-variable S gro(M) , even in cases where it is not equal to the easily calculable S pseudo(M) .
To this end we need to approximate it numerically.The goal of this section is twofold: first, we want to illustrate that this is feasible in principle; second, we show that this raises interesting additional questions for future work.Thus, below we consider in more detail simulations to approximate S gro(M) for the exponential families with S gro(M) ̸ = S pseudo(M) that we considered before, i.e. beta, geometric, exponential and Gaussian with free variance; for simplicity we only consider the case k = 2.In Appendix E we provide some graphs illustrating the RIPr probability densities for particular choices of µ 1 , µ 2 ; here, we focus on how to approximate them, taking our findings for k = 2 as suggestive for what happens with larger k.

Approximating the RIPr via Li's Algorithm
Li [1999] provides an algorithm for approximating the RIPr of distribution Q with density q onto the convex hull conv(P) of a set of distributions P (where each P ∈ P has density p) arbitrarily well in terms of KL divergence.At the m-th step, this algorithm outputs a finite mixture P (m) ∈ conv(P) of at most m elements of P. For m > 1, these mixtures are determined by iteratively setting P (m) := αP (m−1) + (1 − α)P ′ , where α ∈ [0, 1] and P ′ ∈ P are chosen so as to minimize KL divergence D(Q∥αP (m−1) + (1 − α)P ′ ).Here, P (1) is defined as the single element of P that minimizes D(Q∥P (1) ).It is thus a greedy algorithm, but Li shows that, under some regularity conditions on P, it holds that D(Q∥P (m) ) → inf P ∈conv(P) D(Q∥P ).That is, P (m) approximates the RIPr in terms of KL divergence.This suggests, but is not in itself sufficient to prove, that sup P ∈P E P [q(X)/p (m) (X)] → 1, i.e. that the likelihood ratio actually tends to an e-variable.
We numerically investigated whether this holds for our familiar setting with k = 2, Q is equal to P µ for some µ = (µ 1 , µ 2 ) ∈ M 2 , and P = H 0 (M).To this end, we applied Li's algorithm to a wide variety of values (µ 1 , µ 2 ) for the beta, exponential, geometric and Gaussian with free variance.In all these cases, after at most m = 15 iterations, we found that sup µ 0 ∈M E Pµ 0 ,µ 0 [p µ 1 ,µ 2 (X 1 , X 2 )/q (m) (X 1 , X 2 )] was bounded by 1.005: Li's algorithm convergences quite fast; see Appendix E for a graphical depiction of the convergence and design choices in the simulation.
(note that, since we have proved that S gro(M) = S pseudo(M) for Bernoulli, Poisson and Gaussian with free mean, there is no need to approximate S gro(M) for those families).

Approximating the RIPr via Brute Force
While Li's algorithm converges quite fast, it is still highly suboptimal at iteration m = 2, due to its being greedy.This motivated us to investigate how 'close' we can get to an e-variable by using a mixture of just two components.Thus, we set p A (x k ) := αp ⟨µ 01 ⟩ (x k )+(1−α)p ⟨µ 02 ⟩ (x k ) and, for various choices of µ = (µ 1 , µ 2 ), considered as an approximate e-variable, for the specific values of α ∈ [0, 1] and µ 01 , µ 02 that minimize sup (in practice, we maximize µ 0 over a discretization of M with 1000 equally spaced grid points and minimize α, µ 01 , µ 02 over a grid with 100 equally sized grid points, with left-and rightend points of the grids over M determined by trial and error).The simulation results, for k = 2 and particular values of µ 1 , µ 2 and the exponential families for which approximation makes sense (i.e. S gro(M) ̸ = S pseudo(M) ) are presented in Table 2.We tried, and obtained similar results, for many more parameter values; one more parameter pair for each family is given in Table 3 in Appendix E. The term sup µ 0 ∈M E P ⟨µ 0 ⟩ [S appr ] is remarkably close to 1 for all of these families.Corollary 2 of Grünwald et al. [2023] implies that if the supremum is exactly 1, i.e. S appr is an e-variable, then S appr must also be the GRO e-variable relative to P µ .This leads us to speculate that perhaps all the exceedance beyond 1 is due to discretization and numerical error, and the following might (or might not -we found no way of either proving or disproving the claim) be the case: Conjecture For k = 2, the RIPr, i.e. the distribution achieving min Q∈conv(H 0 (M)) can be written as a mixture of just two elements of H 0 (M).

Conclusion and Future Work
In this paper, we introduced and analysed four types of e-variables for testing whether k groups of data are distributed according to the same element of an exponential family.These four e-variables include: the GRO e-variable (S gro(M) ), a conditional e-variable (S cond ), a mixture e-variable (S gro(iid) ), and a pseudo-e-variable (S pseudo(M) ).We compared the growth rate of the e-variables under a simple alternative where each of the k groups has a different, but fixed, distribution in the same exponential family.We have shown that for any two of the e-variables S, S ′ ∈ {S gro(M) , S cond , S gro(iid) , S pseudo(M) }, we have E[log S − log S ′ ] = O(δ 4 ) if the ℓ 2 distance between the parameters of this alternative distribution and the parameter space of the null is given by δ.This shows that when the effect size is small, all the e-variables behave surprisingly similar.For more general effect sizes, we know that S gro(M) has the highest growth rate by definition.Calculating S gro(M) involves computing the reverse information projection of the alternative on the null, which is generally a hard ) arrived at by brute-force minimization of the KL divergence as in Section 5.2, and we show how close the corresponding likelihood ratio S appr is to being an e-variable problem.However, we proved that there are exponential families for which one of the following holds S pseudo(M) = S gro(M) , S cond = S gro(M) or S gro(iid) = S gro(M) , which considerably simplifies the problem.If one is interested in testing an exponential family for which is not the case, there are algorithms to estimate the reverse information projection.We have numerically verified that approximations of the reverse information projection also lead to approximations of S gro(M) .However, the use of S cond or S gro(iid) might still be preferred over S gro(M) due to the computational advantage.Our simulations show that depends on the specific exponential family which of them is preferable over the other, and that sometimes there is even no clear order.A Application in Practice: k Separate I.I.D. Data Streams In the simplest practical applications, we observe one block at a time, i.e. at time n, we have observed X (1) , . . ., X (n) , where each ) is a block, i.e. a vector with one outcome for each of the k groups.This is a rather restrictive setup, but we can easily extend it to blocks of data in which each group has a different number of outcomes.For example, if data comes in blocks with m j outcomes in group j, for j , we can re-organize this having k ′ = k j=1 m j groups, having 1 outcome in each group, and having an alternative in which the first m 1 entries of the outcome vector share the same mean µ ′ 1 = . . .= µ ′ m 1 = µ 1 ; the next m 2 entries share the same mean µ ′ m 1 +1 = . . .= µ ′ m 1 +m 2 = µ 2 , and so on.Even more generally though, we will be confronted with k separate i.i.d streams and data in each stream may arrive at a different rate.We can still handle this case by pre-determining a multiplicity m 1 , . . ., m k for each stream.As data comes in, we fill virtual 'blocks' with m j outcomes for group j, j = 1 . . .k. Once a (number of) virtual block(s) has been filled entirely, the analysis can be performed as usual, restricted to the filled blocks.That is, if for some integer B we have observed Bm j outcomes in stream j, for all j = 1 . . .k, but for some j, we have not yet observed (B + 1)m j outcomes, and we decide to stop the analysis and calculate the evidence against the null, then we output the product of e-variables for the first B blocks and ignore any additional data for the time being.Importantly, if we find out, while analyzing the streams, that some streams are providing data at a much faster rate than others, we may adapt m 1 , . . ., m k dynamically: whenever a virtual block has been finished, we may decide on alternative multiplicities for the next block; see Turner et al. [2021] for a detailed description for the case that k = 2.

B Proofs for Section 2
In the proofs we freely use, without specific mention, basic facts about derivatives of (log-) densities of exponential families.These can all be found in, for example, Barndorff-Nielsen [1978].

B.1 Proof of Proposition 1
Proof.Since S gro(M) was already shown to be an E-variable in Lemma 2, the 'if' part of the statement holds.The 'only-if' part follows directly from Corollary 2 to Theorem 1 in [Grünwald et al., 2023], which states that there can be at most one E-variable of the form p µ (X k )/r(X k ) where r is a probability density for X k .

B.2 Proof of Proposition 2
Proof.Define g(µ 0 ) := E p ⟨µ 0 ⟩ S pseudo(M) and B(µ Taking first and second derivatives with respect to µ 0 , we find

B.3 Proof of Theorem 1
To prepare for the proof of Theorem 1, let us first recall Young's [1912] inequality: Lemma 3. [Young's inequality] Let p, q be positive real numbers satisfying 1 p + 1 q = 1.Then if a, b are nonnegative real numbers, ab ≤ a p p + b q q .The proof of Theorem 1 follows exactly the same argument as the one used by Turner et al. [2021] to prove this statement in the special case that M is the Bernoulli model.
Proof.We first show that S gro(iid) as defined in the theorem statement is an E-variable.For this, we set p * 0 (X) = 1 k k i=1 p µ i (X).We have: We also have We need to show that (B.4) ≤ 1, for which we can use (B.5).Stated more simply, it is sufficient to prove where the first inequality holds because of Young's inequality, by setting 1 This shows that S gro(iid) is a e-variable.It remains to show that S gro(iid) is indeed the GRO e-variable relative to H 0 (iid); once we have shown this, it follows by Lemma 2 that it is the unique such e-variable and therefore by Lemma 1 that P * 0 achieves the minimum in Lemma 1.Since we already know that S gro(iid) is an e-variable, the fact that it is the GRO e-variable relative to H 0 (iid) follows immediately from Corollary 2 of Theorem 1 in Grünwald et al. [2023], which states that there can be at most one e-variable of form p µ (X k )/r(X k ) where r is a probability density.Since S gro(iid) is such an e-variable, Lemma 1 gives that it must be the GRO e-variable.
C Proofs for Section 3 C.1 Proof of Theorem 2 Proof.We prove the theorem using an elaborate Taylor expansion of F (δ), defined below, around δ = 0. We first calculate the first four derivatives of F (δ). Thus we define and derive, with µ i = µ 0 + α i δ and f y (δ) = k i=1 p µ i (y) defined as in the theorem statement, where we define F 1 (δ) to be equal to the leftmost term in (C.1) and F 2 (δ) to be equal to the second, and (a) and (b) both hold provided that for all j ∈ {1, . . ., k}: is finite.In the online supplementary material we verify that this condition, as well as a plethora of related finiteness-of-expectation-of-absolute-value conditions hold for all δ sufficiently close to 0. Together these not just imply (a) and (b), but also (c) that we can freely exchange integration over y and differentiation over δ for all such δ when computing the first k derivatives of F 1 (δ) and F 2 (δ), for any finite k and (d) that all these derivatives are finite for δ in a compact interval including 0 (since the details are straightforward but quite tedious and long-winded we deferred these to the supplementary material).Thus, using (c), we will freely differentiate under the integral sign in the remainder of the proof below, and using (d), we will be able to conclude that the final result is finite.
For each derivative, we first compute the derivative of F 1 (δ) and then that of F 2 (δ).
where the above formulas hold since f ′ x (0) = 0 for all x ∈ X , which can be obtained by where we used that all µ j are equal to µ 0 at δ = 0. We turn to the second derivatives: where f ′′ y (δ)dρ(y) = 0 because f y (δ)dρ(y) = k, in which k is a constant that does not depend on δ.Then F ′′ 2 (δ) is given by around δ = 0, which gives: where f y (0) = k i=1 p µ 0 (y) = kp µ 0 (y) and f ′′ y (0) = and, with x k set to X k (x k−1 , z) and recalling that µ = ⟨µ 0 ⟩ + αδ and µ where I(µ j ) is the Fisher information.The final equality follows because, with λ(µ j ) the canonical parameter corresponding to µ j , we have dλ(µ j )/dµ j = I(µ j ) and dA(β)/dβ) | β=λ(µ j ) = µ j ; see e.g.[Grünwald, 2007, Chapter 18].Now where the second equality follows from k j=1 α j = 0.Because X k i.i.d.∼ P µ 0 under P ⟨µ 0 ⟩ and the integral in (C.10) is over a set of exchangeable sequences, (For understanding the statement, we can consider the simple case k = 2, X 1 and X 2 can be exchangeable because they are 'symmetric' for given C(z).) we must have that (C.10) remains valid if we re-order the α j 's in round-robin fashion, i.e. for all i = 1..k, we have, with α j,i = α (j+i−1) mod k , x j α j,i dρ(x k−1 ).
to the proof for the beta distributions given below.In all of these cases, one easily shows by simulation that in general, S gro(M) ̸ = S gro(iid) and S gro(M) ̸ = S cond , so then S gro(M) ≻ S gro(iid) and S gro(M) ≻ S cond follow.
Proof.First, let Q α,β represent a beta distribution in its standard parameterization, so that its density is given by q α,β (u) = Γ(α + β) Γ(α)Γ(β) To simplify the proof, we assume α = 1 here.Then where β i corresponds to µ i , i.e.E Q 1,β i [(X)] = µ i .We also have:  We illustrate RIPr-approximation and convergence of Li's algorithm with four distributions: exponential, beta with free β and fixed α, geometric and Gaussian with free variance and fixed mean, each with one particular (randomly chosen) setting of the parameters.The pictures on the left in Figure 2-5 give the probability density functions (for geometric distributions, discrete probability mass functions) after n = 100 iterations of Li's algorithm.The pictures on the right illustrate the speed of convergence of Li's algorithm.The pictures on the right do not show the first (or the first two, for geometric and Gaussian with free variance) iteration(s), since the worst-case expectation sup µ 0 ∈M [S gro(M) ] is invariably incomparably larger in these initial steps.We empirically find that Li's algorithm converges quite fast for computing the true S gro(M) .In each step of Li's algorithm, we searched for the best mixture weight α in P (m) over a uniformly spaced grid of 100 points in [0, 1], and for the novel component P ′ = P µ ′ ,µ ′ by searching for µ ′ in a grid of 100 equally spaced points inside the parameter space M where the left-and right-endpoints of the grid were determined by trial and error.While with this ad-hoc discretization strategy we obviously cannot guarantee any formal approximation results, in practice it invariably worked well: in all cases, we found that max   3.
(a) beta with free β and fixed α (b) geometric (c) Gaussian with free variance and fixed mean (d) Exponential Lemma 3. The other inequalities are established in the same way.

Figure 2 :0
Figure 2: Exponential distribution.On the right, n represents number of iterations with Li's algorithm, starting at iteration 2

Figure 3 :
Figure 3: beta with free β and fixed α.On the right, n represents number of iterations with Li's algorithm, starting at iteration 2

EFigure 4 :
Figure 4: geometric distribution.On the right, n represents number of iterations with Li's algorithm, starting at iteration 3

Figure 5 :
Figure 5: Gaussian with free variance and fixed mean.On the right, n represents number of iterations with Li's algorithm, starting at iteration 3 ) − (4) in the table.