Generalized Moment Estimators based on Stein Identities

,


Introduction
For many parametric distributions, so-called Stein identities are available, which rely on moments of functional expressions of a corresponding random variable.These identities are named after Charles Stein, who developed the idea of uniquely characterizing a certain distribution family by such a moment identity (see Stein, 1972Stein, , 1986)).Many examples of both continuous and discrete distributions together with their Stein characterizations can be found in Stein et al. (2004); Sudheesh (2009); Sudheesh & Tibiletti (2012); Landsman & Valdez (2016); Weiß & Aleksandrov (2022); Anastasiou et al. (2023) and the references therein.Stein identities are not a mere tool of probability theory.During the last years, there was also a lot of research activity on statistical applications of Stein identities, for example to goodness-of-fit (GoF) tests (Betsch et al., 2022;Weiß et al., 2023) and control charts (Weiß, 2023), among others.In the present article, however, another application of Stein identities is investigated and exemplified, namely to the parameter estimation of continuous or discrete distributions.The idea to construct generalized types of method-of-moments (MM) estimators based on an appropriate type of Stein identity plus weighting function, referred to as Stein-MM estimators, was first explored in some applications by Arnold et al. (2001) and Wang & Weiß (2023).Recently, Ebner et al. (2023) discussed the Stein-MM approach in a much broader way, and also the present article provides a comprehensive treatment of Stein-MM estimators for various distributions.The main motivation for considering Stein-MM estimation is that the weighting function might be chosen in such a way that the resulting estimator shows better properties (e. g., a reduced bias or mean squared error (MSE)) than the default MM estimator or other existing estimators.Despite the additional flexibility offered by the weighting function, the Stein-MM estimators are computed from simple closed-form expressions, and consistency and asymptotic normality are easily established, also see Ebner et al. (2023).
In what follows, we apply the proposed Stein-MM estimation to three different distribution families.We start with the illustrative example of the exponential (Exp) distribution in Section 2. This simple one-parameter distribution mainly serves to demonstrate the general approach for deriving the Stein-MM estimator and its asymptotics, and it also indicates the potential benefits of using the Stein-MM approach for parameter estimation.Afterward in Section 3, we examine a more sophisticated type of continuous distribution, namely the two-parameter inverse Gaussian (IG) distribution.In Section 4, we then turn to a discrete distribution family, namely the twoparameter negative-binomial (NB) distribution.Illustrative real-world data examples are also presented in Sections 3-4.Note that neither the exponential distribution nor any discrete distribution have been considered by Ebner et al. (2023), and their approach to the Stein-MM estimation of the IG-distribution differs from the one proposed here, see the details below.Also Arnold et al. (2001); Wang & Weiß (2023) did not discuss any of the aforementioned distributions.Finally, we conclude in Section 5 and outline topics for future research.

Stein Estimation of Exponential Distribution
The exponential distribution is the most well-known lifetime distribution, which is characterized by the property of being memory-less.It has positive support and depends on the parameter λ > 0, where its probability density function (pdf) is given by ϕ(x) = λe −λx for x > 0 and zero otherwise.A detailed survey about the properties of and estimators for the Exp(λ)-distribution can be found in Johnson et al. (1995, Chapter 19).Given the independent and identically distributed (i.i. d.) sample X 1 , . . ., X n with X i ∼ Exp(λ) for i = 1, . . ., n, the default estimator of λ > 0, which is an MM estimator and the maximum likelihood (ML) estimator at the same time, is given by λ = 1/X, where X = 1 n n i=1 X i denotes the sample mean.This estimator is known to neither be unbiased, nor to be optimal in terms of the MSE, see Elfessi & Reineke (2001).To derive a generalized MM estimator with perhaps improved bias or MSE properties, we consider the exponential Stein identity according to Stein et al. (2004, Example 1.6), which states that (2.1) for any piecewise differentiable function f with f (0) = 0 such that E |f ′ (X)| , E |f (X)| exist.Solving (2.1) in λ and using the sample moments f ′ (X) = 1 n n i=1 f ′ (X i ) and f (X) = 1 n n i=1 f (X i ) instead of the population moments, the class of Stein-MM estimators for λ is obtained as Note that the choice f (x) = x leads to the default estimator λ = 1/X.Generally, f (x) ̸ = x might be interpreted as a kind of weighting function, which assigns different weights to large or low values of X than the identity function does.For deriving the asymptotic distribution of the general Stein-MM estimator λf,Exp , we first define the vectors Z i with i = 1, . . ., n as Their mean equals where we define Then, the following central limit theorem (CLT) holds.
Theorem 2.1 If X 1 , . . ., X n are i.i. d. according to Exp(λ), then the sample mean Z of Z 1 , . . ., Z n according to (2.3) is asymptotically normally distributed as where N(0, Σ) denotes the multivariate normal distribution, and where the covariances are given as The proof of Theorem 2.1 is provided by Appendix A.1.In the second step of deriving the asymptotics of λf,Exp , we define the function g(u, v) := u v .Then, λf,Exp = g(Z) and λ = g(µ Z ).Applying the Delta method (Serfling, 1980) to Theorem 2.1, the following result follows.
Theorem 2.2 If X 1 , . . ., X n are i.i. d. according to Exp(λ), then λf,Exp is asymptotically normally distributed, where the asymptotic variance and bias are given by .
The proof of Theorem 2.2 is provided by Appendix A.2.Note that the moments µ f (k, l, m) involved in Theorems 2.1 and 2.2 can sometimes be derived explicitly, see the subsequent examples, while they can be computed by using numerical integration otherwise.
After having derived the asymptotic variance and bias without explicitly specifying the function f , let us now consider the special case x a , as our first illustrative example (where a = 1 leads to the default estimator λ = 1/X).Here, we have to restrict to a > 0 to ensure that the condition f (0) = 0 in (2.1) holds.Using that the following corollary to Theorem 2.2 is derived.Corollary 2.3 Let X 1 , . . ., X n be i.i. d. according to Exp(λ), and let f a (x) = x a with a > 1 2 .Then, λfa,Exp is asymptotically normally distributed, where the asymptotic variance and bias are given by Furthermore, the MSE equals The proof of (2.5) and Corollary 2.3 is provided by Appendix A.3.Note that in Corollary 2.3, r s denotes the generalized binomial coefficient given by Γ(r + 1)/Γ(s + 1)/Γ(r − s + 1).
In Figure 1 (a), the asymptotic variance and bias of λfa,Exp according to Corollary 2.3 are presented.While the variance is minimal for a = 1 (i.e., for the ordinary MM and ML estimator), the bias decreases with decreasing a (i.e., bias reductions are achieved for sublinear choices of f a (x) = x a ).Hence, an MSE-optimal choice of f a (x) is obtained for some a ∈ (0.5; 1).This is illustrated by Figure 1 (b), where the MSE of Corollary 2.3 is presented for different sample sizes n ∈ {10, 25, 50, 100}.The corresponding optimal values of a are determined by numerical optimization as 0.952, 0.978, 0.988, and 0.994, respectively.As a result, especially for small n, we have a reduction of the MSE (and of the bias as well) if using a "true" Stein-MM estimator (i.e., with a ̸ = 1).Certainly, if the focus is mainly on bias reduction, then an even smaller choice of a would be beneficial.As a second illustrative example, let us consider the functions f u (x) = 1 − u x with u ∈ (0; 1), which are again sublinear, but this time also bounded from above by one.Again, we can derive a corollary to Theorem 2.2, this time by using the moment formula Corollary 2.4 Let X 1 , . . ., X n be i.i. d. according to Exp(λ), and let f u (x) = 1 − u x with u ∈ (0; 1).Then, λfu,Exp is asymptotically normally distributed, where the asymptotic variance and bias are given by Furthermore, the MSE equals The proof of (2.6) and Corollary 2.4 is provided by Appendix A.4.This time, the variance decreases for increasing u, whereas the bias decreases for decreasing u.As a consequence, an MSE-optimal choice is expected for some u inside the interval (0; 1).This is illustrated by Figure 2 (a), where the minima for n ∈ {10, 25, 50, 100}, given that λ = 1, are attained for u ≈ 0.918, 0.963, 0.981, and 0.990, respectively.The major difference between the two types of weighting functions in Corollaries 2.3 and 2.4 is given by the role of λ within the expression for the MSE.For f a (x) in Corollary 2.3, λ occurs as a simple factor such that the optimal choice for a is the same across different λ.
Hence, the optimal a is simply a function of the sample size n, which is very  1, which refers to a simulation experiment with 10 5 replications per scenario.For simulated i. i. d.Exp(1)-samples of sizes n ∈ {10, 25, 50, 100}, about 10 % of the observations were randomly selected and contaminated by an additive outlier, namely by adding 5 to the selected observations.Note that the topic of outliers in exponential data received considerable interest in the literature (Johnson et al., 1995, pp. 528-530).Then, different estimators λf,Exp are computed from the contaminated data, where the first three choices of the weighting func-tion f are characterized by a sublinear increase, whereas the fourth function, f (x) = x, corresponds to the default estimator λ = 1/X.Table 1 shows that all MM estimators are affected by the outliers, e. g., in terms of the strong negative bias.But comparing the four columns of bias and MSE values, respectively, it gets clear that the novel Stein-MM estimators are more robust against the outliers, having both lower bias and MSE than λ.Especially the choice f (x) = ln(1 + x), a logarithmic weighting scheme, leads to a rather robust estimator.The relatively good performance of the Stein-MM estimators can be explained by the fact that the weighting functions increase sublinearly (which is also beneficial for bias reduction in non-contaminated data, recall the above discussion), so the effect of large observations is damped.To sum up, by choosing an appropriate weighting function f within the Stein-MM estimator λf,Exp , one cannot only achieve a reduced bias and MSE, but also a reduced sensitivity towards outlying observations.
The IG-distribution is commonly used as a lifetime model (as it can be related to the first-passage time in random walks), but it may also simply serve as a distribution with positive skewness and, thus, as an alternative to, e. g., the lognormal distribution (see Folks & Chhikara, 1978).Detailed surveys about the properties and applications of IG(µ, λ), and on many further references, can be found in Folks & Chhikara (1978); Seshadri (1999) as well as in Johnson et al. (1995, Chapter 15).In what follows, the moment properties of X ∼ IG(µ, λ) are particularly relevant.We have In particular, positive and negative moments are related to each other by see Tweedie (1957a, p. 372) as well as the aforementioned surveys.
Remark 3.1 At this point, let us briefly recall the discussion in Section 2 (p.7), where we highlighted the property that for i. i. d. exponential samples, the quotient f ′ (X)/f (X) has an (asymptotically) unique mean for different f .From counterexamples, it is easily seen that this property is not true for IG(µ, λ)-data.The Delta method implies that the mean of f ′ (X)/f (X) is asymptotically equal to From now on, let X 1 , . . ., X n be an i. i. d. sample from IG(µ, λ), which shall be used for parameter estimation.Here, one obviously estimates µ by the sample mean X, but the estimation of λ is more demanding.In the literature, the MM and ML estimation of λ have been discussed (see the details below), while our aim is to derive a generalized MM estimator with improved bias and MSE properties based on a Stein identity.In fact, as we shall see, our proposed approach can be understood as a unifying framework that covers the ordinary MM and ML estimator as special cases.A Stein identity for IG(µ, λ) has been derived by Koudou & Ley (2014, p. 172), which states that holds for all differentiable functions f : 2) in λ and using the sample moments (where h might be any of the functions involved in (3.2)), the class of Stein-MM estimators for λ is obtained as Here, the ordinary MM estimator of λ > 0, i. e., λMM = X 3 /S 2 with S 2 denoting the empirical variance (Tweedie, 1957b), is included as the special case f ≡ 1, whereas the ML estimator λML = X X • 1/X − 1 (Tweedie, 1957a) follows for f (x) = 1/x.Hence, (3.3) provides a unifying estimation approach that covers the established estimators as special cases.
Remark 3.2 At this point, a reference to Example 2.9 in Ebner et al. ( 2023) is necessary.As already mentioned in Section 1, also Ebner et al. (2023) proposed a Stein-MM estimator for the IG-distribution, which, however, differs from the one developed here.The crucial difference is given by the fact that Ebner et al. (2023) tried a joint estimation of (µ, λ) based on (3.2), namely by jointly solving two equations that are implied by (3.2) if using two different weight functions The resulting class of estimators, however, does not cover the existing MM and ML estimators, so Ebner et al. (2023) did not pursue the Stein-MM estimation of the IG-distribution further.By contrast, as we did not see notable potential for improving the estimation of µ by X (recall the diverse optimality properties of the sample mean as an estimator of the population mean (e. g., Shuster, 1982)), we used (3.2) to only derive an estimator for λ.In this way, we were able to recover both the MM and ML estimator of λ within (3.3).
For deriving the asymptotic distribution of our general Stein-MM estimator λf,IG from (3.3), we first define the vectors Z i with i = 1, . . ., n as Their mean equals Then, the following CLT holds.
Theorem 3.3 If X 1 , . . ., X n are i.i. d. according to IG(µ, λ), then the sample mean Z of Z 1 , . . ., Z n according to (3.4) is asymptotically normally distributed as where N(0, Σ) denotes the multivariate normal distribution, and where the covariances are given as The proof of Theorem 3.3 is provided by Appendix A.5.
Theorem 3.4 Let X 1 , . . ., X n be i.i. d. according to IG(µ, λ), and define ϑ f,IG := µ f (2, 1, 0) − µ 2 µ f (0, 1, 0).Then, λf,IG is asymptotically normally distributed, where the asymptotic variance and bias, respectively, are given by and The proof of Theorem 3.4 is provided by Appendix A.6.Before we discuss the effect of f on bias and MSE of λf,IG , let us first consider the special cases of the ordinary MM and ML estimator.Their asymptotics are immediate consequences of Theorem 3.4.For the MM estimator λMM , we have to choose f ≡ 1 such that f ′ ≡ 0. As a consequence, This leads to a considerable simplification of Theorem 3.4, see Appendix A.7, which is summarized in the following corollary.
Next, we consider the special case of the ML estimator λML , which follows by choosing f (x) = 1/x such that f ′ (x) = −1/x 2 .Again, the joint moments µ f (k, l, m) simplify a lot: (3.7) Together with Theorem 3.4, see Appendix A.8, we get the following corollary.
Corollary 3.6 Let X 1 , . . ., X n be i.i. d. according to IG(µ, λ), then λML = X X • 1/X − 1 is asymptotically normally distributed with asymptotic variance σ 2 ML = 2 n λ 2 and bias B ML = 3 n λ.Comparing Corollaries 3.5 and 3.6, it is interesting to note that the MM estimator has larger asymptotic bias and variance than the ML estimator: To verify the asymptotics of Corollary 3.6, note that the ML estimator λML has been shown to follow an inverted-χ 2 distribution: Tweedie, 1957a, p. 368).Using the formulae for mean and variance of Inv-χ 2 n−1 (see Bernardo & Smith, 1994, p. 431), we get for large n, which agrees with Corollary 3.6.
Remark 3.7 To analyze the performance of the asymptotics provided by Theorem 3.4 (and that of the special cases discussed in Corollaries 3.5 and 3.6), if used as approximations to the true distribution of λf,IG for finite sample size n, we did a simulation experiment with 10 5 replications.The obtained results for various choices of (µ, λ) and f (x) are summarized in Table 2.It can be recognized that the asymptotic approximations for mean and standard deviation generally agree quite well with their simulated counterparts.Only for the case f (x) ≡ 1 (the default MM estimator) and sample size n = 100, we sometimes observe stronger deviations.But in the large majority of estimation scenarios, we have a close agreement such that the conclusions derived from the asymptotic expressions are meaningful for finite sample sizes as well.
In analogy to our discussion in Section 2, let us now analyze the performance of the Stein-MM estimator λf,IG for the weight functions Recall that this class of weight functions cover the default MM estimator λMM for a = 0 and the ML estimator λML for a = −1.The choice a = − 1 2 (right in the middle between these two special cases) has to be excluded as it leads to a degenerate estimator λfa,IG according to (3.3).For this reason, the subsequent analyses   The upper and lower panel consider two different example situations, namely (µ, λ) = (1, 3) and (µ, λ) = (3, 1), respectively, while left-hand and right-hand side are separated by the pole at a = − 1 2 .The right-hand side shows that the default MM estimator is neither (locally) optimal in terms of asymptotic bias nor in terms of variance.In fact, the optimal a for a > − 1 2 is around −0.1 for (µ, λ) = (1, 3), and around −0.2 for (µ, λ) = (3, 1).However, comparing the actual values at the Y-axis to those of the plots on the left-hand side, we recognize that the asymptotic bias and variance get considerably smaller for some region with a < − 1 2 .In particular, the ML estimator is clearly superior to the MM estimator, and as shown by Figures 3 (a) and (c), the ML estimator is even optimal in terms of the asymptotic variance.It has to be noted, however, that the curve corresponding to the asymptotic variance is rather flat around a = −1, so moderate deviations from a = −1 do not have a notable effect on the variance.Thus, it is important to also consider the optimum bias, which is reached for some a around −0.65 in both (a) and (c).So it appears to be advisable to choose an a > −1 for optimal overall estimation performance.This is confirmed by Figure 4, where the asymptotic MSE is shown for various sample sizes n and the same scenarios as in Figure 3.While the ML estimator approaches the MSE-optimum for increasing n, we get an improved performance for smaller sample sizes if choosing a ∈ (−1; −0.5) appropriately (e. g., a ≈ −0.8 if n ≤ 50).Generally, an analogous recommendation holds for a > − 1 2 in parts (b) and (d), with MSE-optima at a around −0.1 and −0.2, respectively, but much smaller MSE values can be reached for a < − 1 2 .To sum up, the default MM estimator (and more generally, Stein-MM estimators λfa,IG with a > − 1 2 ) are not recommended for practice due to their rather large bias, variance, and thus MSE, while the ML estimator constitutes at least a good initial choice for estimating λ, being optimal in terms (i) ML estimate, σ 2 fa,IG -optimal (ii) B fa,IG -optimal (v) MM estimate (iii) B fa,IG -optimal given a > −0.5 (iv) σ 2 fa,IG -optimal given a > −0.5 of asymptotic variance.However, unless the sample size n is very large, an improved MSE performance can be achieved by reducing a to an appropriate value in (−1; −0.5) in the second step and by computing the corresponding Stein-MM estimate λfa,IG .
Example 3.8 As an illustrative data example, let us consider the n = 25 runoff amounts at Jug Bridge in Maryland (see Folks & Chhikara, 1978, p. 272), which are "very well described by the inverse Gaussian distribution".The parameter µ is estimated by the sample mean as ≈ 0.803, and using the ML estimator λf −1 ,IG as an initial estimator for λ, we get the value ≈ 1.440.As outlined before, this initial model fit might now be used for searching estimators with improved performance.Some examples (together with further estimates for comparative purposes) are summarized in Table 3.The ML estimator (a = −1) is also optimal in asymptotic variance, whereas the bias-optimal choice is obtained for a somewhat larger value of a, namely a ≈ −0.668.The corresponding estimate is slightly lower than the ML estimate, similar to the value for a = −1.5, and can thus be seen as a fine-tuning of the initial estimate.By contrast, a notable change in the estimate happens if we turn to a > −0.5.The "constrained-optimal" choices (optimal given that a > −0.5) as well as the MM estimate lead to nearly the same values (around 1.51) and are thus visibly larger than the actually preferable estimates for a < −0.5.Also their variance and bias are about 2.5 times larger than those of the estimates for a < −0.5.
tributions, both depending on the (normalized) mean as their only model parameter.But as already discussed in Section 3, there is hardly any potential for finding a better estimator of the mean than the sample mean, so we do not further discuss these distributions.Instead, we focus on another popular count distribution, namely the NB-distribution with parameters ν > 0 and π ∈ (0; 1), abbreviated as NB(ν, π).Such X ∼ NB(ν, π) has the range N 0 , probability mass function (pmf) . By contrast to the equidispersed Poisson distribution, its variance is always larger than the mean (overdispersion), which is an important property for applications in practice.A detailed survey about the properties of and estimators for the NB(ν, π)distribution can be found in Johnson et al. (2005, Chapter 5).Instead of the original parametrization by (ν, π), it is often advantageous to consider either (µ, ν) or (µ, π), where ν or π, respectively, serve as an additional dispersion parameter once the mean µ has been fixed.In case of the (µ, ν)parametrization, it holds that π = ν ν+µ and V[X] = ν+µ ν µ, whereas we get ν = π µ 1−π and V[X] = 1 π µ for the (µ, π)-parametrization.Besides the ease of interpretation, these parametrizations are advantageous in terms of parameter estimation.While MM estimation is rather obvious, namely µ by X and ν MM = X 2 /(S 2 − X), π MM = X/S 2 , ML estimation is generally demanding as there does not exist a closed-form solution, see the discussion by Kemp & Kemp (1987), i. e., numerical optimization is necessary.However, there is an important exception: the NB's ML estimator of the mean µ is given by X (Kemp & Kemp, 1987, p. 867), i. e., X is both the MM and ML estimator with its known appealing performance.So it suffices to find an adequate estimator for ν or π, respectively, the ML estimators of which do not have a closed-form expression.These difficulties in estimating ν or π, respectively, serve as our motivation for deriving a generalized MM estimator.For this purpose, we consider the NB's Stein identity according to Brown & Phillips (1999, Lemma 1), which can be expressed as either for any function f such that E |f (X)| , E |f (X + 1)| exist.Note that the discrete difference ∆f (x) := f (x + 1) − f (x) in (4.1) and (4.2) plays a similar role as the continuous derivative f ′ (x) in the previous Stein identities (2.1) and (3.2).
Stein-MM estimators are now derived by solving (4.1) in ν or (4.2) in π, respectively, and by using again sample moments h(X) = 1 n n i=1 h(X i ) instead of the involved population moments E h(X) (with µ being estimated by X).As a result, the (closed-form) classes of Stein-MM estimators for ν and π are obtained as Note that the choice f (x) = x (hence ∆f (x) = 1) leads to the default MM estimators given above.The ML estimators are not covered by (4.3) this time, because they do not have a closed-form expression at all.Note, however, that the so-called "weighted-mean estimator" for ν in (2.6) of Kemp & Kemp (1987), which was motivated as a kind of approximate ML estimator, is covered by (4.3), namely by choosing f α (x) = α x with α ∈ (0; 1).It is also worth pointing to Savani & Zhigljavsky (2006), who define an estimator of ν based on the moment f (X) for some specified f ; their approach, however, usually does not lead to a closed-form estimator.
For deriving the asymptotic distribution of the general Stein-MM estimator νf,NB or πf,NB , respectively, we first define the vectors Z i with i = 1, . . ., n as Their mean equals where we define μf (k, l, m) Then, the following CLT holds.
Theorem 4.1 If X 1 , . . ., X n are i.i. d. according to a negative binomial distribution, then the sample mean Z of Z 1 , . . ., Z n according to (4.4) is asymptotically normally distributed as where N(0, Σ) denotes the multivariate normal distribution, and where the covariances are given as The proof of Theorem 4.1 is provided by Appendix A.9.
Our first special case shall be the function f α (x) = α x with α ∈ (0; 1), which is inspired by Kemp & Kemp (1987).For evaluating the asymptotics in Theorems 4.1-4.3,we need to compute the moments As shown in the following, this can be done by explicit closed-form expressions.The idea is to utilize the probability generating function (pgf) of the NB-distribution, pgf together with the following property: for r ∈ N 0 denote the falling factorials.The main result is summarized by the following lemma.
Lemma 4.4 Let X ∼ NB(ν, π).For the mixed factorial moments, we have where the upper summation limit M is chosen sufficiently large, e. g., such that ) falls below a specified tolerance limit.In this way, we generated the illustrative graphs in Figures 5 (estimator νf,NB ) and 6 (estimator πf,NB ).There, parts (a)-(b) always refer to the above choice f α (x) = α x , and clear minima for variance and bias for f α (x) = α x can be recognized.To be able to compare with the respective default MM estimator, we did analogous computations for f a (x) = x a (where a = 1 for the default MM estimator), which, however, is only defined for a > 0 as X becomes zero with positive probability.As can be seen from the gray curves in parts (c)-(d), variance and basis usually do not attain a local minimum for a > 0. Therefore, parts (c)-(d) mainly focus on a slight modification of the weight function, namely f a,1 (x) = (x + 1) a , which is also well-defined for a < 0.
The optimal choices for α and a, respectively, lead to very similar variance and bias values, see Table 4.While α x leads to a slightly larger variance than (x + 1) a , its optimal bias is visibly lower.For both choices of f , however, the optimal Stein-MM estimators perform clearly better than the default MM estimator, see the dotted line at a = 1 in parts (c)-(d) in Figures 5 and 6.Altogether, also in view of the fact that explicit closed-form expressions are possible for α x (although being rather complex), we prefer to use f α (x) = α x as the weighting function, in accordance to Kemp & Kemp (1987).For this choice, we also did a simulation experiment with 10 5 replications (in analogy to Remark 3.7), in order to check the finite-sample performance of the πfα,NB 0.459 0.457 0.456 0.468 0.474 0.504 x having an improved performance.The resulting estimates (together with further estimates for comparative purposes) are summarized in the upper part of Table 5.It can be seen that the initial estimate (last column) is corrected downwards to a value close to 1 (i.e., we essentially end up with the special case of a geometric distribution).Here, it is interesting to note that the numerically computed ML estimate as reported in Rueda & O'Reilly (1999), also leads to such a result, namely to the value 1.025.In this context, we also recall Kemp & Kemp (1987) who proposed the choice f α (x) = α x to get a closed-form approximate ML estimator for ν.
We repeated the aforementioned estimation procedure also for the (µ, π)-parameterization, starting with the initial MM-estimate ≈ 0.504 for π, see the lower part of Table 5.Again, the initial estimate is corrected downwards to a value around 0.46.

Conclusions
In this article, we demonstrated how Stein characterizations of (continuous or discrete) distributions can be utilized to derive improved moment estimators of model parameters.The main idea is to choose an appropriate type of weighting function such that the resulting Stein-MM estimator has lower variance and bias than existing estimators.Here, the choice of the weighting function can be done based on asymptotic distributions: as the Stein-MM estimators are given by closed-form expressions with asymptotic normality, one can easily derive an optimal choice from a given family of weighting functions.This procedure was exemplified for three types of distribution: the exponential distribution, the inverse Gaussian distribution, and the negativebinomial distribution.For all these distribution families, we observed an appealing performance in various aspects, and we also demonstrated the application of our findings to real-world data examples.
Our research also gives rise to several directions for future research.While our main focus was on selecting the weight function with respect to minimal bias or variance, we also briefly pointed out in Section 2 that such a choice could also be motivated by robustness to outliers.In fact, there are some analogies to "M-estimation" as introduced by Huber (1964)

A Derivations
A.1 Proof of Theorem 2.1 The asymptotic normality immediately follows from the Lindeberg-Lévy CLT (Serfling, 1980, p. 28).For the covariances, we get In the last line, we applied the Stein-identity (2.1) to g(X We conclude the proof by noting that the second expression for σ 12 in Theorem 2.1 immediately follows by using the expression for σ 22 .
A.2 Proof of Theorem 2.2 First, we evaluate the gradient and Hessian of g in µ Z , which leads to D and H given by Applying the Delta method to Theorem 2.1, the asymptotic normality follows.The 2nd-order Taylor approximation λf,Exp − λ ≈ D(Z − µ Z ) + 1 2 (Z − µ Z ) ⊤ H(Z − µ Z ) allows to conclude the asymptotic bias as The explicit expression for the asymptotic variance σ 2 f,Exp = σ 2 n follows by applying D and Theorem 2.1: .
Similarly, using H, the asymptotic bias becomes . This completes the proof of Theorem 2.2.

A.3 Proof of Corollary 2.3
We start by proving (2.5).As the pdf of X is given by ϕ(x) = λe −λx for x > 0 and zero otherwise, we get where the integrand of the last integral is the pdf of a two-parameter gamma distribution, namely Gamma(a+1, 1 λ ) with a > −1 and λ > 0 according to Johnson et al. (1995, p. 343).Note that E[X a ] corresponds to µ fa (0, 1, 0) in our notation.In view of Theorem 2.2, we also need the following moments, which are implied by (2.5): Inserting these expressions into Theorem 2.2, we get This completes the proof of Corollary 2.3.

A.4 Proof of Corollary 2.4
Let us start by deriving (2.6).We have .
This immediately leads to .
Insertion into Theorem 2.2 leads to Finally, the MSE equals This completes the proof of Corollary 2.4.
A.5 Proof of Theorem 3.3 The asymptotic normality immediately follows from the Lindeberg-Lévy CLT (Serfling, 1980, p. 28).For the covariances, we get gradient and Hessian of g in µ Z , which leads to D and H given by Here, for the sake of readability, the upper triangle of the symmetric matrix H was replaced by stars ' * '.Applying the Delta method to Theorem 3.3, the asymptotic normality The explicit expression for the asymptotic variance σ 2 f,IG = σ 2 n follows by applying D and Theorem 3.3.After tedious calculations, we obtain the expression for σ 2 f,IG stated in Theorem 3.4.Similarly, using H, the expression for the asymptotic bias is derived.This completes the proof of Theorem 3.4.
A.7 Proof of Corollary 3.5 (Sketch) , where δ •,• denotes the Kronecker delta, so Theorem 3.4 simplifies considerably: where ϑ Using the following moments (see Tweedie, 1957a, p. 366), ) and after tedious calculations, this leads to So the proof of Corollary 3.5 is complete.

A.10 Proof of Theorem 4.2 (Sketch)
In what follows, we sketch the derivations for the variance and bias, respectively.Recall the abbreviation η 1 := μf (1, 1, 0) − µ • μf (0, 0, 1).Furthermore, in order to denote the results more compactly, let us abbreviate μklm := μf (k, l, m).First, we evaluate the gradient and Hessian of g ν in µ Z , which leads to D and H given by A .11 Proof of Theorem 4.3 (Sketch) In what follows, we sketch the derivations for the variance and bias, respectively.Recall the abbreviation η 2 := μf (1, 0, 1) − µ • μf (0, 0, 1), and let us again the abbreviations μklm := μf (k, l, m).First, we evaluate the gradient and Hessian of g π in µ Z , which leads to D and H given by Here, for the sake of readability, the upper triangle of the symmetric matrix H was replaced by stars ' * '.Then, the remaining steps are like in Appendix A.10.This completes the proof of Theorem 4.3.
4 are done separately for a < − 1 2 (plots on left-hand side, covering the ML estimator) and a > − 1 2 (plots on right-hand side, covering the MM estimator).Let us start with the analysis of asymptotic bias and variance in Figure3.

Table 1 :
Empirical bias and MSE of λf,Exp from 10 5 simulated i. i. d. samples

Table 3 :
Runoff data from Example 3.8: Stein-MM estimates λfa,IG for different choices of function f a (x) = x a .

Table 5 :
Counts of red mites on apple leaves from Example 4.5: Stein-MM estimates νfα,NB (upper part) and πfα,NB (lower part) for different choices of function f α (x) = α x .
. It appears to be promising to analyze if robustified MM estimators can be achieved by suitable classes of weighting function.As another direction for future research (briefly sketched in Section 2), the performance of GoF-tests based on Stein-MM estimators should be investigated.Finally, one should analyze Stein-MM estimators in a regression or time-series context.