Abstract
Given a parameter of interest, such as a population mean μ or population proportion p, the objective of point estimation is to use a sample to compute a number that represents in some sense a good guess for the true value of the parameter. The resulting number is called a point estimate. In Section 7.1, we present some general concepts of point estimation. In Section 7.2, we describe and illustrate two important methods for obtaining point estimates: the method of moments and the method of maximum likelihood.
Keywords
- Minimal Sufficient Statistic
- Minimum Variance Unbiased Estimation
- Filtered Cigarette Smoke
- Point Calculations
- Normal Population Distribution
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Given a parameter of interest, such as a population mean μ or population proportion p, the objective of point estimation is to use a sample to compute a number that represents in some sense a good guess for the true value of the parameter. The resulting number is called a point estimate. In Section 7.1, we present some general concepts of point estimation. In Section 7.2, we describe and illustrate two important methods for obtaining point estimates: the method of moments and the method of maximum likelihood.
Obtaining a point estimate entails calculating the value of a statistic such as the sample mean \( \overline X\) or sample standard deviation S. We should therefore be concerned that the chosen statistic contains all the relevant information about the parameter of interest. The idea of no information loss is made precise by the concept of sufficiency, which is developed in Section 7.3. Finally, Section 7.4 further explores the meaning of efficient estimation and properties of maximum likelihood.
7.1 General Concepts and Criteria
Statistical inference is frequently directed toward drawing some type of conclusion about one or more parameters (population characteristics). To do so requires that an investigator obtain sample data from each of the populations under study. Conclusions can then be based on the computed values of various sample quantities. For example, let μ (a parameter) denote the average duration of anesthesia for a short-acting anesthetic. A random sample of n = 10 patients might be chosen, and the duration for each one determined, resulting in observed durations x 1, x 2,..., x 10. The sample mean duration \( \bar x \) could then be used to draw a conclusion about the value of μ. Similarly, if σ 2 is the variance of the duration distribution (population variance, another parameter), the value of the sample variance s 2 can be used to infer something about σ 2.
When discussing general concepts and methods of inference, it is convenient to have a generic symbol for the parameter of interest. We will use the Greek letter θ for this purpose. The objective of point estimation is to select a single number, based on sample data, that represents a sensible value for θ. Suppose, for example, that the parameter of interest is μ, the true average lifetime of batteries of a certain type. A random sample of n = 3 batteries might yield observed lifetimes (hours) x 1 = 5.0, x 2 = 6.4, x 3 = 5.9. The computed value of the sample mean lifetime is \( \bar x = 5.77 \), and it is reasonable to regard 5.77 as a very plausible value of μ, our “best guess” for the value of μ based on the available sample information.
Suppose we want to estimate a parameter of a single population (e.g., μ or σ) based on a random sample of size n. Recall from the previous chapter that before data is available, the sample observations must be considered random variables (rv’s) X 1, X 2,..., X n . It follows that any function of the X i ’s—that is, any statistic—such as the sample mean \( \overline{X} \) or sample standard deviation S is also a random variable. The same is true if available data consists of more than one sample. For example, we can represent duration of anesthesia of m patients on anesthetic A and n patients on anesthetic B by X 1,..., X m and Y 1,..., Y n , respectively. The difference between the two sample mean durations is \( \overline{X} - \overline{Y} \), the natural statistic for making inferences about μ 1 – μ 2, the difference between the population mean durations.
DEFINITION
A point estimate of a parameter θ is a single number that can be regarded as a sensible value for θ. A point estimate is obtained by selecting a suitable statistic and computing its value from the given sample data. The selected statistic is called the point estimator of θ.
In the battery example just given, the estimator used to obtain the point estimate of μ was \( \overline{X} \), and the point estimate of μ was 5.77. If the three observed lifetimes had instead been x 1 = 5.6, x 2 = 4.5, and x 3 = 6.1, use of the estimator \( \overline{X} \) would have resulted in the estimate \( \bar{x} = (5.6 + 4.5 + 6.1)/3 = 5.40 \). The symbol \( \hat{\theta } \) (“theta hat”) is customarily used to denote both the estimator of θ and the point estimate resulting from a given sample.Footnote 1 Thus \( \hat{\mu } = \overline{X} \) is read as “the point estimator of μ is the sample mean \( \overline{X} \).” The statement “the point estimate of μ is 5.77” can be written concisely as \( \hat{\mu } = 5.77 \). Notice that in writing \( \hat{\theta } = 72.5 \), there is no indication of how this point estimate was obtained (what statistic was used). It is recommended that both the estimator and the resulting estimate be reported.
Example 7.1
An automobile manufacturer has developed a new type of bumper, which is supposed to absorb impacts with less damage than previous bumpers. The manufacturer has used this bumper in a sequence of 25 controlled crashes against a wall, each at 10 mph, using one of its compact car models. Let X = the number of crashes that result in no visible damage to the automobile. The parameter to be estimated is p = the proportion of all such crashes that result in no damage [alternatively, p = P(no damage in a single crash)]. If X is observed to be x = 15, the most reasonable estimator and estimate are
If for each parameter of interest there were only one reasonable point estimator, there would not be much to point estimation. In most problems, though, there will be more than one reasonable estimator.
Example 7.2
Reconsider the accompanying 20 observations on dielectric breakdown voltage for pieces of epoxy resin introduced in Example 4.36 (Section 4.6).
24.46 | 25.61 | 26.25 | 26.42 | 26.66 | 27.15 | 27.31 | 27.54 | 27.74 | 27.94 |
27.98 | 28.04 | 28.28 | 28.49 | 28.50 | 28.87 | 29.11 | 29.13 | 29.50 | 30.88 |
The pattern in the normal probability plot given there is quite straight, so we now assume that the distribution of breakdown voltage is normal with mean value μ. Because normal distributions are symmetric, μ is also the median lifetime of the distribution. The given observations are then assumed to be the result of a random sample X 1, X 2,..., X 20 from this normal distribution. Consider the following estimators and resulting estimates for μ:
-
a.
\( {\hbox{Estimator}} = \overline{X} \), \( {\hbox{estimate}} = \bar{x} = {{{\sum {{x_i}} }} \left/ {n} \right.} = 555.86/20 = 27.793 \)
-
b.
\( {\hbox{Estimator}} = \widetilde{X} \), \( {\hbox{estimate}} = \widetilde{x} = (27.94 + 27.98)/2=27.960 \)
-
c.
\( {\hbox{Estimator}} = {\overline{X}_e} = { }[{ \min }\left( {{X_i}} \right) + { \max }\left( {{X_i}} \right)]/{2} = {\hbox{the midrange}} \), (average of the two extreme lifetimes), estimate = [min(x i ) + max(x i )]/2 = (24.46 + 30.88)/2 = 27.670
-
d.
\( {\hbox{Estimator}} = {\overline{X}_{{{\rm{tr}}(10)}}} \), the 10% trimmed mean (discard the smallest and largest 10% of the sample and then average)
Each one of the estimators (a)–(d) uses a different measure of the center of the sample to estimate μ. Which of the estimates is closest to the true value? We cannot answer this without knowing the true value. A question that can be answered is, “Which estimator, when used on other samples of X i ’s, will tend to produce estimates closest to the true value?” We will shortly consider this type of question.
Example 7.3
Studies have shown that a calorie-restricted diet can prolong life. Of course, controlled studies are much easier to do with lab animals. Here is a random sample of eight lifetimes (days) taken from a population of 106 rats that were fed a restricted diet (from “Tests and Confidence Sets for Comparing Two Mean Residual Life Functions,” Biometrics, 1988: 103–115)
716 | 1144 | 1017 | 1138 | 389 | 1221 | 530 | 958 |
Let X 1 ,..., X 8 denote the lifetimes as random variables, before the observed values are available. We want to estimate the population variance σ 2. A natural estimator is the sample variance:
The corresponding estimate is
The estimate of σ would then be \( \hat{\sigma } = s = \sqrt {{95,\!315}} = 309 \)
An alternative estimator would result from using divisor n instead of n – 1 (i.e., the average squared deviation):
We will indicate shortly why many statisticians prefer S 2 to the estimator with divisor n.
In the best of all possible worlds, we could find an estimator \( \hat{\theta } \) for which \( \hat{\theta } = \theta \) always. However, \( \hat{\theta } \) is a function of the sample X i ’s, so it is a random variable. For some samples, \( \hat{\theta } \) will yield a value larger than θ, whereas for other samples \( \hat{\theta } \) will underestimate θ. If we write
then an accurate estimator would be one resulting in small estimation errors, so that estimated values will be near the true value.
7.1.1 Mean Squared Error
A popular way to quantify the idea of \( \hat{\theta } \) being close to θ is to consider the squared error \( {(\hat{\theta } - \theta )^2} \). Another possibility is the absolute error \( |\hat{\theta } - \theta | \), but this is more difficult to work with mathematically. For some samples, \( \hat{\theta } \) will be quite close to θ and the resulting squared error will be very small, whereas the squared error will be quite large whenever a sample produces an estimate \( \hat{\theta } \) that is far from the target. An omnibus measure of accuracy is the mean squared error (expected squared error), which entails averaging the squared error over all possible samples and resulting estimates.
DEFINITION
The mean squared error of an estimator \( \hat{\theta } \) is \( E[{(\hat{\theta } - \theta )^2}]. \)
A useful result when evaluating mean squared error is a consequence of the following rearrangement of the shortcut for evaluating a variance V(Y):
That is, the expected value of the square of Y is the variance plus the square of the mean value. Letting \( Y = \hat {\theta } - \theta \), the estimation error, the left-hand side is just the mean squared error. The first term on the right-hand side is \( V(\,\hat {\theta } - \theta ){ } = V(\,\hat {\theta }) \) since θ is just a constant. The second term involves \( E(\,\hat {\theta } - \theta ){ } = E(\,\hat{\theta } ) - \theta \), the difference between the expected value of the estimator and the value of the parameter. This difference is called the bias of the estimator. Thus
Example 7.4 (Example 7.1 continued)
Consider once again estimating a population proportion of “successes” p. The natural estimator of p is the sample proportion of successes \( \hat{p} = X/n \). The number of successes X in the sample has a binomial distribution with parameters n and p, so E(X) = np and V(X) = np(1 − p). The expected value of the estimator is
Thus the bias of \( \hat{p} \) is p − p = 0, giving the mean squared error as
Now consider the alternative estimator \( \hat{p} = (X + 2)/(n + 4) \). That is, add two successes and two failures to the sample and then calculate the sample proportion of successes. One intuitive justification for this estimator is that
from which we see that the alternative estimator is always somewhat closer to.5 than is the usual estimator. It seems particularly reasonable to move the estimate toward.5 when the number of successes in the sample is close to 0 or n. For example, if there are no successes at all in the sample, is it sensible to estimate the population proportion of successes as zero, especially if n is small?
The bias of the alternative estimator is
This bias is not zero unless p = .5. However, as n increases the numerator approaches zero and the denominator approaches 1, so the bias approaches zero. The variance of the estimator is
This variance approaches zero as the sample size increases. The mean squared error of the alternative estimator is
So how does the mean squared error of the usual estimator, the sample proportion, compare to that of the alternative estimator? If one MSE were smaller than the other for all values of p, then we could say that one estimator is always preferred to the other (using MSE as our criterion). But as Figure 7.1 shows, this is not the case at least for the sample sizes n = 10 and n = 100, and in fact is not true for any other sample size.
According to Figure 7.1, the two MSE’s are quite different when n is small. In this case the alternative estimator is better for values of p near.5 (since it moves the sample proportion toward.5) but not for extreme values of p. For large n the two MSE’s are quite similar, but again neither dominates the other.
Seeking an estimator whose mean squared error is smaller than that of every other estimator for all values of the parameter is generally too ambitious a goal. One common approach is to restrict the class of estimators under consideration in some way, and then seek the estimator that is best in that restricted class. A very popular restriction is to impose the condition of unbiasedness.
7.1.2 Unbiased Estimators
Suppose we have two measuring instruments; one instrument has been accurately calibrated, but the other systematically gives readings smaller than the true value being measured. When each instrument is used repeatedly on the same object, because of measurement error, the observed measurements will not be identical. However, the measurements produced by the first instrument will be distributed about the true value in such a way that on average this instrument measures what it purports to measure, so it is called an unbiased instrument. The second instrument yields observations that have a systematic error component or bias.
DEFINITION
A point estimator \( \hat{\theta } \) is said to be an unbiased estimator of θ if E(\( \hat{\theta } \)) = θ for every possible value of θ. If \( \hat{\theta } \) is not unbiased, the difference \( E(\hat{\theta }) - { }\theta \) is called the bias of \( \hat{\theta } \).
That is, \( \hat{\theta } \) is unbiased if its probability (i.e., sampling) distribution is always “centered” at the true value of the parameter. Suppose \( \hat{\theta } \) is an unbiased estimator; then if θ = 100, the \( \hat{\theta } \) sampling distribution is centered at 100; if θ = 27.5, then the \( \hat{\theta } \) sampling distribution is centered at 27.5, and so on. Figure 7.2 pictures the distributions of several biased and unbiased estimators. Note that “centered” here means that the expected value, not the median, of the distribution of \( \hat{\theta } \) is equal to θ.
It may seem as though it is necessary to know the value of θ (in which case estimation is unnecessary) to see whether \( \hat{\theta } \) is unbiased. This is usually not the case, however, because unbiasedness is a general property of the estimator’s sampling distribution—where it is centered—which is typically not dependent on any particular parameter value. For example, in Example 7.4 we showed that \( E( \hat{p }) = p \) when \( \hat{p } \) is the sample proportion of successes. Thus if p = .25, the sampling distribution of \( \hat{p } \) is centered at.25 (centered in the sense of mean value), when p = .9 the sampling distribution is centered at.9, and so on. It is not necessary to know the value of p to know that \( \hat{p } \) is unbiased.
PROPOSITION
When X is a binomial rv with parameters n and p, the sample proportion \( \hat{p} = X/n \) is an unbiased estimator of p.
Example 7.5
Suppose that X, the reaction time to a stimulus, has a uniform distribution on the interval from 0 to an unknown upper limit θ (so the density function of X is rectangular in shape with height 1/θ for 0 ≤ x ≤ θ). An investigator wants to estimate θ on the basis of a random sample X 1 , X 2 ,..., X n of reaction times. Since θ is the largest possible time in the entire population of reaction times, consider as a first estimator the largest sample reaction time: \( {\hat{\theta }_b} = \max ({X_1},\; \ldots, \;{X_n}) \). If n = 5 and x 1 = 4.2, x 2 = 1.7, x 3 = 2.4, x 4 = 3.9, x 5 = 1.3, the point estimate of θ is \( {\hat{\theta }_b} = \max (4.2,\;1.7,\;2.4,\;3.9,\;1.3) = 4.2.\; \)
Unbiasedness implies that some samples will yield estimates that exceed θ and other samples will yield estimates smaller than θ — otherwise θ could not possibly be the center (balance point) of \( {\hat{\theta }_b} \)’s distribution. However, our proposed estimator will never overestimate θ (the largest sample value cannot exceed the largest population value) and will underestimate θ unless the largest sample value equals θ. This intuitive argument shows that \( {\hat{\theta }_b} \) is a biased estimator. More precisely, using our earlier results on order statistics, it can be shown (see Exercise 50) that
The bias of \( {\hat{\theta }_b} \) is given by nθ/(n + 1) – θ = −θ/(n + 1), which approaches 0 as n gets large.
It is easy to modify \( {\hat{\theta }_b} \) to obtain an unbiased estimator of θ. Consider the estimator
Using this estimator on the data gives the estimate (6/5)(4.2) = 5.04. The fact that (n + 1)/n > 1 implies that \( {\hat{\theta }_u} \) will overestimate θ for some samples and underestimate it for others. The mean value of this estimator is
If \( {\hat{\theta }_u} \) is used repeatedly on different samples to estimate θ, some estimates will be too large and others will be too small, but in the long run there will be no systematic tendency to underestimate or overestimate θ.
Statistical practitioners who buy into the Principle of Unbiased Estimation would employ an unbiased estimator in preference to a biased estimator. On this basis, the sample proportion of successes should be preferred to the alternative estimator of p, and the unbiased estimator \( {\hat{\theta }_u} \) should be preferred to the biased estimator \( \hat{\theta }_b\) in the uniform distribution scenario of the previous example.
Example 7.6
Let’s turn now to the problem of estimating σ 2 based on a random sample X 1, …, X n . First consider the estimator \( {S^2} = \sum {({X_i}} - {\overline{X}^2})/(n - 1) \), the sample variance as we have defined it. Applying the result E(Y 2) = V(Y) + [E(Y)]2 to
from Section 1.4 gives
Thus we have shown that the sample variance S 2 is an unbiased estimator of σ 2.
The estimator that uses divisor n can be expressed as (n – 1)S 2/n, so
This estimator is therefore biased. The bias is (n – 1)σ 2/n – σ 2 = −σ 2/n. Because the bias is negative, the estimator with divisor n tends to underestimate σ 2, and this is why the divisor n – 1 is preferred by many statisticians (although when n is large, the bias is small and there is little difference between the two).
This is not quite the whole story, however. Suppose the random sample has come from a normal distribution. Then from Section 6.4, we know that the rv (n – 1)S 2/σ 2 has a chi-squared distribution with n – 1 degree of freedom. The mean and variance of a chi-squared variable are df and 2 df, respectively. Let’s now consider estimators of the form
The expected value of the estimator is
so the bias is \( c(n - 1){\sigma^2} - {\sigma^2} \). The only unbiased estimator of this type is the sample variance, with c = 1/(n – 1).
Similarly, the variance of the estimator is
Substituting these expressions into the relationship MSE = variance + (bias)2, the value of c for which MSE is minimized can be found by taking the derivative with respect to c, equating the resulting expression to zero, and solving for c. The result is c = 1/(n + 1). So in this situation, the principle of unbiasedness and the principle of minimum MSE are at loggerheads.
As a final blow, even though S 2 is unbiased for estimating σ 2, it is not true that the sample standard deviation S is unbiased for estimating σ. This is because the square root function is not linear, so the expected value of the square root is not the square root of the expected value. Well, if S is biased, why not find an unbiased estimator for σ and use it rather than S? Unfortunately there is no estimator of σ that is unbiased irrespective of the nature of the population distribution (although in special cases, e.g., a normal distribution, an unbiased estimator does exist). Fortunately the bias of S is not serious unless n is quite small. So we shall generally employ it as an estimator.
In Example 7.2, we proposed several different estimators for the mean μ of a normal distribution. If there were a unique unbiased estimator for μ, the estimation dilemma could be resolved by using that estimator. Unfortunately, this is not the case.
Proposition
If X 1, X 2,..., X n is a random sample from a distribution with mean μ, then \( \overline{X} \) is an unbiased estimator of μ. If in addition the distribution is continuous and symmetric, then \( \widetilde{X} \) and any trimmed mean are also unbiased estimators of μ.
The fact that \( \overline{X} \) is unbiased is just a restatement of one of our rules of expected value: \( E(\overline{X}) = \mu \) for every possible value of μ (for discrete as well as continuous distributions). The unbiasedness of the other estimators is more difficult to verify; the argument requires invoking results on distributions of order statistics from Section 5.5.
According to this proposition, the principle of unbiasedness by itself does not always allow us to select a single estimator. When the underlying population is normal, even the third estimator in Example 7.2 is unbiased, and there are many other unbiased estimators. What we now need is a way of selecting among unbiased estimators.
7.1.3 Estimators with Minimum Variance
Suppose \( {\hat{\theta }_1} \) and \( {\hat{\theta }_2} \) are two estimators of θ that are both unbiased. Then, although the distribution of each estimator is centered at the true value of θ, the spreads of the distributions about the true value may be different.
PRINCIPLE OF MINIMUM VARIANCE UNBIASED ESTIMATION
Among all estimators of θ that are unbiased, choose the one that has minimum variance. The resulting \( \hat{\theta } \) Is called the minimum variance unbiased estimator (MVUE) of θ. Since MSE = variance + (bias)2, seeking an unbiased estimator with minimum variance is the same as seeking an unbiased estimator that has minimum mean squared error.
Figure 7.3 pictures the pdf’s of two unbiased estimators, with the first \( \hat{\theta } \) having smaller variance than the second estimator. Then the first \( \hat{\theta } \) is more likely than the second one to produce an estimate close to the true θ. The MVUE is, in a certain sense, the most likely among all unbiased estimators to produce an estimate close to the true θ.
Example 7.7
We argued in Example 7.5 that when X 1,..., X n is a random sample from a uniform distribution on [0, θ], the estimator
is unbiased for θ (we previously denoted this estimator by \( {\hat{\theta }_u} \)). This is not the only unbiased estimator of θ. The expected value of a uniformly distributed rv is just the midpoint of the interval of positive density, so E(X i ) = θ/2. This implies that \( E(\overline{X}) = \theta /2 \), from which \( E({2}\overline{X}) = \theta \). That is, the estimator \( {\hat{\theta }_2} = 2\overline{X} \) is unbiased for θ.
If X is uniformly distributed on the interval [A, B], then V(X) = σ 2 = (B – A)2/12 (Exercise 23 in Chapter 4). Thus, in our situation, V(X i ) = θ 2/12, \( V(\overline{X}) = {\sigma^2}/n = {\theta^2}/(12n) \), and \( V({\hat{\theta }_2}) = V(2\overline{X}) = 4V(\overline{X}) = {\theta^2}/(3n) \). The results of Exercise 50 can be used to show that \( V({\hat{\theta }_1}) = {\theta^{{2}}}/[n(n + {2})] \). The estimator \( {\hat{\theta }_1} \) has smaller variance than does \( {\hat{\theta }_2} \) if 3n < n(n + 2)—that is, if 0 < n 2 – n = n(n – 1). As long as n > 1, V(\( {\hat{\theta }_1} \)) < V(\( {\hat{\theta }_2} \)), so \( {\hat{\theta }_1} \) is a better estimator than \( {\hat{\theta }_2} \). More advanced methods can be used to show that \( {\hat{\theta }_1} \) is the MVUE of θ—every other unbiased estimator of θ has variance that exceeds θ 2/[n(n + 2)].
One of the triumphs of mathematical statistics has been the development of methodology for identifying the MVUE in a wide variety of situations. The most important result of this type for our purposes concerns estimating the mean μ of a normal distribution. For a proof in the special case that σ is known, see Exercise 45.
THEOREM
Let X 1,..., X n be a random sample from a normal distribution with parameters μ and σ. Then the estimator \( \hat{\mu } = \overline{X} \) is the MVUE for μ.
Whenever we are convinced that the population being sampled is normal, the result says that \( \overline{X} \) should be used to estimate μ. In Example 7.2, then, our estimate would be \( \bar{x} = 27.793 \).
Once again, in some situations such as the one in Example 7.6, it is possible to obtain an estimator with small bias that would be preferred to the best unbiased estimator. This is illustrated in Figure 7.4. However, MVUEs are often easier to obtain than the type of biased estimator whose distribution is pictured.
7.1.4 More Complications
The last theorem does not say that in estimating a population mean μ, the estimator \( \overline{X} \) should be used irrespective of the distribution being sampled.
Example 7.8
Suppose we wish to estimate the number of calories θ in a certain food. Using standard measurement techniques, we will obtain a random sample X 1 ,..., X n of n calorie measurements. Let’s assume that the population distribution is a member of one of the following three families:
The pdf (7.1) is the normal distribution, (7.2) is called the Cauchy distribution, and (7.3) is a uniform distribution. All three distributions are symmetric about θ, which is therefore the median of each distribution. The value θ is also the mean for the normal and uniform distributions, but the mean of the Cauchy distribution fails to exist. This happens because, even though the Cauchy distribution is bell-shaped like the normal distribution, it has much heavier tails (more probability far out) than the normal curve. The uniform distribution has no tails. The four estimators for μ considered earlier are \( \overline{X} \), \( \widetilde{X} \), \( {\overline{X}_e} \) (the average of the two extreme observations), and \( {\overline{X}_{{{\rm{tr}}(10)}}} \), a trimmed mean.
The very important moral here is that the best estimator for μ depends crucially on which distribution is being sampled. In particular,
-
1.
If the random sample comes from a normal distribution, then \( \overline{X} \) is the best of the four estimators, since it has minimum variance among all unbiased estimators.
-
2.
If the random sample comes from a Cauchy distribution, then \( \overline{X} \) and \( {\overline{X}_e} \) are terrible estimators for μ, whereas \( \widetilde{X} \) is quite good (the MVUE is not known); \( \overline{X} \) is bad because it is very sensitive to outlying observations, and the heavy tails of the Cauchy distribution make a few such observations likely to appear in any sample.
-
3.
If the underlying distribution is the particular uniform distribution in (7.3), then the best estimator is \( {\overline{X}_e} \); in general, this estimator is greatly influenced by outlying observations, but here the lack of tails makes such observations impossible.
-
4.
The trimmed mean is best in none of these three situations but works reasonably well in all three. That is, \( {\overline{X}_{{tr(10)}}} \) does not suffer too much in comparison with the best procedure in any of the three situations.
More generally, recent research in statistics has established that when estimating a point of symmetry μ of a continuous probability distribution, a trimmed mean with trimming proportion 10% or 20% (from each end of the sample) produces reasonably behaved estimates over a very wide range of possible models. For this reason, a trimmed mean with small trimming percentage is said to be a robust estimator.
Until now, we have focused on comparing several estimators based on the same data, such as \( \overline{X} \) and \( \widetilde{X} \) for estimating μ when a sample of size n is selected from a normal population distribution. Sometimes an investigator is faced with a choice between alternative ways of gathering data; the form of an appropriate estimator then may well depend on how the experiment was carried out.
Example 7.9
Suppose a type of component has a lifetime distribution that is exponential with parameter λ so that expected lifetime is μ = 1/λ. A sample of n such components is selected, and each is put into operation. If the experiment is continued until all n lifetimes, X 1,..., X n , have been observed, then \( \overline{X} \) is an unbiased estimator of μ.
In some experiments, though, the components are left in operation only until the time of the rth failure, where r < n. This procedure is referred to as censoring. Let Y 1 denote the time of the first failure (the minimum lifetime among the n components), Y 2 denote the time at which the second failure occurs (the second smallest lifetime), and so on. Since the experiment terminates at time Y r , the total accumulated lifetime at termination is
We now demonstrate that \( \hat{\mu } = {T_r}/r \) is an unbiased estimator for μ. To do so, we need two properties of exponential variables:
-
1.
The memoryless property (see Section 4.4) says that at any time point, remaining lifetime has the same exponential distribution as original lifetime.
-
2.
If X 1,..., X k are independent, each exponentially distributed with parameter λ, then min (X 1,..., X k ) is exponential with parameter kλ and has expected value 1/(kλ). See Example 5.28.
Since all n components last until Y 1, n – 1 last an additional Y 2 – Y 1, n – 2 an additional Y 3 – Y 2 amount of time, and so on, another expression for T r is
But Y 1 is the minimum of n exponential variables, so E(Y 1) = 1/(nλ). Similarly, Y 2 – Y 1 is the smallest of the n – 1 remaining lifetimes, each exponential with parameter λ (by the memoryless property), so E(Y 2 – Y 1) = 1/[(n – 1)λ]. Continuing, E(Y i+1 – Y i ) = 1/[(n – i)λ], so
Therefore, E(T r /r) = (1/r)E(T r ) = (1/r) · (r/λ) = 1/λ = μ as claimed.
As an example, suppose 20 components are put on test and r = 10. Then if the first ten failure times are 11, 15, 29, 33, 35, 40, 47, 55, 58, and 72, the estimate of μ is
The advantage of the experiment with censoring is that it terminates more quickly than the uncensored experiment. However, it can be shown that V(T r /r) = 1/(λ 2 r), which is larger than 1/(λ 2 n), the variance of \( \overline{X} \) in the uncensored experiment.
7.1.5 Reporting a Point Estimate: The Standard Error
Besides reporting the value of a point estimate, some indication of its precision should be given. The usual measure of precision is the standard error of the estimator used.
DEFINITION
The standard error of an estimator \( \hat{\theta } \) is its standard deviation \( {\sigma_{{\hat{\theta }}}} = \textstyle\sqrt {{V(\hat{\theta })}} \). If the standard error itself involves unknown parameters whose values can be estimated, substitution of these estimates into \( {\sigma_{{\hat{\theta }}}} \) yields the estimated standard error (estimated standard deviation) of the estimator. The estimated standard error can be denoted either by \( {\hat{\sigma }_{{\hat{\theta }}}} \) (the ^ over σ emphasizes that \( {\sigma_{{\hat{\theta }}}} \) is being estimated) or by \( {s_{{\hat{\theta }}}} \).
Example 7.10 (Example 7.2 continued)
Assuming that breakdown voltage is normally distributed, \( \hat{\mu } = \overline{X} \) is the best estimator of μ. If the value of σ is known to be 1.5, the standard error of \( \overline{X} \) is \( {\sigma_{{\overline{X}}}} = \sigma /\sqrt {n} = 1.5/\sqrt {{20}} =.335 \). If, as is usually the case, the value of σ is unknown, the estimate \( \hat{\sigma } = s = 1.462 \) is substituted into \( {\sigma_{{\overline{X}}}} \) to obtain the estimated standard error \( {\hat{\sigma }_{{\overline{X}}}} = {s_{{\overline{X}}}} = s/\sqrt {n} = 1.462/\sqrt {{20}} =.327 \)
Example 7.11 (Example 7.1 continued)
The standard error of \( \hat{p} = X/n \) is
Since p and q = 1 – p are unknown (else why estimate?), we substitute \( \hat{p} = x/n \) and \( \hat{q} = 1 - x/n \) into \( {\sigma_{{\hat{p}}}} \), yielding the estimated standard error \( {\hat{\sigma }_{{\hat{p}}}} = \sqrt {{\hat{p}\hat{q}/n}} = \sqrt {{(.6)(.4)/25}} =.098 \). Alternatively, since the largest value of pq is attained when p = q = .5, an upper bound on the standard error is \( \sqrt {{1/(4n)}} =.10 \).
When the point estimator \( \hat{\theta } \) has approximately a normal distribution, which will often be the case when n is large, then we can be reasonably confident that the true value of θ lies within approximately 2 standard errors (standard deviations) of \( \hat{\theta } \). Thus if measurement of prothrombin (a blood-clotting protein) in 36 individuals gives \( \hat{\mu } = \bar{x} = {2}0.{5} \) and s = 3.6 mg/100 ml, then \( s/\sqrt {n} =.60 \), so “within 2 estimated standard errors of \( \hat{\mu } \)” translates to the interval 20.50 ± (2)(.60) = (19.30, 21.70).
If \( \hat{\theta } \) is not necessarily approximately normal but is unbiased, then it can be shown (using Chebyshev’s inequality, introduced in Exercises 43, 77, and 135 of Chapter 3) that the estimate will deviate from θ by as much as 4 standard errors at most 6% of the time. We would then expect the true value to lie within 4 standard errors of \( \hat{\theta } \) (and this is a very conservative statement, since it applies to any unbiased \( \hat{\theta } \)). Summarizing, the standard error tells us roughly within what distance of \( \hat{\theta } \) we can expect the true value of θ to lie.
7.1.6 The Bootstrap
The form of the estimator \( \hat{\theta } \) may be sufficiently complicated so that standard statistical theory cannot be applied to obtain an expression for \( {\sigma_{{\hat{\theta }}}} \). This is true, for example, in the case θ = σ, \( \hat{\theta } = S \); the standard deviation of the statistic S, σ S , cannot in general be determined. In recent years, a new computer-intensive method called the bootstrap has been introduced to address this problem. Suppose that the population pdf is f(x; θ), a member of a particular parametric family, and that data x 1, x 2,..., x n gives \( \hat{\theta } = 21.7 \). We now use the computer to obtain “bootstrap samples” from the pdf f(x; 21.7), and for each sample we calculate a “bootstrap estimate” \( {\hat{\theta }^{*}} \):
-
First bootstrap sample: \( x_1^{*},x_2^{*}, \ldots, x_n^{*}; \) \( {\hbox{estimate}} = \hat{\theta }_1^{*} \)
-
Second bootstrap sample: \( x_1^{*},x_2^{*}, \ldots, x_n^{*}; \) \( {\hbox{estimate}} = \hat{\theta }_2^{*} \)
-
Bth bootstrap sample: \( x_1^{*},x_2^{*}, \ldots, x_n^{*}; \) \( {\hbox{estimate}} = \hat{\theta }_B^{*} \)
B = 100 or 200 is often used. Now let \( \bar {{\theta}}^{*} = \sum {\hat{\theta }_i^{*}/B} \), the sample mean of the bootstrap estimates. The bootstrap estimate of \( \hat{\theta } \)’s standard error is now just the sample standard deviation of the \( \hat{\theta }_i^{*} \)’s:
(In the bootstrap literature, B is often used in place of B – 1; for typical values of B, there is usually little difference between the resulting estimates.)
Example 7.12
A theoretical model suggests that X, the time to breakdown of an insulating fluid between electrodes at a particular voltage, has f(x; λ) = λe –λx, an exponential distribution. A random sample of n = 10 breakdown times (min) gives the following data:
41.53 | 18.73 | 2.99 | 30.34 | 12.33 | 117.52 | 73.02 | 223.63 | 4.00 | 26.78 |
Since E(X) = 1/λ, \( E(\overline{X}) = {1}/\lambda \), so a reasonable estimate of λ is \( \hat{\lambda } = 1/\bar{x} = 1/55.087 =.018153 \). We then used a statistical computer package to obtain B = 100 bootstrap samples, each of size 10, from f(x;.018153). The first such sample was 41.00, 109.70, 16.78, 6.31, 6.76, 5.62, 60.96, 78.81, 192.25, 27.61, from which \( \sum {x_i^{*}} = 545.8 \) and \( \hat{\lambda }_1^{*} = 1/54.58 =.01832 \). The average of the 100 bootstrap estimates is \overline \lambda^* = .02153, and the sample standard deviation of these 100 estimates is \( {s_{{\hat{\lambda }}}} =.0091 \), the bootstrap estimate of \( \hat{\lambda } \)’s standard error. A histogram of the 100 \( \hat{\lambda }_i^{*} \)’s was somewhat positively skewed, suggesting that the sampling distribution of \( \hat{\lambda } \) also has this property.
Sometimes an investigator wishes to estimate a population characteristic without assuming that the population distribution belongs to a particular parametric family. An instance of this occurred in Example 7.8, where a 10% trimmed mean was proposed for estimating a symmetric population distribution’s center θ. The data of Example 7.2 gave \( \hat{\theta } = {\overline{X}_{{{\rm{tr}}(10)}}} = 27.838 \), but now there is no assumed f(x; θ), so how can we obtain a bootstrap sample? The answer is to regard the sample itself as constituting the population (the n = 20 observations in Example 7.2) and take B different samples, each of size n, with replacement from this population. We expand on this idea in Section 8.5.
7.1.7 Exercises: Section 7.1 (1–20)
-
1.
The accompanying data on IQ for first-graders at a university lab school was introduced in Example 1.2.
82
96
99
102
103
103
106
107
108
108
108
108
109
110
110
111
113
113
113
113
115
115
118
118
119
121
122
122
127
132
136
140
146
-
a.
Calculate a point estimate of the mean value of IQ for the conceptual population of all first graders in this school, and state which estimator you used. [Hint: Σ x i = 3753]
-
b.
Calculate a point estimate of the IQ value that separates the lowest 50% of all such students from the highest 50%, and state which estimator you used.
-
c.
Calculate and interpret a point estimate of the population standard deviation σ. Which estimator did you use? [Hint: \( \Sigma x_i^2 = 432,015 \)]
-
d.
Calculate a point estimate of the proportion of all such students whose IQ exceeds 100. [Hint: Think of an observation as a “success” if it exceeds 100.]
-
e.
Calculate a point estimate of the population coefficient of variation σ/μ and state which estimator you used.
-
a.
-
2.
A sample of 20 students who had recently taken elementary statistics yielded the following information on brand of calculator owned (T = Texas Instruments, H = Hewlett-Packard, C = Casio, S = Sharp):
T
T
H
T
C
T
T
S
C
H
S
S
T
H
C
T
T
T
H
T
-
a.
Estimate the true proportion of all such students who own a Texas Instruments calculator.
-
b.
Of the ten students who owned a TI calculator, 4 had graphing calculators. Estimate the proportion of students who do not own a TI graphing calculator.
-
a.
-
3.
Consider the following sample of observations on coating thickness for low-viscosity paint (“Achieving a Target Value for a Manufacturing Process: A Case Study,” J. Qual. Technol. 1992: 22–26):
.83
.88
.88
1.04
1.09
1.12
1.29
1.31
1.48
1.49
1.59
1.62
1.65
1.71
1.76
1.83
Assume that the distribution of coating thickness is normal (a normal probability plot strongly supports this assumption).
-
a.
Calculate a point estimate of the mean value of coating thickness, and state which estimator you used.
-
b.
Calculate a point estimate of the median of the coating thickness distribution, and state which estimator you used.
-
c.
Calculate a point estimate of the value that separates the largest 10% of all values in the thickness distribution from the remaining 90%, and state which estimator you used. [Hint: Express what you are trying to estimate in terms of μ and σ]
-
d.
Estimate P(X < 1.5), i.e., the proportion of all thickness values less than 1.5. [Hint: If you knew the values of μ and σ, you could calculate this probability. These values are not available, but they can be estimated.]
-
e.
What is the estimated standard error of the estimator that you used in part (b)?
-
a.
-
4.
The data set mentioned in Exercise 1 also includes these third grade verbal IQ observations for males:
117
103
121
112
120
132
113
117
132
149
125
131
136
107
108
113
136
114
and females
114
102
113
131
124
117
120
90
114
109
102
114
127
127
103
Prior to obtaining data, denote the male values by X 1 ,..., X m and the female values by Y 1 ,..., Y n . Suppose that the X i ’s constitute a random sample from a distribution with mean μ 1 and standard deviation σ 1 and that the Y i ’s form a random sample (independent of the X i ’s) from another distribution with mean μ 2 and standard deviation σ 2.
-
a.
Use rules of expected value to show that \( \overline{X} - \overline{Y} \) is an unbiased estimator of μ 1 – μ 2. Calculate the estimate for the given data.
-
b.
Use rules of variance from Chapter 6 to obtain an expression for the variance and standard deviation (standard error) of the estimator in part (a), and then compute the estimated standard error.
-
c.
Calculate a point estimate of the ratio σ 1/σ 2 of the two standard deviations.
-
d.
Suppose one male third-grader and one female third-grader are randomly selected. Calculate a point estimate of the variance of the difference X – Y between male and female IQ.
-
a.
-
5.
As an example of a situation in which several different statistics could reasonably be used to calculate a point estimate, consider a population of N invoices. Associated with each invoice is its “book value,” the recorded amount of that invoice. Let T denote the total book value, a known amount. Some of these book values are erroneous. An audit will be carried out by randomly selecting n invoices and determining the audited (correct) value for each one. Suppose that the sample gives the following results (in dollars).
Invoice
1
2
3
4
5
Book value
300
720
526
200
127
Audited value
300
520
526
200
157
Error
0
200
0
0
−30
Let \( \overline X \) = the sample mean audited value, \( \overline Y \)= the sample mean book value, and \( \overline D \)= the sample mean error. Propose three different statistics for estimating the total audited (i.e. correct) value θ — one involving just N and \( \overline X \), another involving N, T, and \( \overline D \), and the last involving T and \( \overline X /\overline Y \). Then calculate the resulting estimates when N = 5,000 and T = 1,761,300 (The article “Statistical Models and Analysis in Auditing,”, Statistical Science, 1989: 2 – 33 discusses properties of these estimators).
-
6.
Consider the accompanying observations on stream flow (1000’s of acre-feet) recorded at a station in Colorado for the period April 1–August 31 over a 31-year span (from an article in the 1974 volume of Water Resources Res.).
127.96
210.07
203.24
108.91
178.21
285.37
100.85
89.59
185.36
126.94
200.19
66.24
247.11
299.87
109.64
125.86
114.79
109.11
330.33
85.54
117.64
302.74
280.55
145.11
95.36
204.91
311.13
150.58
262.09
477.08
94.33
An appropriate probability plot supports the use of the lognormal distribution (see Section 4.5) as a reasonable model for stream flow.
-
a.
Estimate the parameters of the distribution. [Hint: Remember that X has a lognormal distribution with parameters μ and σ 2 if ln(X) is normally distributed with mean μ and variance σ 2.]
-
b.
Use the estimates of part (a) to calculate an estimate of the expected value of stream flow. [Hint: What is E(X)?]
-
a.
-
7
-
a.
A random sample of 10 houses in a particular area, each of which is heated with natural gas, is selected and the amount of gas (therms) used during the month of January is determined for each house. The resulting observations are 103, 156, 118, 89, 125, 147, 122, 109, 138, 99. Let μ denote the average gas usage during January by all houses in this area. Compute a point estimate of μ.
-
b.
Suppose there are 10,000 houses in this area that use natural gas for heating. Let τ denote the total amount of gas used by all of these houses during January. Estimate τ using the data of part (a). What estimator did you use in computing your estimate?
-
c.
Use the data in part (a) to estimate p, the proportion of all houses that used at least 100 therms.
-
d.
Give a point estimate of the population median usage (the middle value in the population of all houses) based on the sample of part (a). What estimator did you use?
-
a.
-
8.
In a random sample of 80 components of a certain type, 12 are found to be defective.
-
a.
Give a point estimate of the proportion of all such components that are not defective.
-
b.
A system is to be constructed by randomly selecting two of these components and connecting them in series, as shown here.
The series connection implies that the system will function if and only if neither component is defective (i.e., both components work properly). Estimate the proportion of all such systems that work properly. [Hint: If p denotes the probability that a component works properly, how can P(system works) be expressed in terms of p?]
-
c.
Let \( \hat{p} \) be the sample proportion of successes. Is \( {\hat{p}^2} \) an unbiased estimator for p 2? [Hint: For any rv Y, E(Y 2) = V(Y) + [E(Y)]2.]
-
a.
-
9.
Each of 150 newly manufactured items is examined and the number of scratches per item is recorded (the items are supposed to be free of scratches), yielding the following data:
Number of scratches per item
0
1
2
3
4
5
6
7
Observed frequency
18
37
42
30
13
7
2
1
Let X = the number of scratches on a randomly chosen item, and assume that X has a Poisson distribution with parameter λ.
-
a.
Find an unbiased estimator of λ and compute the estimate for the data. [Hint: E(X) = λ for X Poisson, so E(\( \overline{X} = ? \))]
-
b.
What is the standard deviation (standard error) of your estimator? Compute the estimated standard error. [Hint: \( \sigma_X^2 = \lambda \) for X Poisson.]
-
a.
-
10.
Using a long rod that has length μ, you are going to lay out a square plot in which the length of each side is μ. Thus the area of the plot will be μ 2. However, you do not know the value of μ, so you decide to make n independent measurements X 1, X 2,... X n of the length. Assume that each X i has mean μ (unbiased measurements) and variance σ 2.
-
a.
Show that \( {\overline{X}^2} \) is not an unbiased estimator for μ 2. [Hint: For any rv Y, E(Y 2) = V(Y) + [E(Y)]2. Apply this with \( Y = \overline{X} \).]
-
b.
For what value of k is the estimator \( {\overline{X}^2} - k{S^2} \) unbiased for μ 2? [Hint: Compute E(\( {\overline{X}^2} - k{S^2} \)).]
-
a.
-
11.
Of n 1 randomly selected male smokers, X 1 smoked filter cigarettes, whereas of n 2 randomly selected female smokers, X 2 smoked filter cigarettes. Let p 1 and p 2 denote the probabilities that a randomly selected male and female, respectively, smoke filter cigarettes.
-
a.
Show that (X 1/n 1) – (X 2/n 2) is an unbiased estimator for p 1 – p 2. [Hint: E(X i ) = n i p i for i = 1, 2.]
-
b.
What is the standard error of the estimator in part (a)?
-
c.
How would you use the observed values x 1 and x 2 to estimate the standard error of your estimator?
-
d.
If n 1 = n 2 = 200, x 1 = 127, and x 2 = 176, use the estimator of part (a) to obtain an estimate of p 1 – p 2.
-
e.
Use the result of part (c) and the data of part (d) to estimate the standard error of the estimator.
-
a.
-
12.
Suppose a certain type of fertilizer has an expected yield per acre of μ 1 with variance σ 2, whereas the expected yield for a second type of fertilizer is μ 2 with the same variance σ 2. Let \( S_1^2 \) and \( S_2^2 \) denote the sample variances of yields based on sample sizes n 1 and n 2, respectively, of the two fertilizers. Show that the pooled (combined) estimator
$$ {\hat{\sigma }^2} = \frac{{({n_1} - 1)S_1^2 + ({n_2} - 1)S_2^2}}{{{n_1} + {n_2} - 2}} $$is an unbiased estimator of σ 2.
-
13.
Consider a random sample X 1,..., X n from the pdf
$$ f(x;\theta ) =.5(1 + \theta x)\quad \quad - 1 \leq x \leq 1 $$where −1 ≤ θ ≤ 1 (this distribution arises in particle physics). Show that \( \hat{\theta } = 3\overline{X} \) is an unbiased estimator of θ. [Hint: First determine \( \mu = E(X) = E(\overline{X}) \).]
-
14.
A sample of n captured Pandemonium jet fighters results in serial numbers x 1, x 2, x 3,..., x n . The CIA knows that the aircraft were numbered consecutively at the factory starting with α and ending with β, so that the total number of planes manufactured is β – α + 1 (e.g., if α = 17 and β = 29, then 29−17 + 1 = 13 planes having serial numbers 17, 18, 19,..., 28, 29 were manufactured). However, the CIA does not know the values of α or β. A CIA statistician suggests using the estimator max(X i ) – min(X i ) + 1 to estimate the total number of planes manufactured.
-
a.
If n = 5, x 1 = 237, x 2 = 375, x 3 = 202, x 4 = 525, and x 5 = 418, what is the corresponding estimate?
-
b.
Under what conditions on the sample will the value of the estimate be exactly equal to the true total number of planes? Will the estimate ever be larger than the true total? Do you think the estimator is unbiased for estimating β – α + 1? Explain in one or two sentences.
(A similar method was used to estimate German tank production in World War II.)
-
a.
-
15.
Let X 1, X 2,..., X n represent a random sample from a Rayleigh distribution with pdf
$$ f(x;\theta ) = \frac{x}{\theta }{e^{{ - {x^2}/(2\theta )}}}\quad \quad x > 0 $$-
a.
It can be shown that E(X 2) = 2θ. Use this fact to construct an unbiased estimator of θ based on \( \sum {X_i^2} \) (and use rules of expected value to show that it is unbiased).
-
b.
Estimate θ from the following measurements of blood plasma beta concentration (in pmol/L) for n = 10 men.
16.88
10.23
4.59
6.66
13.68
14.23
19.87
9.40
6.51
10.95
-
a.
-
16.
Suppose the true average growth μ of one type of plant during a 1-year period is identical to that of a second type, but the variance of growth for the first type is σ 2, whereas for the second type, the variance is 4σ 2. Let X 1,..., X m be m independent growth observations on the first type [so E(X i ) = μ, V(X i ) = σ 2], and let Y 1,..., Y n be n independent growth observations on the second type [E(Y i ) = μ, V(Y i ) = 4σ 2]. Let c be a numerical constant and consider the estimator \( \hat{\mu} = c\overline X + (1 - c)\overline Y \). For any c between 0 and 1 this is a weighted average of the two sample means, e.g., \(.7\overline X +.3\overline Y \)
-
a.
Show that for any c the estimator is unbiased.
-
b.
For fixed m and n, what value c minimizes \( V(\hat{\mu}) \)? [Hint: The estimator is a linear combination of the two sample means and these means are independent. Once you have an expression for the variance, differentiate with respect to c.]
-
a.
-
17.
In Chapter 3, we defined a negative binomial rv as the number of failures that occur before the rth success in a sequence of independent and identical success/failure trials. The probability mass function (pmf) of X is
$$ \begin{array}{ccccc} nb(x,r,p) \cr = \left\{ {\begin{array}{ccccc} {\left( {\begin{array}{ccccc} {x + r - 1} \\x \\\end{array} } \right){p^r}{{(1 - p)}^x}\quad \quad x = 0,1,2, \ldots } \\ \quad \quad\quad \quad\quad\quad\!{0 \quad\! \quad \quad \quad \quad \quad {\hbox{otherwise}}} \\\end{array} } \right. \end{array}$$-
a.
Suppose that r ≥ 2. Show that
$$ \hat{p} = (r - 1)/(X + r - 1) $$is an unbiased estimator for p. [Hint: Write out \( E(\,\hat{p}) \) and cancel x + r – 1 inside the sum.]
-
b.
A reporter wishing to interview five individuals who support a certain candidate begins asking people whether (S) or not (F) they support the candidate. If the sequence of responses is SFFSFFFSSS, estimate p = the true proportion who support the candidate.
-
a.
-
18.
Let X 1, X 2,..., X n be a random sample from a pdf f(x) that is symmetric about μ, so that \( \widetilde{X} \) is an unbiased estimator of μ. If n is large, it can be shown that \( V(\widetilde{X}) \approx {1}/\{ {4}n{[\,f(\mu )]^{{2}}} \} \). When the underlying pdf is Cauchy (see Example 7.8), \( V(\overline{X}) = \infty \), so \( \overline{X} \) is a terrible estimator. What is \( V(\widetilde{X}) \) in this case when n is large?
-
19.
An investigator wishes to estimate the proportion of students at a certain university who have violated the honor code. Having obtained a random sample of n students, she realizes that asking each, “Have you violated the honor code?” will probably result in some untruthful responses. Consider the following scheme, called a randomized response technique. The investigator makes up a deck of 100 cards, of which 50 are of type I and 50 are of type II.
Type I: Have you violated the honor code (yes or no)?
Type II: Is the last digit of your telephone number a 0, 1, or 2 (yes or no)?
Each student in the random sample is asked to mix the deck, draw a card, and answer the resulting question truthfully. Because of the irrelevant question on type II cards, a yes response no longer stigmatizes the respondent, so we assume that responses are truthful. Let p denote the proportion of honor-code violators (i.e., the probability of a randomly selected student being a violator), and let λ = P(yes response). Then λ and p are related by λ = .5p + (.5)(.3).
-
a.
Let Y denote the number of yes responses, so Y ~ Bin(n, λ). Thus Y/n is an unbiased estimator of λ. Derive an estimator for p based on Y. If n = 80 and y = 20, what is your estimate? [Hint: Solve λ = .5p + .15 for p and then substitute Y/n for λ.]
-
b.
Use the fact that E(Y/n) = λ to show that your estimator \( \hat{p} \) is unbiased.
-
c.
If there were 70 type I and 30 type II cards, what would be your estimator for p?
-
a.
-
20.
Return to the problem of estimating the population proportion p and consider another adjusted estimator, namely
$$ \hat{p} = \frac{{X + \sqrt {{n/4}} }}{{n + \sqrt {n} }} $$The justification for this estimator comes from the Bayesian approach to point estimation to be introduced in Section 14.4.
-
a.
Determine the mean squared error of this estimator. What do you find interesting about this MSE?
-
b.
Compare the MSE of this estimator to the MSE of the usual estimator (the sample proportion).
-
a.
7.2 Methods of Point Estimation
So far the point estimators we have introduced were obtained via intuition and/or educated guesswork. We now discuss two “constructive” methods for obtaining point estimators: the method of moments and the method of maximum likelihood. By constructive we mean that the general definition of each type of estimator suggests explicitly how to obtain the estimator in any specific problem. Although maximum likelihood estimators are generally preferable to moment estimators because of certain efficiency properties, they often require significantly more computation than do moment estimators. It is sometimes the case that these methods yield unbiased estimators.
7.2.1 The Method of Moments
The basic idea of this method is to equate certain sample characteristics, such as the mean, to the corresponding population expected values. Then solving these equations for unknown parameter values yields the estimators.
DEFINITION
Let X 1,..., X n be a random sample from a pmf or pdf f(x). For k = 1, 2, 3,..., the k th population moment, or k th moment of the distribution f(x), is E(X k). The k th sample moment is \( (1/n)\sum\nolimits_{{i = 1}}^n {X_i^k}. \)
Thus the first population moment is E(X) = μ and the first sample moment is \( \sum {{X_i}} /n = \overline{X}. \) The second population and sample moments are E(X 2) and \( \sum {X_i^2} /n \), respectively. The population moments will be functions of any unknown parameters θ 1, θ 2,....
DEFINITION
Let X 1, X 2,..., X n be a random sample from a distribution with pmf or pdf f(x; θ 1,..., θ m ), where θ 1,..., θ m are parameters whose values are unknown. Then the moment estimators \( {\hat{\theta }_1}, \ldots, {\hat{\theta }_m} \) are obtained by equating the first m sample moments to the corresponding first m population moments and solving for θ 1,..., θ m .
If, for example, m = 2, E(X) and E(X 2) will be functions of θ 1 and θ 2. Setting \( E(X) = (1/n)\sum {{X_i}} \;\;( = \overline{X}) \) and \( E\left( {{X^{{2}}}} \right) = (1/n)\sum {X_i^2} \) gives two equations in θ 1 and θ 2. The solution then defines the estimators. For estimating a population mean μ, the method gives \( \mu = \overline{X} \), so the estimator is the sample mean.
Example 7.13
Let X 1,..., X n represent a random sample of service times of n customers at a certain facility, where the underlying distribution is assumed exponential with parameter λ. Since there is only one parameter to be estimated, the estimator is obtained by equating E(X) to \( \overline{X} \). Since E(X) = 1/λ for an exponential distribution, this gives \( {1}/\lambda = \overline{X} \) or \( \lambda = {1}/\overline{X} \). The moment estimator of λ is then \( \hat{\lambda } = 1/\overline{X} \).
Example 7.14
Let X 1,..., X n be a random sample from a gamma distribution with parameters α and β. From Section 4.4, E(X) = αβ and E(X 2) = β 2Γ(α + 2)/Γ(α) = β 2(α + 1)α. The moment estimators of α and β are obtained by solving
Since \( \alpha (\alpha + 1){\beta^2} = {\alpha^2}{\beta^2} + \alpha {\beta^2} \) and the first equation implies \( {\alpha^2}{\beta^2} = {\left( {\overline{X}} \right)^2} \), the second equation becomes
Now dividing each side of this second equation by the corresponding side of the first equation and substituting back gives the estimators
To illustrate, the survival time data mentioned in Example 4.28 is
152 | 115 | 109 | 94 | 88 | 137 | 152 | 77 | 160 | 165 |
125 | 40 | 128 | 123 | 136 | 101 | 62 | 153 | 83 | 69 |
with \( \bar{x} = 113.5 \) and \( (1/20)\sum {x_i^2} = 14,087.8 \). The estimates are
These estimates of α and β differ from the values suggested by Gross and Clark because they used a different estimation technique.
Example 7.15
Let X 1,..., X n be a random sample from a generalized negative binomial distribution with parameters r and p (Section 3.6). Since E(X) = r(1 – p)/p and V(X) = r(1 – p)/p 2, E(X 2) = V(X) + [E(X)]2 = r(1 – p) (r – rp + 1)/p 2. Equating E(X) to \( \overline{X} \) and E(X 2) to \( (1/n)\sum {X_i^2} \) eventually gives
As an illustration, Reep, Pollard, and Benjamin (“Skill and Chance in Ball Games,” J. Roy. Statist. Soc. Ser. A, 1971: 623–629) consider the negative binomial distribution as a model for the number of goals per game scored by National Hockey League teams. The data for 1966–1967 follows (420 games):
Goals | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Frequency | 29 | 71 | 82 | 89 | 65 | 45 | 24 | 7 | 4 | 1 | 3 |
Then,
and
Thus,
Although r by definition must be positive, the denominator of \( \hat{r} \) could be negative, indicating that the negative binomial distribution is not appropriate (or that the moment estimator is flawed).
7.2.2 Maximum Likelihood Estimation
The method of maximum likelihood was first introduced by R. A. Fisher, a geneticist and statistician, in the 1920s. Most statisticians recommend this method, at least when the sample size is large, since the resulting estimators have certain desirable efficiency properties (see the proposition on large sample behavior toward the end of this section).
Example 7.16
A sample of ten new bike helmets manufactured by a company is obtained. Upon testing, it is found that the first, third, and tenth helmets are flawed, whereas the others are not. Let p = P(flawed helmet) and define X 1,..., X 10 by X i = 1 if the ith helmet is flawed and zero otherwise. Then the observed x i ’s are 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, so the joint pmf of the sample is
We now ask, “For what value of p is the observed sample most likely to have occurred?” That is, we wish to find the value of p that maximizes the pmf (7.4) or, equivalently, maximizes the natural log of (7.4).Footnote 2 Since
and this is a differentiable function of p, equating the derivative of (7.5) to zero gives the maximizing valueFootnote 3:
where x is the observed number of successes (flawed helmets). The estimate of p is now \( \hat{p} = \frac{3}{{10}} \). It is called the maximum likelihood estimate because for fixed x 1,..., x 10, it is the parameter value that maximizes the likelihood (joint pmf) of the observed sample. The likelihood and log likelihood are graphed in Figure 7.5. Of course, the maximum on both graphs occurs at the same value, p = .3.
Note that if we had been told only that among the ten helmets there were three that were flawed, Equation (7.4) would be replaced by the binomial pmf \( \left( {\begin{array}{cccc} {10} \\3 \\\end{array} } \right){p^3}{(1 - p)^7}\), which is also maximized for \( \hat{p} = \displaystyle\frac{3}{{10}} \).
DEFINITION
Let X 1,..., X n have joint pmf or pdf
where the parameters θ 1,..., θ m have unknown values. When x 1,..., x n are the observed sample values and (7.6) is regarded as a function of θ 1,..., θ m , it is called the likelihood function. The maximum likelihood estimates \( {\hat{\theta }_1}, \ldots, {\hat{\theta }_m} \) are those values of the θ i ’s that maximize the likelihood function, so that
f(x 1, x 2,..., x n ; \( {\hat{\theta }_1}, \ldots, {\hat{\theta }_m} \)) ≥ f(x 1, x 2,..., x n ; θ 1,..., θ m ) for all θ 1,..., θ m
When the X i ’s are substituted in place of the x i ’s, the maximum likelihood estimators (mle’s) result.
The likelihood function tells us how likely the observed sample is as a function of the possible parameter values. Maximizing the likelihood gives the parameter values for which the observed sample is most likely to have been generated, that is, the parameter values that “agree most closely” with the observed data.
Example 7.17
Suppose X 1,..., X n is a random sample from an exponential distribution with parameter λ. Because of independence, the likelihood function is a product of the individual pdf’s:
The ln(likelihood) is
Equating (d/dλ)[ln(likelihood)] to zero results in n/λ – Σx i = 0, or \( \lambda = n/\Sigma {{x_i}} = 1/\bar{x} \). Thus the mle is \( \hat{\lambda } = 1/\overline{X} \); it is identical to the method of moments estimator but it is not an unbiased estimator, since \( E(1/\overline{X}) \ne 1/E(\overline{X}) \).
Example 7.18
Let X 1,..., X n be a random sample from a normal distribution. The likelihood function is
so
To find the maximizing values of μ and σ 2, we must take the partial derivatives of ln(f) with respect to μ and σ 2, equate them to zero, and solve the resulting two equations. Omitting the details, the resulting mle’s are
The mle of σ 2 is not the unbiased estimator, so two different principles of estimation (unbiasedness and maximum likelihood) yield two different estimators.
Example 7.19
In Chapter 3, we discussed the use of the Poisson distribution for modeling the number of “events” that occur in a two-dimensional region. Assume that when the region R being sampled has area a(R), the number X of events occurring in R has a Poisson distribution with parameter λa(R) (where λ is the expected number of events per unit area) and that nonoverlapping regions yield independent X’s.
Suppose an ecologist selects n nonoverlapping regions R 1,..., R n and counts the number of plants of a certain species found in each region. The joint pmf (likelihood) is then
The ln(likelihood) is
Taking d/dλ ln(p) and equating it to zero yields
so
The mle is then \( \hat{\lambda } = \sum {{X_i}/\sum {a({R_i})} } \). This is intuitively reasonable because λ is the true density (plants per unit area), whereas \( \hat{\lambda } \) is the sample density since \( \sum {a({R_i})} \) is just the total area sampled. Because E(X i ) = λ · a(R i ), the estimator is unbiased.
Sometimes an alternative sampling procedure is used. Instead of fixing regions to be sampled, the ecologist will select n points in the entire region of interest and let y i = the distance from the ith point to the nearest plant. The cumulative distribution function (cdf) of Y = distance to the nearest plant is
Taking the derivative of F Y (y) with respect to y yields
If we now form the likelihood f Y (y 1; λ) · ··· · f Y (y n ; λ), differentiate ln(likelihood), and so on, the resulting mle is
which is also a sample density. It can be shown that in a sparse environment (small λ), the distance method is in a certain sense better, whereas in a dense environment, the first sampling method is better.
Example 7.20
Let X 1,..., X n be a random sample from a Weibull pdf
Writing the likelihood and ln(likelihood), then setting both \( (\partial /\partial \alpha )[\ln (f)] = 0 \) and \( (\partial /\partial \beta )[\ln (\,f)] = 0 \) yields the equations
These two equations cannot be solved explicitly to give general formulas for the mle’s \( \hat{\alpha } \) and \( \hat{\beta } \). Instead, for each sample x 1,..., x n , the equations must be solved using an iterative numerical procedure. Even moment estimators of α and β are somewhat complicated (see Exercise 22).
The iterative mle computations can be done on a computer, and they are available in some statistical packages. MINITAB gives maximum likelihood estimates for both the Weibull and the gamma distributions (under “Quality Tools”). Stata has a general procedure that can be used for these and other distributions. For the data of Example 7.14 the maximum likelihood estimates for the Weibull distribution are \( \hat{\alpha } = {3}.{799} \) and \( \hat{\beta }= {125}.{88} \). (The mle’s for the gamma distribution are \( \hat{\alpha } = {8}.{799} \) and \( \hat{\beta } = {12}.{893} \), a little different from the moment estimates in Example 7.14). Figure 7.6 shows the Weibull log likelihood as a function of α and β. The surface near the top has a rounded shape, allowing the maximum to be found easily, but for some distributions the surface can be much more irregular, and the maximum may be hard to find.
7.2.3 Some Properties of MLEs
In Example 7.18, we obtained the mle of σ 2 when the underlying distribution is normal. The mle of \( \sigma = \sqrt {{{\sigma^2}}} \), as well as many other mle’s, can be easily derived using the following proposition.
PROPOSITION
The Invariance Principle
Let \( {\hat{\theta }_1},{\hat{\theta }_2}, \ldots, {\hat{\theta }_m} \) be the mle’s of the parameters θ 1, θ 2,..., θ m . Then the mle of any function h(θ 1, θ 2,..., θ m ) of these parameters is the function \( h({\hat{\theta }_1},{\hat{\theta }_2}, \ldots, {\hat{\theta }_m}) \), of the mle’s.
7.2.4 Proof
For an intuitive idea of the proof, consider the special case m = 1, with θ 1 = θ, and assume that h(·) is a one-to-one function. On the graph of the likelihood as a function of the parameter θ, the highest point occurs where \( \theta = \hat{\theta } \). Now consider the graph of the likelihood as a function of h(θ). In the new graph the same heights occur, but the height that was previously plotted at θ = a is now plotted at \( h(\theta ) = h(a) \), and the highest point is now plotted at \( h(\theta ) = h(\hat{\theta }) \). Thus, the maximum remains the same, but it now occurs at \( h(\hat{\theta }) \).
Example 7.21 (Example 7.18 continued)
In the normal case, the mle’s of μ and σ 2 are \( \hat{\mu } = \overline{X}{\hbox{and }}{\hat{\sigma }^2} = \sum {{{({X_i} - \overline{X})}^2}/n} \). To obtain the mle of the function \( h(\mu, {\sigma^2}) = \sqrt {{{\sigma^2}}} = \sigma \), substitute the mle’s into the function:
The mle of σ is not the sample standard deviation S although they are close unless n is quite small. Similarly, the mle of the population coefficient of variation 100μ/σ is \( 100\hat{\mu }/\hat{\sigma } \).
Example 7.22 (Example 7.20 continued)
The mean value of an rv X that has a Weibull distribution is
The mle of μ is therefore \( \hat{\mu } = \hat{\beta } \cdot \Gamma (1 + 1/\hat{\alpha }) \), where \( \hat{\alpha }{\hbox{ and }}\hat{\beta } \) are the mle’s of α and β. In particular, \( \overline{X} \) is not the mle of μ, although it is an unbiased estimator. At least for large n, \( \hat{\mu } \) is a better estimator than \( \overline{X} \).
7.2.5 Large-Sample Behavior of the MLE
Although the principle of maximum likelihood estimation has considerable intuitive appeal, the following proposition provides additional rationale for the use of mle’s. (See Section 7.4 for more details.)
PROPOSITION
Under very general conditions on the joint distribution of the sample, when the sample size is large, the maximum likelihood estimator of any parameter θ is close to θ (consistency), is approximately unbiased [\( E(\hat{\theta }) \approx \theta \)], and has variance that is nearly as small as can be achieved by any unbiased estimator. Stated another way, the mle \( \hat {\theta } \) is approximately the MVUE of θ.
Because of this result and the fact that calculus-based techniques can usually be used to derive the mle’s (although often numerical methods, such as Newton’s method, are necessary), maximum likelihood estimation is the most widely used estimation technique among statisticians. Many of the estimators used in the remainder of the book are mle’s. Obtaining an mle, however, does require that the underlying distribution be specified.
Note that there is no similar result for method of moments estimators. In general, if there is a choice between maximum likelihood and moment estimators, the mle is preferable. For example, the maximum likelihood method applied to estimating gamma distribution parameters tends to give better estimates (closer to the parameter values) than does the method of moments, so the extra computation is worth the price.
7.2.6 Some Complications
Sometimes calculus cannot be used to obtain mle’s.
Example 7.23
Suppose the waiting time for a bus is uniformly distributed on [0, θ] and the results x 1,..., x n of a random sample from this distribution have been observed. Since f(x; θ) = 1/θ for 0 ≤ x ≤ θ and 0 otherwise,
As long as max(x i ) ≤ θ, the likelihood is 1/θ n, which is positive, but as soon as θ < max(x i ), the likelihood drops to 0. This is illustrated in Figure 7.7. Calculus will not work because the maximum of the likelihood occurs at a point of discontinuity, but the figure shows that \( \hat{\theta } = { \max }\left( {{x_i}} \right) \). Thus if my waiting times are 2.3, 3.7, 1.5,.4, and 3.2, then the mle is \( \hat{\theta } = {3}.{7} \). Note that the mle is biased (see Example 7.5).
Example 7.24
A method that is often used to estimate the size of a wildlife population involves performing a capture/recapture experiment. In this experiment, an initial sample of M animals is captured, each of these animals is tagged, and the animals are then returned to the population. After allowing enough time for the tagged individuals to mix into the population, another sample of size n is captured. With X = the number of tagged animals in the second sample, the objective is to use the observed x to estimate the population size N.
The parameter of interest is θ = N, which can assume only integer values, so even after determining the likelihood function (pmf of X here), using calculus to obtain N would present difficulties. If we think of a success as a previously tagged animal being recaptured, then sampling is without replacement from a population containing M successes and N – M failures, so that X is a hypergeometric rv and the likelihood function is
The integer-valued nature of N notwithstanding, it would be difficult to take the derivative of p(x; N). However, let’s consider the ratio of p(x; N) to p(x; N – 1):
This ratio is larger than 1 if and only if (iff) N < Mn/x. The value of N for which p(x; N) is maximized is therefore the largest integer less than Mn/x. If we use standard mathematical notation [r] for the largest integer less than or equal to r, the mle of N is \( \hat{N} = [Mn/x] \). As an illustration, if M = 200 fish are taken from a lake and tagged, subsequently n = 100 fish are recaptured, and among the 100 there are x = 11 tagged fish, then \( \hat{N} = [(200)(100)/11] = [1818.18] = 1818 \). The estimate is actually rather intuitive; x/n is the proportion of the recaptured sample that is tagged, whereas M/N is the proportion of the entire population that is tagged. The estimate is obtained by equating these two proportions (estimating a population proportion by a sample proportion).
Suppose X 1, X 2,..., X n is a random sample from a pdf f(x; θ) that is symmetric about θ, but the investigator is unsure of the form of the f function. It is then desirable to use an estimator \( \hat{\theta } \) that is robust, that is, one that performs well for a wide variety of underlying pdf’s. One such estimator is a trimmed mean. In recent years, statisticians have proposed another type of estimator, called an M-estimator, based on a generalization of maximum likelihood estimation. Instead of maximizing the log likelihood Σln[f(x; θ)] for a specified f, one seeks to maximize Σρ(x i ; θ). The “objective function” ρ is selected to yield an estimator with good robustness properties. The book by David Hoaglin et al. (see the bibliography) contains a good exposition on this subject.
7.2.7 Exercises: Section 7.2 (21–31)
-
21.
A random sample of n bike helmets manufactured by a company is selected. Let X = the number among the n that are flawed, and let p = P(flawed). Assume that only X is observed, rather than the sequence of S’s and F’s.
-
a.
Derive the maximum likelihood estimator of p. If n = 20 and x = 3, what is the estimate?
-
b.
Is the estimator of part (a) unbiased?
-
c.
If n = 20 and x = 3, what is the mle of the probability (1 – p)5 that none of the next five helmets examined is flawed?
-
a.
-
22.
Let X have a Weibull distribution with parameters α and β, so
$$ \begin{array}{ccccc} E(X) =\ & \beta \cdot \Gamma (1 + 1/\alpha ) \\V(X) =\ & {\beta^2}\{ \Gamma (1 + 2/\alpha ) - {[\Gamma (1 + 1/\alpha )]^2}\} \\\end{array} $$-
a.
Based on a random sample X 1,..., X n , write equations for the method of moments estimators of β and α. Show that, once the estimate of α has been obtained, the estimate of β can be found from a table of the gamma function and that the estimate of α is the solution to a complicated equation involving the gamma function.
-
b.
If n = 20, \( \bar{x} = 28.0 \), and \( \sum {x_i^2} = 16,500, \) compute the estimates. [Hint: [Γ(1.2)]2/Γ(1.4) =.95.]
-
a.
-
23.
Let X denote the proportion of allotted time that a randomly selected student spends working on a certain aptitude test. Suppose the pdf of X is
$$ f(x;\ \theta ) = \left\{ {\begin{array}{ccccc} {(\theta + 1){x^{\theta }}\quad \quad \!\!0 \leq x \leq 1\quad } \\{\quad \!\!0\quad \quad \quad {\hbox{otherwise}}} \\\end{array} } \right. $$where−1 < θ. A random sample of ten students yields data x 1 = .92, x 2 = .79, x 3 = .90, x 4 = .65, x 5 = .86, x 6 = .47, x 7 = .73, x 8 = .97, x 9 = .94, x 10 = .77.
-
a.
Use the method of moments to obtain an estimator of θ, and then compute the estimate for this data.
-
b.
Obtain the maximum likelihood estimator of θ, and then compute the estimate for the given data.
-
a.
-
24.
Two different computer systems are monitored for a total of n weeks. Let X i denote the number of breakdowns of the first system during the ith week, and suppose the X i ’s are independent and drawn from a Poisson distribution with parameter λ 1. Similarly, let Y i denote the number of breakdowns of the second system during the ith week, and assume independence with each Y i Poisson with parameter λ 2. Derive the mle’s of λ 1, λ 2, and λ 1 – λ 2. [Hint: Using independence, write the joint pmf (likelihood) of the X i ’s and Y i ’s together.]
-
25.
Refer to Exercise 21. Instead of selecting n = 20 helmets to examine, suppose we examine helmets in succession until we have found r = 3 flawed ones. If the 20th helmet is the third flawed one (so that the number of helmets examined that were not flawed is x = 17), what is the mle of p? Is this the same as the estimate in Exercise 21? Why or why not? Is it the same as the estimate computed from the unbiased estimator of Exercise 17?
-
26.
Six Pepperidge Farm bagels were weighed, yielding the following data (grams):
117.6
109.5
111.6
109.2
119.1
110.8
(Note: 4 oz = 113.4 g)
-
a.
Assuming that the six bagels are a random sample and the weight is normally distributed, estimate the true average weight and standard deviation of the weight using maximum likelihood.
-
b.
Again assuming a normal distribution, estimate the weight below which 95% of all bagels will have their weights. [Hint: What is the 95th percentile in terms of μ and σ? Now use the invariance principle.]
-
c.
Suppose we choose another bagel and weigh it. Let X = weight of the bagel. Use the given data to obtain the mle of P(X ≤ 113.4). (Hint: P(X ≤ 113.4) = Φ[(113.4 – μ)/σ)].)
-
a.
-
27.
Suppose a measurement is made on some physical characteristic whose value is known, and let X denote the resulting measurement error. For an unbiased measuring instrument or technique, the mean value of X is 0. Assume that any particular measurement error is normally distributed with variance σ 2. Let X 1,... X n be a random sample of measurement errors.
-
a.
Obtain the method of moments estimator of σ 2.
-
b.
Obtain the maximum likelihood estimator of σ 2.
-
a.
-
28.
Let X 1,..., X n be a random sample from a gamma distribution with parameters α and β.
-
a.
Derive the equations whose solution yields the maximum likelihood estimators of α and β. Do you think they can be solved explicitly?
-
b.
Show that the mle of μ = αβ is \( \hat{\mu } = \overline{X} \).
-
a.
-
29.
Let X 1, X 2,..., X n represent a random sample from the Rayleigh distribution with density function given in Exercise 15. Determine
-
a.
The maximum likelihood estimator of θ and then calculate the estimate for the vibratory stress data given in that exercise. Is this estimator the same as the unbiased estimator suggested in Exercise 15?
-
b.
The mle of the median of the vibratory stress distribution. [Hint: First express the median in terms of θ.]
-
a.
-
30.
Consider a random sample X 1, X 2,..., X n from the shifted exponential pdf
$$ f(x;\lambda, \theta ) = \left\{ {\begin{array}{ccccc} {\lambda {e^{{ - \lambda (x - \theta )}}}\quad \quad x \geq \theta \quad } \\{\quad\quad 0\quad \quad \quad \quad {\hbox{otherwise}}} \\\end{array} } \right. $$Taking θ = 0 gives the pdf of the exponential distribution considered previously (with positive density to the right of zero). An example of the shifted exponential distribution appeared in Example 4.5, in which the variable of interest was time headway in traffic flow and θ = .5 was the minimum possible time headway.
-
a.
Obtain the maximum likelihood estimators of θ and λ.
-
b.
If n = 10 time headway observations are made, resulting in the values 3.11,.64, 2.55, 2.20, 5.44, 3.42, 10.39, 8.93, 17.82, and 1.30, calculate the estimates of θ and λ.
-
a.
-
31.
At time t = 0, 20 identical components are put on test. The lifetime distribution of each is exponential with parameter λ. The experimenter then leaves the test facility unmonitored. On his return 24 h later, the experimenter immediately terminates the test after noticing that y = 15 of the 20 components are still in operation (so 5 have failed). Derive the mle of λ. [Hint: Let Y = the number that survive 24 h. Then Y ~ Bin(n, p). What is the mle of p? Now notice that p = P(X i ≥ 24), where X i is exponentially distributed. This relates λ to p, so the former can be estimated once the latter has been.]
7.3 Sufficiency
An investigator who wishes to make an inference about some parameter θ will base conclusions on the value of one or more statistics – the sample mean \( \overline{X} \), the sample variance S 2, the sample range Y n − Y 1, and so on. Intuitively, some statistics will contain more information about θ than will others. Sufficiency, the topic of this section, will help us decide which functions of the data are most informative for making inferences.
As a first point, we note that a statistic T = t(X 1,..., X n ) will not be useful for drawing conclusions about θ unless the distribution of T depends on θ. Consider, for example, a random sample of size n = 2 from a normal distribution with mean μ and variance σ 2, and let T = X 1 − X 2. Then T has a normal distribution with mean 0 and variance 2σ 2, which does not depend on μ. Thus this statistic cannot be used as a basis for drawing any conclusions about μ, although it certainly does carry information about the variance σ2.
The relevance of this observation to sufficiency is as follows. Suppose an investigator is given the value of some statistic T, and then examines the conditional distribution of the sample X 1, X 2,..., X n given the value of the statistic – for example, the conditional distribution given that \( \overline{X} = 28.7 \). If this conditional distribution does not depend upon θ, then it can be concluded that there is no additional information about θ in the data over and above what is provided by T. In this sense, for purposes of making inferences about θ, it is sufficient to know the value of T, which contains all the information in the data relevant to θ.
Example 7.25
An investigation of major defects on new vehicles of a certain type involved selecting a random sample of n = 3 vehicles and determining for each one the value of X = the number of major defects. This resulted in observations x 1 = 1, x 2 = 0, and x 3 = 3. You, as a consulting statistician, have been provided with a description of the experiment, from which it is reasonable to assume that X has a Poisson distribution, and told only that the total number of defects for the three sampled vehicles was four.
Knowing that T = ∑X i = 4, would there be any additional advantage in having the observed values of the individual X i ’s when making an inference about the Poisson parameter λ? Or rather is it the case that the statistic T contains all relevant information about λ in the data? To address this issue, consider the conditional distribution of X 1, X 2, X 3 given that ∑X i = 4. First of all, there are only a few possible (x 1, x 2, x 3) triples for which x 1 + x 2 + x 3 = 4. For example, (0, 4, 0) is a possibility, as are (2, 2, 0) and (1, 0, 3), but not (1, 2, 3) or (5, 0, 2). That is,
Now consider the triple (2, 1, 1), which is consistent with ∑X i = 4. If we let A denote the event that X 1 = 2, X 2 = 1, and X 3 = 1 and B denote the event that ∑X i = 4, then the event A implies the event B (i.e., A is contained in B), so the intersection of the two events is just the smaller event A. Thus
A moment generating function argument shows that ∑X i has a Poisson distribution with parameter 3λ. Thus the desired conditional probability is
Similarly,
The complete conditional distribution is as follows:
This conditional distribution does not involve λ. Thus once the value of the statistic ∑X i has been provided, there is no additional information about λ in the individual observations.
To put this another way, think of obtaining the data from the experiment in two stages:
-
1.
Observe the value of T = X 1 + X 2 + X 3 from a Poisson distribution with parameter 3λ.
-
2.
Having observed T = 4, now obtain the individual x i’s from the conditional distribution
$$ P({X_1} = {x_1},{X_2} = {x_2},{X_3} = {x_3}|\textstyle\sum\limits_{{i = 1}}^3 {{X_i} = 4)} $$
Since the conditional distribution in step 2 does not involve λ, there is no additional information about λ resulting from the second stage of the data generation process. This argument holds more generally for any sample size n and any value t other than 4 (e.g., the total number of defects among ten randomly selected vehicles might be ∑X i = 16). Once the value of ∑X i is known, there is no further information in the data about the Poisson parameter.
DEFINITION
A statistic T = t(X 1,..., X n ) is said to be sufficient for making inferences about a parameter θ if the joint distribution of X 1, X 2 ,..., X n given that T = t does not depend upon θ for every possible value t of the statistic T.
The notion of sufficiency formalizes the idea that a statistic T contains all relevant information about θ. Once the value of T for the given data is available, it is of no benefit to know anything else about the sample.
7.3.1 The Factorization Theorem
How can a sufficient statistic be identified? It may seem as though one would have to select a statistic, determine the conditional distribution of the X i’s given any particular value of the statistic, and keep doing this until hitting paydirt by finding one that satisfies the defining condition. This would be terribly time-consuming, and when the X i’s are continuous there are additional technical difficulties in obtaining the relevant conditional distribution. Fortunately, the next result provides a relatively straightforward way of proceeding.
THE NEYMAN FACTORIZATION THEOREM
Let f(x 1, x 2,..., x n ; θ) denote the joint pmf or pdf of X 1, X 2 ,..., X n . Then T = t(X 1,..., X n ) is a sufficient statistic for θ if and only if the joint pmf or pdf can be represented as a product of two factors in which the first factor involves θ and the data only through t(x 1,..., x n ) whereas the second factor involves x 1,..., x n but does not depend on θ:
Before sketching a proof of this theorem, we consider several examples.
Example 7.26
Let’s generalize the previous example by considering a random sample X 1, X 2 ,..., X n from a Poisson distribution with parameter λ, for example, the numbers of blemishes on n independently selected DVD’s or the numbers of errors in n batches of invoices where each batch consists of 200 invoices. The joint pmf of these variables is
The factor inside the first set of parentheses involves the parameter λ and the data only through ∑x i , whereas the factor inside the second set of parentheses involves the data but not λ. So we have the desired factorization, and the sufficient statistic is T = ∑X i as we previously ascertained directly from the definition of sufficiency.
A sufficient statistic is not unique; any one-to-one function of a sufficient statistic is itself sufficient. In the Poisson example, the sample mean \( \overline{X} = (1/n)\sum {{X_i}} \) is a one-to-one function of ∑X i (knowing the value of the sum of the n observations is equivalent to knowing their mean), so the sample mean is also a sufficient statistic.
Example 7.27
Suppose that the waiting time for a bus on a weekday morning is uniformly distributed on the interval from 0 to θ, and consider a random sample X 1,..., X n of waiting times (i.e., times on n independently selected mornings). The joint pdf of these times is
To obtain the desired factorization, we introduce notation for an indicator function of an event A: I(A) = 1 if (x 1, x 2,..., x n ) lies in A and I(A) = 0 otherwise. Now let
That is, A is the indicator for the event that all x i ’s are between 0 and θ. But all n of the x i ’s will be between 0 and θ if and only if the smallest of the x i ’s is at least 0 and the largest is at most θ. Thus
We can now use this indicator function notation to write a one-line expression for the joint pdf:
The factor inside the square brackets involves θ and the x i ’s only through the function t(x 1,..., x n ) = max{x 1,..., x n }. Voila, we have our desired factorization, and the sufficient statistic for the uniform parameter θ is T = max{X 1,..., X n }, the largest order statistic. All the information about θ in this uniform random sample is contained in the largest of the n observations. This result is much more difficult to obtain directly from the definition of sufficiency.
7.3.2 Proof of the Factorization Theorem
A general proof when the X i’s constitute a random sample from a continuous distribution is fraught with technical details that are beyond the level of our text. So we content ourselves with a proof in the discrete case. For the sake of concise notation, denote X 1, X 2,..., X n by X and x 1, x 2,..., x n by x.
Suppose first that T = t(x) is sufficient, so that P(X = x | T = t) does not depend upon θ. Focus on a value t for which t(x) = t (e.g., x = 3, 0, 1, t(x) = ∑x i , so t = 4). The event that X = x is then identical to the event that both X = x and T = t because the former equality implies the latter one. Thus
Since the first factor in this latter product does not involve θ and the second one involves the data only through t, we have our desired factorization.
Now let’s go the other way: assume a factorization, and show that T is sufficient, i.e., that the conditional probability that X = x given that T = t does not involve θ.
Sure enough, this latter ratio does not involve θ.
7.3.3 Jointly Sufficient Statistics
When the joint pmf or pdf of the data involves a single unknown parameter θ, there is frequently a single statistic (single function of the data) that is sufficient. However, when there are several unknown parameters—for example, the mean μ and standard deviation σ of a normal distribution, or the shape parameter α and scale parameter β of a gamma distribution—we must expand our notion of sufficiency.
DEFINITION
Suppose the joint pmf or pdf of the data involves k unknown parameters θ 1, θ 2,..., θ k . The m statistics T 1 = t 1(X 1,..., X n ), T 2 = t 2(X 1,..., X n ),..., T m = t m (X 1,..., X n ) are said to be jointly sufficient for the parameters if the conditional distribution of the X i ’s given that T 1 = t 1, T 2 = t 2,..., T m = t m does not depend on any of the unknown parameters, and this is true for all possible values t 1, t 2,..., t m of the statistics.
Example 7.28
Consider a random sample of size n = 3 from a continuous distribution, and let T 1, T 2, and T 3 be the three order statistics – that is, T 1 = the smallest of the three X i ’s, T 2 = the second smallest X i , and T 3 = the largest X i (these order statistics were previously denoted by Y 1, Y 2, and Y 3.). Then for any values t 1, t 2, and t 3 satisfying t 1 < t 2 < t 3,
For example, if the three ordered values are 21.4, 23.8, and 26.0, then the conditional probability distribution of the three X i ’s places probability \( \frac{1}{6} \) on each of the 6 permutations of these three numbers (23.8, 21.4, 26.0, and so on). This conditional distribution clearly does not involve any unknown parameters.
Generalizing this argument to a sample of size n, we see that for a random sample from a continuous distribution, the order statistics are jointly sufficient for θ 1, θ 2,..., θ k regardless of whether k = 1 (e.g., the exponential distribution has a single parameter) or 2 (the normal distribution) or even k > 2.
The factorization theorem extends to the case of jointly sufficient statistics: T 1, T 2,..., T m are jointly sufficient for θ 1, θ 2,..., θ k if and only if the joint pmf or pdf of the X i ’s can be represented as a product of two factors, where the first involves the θ i ’s and the data only through t 1, t 2,..., t m and the second does not involve the θ i ’s.
Example 7.29
Let X 1,..., X n be a random sample from a normal distribution with mean μ and variance σ 2. The joint pdf is
This factorization shows that the two statistics \( \Sigma {{X_i}} \) and \( \Sigma {X_i^2} \) are jointly sufficient for the two parameters μ and σ 2. Since \( \Sigma {{{({X_i} - \overline{X})}^2}} = \Sigma {X_i^2} - n{\left( {\overline{X}} \right)^2} \) there is a one-to-one correspondence between the two sufficient statistics and the statistics \( \overline{X} \) and \( \Sigma {{{({X_i} - \overline{X})}^2}} \); that is, values of the two original sufficient statistics uniquely determine values of the latter two statistics, and vice-versa. This implies that the latter two statistics are also jointly sufficient, which in turn implies that the sample mean and sample variance (or sample standard deviation) are jointly sufficient statistics. The sample mean and sample variance encapsulate all the information about μ and σ 2 that is contained in the sample data.
7.3.4 Minimal Sufficiency
When X 1,..., X n constitute a random sample from a normal distribution, the n order statistics Y 1,..., Y n are jointly sufficient for μ and σ 2, and the sample mean and sample variance are also jointly sufficient. Both the order statistics and the pair \( (\overline{X},{S^2}) \) reduce the data without any information loss, but the sample mean and variance represent a greater reduction. In general, we would like the greatest possible reduction without information loss. A minimal (possibly jointly) sufficient statistic is a function of every other sufficient statistic. That is, given the value(s) of any other sufficient statistic(s), the value(s) of the minimal sufficient statistic(s) can be calculated. The minimal sufficient statistic is the sufficient statistic having the smallest dimensionality, and thus represents the greatest possible reduction of the data without any information loss.
A general discussion of minimal sufficiency is beyond the scope of our text. In the case of a normal distribution with values of both μ and σ 2 unknown, it can be shown that the sample mean and sample variance are jointly minimal sufficient (so the same is true of \( \sum {{X_i}} \) and \( \sum {X_i^2} \)). It is intuitively reasonable that because there are two unknown parameters, there should be a pair of sufficient statistics. It is indeed often the case that the number of the (jointly) sufficient statistic(s) matches the number of unknown parameters. But this is not always true. Consider a random sample X 1,..., X n from the pdf f(x;θ) = 1/{π[1 + (x − θ)]2} for − ∞ < x < ∞, i.e., from a Cauchy distribution with location parameter θ. The graph of this pdf is bell shaped and centered at θ, but its tails decrease much more slowly than those of a normal density curve. Because the Cauchy distribution is continuous, the order statistics are jointly sufficient for θ. It would seem, though, that a single sufficient statistic (one-dimensional) could be found for the single parameter. Unfortunately this is not the case; it can be shown that the order statistics are minimal sufficient! So going beyond the order statistics to any single function of the X i ’s as a point estimator of θ entails a loss of information from the original data.
7.3.5 Improving an Estimator
Because a sufficient statistic contains all the information the data has to offer about the value of θ, it is reasonable that an estimator of θ or any function of θ should depend on the data only through the sufficient statistic. A general result due to Rao and Blackwell shows how to start with an unbiased statistic that is not a function of sufficient statistics and create an improved estimator that is sufficient.
THEOREM
Suppose that the joint distribution of X 1,..., X n depends on some unknown parameter θ and that T is sufficient for θ. Consider estimating h(θ), a specified function of θ. If U is an unbiased statistic for estimating h(θ) that does not involve T, then the estimator U* = E(U | T) is also unbiased for h(θ) and has variance no greater than the original unbiased estimator U.
7.3.6 Proof
First of all, we must show that U* is indeed an estimator—that it is a function of the X i’s which does not depend on θ. This follows because, given that T is sufficient, the distribution of U conditional on T does not involve θ, so the expected value calculated from the conditional distribution will of course not involve θ. The fact that U* has smaller variance than U is a consequence of a conditional expectation-conditional variance formula for V(U) introduced in Section 5.3:
Because V(U | T), being a variance, is positive, it follows that V(U) ≥ V(U*) as desired.
Example 7.30
Suppose that the number of major defects on a randomly selected new vehicle of a certain type has a Poisson distribution with parameter λ, Consider estimating e −λ, the probability that a vehicle has no such defects, based on a random sample of n vehicles. Let’s start with the estimator U = I(X 1 = 0), the indicator function of the event that the first vehicle in the sample has no defects. That is,
Then
Our estimator is therefore unbiased for estimating the probability of no defects. The sufficient statistic here is T = ∑X i , so of course the estimator U is not a function of T. The improved estimator is U* = E(U | ∑X i ) = P(X 1 = 0 | ∑X i ). Let’s consider P(X 1 = 0 | ∑X i = t) where t is some non-negative integer. The event that X 1 = 0 and ∑X i = t is identical to the event that the first vehicle has no defects and the total number of defects on the last n−1 vehicles is t. Thus
A moment generating function argument shows that the sum of all n X i ’s has a Poisson distribution with parameter nλ and the sum of the last n − 1 X i ’s has a Poisson distribution with parameter (n − 1)λ. Furthermore, X 1 is independent of the other n − 1 X i ’s so it is independent of their sum, from which
The improved unbiased estimator is then U* = (1−1/n)T. If, for example, there are a total of 15 defects among 10 randomly selected vehicles, then the estimate is \( {(1 - \frac{1}{{10}})^{{15}}} =.206 \). For this sample, \( \hat{\lambda } = \bar{x} = 1.5 \), so the maximum likelihood estimate of e −λ is e −1.5 = .223. Here as in some other situations the principles of unbiasedness and maximum likelihood are in conflict. However, if n is large, the improved estimate is \( {(1 - 1/n)^t} = {[{(1 - 1/n)^n}]^{{\bar{x}}}} \approx {e^{{ - \bar{x}}}} \), which is the mle. That is, the unbiased and maximum likelihood estimators are “asymptotically equivalent.”
We have emphasized that in general there will not be a unique sufficient statistic. Suppose there are two different sufficient statistics T 1 and T 2 such that the first one is not a one-to-one function of the second (e.g., we are not considering T 1 = ∑X i and T 2 = \( \overline{X} \)). Then it would be distressing if we started with an unbiased estimator U and found that E(U | T 1) ≠ E(U | T 2), so our improved estimator depended on which sufficient statistic we used. Fortunately there are general conditions under which, starting with a minimal sufficient statistic T, the improved estimator is the MVUE (minimum variance unbiased estimator). That is, the new estimator is unbiased and has smaller variance than any other unbiased estimator. Please consult one of the chapter references for more detail.
7.3.7 Further Comments
Maximum likelihood is by far the most popular method for obtaining point estimates, so it would be disappointing if maximum likelihood estimators did not make full use of sample information. Fortunately the mle’s do not suffer from this defect. If T 1,..., T m are jointly sufficient statistics for parameters θ 1,..., θ k , then the joint pmf or pdf factors as follows:
The maximum likelihood estimates result from maximizing f(ċ) with respect to the θ i ’s. Because the h(ċ) factor does not involve the parameters, this is equivalent to maximizing the g(ċ) factor with respect to the θ i ’s. The resulting \( {\hat{\theta }_i} \)’s will involve the data only through the t i ’s. Thus it is always possible to find a maximum likelihood estimator that is a function of just the sufficient statistic(s). There are contrived examples of situations where the mle is not unique, in which case an mle that is not a function of the sufficient statistics can be constructed—but there is also one that is a function of the sufficient statistics.
The concept of sufficiency is very compelling when an investigator is sure the underlying distribution that generated the data is a member of some particular family (normal, exponential, etc.). However, two different families of distributions might each furnish plausible models for the data in a particular application, and yet the sufficient statistics for these two families might be different (an analogous comment applies to maximum likelihood estimation). For example, there are data sets for which a gamma probability plot suggests that a member of the gamma family would give a reasonable model and also a lognormal probability plot (normal probability plot of the logs of the observations) indicates that lognormality is plausible. Yet the jointly sufficient statistics for the parameters of the gamma family are not the same as those for the parameters of the lognormal family. When estimating some parameter θ in such situations (e.g., the mean μ or median \( \tilde{\mu } \)), one would look for a robust estimator that performs well for a wide variety of underlying distributions, as discussed in Section 7.1. Please consult a more advanced source for additional information.
7.3.8 Exercises: Section 7.3 (32–41)
-
32.
The long run proportion of vehicles that pass a certain emissions test is p. Suppose that three vehicles are independently selected for testing. Let X i = 1 if the ith vehicle passes the test and X i = 0 otherwise (i = 1, 2, 3), and let X = X 1 + X 2 + X 3. Use the definition of sufficiency to show that X is sufficient for p by obtaining the conditional distribution of the X i ’s given that X = x for each possible value x. Then generalize by giving an analogous argument for the case of n vehicles.
-
33.
Components of a certain type are shipped in batches of size k. Suppose that whether or not any particular component is satisfactory is independent of the condition of any other component, and that the long run proportion of satisfactory components is p. Consider n batches, and let X i denote the number of satisfactory components in the ith batch (i = 1, 2,..., n). Statistician A is provided with the values of all the X i ’s, whereas statistician B is given only the value of X = ∑X i . Use a conditional probability argument to decide whether statistician A has more information about p than does statistician B.
-
34.
Let X 1,..., X n be a random sample of component lifetimes from an exponential distribution with parameter λ. Use the factorization theorem to show that ∑X i is a sufficient statistic for λ.
-
35.
Identify a pair of jointly sufficient statistics for the two parameters of a gamma distribution based on a random sample of size n from that distribution.
-
36.
Suppose waiting time for delivery of an item is uniform on the interval from θ 1 to θ 2 (so f(x; θ 1, θ 2) = 1/(θ 2 − θ 1) for θ 1 < x < θ 2 and is 0 otherwise). Consider a random sample of n waiting times, and use the factorization theorem to show that min(X i ), max(X i ) is a pair of jointly sufficient statistics for θ 1 and θ 2. [Hint: Introduce an appropriate indicator function as we did in Example 7.27.]
-
37.
For θ > 0 consider a random sample from a uniform distribution on the interval from θ to 2θ (pdf 1/θ for θ < x < 2θ), and use the factorization theorem to determine a sufficient statistic for θ.
-
38.
Suppose that survival time X has a lognormal distribution with parameters μ and σ (which are the mean and standard deviation of ln(X), not of X itself). Are ∑X i and \( \sum {X_i^2} \) jointly sufficient for the two parameters? If not, what is a pair of jointly sufficient statistics?
-
39.
The probability that any particular component of a certain type works in a satisfactory manner is p. If n of these components are independently selected, then the statistic X the number among the selected components that perform in a satisfactory manner, is sufficient for p. You must purchase two of these components for a particular system. Obtain an unbiased statistic for the probability that exactly one of your purchased components will perform in a satisfactory manner. [Hint: Start with the statistic U, the indicator function of the event that exactly one of the first two components in the sample of size n performs as desired, and improve on it by conditioning on the sufficient statistic.]
-
40.
In Example 7.30, we started with U = I(X 1 = 0) and used a conditional expectation argument to obtain an unbiased estimator of the zero-defect probability based on the sufficient statistic. Consider now starting with a different statistic: U = [∑I(X i = 0)]/n. Show that the improved estimator based on the sufficient statistic is identical to the one obtained in the cited example. [Hint: Use the general property E(Y + Z|T) = E(Y|T) + E(Z|T).]
-
41.
A particular quality characteristic of items produced using a certain process is known to be normally distributed with mean μ and standard deviation 1. Let X denote the value of the characteristic for a randomly selected item. An unbiased estimator for the parameter θ = P(X ≤ c), where c is a critical threshold, is desired. The estimator will be based on a random sample X 1,..., X n .
-
a.
Obtain a sufficient statistic for μ.
-
b.
Consider the estimator \( \hat{\theta } = I({X_1} \leq c) \). Obtain an improved unbiased estimator based on the sufficient statistic (it is actually the minimum variance unbiased estimator). [Hint: You may use the following facts: (1) The joint distribution of X 1 and \( \overline{X} \) is bivariate normal with means μ and μ, respectively, variances 1 and 1/n, respectively, and correlation ρ (which you should determine). (2) If Y 1 and Y 2 have a bivariate normal distribution, then the conditional distribution of Y 1 given that Y 2 = y 2 is normal with mean μ 1 + (ρσ 1/σ 2)(y 2 − μ 2) and variance \( \sigma_1^2{(1 - \rho )^2} \).]
-
a.
7.4 Information and Efficiency
In this section we introduce the idea of Fisher information and two of its applications. The first application is to find the minimum possible variance for an unbiased estimator. The second application is to show that the maximum likelihood estimator is asymptotically unbiased and normal (that is, for large n it has expected value approximately θ and it has approximately a normal distribution) with the minimum possible variance.
Here the notation f(x; θ) will be used for a probability mass function or a probability density function with unknown parameter θ. The Fisher information is intended to measure the precision in a single observation. Consider the random variable U obtained by taking the partial derivative of ln[f(x;θ)] with respect to θ and then replacing x by X: U = Μ[ln[f(X;θ)]/Μθ. For example, if the pdf is θx θ−1 for 0 < x < 1 (θ > 0), then Μ[ln(θx θ−1)]/Μθ = Μ[ln(θ) + (θ−1)ln(x)]/Μθ = 1/θ + ln(x), so U = ln(X) + 1/θ.
DEFINITION
The Fisher information I(θ ) in a single observation from a pmf or pdf f(x ; θ) is the variance of the random variable U = Μ[ln[f(X;θ)]/Μθ :
It may seem strange to differentiate the logarithm of the pmf or pdf, but this is exactly what is often done in maximum likelihood estimation. In what follows we will assume that f(x; θ) is a pmf, but everything that we do will apply also in the continuous case if appropriate assumptions are made. In particular, it is important to assume that the set of possible x’s does not depend on the value of the parameter.
When f(x; θ) is a pmf, we know that \( 1 = \sum\nolimits_x {f(x;\theta )} \). Therefore, differentiating both sides with respect to θ and using the fact that [ln(f)]′ = f′/f, we find that the mean of U is 0:
This involves interchanging the order of differentiation and summation, which requires certain technical assumptions if the set of possible x values is infinite. We will omit those assumptions here and elsewhere in this section, but we emphasize that switching differentiation and summation (or integration) is not allowed if the set of possible values depends on θ. For example, if the summation were from –θ to θ there would be additional variability, and therefore terms for the limits of summation would be needed.
There is an alternative expression for I(θ) that is sometimes easier to compute than the variance in the definition:
This is a consequence of taking another derivative in (7.8):
To complete the derivation of (7.9), recall that U has mean 0, so its variance is
where Equation (7.10) is used in the last step.
Example 7.31
Let X be a Bernoulli rv, so f(x; p) = p x(1–p)1–x, x = 0, 1. Then
This has mean 0, in accord with Equation (7.8), because E(X) = p. Computing the variance of the partial derivative, we get the Fisher information:
The alternative method uses Equation (7.9). Differentiating Equation (7.11) with respect to p gives
Taking the negative of the expected value in Equation (7.13) gives the information in an observation:
Both methods yield the answer I(p) = 1/[p(1 – p)], which says that the information is the reciprocal of V(X). It is reasonable that the information is greatest when the variance is smallest.
7.4.1 Information in a Random Sample
Now assume a random sample X 1, X 2, …, X n from a distribution with pmf or pdf f(x; θ). Let f(X 1, X 2, …, X n ; θ) = f(X 1; θ) · f(X 2; θ) · ċċċ · f(X n ; θ) be the likelihood function. The Fisher information I n (θ) for the random sample is the variance of the score function
The log of a product is the sum of the logs, so the score function is a sum:
This is a sum of terms for which the mean is zero, by Equation (7.8), and therefore
The right-hand-side of Equation (7.15) is a sum of independent identically distributed random variables, and each has variance I(θ). Taking the variance of both sides of Equation (7.15) gives the information I n (θ) in the random sample
Therefore, the Fisher information in a random sample is just n times the information in a single observation. This should make sense intuitively, because it says that twice as many observations yield twice as much information.
Example 7.32
Continuing with Example 7.31, let X 1, X 2, …, X n be a random sample from the Bernoulli distribution with f(x; p) = p x(1 – p)1–x, x = 0, 1. Suppose the purpose is to estimate the proportion p of drivers who are wearing seat belts. We saw that the information in a single observation is I(p) = 1/[p(1 – p)], and therefore the Fisher information in the random sample is I n (p) = nI(p) = n/[p(1 – p)].
7.4.2 The Cramér-Rao Inequality
We will use the concept of Fisher information to show that if t(X 1, X 2, …, X n ) is an unbiased estimator of θ, then its minimum possible variance is the reciprocal of I n (θ). Harald Cramér in Sweden and C. R. Rao in India independently derived this inequality during World War II, but R. A. Fisher had some notion of it 20 years previously.
THEOREM (CRAMÉR-RAO INEQUALITY)
Assume a random sample X 1, X 2, …, X n from the distribution with pmf or pdf f(x; θ) such that the set of possible values does not depend on θ. If the statistic T = t(X 1, X 2, …, X n ) is an unbiased estimator for the parameter θ, then
7.4.3 Proof
The basic idea here is to consider the correlation ρ between T and the score function, and the desired inequality will result from −1 ≤ ρ ≤ 1. If T = t(X 1, X 2, …, X n ) is an unbiased estimator of θ, then
Differentiating this with respect to θ,
Multiplying and dividing the last term by the likelihood f(x 1, …, x n ;θ) gives
which is equivalent to
Therefore, because of Equation 7.16, the covariance of T with the score function is 1:
Recall from Section 5.2 that the correlation between two rv’s X and Y is ρ X,Y = Cov(X, Y)/(σ X σ Y ), and that −1 ≤ ρ X,Y ≤ 1. Therefore,
Apply this to Equation 7.18:
Dividing both sides by the variance of the score function and using the fact that this variance equals nI(θ), we obtain the desired result.
Because the variance of T must be at least 1/nI(θ), it is natural to call T an efficient estimator of θ if V(T) = 1/[nI(θ)].
DEFINITION
Let T be an unbiased estimator of θ. The ratio of the lower bound to the variance of T is its efficiency. Then T is said to be an efficient estimator if T achieves the Cramér–Rao lower bound (the efficiency is 1). An efficient estimator is a minimum variance unbiased (MVUE) estimator, as discussed in Section 7.1.
Example 7.33
Continuing with Example 7.32, let X 1, X 2, …, X n be a random sample from the Bernoulli distribution, where the purpose is to estimate the proportion p of drivers who are wearing seat belts. We saw that the information in the sample is I n (p) = n/[p(1 – p)], and therefore the Cramér–Rao lower bound is 1/I n (p) = p(1 – p)/n. Let T(X 1, X 2, …, X n ) = \( {\hat p} = \overline X = \sum {{X_i}} /n \). Then \( E(T) = E(\sum {{X_i}} )/n = np/n = p \) so T is unbiased, and \( V(T) = V(\sum {{X_i}} )/{n^2} = np(1 - p)/{n^2} = p(1 - p)/n \). Because T is unbiased and V(T) is equal to the lower bound, T has efficiency 1 and therefore it is an efficient estimator.
7.4.4 Large Sample Properties of the MLE
As discussed in Section 7.2, the maximum likelihood estimator \( \hat{\theta } \) has some nice properties. First of all it is consistent, which means that it converges in probability to the parameter θ as the sample size increases. A verification of this is beyond the level of this book, but we can use it as a basis for showing that the mle is asymptotically normal with mean θ (asymptotic unbiasedness) and variance equal to the Cramér–Rao lower bound.
THEOREM
Given a random sample X 1, X 2, …, X n from a distribution with pmf or pdf f(x; θ), assume that the set of possible x values does not depend on θ. Then for large n the maximum likelihood estimator \( \hat {\theta } \) has approximately a normal distribution with mean θ and variance 1/[nI(θ)]. More precisely, the limiting distribution of \( \sqrt {n} (\,\hat{\theta } - \theta ) \) is normal with mean 0 and variance 1/I(θ).
7.4.5 Proof
Consider the score function
Its derivative S′(θ) at the true θ is approximately equal to the difference quotient
and the error approaches zero asymptotically because \( \hat {\theta } \) approaches θ (consistency). Equation (7.20) connects the mle \( \hat {\theta } \) to the score function, so the asymptotic behavior of the score function can be applied to \( \hat {\theta } \). Because \( \hat {\theta } \) is the maximum likelihood estimate, \( S(\,\hat{\theta } ) = 0 \), so in the limit,
Multiplying both sides by \( \sqrt {n} \), then dividing numerator and denominator by \( n\sqrt {{I(\theta )}} \),
Now rewrite S(θ) and S′(θ) as sums using Equation 7.15:
The denominator braces contain a sum of independent identically distributed rv’s each with mean
by Equation (7.9). Therefore, by the law of large numbers, the denominator average \( \frac{1}{n}\left\{ { } \right\} \) converges to I(θ). Thus the denominator converges to \( \sqrt {{I(\theta )}} \). The numerator average \( \frac{1}{n}\left\{ { } \right\} \) is the mean of independent identically distributed rv’s with mean 0 [by Equation (7.8)] and variance I(θ), so the numerator ratio is an average minus its expected value, divided by its standard deviation. Therefore, by the Central Limit Theorem it is approximately normal with mean 0 and standard deviation 1. Thus, the ratio in Equation (7.21) has a numerator that is approximately N(0, 1) and a denominator that is approximately \( \sqrt {{I(\theta )}} \), so the ratio is approximately N(0, 1/\( {\sqrt {{I(\theta )}}^2} \)) = N(0, 1/I(θ)). That is, \( \sqrt {n} (\hat{\theta } - \theta ) \) is approximately N(0, 1/I(θ)), and it follows that \( \hat{\theta }\) is approximately normal with mean θ and variance 1/[nI(θ)], the Cramér–Rao lower bound.
Example 7.34
Continuing with the previous example, let X 1, X 2, …, X n be a random sample from the Bernoulli distribution. The objective is to estimate the proportion p of drivers who are wearing seat belts. The pmf is f(x; p) = p x(1 – p)1–x, x = 0, 1 so the likelihood is
Then the log likelihood is
and therefore its derivative, the score function, is
Conclude that the maximum likelihood estimator is \( {\hat p} = \overline X = \sum {{X_i}} /n \). Recall from Example 7.33 that this is unbiased and efficient with the minimum variance of the Cramér–Rao inequality. It is also asymptotically normal by the Central Limit Theorem. These properties are in accord with the asymptotic distribution given by the theorem, \( \hat{p} \sim N(\,p,1/[nI(\,p)]) \).
Example 7.35
Let X 1, X 2, …, X n be a random sample from the distribution with pdf f(x; θ) = θx θ−1 for 0 < x < 1, assuming θ > 0. Here X i , i = 1, 2, …, n, represents the fraction of a perfect score assigned to the ith applicant by a recruiting team. The Fisher information is the variance of
However, it is easier to use the alternative method of Equation (7.9):
To obtain the maximum likelihood estimator, we first find the log likelihood:
Its derivative, the score function, is
Setting this to 0, we find that the maximum likelihood estimate is
The expected value of ln(X) is −1/θ, because E(U) = 0, so the denominator of (7.22) converges in probability to −1/θ by the law of large numbers. Therefore \( \hat{\theta } \) converges in probability to θ, which means that \( \hat{\theta } \) is consistent. We knew this because the mle is always consistent, but it is also nice to show it directly. By the theorem, the asymptotic distribution of \( \hat{\theta } \) is normal with mean θ and variance 1/[nI(θ)] = θ 2/n.
7.4.6 Exercises: Section 7.4 (42–48)
-
42.
Assume that the number of defects in a car has a Poisson distribution with parameter λ. To estimate λ we obtain the random sample X 1, X 2, …, X n .
-
a.
Find the Fisher information in a single observation using two methods.
-
b.
Find the Cramér–Rao lower bound for the variance of an unbiased estimator of λ.
-
c.
Use the score function to find the mle of λ and show that the mle is an efficient estimator.
-
d.
Is the asymptotic distribution of the mle in accord with the second theorem? Explain.
-
a.
-
43.
In Example 7.23 f(x; θ) = 1/θ for 0 ≤ x ≤ θ and 0 otherwise. Given a random sample, the maximum likelihood estimate \( \hat{\theta } \) is the largest observation.
-
a.
Letting \( \tilde{\theta } = [(n + 1)/n]\hat{\theta } \), show that \( \tilde{\theta } \) is unbiased and find its variance.
-
b.
Find the Cramér–Rao lower bound for the variance of an unbiased estimator of θ.
-
c.
Compare the answers in parts (a) and (b) and explain why it is apparent that they disagree. What assumption is violated, causing the theorem not to apply here?
-
a.
-
44.
Survival times have the exponential distribution with pdf f(x; λ) = λe –λx, x ≥ 0, and f(x; λ) = 0 otherwise, where λ > 0. However, we wish to estimate the mean μ = 1/λ based on the random sample X 1, X 2, …, X n , so let’s re-express the pdf in the form (1/μ)e –x/μ.
-
a.
Find the information in a single observation and the Cramér–Rao lower bound.
-
b.
Use the score function to find the mle of μ.
-
c.
Find the mean and variance of the mle.
-
d.
Is the mle an efficient estimator? Explain.
-
a.
-
45.
Let X 1, X 2, …, X n be a random sample from the normal distribution with known standard deviation σ.
-
a.
Find the mle of μ.
-
b.
Find the distribution of the mle.
-
c.
Is the mle an efficient estimator? Explain.
-
d.
How does the answer to part (b) compare with the asymptotic distribution given by the second theorem?
-
a.
-
46.
Let X 1, X 2, …, X n be a random sample from the normal distribution with known mean μ but with the variance σ 2as the unknown parameter.
-
a.
Find the information in a single observation and the Cramér–Rao lower bound.
-
b.
Find the mle of σ 2.
-
c.
Find the distribution of the mle.
-
d.
Is the mle an efficient estimator? Explain.
-
e.
Is the answer to part (c) in conflict with the asymptotic distribution of the mle given by the second theorem? Explain.
-
a.
-
47.
Let X 1, X 2, …, X n be a random sample from the normal distribution with known mean μ but with the standard deviation σ as the unknown parameter.
-
a.
Find the information in a single observation.
-
b.
Compare the answer in part (a) to the answer in part (a) of Exercise 46. Does the information depend on the parameterization?
-
a.
-
48.
Let X 1, X 2, …, X n be a random sample from a continuous distribution with pdf f(x; θ). For large n, the variance of the sample median is approximately 1/{4n[f(\( \tilde{\mu } \);θ)]2}. If X 1, X 2, …, X n is a random sample from the normal distribution with known standard deviation σ and unknown μ, determine the efficiency of the sample median.
7.4.7 Supplementary Exercises: (49–63)
-
49.
At time t = 0, there is one individual alive in a certain population. A pure birth process then unfolds as follows. The time until the first birth is exponentially distributed with parameter λ. After the first birth, there are two individuals alive. The time until the first gives birth again is exponential with parameter λ, and similarly for the second individual. Therefore, the time until the next birth is the minimum of two exponential (λ) variables, which is exponential with parameter 2λ. Similarly, once the second birth has occurred, there are three individuals alive, so the time until the next birth is an exponential rv with parameter 3λ, and so on (the memoryless property of the exponential distribution is being used here). Suppose the process is observed until the sixth birth has occurred and the successive birth times are 25.2, 41.7, 51.2, 55.5, 59.5, 61.8 (from which you should calculate the times between successive births). Derive the mle of λ. [Hint: The likelihood is a product of exponential terms.]
-
50.
Let X 1,…,X n be a random sample from a uniform distribution on the interval [−θ, θ].
-
a.
Determine the mle of θ. [Hint: Look back at what we did in Example 7.23.]
-
b.
Give an intuitive argument for why the mle is either biased or unbiased.
-
c.
Determine a sufficient statistic for θ. [Hint: See Example 7.27.]
-
d.
Determine the joint pdf of the smallest order statistic Y 1 (= min(X i)) and the largest order statistic Yn (= max(X i)) [Hint: In Section 5.5 we determined the joint pdf of two particular order statistics]. Then use it to obtain the expected value of the mle. [Hint: Draw the region of joint positive density for Y 1 and Y n, and identify what the mle is for each part of this region.]
-
e.
What is an unbiased estimator for θ?
-
a.
-
51.
Carry out the details for minimizing MSE in Example 7.6: show that c = 1/(n + 1) minimizes the MSE of \( {\hat{\sigma }^2} = c\sum {({X_i}} - \overline{X}{)^2} \) when the population distribution is normal.
-
52.
Let X 1,..., X n be a random sample from a pdf that is symmetric about μ. An estimator for μ that has been found to perform well for a variety of underlying distributions is the Hodges–Lehmann estimator. To define it, first compute for each i ≤ j and each j = 1, 2,..., n the pairwise average \( {\overline{X}_{{i,j}}} = ({X_i} + {X_j})/2 \). Then the estimator is \( \hat{\mu } \) = the median of the \( {\overline{X}_{{i,j}}} \)’s. Compute the value of this estimate using the data of Exercise 41 of Chapter 1. [Hint: Construct a square table with the x i ’s listed on the left margin and on top. Then compute averages on and above the diagonal.]
-
53.
For a normal population distribution, the statistic median \( \{ |{X_1} - \widetilde{X})|, \ldots, |{X_n} - \widetilde{X})|\} /.6745 \) can be used to estimate σ. This estimator is more resistant to the effects of outliers (observations far from the bulk of the data) than is the sample standard deviation. Compute both the corresponding point estimate and s for the data of Example 7.2.
-
54.
When the sample standard deviation S is based on a random sample from a normal population distribution, it can be shown that
$$ E(S) = \sqrt {{2/(n - 1)}} \Gamma (n/2)\sigma /\Gamma [(n - 1)/2] $$Use this to obtain an unbiased estimator for σ of the form cS. What is c when n = 20?
-
55.
Each of n specimens is to be weighed twice on the same scale. Let X i and Y i denote the two observed weights for the ith specimen. Suppose X i and Y i are independent of each other, each normally distributed with mean value μ i (the true weight of specimen i) and variance σ 2.
-
a.
Show that the maximum likelihood estimator of σ 2 is \( {\hat{\sigma }^2} = \sum {{{({X_i} - {Y_i})}^2}/(4n)} \) [Hint: If \( \bar{z} = ({z_1} + {z_2})/2 \), then \( \sum {({z_i} - } \bar{z}{)^2} = {({z_1} - {z_2})^2}/2 \).]
-
b.
Is the mle \( {\hat{\sigma }^2} \) an unbiased estimator of σ 2? Find an unbiased estimator of σ 2. [Hint: For any rv Z, E(Z 2) = V(Z) + [E(Z)]2. Apply this to Z = X i – Y i .]
-
a.
-
56.
For 0 < θ < 1 consider a random sample from a uniform distribution on the interval from θ to 1/θ. Identify a sufficient statistic for θ.
-
57.
Let p denote the proportion of all individuals who are allergic to a particular medication. An investigator tests individual after individual to obtain a group of r individuals who have the allergy. Let X i = 1 if the ith individual tested has the allergy and X i = 0 otherwise (i = 1, 2, 3,...). Recall that in this situation, X = the number of nonallergic individuals tested prior to obtaining the desired group has a negative binomial distribution. Use the definition of sufficiency to show that X is a sufficient statistic for p.
-
58.
The fraction of a bottle that is filled with a particular liquid is a continuous random variable X with pdf f(x; θ) = θ x θ−1 for 0 < x < 1 (where θ > 0).
-
a.
Obtain the method of moments estimator for θ.
-
b.
Is the estimator of (a) a sufficient statistic? If not, what is a sufficient statistic, and what is an estimator of θ (not necessarily unbiased) based on a sufficient statistic?
-
a.
-
59.
Let X 1,..., X n be a random sample from a normal distribution with both μ and σ unknown. An unbiased estimator of θ = P(X ≤ c) based on the jointly sufficient statistics is desired. Let \( k = \sqrt {{n/(n - 1)}} \) and \( w = (c - \bar{x})/s \). Then it can be shown that the minimum variance unbiased estimator for θ is
$$\qquad \hat{\theta } = \left\{ {\begin{array}{ccccc} {\quad 0\quad \quad \quad \quad \quad \quad kw \leq -1} \\{P\left( {T < \displaystyle\frac{{kw\sqrt {{n - 2}} }}{{\sqrt {{1 - {k^2}{w^2}}} }}} \right)\quad \ - 1 \ <\ kw \ <\ 1} \\{\quad 1\quad \quad \quad \quad \quad\quad kw \geq 1} \quad\\\end{array} } \right\} $$where T has a t distribution with n – 2 df. The article “Big and Bad: How the S.U.V. Ran over Automobile Safety” (The New Yorker, Jan. 24, 2004) reported that when an engineer with Consumers Union (the product testing and rating organization that publishes Consumer Reports) performed three different trials in which a Chevrolet Blazer was accelerated to 60 mph and then suddenly braked, the stopping distances (ft) were 146.2, 151.6, and 153.4, respectively. Assuming that braking distance is normally distributed, obtain the minimum variance unbiased estimate for the probability that distance is at most 150 ft, and compare to the maximum likelihood estimate of this probability.
-
60.
Here is a result that allows for easy identification of a minimal sufficient statistic: Suppose there is a function t(x 1,..., x n ) such that for any two sets of observations x 1,. . ., x n and y 1,. . ., y n , the likelihood ratio f(x 1,..., x n ; θ)/f(y 1,..., y n ; θ) doesn’t depend on θ if and only if t(x 1,..., x n ) = t(y 1,..., y n ). Then T = t(X 1,..., X n ) is a minimal sufficient statistic. The result is also valid if θ is replaced by θ 1,..., θ k , in which case there will typically be several jointly minimal sufficient statistics. For example, if the underlying pdf is exponential with parameter λ, then the likelihood ratio is \( {\lambda^{{\Sigma {{x_i} - } \Sigma {{y_i}} }}} \), which will not depend on λ if and only if \( \sum {{x_i}} = \sum {{y_i}} \), so T = \( \sum {{x_i}} \) is a minimal sufficient statistic for λ (and so is the sample mean).
-
a.
Identify a minimal sufficient statistic when the X i ’s are a random sample from a Poisson distribution.
-
b.
Identify a minimal sufficient statistic or jointly minimal sufficient statistics when the X i ’s are a random sample from a normal distribution with mean θ and variance θ.
-
c.
Identify a minimal sufficient statistic or jointly minimal sufficient statistics when the X i ’s are a random sample from a normal distribution with mean θ and standard deviation θ.
-
a.
-
61.
The principle of unbiasedness (prefer an unbiased estimator to any other) has been criticized on the grounds that in some situations the only unbiased estimator is patently ridiculous. Here is one such example. Suppose that the number of major defects X on a randomly selected vehicle has a Poisson distribution with parameter λ. You are going to purchase two such vehicles and wish to estimate θ = P(X 1 = 0, X 2 = 0) = e −2λ, the probability that neither of these vehicles has any major defects. Your estimate is based on observing the value of X for a single vehicle. Denote this estimator by \( \hat{\theta } = \delta (X) \). Write the equation implied by the condition of unbiasedness, E[δ(X)] = e −2λ, cancel e –λ from both sides, then expand what remains on the right-hand side in an infinite series, and compare the two sides to determine δ(X). If X = 200, what is the estimate? Does this seem reasonable? What is the estimate if X = 199? Is this reasonable?
-
62.
Let X, the payoff from playing a certain game, have pmf
$$\qquad f(x;\theta ) = \left\{ {\begin{array}{ccccc} {\theta \quad \quad \quad \quad \quad x = - 1\;} \\{{{(1 - \theta )}^2}{\theta^x}\quad \quad \quad x = 0,1,2, \ldots } \\\end{array} } \right. $$-
a.
Verify that f(x; θ) is a legitimate pmf, and determine the expected payoff. [Hint: Look back at the properties of a geometric random variable discussed in Chapter 3.]
-
b.
Let X 1,..., X n be the payoffs from n independent games of this type. Determine the mle of θ. [Hint: Let Y denote the number of observations among the n that equal −1 {that is, Y = ΣI(Y i = −1), where I(A) = 1 if the event A occurs and 0 otherwise}, and write the likelihood as a single expression in terms of \( \sum {{x_i}} \) and y.]
-
c.
What is the approximate variance of the mle when n is large?
-
a.
-
63.
Let x denote the number of items in an order and y denote time (min) necessary to process the order. Processing time may be determined by various factors other than order size. So for any particular value of x, we now regard the value of total production time as a random variable Y. Consider the following data obtained by specifying various values of x and determining total production time for each one.
x
10
15
18
20
25
27
30
35
36
40
y
301
455
533
599
750
810
903
1054
1088
1196
-
a.
Plot each observed (x, y) pair as a point on a two-dimensional coordinate system with a horizontal axis labeled x and vertical axis labeled y. Do all points fall exactly on a line passing through (0, 0)? Do the points tend to fall close to such a line?
-
b.
Consider the following probability model for the data. Values x 1, x 2,..., x n are specified, and at each x i we observe a value of the dependent variable y. Prior to observation, denote the y values by Y 1, Y 2,..., Y n , where the use of uppercase letters here is appropriate because we are regarding the y values as random variables. Assume that the Y i ’s are independent and normally distributed, with Y i having mean value βx i and variance σ 2. That is, rather than assume that y = βx, a linear function of x passing through the origin, we are assuming that the mean value of Y is a linear function of x and that the variance of Y is the same for any particular x value. Obtain formulas for the maximum likelihood estimates of β and σ 2, and then calculate the estimates for the given data. How would you interpret the estimate of β? What value of processing time would you predict when x = 25? [Hint: The likelihood is a product of individual normal likelihoods with different mean values and the same variance. Proceed as in the estimation via maximum likelihood of the parameters μ and σ 2 based on a random sample from a normal population distribution (but here the data does not constitute a random sample as we have previously defined it, since the Y i ’s have different mean values and therefore don’t have the same distribution).] [Note: This model is referred to as regression through the origin.]
-
a.
Notes
- 1.
Following earlier notation, we could use \( \hat{\Theta } \) (an uppercase theta) for the estimator, but this is cumbersome to write.
- 2.
Since ln[g(x)] is a monotonic function of g(x), finding x to maximize ln[g(x)] is equivalent to maximizing g(x) itself. In statistics, taking the logarithm frequently changes a product to a sum, which is easier to work with.
- 3.
This conclusion requires checking the second derivative, but the details are omitted.
Bibliography
DeGroot, Morris, and Mark Schervish, Probability and Statistics (3rd ed.), Addison-Wesley, Boston, MA, 2002. Includes an excellent discussion of both general properties and methods of point estimation; of particular interest are examples showing how general principles and methods can yield unsatisfactory estimators in particular situations.
Efron, Bradley, and Robert Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993. The bible of the bootstrap.
Hoaglin, David, Frederick Mosteller, and John Tukey, Understanding Robust and Exploratory Data Analysis, Wiley, New York, 1983. Contains several good chapters on robust point estimation, including one on M-estimation.
Hogg, Robert, Allen Craig, and Joseph McKean, Introduction to Mathematical Statistics (6th ed.), Prentice Hall, Englewood Cliffs, NJ, 2005. A good discussion of unbiasedness.
Larsen, Richard, and Morris Marx, Introduction to Mathematical Statistics (4th ed.), Prentice Hall, Englewood Cliffs, NJ, 2005. A very good discussion of point estimation from a slightly more mathematical perspective than the present text.
Rice, John, Mathematical Statistics and Data Analysis (3rd ed.), Duxbury Press, Belmont, CA, 2007. A nice blending of statistical theory and data.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Devore, J.L., Berk, K.N. (2012). Point Estimation. In: Modern Mathematical Statistics with Applications. Springer Texts in Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0391-3_7
Download citation
DOI: https://doi.org/10.1007/978-1-4614-0391-3_7
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-0390-6
Online ISBN: 978-1-4614-0391-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)