Maximum likelihood estimation for survey data with informative interval censoring

Interval-censored data may arise in questionnaire surveys when, instead of being asked to provide an exact value, respondents are free to answer with any interval without having pre-specified ranges. In this context, the assumption of noninformative censoring is violated, and thus, the standard methods for interval-censored data are not appropriate. This paper explores two schemes for data collection and deals with the problem of estimation of the underlying distribution function, assuming that it belongs to a parametric family. The consistency and asymptotic normality of a proposed maximum likelihood estimator are proven. A bootstrap procedure that can be used for constructing confidence intervals is considered, and its asymptotic validity is shown. A simulation study investigates the performance of the suggested methods.


Introduction
In questionnaire surveys respondents are often allowed to give an answer in the form of an interval. For example, the respondent can be asked to select from several prespecified intervals; this question format is known as range card. Another approach is called unfolding brackets, where the respondent is asked a sequence of yes-no B Angel G. Angelov agangelov@gmail.com Magnus Ekström magnus.ekstrom@umu.se questions that narrow down the range in which the respondent's true value is. These formats are suitable when asking questions that are difficult to answer with an exact value (e.g., recall questions) or when asking sensitive questions (e.g., asking about income) because they allow partial information to be elicited from respondents who are unable or unwilling to provide exact amounts. However, studies have found that the pre-specified intervals given to the respondents in a range-card question are likely to influence their answers. Such bias is known as bracketing effect (see, e.g., McFadden et al. 2005). Similarly, the unfolding brackets format is prone to the so-called anchoring effect, i.e., answers can be biased toward the starting value in the sequence of yes-no questions (see, e.g., Furnham and Boo 2011;Van Exel et al. 2006).
A format that does not involve any pre-specified values is the respondent-generated intervals approach, suggested by Press and Tanur (2004a, b), where the respondent is asked to provide both a point value (a best guess for the true value) and an interval. They employed Bayesian methods for estimating the parameters of the underlying distribution. Similar format, in which the respondent is free to answer with any interval containing his/her true value, was considered by Belyaev and Kriström (2010). They use the term self-selected interval (SSI). Estimating the underlying distribution using SSI data, however, requires some generally untestable assumptions related to how the respondent chooses the interval. To avoid such assumptions, Kriström (2012, 2015) introduced a novel two-stage approach. The idea is to ask the respondent first to provide an SSI and then to select from several sub-intervals of the SSI the one that most likely contains his/her true value. Data collected in a pilot stage are used for generating the sub-intervals in the second question. Kriström (2012, 2015) proposed a nonparametric maximum likelihood estimator of the underlying distribution for two-stage SSI data. Angelov and Ekström (2017) extended their work by exploring a sampling scheme where the number of sub-intervals in the second question is limited to two or three, which is motivated by the fact that a question with a large number of sub-intervals might be difficult to implement in practice, e.g., in a telephone interview.
Data consisting of self-selected intervals are a special case of interval-censored data. Let X be a random variable of interest. An observation on X is interval-censored if, instead of observing X exactly, only an interval (L , R ] is observed, where L < X ≤ R (see, e.g., Zhang and Sun 2010). Interval-censored data arise most commonly when the observed variable is the time to some event (known as survival data, failure time data, lifetime data, duration data, or time-to-event data). The problem of estimating the underlying distribution for interval-censored data has been approached through nonparametric methods by Peto (1973), Turnbull (1976), and Gentleman and Geyer (1994), among others. These estimators rely on the assumption of noninformative censoring, i.e., the observation process that generates the censoring is independent of the variable of interest (see, e.g., Sun 2006, p. 244). In the sampling schemes considered by Belyaev and Kriström (2010, 2015 and Angelov and Ekström (2017) this is not a reasonable assumption as it is the respondent who chooses the interval; thus, the standard methods are not appropriate. The existing methods for data with informative interval censoring (see Finkelstein et al. 2002;Shardell et al. 2007) are specific for time-to-event data and are not directly applicable in the context that we are discussing.
In this paper, we focus on parametric estimation of the underlying distribution function, i.e., we assume a particular functional form of the distribution. Compared to nonparametric methods, this approach usually leads to more efficient estimators, provided that the distributional assumption is true (see, e.g., Collett 1994, p. 107). The problem of choosing the right parametric model can be sidestepped by using a wide parametric family like the generalized gamma distribution (see, e.g., Cox et al. 2007) that includes most of the commonly used distributions as special cases (exponential, gamma, Weibull, and log-normal).
We suggest two modifications of the sampling scheme for SSI data studied in Angelov and Ekström (2017) and propose a parametric maximum likelihood estimator. In Sect. 2, we introduce the sampling schemes. In Sect. 3, the statistical model is defined and the corresponding likelihood function is derived. Asymptotic properties of the maximum likelihood estimator are established in Sect. 4. The results of a simulation study are presented in Sect. 5, and the paper is concluded in Sect. 6. In "Appendix" are given proofs and auxiliary results.

Sampling schemes 2.1 Scheme A
The rationale behind this scheme is that we need to have more information than just the self-selected intervals in order to estimate the underlying distribution. Therefore, we ask the respondent to select a sub-interval of the interval that he/she stated. The problem of deciding where to split the stated interval into sub-intervals can be resolved using some previously collected data (in a pilot stage) or based on other knowledge about the quantity of interest.
We consider the following two-stage scheme for collecting data. In the pilot stage, a random sample of n 0 individuals is selected and each individual is requested to give an answer in the form of an interval containing his/her value of the quantity of interest. It is assumed that the endpoints of the intervals are rounded, for example, to the nearest integer or to the nearest multiple of 10. Thus, instead of (50.2, 78.7] respondents will answer with (50, 79] or (50, 80]. Let d 0 < d 1 < · · · < d k −1 < d k be the endpoints of all observed intervals. The set {d j } = {d 0 , . . . , d k } can be seen as a set of typical endpoints. The data collected in the pilot stage are used only for constructing the set {d j }, which is needed for the main stage. The set {d j } may also be constructed using data from a previous survey, or it can be determined by the researcher based on prior knowledge about the quantity of interest or other reasonable arguments. For instance, if it is known that the variable of interest ranges between 0 and 200 and that the respondents are rounding their endpoints to a multiple of 10, then a reasonable set of endpoints will be {0, 10, 20, . . . , 200}. In the main stage, a new random sample of n individuals is selected and each individual is asked to state an interval containing his/her value of the quantity of interest. We refer to this question as Qu1. The stated interval is then split into two or three sub-intervals, and the respondent is asked to select one of these sub-intervals (the points of split are chosen in some random fashion among the points d j that are within the stated interval, e.g., equally likely or according to some other pre-specified probabilities). We refer to this question as Qu2. The respondent may refuse to answer Qu2, and this will be allowed for. If there are no points d j within the stated interval, the second question is not asked.

Remark 1
The main difference between this scheme and the one explored in Angelov and Ekström (2017) is that with scheme A there is no exclusion of respondents, while with the former scheme respondents are excluded if they stated an interval with endpoints not belonging to {d j }.

Scheme B
This scheme is a modification of scheme A with two follow-up questions after Qu1 aiming to extract more refined information from the respondents. The pilot stage is the same as in scheme A. The sets {d 0 , . . . , d k }, V, U, and J h are also defined in the same way. In the main stage, a new random sample of n individuals is selected and each individual is asked to state an interval containing his/her value of the quantity of interest. We refer to this question as Qu1. The stated interval is then split into two sub-intervals, and the respondent is asked to select one of these sub-intervals. The point of split is the d j that is the closest to the middle of the interval; if there are two points that are equally close to the middle, one of them is taken at random. This way of splitting the interval yields two sub-interval of similar length, which would be more natural for the respondent. We refer to this question as Qu2a. The interval selected at Qu2a is thereafter split similarly into two sub-intervals, and the respondent is asked to select one of them. We refer to this question as Qu2b. The respondent may refuse to answer the follow-up questions Qu2a and Qu2b. If there are no points d j within the interval stated at Qu1 or Qu2a, the respective follow-up question is not asked. We assume that if a respondent has answered Qu2a, he/she has chosen the interval containing his/her true value, independent of how the interval stated at Qu1 was split. An analogous assumption is made about the response to Qu2b.
If we know the intervals stated at Qu1 and Qu2b, we can find out the answer to Qu2a. For this reason, if Qu2b is answered, the data from Qu2a can be omitted. Let Qu2Δ denote the last follow-up question that was answered by the respondent. If the respondent did not answer both Qu2a and Qu2b, we say that there is no answer at Qu2Δ. We will distinguish three types of answers in the main stage: Type 1. (u h ; NA), when the respondent stated interval u h at Qu1 and did not answer Qu2Δ; Type 2. (u h ; v j ), when the respondent stated interval u h at Qu1 and v j at Qu2Δ, where v j ⊆ u h ; Type 3. (u h ; u s ), when the respondent stated interval u h at Qu1 and u s at Qu2Δ, where u s is a union of at least two intervals from V and u s ⊂ u h .
Similar types of answers can be considered for scheme A, as well. In what follows we will use these three types for both schemes (for scheme A, Qu2Δ will denote Qu2).

Model and estimation
We consider the unobserved (interval-censored) values x 1 , . . . , x n of the quantity of interest to be values of independent and identically distributed (i.i.d.) random variables Because only intervals with endpoints from {d 0 , . . . , d k } are observed, the likelihood function will depend on F(x) through the probabilities q j . In order to avoid complicated notation, we assume that q j > 0 for all j = 1, . . . , k. The case when q j = 0 for some j can be treated similarly (cf. Rao 1973, p. 356). Let Hereafter we will need the following frequencies: n h,NA is the number of respondents who stated u h at Qu1 and NA (no answer) at Qu2Δ; n h j is the number of respondents who stated u h at Qu1 and v j at Qu2Δ, where v j ⊆ u h ; n h * s is the number of respondents who stated u h at Qu1 and u s at Qu2Δ, where u s is a union of at least two intervals from V and u s ⊂ u h ; n h• is the number of respondents who stated u h at Qu1 and any sub-interval at Qu2Δ. Now we will derive the likelihood for scheme B. If respondent i has given an answer of type 1, i.e., u h at Qu1 and NA at Qu2Δ, then the contribution to the likelihood can be expressed using the law of total probability: P (H i = h) = j∈J h w h| j q j . If an answer of type 2 is observed, i.e., u h at Qu1 and v j at Qu2Δ, then the contribution to the likelihood is w h| j q j . And if a respondent has given an answer of type 3, i.e., u h at Qu1 and u s at Qu2Δ, then the contribution to the likelihood is j∈J s w h| j q j . Thus, the log-likelihood function corresponding to the main-stage data is where c 1 does not depend on q = (q 1 , . . . , q k ). By similar arguments it can be shown that the log-likelihood for scheme A has essentially the same form as the log-likelihood (1), it differs by an additive constant (the pre-specified probabilities of choosing the points of split of the stated interval are incorporated in c 1 ).
If we want to estimate F(x) without making any distributional assumptions, we can maximize the log-likelihood (1) with respect to q (for details see Angelov and Ekström 2017). Here we will assume that F(x) belongs to a parametric family, i.e., F(x) is a known function of some unknown parameter θ = (θ 1 , . . . , θ d ), and thus the probabilities q j are functions of θ . Therefore, the log-likelihood will be a function of θ , i.e., log L(θ ) = log L q(θ ) , and in order to estimate F(x) we need to estimate θ. For emphasizing that F(x) depends on θ , we will sometimes write F θ (x).
The conditional probabilities w h| j are nuisance parameters. If w h| j does not depend on j, the assumption of noninformative censoring will be satisfied. In our case, there are no grounds for making such assumptions about w h| j , and therefore, we need the data from Qu2Δ in order to estimate w h| j . For this task we employ the procedure suggested in Angelov and Ekström (2017), which we outline here. The idea is first to estimate the probabilities p j|h = P (X i ∈ v j | H i = h), j ∈ J h . For a given h, a strongly consistent estimator p j|h of p j|h , j ∈ J h is obtained by maximizing the log-likelihood: where c 0 does not depend on p j|h . Then, an estimator of w h| j is derived using the Bayes formula: To find the maximum likelihood estimate of the parameter θ , we insert the estimates of the probabilities w h| j into log L(θ) and maximize with respect to θ . Alternatively, one may maximize the log-likelihood with respect to both θ and the nuisance parameters w h| j using standard numerical optimization methods. This is, however, a high-dimensional and computationally time-consuming optimization problem, which we avoid by simply plugging in the estimated nuisance parameters w h| j into the loglikelihood.
Remark 2 The proposed methodology for estimating F θ (x) assumes that the respondents are selected according to simple random sampling. If this is not the case, extrapolating the results to the target population may be incorrect. For surveys with a complex design, parameter estimates can be obtained, for example, by using the pseudo-likelihood approach, in which the individual contribution to the log-likelihood is weighted by the reciprocal of the corresponding sample inclusion probability (see, e.g., Chambers et al. 2012, p. 60).

Asymptotic results
Let us consider q j as a function of θ = (θ 1 , . . . , θ d ), a multidimensional parameter belonging to a set ⊆ R d , and let the true value θ 0 be an interior point of . In this section we prove the consistency and asymptotic normality of the proposed maximum likelihood estimator of θ . We also show the asymptotic validity of a bootstrap procedure which can be used for constructing confidence intervals.
Let θ 1 , θ 2 ∈ and θ 1 − θ 2 denote the Euclidean distance between θ 1 and θ 2 . Let the contribution of the i-th respondent to the log-likelihood be denoted llik i (θ), whose precise definition is given by (13). We will consider the following assumptions: A1 If θ 1 = θ 2 , then q(θ 1 ) = q(θ 2 ). A2 For every δ > 0, there exists ε > 0 such that A3 The functions q j (θ ) are continuous. A4 The functions q j (θ ) have first-order partial derivatives that are continuous. A5 The set is compact, and the functions q j (θ) have first-and second-order partial derivatives that are continuous on . Furthermore, q j (θ) > 0 on . A6 For each θ ∈ , the Fisher information matrix I(θ ) with elements We say that θ is an approximate maximum likelihood estimator (cf. Rao 1973, p. 353) of θ if for some c ∈ (0, 1), Let γ t be the probability that a respondent gives an answer of type t, for t = 1, 2, 3.
Theorem 1 Let θ be an approximate maximum likelihood estimator of θ .
Theorem 2 If assumptions A2 and A3 are satisfied, γ 2 > 0, and the conditional probabilities w h| j are known (or strongly consistently estimated), then the maximum likelihood estimator of θ exists and is strongly consistent.
Theorem 3 If assumptions A1 and A4 are satisfied and the conditional probabilities w h| j are known (or strongly consistently estimated), then there exists a rootθ of the system of likelihood equations such thatθ a.s.
In what follows, θ will denote the maximum likelihood estimator of θ, unless we state that it denotes an approximate maximum likelihood estimator.
For obtaining asymptotic distributional results about √ n( θ − θ 0 ) we will use the notion of weakly approaching sequences of distributions (Belyaev and Sjöstedt-de Luna 2000), which is a generalization of the well-known concept of weak convergence of distributions but without the need to have a limiting distribution. Two sequences of random variables, {X n } n≥1 and {Y n } n≥1 , are said to have weakly approaching distribution laws, {L(X n )} n≥1 and {L(Y n )} n≥1 , if for every bounded continuous function ϕ(·), E ϕ(X n )−E ϕ(Y n ) −→ 0 as n −→ ∞. Further, we say that the sequence of conditional distribution laws {L(X n | Z n )} n≥1 weakly approaches {L(Y n )} n≥1 in probability (along Z n ) if for every bounded continuous function ϕ(·), E (ϕ(X n ) | Z n ) − E ϕ(Y n ) −→ 0 in probability as n −→ ∞.
The claim of Theorem 4 implies weak convergence, i.e., the limiting distribution of √ n( θ − θ 0 ) is multivariate normal with zero mean vector and covariance matrix I −1 (θ 0 ).
Let y 1 , . . . , y n be the observed main-stage data. Each data point y i is a vector of size four, where the first two elements represent the endpoints of the interval stated at Qu1 and the last two elements represent the endpoints of the interval stated at Qu2Δ. We consider y 1 , . . . , y n to be values of i.i.d. random variables Y 1 , . . . , Y n . We denote Y 1:n = (Y 1 , . . . , Y n ). Let Y 1 , . . . , Y n be i.i.d. random variables taking on the values y 1 , . . . , y n with probability 1/n, i.e., Y 1 , . . . , Y n is a random sample with replacement from the original data set {y 1 , . . . , y n }. We say that Y 1 , . . . , Y n is a bootstrap sample. Let θ be the maximum likelihood estimator of θ from the bootstrap sample Y 1 , . . . , Y n .

Simulation study
We have conducted a simulation study to examine the performance of the proposed methods. The data for the pilot stage and for Qu1 at the main stage are generated in the same way. We describe it for Qu1 to avoid unnecessary notation. In all simulations, the random variables X 1 , . . . , X n are independent and have a Weibull distribution: where ν = 1.5 and σ = 80. The Weibull distribution has a flexible shape and is used in various contexts, for example, in contingent valuation studies where people are asked how much they would be willing to pay for a certain nonmarket good (see, e.g., Alberini et al. 2005). Contingent valuation is a natural application area for the sampling schemes considered here because they account for respondent uncertainty. Let U L 1 , . . . , U L n and U R 1 , . . . , U R n be sequences of i.i.d. random variables defined below: rounded downwards to the nearest multiple of 10. The right endpoints are generated as R 1i = X i + U R i rounded upwards to the nearest multiple of 10. The data for the follow-up questions Qu2a and Qu2b are generated according to scheme B. The probability that a respondent gives no answer to Qu2Δ is 1/6. All computations were performed in R (R Core Team 2016). The R code can be obtained from the first author upon request.
It is of interest to investigate to what extent the set of endpoints {d j } influences the properties of the estimator of θ = (ν, σ ). For this purpose, we explore three different ways of obtaining the set {d j }, i.e., three variations of scheme B, specified below: (i) pilot stage with sample size n 0 = 20; (ii) pilot stage with sample size which is the same as in the main stage, n 0 = n; (iii) skipping the pilot stage and using instead a predetermined set of endpoints {d j } = {0, 10, 20, . . . , 300, 320, 340, . . . , 400}, which is a reasonable set given the rounding to a multiple of 10 and the likely values of X i .
Under the settings of our simulations, the set {d j } will on average be smallest in scenario (i) and largest in scenario (iii). First, we compare the suggested estimator of θ under the three variations of scheme B and the maximum likelihood estimator when X 1 , . . . , X n are observed without censoring (uncensored observation scheme). For each scheme, 40000 samples of different sizes are generated. Table 1 presents the relative bias and the root mean square error over the simulations. If ν is an estimator of ν, the relative bias of ν is defined as rb( ν) = 100 bias( ν)/ν. The root mean square error is of more or less the same magnitude in each of the three scenarios for obtaining the set of endpoints {d j }. However, if we look at the results for n = 1000, the bias is smallest when the set of endpoints is largest. This indicates that the set {d j } should not be too small (ideally one would like the set to contain all endpoints that future respondents will give). As we can expect, the error with the uncensored scheme is lower; however, the difference is pretty small. The bias is fairly close to zero with all schemes. Analogous simulations for scheme A displayed comparable results with a slightly higher root mean square error. We also conducted similar simulations with the scheme suggested in Angelov and Ekström (2017) which showed a larger bias, e.g., for n 0 = n = 100, rb( ν) = 6.7, for n 0 = n = 1000, rb( ν) = 1.7, while with schemes A and B of the current paper, rb( ν) < 1 in each of the cases studied. This bias can be attributed to the exclusion of respondents in the former scheme. For the sake of brevity, the detailed simulation results for scheme A and the scheme of Angelov and Ekström (2017) are omitted.
In addition, we have performed simulations to examine potential bias due to wrongly assuming that w h| j does not depend on j. This assumption implies noninformative censoring, and in this case the likelihood will be proportional to where (a i , b i ] is the last interval stated by respondent i at the series of questions Qu1, Qu2Δ (cf. Sun 2006, p. 28). We compare the estimator suggested in this paper with an estimator assuming noninformative censoring, obtained by maximizing the likelihood (6). For generating data we use the model stated above with M i ∼ Bernoulli(1/100) in (5). This model corresponds to a specific behavior of the respondents, that is, at Qu1 they tend to choose an interval in which the true value is located in the right half of the interval. The estimator assuming noninformative censoring has been applied both to the full data (Qu1 and Qu2Δ) and to the data only from Qu1. Table 2 displays the relative bias and the root mean square error of the estimators based on 40000 simulated samples of sizes n = 100 and n = 1000, with scheme variation B(ii). For n = 1000 when using the full data, the bias of the estimator assuming noninformative censoring is substantially greater than the bias of our estimator. Similar thing is observed for the root mean square error. The results with n = 100 indicate that when using the  full data, for ν the bias of our estimator is a bit greater than the bias of the other estimator, while for σ the bias of our estimator is smaller. Yet, with our estimator the estimated distribution function more closely resembles the true distribution. If only the data from Qu1 are used, the bias under the assumption of noninformative censoring is considerably larger for both sample sizes. Finally, we compare the performance of the bootstrap confidence intervals (4) and the confidence intervals constructed using normal approximation (see Theorem 4). Table 3 shows results based on 1500 simulated samples of sizes n = 100 and n = 1000 using scheme variation B(iii); the confidence level is 0.95. One bootstrap confidence interval is calculated using 1000 bootstrap samples. For both sample sizes we see that the bootstrap confidence intervals have similar coverage and length as the confidence intervals based on normal approximation.

Conclusion
We considered two schemes (A and B) for collecting self-selected interval data that extend sampling schemes studied before in the literature. Under general assumptions, we proved the existence, consistency, and asymptotic normality of a proposed parametric maximum likelihood estimator. In comparison with the scheme used in a previous paper (Angelov and Ekström 2017), the new schemes do not involve exclusion of respondents and this leads to a smaller bias of the estimator as indicated by our simulation study. Furthermore, the simulations showed a good performance of the estimator compared to the maximum likelihood estimator for uncensored observations. It should be noted that the censoring in this case is imposed by the design of the question. A design allowing uncensored observations might introduce bias in the estimation if respondents are asked a question that is difficult to answer with an exact amount (e.g., number of hours spent on the internet) and they give a rough best guess. We also demonstrated via simulations that ignoring the informative censoring can lead to bias. We presented a bootstrap procedure for constructing confidence intervals that is easier to apply compared to the confidence intervals based on asymptotic normality where, e.g., the derivatives of the log-likelihood need to be calculated. According to our simulations, the two approaches yield similar results in terms of coverage and length of the confidence intervals. Finally, it would be of interest in future research to develop a test for assessing the goodness of fit of a parametric model.
where c 2 = (1/n)(c 1 + h, j n h j log w h| j ). Using the notations we can write the log-likelihood (7) in a more compact way: Taking into account that q j = q j (θ ), the log-likelihood (8) may also be written as follows where w h (θ ) = j∈J h w h| j q j (θ ) and w h * s (θ ) = j∈J s w h| j q j (θ ).
Lemma 1 (Information inequality) Let i a i and i b i be convergent series of positive numbers such that i a i ≥ i b i . Then with strict inequality if a i = b i for at least one i.
A proof of Lemma 1 can be found in Rao (1973, p. 58).
Proof of Theorem 1 Let us consider the case when the conditional probabilities w h| j are known. By convention, we define 0 log 0 = 0 and 0 log(a/0) = 0 on the basis that lim x↓0 x log x = 0 and lim x↓0 x log(a/x) = 0 for a > 0. Using (2) and Lemma 1, we get From the strong law of large numbers (SLLN) it follows that as n −→ ∞. Combining (10) and (11) yields as n −→ ∞. From this and Lemma 1 it follows that Using (11) and the assumption γ 2 > 0, we get This, together with assumption A2, implies that θ a.s.
−→ θ 0 , which is what we had to prove.
In the case, when a strongly consistent estimator of w h| j is inserted into the loglikelihood, the proof follows the same lines.
where π a is a relative frequency, π a (θ ) is an expression of the form j∈J (a) w h| j q j (θ ), and J (a) is an index set. The continuity of the derivatives of q j (θ ) implies the continuity of the derivatives of π a (θ). Thus, the proof of asymptotic normality follows the same lines as that of proposition (iv) in Rao (1973, p. 361).

Lemma 2
If assumption A5 is satisfied and the conditional probabilities w h| j are known (or strongly consistently estimated), then assumptions B1-B4 hold true.
Proof of Lemma 2 Assumption B1. From the continuity of the first-and second-order partial derivatives of q j (θ ), it follows that ∂ llik i (θ) ∂θ r and ∂ 2 llik i (θ) ∂θ r ∂θ are also continuous. Assumption B2. Because we have an experiment with a finite number of outcomes, the respective expectations can be expressed as finite sums and are therefore finite.
Using Lebesgue's dominated convergence theorem (see, e.g., Roussas 2014, p. 75), we get which is what we had to prove. Assumption B4. The identities in this assumption follow from the fact that we have an experiment with a finite number of outcomes and thus the respective expectations can be expressed as finite sums. Indeed, if Y is a random variable that can take a finite number of possible values and P (Y = y) = p θ (y), then ∂θ r p θ (y) = ∂ ∂θ r y p θ (y) = 0.
The same argument leads to the second identity.
Lemma 3 If assumptions A2 and A3 are satisfied, γ 2 > 0, and the conditional probabilities w h| j are known (or strongly consistently estimated), then θ exists and θ a.s.

Proof of Lemma 3
The proof follows the same arguments as that of Theorem 2 but, instead of the classical SLLN, the strong law of large numbers for the bootstrapped mean (see, e.g., Athreya et al. 1984) is used.
We will present a general result about bootstrapping maximum likelihood estimators that is used in the proof of Theorem 5. Let z 1 , . . . , z n be observed values of i.i.d. random variables Z 1 , . . . , Z n whose distribution depends on some unknown parameter θ = (θ 1 , . . . , θ d ) ∈ ⊆ R d . The contribution of the i-th observation to the loglikelihood is denoted llik i (θ ). Let Z 1:n = (Z 1 , . . . , Z n ) and θ be the maximum likelihood estimator of θ. We define Z 1 , . . . , Z n to be i.i.d. random variables taking on the values z 1 , . . . , z n with probability 1/n, i.e., Z 1 , . . . , Z n is a bootstrap sample. Let θ be the maximum likelihood estimator of θ from the bootstrap sample Z 1 , . . . , Z n . Then the distribution of √ n( θ − θ) | Z 1:n weakly approaches the distribution of √ n( θ − θ 0 ) in probability as n −→ ∞.
Proof of Theorem 5 The idea of the proof is to show that the conditions of Lemma 4 are fulfilled. By using the fact that assumption A5 implies assumption A3 and combining the results obtained in Theorem 2, Lemma 2, and Lemma 3, we see that these conditions are satisfied. Thus, the assertion of Theorem 5 follows directly.