Nonparametric estimation for self-selected interval data collected through a two-stage approach

Self-selected interval data arise in questionnaire surveys when respondents are free to answer with any interval without having pre-specified ranges. This type of data is a special case of interval-censored data in which the assumption of noninformative censoring is violated, and thus the standard methods for interval-censored data (e.g. Turnbull’s estimator) are not appropriate because they can produce biased results. Based on a certain sampling scheme, this paper suggests a nonparametric maximum likelihood estimator of the underlying distribution function. The consistency of the estimator is proven under general assumptions, and an iterative procedure for finding the estimate is proposed. The performance of themethod is investigated in a simulation study.


Introduction
When being asked about a quantity, people often answer with an interval if they are not certain. For example, when asked about the distance to a given town, we would say "it is about 60-70 km". This is one of the reasons why in questionnaire surveys respondents are often allowed to give an answer in the form of an interval to a quantitative question. One common question format is the so-called range card, where the respondent is asked to select from several pre-specified intervals (called "brackets"). Another approach is known as unfolding brackets. In this case the respondent is asked a sequence of yes-no questions that narrow down the range in which the respondent's true value is. For example, the respondent is first asked "In the past year, did your household spend less than 500 EUR on electrical items?". If the answer is "yes", the next question asks if they spent more than 400 EUR. If the response to the first question is "no", the next question asks if they spent less than 600 EUR and so on. Unfolding brackets can be designed such that they elicit the same information as in a range-card question. These formats are often used for asking sensitive questions, e.g. asking about income, because they allow partial information to be obtained from respondents who are unwilling to provide exact amounts.
However, there are some issues associated with these approaches. Studies have found that the choice of bracket values in range-card questions is likely to influence responses. This is known as the bracketing effect or range bias (see, e.g., McFadden et al. 2005;Whynes et al. 2004). In questions about usage frequency (e.g. "How many hours per day do you spend on the internet?"), respondents might assume that the range of response alternatives represents a range of "expected" behaviors. Thus, they seem reluctant to report behaviors that are "extreme", i.e. the bottom and top brackets (see Schwarz et al. 1985). The unfolding brackets format is susceptible to the so-called anchoring effect (see, e.g., Furnham and Boo 2011;Van Exel et al. 2006), i.e. answers are biased toward the starting value (500 EUR in the example above). Respondents might perceive the initial value as representing a reasonable value of the quantity in question. It serves as an "anchor" or reference point, and respondents adjust their answer to be closer to the anchor than the estimate they had before seeing the question.
It is intuitively plausible that bracketing and anchoring effects would be avoided if the respondent is free to state any interval without having any hints like pre-specified values, in other words, if the question is open-ended. One such format is called respondent-generated intervals, proposed and investigated by Press and Tanur (see, e.g., Press and Tanur 2004a, b and the references therein). In this approach the respondent is asked to provide both a point value (a best guess for the true value) and an interval (a lower and an upper bound) to a question. They used hierarchical Bayesian methods to obtain point estimates and credibility intervals that are based on both the point values and the intervals.
Related to the respondent-generated intervals approach is the self-selected interval (SSI) approach suggested by Belyaev and Kriström (2010), where the respondent is free to provide any interval containing his/her true value. They proposed a maximum likelihood estimator of the underlying distribution based on SSI data. However, this estimator relies on certain restrictive assumptions on some nuisance parameters. To avoid such assumptions, Kriström (2012, 2015) introduced a novel twostage approach. In the first stage of data collection (we will call it the pilot stage), respondents are asked to state single self-selected intervals. In the second stage (the main stage), each respondent from a new sample is asked two questions: (i) to provide a SSI and then (ii) to select from several sub-intervals of the SSI the one that most likely contains his/her true value. The sub-intervals in the second question of the main stage are generated from the SSIs collected in the pilot stage. Kriström (2012, 2015) developed a nonparametric maximum likelihood estimator of the underlying distribution for two-stage SSI data.
Data consisting of self-selected intervals or respondent-generated intervals (without the point values) are a special case of interval-censored data. Let X be a random variable of interest. An observation on X is interval-censored if, instead of observing X exactly, only an interval (L , R ] is observed, where L < X ≤ R. Interval censoring also contains right censoring and left censoring as special cases, and if R = ∞, the observation is right-censored, while if L = −∞ the observation is left-censored (see, e.g., Zhang and Sun 2010). Interval-censored data are encountered most commonly when the observed variable is the time to some event (known as time-to-event data, failure time data, survival data, or lifetime data). The problem of analyzing timeto-event data appears in many areas such as medicine, epidemiology, engineering, economics, and demography.
With regard to statistical analysis of interval-censored data, Peto (1973) considered nonparametric maximum likelihood estimation and employed a constrained Newton-Raphson algorithm. Turnbull (1976) extended the work of Peto to allow for truncation and suggested a self-consistency algorithm. Considering the case of no truncation, Gentleman and Geyer (1994) provided conditions under which Turnbull's estimator is indeed a maximum likelihood estimator and is unique. All these methods rely on the assumption of noninformative censoring, which implies that the joint distribution of L and R contains no parameters that are involved in the distribution function of X and therefore does not contribute to the likelihood function (see, e.g., Sun 2006). In the sampling schemes considered by Belyaev and Kriström (2010, 2012, 2015 this is not a reasonable assumption, thus the standard methods are not appropriate. The existing methods for analysis of time-to-event data in the presence of informative interval censoring require modeling the censoring process and estimating nuisance parameters (see Finkelstein et al. 2002) or making additional assumptions about the censoring process (see Shardell et al. 2007). These estimators are specific for time-to-event data and are not directly applicable in the context that we are discussing.
In this paper, we extend the work of Kriström (2012, 2015) by considering a sampling scheme where the number of sub-intervals in the second question of the main stage is limited to two or three, which is motivated by the fact that a question with a large number of sub-intervals might be difficult to implement in practice (e.g., in a telephone interview). In Sect. 2, we describe the sampling scheme. Section 3 introduces the statistical model. In Sect. 4, a nonparametric maximum likelihood estimator of the underlying distribution is proposed, and some of its properties are established. In Sect. 5, the results of a simulation study are presented, and Sect. 6 concludes the paper. Proofs and auxiliary results are given in the Appendix.

Sampling scheme
We consider the following two-stage scheme for collecting data. In the pilot stage, a random sample of n 0 individuals is selected and each individual is asked to state an interval containing his/her value of the quantity of interest. It is assumed that the endpoints of the intervals are rounded, for example, to the nearest integer or to the nearest multiple of 10. Thus, instead of (21.3, 47.8] respondents will answer with (21,48] or (20,50].
Let d 0 < d 1 < . . . < d k−1 < d k be the endpoints of all observed intervals. The set {d 0 , . . . , d k } can be seen as a set of typical endpoints. The data, collected in the pilot stage are used only for constructing the set {d 0 , . . . , d k }, which is then needed for the the main stage. In the case that a similar survey is conducted again, a new pilot stage is not necessary-the data from the previous survey can be used for constructing {d 0 , . . . , d k }.
In the main stage, a new random sample of individuals is selected and each individual is asked to state an interval containing his/her value of the quantity of interest. We refer to this first question as Qu1. If the interval has endpoints that do not belong to {d 0 , . . . , d k }, we exclude the respondent from the collected data. If the endpoints of the stated interval belong to {d 0 , . . . , d k }, then the interval is split into two or three sub-intervals with endpoints from {d 0 , . . . , d k } and the respondent is asked to select one of these sub-intervals (the points of split are chosen in some random fashion; for details see Sect. 3). We refer to this second question as Qu2. The respondent may refuse to answer Qu2, and this will be allowed for.
Let us define a set of intervals In the example with V = {(0, 5], (5, 10], (10, 20]}, u 5 = (5, 20] = v 2 ∪ v 3 , hence J 5 = {2, 3}. Similarly, the interval v 3 = (10, 20] is contained in u 3 , u 5 and u 6 , thus H 3 = {3, 5, 6}. We can distinguish three types of answers in the main stage: type 1. (u h ; NA), when the respondent stated interval u h at Qu1 and refused to answer Qu2; type 2. (u h ; v j ), when the respondent stated interval u h at Qu1 and v j at Qu2, where v j ⊆ u h ; type 3. (u h ; u s ), when the respondent stated interval u h at Qu1 and u s at Qu2, where u s is a union of at least two intervals from V and u s ⊂ u h .
In the case when u h ∈ V, Qu2 is not asked, but we input the answer from Qu1, and we consider this as an answer of type 2 : (u h ; v j = u h ). The number of respondents in the main stage is denoted by n (not counting those who were excluded).
Remark 1 This sampling scheme has two essential differences from the one introduced by Kriström (2012, 2015), namely (i) they include in the data for the main stage only respondents who stated at Qu1 an interval that was observed at the pilot stage, while we allow any interval with endpoints from {d 0 , . . . , d k }, and (ii) in their scheme the interval stated at Qu1 is split into all the sub-intervals v j that it contains, while in our scheme it is split into two or three sub-intervals with endpoints from {d 0 , . . . , d k }.
Remark 2 A question that arises naturally is: How large should the sample in the pilot stage be so that the proportion of excluded respondents in the main stage is sufficiently small? As noticed by Belyaev and Kriström (2015), this question is related to the problem of estimating the number of species in a population, which dates back to a work by Good (1953) and has been extensively treated in the literature since then. Belyaev and Kriström (2015) suggested a rule for determining the sample size for the pilot stage (stopping the sampling process) based on results by Good (1953). A similar stopping rule can be utilized for our sampling scheme.

Statistical model
The unobserved (interval-censored) values x 1 , . . . , x n of the quantity of interest are considered to be values of independent and identically distributed Thereby, the estimated distribution function will be a step function with jumps only at the points d 1 , . . . , d k . To avoid complicated notation, we assume that q j > 0 for all j = 1, . . . , k. The case when q j = 0 for some j can be treated similarly. Actually, if we have observed at Qu1 an interval u h containing v j , it is plausible to assume that q j > 0. If for some j 0 we have not observed any u h containing v j 0 , then we can assume that q j 0 = 0 and proceed by estimating the remaining q j 's. Let The probabilities q j are the main parameters of interest, while the conditional probabilities w h| j are nuisance parameters. If w h| j does not depend on j, the assumption of noninformative censoring will be satisfied. In our case, there are no grounds for making such assumptions about w h| j , and therefore we need the data on Qu2 in order to estimate w h| j .
We are considering a sampling scheme where, for the purpose of asking Qu2, the interval stated at Qu1 is split into two or three sub-intervals (we refer to these as 2-split design and 3-split design, respectively). We will now discuss how the points of split are determined. Let J • h be the set of indices of points from {d 0 , . . . , We denote by γ t the probability that a respondent gives an answer of type t, for t = 1, 2, 3, and similarly γ ht denotes the probability that a respondent, who stated u h at Qu1, gives an answer of type t for t = 1, 2, 3. Later on, we will need to assume that γ 2 > 0 and γ h2 > 0. Sufficient conditions for this are given by the following proposition.
Let δ h, j be the probability that u h is split so that one of the resulting sub-intervals is v j , and let δ h * s be the probability that u h is split so that one of the resulting subintervals is u s . It is easy to see that the probabilities δ h, j and δ h * s can be expressed in terms of δ h,d j in case of a 2-split design, and in terms of δ h,d i ,d j in case of a 3-split design.

Estimation
In this section we discuss the estimation of the distribution function F(x). We prove the consistency of a proposed nonparametric maximum likelihood estimator of the probabilities q j given that the conditional probabilities w h| j are known. We then show that if we plug in a consistent estimator of w h| j , the estimator of q j is still consistent. Thereafter, we suggest an estimator of w h| j and show its consistency. Iterative procedures are proposed for finding the estimates of q j and w h| j .

Estimating the probabilities q j
Henceforth we will need the following frequencies: We denote by n , n , and n the number of respondents who gave an answer of type 1, 2, and 3, respectively. The following are satisfied: n h * s , n + n + n = n.
If respondent i has given an answer of type 1, i.e. u h at Qu1 and NA at Qu2, then the contribution to the likelihood is P (H i = h) = j∈J h w h| j q j , where the equality follows from the law of total probability. If an answer of type 2 is observed, i.e. u h at Qu1 and v j at Qu2, then the contribution to the likelihood is δ h, j w h| j q j . And in the case that we observe an answer of type 3, i.e. u h at Qu1 and u s at Qu2, the contribution to the likelihood is δ h * s j∈J s w h| j q j . Thus, the log-likelihood function (normed by n) corresponding to the main-stage data is where c 1 does not depend on q = (q 1 , . . . , q k ) and Remark 3 If n = 0, the log-likelihood (1) has essentially the same form as the one in Belyaev and Kriström (2012).
We say that q is an approximate maximum likelihood estimator (see, e.g., Rao 1973 p. 353) where L(q) is the likelihood function and A is an admissible set of values of q. In our case the admissible set is A = {q : 0 < q j < 1, k j=1 q j = 1}.
Theorem 1 Let q be an approximate maximum likelihood estimator of q and q 0 be the vector of true probabilities. If the conditional probabilities w h| j are known and γ 2 > 0, then q a.s.
In order to find the maximizer of the log-likelihood log L(q), we will consider the Lagrange function: L(q, λ) = log L(q) n + λ(q 1 + · · · + q k ).
From the concavity of the log-likelihood function (see Proposition 2 in the Appendix), it follows that it can have no more than one stationary point. It is easy to see that the same is true for L(q, λ). Therefore, if we find a stationary point of L(q, λ), it corresponds to the unique stationary point of the log-likelihood, which will be the maximum likelihood estimate. By taking the derivative of L(q, λ) with respect to q j , we can write equations ( By multiplying (4) by q j , then taking the sum over j = 1, . . . , k and using the identities For finding the solution of (5), we suggest the following iterative process, which is similar to the one proposed by Belyaev and Kriström (2012): When q (r +1) is close enough to q (r ) , the process is stopped. Our simulation experiments showed a very fast convergence of this iterative procedure to the true solution.

Corollary 1 If we insert a strongly consistent estimator of w h| j into the log-likelihood
(1) and γ 2 > 0, then the approximate maximum likelihood estimator q is strongly consistent.

Estimating the conditional probabilities w h| j
We propose an estimator of the probabilities p j|h , j ∈ J h . Then, an estimator of w h| j can be obtained using the Bayes formula: where p j|h is an estimator of p j|h and We will consider the estimation of p j|h for a given h. For simplicity, we assume that p j|h > 0 for all j ∈ J h ; the case when some of them are zero can be treated similarly. Let p h be the vector of p j|h for j ∈ J h . The log-likelihood function (normed by n h• ), based on the respondents who stated the interval u h at Qu1 and any sub-interval at Qu2, will be: −→ p 0 j|h as n −→ ∞.
Remark 4 From the strong law of large numbers, it follows that w h is a strongly consistent estimator of w h . This, together with Theorem 2, implies that the estimator w h| j is strongly consistent.
The maximizer of the log-likelihood function log L h (p h ) can be found by employing the same method we used for log L(q). The concavity of log L h (p h ) is shown in Proposition 3 (see the Appendix). The unique stationary point is the solution of: Again, we suggest an iterative process for finding the solution:

Remark 5
If n h• = 0, i.e. if the interval u h has not been observed in type 2 or in type 3 answers, we do not have any observations in order to estimate the probabilities p j|h , j ∈ J h . In that presumably rare case, we need to make assumptions about those probabilities. In our simulation experiments, we have assumed that all sub-intervals v j , j ∈ J h , are equally likely, i.e. p j|h = 1/|J h |

Simulation study
We have conducted a simulation study in order to investigate the behavior of the proposed estimator. The data for the pilot stage and for Qu1 at the main stage are generated in the same way. Here we describe it for Qu1 in order to avoid unnecessary notations. In all simulations, the random variables X 1 , . . . , X n are independent and have a Weibull distribution: where a = 1.5 and σ = 80. Let U L 1 , . . . , U L n and U R 1 , . . . , U R n be sequences of i.i.d. random variables defined below:  (2) i ∼ Uniform(20, 50). Let (L 1i , R 1i ] be the interval stated by the i-th respondent at Qu1. The left endpoints are generated as L 1i = (X i −U L i ) 1{X i −U L i > 0} rounded downwards to the nearest multiple of 10. The right endpoints are generated as R 1i = X i + U R i rounded upwards to the nearest multiple of 10. For the second question (Qu2) we have considered three different designs: splitting the interval stated at Qu1 into two sub-intervals, into three sub-intervals, and into all sub-intervals v j that it contains. The latter corresponds to the sampling scheme explored by Belyaev and Kriström (2012). In case of a 2-split design, the point of split is chosen equally likely from all the possible points d j that are within the interval. Similarly, in case of a 3-split design, both points of split are chosen equally likely. The probability that a respondent gives no answer to Qu2 is 1/6, and the sample size for the pilot stage is equal to 200 unless stated otherwise. The computations were performed in R (R Core Team 2015).
Some descriptive statistics about the length of the interval at Qu1 for a simulated sample of size 2000 are shown in Table 1. Figures 1 and 2 illustrate the results of simulations with the 2-split design for sample sizes n = 400 and n = 2000. The estimated distribution function F(x) = j: d j ≤x q j is plotted together with the true distribution function F(x) and the empirical cumulative We can see that the estimate F(d j ) is very close to true probability F(d j ) for most j, and when F(d j ) deviates from F(d j ) a similar deviation is observed for F n (d j ).
It is of interest to compare the mean square error of different estimators of the probabilities q j , j = 1, . . . , k, based on different sampling schemes. We have generated 5000 samples (only the main stage is repeated 5000 times) according to the three designs described above and calculated the root mean square error (RootMSE) and the root relative mean square error (RootRelMSE). These are compared with the corresponding error when q j is estimated from the empirical c.d.f. F n (x) of the uncensored observations. Figure 3 shows the results for sample size n = 400 and Fig. 4 shows the results for n = 2000. The design, corresponding to the sampling scheme in Belyaev and Kriström (2012), is denoted as "all-split". The error when using the all-split design is fairly close to the error when q j is estimated using the uncensored observations x 1 , . . . , x n . As we can expect, when using the 2-split or 3-split designs, the errors are a bit larger. We observe similar patterns for n = 400 and n = 2000, the main difference is that the error decreases with increasing sample size.
In relation to Remark 2, we have performed simulations in order to see what proportion of respondents will be accepted at the main stage when the data are generated according to the model described above. The results are given in Table 2, where n 0 is the number of respondents at the pilot stage and n +n rej is the number of respondents at the main stage (accepted and rejected). In the third column are the proportions when  using the sampling scheme of Belyaev and Kriström (2012), and in the fourth column are the proportions when using the sampling scheme suggested in this paper (the average proportion over 3000 replications is reported). As expected, the proportion of accepted is larger for our scheme. For both schemes, the proportion gets close to one with increasing values of n 0 . We have carried out simulations to examine potential bias due to wrongly assuming that w h| j does not depend on j. This assumption implies noninformative censoring and in this case our method is essentially equivalent to the estimator proposed by Turnbull (1976). We compare the estimator suggested in this paper (i.e. estimating both w h| j and q j from the data) with Turnbull's estimator (i.e. assuming that w h| j does not depend    (8). This model corresponds to a specific behavior of the respondents, that is, at Qu1 they tend to choose an interval in which the true value is located in the right half of the interval. Figure 5 presents the bias and the root mean square error of the two estimators based on 5000 simulated samples (only the main stage is repeated) of size n = 2000 for both the 2-split and 3-split designs. The bias of our estimator is negligible, while the bias of Turnbull's estimator is substantially larger. The RootMSE of Turnbull's estimator is larger, as well. We see that Turnbull's method on average overestimates the mass in the left tail because it puts mass uniformly over the observed interval when in fact it should put more mass to the right. It is also of interest to compare Turnbull's estimator applied to Qu1 data with Turnbull's estimator applied to 2-split data. The results, based on 5000 simulated samples of size n = 2000, are shown in Fig. 6. As we might expect, the bias is much larger if only the data from Qu1 are used.

Concluding comments
In this paper, we considered a two-stage scheme for collecting self-selected interval data in which the number of sub-intervals in the second question of the main stage is limited to two or three. We suggested a nonparametric maximum likelihood estimator of the underlying distribution function and showed its strong consistency under easily verifiable conditions. Our simulations indicated a good performance of the proposed estimator-its error is comparable with the error of the empirical c.d.f. of the uncensored observations. It is important to note that the censoring in this context is imposed by the design of the question. A design allowing uncensored values might introduce bias in the estimation if respondents are forced to give an exact value of a quantity that is hard to evaluate exactly (e.g., number of hours spent on the internet), and consequently they give a rough "best guess". We also showed via simulations that ignoring the informative censoring and thus applying a standard method (Turnbull's estimator) can lead to serious bias. It would be of interest to investigate the accuracy of the estimator theoretically, but we leave that as a future work. and dividing by n, we get In (15) and (16), instead of w 0