Quantile regression with interval‑censored data in questionnaire‑based studies

Interval-censored data can arise in questionnaire-based studies when the respond-ent gives an answer in the form of an interval without having pre-specified ranges. Such data are called self-selected interval data. In this case, the assumption of independent censoring is not fulfilled, and therefore the ordinary methods for interval-censored data are not suitable. This paper explores a quantile regression model for self-selected interval data and suggests an estimator based on estimating equations. The consistency of the estimator is shown. Bootstrap procedures for constructing confidence intervals are considered. A simulation study indicates satisfactory performance of the proposed methods. An application to data concerning price estimates is presented.


Introduction
Quantile regression is a flexible approach to analyzing relationships between a response variable and a set of covariates.While the classical least-squares regression methods capture the central tendency of the data, quantile regression methods allow estimating the full range of conditional quantile functions and thus can provide a more complete analysis.Other attractive properties of quantile regression are equivariance to monotone transformations, robustness to outlying observations, and flexibility to distributional assumptions (Koenker 2005).
In many studies, the response variable of interest is observed to lie within an interval instead of being observed exactly.Such observations are called interval-censored and they often arise when the variable of interest is the time to some event (Kalbfleisch and Prentice 2002;Sun 2006;Bogaerts et al. 2017).Interval-censored data may also occur in questionnaire-based studies when the respondent is requested to give an answer in the form of an interval without having a list of ranges to choose from.This type of data is referred to as self-selected interval data (Belyaev and Kriström 2010, 2012, 2015).Similar question formats have been explored by Press andTanur (2004a, 2004b), Håkansson (2008), and Mahieu et al. (2017).Such formats are appropriate for asking questions which are hard to answer with an exact amount and for sensitive questions because they allow partial information to be elicited from respondents who are unable or unwilling to provide exact values.
Estimation procedures for quantile regression with interval-censored data have been suggested by Kim et al. (2010), Shen (2013), Zhou et al. (2017), Li et al. (2020), andFrumento (2022).These methods rely on the assumption of independent censoring, i.e., the observation process that generates the censoring is independent of the variable of interest, conditional on the covariates included in the model (Sun 2006).However, for self-selected interval data this is not a reasonable assumption because the respondent is the one who chooses the interval.Not accounting for the dependent censoring in selfselected interval data can lead to bias in the estimation (Angelov andEkström 2017, 2019).
Building upon the ideas of McKeague et al. (2001), Shen (2013), and Angelov and Ekström (2017), we suggest an estimator for quantile regression where the response variable is of self-selected interval data type and the covariates are discrete.In questionnaire-based studies, most often the covariates are discrete, such as gender, level of education, employment status, and answers to Likert-scale questions, or ones that are discretized such as age, personal income, and monthly expenses.In Sect.2, we outline the sampling scheme for self-selected interval data.Section 3 describes the model and the suggested estimation procedure.A simulation study is reported in Sect. 4. In Sect.5, the methods are applied to data from a study where the respondents provided estimates of the prices of rice and two types of fish.In the Appendix are given proofs and auxiliary results.

Data collection scheme
We consider a two-stage scheme for collecting data.The motivation behind this scheme is that more information is needed than a single interval from each respondent in order to consistently estimate the underlying distribution function or related parameters.Therefore the respondent is asked to select a sub-interval of the interval that he/she stated.The problem of deciding where to split the stated interval into sub-intervals can be resolved using some previously collected data (in a pilot stage or an earlier survey) or based on other knowledge about the quantity of interest.Another possibility is to include a predetermined degree of rounding in the instruction for the respondents, e.g., to state intervals with endpoints rounded to a multiple of 10, and then the points of split will be chosen among the multiples of 10.
In the pilot stage, a random sample of individuals is selected and each individual is requested to give an answer in the form of an interval containing his/her value of the quantity of interest.It is assumed that the endpoints of the intervals are rounded (e.g., to the nearest multiple of 10) and that they are bounded from above by some large number.Let {d ⋆ j } be the set of endpoints of all observed intervals.The pilot- stage data are used only for obtaining the set {d ⋆ j }.In the main stage, a new random sample of n individuals is selected and each individual is asked to state an interval containing his/her value of the quantity of interest.We refer to this question as Qu1.Then, follow-up questions are asked according to one of the following designs.
Design A. The interval stated at Qu1 is split into two or three sub-intervals and the respondent is asked to select one of these sub-intervals.The points of split are chosen in some random fashion among the points d ⋆ j that are within the stated interval, e.g., equally likely.We refer to this question as Qu2.
Design B. The interval stated at Qu1 is split into two sub-intervals and the respondent is asked to select one of these sub-intervals.The point of split is the d ⋆ j that is the closest to the middle of the interval; if there are two points that are equally close to the middle, one of them is taken at random.We refer to this question as Qu2a.The interval selected at Qu2a is thereafter split similarly into two sub-intervals and the respondent is asked to select one of them.We refer to this question as Qu2b.
The respondent may refuse to answer Qu2 (Qu2a and Qu2b); we assume that the respondent chooses not to answer independently of his/her true value.If there are no points d ⋆ j within the interval stated at Qu1 or Qu2a, the respective follow-up question is not asked.We assume that if a respondent has answered Qu2 (Qu2a), he/she has chosen the interval containing his/her true value, independent of how the interval stated at Qu1 was split.An analogous assumption is made about the response to Qu2b.
In Design B, if we know the intervals stated at Qu1 and Qu2b, we can find out the answer to Qu2a.Thus, if Qu2b is answered, the data from Qu2a can be omitted.Let Qu2Δ denote the last follow-up question that was answered by the respondent.If the respondent did not answer Qu2a (Qu2 in Design A), we say that there is no answer at Qu2Δ .Designs A and B are studied in Angelov and Ekström (2019), where they are referred to as schemes A and B. Let be the endpoints of all intervals observed at the main stage.The assumptions that the endpoints are rounded and bounded from above imply that J remains fixed for large sample sizes.Let us define a set of intervals V = { j } , where j = (d j−1 , d j ], j = 1, … , J , and let U = { h } be the set of all intervals that can be expressed as a union of intervals from V , i.e., U = {(d l , d r ] ∶ d l < d r , l, r = 0, … , J} .We denote J h to be the set of indices of intervals from V contained in h , i.e., J h = {j ∶ j ⊆ h } .For example, if V = {(0, 2], (2, 5], (5, 10]} , then U = {(0, 2], (2, 5], (5, 10], (0, 5], (2, 10], (0, 10]} .Also, 4 = (0, 5] = 1 ∪ 2 , hence J 4 = {1, 2}.

Let us denote the observations
, where (l 1i , r 1i ] is the interval stated at Qu1, (l 2i , r 2i ] is the interval stated at Qu2Δ , and i = (1, The distribution of Y i depends on the value of i .It is assumed that i takes finitely many values. Let Q ( i ) be the -th quantile of Y i conditional on i = i , We assume that where  ∈ ⊆ ℝ d+1 is a parameter vector (a vector of regression coefficients).For uncensored data, an estimate of can be obtained by solving the estimating equation Following the ideas of McKeague et al. (2001) and Shen (2013), we replace the unobservable 1{y i ≥ ⊺ i } in (1) by an estimate of the conditional probability that Y i ≥ ⊺ i given i .Thus we arrive at the following estimating equation: We define ̂ to be the root of estimating equation ( 2).
Unless otherwise stated, hereafter we focus on the case = 0.5 which cor- responds to a median regression model and we omit the subscript in and .However, the suggested estimation procedure is applicable to an arbitrary ∈ (0, 1).The set of combinations of possible values of i is denoted by Let us define where s ⊂ h .The following relation between p j|h,k and p j|h * s,k is fulfilled: We need to estimate p j|h,k and p j|h * s,k in order to find an estimate Gi , which is needed in (2).The conditional probabilities p j|h,k reflect the relative position of Y i within the stated interval (L 1i , R 1i ] .These probabilities are estimated using the data from Qu2Δ , where the respondent selects a sub-interval of (L 1i , R 1i ] .The estimate pj|h,k is obtained by applying the procedure proposed in Angelov and Ekström (2017) to the subset of data corresponding to i = k , namely, pj|h,k , j ∈ J h , is the maximizer of the log-likelihood where n hjk is the number of respondents who stated h at Qu1, j at Qu2Δ ( j ⊆ h ) and have covariate value k , while n h * s,k is the number of respondents who stated h at Qu1, s at Qu2Δ ( s is a union of at least two intervals from V , s ⊂ h ) and have covariate value k .
The estimate pj|h * s,k is computed using the relation (3), i.e., If independent censoring is assumed and the survival function of Y i is close to linear over (L 1i , R 1i ] , then the distribution of the relative position of Y i within the interval (L 1i , R 1i ] will be close to uniform.This will not be realistic if the respond- ents exhibit some specific behavior when choosing the intervals, e.g., if they tend .
to choose an interval such that the true value is located in the right half of the interval.Therefore, assuming independent censoring in such cases may lead to bias in the estimation of .
Thus, G i is a step function with jumps at the points d j 1(h) , … , d j c(h) .However, it will be more convenient to use a smoothed version of G i and we employ spline interpolation for that purpose.The procedure for obtaining the smooth version Gi is described below.Figure 1 visualizes the functions G i and Gi in an artificial example.Let be a positive constant.
Case 1 Suppose that (L 1i , R 1i ] = h , (L 2i , R 2i ] = NA , and i = k .Then Gi is the monotone cubic spline (see Fritsch and Carlson 1980) through the points: Fig. 1 An illustration of G i and Gi for some i, where By adding the points The constant can be chosen, e.g., as = min j |d j − d j+1 | ; although any positive constant should work.
Case 2 Suppose that (L 1i , R 1i ] = h , (L 2i , R 2i ] = s , and i = k .Then Gi is the monotone cubic spline through the points: Case 3 Suppose that (L 2i , R 2i ] = j .Then Gi is the monotone cubic spline through the points: Let • ( ) be an estimating function based on the true G i rather than on Gi , i.e., Let D( ) = n −1 • ( ) .Let 0 be the true value of , i.e., the median of Y i condi- tional on i = i is given by 0 ⊺ i .
⟶ A , where A is negative definite.
First coordinate Second coordinate are known for all possibly observed points d j , then the survival function Assumption 3 ∑ j n hjk ∕( We can regard Assumption 2 as a sensible approximation of the true underlying survival function.The very nature of a distributional model is a simplified and idealized representation of the underlying survival function, and thus there is no 'true' model that perfectly describes the survival function and how it depends on the covariates. Assumption 3 ensures the strong consistency of pj|h,k , see Angelov and Ekström  (2017).
The almost sure convergence of ̂ is established in the following theorem.
We will explore the following confidence intervals for r with nominal level 1 − : • Bootstrap percentile confidence interval • Wald-type confidence interval with bootstrap standard error For monotone cubic spline interpolation, we use the R function splinefun with the option method="monoH.FC", which corresponds to the method of Fritsch and Carlson (1980).The estimate ̂ is obtained as a minimizer of ‖ ( )‖ , where ‖ ⋅ ‖ is the Euclidean norm.For this task, the Nelder-Mead (NM) algorithm is used (the R function optim with the option method="Nelder-Mead").The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method can also be used (the R function optim with method="BFGS"); however, our experiments suggested that it is much slower than the Nelder-Mead algorithm for this particular optimization problem.Table 1 displays the average computation time for the suggested estimation procedure (using the NM algorithm and the BFGS algorithm) under different settings on a laptop computer with Intel(R) Pentium(R) CPU 2117U 1.8 GHz, RAM 4.0 GB.

Setup
Let Y 1 , … , Y n be independent random variables that have a Weibull distribution, Then, the -th quantile of Y i is ⊺ i .We generate Y 1 , … , Y n according to the above definition with = 1.5 and con- sider two cases for the covariates: (i) one covariate x 1i taking values 1, 2, or 3; (ii) two covariates x 1i and x 2i , where x 1i takes values 2 or 3 and x 2i takes values 0 or 1. Let where i and U (2) i are random variables defined later.Let (L 1i , R 1i ] be the interval stated by the i-th respondent at question Qu1.The left end- points are generated as rounded downwards to the nearest multiple of 10.The right endpoints are generated as (6) upwards to the nearest multiple of 10.We consider two settings for the random variables U (1) i and U (2) i in ( 6), see Table 2.In setting S11, the median length of the interval at Qu1 is 50, while in settings S21 and S22 the median length is 30.The data for the follow-up question are generated according to Design A; the interval (L 1i , R 1i ] is split into two sub-intervals, the point of split is chosen equally likely from all the possible points d ⋆ j that are within the interval.The probability that a respondent gives no answer to Qu2Δ is p NA = 1∕4 .The parameter p M of the Bernoulli random variables M i is considered to be a function of the covariates (see Table 2).For exam- ple, in setting S11, p M = 0.2x 1i − 0.1 , which leads to tree possible values, p M = 0.1, 0.3, 0.5 .Figure 2 illustrates the relative position of Y i in the interval ) , for the different values of p M under setting S11.Instead of simulating pilot-stage data, a pre-determined set of points {d ⋆ j } = {0, 10, 20, … , 450} is used (cf.Angelov and Ekström 2019).All computations were performed with R (see R Core Team 2019).The R code can be obtained from the corresponding author upon request.

Results
We conducted simulations for a range of sample sizes where we compare the proposed estimator with the estimator of Shen (2013), which assumes independent censoring.Our estimator can be seen as an extension of Shen's estimator to the case of dependent censoring.With such comparison we can see the benefit of using an estimator that accounts for dependent censoring.Shen's estimator is applied to the dataset where each data point includes only the last interval stated by the respondent.Relative bias is defined as the bias divided by the true value of the parameter.Tables 3, 4, and 5 display the results based on 10000 simulated datasets (replications).We see that in most cases the root mean square error is smaller for our estimator.The bias of our estimator is considerably lower than the bias of Shen's estimator (with some exceptions for n = 100 under setting S22).Moreover, the bias of our estimator gets closer to zero as the sample size increases, while the bias of the other estimator does not change noticeably when increasing the sample size.The bias of our estimator for smaller sample sizes might be explained by the not large number of observations for each combination of h and k which may lead to poor estimates of some of the probabilities p j|h,k .
Simulations concerning the bootstrap confidence intervals (4) and ( 5) are reported in Table 6.The results are based on 1000 simulated samples of sizes n = 100 and n = 1500 .One bootstrap confidence interval is calculated using 1000 bootstrap sam- ples.For the bootstrap percentile confidence intervals, the coverage is fairly close to   error (Wald with BootSE) are on average longer and their coverage is in some cases too low.Therefore, the bootstrap percentile confidence intervals are recommended.

Application
We apply the proposed methods to data concerning price estimates from a study conducted in Aklan, a province in the Philippines.The focus of the sampling process was the capital city, Kalibo.The administrative divisions, barangays, of Kalibo were classified into either coastal or inland communities.Two coastal barangays (Pook and Old Buswang) and two inland barangays (Tigayon and Estancia) were randomly selected.In each barangay, a number of households were randomly chosen.With their consent, a member of a sampled household (preferably, the head) was asked to participate in a survey.They were told to answer as honest as possible, and that their identity and personal data gathered will be kept confidential.The questionnaire was written in English, but trained enumerators explained questions in the local language Tagalog.The participants were asked to provide estimates of the prices of rice and two types of fish (galunggong and bangus).They answered by means of self-selected intervals.As a follow-up question, the respondents were asked whether the price is more likely to be in the left or in the right half of the interval.Price estimates were given for two time periods: April 2019 (summer/fishing season) and September 2019 (typhoon/non-fishing season); thus the dataset contains six price estimates: (RA) Price of 1 kg of rice in April 2019; (RS) Price of 1 kg of rice in September 2019; (GA) Price of 1 kg of galunggong in April 2019; (GS) Price of 1 kg of galunggong in September 2019; (BA) Price of 1 kg of bangus in April 2019; (BS) Price of 1 kg of bangus in September 2019.below the lower bound of the confidence intervals for the medians).For bangus (luxury fish), respondents underestimated the price in April (observed market price is above the upper bound of the confidence intervals for the medians and the 0.75-quantiles).However, they gave more accurate estimates for the price of bangus in September (observed market price is within the confidence intervals for the medians).
Respondents expected prices to be higher in the typhoon season compared to the non-typhoon season, which in reality happened only with the price of galunggong, while the prices of rice and bangus remained stable.
We also considered models with two covariates: where HouseholdHead is a variable which takes value 1, if the respondent is head of the household, and 0 otherwise.Point estimates and confidence intervals for the parameters 1 and 2 are presented in Figs. 5 and 6.The results indicate that people with college education tend to give higher price estimates compared to those without college education.Heads (10) Fig. 4 Estimates and bootstrap percentile confidence intervals for the 0.25-quantile, the median, and the 0.75-quantile of the prices using the models with one covariate (7, 8 and 9).The confidence intervals are based on 50000 bootstrap samples.The confidence level is 0.95.In each plot, the observed market price (see Table 7) is displayed with a horizontal dashed line of households tend to give higher price estimates for galunggong and bangus compared to people who are not heads of households.However, all the confidence intervals for the parameters 1 and 2 contain zero.Therefore, in each case the hypotheses 1 = 0 and 2 = 0 can not be rejected at the 5% significance level.
Fig. 5 Estimates and bootstrap percentile confidence intervals for the parameter 1 in the models with two covariates (10, 11 and 12).The confidence intervals are based on 50000 bootstrap samples.The confidence level is 0.95 Fig. 6 Estimates and bootstrap percentile confidence intervals for the parameter 2 in the models with two covariates (10, 11 and 12).The confidence intervals are based on 50000 bootstrap samples.The confidence level is 0.95

Concluding remarks
We suggested an estimator for quantile regression for self-selected interval data with discrete covariates.We proved the strong consistency of the estimator.Our simulation study indicated that the proposed estimator performs better than an existing estimator which assumes independent censoring.A simple bootstrap procedure for constructing confidence intervals (the bootstrap percentile) showed satisfactory performance in the simulations.

A.1 Continuity of splines
Here we show the continuity of monotone cubic splines (see Fritsch and Carlson 1980) with respect to the data points.The notation in this section is independent of that in the rest of the paper.Suppose that we have data points (x i , y i ), i = 1, … , m , where x 1 < x 2 < … < x m and y 1 ≥ y 2 ≥ … ≥ y m .Let g(x) be a monotone piecewise cubic function such that g(x i ) = y i , i = 1, … , m .In each interval [x i , x i+1 ] , g(x) is a cubic polynomial: where We use the following procedure for calculating a i , i = 1, … , m.
⟶ y i as n ⟶ ∞ , where n is the size of the sample used for obtaining ŷi .All quantities with a hat (e.g., âi ) imply that y i is substituted with ŷi .Let Assumption 3 guarantees that pj|h,k is a strongly consistent estimator of p j|h,k (see Angelov and Ekström 2017).Therefore, the data points used for Gi converge almost surely to the data points used for G i .Then, the claim follows from Lemma 1. ◻ Proof of Theorem 1 Using Lemma 2, we get .

Fig. 2
Fig. 2 Relative position ofY i in the interval (L 1i , R 1i ] , i.e., (Y i − L 1i )∕(R 1i − L 1i )for three different values of p M corresponding to x i = 1, 2, 3 .The histograms are based on a generated dataset of size n = 50000 under setting S11

Table 1
Average computation time (in seconds)

Table 2
Simulation settings the nominal level of 0.95.The bootstrap percentile method has previously shown good performance in the context of quantile regression (see, e.g.,Wang and Wang 2009;  De Backer et al. 2019).The Wald-type confidence intervals with bootstrap standard

Table 7
Observed market prices per kilogramThe data for rice are from the Philippine Statistics Authority.The data for galunggong and bangus are from the Bureau of Fisheries and Aquatic Resources and the Municipal Economic Enterprise Development Office, Municipality of Kalibo