Estimation of Covariate Effects With Current Status Data and Differential Mortality
- First Online:
- Cite this article as:
- Palloni, A. & Thomas, J.R. Demography (2013) 50: 521. doi:10.1007/s13524-012-0160-6
- 272 Views
The assessment of the impact that socioeconomic determinants have on the prevalence of certain chronic conditions reported by respondents in population surveys must confront two problems. First, the self-reports could be in error (false positives and false negatives). Second, those reporting are a selected sample of those who ever experience the problem, and this selection is heavily influenced by excess mortality attributable to the condition being reported. In this article, we use a combination of empirical data and microsimulation to (a) assess the magnitude of the bias attributable to the selection problem, and (b) suggest an adjustment procedure that corrects for this bias. We find that the proposed adjustment procedure considerably reduces the bias arising from differential mortality.
KeywordsCurrent status dataSelection biasMortalityHealth inequality
Accurate inferences about incidence of phenomena are generally made from data collection plans that follow observations over time and allow precise measurement of the timing of occurrence of relevant events. However, longitudinal designs are expensive enterprises, and sometimes researchers replace them with single-wave cross-sectional surveys with retrospective recall. For example, a significant number of phenomena—such as onset of illnesses, recovery from treatment, menopause, weaning, leaving home, and first marriage—rely on information collected retrospectively in cross-sectional surveys. But retrospective recall of events and their timing is often inaccurate, and statistical inferences from this information stand on shaky ground.1
An alternative to retrospective recalls is current-status data—that is, information about the occurrence of a relevant event prior to a time marker, such as the date of a survey. This information is less sensitive to recall problems and can be retrieved easily in conventional interviews. Under some conditions that we examine in this article, this information and associated statistical tools are a good basis for inferences about the underlying incidence of a phenomenon and for the identification of the determinants of its intensity and duration profile (Diamond and McDonald 1991; Keiding 1991, 2006; Keiding et al. 1989, 1996; Sun and Kalbfleish 1993). The information is sometimes aggregated and represented as prevalence data: namely, the fraction of the observations that experiences the event by a time marker.
Current-status data have an important drawback: they rest on the assumption that attrition of individuals who experience the event of interest is the same as the attrition of those who do not. This assumption is violated when the event under study (an illness or disability, for example) is associated with at least one source of attrition (e.g., mortality). Although this weakness of current-status data is well known (Keiding 1991), it is normally trivialized, dismissed, or altogether ignored in empirical applications. Keiding (1991) considered the case of differential mortality in the analysis of current-status data but assumed that the mortality hazards (or their difference) are known. Little work, however, has studied current-status data when the mortality difference is unknown (Jewell and Van der Laan 2004), the problem that occupies us here.2
In this article, we assess the magnitude of the bias that arises when the assumption of homogeneous risks (i.e., identical pre-survey attrition among those who do and those who do not experience the event of interest) is violated, and we develop an adjustment procedure that corrects the bias. The article is organized as follows: the next section reviews an example of the application of current-status techniques in population studies. Then we introduce a maximum-likelihood approach for analyzing current-status data in the context of differential pre-survey attrition and discuss how this situation relates to the broader literature on selection biases. This is followed by an assessment of our adjustment procedure with Monte Carlo simulations. We then apply the adjustment procedure to a concrete case, summarize the results, and conclude.
An Example: Timing of Marriage and Proportions Single3
Almost 60 years ago, John Hajnal proposed the use of the singulate mean age at marriage, better known as SMAM, to estimate the mean age at marriage (Hajnal 1953).4 This was the first demographic application of current status techniques. SMAM has been widely used to assess the timing of first marriage from census or survey information on the age-specific proportion single. Under some conditions, the age-specific prevalence of singlehood is a good indicator of the (single-decrement) probability of remaining single during the interval elapsed between the age at which the population begins to marry and the age of the individual at the time of a census or survey. The first condition is that first-marriage rates be time-invariant (stationarity). The second is that the risks of attrition, mostly induced by mortality or migration, be identical among single and married people (risk homogeneity).5 A very large literature and influential theories on historical demography rest on observed trends of SMAM or age-specific proportion single in some target age group, usually 45 or 50. In his landmark studies of the so-called Western European marriage pattern, Hajnal (1965) made extensive use of these quantities to characterize two different continental marriage regimes, a distinction that became influential on subsequent research on fertility and family formation.
While most analysts are aware of the need to invoke assumptions about these two conditions, we know of no study since Hajnal (1953) that has evaluated the pitfalls when the second assumption is inappropriate.6 To begin with, inferences from SMAM can be correct only when there are no time trends in first marriage. When the stationarity assumption is violated, current status procedures should be avoided. By the same token, we know that in violation of the assumption of risk homogeneity, mortality risks among single individuals are higher than among the married (Hu and Goldman 1990; Kisker and Goldman 1987; Livi-Bacci 1985). If stationarity holds but mortality risks of single individuals are higher than those of the married population, the observed proportions single will be too small and SMAM will be too large (Hajnal 1953).7 Simulations with a wide variety of marriage and mortality patterns indicate that the sensitivity of SMAM to mortality differentials is not trivial (see the appendix for details on the simulations): a 1 % mortality differential can produce proportionate errors in SMAM that are as small as 0.1 % and as large as 0.8 %, depending on the proportion that eventually marries. If the probability of ever marrying hovers round .75–.85, as was the case in Northern and Western Europe in the middle of the nineteenth century, a mortality differential equivalent to 10 % will bias SMAM upward by approximately 7 %. With a true mean age at marriage of 28 and a probability of ever marrying of about .80, fairly typical parameters in preindustrial Europe, and mortality differentials of the order of 10 %, the upward bias could be as high as 2.1 years. If the differential is 20 %, the bias will be 4 years.8 Errors of higher magnitude distort the proportion single at older ages, say 50 or 55, beyond which first marriage is negligible. When singles’ mortality is higher than overall mortality, the observed proportion single is an underestimate of the true probability of being single.
But errors can be worse: if mortality differences between single and married decline over time, the observed trend of SMAM will be downward and will yield the appearance of a decline in the mean age at marriage even in the absence of any real time trend in first marriage rates. If, as happened in Europe, a decline in the mean age at first marriage is accompanied by a reduction of mortality differences, the observed trajectory of SMAM will exaggerate the magnitude of the rate of decline in the age at first marriage. Similar errors endanger inferences about regional or national patterns. Indeed, the divide between the so-called Western and Eastern marriage pattern was established on the observation of higher SMAM and higher proportion single in Northern and Western Europe. This can be interpreted as a result of different marriage patterns only in the unlikely situation that the magnitude of married-single mortality differences was identical in Eastern, Northern, and Western Europe.
Estimation From Current Status Information
A Likelihood Approach9
For those who do not contract the disease (Y = 0), Eq. (2) is equal to the probability of surviving to age x multiplied by the probability of not experiencing the disease by age xi; otherwise (Y = 1), the expression is equal to the probability of surviving to age xi multiplied by the probability of contracting the disease by age xi (Keiding 1991, 2006).10
The quantity stands for the integrated hazard of mortality for those who do not contract the disease, and the lower limit of integration is the expected age at disease onset for those who are aged xi and who contract the disease by this age. The value of for individual i can be calculated exactly only if one knows (a) the incidence curve (of disease) and its determinants, (b) the force of mortality at age xi, and (c) the parameter of excess mortality, . First, the incidence curve can be approximated from retrospective (but noisy) information about the timing of the onset of disease or from known incidence curves in populations similar to the one under study. We show that the adjustment we propose is largely insensitive to even relatively large variability of the values of . Second, throughout our discussion, refers to the mortality risk at age x among those who do not have the disease. This quantity is unlikely to be known with any precision. However, to implement our adjustment procedure, it suffices to identify a standard (baseline) age pattern of mortality that applies to both those who experience and those who do not experience the disease.
Estimation in the Presence of Covariates
Assume two subgroups defined by a binary covariate Z. Assume also that each of them experiences risk heterogeneity that can be parameterized using a unique standard pattern of mortality . This is equivalent to defining and for the first subgroup , and and for the second , where and θz are positive constants. Thus, the adjustment factor can be computed using the same function μs for all observations. With this parameterization, we define a logistic model that includes the following independent variables: , a dummy variable Z, the integrated hazard of mortality , and the interaction term . The estimated coefficient of Z corresponds to the effect of subgroup Z = 1 on the incidence of disease, the coefficient of is an estimate of , and the coefficient of the interaction term is an estimate of . Under these assumptions, we can always retrieve the effects of Z as well as estimates of the mortality differential between those who experience the disease and those who do not, but we cannot identify the parameters of mortality risks for each subgroup.
Current Status, Unmeasured Heterogeneity, and Sample Selection Bias
The problem formalized above is a member of a more general class of problems characterized by two features: (1) the phenomenon of interest is only partially observed (e.g., observed only in a subset of individuals who experienced it); and (2) those who experience the event but are unobserved are removed from observation because of events whose risks have increased after the occurrence of the phenomenon of interest.
A well-known case belonging to this class is the classic sample selection problem, in which an outcome of interest (discrete or continuous) is observed only among a subset of the sample members who differ systematically (from the rest of the sample) on observed and unobserved characteristics (Berk 1983; Fligstein and Wolf 1978; Greene 1981; Heckman 1979; Little 1995; Wooldridge 1995). Some of these characteristics increase the risk of not observing the outcome of interest. Ignorance about the current status of a nonrandom subsample of observations is akin to ignorance about the nature of the outcome of interest in the classic sample selection case. However, in the classic case, the researcher can deploy adjustments using partial information available for all individuals, including those whose outcome the researcher knows nothing about. In contrast, in the current-status problem we address in this article, no such adjustments are possible because the researcher does not have any information about those individuals among whom we cannot observe the event of interest. The current-status problem can be, and indeed has been, formulated using modeling tools characteristic of the classic sample selection problem. But such tools invariably require the rigid formulation of unverified and unverifiable distributional assumptions (usually normality of latent traits), most of which are unsuitable for dealing with disease incidence and mortality (Bloom and Killingsworth 1985; Maddala 1983).
A second class of problems tightly related to current status is, unsurprisingly, the so-called unmeasured heterogeneity problem (Heckman and Singer 1984; Hougaard 2000; Manton and Stallard 1981; Trussell and Richards 1985; Trussell and Rodríguez 1990; Vaupel et al. 1979; Vaupel and Yashin 1985). This arises in the estimation of hazard models from longitudinal data whenever the occurrence of the event of interest is a function of variables that are unmeasured or ignored. The central problem of these models is the estimation of the rate of occurrence of a phenomenon within a sequence of time intervals. The estimated magnitude of the rate for any time interval depends on the composition of the sample of individuals who are exposed to the event at the beginning of the interval. Ordinarily, this subsample does not include individuals who experienced the event before the origin of the time interval. Like the case of current status, we observe the occurrence/non-occurrence of an event (within a time interval) partially (e.g., among only the subsample of individuals who did not experience the event before the time interval). Those who are excluded from the exposed subsample vanish from observation because they possess traits that increase the risk of experiencing the event throughout. But unlike the current-status problem and like the case of sample selection, the researcher may use available partial information about the unobserved individuals and/or invoke plausible assumptions about the makeup of the subset that is not observed and, armed with this information and/or assumptions, proceed to remove totally or partially the biases attributable to partial observations (Heckman and Singer 1984; Hougaard 2000; Trussell and Richards 1985). These adjustments, however, cannot be implemented in the current-status problem that occupies us here because the researcher knows nothing about individuals whose current status cannot be assessed. If the researcher is able to collect repeated current-status information over time on an initial sample of individuals, then the situation will resemble and indeed converge (as data collection times grow and become arbitrarily close to each other) to the standard unmeasured heterogeneity problem.
Monte Carlo Simulation
Design space for the education-specific mortality differences in the Monte Carlo simulation: Letters indicate panels of Fig. 3
Mortality Rate Ratio
Mortality Rate Ratio
a, b, c
Consider the population at some time t when members range in age from 31 to 100 years and have been exposed to the risk of both dying and becoming diabetic. We choose to start exposure at age 30 so that the youngest cohort of survivors to time t (observed at age 31) has been exposed to both risks for one year, whereas the oldest cohort of survivors at time t and observed at age 100 has been exposed to both risks for 70 (completed) years. We assume that the sizes of the birth cohorts are unequal and that the initial size of each grows at rates between .001 and .005 per year. With a radix of 1,000, this yields a total of roughly 60,000 and 70,000 individuals in the high- and low-education groups, respectively.13
We assume that the log of the waiting time to developing diabetes follows a logistic distribution with a constant variance and a mean that is higher among those with high education relative to those with low education. In the absence of mortality, the prevalence of diabetes at age 100 is expected to be roughly 20 % among the high-education group and just over 40 % among the low-education group. The resulting logistic regression of the log odds of being diabetic on the log of age and a dummy variable for the low-education group yields a (true) coefficient of 1.00 for the low-education covariate.
Estimates, Biases, and Adjustments
A common approach to assessing the size of education differentials in diabetes is to fit a logistic model to the log odds of being diabetic regressed on a constant, a dummy variable for the low-education group, some control variables, and the log of age.15 In the absence of risk heterogeneity (as manifested through differential mortality) in both education groups, the estimated coefficient of the education dummy variable will reflect (on average) the difference between the two solid lines shown in Fig. 2. If there are mortality differences only among the low-education group, then the estimated effect of the education dummy variable will reflect the difference between the two sets of symbols plotted in Fig. 2. These estimates lead to the erroneous inference that education has no effect on the incidence of diabetes. The bias depends on the size of the mortality differential in the low-education group.
Figure 3 also shows results for cases in which there are mortality differentials in both education groups. As pointed out earlier, if the magnitudes of the mortality differentials are the same, then there will be no bias in the estimated effect. This is shown in panel (b) of Fig. 3, where there are only small, random fluctuations in the coefficients for each level of the mortality differentials shown. In panel (c) of Fig. 3, the mortality differential is constant across the different scenarios for the high-education group while it increases for the low-education group. When the mortality differential is larger in the high-education group, the estimated effect of education contains an upward bias, and when the differential is larger in the low-education group, bias is downward. The final panel in this figure displays the bias as the sizes of the differentials increase at constant rates for each education group.
The unadjusted estimates exhibit the same patterns described earlier: if the differential is the same in each education group, there is no bias, but if the differential is larger in the low-education group, the estimated coefficient for the education dummy variable is downwardly biased. Conversely, if the differential is smaller in the low-education group, then the estimated coefficient will be upwardly biased.
The adjusted estimates are obtained after controlling for the baseline (standard) integrated hazard. In all cases, the integrated hazard associated with individuals aged x is evaluated using the conditional mean of the age of onset of diabetes in the true incidence curve as the lower bound of integration . Figure 4 displays the bias from the adjusted logistic model under various scenarios defined by the magnitude of risk heterogeneity in each education group. The average bias from the adjusted model forms a flat plane close to zero, and the mean adjusted estimate is within a few percentage points of the true effect. Contrast this to the 40 % (downward) bias of the unadjusted estimate when diabetes-specific mortality in the low-education group is four times higher than nondiabetic-specific mortality and there is no mortality difference among the high-education group.
These simulation results suggest that the proposed adjustment procedure is effective under the conditions used to simulate the data. However, the adjusted estimates are obtained using the true incidence curve to calculate the lower bounds of the integrated hazards that enter as an adjustment factor in the logistic model. How sensitive is the adjustment procedure to misidentification of the distribution used to calculate the mean ages of onset?
In summary, there is bias in the estimated effect of education on the log odds of being diabetic when the force of diabetes-specific mortality varies across the education groups. The bias can be reduced by assuming that the log odds of being diabetic are a linear function of the covariates and by including the integrated hazard of nondiabetic mortality as a covariate in the logistic regression model along with an interaction term between the integrated hazard and the dummy variable for education. For individual i aged xi, the lower limit of integration for this additional covariate is the conditional mean age of diabetes onset given that the individual has diabetes by age xi; the upper limit of integration is xi. The adjustment procedure yields correct results when the conditional means that serve to calculate adjustment factors are drawn from a distribution function that is similar to the one that underlies the occurrence of the event of interest. And although the results of the adjustment are sensitive to the specification of this distribution, it still performs much better than a naïve approach that ignores mortality differentials and how they differ across education groups.
We evaluate the adjustment procedure using two data sets on elderly people: one for Mexico, MHAS, and the other for Puerto Rico, PREHCO. Both are panel surveys of elderly populations (aged 50 or older in MHAS and aged 60 or older in PREHCO), and they both consist of two waves, separated by two years (MHAS) and four years (PREHCO). The first waves were fielded in 2000 and 2002 in MHAS and PREHCO, respectively. Both surveys elicited self-reports on diabetes in the first and second waves, and in both cases there is information on interwave mortality. Within the limitations in population panel data of this kind, MHAS and PREHCO provide us with enough information to estimate mortality differences between diabetics and nondiabetics but not enough to estimate the true incidence of diabetes at adult ages.18
In our application, we use observed prevalence data in the first wave (the current-status information on diabetes) to estimate the effect of the education covariate both with and without the adjustment procedure. In addition, we retrieve an estimate of the mortality differential between diabetics and nondiabetics. In the absence of information on the true incidence of diabetes, a comparison of the estimated mortality differential to the observed mortality differential in each education group is the only benchmark we have to judge the performance of the adjustment procedure.
Estimates of the effects of education on the probability of being diabetic: MHAS
Model With No Adjustment
Model With Adjustment
Log of Age
Dummy Variable for Educationb
We apply two adjustment factors, one for each education group. Using the observed mortality from the interwave period, we fit Gompertz models for each education group separately and irrespective of diabetes status. We then calculate the integrated hazard using the parameters of the Gompertz model for each education group. Thus, the standard mortality pattern used to calculate integrated hazards is the same within each education group (for diabetics and nondiabetics) but different across education groups.20 Recall that the regression coefficients associated with the integrated hazard are estimates of the differences in mortality levels between diabetics and nondiabetics in each education group (e.g., ), where the parameters θz and are measures of the mortality levels for subgroup Z. In our case, these parameters correspond to the logs of the Gompertz constants. Because of the panel nature of MHAS, we can actually calculate these mortality levels directly and compare them with those obtained from the adjusted model. Although it is not a perfect test, this contrast will provide an indication of performance of the adjustment. Among those with low education, the observed difference in mortality levels between diabetics and nondiabetics is .59, whereas the estimated difference is .43 (minus the value of the regression coefficient). Among those with high education, the observed difference is .60, while the estimated difference is .36. The lack of concordance between estimated and observed values is probably due to departures from the assumption of identical mortality patterns and/or deviations from the Gompertz model. While not perfect, the rather close agreement between observed and estimated values of mortality differentials is reassuring, and we take it as an indication of the suitability of the adjustment.
In summary, the suggested adjustment leads to changes in the estimate of the covariate that go in the expected direction, and although this cannot be interpreted as a true effect, we find confirmatory evidence in the modest differences between estimated and observed mortality differences between diabetics and nondiabetics.
Estimates of the effects of education on the probability of being diabetic: PREHCO
Model With No Adjustment
Model With Adjustment
Log of Age
Dummy Variable for Educationb
Summary and Conclusion
By and large, conventional current-status analysis in particular and analyses of prevalence data in general give short shrift to potential errors that arise under risk heterogeneity—for example, when the risk of attrition prior to the time at which individuals' status is assessed depends on the occurrence/non-occurrence of the event of interest. Through suitable approximations and simulations, we show that even under mild conditions defining the regime of risk heterogeneity, the biases can be substantial and could lead to misleading inferences about the time profile of the underlying risks and/or about the effects of covariates. The adjustment procedure we propose is simple and can be deployed with little effort and with minimal knowledge about the age pattern of the risk of attrition. We show that the adjustment performs much better than the naïve estimate (which assumes no risk heterogeneity). The adjusted estimates are quite robust to the precise function governing the incidence of the event of interest, but even large departures from it will produce estimates that are much closer to the true values than naïve, unadjusted estimates.
Future research should proceed along three routes. The first is to investigate the asymptotic properties of the estimator suggested here. While these are well understood in the case of a logistic function, they are not so for other equally plausible functional forms. The second is to assess the robustness of the adjustment to an inaccurate rendition of the baseline hazard (e.g., mortality) that censors those who experience the event of interest more than those who do not. The integrated hazard on which the adjustment factor rests cannot be calculated without knowledge of this baseline hazard. This may be unproblematic in the case of adult mortality because what matters in these cases is to identify correctly the curvature of the hazard over the span of ages of interest, not its level. But in other applications, it may not be so clear what the baseline hazard should look like, let alone what its approximate curvature may be within a particular range of ages or durations. The third route of research is to assess the performance of the adjustment in a broader array of empirical cases and to determine the extent to which resulting estimates lead to correct inferences.
Lin et al. (1998) studied the case of differential mortality and proposed a model for current-status data collected in an experimental setting in which all subjects are observed (and the monitoring time depends on the event of interest). We consider a different situation in which current-status data are randomly sampled from a population and differential mortality is more likely to prevent population members who experience the event of interest, relative to those who do not, from surviving to (and thus being observed at) the time of the survey (for individuals of a given age).
See Palloni and Thomas (2011) for an additional example concerning trends in the prevalence of disability in the United States.
In what follows, risk homogeneity refers to a situation in which the risk of attrition before (and hence not being observed by) the time of a census or survey is independent of the event being studied. Conversely, risk heterogeneity is a situation in which precensus (survey) attrition occurs differentially among those who do and those who do not experience the event of interest.
See Goldman (1993) for a simulation study of the roles of marital selection and marital protection in producing mortality differences between the married and single populations.
Results of simulated values of SMAM under different conditions are available on request.
To move from Eq. (1) to Eq. (2), combine the terms in the second factor involving and note that the terms and combine to form the density function, which yields —that is, the distribution function, when integrated.
If functional forms other than the logistic are deemed appropriate, the same conclusions about biases and inferential difficulties apply and only the functional form of the adjustment factor changes.
We simulate growing birth cohorts simply to mimic real populations. All results apply if all rates of growth are set equal to zero, or if there are no calendar time effects on fertility, mortality, or the incidence of diabetes.
Although we started with a large number of simulations, a small number was enough to produce sufficient Monte Carlo variation. As a consequence, we settled on a total of 25 replicas.
Some researchers (e.g., Smith 2007) prefer to fit probit models to prevalence data to make inferences about incidence. We have not investigated the magnitude of the biases when the researcher estimates a probit rather than a logit model.
Recall that in the simulated data, this coefficient has a true value equal to 1.0.
In the sensitivity analysis, there is no mortality difference among the high-education group, but in the low-education group, mortality is 4.48 times higher among diabetics compared with nondiabetics. The mean errors are smaller than those presented in Fig. 5 when the mortality differential decreases.
Although these panel data can be used to obtain (noisy or error-ridden) estimates of diabetes incidence for any subgroup, we cannot do so before ages 50 (Mexico) or 60 (Puerto Rico). Because diabetes in these countries has a relatively early onset, the set of observed incidence rates is too incomplete to retrieve reliable effects of covariates. Additional information can be obtained online (http://prehco.rcm.upr.edu for PREHCO, and http://www.mhas.pop.upenn.edu/english/home.htm for MHAS).
Low education is defined as less than 6 years of schooling, and high education is defined as 6 years or more.
This is a refinement that we can introduce only due to the panel nature of the data.
An earlier version of this article was presented at the annual meeting of the Population Association of America, Dallas, TX, April 14–17, 2010. This study was supported by grants from the National Institute of Aging (R01 AG016209 and R37 AG025216) and the Fogarty International Center (FIC) training program (5D43TW001586) to the Center for Demography and Ecology (CDE) and the Center for Demography of Health and Aging (CDHA), University of Wisconsin–Madison. CDE is funded by the NICHD Center Grant 5R24HD04783; CDHA is funded by the NIA Center Grant 5P30AG017266.