Additive versus multiplicative models in ecologic regression
Authors
- First Online:
DOI: 10.1007/s00477-007-0141-2
- Cite this article as:
- Thompson, W.D. & Wartenberg, D. Stoch Environ Res Risk Assess (2007) 21: 635. doi:10.1007/s00477-007-0141-2
- 3 Citations
- 91 Views
Abstract
Much research in environmental epidemiology relies on aggregate-level information on exposure to potentially toxic substances and on relevant covariates. We compare the use of additive (linear) and multiplicative (log-linear) regression models for the analysis of such data. We illustrate how both additive and multiplicative models can be fit to aggregate-level data sets in which disease incidence is the dependent variable, and contrast these results with similar models fitted to individual-level data. We find (1) that for aggregate-level data, multiplicative models are more likely than additive models to introduce bias into the estimation of rates, an effect not found with individual-level data; and (2) that under many circumstances multiplicative models reduce the precision of the estimates, an effect also not found in individual-level models. For both additive and multiplicative models of aggregate-level data, we find that, in the presence of covariates, narrow confidence interval are obtained only when two or more antecedent factors are strongly related to the measured covariate and/or the exposure of primary substantive interest. We conclude that the equivalency of fitting additive versus multiplicative models in studies with individual-level binary data does not carry over to studies that analyze aggregate-level information. For aggregate data, we strongly recommend use of additive models.
1 Introduction
In epidemiologic studies of the effects of environmental hazards on human health, obtaining each person’s exposure history is a major challenge, and yet critical to obtaining reliable estimates of exposure-disease relationships. Often, we cannot measure directly individual exposures, due to prohibitive costs or the lack of relevant methodologies. In such instances, researchers often assign to each study subject a value for exposure that is a composite of the exposure information available for all people in the vicinity of the subject’s residence, or use a regional measure of ambient environmental quality. These aggregate-level summaries are used to represent exposure in a region, time period, or population. Although less specific than individual data, these summaries have proven extremely valuable in understanding etiology and developing policy.
For example, in some of the most important air pollution and health effects studies, which have been used as part of the basis of the US EPA’s Clean Air Act, the exposure estimates are based on data collected from routine air monitors (Dockery et al. 1993; Pope et al. 1995). These air pollution measures are regional values that are used to represent averages in both space and time. While they are less accurate for each study subject than individual values, they are more stable because they are averages, and smooth out individual, short-term, and aberrant variation. The results of studies using such values have been shown to be reliable and replicable by subsequent analyses (Samet et al. 2000).
Similar approaches are used routinely in the study of workers, in which exposures often are assigned based on job title, a categorization that often averages exposures over time and across tasks within a job title (Monson 1990). This specification is sufficient to differentiate exposures among groups of workers, but is not able to distinguish variations among workers in the same group or over time. Yet, studies using job titles and related job exposure matrices have been very successful in identifying workplace hazards.
Another set of investigations that use aggregate-level data to characterize risk factors is migrant studies (Parkin and Khlat 1996). In these studies, the disease experience of groups of people who move their residence from one country/culture to another (i.e., migrate) is compared with those who remain in their country of origin, as well as their new neighbors, and also over time. These studies have revealed striking changes in cancer incidence patterns, with these changes attributed to dietary differences related to location and acculturation over time.
The relevance of these sets of studies to this paper is that while individual data usually are available for disease outcomes, personal characteristics and sometimes personal behavior data, in many broad-based, environmental studies, the principal risk factor of interest, are available only at the aggregate level (i.e., each individual in the group is assigned the same value). In this paper, we consider methodology for studies in which the primary risk factor (which we also refer to as the exposure) is assessed at the aggregate level rather than at the individual level (Greenland 2001, 2002; Greenland and Robins 1994; Koepsell and Weiss 2003; Morgenstern 1998). We limit our consideration to a health outcome that can be expressed as an incidence rate, and we assume that information on the numbers of events and the person-time at risk (not just the rate calculated by taking their ratio) are available within each geographic area. These studies are variously referred to as ecologic studies, partially ecologic studies, and semi-individual studies (Bjork and Stromberg 2002; Kunzli and Tager 1997; Webster 2002). We prefer to refer to these studies simply as studies employing aggregate-level information on some risk factors, because that is their distinguishing feature.
In the epidemiologic literature on studies of exposures measured at the individual level, considerable attention has been paid to the rationale for choosing between additive and multiplicative models for regression analysis of incidence rates (Greenland and Poole 1988; Rothman et al. 1980; Thompson 1991; Weed et al. 1988). Our primary purpose here is to examine the relative merits of additive and multiplicative formulations for studying disease incidence in the specific situation where an environmental exposure is measured at the aggregate level rather than at the individual level. We consider a single binary exposure, as well as a binary covariate that may confound the association between exposure and the incidence of disease.
2 Bias in estimating the effect of a single binary exposure
When a single binary exposure is measured at the individual level, there is no substantive difference in fitting an additive or a multiplicative model in that the same estimates of exposure-specific disease rates will be obtained, provided that possible interaction is addressed in the model specification. When a binary exposure is measured at the aggregate level, however, the choice between an additive versus multiplicative formulation generally produces different estimates of exposure-specific disease rates, even if interaction terms are included in the specified model (Morgenstern 1998).
The predicted disease rate for the exposed is e^{−4.59339 + 1.67054 }= 0.05378, and the rate for the unexposed is e^{−4.59339} = 0.01012. The estimated rate difference is 0.04366 instead of the correct value of 0.02. The estimated rate ratio is 5.32 instead of the correct value of 3.0.
Based on this simplest of situations, it appears that additive regression models may be more appropriate than multiplicative models for studies using aggregate level measurement of exposure.
3 Formulation for aggregate-level information on exposure in the presence of a covariate
To explore the relative merits of additive (linear) versus multiplicative (log-linear) modeling in somewhat more complex analytic situations, we begin by postulating three unmeasured binary variables (A, B, and C) that are causally prior to both the exposure of interest and a measured binary covariate. We further assume that the three unmeasured variables are independent of each other and that the only variation among geographic areas is in the prevalence of these three unmeasured variables. The latter assumption can be stated as follows: conditional on the three unmeasured variables, area is independent of the covariate, of exposure, and of the incidence of disease, a set of circumstances that implies lack of confounding of the exposure-disease association by geographic area. This formulation provides for a wide range of patterns for variation in the distribution of the aggregate-level variables for both exposure and the covariate across areas, as well as for various values of the covariance across areas for the two aggregate measures.
One important limitation of the work reported here is that we consider only a binary exposure and a binary covariate. While our results may generalize reasonably well to more than two levels for categorical variables, treating the exposure and the covariate as continuous variables raises additional complexities beyond the scope of this paper.
Hypothetical proportions positive on a binary covariate and on a binary exposure, according to values on three unmeasured binary variables
Measured variable | Unmeasured variable A | |||||||
---|---|---|---|---|---|---|---|---|
+ | − | |||||||
Unmeasured variable B | Unmeasured variable B | |||||||
+ | − | + | − | |||||
Unmeasured variable C | Unmeasured variable C | Unmeasured variable C | Unmeasured variable C | |||||
+ | − | + | − | + | − | + | − | |
Covariate | 0.80 | 0.80 | 0.50 | 0.50 | 0.50 | 0.50 | 0.20 | 0.20 |
Exposure | 0.80 | 0.50 | 0.50 | 0.20 | 0.80 | 0.50 | 0.50 | 0.20 |
Three hypothetical scenarios in terms of variation across geographical units in the prevalence of an unmeasured binary variable
Thirds of distribution | Variation across geographical units | |
---|---|---|
High | Low | |
Lowest | 0.10 | 0.48 |
Middle | 0.50 | 0.50 |
Highest | 0.90 | 0.52 |
Our formulation considers only situations in which an individual-level analysis yields unbiased estimates and standard errors without incorporation of any information on area into the analysis. Consequently, we do not address the issue of spatial autocorrelation, and we assume that there are no contextual effects for exposure, i.e., that the proportion of individuals in an area who are exposed does not affect an individual’s risk independently of one’s own exposure status.
Patterns of incidence rates considered for numerical evaluations (incidence rates are expressed per person per year)
Pattern of rates | Covariate | |||
---|---|---|---|---|
+ | − | |||
Exposure | Exposure | |||
+ | − | + | − | |
Sub-additive | 0.020 | 0.018 | 0.005 | 0.001 |
Additive | 0.020 | 0.016 | 0.005 | 0.001 |
Supra-additive/sub-multiplicative | 0.020 | 0.010 | 0.005 | 0.001 |
Multiplicative | 0.020 | 0.004 | 0.005 | 0.001 |
Supra-multiplicative | 0.020 | 0.002 | 0.005 | 0.001 |
Individual-level data in each of 27 hypothetical geographic units of size 1,000, according to the prevalence of three unmeasured binary variables: example with high variability across units for all three unmeasured variables and an additive pattern of rates
Geographic unit | Covariate | |||||||
---|---|---|---|---|---|---|---|---|
+ | − | |||||||
Exposure | Exposure | |||||||
+ | − | + | − | |||||
n | Expected number of events | n | Expected number of events | n | Expected number of events | n | Expected number of events | |
1 | 76 | 1.51 | 184 | 2.95 | 184 | 0.92 | 556 | 0.56 |
2 | 107 | 2.14 | 153 | 2.45 | 273 | 1.37 | 467 | 0.47 |
3 | 138 | 2.76 | 122 | 1.95 | 362 | 1.81 | 378 | 0.38 |
4 | 167 | 3.34 | 213 | 3.41 | 213 | 1.07 | 407 | 0.41 |
5 | 213 | 4.25 | 168 | 2.68 | 288 | 1.44 | 333 | 0.33 |
6 | 258 | 5.16 | 122 | 1.95 | 362 | 1.81 | 258 | 0.26 |
7 | 258 | 5.16 | 242 | 3.87 | 242 | 1.21 | 258 | 0.26 |
8 | 318 | 6.36 | 182 | 2.91 | 302 | 1.51 | 198 | 0.20 |
9 | 378 | 7.56 | 122 | 1.95 | 362 | 1.81 | 138 | 0.14 |
10 | 107 | 2.14 | 273 | 4.37 | 153 | 0.77 | 467 | 0.47 |
11 | 153 | 3.05 | 228 | 3.64 | 228 | 1.14 | 393 | 0.39 |
12 | 198 | 3.96 | 182 | 2.91 | 302 | 1.51 | 318 | 0.32 |
13 | 213 | 4.25 | 288 | 4.60 | 168 | 0.84 | 333 | 0.33 |
14 | 273 | 5.45 | 228 | 3.64 | 228 | 1.14 | 273 | 0.27 |
15 | 333 | 6.65 | 168 | 2.68 | 288 | 1.44 | 213 | 0.21 |
16 | 318 | 6.36 | 302 | 4.83 | 182 | 0.91 | 198 | 0.20 |
17 | 393 | 7.85 | 228 | 3.64 | 228 | 1.14 | 153 | 0.15 |
18 | 467 | 9.34 | 153 | 2.45 | 273 | 1.37 | 107 | 0.11 |
19 | 138 | 2.76 | 362 | 5.79 | 122 | 0.61 | 378 | 0.38 |
20 | 198 | 3.96 | 302 | 4.83 | 182 | 0.91 | 318 | 0.32 |
21 | 258 | 5.16 | 242 | 3.87 | 242 | 1.21 | 258 | 0.26 |
22 | 258 | 5.16 | 362 | 5.79 | 122 | 0.61 | 258 | 0.26 |
23 | 333 | 6.65 | 288 | 4.60 | 168 | 0.84 | 213 | 0.21 |
24 | 407 | 8.14 | 213 | 3.41 | 213 | 1.07 | 167 | 0.17 |
25 | 378 | 7.56 | 362 | 5.79 | 122 | 0.61 | 138 | 0.14 |
26 | 467 | 9.34 | 273 | 4.37 | 153 | 0.77 | 107 | 0.11 |
27 | 556 | 11.11 | 184 | 2.95 | 184 | 0.92 | 76 | 0.08 |
Prevalence of a binary covariate and a binary exposure and incidence of disease in each of 27 hypothetical geographic units of size 1,000, according to the prevalence of three unmeasured binary variables: example with high variability across units for all three unmeasured variables and an additive pattern of rates
Unit | Prevalence of unmeasured variables | Prevalence of covariate | Prevalence of exposure | Expected number of events | Incidence rate^{a} | ||
---|---|---|---|---|---|---|---|
A | B | C | |||||
1 | 0.10 | 0.10 | 0.10 | 0.26 | 0.26 | 5.94 | 0.0059 |
2 | 0.10 | 0.10 | 0.50 | 0.26 | 0.38 | 6.42 | 0.0064 |
3 | 0.10 | 0.10 | 0.90 | 0.26 | 0.50 | 6.90 | 0.0069 |
4 | 0.10 | 0.50 | 0.10 | 0.38 | 0.38 | 8.22 | 0.0082 |
5 | 0.10 | 0.50 | 0.50 | 0.38 | 0.50 | 8.70 | 0.0087 |
6 | 0.10 | 0.50 | 0.90 | 0.38 | 0.62 | 9.18 | 0.0092 |
7 | 0.10 | 0.90 | 0.10 | 0.50 | 0.50 | 10.50 | 0.0105 |
8 | 0.10 | 0.90 | 0.50 | 0.50 | 0.62 | 10.98 | 0.0110 |
9 | 0.10 | 0.90 | 0.90 | 0.50 | 0.74 | 11.46 | 0.0115 |
10 | 0.50 | 0.10 | 0.10 | 0.38 | 0.26 | 7.74 | 0.0077 |
11 | 0.50 | 0.10 | 0.50 | 0.38 | 0.38 | 8.22 | 0.0082 |
12 | 0.50 | 0.10 | 0.90 | 0.38 | 0.50 | 8.70 | 0.0087 |
13 | 0.50 | 0.50 | 0.10 | 0.50 | 0.38 | 10.02 | 0.0100 |
14 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 10.50 | 0.0105 |
15 | 0.50 | 0.50 | 0.90 | 0.50 | 0.62 | 10.98 | 0.0110 |
16 | 0.50 | 0.90 | 0.10 | 0.62 | 0.50 | 12.30 | 0.0123 |
17 | 0.50 | 0.90 | 0.50 | 0.62 | 0.62 | 12.78 | 0.0128 |
18 | 0.50 | 0.90 | 0.90 | 0.62 | 0.74 | 13.26 | 0.0133 |
19 | 0.90 | 0.10 | 0.10 | 0.50 | 0.26 | 9.54 | 0.0095 |
20 | 0.90 | 0.10 | 0.50 | 0.50 | 0.38 | 10.02 | 0.0100 |
21 | 0.90 | 0.10 | 0.90 | 0.50 | 0.50 | 10.50 | 0.0105 |
22 | 0.90 | 0.50 | 0.10 | 0.62 | 0.38 | 11.82 | 0.0118 |
23 | 0.90 | 0.50 | 0.50 | 0.62 | 0.50 | 12.30 | 0.0123 |
24 | 0.90 | 0.50 | 0.90 | 0.62 | 0.62 | 12.78 | 0.0128 |
25 | 0.90 | 0.90 | 0.10 | 0.74 | 0.50 | 14.10 | 0.0141 |
26 | 0.90 | 0.90 | 0.50 | 0.74 | 0.62 | 14.58 | 0.0146 |
27 | 0.90 | 0.90 | 0.90 | 0.74 | 0.74 | 15.06 | 0.0151 |
4 Implementation of regression models
To clarify the nature of the data used for both individual-level and aggregate-level analyses, we illustrate the set-up and processing of the relevant data using the SAS^{®} software. This template may be readily adapted to the syntax of other statistical packages.
We fit linear and log-linear regression models to the expected sample outcomes based on exposure measured at both the individual and aggregate levels using the SAS^{®} statistical procedure PROC GENMOD for generalized linear models (SAS Institute 2004). For linear modeling the dependent variable was the number of events, which is assumed to follow a Poisson distribution (SAS^{®} GENMOD option d = poisson), and the denominator information on person-time was incorporated using a SAS^{®} weight statement. The link function was specified as the identity function (SAS^{®} GENMOD option link = id).
For log-linear modeling, the number of events was likewise assumed to follow a Poisson distribution, but the link function was logarithmic and the denominator information was incorporated by using a SAS^{®} GENMOD offset variable, here the logarithm of the person-time.
The SAS^{®} estimate statement within PROC GENMOD provides a straightforward method for calculating point and interval estimates of the four covariate- and exposure-specific rates. In the case of log-linear models, the estimates of the log rates are exponentiated using the “exp” option.
5 Differences among regression models when the pattern of disease rates is additive
Inspection of the numerical results in Fig. 6, in which the joint effects of the covariate and exposure are assumed to be additive in the population, indicates that log-linear regression models can produce biased estimates when aggregate-level exposure is employed in the analysis. Note specifically that, although the correct values for the rates are obtained in Fig. 4 when the individual-level information on the covariate and exposure is used, the same is not true in Fig. 6, where the aggregate-level information has been used instead. In the log-linear regression analysis, the estimates for the four covariate- and exposure-specific rates are 0.0187, 0.0246, 0.0087, and 0.0029 rather than the correct values of 0.0200, 0.0160, 0.0050, and 0.0010. The rate ratio calculated from these biased estimates of the rates is 0.76 for the effect of exposure when the covariate factor is present and 3.00 when the covariate factor is absent. The correct values are 1.25 and 5.00, respectively. The rate difference calculated from these biased estimates is −0.0059 for the effect of exposure when the covariate factor is present and 0.0058 when the covariate factor is absent. The correct value is 0.0040, regardless of whether the covariate factor is present or absent. The difference in the signs of the rate differences for the log-linear analysis gives the mistaken impression that on the additive scale there is a qualitative (cross-over) interaction (Gail and Simon 1985; Peto 1982; Thompson 1991).
Comparison of the confidence intervals for the estimated rates in Figs. 4 and 6 clearly indicates that use of the aggregate-level information on exposure results in inferior precision of estimation, regardless of whether a linear or a log-linear regression model is employed.
6 Differences among regression models when the pattern of disease rates is non-additive
Predicted rates and 95% confidence intervals for linear and log-linear analyses based on individual-level and aggregate-level measurement of exposure, according to various patterns of incidence of disease: example with high variability across units for all three unmeasured variables
Level of exposure variable | Model | Exposure | Covariate | |||
---|---|---|---|---|---|---|
+ | − | |||||
Rate | 95% CI | Rate | 95% CI | |||
Sub-additive | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0180 | 0.0146 to 0.0214 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0180 | 0.0149 to 0.0217 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | 0.0019 to 0.0382 | 0.0049 | −0.0142 to 0.0240 |
− | 0.0179 | −0.0021 to 0.0379 | 0.0010 | −0.0146 to 0.0167 | ||
Log-linear | + | 0.0187 | 0.0040 to 0.0868 | 0.0085 | 0.0013 to 0.0577 | |
− | 0.0276 | 0.0044 to 0.1742 | 0.0031 | 0.0005 to 0.0181 | ||
Additive | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0160 | 0.0128 to 0.0192 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0160 | 0.0131 to 0.0195 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | 0.0022 to 0.0378 | 0.0050 | −0.0137 to 0.0237 |
− | 0.0160 | −0.0035 to 0.0355 | 0.0010 | −0.0143 to 0.0163 | ||
Log-linear | + | 0.0187 | 0.0039 to 0.0893 | 0.0087 | 0.0012 to 0.0611 | |
− | 0.0246 | 0.0037 to 0.1619 | 0.0029 | 0.0005 to 0.0175 | ||
Supra-additive/sub-multiplicative | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0100 | 0.0075 to 0.0125 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0100 | 0.0078 to 0.0128 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0199 | 0.0032 to 0.0366 | 0.0053 | −0.0123 to 0.0228 |
− | 0.0103 | −0.0076 to 0.0282 | 0.0009 | −0.0132 to 0.0149 | ||
Log-linear | + | 0.0188 | 0.0036 to 0.0991 | 0.0095 | 0.0012 to 0.0757 | |
− | 0.0164 | 0.0021 to 0.1258 | 0.0022 | 0.0003 to 0.0155 | ||
Multiplicative | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0040 | 0.0024 to 0.0056 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0040 | 0.0027 to 0.0059 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0198 | 0.0043 to 0.0353 | 0.0056 | −0.0107 to 0.0218 |
− | 0.0046 | −0.0116 to 0.0207 | 0.0008 | −0.0119 to 0.0134 | ||
Log-linear | + | 0.0192 | 0.0032 to 0.1142 | 0.0111 | 0.0012 to 0.1037 | |
− | 0.0098 | 0.0010 to 0.0921 | 0.0015 | 0.0002 to 0.0133 | ||
Supra-multiplicative | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0020 | 0.0009 to 0.0031 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0020 | 0.0011 to 0.0035 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0197 | 0.0047 to 0.0348 | 0.0057 | −0.0101 to 0.0214 |
− | 0.0027 | −0.0128 to 0.0182 | 0.0007 | −0.0114 to 0.0129 | ||
Log-linear | + | 0.0195 | 0.0031 to 0.1212 | 0.0120 | 0.0012 to 0.1191 | |
− | 0.0080 | 0.0008 to 0.0816 | 0.0013 | 0.0001 to 0.0124 |
The numerical results in Table 6 indicate that, when aggregate-level information on the covariate and on exposure is employed for the analysis, fitting a log-linear regression model leads to biased estimates of rates, even when the underlying pattern of joint effects conforms to multiplicativity. For the section of the table based on multiplicativity for joint effects, it has been assumed that the presence of exposure increases the rate of disease fivefold, regardless of whether the binary covariate is present or absent. However, fitting a log-linear model to the aggregate-level data for this situation yields biased estimates of 0.0192/0.0098 = 2.0 for the rate ratio when the covariate is present and 0.0111/0.0015 = 7.4 when the covariate is absent.
Note that fitting of a linear model to the aggregate-level exposure data also leads to bias in the estimates when the pattern of rates for the combined effects of the covariate and exposure is something other than additive. However, this bias is of considerably smaller magnitude than is the bias when a log-linear model is fit to the aggregate-level data.
Table 6 also indicates that the width of the expected confidence intervals around the estimated rates is substantially greater for analyses based on aggregate-level data than for those based on individual-level data. Of the two formulations for analysis of the aggregate-level data, a log-linear regression model consistently yields wider confidence intervals.
7 Effects of variation across geographic areas in the prevalence of unmeasured antecedent variables
Predicted rates and 95% confidence intervals for linear and log-linear analyses based on individual-level and aggregate-level measurement of exposure, according to variation across geographic area in three unmeasured antecedent variables: additive pattern of rates
Level of exposure variable | Model | Exposure | Covariate | |||
---|---|---|---|---|---|---|
+ | − | |||||
Rate | 95% CI | Rate | 95% CI | |||
Low variation across geographic areas in an unmeasured antecedent variable that is associated with only the exposure | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0160 | 0.0128 to 0.0192 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0160 | 0.0131 to 0.0195 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | −0.0064 to 0.0464 | 0.0050 | −0.0253 to 0.0353 |
− | 0.0160 | −0.0132 to 0.0452 | 0.0010 | −0.0230 to 0.0250 | ||
Log-linear | + | 0.0148 | 0.0013 to 0.1637 | 0.0110 | 0.0006 to 0.1996 | |
− | 0.0324 | 0.0016 to 0.6501 | 0.0023 | 0.0002 to 0.0318 | ||
Low variation across geographic areas in an unmeasured antecedent variable that is associated with both the covariate and exposure | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0160 | 0.0128 to 0.0192 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0160 | 0.0131 to 0.0195 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | −0.0139 to 0.0539 | 0.0050 | −0.0272 to 0.0372 |
− | 0.0160 | −0.0174 to 0.0494 | 0.0010 | −0.0307 to 0.0327 | ||
Log-linear | + | 0.0225 | 0.0011 to 0.4795 | 0.0071 | 0.0003 to 0.1777 | |
− | 0.0202 | 0.0009 to 0.4491 | 0.0036 | 0.0001 to 0.0957 | ||
Low variation across geographic areas in an unmeasured antecedent variable that is associated with only the covariate | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0160 | 0.0128 to 0.0192 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0160 | 0.0131 to 0.0195 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | −0.0062 to 0.0462 | 0.0050 | −0.0240 to 0.0340 |
− | 0.0160 | −0.0151 to 0.0471 | 0.0010 | −0.0239 to 0.0259 | ||
Log-linear | + | 0.0202 | 0.0018 to 0.2283 | 0.0080 | 0.0004 to 0.1658 | |
− | 0.0227 | 0.0013 to 0.3834 | 0.0032 | 0.0002 to 0.0416 | ||
Low variation across geographic areas in unmeasured antecedent variables that are associated with only the covariate and with only exposure | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0160 | 0.0128 to 0.0192 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0160 | 0.0131 to 0.0195 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | −0.0243 to 0.0643 | 0.0050 | −0.1747 to 0.1847 |
− | 0.0160 | −0.1637 to 0.1957 | 0.0010 | −0.0417 to 0.0437 | ||
Log-linear | + | 0.0173 | 0.0003 to 1.0489 | 0.0095 | 0.0000 to ∞ | |
− | 0.0270 | 0.0000 to ∞ | 0.0028 | 0.0000 to 0.1957 | ||
Low variation across geographic areas in unmeasured antecedent variables that are associated with only the covariate and with both the covariate and exposure | ||||||
Individual | Linear | + | 0.0200 | 0.0168 to 0.0232 | 0.0050 | 0.0032 to 0.0068 |
− | 0.0160 | 0.0128 to 0.0192 | 0.0010 | 0.0003 to 0.0017 | ||
Log-linear | + | 0.0200 | 0.0170 to 0.0235 | 0.0050 | 0.0035 to 0.0071 | |
− | 0.0160 | 0.0131 to 0.0195 | 0.0010 | 0.0005 to 0.0021 | ||
Aggregate | Linear | + | 0.0200 | −0.4414 to 0.4814 | 0.0050 | −0.4564 to 0.4664 |
− | 0.0160 | −0.4390 to 0.4710 | 0.0010 | −0.4537 to 0.4557 | ||
Log-linear | + | 0.0222 | 0.0000 to ∞ | 0.0072 | 0.0000 to ∞ | |
− | 0.0207 | 0.0000 to ∞ | 0.0036 | 0.0000 to ∞ |
Also of note in Table 7 is the clear indication that if variability across geographic units is low for two of the three antecedent unmeasured variables, then the expected confidence intervals are extremely wide for analysis of aggregate-level information, especially when a log-linear model is fit to the data.
8 Discussion
The numerical results presented in this paper provide strong support for the use of linear rather than log-linear regression modeling of disease incidence in environmental epidemiology when the available measure of exposure is at the aggregate level rather than at the individual level. Use of log-linear regression formulations (i.e., fitting multiplicative models) to analyze incidence rates can result in appreciable bias in (1) quantitative comparison of disease in exposed versus unexposed groups in the absence of covariates; and (2) the predicted covariate- and exposure-specific rates when a binary covariate is taken into account. These results are consistent with the work of Bjork and Stromberg (2005), who used a rather different formulation and found that, for the analysis of proportions rather than rates, a linear odds regression model performed better than the logistic model.
In traditional individual-level epidemiologic research, discussions of additive versus multiplicative models have centered around the issue of the appropriate definition of the absence of interaction when multiple causal factors are considered simultaneously (Rothman et al. 1980). However, in that realm, the pattern of rates for the four combinations of a binary exposure of primary interest and a binary covariate can be estimated in an unbiased manner by inclusion of a product variable in either a linear or log-linear regression analysis. For analysis of studies based on aggregate-level information, however, log-linear analysis results in biased estimates, even if an interaction term is included.
A further limitation of a log-linear formulation for analysis of aggregate-level exposure is that it may lead to a more complex characterization of the data than is really necessary. For example, in Fig. 6 we illustrated an instance in which there appears to be a qualitative (i.e., cross-over) interaction when in fact the underlying rates follow a perfect additive pattern.
In terms of precision of estimation, studies that use aggregate information on exposure are shown to have much wider expected confidence intervals than studies in which individual-level information is available. This problem is exacerbated when log-linear models are fit to the data.
Consideration of patterns of incidence ranging from sub-additive to supra-multiplicative confirm what others have noted regarding bias in the analysis of aggregate-level data whenever the pattern of rates is non-additive. However, for the range of situations considered, we found that this bias was considerably smaller when additive models were fit than when multiplicative ones were fit. Furthermore, the absolute magnitude of the bias for additive models was numerically small. This latter finding is especially important in environmental epidemiology because the only way to completely remove the bias is to reparameterize the problem so that, rather than including a product variable, three indicator variables are introduced into the regression model to compare each of three combinations of covariate status and exposure status to the fourth (referent) category (Greenland and Robins 1994). Such a parameterization requires information on the joint distribution of the covariate and exposure within each geographic unit—information that often is not available. For example, one might be able to obtain information on both the prevalence of exposure to high concentrations of particulate air pollution and the prevalence of smoking in each area. But typically these would be obtained from different sources (e.g., environmental monitoring systems versus behavioral surveys such as the Behavioral Risk Factor Survey of the Centers for Disease Control and Prevention), so that the joint distribution would be unknown.
One potential disadvantage of fitting additive regression models to aggregate-level data is that a negative point estimate or confidence limit for the intercept term may sometimes be obtained, precluding estimation of the rate ratio (Koepsell and Weiss 2003; Morgenstern 1998). In such circumstances, however, estimation of the rate difference is not generally precluded. Due to problems of bias and imprecision, our results suggest that the avoidance of an inadmissible intercept value may not be sufficient basis for adopting log-linear formulations.
Our provision for three unmeasured antecedent factors that are causally prior to exposure and a binary covariate indicates that two or more of these factors must be highly variable across geographic units in order for an aggregate-level analysis to provide precise estimates of effects at the individual level. This finding implies that large variability in exposure across units is not in itself sufficient to ensure an informative aggregate-level analysis. There must also be substantial variation in exposure conditional on any confounding factor included in the analysis.