Stochastic Environmental Research and Risk Assessment

, Volume 21, Issue 5, pp 635–646

Additive versus multiplicative models in ecologic regression

Authors

  • W. Douglas Thompson
    • Department of Applied Medical SciencesUniversity of Southern Maine
  • Daniel Wartenberg
    • Department of Environmental and Occupational MedicineUMDNJ–Robert Wood Johnson Medical School
Original Paper

DOI: 10.1007/s00477-007-0141-2

Cite this article as:
Thompson, W.D. & Wartenberg, D. Stoch Environ Res Risk Assess (2007) 21: 635. doi:10.1007/s00477-007-0141-2

Abstract

Much research in environmental epidemiology relies on aggregate-level information on exposure to potentially toxic substances and on relevant covariates. We compare the use of additive (linear) and multiplicative (log-linear) regression models for the analysis of such data. We illustrate how both additive and multiplicative models can be fit to aggregate-level data sets in which disease incidence is the dependent variable, and contrast these results with similar models fitted to individual-level data. We find (1) that for aggregate-level data, multiplicative models are more likely than additive models to introduce bias into the estimation of rates, an effect not found with individual-level data; and (2) that under many circumstances multiplicative models reduce the precision of the estimates, an effect also not found in individual-level models. For both additive and multiplicative models of aggregate-level data, we find that, in the presence of covariates, narrow confidence interval are obtained only when two or more antecedent factors are strongly related to the measured covariate and/or the exposure of primary substantive interest. We conclude that the equivalency of fitting additive versus multiplicative models in studies with individual-level binary data does not carry over to studies that analyze aggregate-level information. For aggregate data, we strongly recommend use of additive models.

1 Introduction

In epidemiologic studies of the effects of environmental hazards on human health, obtaining each person’s exposure history is a major challenge, and yet critical to obtaining reliable estimates of exposure-disease relationships. Often, we cannot measure directly individual exposures, due to prohibitive costs or the lack of relevant methodologies. In such instances, researchers often assign to each study subject a value for exposure that is a composite of the exposure information available for all people in the vicinity of the subject’s residence, or use a regional measure of ambient environmental quality. These aggregate-level summaries are used to represent exposure in a region, time period, or population. Although less specific than individual data, these summaries have proven extremely valuable in understanding etiology and developing policy.

For example, in some of the most important air pollution and health effects studies, which have been used as part of the basis of the US EPA’s Clean Air Act, the exposure estimates are based on data collected from routine air monitors (Dockery et al. 1993; Pope et al. 1995). These air pollution measures are regional values that are used to represent averages in both space and time. While they are less accurate for each study subject than individual values, they are more stable because they are averages, and smooth out individual, short-term, and aberrant variation. The results of studies using such values have been shown to be reliable and replicable by subsequent analyses (Samet et al. 2000).

Similar approaches are used routinely in the study of workers, in which exposures often are assigned based on job title, a categorization that often averages exposures over time and across tasks within a job title (Monson 1990). This specification is sufficient to differentiate exposures among groups of workers, but is not able to distinguish variations among workers in the same group or over time. Yet, studies using job titles and related job exposure matrices have been very successful in identifying workplace hazards.

Another set of investigations that use aggregate-level data to characterize risk factors is migrant studies (Parkin and Khlat 1996). In these studies, the disease experience of groups of people who move their residence from one country/culture to another (i.e., migrate) is compared with those who remain in their country of origin, as well as their new neighbors, and also over time. These studies have revealed striking changes in cancer incidence patterns, with these changes attributed to dietary differences related to location and acculturation over time.

The relevance of these sets of studies to this paper is that while individual data usually are available for disease outcomes, personal characteristics and sometimes personal behavior data, in many broad-based, environmental studies, the principal risk factor of interest, are available only at the aggregate level (i.e., each individual in the group is assigned the same value). In this paper, we consider methodology for studies in which the primary risk factor (which we also refer to as the exposure) is assessed at the aggregate level rather than at the individual level (Greenland 2001, 2002; Greenland and Robins 1994; Koepsell and Weiss 2003; Morgenstern 1998). We limit our consideration to a health outcome that can be expressed as an incidence rate, and we assume that information on the numbers of events and the person-time at risk (not just the rate calculated by taking their ratio) are available within each geographic area. These studies are variously referred to as ecologic studies, partially ecologic studies, and semi-individual studies (Bjork and Stromberg 2002; Kunzli and Tager 1997; Webster 2002). We prefer to refer to these studies simply as studies employing aggregate-level information on some risk factors, because that is their distinguishing feature.

In the epidemiologic literature on studies of exposures measured at the individual level, considerable attention has been paid to the rationale for choosing between additive and multiplicative models for regression analysis of incidence rates (Greenland and Poole 1988; Rothman et al. 1980; Thompson 1991; Weed et al. 1988). Our primary purpose here is to examine the relative merits of additive and multiplicative formulations for studying disease incidence in the specific situation where an environmental exposure is measured at the aggregate level rather than at the individual level. We consider a single binary exposure, as well as a binary covariate that may confound the association between exposure and the incidence of disease.

2 Bias in estimating the effect of a single binary exposure

When a single binary exposure is measured at the individual level, there is no substantive difference in fitting an additive or a multiplicative model in that the same estimates of exposure-specific disease rates will be obtained, provided that possible interaction is addressed in the model specification. When a binary exposure is measured at the aggregate level, however, the choice between an additive versus multiplicative formulation generally produces different estimates of exposure-specific disease rates, even if interaction terms are included in the specified model (Morgenstern 1998).

Consider, for example, the simple situation of just two geographic areas. Suppose that in exposed individuals the rate of disease is 0.03 per person per year, whereas in unexposed individuals it is 0.01 per person per year. If the percentages exposed in the two areas are 5 and 15%, then, in the absence of confounding by area, the incidence rates in the two geographical areas will be
$$ {\text{0}}{\text{.05}}\, \times \,{\text{0}}{\text{.03}}\, + \,{\text{0}}{\text{.95}}\, \times \,{\text{0}}{\text{.01}}\,{\text{ = }}\,{\text{0}}{\text{.011}} $$
and
$$ {\text{0}}{\text{.15}}\, \times \,{\text{0}}{\text{.03}}\,{\text{ + }}\,{\text{0}}{\text{.85}}\, \times \,{\text{0}}{\text{.01}}\,{\text{ = }}\,{\text{0}}{\text{.013}} $$
respectively.
A linear fit to the data using aggregate-level information on exposure gives
$$ {\text{rate}}\,{\text{ = }}\,{\text{0}}{\text{.01}}\,{\text{ + }}\,{\text{0}}{\text{.02}}\, \times \,{\text{EXPOSED}} $$
where EXPOSED is a binary 1/0 independent variable for the presence/absence of exposure. The predicted disease rates from this analysis are the correct values of 0.03 and 0.01 for those with and without the exposure, giving a rate difference of 0.03 − 0.01 = 0.02, and a rate ratio of 0.03/0.01 = 3.0.
On the other hand, a log-linear fit to the data gives
$$ {\text{log(rate)}}\,{\text{ = }}\, - {\text{4}}{\text{.59339}}\,{\text{ + }}\,{\text{1}}{\text{.67054}}\, \times \,{\text{EXPOSED}} $$

The predicted disease rate for the exposed is e−4.59339 + 1.67054 = 0.05378, and the rate for the unexposed is e−4.59339 = 0.01012. The estimated rate difference is 0.04366 instead of the correct value of 0.02. The estimated rate ratio is 5.32 instead of the correct value of 3.0.

Figure 1 shows the bias stemming from a log-linear analysis in estimating the rate ratio for various combinations of values for the prevalence of exposure in two areas. Especially when both of the prevalences are small, the bias can be substantial. A corresponding pattern of bias occurs if the rate difference is taken as the measure of association rather than the rate ratio (results not shown).
https://static-content.springer.com/image/art%3A10.1007%2Fs00477-007-0141-2/MediaObjects/477_2007_141_Fig1_HTML.gif
Fig. 1

Biased estimation of the rate ratio in a log-linear analysis that is based on an aggregate-level measure of a binary exposure in two geographic areas. For these calculations the rate in the exposed is 0.03 per person per unit time and the rate in the unexposed is 0.01 per person per unit time

Based on this simplest of situations, it appears that additive regression models may be more appropriate than multiplicative models for studies using aggregate level measurement of exposure.

3 Formulation for aggregate-level information on exposure in the presence of a covariate

To explore the relative merits of additive (linear) versus multiplicative (log-linear) modeling in somewhat more complex analytic situations, we begin by postulating three unmeasured binary variables (A, B, and C) that are causally prior to both the exposure of interest and a measured binary covariate. We further assume that the three unmeasured variables are independent of each other and that the only variation among geographic areas is in the prevalence of these three unmeasured variables. The latter assumption can be stated as follows: conditional on the three unmeasured variables, area is independent of the covariate, of exposure, and of the incidence of disease, a set of circumstances that implies lack of confounding of the exposure-disease association by geographic area. This formulation provides for a wide range of patterns for variation in the distribution of the aggregate-level variables for both exposure and the covariate across areas, as well as for various values of the covariance across areas for the two aggregate measures.

One important limitation of the work reported here is that we consider only a binary exposure and a binary covariate. While our results may generalize reasonably well to more than two levels for categorical variables, treating the exposure and the covariate as continuous variables raises additional complexities beyond the scope of this paper.

Figure 2 shows the causal effects of the three unmeasured variables on exposure and the measured covariate. Variable A has a causal effect on the covariate only; variable B has a causal effect on both the measured covariate and exposure; variable C has a causal effect on exposure only. For all numerical evaluations reported here, we incorporate the specific pattern of prevalences for the covariate and exposure that is shown in Table 1.
https://static-content.springer.com/image/art%3A10.1007%2Fs00477-007-0141-2/MediaObjects/477_2007_141_Fig2_HTML.gif
Fig. 2

Three unmeasured variables used in the parameterization of the relationship of individual-level and aggregate-level measures of exposure to the incidence rate of disease in the presence of a covariate

Table 1

Hypothetical proportions positive on a binary covariate and on a binary exposure, according to values on three unmeasured binary variables

Measured variable

Unmeasured variable A

+

Unmeasured variable B

Unmeasured variable B

+

+

Unmeasured variable C

Unmeasured variable C

Unmeasured variable C

Unmeasured variable C

+

+

+

+

Covariate

0.80

0.80

0.50

0.50

0.50

0.50

0.20

0.20

Exposure

0.80

0.50

0.50

0.20

0.80

0.50

0.50

0.20

For each of the three unmeasured variables, we evaluate situations involving high variability and low variability among geographic areas. The definitions we used for high and low variability among areas were the same for each of the three unmeasured variables and are given in Table 2. When considering high or low variability for a particular variable, one third of the areas were assigned each of the three values in the appropriate column of the table. We considered eight different scenarios representing all combinations of high and low variability among areas for the three unmeasured variables. For each scenario, 33 = 27 geographic areas were included, with the chosen prevalences for the three variables assigned systematically in all possible combinations. Situations with fewer than 27 areas or more than 27 areas are not addressed in this report. The size of each of the 27 areas was set at 1,000 person-years of observation for all of the numerical evaluations. Analysis of the trade-off between number of areas and the size of the population within areas entails complex additional issues beyond the scope of this paper.
Table 2

Three hypothetical scenarios in terms of variation across geographical units in the prevalence of an unmeasured binary variable

Thirds of distribution

Variation across geographical units

High

Low

Lowest

0.10

0.48

Middle

0.50

0.50

Highest

0.90

0.52

Our formulation considers only situations in which an individual-level analysis yields unbiased estimates and standard errors without incorporation of any information on area into the analysis. Consequently, we do not address the issue of spatial autocorrelation, and we assume that there are no contextual effects for exposure, i.e., that the proportion of individuals in an area who are exposed does not affect an individual’s risk independently of one’s own exposure status.

We evaluated five different patterns of incidence of the disease for the joint effect of the covariate and exposure (i.e., interaction or effect modification). These patterns conform to joint effects that are sub-additive, additive, supra-additive but sub-multiplicative, multiplicative, or supra-multiplicative. The specific values incorporated into the calculations for each pattern of joint effects are shown in Table 3. In terms of potential causal effects, these five situations encompass the epidemiologic concepts of antagonism, independence, and synergy on both the additive and multiplicative scales (Thompson 1991).
Table 3

Patterns of incidence rates considered for numerical evaluations (incidence rates are expressed per person per year)

Pattern of rates

Covariate

+

Exposure

Exposure

+

+

Sub-additive

0.020

0.018

0.005

0.001

Additive

0.020

0.016

0.005

0.001

Supra-additive/sub-multiplicative

0.020

0.010

0.005

0.001

Multiplicative

0.020

0.004

0.005

0.001

Supra-multiplicative

0.020

0.002

0.005

0.001

Table 4 shows the expected sample outcome at the individual level when there is high variability across areas for all three of the unmeasured variables and when the pattern of the joint effects of the covariate and exposure on disease rates in the population is additive. Table 5 shows the corresponding information with only aggregate-level data on exposure.
Table 4

Individual-level data in each of 27 hypothetical geographic units of size 1,000, according to the prevalence of three unmeasured binary variables: example with high variability across units for all three unmeasured variables and an additive pattern of rates

Geographic unit

Covariate

+

Exposure

Exposure

+

+

n

Expected number of events

n

Expected number of events

n

Expected number of events

n

Expected number of events

1

76

1.51

184

2.95

184

0.92

556

0.56

2

107

2.14

153

2.45

273

1.37

467

0.47

3

138

2.76

122

1.95

362

1.81

378

0.38

4

167

3.34

213

3.41

213

1.07

407

0.41

5

213

4.25

168

2.68

288

1.44

333

0.33

6

258

5.16

122

1.95

362

1.81

258

0.26

7

258

5.16

242

3.87

242

1.21

258

0.26

8

318

6.36

182

2.91

302

1.51

198

0.20

9

378

7.56

122

1.95

362

1.81

138

0.14

10

107

2.14

273

4.37

153

0.77

467

0.47

11

153

3.05

228

3.64

228

1.14

393

0.39

12

198

3.96

182

2.91

302

1.51

318

0.32

13

213

4.25

288

4.60

168

0.84

333

0.33

14

273

5.45

228

3.64

228

1.14

273

0.27

15

333

6.65

168

2.68

288

1.44

213

0.21

16

318

6.36

302

4.83

182

0.91

198

0.20

17

393

7.85

228

3.64

228

1.14

153

0.15

18

467

9.34

153

2.45

273

1.37

107

0.11

19

138

2.76

362

5.79

122

0.61

378

0.38

20

198

3.96

302

4.83

182

0.91

318

0.32

21

258

5.16

242

3.87

242

1.21

258

0.26

22

258

5.16

362

5.79

122

0.61

258

0.26

23

333

6.65

288

4.60

168

0.84

213

0.21

24

407

8.14

213

3.41

213

1.07

167

0.17

25

378

7.56

362

5.79

122

0.61

138

0.14

26

467

9.34

273

4.37

153

0.77

107

0.11

27

556

11.11

184

2.95

184

0.92

76

0.08

Table 5

Prevalence of a binary covariate and a binary exposure and incidence of disease in each of 27 hypothetical geographic units of size 1,000, according to the prevalence of three unmeasured binary variables: example with high variability across units for all three unmeasured variables and an additive pattern of rates

Unit

Prevalence of unmeasured variables

Prevalence of covariate

Prevalence of exposure

Expected number of events

Incidence ratea

A

B

C

1

0.10

0.10

0.10

0.26

0.26

5.94

0.0059

2

0.10

0.10

0.50

0.26

0.38

6.42

0.0064

3

0.10

0.10

0.90

0.26

0.50

6.90

0.0069

4

0.10

0.50

0.10

0.38

0.38

8.22

0.0082

5

0.10

0.50

0.50

0.38

0.50

8.70

0.0087

6

0.10

0.50

0.90

0.38

0.62

9.18

0.0092

7

0.10

0.90

0.10

0.50

0.50

10.50

0.0105

8

0.10

0.90

0.50

0.50

0.62

10.98

0.0110

9

0.10

0.90

0.90

0.50

0.74

11.46

0.0115

10

0.50

0.10

0.10

0.38

0.26

7.74

0.0077

11

0.50

0.10

0.50

0.38

0.38

8.22

0.0082

12

0.50

0.10

0.90

0.38

0.50

8.70

0.0087

13

0.50

0.50

0.10

0.50

0.38

10.02

0.0100

14

0.50

0.50

0.50

0.50

0.50

10.50

0.0105

15

0.50

0.50

0.90

0.50

0.62

10.98

0.0110

16

0.50

0.90

0.10

0.62

0.50

12.30

0.0123

17

0.50

0.90

0.50

0.62

0.62

12.78

0.0128

18

0.50

0.90

0.90

0.62

0.74

13.26

0.0133

19

0.90

0.10

0.10

0.50

0.26

9.54

0.0095

20

0.90

0.10

0.50

0.50

0.38

10.02

0.0100

21

0.90

0.10

0.90

0.50

0.50

10.50

0.0105

22

0.90

0.50

0.10

0.62

0.38

11.82

0.0118

23

0.90

0.50

0.50

0.62

0.50

12.30

0.0123

24

0.90

0.50

0.90

0.62

0.62

12.78

0.0128

25

0.90

0.90

0.10

0.74

0.50

14.10

0.0141

26

0.90

0.90

0.50

0.74

0.62

14.58

0.0146

27

0.90

0.90

0.90

0.74

0.74

15.06

0.0151

aExpressed per person per year

4 Implementation of regression models

To clarify the nature of the data used for both individual-level and aggregate-level analyses, we illustrate the set-up and processing of the relevant data using the SAS® software. This template may be readily adapted to the syntax of other statistical packages.

We fit linear and log-linear regression models to the expected sample outcomes based on exposure measured at both the individual and aggregate levels using the SAS® statistical procedure PROC GENMOD for generalized linear models (SAS Institute 2004). For linear modeling the dependent variable was the number of events, which is assumed to follow a Poisson distribution (SAS® GENMOD option d = poisson), and the denominator information on person-time was incorporated using a SAS® weight statement. The link function was specified as the identity function (SAS® GENMOD option link = id).

For log-linear modeling, the number of events was likewise assumed to follow a Poisson distribution, but the link function was logarithmic and the denominator information was incorporated by using a SAS® GENMOD offset variable, here the logarithm of the person-time.

Figures 3 and 4 give SAS® code and computer output from PROC GENMOD for analysis of the individual-level exposure data in Table 4. Figures 5 and 6 give the corresponding code and output for analysis of the aggregate-level exposure data in Table 5. To provide for possible interactive effects between the covariate and exposure in terms of their impact on the rate of disease, a product variable is included in each of the regression analyses.
https://static-content.springer.com/image/art%3A10.1007%2Fs00477-007-0141-2/MediaObjects/477_2007_141_Fig3_HTML.gif
Fig. 3

SAS® code for fitting linear and log-linear poisson models to individual-level data

https://static-content.springer.com/image/art%3A10.1007%2Fs00477-007-0141-2/MediaObjects/477_2007_141_Fig4_HTML.gif
Fig. 4

SAS® output for the program in Fig. 3 

https://static-content.springer.com/image/art%3A10.1007%2Fs00477-007-0141-2/MediaObjects/477_2007_141_Fig5_HTML.gif
Fig. 5

SAS® code for fitting linear and log-linear Poisson models to aggregate-level data

https://static-content.springer.com/image/art%3A10.1007%2Fs00477-007-0141-2/MediaObjects/477_2007_141_Fig6_HTML.gif
Fig. 6

SAS® output for the program in Fig. 5 

Thus, the general forms for the two models considered are
$$ {\text{rate}}\,{\text{ = }}\,{\text{intercept}}\,{\text{ + }}\,{\text{b1}}\, \times \,{\text{exposure}}\,{\text{ + }}\,{\text{b2}}\, \times \,{\text{covariate}}\,{\text{ + }}\,{\text{b3}}\, \times \,{\text{exposure}}\, \times \,{\text{covariate}} $$
and
$$ {\text{log(rate)}}\,{\text{ = }}\,{\text{intercept}}\,{\text{ + }}\,{\text{b1}}\, \times \,{\text{exposure}}\,{\text{ + }}\,{\text{b2}}\, \times \,{\text{covariate}}\,{\text{ + }}\,{\text{b3}}\, \times \,{\text{exposure}}\, \times \,{\text{covariate}} $$

The SAS® estimate statement within PROC GENMOD provides a straightforward method for calculating point and interval estimates of the four covariate- and exposure-specific rates. In the case of log-linear models, the estimates of the log rates are exponentiated using the “exp” option.

5 Differences among regression models when the pattern of disease rates is additive

Inspection of the numerical results in Fig. 6, in which the joint effects of the covariate and exposure are assumed to be additive in the population, indicates that log-linear regression models can produce biased estimates when aggregate-level exposure is employed in the analysis. Note specifically that, although the correct values for the rates are obtained in Fig. 4 when the individual-level information on the covariate and exposure is used, the same is not true in Fig. 6, where the aggregate-level information has been used instead. In the log-linear regression analysis, the estimates for the four covariate- and exposure-specific rates are 0.0187, 0.0246, 0.0087, and 0.0029 rather than the correct values of 0.0200, 0.0160, 0.0050, and 0.0010. The rate ratio calculated from these biased estimates of the rates is 0.76 for the effect of exposure when the covariate factor is present and 3.00 when the covariate factor is absent. The correct values are 1.25 and 5.00, respectively. The rate difference calculated from these biased estimates is −0.0059 for the effect of exposure when the covariate factor is present and 0.0058 when the covariate factor is absent. The correct value is 0.0040, regardless of whether the covariate factor is present or absent. The difference in the signs of the rate differences for the log-linear analysis gives the mistaken impression that on the additive scale there is a qualitative (cross-over) interaction (Gail and Simon 1985; Peto 1982; Thompson 1991).

Comparison of the confidence intervals for the estimated rates in Figs. 4 and 6 clearly indicates that use of the aggregate-level information on exposure results in inferior precision of estimation, regardless of whether a linear or a log-linear regression model is employed.

6 Differences among regression models when the pattern of disease rates is non-additive

Table 6 gives results for numerical examples such as those presented in Figs. 4 and 6, but based on a variety of non-additive patterns for the joint effects of exposure and the covariate on disease. The results for the additive pattern of joint effects in Figs. 4 and 6 are also repeated in Table 6 to facilitate comparisons.
Table 6

Predicted rates and 95% confidence intervals for linear and log-linear analyses based on individual-level and aggregate-level measurement of exposure, according to various patterns of incidence of disease: example with high variability across units for all three unmeasured variables

Level of exposure variable

Model

Exposure

Covariate

+

Rate

95% CI

Rate

95% CI

Sub-additive

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0180

0.0146 to 0.0214

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0180

0.0149 to 0.0217

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

0.0019 to 0.0382

0.0049

−0.0142 to 0.0240

0.0179

−0.0021 to 0.0379

0.0010

−0.0146 to 0.0167

Log-linear

+

0.0187

0.0040 to 0.0868

0.0085

0.0013 to 0.0577

0.0276

0.0044 to 0.1742

0.0031

0.0005 to 0.0181

Additive

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0160

0.0128 to 0.0192

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0160

0.0131 to 0.0195

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

0.0022 to 0.0378

0.0050

−0.0137 to 0.0237

0.0160

−0.0035 to 0.0355

0.0010

−0.0143 to 0.0163

Log-linear

+

0.0187

0.0039 to 0.0893

0.0087

0.0012 to 0.0611

0.0246

0.0037 to 0.1619

0.0029

0.0005 to 0.0175

Supra-additive/sub-multiplicative

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0100

0.0075 to 0.0125

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0100

0.0078 to 0.0128

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0199

0.0032 to 0.0366

0.0053

−0.0123 to 0.0228

0.0103

−0.0076 to 0.0282

0.0009

−0.0132 to 0.0149

Log-linear

+

0.0188

0.0036 to 0.0991

0.0095

0.0012 to 0.0757

0.0164

0.0021 to 0.1258

0.0022

0.0003 to 0.0155

Multiplicative

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0040

0.0024 to 0.0056

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0040

0.0027 to 0.0059

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0198

0.0043 to 0.0353

0.0056

−0.0107 to 0.0218

0.0046

−0.0116 to 0.0207

0.0008

−0.0119 to 0.0134

Log-linear

+

0.0192

0.0032 to 0.1142

0.0111

0.0012 to 0.1037

0.0098

0.0010 to 0.0921

0.0015

0.0002 to 0.0133

Supra-multiplicative

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0020

0.0009 to 0.0031

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0020

0.0011 to 0.0035

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0197

0.0047 to 0.0348

0.0057

−0.0101 to 0.0214

0.0027

−0.0128 to 0.0182

0.0007

−0.0114 to 0.0129

Log-linear

+

0.0195

0.0031 to 0.1212

0.0120

0.0012 to 0.1191

0.0080

0.0008 to 0.0816

0.0013

0.0001 to 0.0124

The numerical results in Table 6 indicate that, when aggregate-level information on the covariate and on exposure is employed for the analysis, fitting a log-linear regression model leads to biased estimates of rates, even when the underlying pattern of joint effects conforms to multiplicativity. For the section of the table based on multiplicativity for joint effects, it has been assumed that the presence of exposure increases the rate of disease fivefold, regardless of whether the binary covariate is present or absent. However, fitting a log-linear model to the aggregate-level data for this situation yields biased estimates of 0.0192/0.0098 = 2.0 for the rate ratio when the covariate is present and 0.0111/0.0015 = 7.4 when the covariate is absent.

Note that fitting of a linear model to the aggregate-level exposure data also leads to bias in the estimates when the pattern of rates for the combined effects of the covariate and exposure is something other than additive. However, this bias is of considerably smaller magnitude than is the bias when a log-linear model is fit to the aggregate-level data.

Table 6 also indicates that the width of the expected confidence intervals around the estimated rates is substantially greater for analyses based on aggregate-level data than for those based on individual-level data. Of the two formulations for analysis of the aggregate-level data, a log-linear regression model consistently yields wider confidence intervals.

7 Effects of variation across geographic areas in the prevalence of unmeasured antecedent variables

Table 7 gives numeric results for various scenarios in which we change the variability among geographic areas in the prevalence of antecedent factors associated with the covariate and/or the exposure. All previous evaluations have addressed situations in which there was high variability among geographic areas in the prevalence of all unmeasured antecedent variables. For the evaluations in Table 7, we change the prevalence for some of the antecedent variables from high to low (using values from Table 2). Since all of the scenarios in Table 7 assume additivity for the joint effects for exposure and the covariate in the population, a linear regression analysis of aggregate-level data entails no bias in the estimated rates. However, a log-linear analysis does introduce bias, and the magnitude of that bias differs according to the specific pattern of variation across geographic areas. Additionally, the expected confidence intervals are widest when a log-linear analysis of the aggregate-level data is conducted.
Table 7

Predicted rates and 95% confidence intervals for linear and log-linear analyses based on individual-level and aggregate-level measurement of exposure, according to variation across geographic area in three unmeasured antecedent variables: additive pattern of rates

Level of exposure variable

Model

Exposure

Covariate

+

Rate

95% CI

Rate

95% CI

Low variation across geographic areas in an unmeasured antecedent variable that is associated with only the exposure

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0160

0.0128 to 0.0192

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0160

0.0131 to 0.0195

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

−0.0064 to 0.0464

0.0050

−0.0253 to 0.0353

0.0160

−0.0132 to 0.0452

0.0010

−0.0230 to 0.0250

Log-linear

+

0.0148

0.0013 to 0.1637

0.0110

0.0006 to 0.1996

0.0324

0.0016 to 0.6501

0.0023

0.0002 to 0.0318

Low variation across geographic areas in an unmeasured antecedent variable that is associated with both the covariate and exposure

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0160

0.0128 to 0.0192

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0160

0.0131 to 0.0195

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

−0.0139 to 0.0539

0.0050

−0.0272 to 0.0372

0.0160

−0.0174 to 0.0494

0.0010

−0.0307 to 0.0327

Log-linear

+

0.0225

0.0011 to 0.4795

0.0071

0.0003 to 0.1777

0.0202

0.0009 to 0.4491

0.0036

0.0001 to 0.0957

Low variation across geographic areas in an unmeasured antecedent variable that is associated with only the covariate

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0160

0.0128 to 0.0192

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0160

0.0131 to 0.0195

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

−0.0062 to 0.0462

0.0050

−0.0240 to 0.0340

0.0160

−0.0151 to 0.0471

0.0010

−0.0239 to 0.0259

Log-linear

+

0.0202

0.0018 to 0.2283

0.0080

0.0004 to 0.1658

0.0227

0.0013 to 0.3834

0.0032

0.0002 to 0.0416

Low variation across geographic areas in unmeasured antecedent variables that are associated with only the covariate and with only exposure

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0160

0.0128 to 0.0192

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0160

0.0131 to 0.0195

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

−0.0243 to 0.0643

0.0050

−0.1747 to 0.1847

0.0160

−0.1637 to 0.1957

0.0010

−0.0417 to 0.0437

Log-linear

+

0.0173

0.0003 to 1.0489

0.0095

0.0000 to ∞

0.0270

0.0000 to ∞

0.0028

0.0000 to 0.1957

Low variation across geographic areas in unmeasured antecedent variables that are associated with only the covariate and with both the covariate and exposure

 Individual

Linear

+

0.0200

0.0168 to 0.0232

0.0050

0.0032 to 0.0068

0.0160

0.0128 to 0.0192

0.0010

0.0003 to 0.0017

Log-linear

+

0.0200

0.0170 to 0.0235

0.0050

0.0035 to 0.0071

0.0160

0.0131 to 0.0195

0.0010

0.0005 to 0.0021

 Aggregate

Linear

+

0.0200

−0.4414 to 0.4814

0.0050

−0.4564 to 0.4664

0.0160

−0.4390 to 0.4710

0.0010

−0.4537 to 0.4557

Log-linear

+

0.0222

0.0000 to ∞

0.0072

0.0000 to ∞

0.0207

0.0000 to ∞

0.0036

0.0000 to ∞

Also of note in Table 7 is the clear indication that if variability across geographic units is low for two of the three antecedent unmeasured variables, then the expected confidence intervals are extremely wide for analysis of aggregate-level information, especially when a log-linear model is fit to the data.

8 Discussion

The numerical results presented in this paper provide strong support for the use of linear rather than log-linear regression modeling of disease incidence in environmental epidemiology when the available measure of exposure is at the aggregate level rather than at the individual level. Use of log-linear regression formulations (i.e., fitting multiplicative models) to analyze incidence rates can result in appreciable bias in (1) quantitative comparison of disease in exposed versus unexposed groups in the absence of covariates; and (2) the predicted covariate- and exposure-specific rates when a binary covariate is taken into account. These results are consistent with the work of Bjork and Stromberg (2005), who used a rather different formulation and found that, for the analysis of proportions rather than rates, a linear odds regression model performed better than the logistic model.

In traditional individual-level epidemiologic research, discussions of additive versus multiplicative models have centered around the issue of the appropriate definition of the absence of interaction when multiple causal factors are considered simultaneously (Rothman et al. 1980). However, in that realm, the pattern of rates for the four combinations of a binary exposure of primary interest and a binary covariate can be estimated in an unbiased manner by inclusion of a product variable in either a linear or log-linear regression analysis. For analysis of studies based on aggregate-level information, however, log-linear analysis results in biased estimates, even if an interaction term is included.

A further limitation of a log-linear formulation for analysis of aggregate-level exposure is that it may lead to a more complex characterization of the data than is really necessary. For example, in Fig. 6 we illustrated an instance in which there appears to be a qualitative (i.e., cross-over) interaction when in fact the underlying rates follow a perfect additive pattern.

In terms of precision of estimation, studies that use aggregate information on exposure are shown to have much wider expected confidence intervals than studies in which individual-level information is available. This problem is exacerbated when log-linear models are fit to the data.

Consideration of patterns of incidence ranging from sub-additive to supra-multiplicative confirm what others have noted regarding bias in the analysis of aggregate-level data whenever the pattern of rates is non-additive. However, for the range of situations considered, we found that this bias was considerably smaller when additive models were fit than when multiplicative ones were fit. Furthermore, the absolute magnitude of the bias for additive models was numerically small. This latter finding is especially important in environmental epidemiology because the only way to completely remove the bias is to reparameterize the problem so that, rather than including a product variable, three indicator variables are introduced into the regression model to compare each of three combinations of covariate status and exposure status to the fourth (referent) category (Greenland and Robins 1994). Such a parameterization requires information on the joint distribution of the covariate and exposure within each geographic unit—information that often is not available. For example, one might be able to obtain information on both the prevalence of exposure to high concentrations of particulate air pollution and the prevalence of smoking in each area. But typically these would be obtained from different sources (e.g., environmental monitoring systems versus behavioral surveys such as the Behavioral Risk Factor Survey of the Centers for Disease Control and Prevention), so that the joint distribution would be unknown.

One potential disadvantage of fitting additive regression models to aggregate-level data is that a negative point estimate or confidence limit for the intercept term may sometimes be obtained, precluding estimation of the rate ratio (Koepsell and Weiss 2003; Morgenstern 1998). In such circumstances, however, estimation of the rate difference is not generally precluded. Due to problems of bias and imprecision, our results suggest that the avoidance of an inadmissible intercept value may not be sufficient basis for adopting log-linear formulations.

Our provision for three unmeasured antecedent factors that are causally prior to exposure and a binary covariate indicates that two or more of these factors must be highly variable across geographic units in order for an aggregate-level analysis to provide precise estimates of effects at the individual level. This finding implies that large variability in exposure across units is not in itself sufficient to ensure an informative aggregate-level analysis. There must also be substantial variation in exposure conditional on any confounding factor included in the analysis.

Copyright information

© Springer-Verlag 2007