Spurious relationships arising from aggregate variables in linear regression


Linear regressions that use aggregated values from a group variable such as a school or a neighborhood are commonplace in the social sciences. This paper uses Monte Carlo methods to demonstrate that aggregated variables produce spurious relationships with other dependent and independent variables in a model even when there are no underlying relationships among those variables. The size of the spurious relationships (or postulated effects) increases as the number of observations per group decreases. Although this problem is remedied by including the individual-level variable in the regression, the problem has not been discussed in the methodological literature. Accordingly, studies using aggregate variables must be interpreted with caution if the individual-level measurements are not available.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    The shape of the actual distributions are unknown, but assuming they are normal, the 100,000 samples would generate an extremely small standard error, so that a correlation as small as .01 would be significant at the .05 level.

  2. 2.

    Simulations were also run for 60 and 80 schools, but results differed only slightly from the 50 school case.

  3. 3.

    In the dataset, the variable was named "metasum".

  4. 4.

    It is understood that the coefficients for S and P in model (4) can be different than those in model (3), even though the same β symbols are used.

  5. 5.

    The full set of simulation correlations that go into model (4) are available from the authors.


  1. Bryk, A.S., Raudenbush, S.W.: Hierarchical Linear Models. Sage Publications, Newbury Park (1992)

    Google Scholar 

  2. Gottfried, M.A.: Absent peers in elementary years: the negative classroom effects of unexcused absences on standardized testing outcomes. Teach. Coll. Rec. 113, 1597–1632 (2011)

    Google Scholar 

  3. Hanushek, E.A., Kain, J.F., Rivkin, S.G.: New evidence about Brown v. Board of Education: the complex effects of school racial composition on achievement. J. Labor Econ. 27, 349–383 (2009)

    Article  Google Scholar 

  4. Hill, C.J., Bloom, H.S., Black, A.R., Lipsey, M.W.: Empirical benchmarks for interpreting effect sizes in research. Child. Dev. Perspect. 2, 172–177 (2008)

    Article  Google Scholar 

  5. Hoxby, C., Wiengrath, G.: Taking Race Out of the Equation: School Reassigment and the Structure of Peer Effects (2005). www.economics.harvard.edu/faculty/hoxby/papers/hoxbyweingarth_taking_race.pdf

  6. Kahlenberg, R.D.: The Future of School Integration: Socioeconomic Diversity as an Education Reform Strategy. The Century Foundation, Washington DC (2012)

    Google Scholar 

  7. King, G.: A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton (1997)

    Google Scholar 

  8. Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, W., Roberts, M., Anthony, K.S., Busick, M.D.: Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms. U.S. Department of Education, Institute for Education Science, Washington DC (2012)

    Google Scholar 

  9. Loveless, T.: How Well are American Students Learning?. Brookings Institution, Washington DC (2012)

    Google Scholar 

  10. Marks GN. (2012). Are school-SES effects theoretical and methodological artifacts?. Teach. Coll. Rec. (ID Number 16872)

  11. Moulton, B.R.: An illustration of a pitfall in estimating the effects of aggregate variables on micro units. Rev. Econ. Stat. 72, 334–338 (1990)

    Article  Google Scholar 

  12. Sampson, R.J., Raudenbush, S.W., Earls, F.: Neighbourhoods and violent crime: a multilevel study of collective efficacy. Science 277, 918–924 (1997)

    Article  Google Scholar 

  13. Vigdor, J., Nechyba, T.: Peer Effects in Elementary School: Learning from ‘Apparent’ Random Assignment. Duke University and NBER, Durham (2004)

    Google Scholar 

  14. Willms, J.D.: School composition and contextual effects on student outcomes. Teach. Coll. Rec. 112(4), 1137–1162 (2010)

    Google Scholar 

  15. Wooldridge, J.M.: Cluster-sample methods in applied econometrics. Am. Econ. Rev. 93, 133–138 (2003)

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to David J. Armor.

Technical appendix: on multilevel modeling

Technical appendix: on multilevel modeling

Multilevel modeling does not affect the analyses presented here because of a lack of within-school (or within-group) variation. This is demonstrated in Table 8 using Model 1(b) for the case of 10 schools and 30 students per school (N = 300) from the USA PISA data. The within-school expected values are shown for the standard deviations of school SES (S), achievement (A), and individual SES (I) as well as the covariance between A and I. The correlation was computed from the appropriate values for covariance and standard deviations. It is clear that there is almost no variation among the within-school estimates of these values, where the sample size for each school is 30 students. These within-cell values were estimated at a later time than the original simulations for Table 2, so the correlation for the full sample of 300 cases (.409) is slightly smaller than the original correlation of .456.

Table 8 Within-school estimates of parameter variation for Model 1b (case of 10 schools & 30 students per school)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Armor, D.J., Cotla, C.R. & Stratmann, T. Spurious relationships arising from aggregate variables in linear regression. Qual Quant 51, 1359–1379 (2017). https://doi.org/10.1007/s11135-016-0335-0

Download citation


  • Aggregated variables
  • Contextual effects
  • Monte Carlo
  • Linear regression
  • Spurious correlation