Skip to main content

A latent variable model approach to estimating systematic bias in the oversampling method

Abstract

The method of oversampling data from a preselected range of a variable’s distribution is often applied by researchers who wish to study rare outcomes without substantially increasing sample size. Despite frequent use, however, it is not known whether this method introduces statistical bias due to disproportionate representation of a particular range of data. The present study employed simulated data sets to examine how oversampling introduces systematic bias in effect size estimates (of the relationship between oversampled predictor variables and the outcome variable), as compared with estimates based on a random sample. In general, results indicated that increased oversampling was associated with a decrease in the absolute value of effect size estimates. Critically, however, the actual magnitude of this decrease in effect size estimates was nominal. This finding thus provides the first evidence that the use of the oversampling method does not systematically bias results to a degree that would typically impact results in behavioral research. Examining the effect of sample size on oversampling yielded an additional important finding: For smaller samples, the use of oversampling may be necessary to avoid spuriously inflated effect sizes, which can arise when the number of predictor variables and rare outcomes is comparable.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. Additional results not shown in Table 1 or in the figures are available upon request. Supplementary Table 1 presents a summary of regression analyses for oversampling (predictor variables) and effect size estimates (criterion variable) based on reliability set at ρ X,X = 1.000.

References

  • Abrahams, N. M., & Alf, E. F. (1978). Relative costs and statistical power in the extreme groups approach. Psychometrika, 43(1), 11–17.

    Article  Google Scholar 

  • Alf, E. F., & Abrahams, N. M. (1975). The use of extreme groups in assessing relationships. Psychometrika, 40(4), 563–572.

    Article  Google Scholar 

  • Allison, D. B., Allison, R. L., Faith, M. S., Paultre, F., & Pi-Sunyer, F. X. (1997). Power and money: Designing statistically powerful studies while minimizing financial costs. Psychological Methods, 2(1), 20–33.

    Article  Google Scholar 

  • Alloy, L. B., Abramson, L. Y., Hogan, M. E., Whitehouse, W. G., Rose, D. T., Robinson, M. S., & Lapkin, J. B. (2000). The temple-wisconsin cognitive vulnerability to depression project: Lifetime history of axis I psychopathology in individuals at high and low cognitive risk for depression. Journal of Abnormal Psychology, 109(3), 403–418.

    PubMed  Article  Google Scholar 

  • Alloy, L. B., Abramson, L. Y., Whitehouse, W. G., Hogan, M. E., Panzarella, C., & Rose, D. T. (2006). Prospective incidence of first onsets and recurrences of depression in individuals at high and low cognitive risk for depression. Journal of Abnormal Psychology, 115(1), 145–156.

    PubMed  Article  Google Scholar 

  • Borich, G. D., & Godbout, R. C. (1974). Extreme groups designs and the calculation of statistical power. Educational and Psychological Measurement, 34(3), 663–675.

    Article  Google Scholar 

  • Cohen, J. (1983). The cost of dichotomization. Psychological Measurement, 7, 249–253.

    Article  Google Scholar 

  • Costello, E. J., Angold, A., Burns, B. J., Stangl, D. K., Tweed, D. L., Erkanli, A., & Worthman, C. M. (1996). The Great Smoky Mountains Study of youth: Goals, design, methods, and the prevalence of DSM-III-R disorders. Archives of General Psychiatry, 53(12), 1129–1136.

    PubMed  Article  Google Scholar 

  • DuMouchel, W. H., & Duncan, G. J. (1982). Using sample survey weights in multiple regression analyses of stratified samples. Journal of the American Statistical Association, 78(383), 535–543.

    Article  Google Scholar 

  • Feldt, L. S. (1961). The use of extreme groups to test for the presence of a relationship. Psychometrika, 26, 307–316.

    Article  Google Scholar 

  • Gelman, A. (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2), 153–164.

    Article  Google Scholar 

  • Hirtz, D., Thurman, D. J., Gwinn-Hardy, K., Mohamed, M., Chaudhuri, A. R., & Zalutsky, R. (2007). How common are the “common” neurologic disorders? Neurology, 68(5), 326–337.

    PubMed  Article  Google Scholar 

  • Humphreys, L. G. (1985). Correlations in psychological research. In D. K. Detterman (Ed.), Current topics in human intelligence (Research methodology, Vol. 1, pp. 3–24). Norwood, NJ: Ablex Publishing.

    Google Scholar 

  • Kessler, R. C., Berglund, P., Demler, O., Jin, R., Merikangas, K. R., & Walters, E. E. (2005). Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the National Comorbidity Survey Replication. Archives of General Psychiatry, 62(6), 593–602.

    PubMed  Article  Google Scholar 

  • MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19–40.

    PubMed  Article  Google Scholar 

  • McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114(2), 376–390.

    PubMed  Article  Google Scholar 

  • Menard, S. (2004). Six approaches to calculating standardized logistic regression coefficients. The American Statistician, 58(3), 218–226.

    Article  Google Scholar 

  • Peduzzi, P., Concato, J., Feinstein, A. R., & Holford, T. R. (1995). Importance of events per independent variable in proportional hazards regression analysis II. Accuracy and precision of regression estimates. Journal of Clinical Epidemiology, 48(12), 1503–1510.

    PubMed  Article  Google Scholar 

  • Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., & Feinstein, A. R. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), 1373–1379.

    PubMed  Article  Google Scholar 

  • Preacher, K. J., Rucker, D. D., MacCallum, R. C., & Nicewander, W. A. (2005). Use of the Extreme Groups Approach: A critical reexamination and new recommendations. Psychological Methods, 10(2), 178–192.

    PubMed  Article  Google Scholar 

  • Development Core Team, R. (2007). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

    Google Scholar 

  • Vittinghoff, E., & McCulloch, C. E. (2007). Relaxing the rule of ten events per variable in logistic and Cox regression. American Journal of Epidemiology, 165(6), 710–718. doi:10.1093/aje/kwk052

    PubMed  Article  Google Scholar 

  • Zinbarg, R. E., Mineka, S., Craske, M. G., Griffith, J. W., Sutton, J., Rose, R. D., & Waters, A. M. (2010). The Northwestern-UCLA youth emotion project: Associations of cognitive vulnerabilities, neuroticism and gender with past diagnoses of emotional disorders in adolescents. Behaviour Research and Therapy, 48(5), 347–358. doi:10.1016/j.brat.2009.12.008

    PubMed Central  PubMed  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Katherina K. Hauner.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(DOC 171 kb)

ESM 2

(DOC 2384 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hauner, K.K., Zinbarg, R.E. & Revelle, W. A latent variable model approach to estimating systematic bias in the oversampling method. Behav Res 46, 786–797 (2014). https://doi.org/10.3758/s13428-013-0402-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3758/s13428-013-0402-6

Keywords

  • Sampling
  • statistical bias
  • latent variable modeling