Computing and Interpreting Effect Sizes

  • Crystal Reneé Hill
  • Bruce Thompson
Part of the Higher Education: Handbook of Theory and Research book series (HATR, volume 19)


Effect sizes will be routinely reported only once editors promulgate policies that make these practices normatively expected. As Sedlmeier and Gigerenzer (1989) argued, “there is only one force that can effect a change, and that is the same force that helped institutionalize null hypothesis testing as the sine qua non for publication, namely, the editors of the major journals” (p. 315). Glantz (1980) agreed, noting that “The journals are the major force for quality control in scientific work” (p. 3).

The fact that 23 journals, including two major journals of large professional associations, now require effect size reporting bodes well for improved practices. As Fidler (2002) recently observed in her penetrating essay, “Of the major American associations, only all the journals of the American Educational Research Association have remained silent on all these issues” (p. 754).

As Thompson (1999a) noted, “It is doubtful that the field will ever settle on a single index to be used in all studies, given that so many choices exist and because the statistics can be translated into approximations across the two major classes” (p. 171). But three practices should be expected:

  1. 1.

    report effect sizes for all primary outcomes, even if they are not statistically significant (see Thompson, 2002c);

  2. 2.

    interpret effect sizes by explicit and direct comparison with the effects in related prior studies (Thompson, 2002c; Wilkinson and APA Task Force on Statistical Inference, 1999); and

  3. 3.

    compute CIs for results, including effect sizes (cf. Cumming and Finch, 2001), and when a reasonably large number of effects for current and prior studies are available, consider using graphics to facilitate comparisons (Wilkinson and APA Task Force on Statistical Inference, 1999).



Canonical Correlation Analysis American Psychological Association Psychological Measurement American Educational Research Association Statistical Significance Testing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aaron, B., Kromrey, J.D., and Ferron, J.M. (November, 1998). Equating r-based and d-based effect size indices: problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL (ERIC Document Reproduction Service No. ED 433 353).Google Scholar
  2. Abelson, R.P. (1997). A retrospective on the significance test ban of 1999 (If there were no significance tests, they would be invented). In L.L. Harlow, S.A. Mulaik, and J.H. Steiger (eds.), What if There Were no Significance Tests? (pp. 117–141). Mahwah, NJ: Erlbaum.Google Scholar
  3. American Psychological Association. (1994). Publication Manual of the American Psychological Association (4th edn.). Washington, DC: Author.Google Scholar
  4. American Psychological Association. (2001). Publication Manual of the American Psychological Association (5th edn.). Washington, DC: Author.Google Scholar
  5. Bagozzi, R.P., Fornell, C., and Larcker, D.F. (1981). Canonical correlation analysis as a special case of a structural relations model. Multivariate Behavioral Research 16: 437–454.Google Scholar
  6. Baugh, F. (2002). Correcting effect sizes for score reliability: A reminder that measurement and substantive issues are linked inextricably. Educational and Psychological Measurement 62: 254–263.CrossRefGoogle Scholar
  7. Baugh, F., and Thompson, B. (2001). Using effect sizes in social science research: New APA and journal mandates for improved methodology practices. Journal of Research in Education 11(1): 120–129.Google Scholar
  8. Boring, E.G. (1919). Mathematical vs. scientific importance. Psychological Bulletin 16: 335–338.Google Scholar
  9. Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review 48: 378–399.Google Scholar
  10. Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin 70: 426–443.Google Scholar
  11. Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press.Google Scholar
  12. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edn.), Hillside, NJ: Erlbaum.Google Scholar
  13. Cohen, J. (1994). The earth is round (p<.05). American Psychologist 49: 997–1003.CrossRefGoogle Scholar
  14. Cortina, J.M., and Dunlap, W.P. (1997). Logic and purpose of significance testing. Psychological Methods 2: 161–172.CrossRefGoogle Scholar
  15. Cumming, G., and Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement 61: 532–574.CrossRefGoogle Scholar
  16. Elmore, P., and Rotou, O. (2001, April). A primer on basic effect size concepts. Paper presented at the annual meeting of the American Educational Research Association, Seattle (ERIC Document Reproduction Service No. ED 453 260).Google Scholar
  17. Ezekiel, M. (1930). Methods of Correlational Analysis. New York: Wiley.Google Scholar
  18. Fidler, F. (2002). The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial. Educational and Psychological Measurement 62: 749–770.CrossRefGoogle Scholar
  19. Finch, S., Cumming, G., and Thomason, N. (2001). Reporting of statistical inference in the Journal of Applied Psychology: Little evidence of reform. Educational and Psychological Measurement 61: 181–210.Google Scholar
  20. Fleishman, A.I. (1980). Confidence intervals for correlation ratios. Educational and Psychological Measurement 40: 659–670.Google Scholar
  21. Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. Psychological Bulletin 70: 245–251.Google Scholar
  22. Glantz, S.A. (1980). Biostatistics: How to detect, correct and prevent errors in the medical literature. Circulation 61: 1–7.Google Scholar
  23. Glass, G. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher 5(10): 3–8.Google Scholar
  24. Gregg, M., and Leinhardt, G. (2002). Learning from the Birmingham Civil Rights Institute: Documenting teacher development. American Educational Research Journal 39: 553–587.Google Scholar
  25. Harris, M.J. (1991). Significance tests are not enough: The role of effect-size estimation in theory corroboration. Theory & Psychology 1: 375–382.Google Scholar
  26. Herzberg, P.A. (1969). The parameters of cross-validation. Psychometrika Monograph Supplement 16: 1–67.Google Scholar
  27. Hess, B., Olejnik, S., and Huberty, C.J (2001). The efficacy of two Improvement-over-chance effect sizes for two-group univariate comparisons under variance heterogeneity and non-normality. Educational and Psychological Measurement 61: 909–936.CrossRefGoogle Scholar
  28. Huberty, C.J. (1999). On some history regarding statistical testing. In B. Thompson (ed.), Advances in Social Science Methodology (Vol. 5, pp. 1–23). Stamford, CT: JAI Press.Google Scholar
  29. Huberty, C.J. (2002). A history of effect size indices. Educational and Psychological Measurement 62: 227–240.CrossRefGoogle Scholar
  30. Huberty, C.J., and Holmes, S.E. (1983). Two-group comparisons and univariate classification. Educational and Psychological Measurement 43: 15–26.Google Scholar
  31. Huberty, C.J., and Lowman, L.L. (2000). Group overlap as a basis for effect size. Educational and Psychological Measurement 60: 543–563.CrossRefGoogle Scholar
  32. Huberty, C.J., and Morris, J.D. (1988). A single contrast test procedure. Educational and Psychological Measurement 48: 567–578.Google Scholar
  33. Hunter, J.E. (1997). Needed: A ban on the significance test. Psychological Science 8(1): 3–7.Google Scholar
  34. Jacobson, N.S., Roberts, L.J., Berns, S.B., and McGlinchey, J.B (1999). Methods for defining and determining the clinical significance of treatment effects: Description, application, and alternatives. Journal of Consulting and Clinical Psychology 67: 300–307.CrossRefGoogle Scholar
  35. Kazdin, A.E. (1999). The meanings and measurement of clinical significance. Journal of Consulting and Clinical Psychology 67: 332–339.CrossRefGoogle Scholar
  36. Kendall, P.C. (1999). Clinical significance. Journal of Consulting and Clinical Psychology 67: 283–284.CrossRefGoogle Scholar
  37. Kieffer, K.M., Reese, R.J., and Thompson, B. (2001). Statistical techniques employed in AERJ and JCP articles from 1988 to 1997: A methodological review. Journal of Experimental Education 69: 280–309.Google Scholar
  38. Kirk, R.E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement 56: 746–759.Google Scholar
  39. Knapp, T.R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin 85: 410–416.Google Scholar
  40. Kromrey, J.D., and Hines, C.V. (1996). Estimating the coefficient of cross-validity in multiple regression: A comparison of analytical and empirical methods. Journal of Experimental Education 64: 240–266.Google Scholar
  41. Kupersmid, J. (1988). Improving what is published: A model in search of an editor. American Psychologist 43: 635–642.Google Scholar
  42. Loftus, G.R. (1994, August). Why psychology will never be a real science until we change the way we analyze data. Paper presented at the annual meeting of the American Psychological Association, Los Angeles.Google Scholar
  43. Lord, F.M. (1950). Efficiency of Prediction when a Regression Equation from One Sample is Used in a New Sample (Research Bulletin 50-110). Princeton, NJ: Educational Testing Service.Google Scholar
  44. Mittag, K.C., and Thompson, B. (2000). A national survey of AERA members’ perceptions of statistical significance tests and other statistical issues. Educational Researcher 29(4): 14–20.Google Scholar
  45. Murray, L.W., and Dosser, D.A. (1987). How significant is a significant difference? Problems with the measurement of magnitude of effect. Journal of Counseling Psychology 34: 68–72.CrossRefGoogle Scholar
  46. Nelson, N., Rosenthal, R., and Rosnow, R.L. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist 41: 1299–1301.CrossRefGoogle Scholar
  47. Oakes, M. (1986). Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York: Wiley.Google Scholar
  48. O’Grady, K.E. (1982). Measures of explained variance: Cautions and limitations. Psychological Bulletin 92: 766–777.Google Scholar
  49. Olejnik, S., and Algina, J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limitations. Contemporary Educational Psychology 25: 241–286.CrossRefGoogle Scholar
  50. Roberts, J.K., and Henson, R.K. (2002). Correction for bias in estimating effect sizes. Educational and Psychological Measurement 62: 241–253.CrossRefGoogle Scholar
  51. Robinson, D.H., and Wainer, H. (2002). On the past and future of null hypothesis significance testing. Journal of Wildlife Management 66: 263–271.Google Scholar
  52. Rosenthal, R., and Gaito, J. (1963). The interpretation of level of significance by psychological researchers. Journal of Psychology 55: 33–38.Google Scholar
  53. Rosnow, R.L., and Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist 44: 1276–1284.CrossRefGoogle Scholar
  54. Saunders, S.M., Howard, K.I., and Newman, F.L. (1988). Evaluating the clinical-significance of treatment effects — norms and normality. Behavioral Assessment 10: 207–218.Google Scholar
  55. Schmidt, F. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods 1: 115–129.Google Scholar
  56. Sedlmeier, P., and Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin 105: 309–316.CrossRefGoogle Scholar
  57. Shaver, J. (1985). Chance and nonsense. Phi Delta Kappan 67(1): 57–60.Google Scholar
  58. Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational and Psychological Measurement 61: 605–632.CrossRefGoogle Scholar
  59. Snyder, P. (2000). Guidelines for reporting results of group quantitative investigations. Journal of Early Intervention 23: 145–150.Google Scholar
  60. Snyder, P., and Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education 61: 334–349.Google Scholar
  61. Steiger, J.H., and Fouladi, R.T. (1992). R2: A computer program for interval estimation, power calculation, and hypothesis testing for the squared multiple correlation. Behavior Research Methods, Instruments, and Computers 4: 581–582.Google Scholar
  62. Stevens, J. (1992). Applied Multivariate Statistics for the Social Sciences (2nd edn.). Hillsdale, NJ: Erlbaum.Google Scholar
  63. Thompson, B. (1992). Two and one-half decades of leadership in measurement and evaluation. Journal of Counseling and Development 70: 434–438.Google Scholar
  64. Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education 61: 361–377.Google Scholar
  65. Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher 25(2): 26–30.Google Scholar
  66. Thompson, B. (1998a). In praise of brilliance: Where that praise really belongs. American Psychologist 53: 799–800.Google Scholar
  67. Thompson, B. (1998b). Review of What if there were no significance tests? Educational and Psychological Measurement 58: 332–344.Google Scholar
  68. Thompson, B. (1999a). If statistical significance tests are broken/misused, what practices should supplement or replace them? Theory & Psychology 9: 167–183.Google Scholar
  69. Thompson, B. (1999b). Journal editorial policies regarding statistical significance tests: Heat is to fire as p is to importance. Educational Psychology Review 11: 157–169.Google Scholar
  70. Thompson, B. (2000a). Canonical correlation analysis. In L. Grimm, and P. Yarnold (eds.), Reading and Understanding More Multivariate Statistics (pp. 285–316). Washington, DC: American Psychological Association.Google Scholar
  71. Thompson, B. (2000b). Ten commandments of structural equation modeling. In L. Grimm, and P. Yarnold (eds.), Reading and Understanding More Multivariate Statistics (pp. 261–284). Washington, DC: American Psychological Association.Google Scholar
  72. Thompson, B. (2001). Significance, effect sizes, stepwise methods, and other issues: Strong arguments move the field. Journal of Experimental Education 70: 80–93.Google Scholar
  73. Thompson, B. (ed.) (2002a). Score Reliability: Contemporary Thinking on Reliability Issues. Newbury Park, CA: Sage.Google Scholar
  74. Thompson, B. (2002b). “Statistical,” “practical,” and “clinical”: How many kinds of significance do counselors need to consider? Journal of Counseling and Development 80: 64–71.Google Scholar
  75. Thompson, B. (2002c). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher 31(3), 24–31.Google Scholar
  76. Thompson, B., and Kieffer, K.M. (2000). Interpreting statistical significance test results: A proposed new “What if” method. Research in the Schools 7(2): 3–10.Google Scholar
  77. Thompson, B., and Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement 60: 174–195.CrossRefGoogle Scholar
  78. Trusty, J., Thompson, B., and Petrocelli, J.V. (2004). Practical guide to implementing the requirement of reporting effect size in quantitative research in the Journal of Counseling & Development. Journal of Counseling and Development.Google Scholar
  79. Tryon, W.W. (1998). The inscrutable null hypothesis. American Psychologist 53: 796.Google Scholar
  80. Vacha-Haase, T., Nilsson, J.E., Reetz, D.R., Lance, T.S., and Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory & Psychology 10: 413–425.Google Scholar
  81. Wilkinson, L., and APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist 54: 594–604 (reprint available through the APA Home Page: Scholar
  82. Zuckerman, M., Hodgins, H.S., Zuckerman, A., and Rosenthal, R. (1993). Contemporary issues in the analysis of data: A survey of 551 psychologists. Psychological Science 4: 49–53.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Crystal Reneé Hill
    • 1
  • Bruce Thompson
    • 2
  1. 1.Texas A&M UniversityUSA
  2. 2.LSU Health Science CenterUSA

Personalised recommendations