Null Hypothesis Testing

  • Gideon J. MellenberghEmail author


Null hypothesis testing applies to confirmatory research, where substantive hypotheses are tested. The preferred approach is to construct a confidence interval (CI) because a CI simultaneously assesses the precision of a parameter estimate and tests a null hypothesis on the parameter. Two- and one-sided CIs and two- and one-tailed tests are considered. The CI approach is demonstrated for conventional tests of the null hypothesis of equal means of paired (Student’s t test) and independent (Student’s t and Welch tests) variables. Bootstrap methods make weaker assumptions than the conventional tests. The bootstrap t method for the means of paired and independent variables and the modified percentile bootstrap method for the product moment correlation are described. Null hypothesis testing is often incorrectly understood and applied. Several methods to correct these flaws are discussed. First, the overlap of the CIs of two means does not imply that the difference of the two means is not significant. Second, a two-step procedure, where the choice of a test is based on results of tests of the assumptions of the test, inflates the Type I error. Third, standardized effect sizes can be computed in different ways, which hampers the comparability of effect sizes in meta-analysis. Fourth, an observed power analysis, where the effect size is estimated from sample data, cannot explain nonsignificant results. Fifth, testing multiple null hypotheses increases the probability of rejecting at least one true null hypothesis, which is prevented by applying multiple null hypothesis testing methods (e.g., Hochberg’s method). Sixth, data exploration may yield interesting substantive hypotheses, but these have to be confirmed with new data of a cross-validation or replication study. Seventh, adding participants to the sample till the null hypothesis is rejected inflates the Type I error, which is prevented by using sequential testing methods (e.g., the group sequential testing procedure). Finally, if researchers do not want to reject a null hypothesis, they have to apply equivalence testing.


Bootstrap methods Conditional null hypothesis testing Confidence intervals Cross-validation design Equivalence testing Group sequential testing method Hochberg’s multiple null hypotheses testing method Overlapping confidence intervals Power analysis Replication design Standardized effect size Student’s t-test Welch test 


  1. American Educational Research Association. (2006). Standards for reporting on empirical social science research in AERA publications. Educational Researcher, 35, 33–40.Google Scholar
  2. APA. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: American Psychological Association.Google Scholar
  3. Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423–437.CrossRefGoogle Scholar
  4. Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389–396.CrossRefGoogle Scholar
  5. Brewer, J. K. (1972). On the power of statistical tests in the American Educational Research Journal. American Educational Research Journal, 9, 391–401.Google Scholar
  6. Brewer, J. K., & Owen, P. W. (1973). A note on the power of statistical tests in the Journal of Educational Measurement. Journal of Educational Measurement, 10, 71–74.CrossRefGoogle Scholar
  7. Chase, L. J., & Chase, R. B. (1976). A statistical power analysis of applied psychological research. Journal of Applied Psychology, 61, 234–237.CrossRefGoogle Scholar
  8. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153.CrossRefGoogle Scholar
  9. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York, NY: Academic Press.Google Scholar
  10. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NY: Erlbaum.Google Scholar
  11. Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. London, England: Routledge.Google Scholar
  12. de Groot, A. D. (1956/2014). De betekenis van ‘significant’ bij verschillende typen onderzoek. Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden, 11, 398-409/ The meaning of ‘significance’ for different types of research. Acta Psychologica (E.-J. Wagenmakers et al., Trans. and annotated), 148, 188–194.Google Scholar
  13. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY: Chapman and Hall.CrossRefGoogle Scholar
  14. Elstrodt, M., & Mellenbergh, G. J. (1978). Eén minus de vergeten fout [One minus the forgotten Type II error]. Nederlands Tijdschrift voor de Psychologie, 33, 33–47.Google Scholar
  15. Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175–191.CrossRefGoogle Scholar
  16. Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the analysis of variance and analysis of covariance. Review of Educational Research, 42, 237–288.CrossRefGoogle Scholar
  17. Goldstein, H., & Healy, J. R. (1995). The graphical presentation of a collection of means. Journal of the Royal Statistical Society, 158, 175–177 (Part I).Google Scholar
  18. Hayes, A. F., & Cai, L. (2007). Further evaluating the conditional decision rule for comparing two independent means. British Journal of Mathematical and Statistical Psychology, 60, 217–244.CrossRefGoogle Scholar
  19. Hedges, L. V., & Pigott, T. D. (2001). The power of statistical tests in meta-analysis. Psychological Methods, 6, 203–217.CrossRefGoogle Scholar
  20. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800–802.CrossRefGoogle Scholar
  21. Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.CrossRefGoogle Scholar
  22. Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculation for data analysis. American Statistician, 55, 19–24.CrossRefGoogle Scholar
  23. Kelley, K., & Preacher, K. J. (2012). On effect size. Psychological Methods, 17, 137–152.CrossRefGoogle Scholar
  24. Keppel, G., & Wickens, T. D. (2004). Design and analysis: A researcher’s handbook (4th ed.). Upper Saddle River, NJ: Pearson.Google Scholar
  25. Keselman, H. J., Algina, J., Lix, L. M., Wilcox, R. R., & Deering, K. N. (2008). A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychological Methods, 13, 110–129.CrossRefGoogle Scholar
  26. Lakens, D. (2017). Equivalence tests. A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8, 355–362.CrossRefGoogle Scholar
  27. Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian cognitive modeling: A practical guide. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  28. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166.CrossRefGoogle Scholar
  29. Morrison, D. F. (1990). Multivariate statistical methods (3rd ed.). New York, NY: McGraw-Hill.Google Scholar
  30. Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301.CrossRefGoogle Scholar
  31. O’Keefe, K. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power. Sorting out appropriate uses of statistical power analysis. Communication Methods and Measures, 1, 291–299.CrossRefGoogle Scholar
  32. Onwuegbuzie, A. J., & Leech, N. L. (2004). Post hoc power: A concept whose time has come. Understanding Statistics, 3, 201–230.CrossRefGoogle Scholar
  33. Peng, C.-J. J., & Chen, L.-T. (2014). Beyond Cohen’s d: Alternative effect size measures for between-subject designs. Journal of Experimental Education, 82, 22–50.CrossRefGoogle Scholar
  34. Piantodosi, S. (2005). Clinical trials: A methodological perspective (2nd ed.). Hoboken, NJ: Wiley.CrossRefGoogle Scholar
  35. Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin, 113, 553–565.CrossRefGoogle Scholar
  36. Rom, D. M. (2013). An improved Hochberg procedure for multiple tests of significance. British Journal of Mathematical and Statistical Psychology, 66, 189–196.CrossRefGoogle Scholar
  37. Ruscio, J., & Roche, B. (2012). Variance heterogeneity in published psychological research: A review and a new index. Methodology, 8, 1–11.CrossRefGoogle Scholar
  38. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316.CrossRefGoogle Scholar
  39. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 20, 1–8.Google Scholar
  40. Sun, S., Pan, W., & Wang, L. L. (2010). Rethinking observed power: Concept, practice, and implications. Methodology, 7, 81–87.CrossRefGoogle Scholar
  41. van Belle, G. (2002). Statistical rules of thumb. New York, NY: Wiley.Google Scholar
  42. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 627–633.CrossRefGoogle Scholar
  43. Walker, E., & Nowacki, A. S. (2010). Understanding equivalence and noninferiority testing. Journal General Internal Medicine, 26, 192–196.CrossRefGoogle Scholar
  44. Westlake, W. J. (1981). Bioequivalence testing-A need to rethink. Biometrics, 37, 591–593.Google Scholar
  45. Wilcox, R. R. (1998). How many discoveries has been lost by ignoring modern statistical methods? American Psychologist, 53, 300–314.CrossRefGoogle Scholar
  46. Wilcox, R. R. (2010). Fundamentals of modern statistical methods. New York, NY: Springer.CrossRefGoogle Scholar
  47. Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals. American Psychologist, 54, 594–604.Google Scholar
  48. Yuan, K.-H., & Maxwell, S. (2005). On the post hoc power in testing mean differences. Journal of Educational and Behavioral Statistics, 30, 141–167.CrossRefGoogle Scholar
  49. Zimmerman, D. W. (1996). Some properties of preliminary tests of equality of variances in the two-sample location problem. The Journal of General Psychology, 123, 217–231.CrossRefGoogle Scholar
  50. Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57, 173–181.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Emeritus Professor Psychological Methods, Department of PsychologyUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations