Skip to main content
Log in

Statistical significance testing should be discontinued in mathematics education research

  • Article
  • Published:
Mathematics Education Research Journal Aims and scope Submit manuscript

Abstract

It is claimed here that the confidence mathematics education researchers have in statistical significance testing (SST) as an inference tool par excellence for experimental research is misplaced. Five common myths about SST are discussed, namely that SST: (a) is a controversy-free, recipe-like method to allow decision making; (b) answers the question whether there is a low probability that the research results were due to chance; (c) logic parallels the logic of mathematical proof by contradiction; (d) addresses the reliability/replicability question; and (e) is a necessary but not sufficient condition for the credibility of results. It is argued that SST’s contribution to educational research in general, and mathematics education research in particular, is not beneficial, and that SST should be discontinued as a tool for such research. Some alternatives to SST are suggested, and a call is made for mathematics education researchers to take the lead in using these alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and scientific process: Is there a (statistically) significant relationship?Journal of Counselling Psychology, 29, 189–194.

    Article  Google Scholar 

  • Bakan, D. (1966). The test of significance in psychological research.Psychological Bulletin, 66, 423–437.

    Article  Google Scholar 

  • Begg, I., Armour, V., & Kerr, T. (1985). On believing what we remember.Canadian Journal of Behavioral Science, 17, 199–214.

    Google Scholar 

  • Carver, R. P. (1978). The case against statistical significance testing.Harvard Educational Review, 48, 378–399.

    Google Scholar 

  • Chow, S. L. (1991). Some reservations about power analysis.American Psychologist, 46, 1088–1089.

    Article  Google Scholar 

  • Coats, W. (1970). Significant differences: A case against the normal use of inferential statistical models in educational research.Educational Researcher Newsletter, 21, 6–7.

    Google Scholar 

  • Cohen, J. (1977).Statistical power analysis for the behavioral sciences. New York: Academic Press.

    Google Scholar 

  • Cohen, J. (1990). Things I’ve learned so far.American Psychologist, 45, 304–312.

    Google Scholar 

  • Cooper, H. M. (1984).The integrative research review: A systematic approach. California: Sage Publications.

    Google Scholar 

  • Cronbach, L. J., & Snow, R. E. (1977).Aptitudes and instructional methods: A handbook for research on interactions. New York: Irvington.

    Google Scholar 

  • Crow, E. L. (1991). Response to Rosenthal’s comment “How are we doing in soft psychology?”American Psychologist, 46, 1083.

    Article  Google Scholar 

  • Daniel, L. G. (1989, January).Use of the jacknife statistic to establish the external validity of discriminant analysis results. Paper presented at the annual meeting of the Southwest Educational Research Association, Houston, Texas. (ERIC Document Reproduction Service No. ED 305 382).

  • Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists.American Psychologist, 42, 145–151.

    Article  Google Scholar 

  • Dawes, R. M. (1981).How to use your head and statistics at the same time, or at least in rapid alternation. Unpublished manuscript, University of Oregon.

  • Diaconis, P., & Efron, B. (1983). Computer-intensive methods in statistics.Scientific American, 248(5), 116–130.

    Article  Google Scholar 

  • Diaconis, P., & Freedman, D. (1981). The persistence of cognitive illusions.The Behavioral and Brain Sciences, 4, 333–334.

    Article  Google Scholar 

  • Factor, L., & Kooser, R. (1981).Value presuppositions in science textbooks: A critical bibliography. Galesburg, IL: Knox College.

    Google Scholar 

  • Falk, R. (1986). Misconceptions of statistical significance.Journal of Structural Learning, 9, 83–96.

    Google Scholar 

  • Falk, R., & Greenbaum, C. W. (1993).The fallacy of probabilistic modus tollens and the statistical-significance decision. Paper submitted for publication.

  • Fisher, R. A. (1960).The design of experiments, (7th ed.). Edinburgh: Oliver & Boyd.

    Google Scholar 

  • Gigerenzer, G., & Murray, D. J. (1987).Cognition as intuitive statistics. Hillsdale, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • Glass, G. V., & Hopkins, K. D. (1984).Statistical methods in education and psychology (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  • Gold, D. (1969). Statistical tests and substantive significance.The American Sociologist, 4, 42–46.

    Google Scholar 

  • Guttman, L. (1977). What is not what in statistics.The Statistician, 26, 81–107.

    Article  Google Scholar 

  • Guttman, L. (1981). Efficacy coefficients for differences among averages. In I. Borg (Ed.),Multidimensional data representations: When and why. Ann Arbor, MI: Mathesis Press.

    Google Scholar 

  • Guttman, L. (1985). The illogic of statistical inference for cumulative science.Applied Stochastic Models and Data Analysis, 1, 3–10.

    Article  Google Scholar 

  • Hays, W. L. (1974).Statistics (2nd ed.). New York: Holt, Rinehart & Winston.

    Google Scholar 

  • Hays, W. L. (1981).Statistics for psychologists (3rd ed.). New York: Holt, Rinehart & Winston.

    Google Scholar 

  • Kendall, M. G. (1943).The advanced theory of statistics. Vol. 1. New York: Lippincott.

    Google Scholar 

  • Lesnak, R. J. (1989). Writing to learn: An experiment in remedial algebra. In P. Connolly & T. Vilardi (Eds.),Writing to learn mathematics and science (pp. 147–156). New York: Teachers College Press.

    Google Scholar 

  • Levy, P. (1967). Substantive significance of significant differences between two groups.Psychological Bulletin, 67, 37–40.

    Article  Google Scholar 

  • Lunneborg, C. E. (1987).Bootstrap applications for the behavioral sciences. Seattle: University of Washington.

    Google Scholar 

  • McGraw, K. Q. (1991). Problems with the BESD: A comment on Rosenthal’s “How are we doing in soft psychology?”American Psychologist, 46, 1084–1086.

    Article  Google Scholar 

  • Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology.Journal of Consulting and Clinical Psychology, 46, 806–834.

    Article  Google Scholar 

  • Melton, A. W. (1962). Editorial.Journal of Experimental Psychology, 64, 553–557.

    Article  Google Scholar 

  • Morrison, D. E., & Henkel, R. E. (1969). Significance tests reconsidered.The American Sociologist, 4, 131–140.

    Google Scholar 

  • Pauker, S. P., & Pauker, S. G. (1979). The amniocentesis decision: An explicit guide for parents. In C. J. Epstein, C. J. R. Curry, S. Packman, S. Sherman & B. D. Hall (Eds.),Birth defects: Original article series; Vol. 15. Risk, communication, and decision making in genetic counseling (pp. 289–324). New York: The National Foundation.

    Google Scholar 

  • Phillips, L. D. (1973).Bayesian statistics for social scientists. London: Nelson.

    Google Scholar 

  • Rosenthal, R., & Rubin, D. B. (1982). A simple general purpose display of magnitude of experimental effect.Journal of Educational Psychology, 74, 166–169.

    Article  Google Scholar 

  • Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results.Psychological Bulletin, 86, 638–641.

    Article  Google Scholar 

  • Rosenthal, R. (1990). How are we doing in soft psychology?American Psychologist, 45, 775–777.

    Article  Google Scholar 

  • Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science.American Psychologist, 44, 1276–1284.

    Article  Google Scholar 

  • Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test.Psychological Bulletin, 57, 416–428.

    Article  Google Scholar 

  • Salsburg, D. S. (1985). The religion of statistics as practiced in medical journals.The American Statistician, 39(3), 220–223.

    Article  Google Scholar 

  • Shaver, J. P. (1985a). Chance and nonsense: A conversation about interpreting tests of statistical significance, Part 1.Phi Delta Kappan, September, 57–60.

  • Shaver, J. P. (1985b). Chance and nonsense: A conversation about interpreting tests of statistical significance, Part 2.Phi Delta Kappan, October, 138–141.

  • Shaver, J. P. (1992, April).What statistical significance testing is, and what it is not. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

  • Slakter, M. J., Yu, Y. B., & Suzuki-Slakter, N. S. (1991). *, **, and ***; Statistical nonsense at the.00000 level.Nursing Research, 40(4), 248–249.

    Article  Google Scholar 

  • Spencer-Brown, G. (1957).Probability and scientific inference. London: Longmans.

    Google Scholar 

  • Stegmuller, W. (1973). “Jenseits von Popper und Carnap”: Die logischen Grundlagen des statitischen Schliessens. Berlin: Springer.

    Google Scholar 

  • Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa.Journal of the American Statistical Association, 54, 30–34.

    Article  Google Scholar 

  • Stevens, S. S. (1968). Measurement, statistics and the schemapiric view.Science, 161, 849–856.

    Article  Google Scholar 

  • Stevens, S. S. (1971). Issues in psychophysical measurement.Psychological Review, 78, 426–450.

    Article  Google Scholar 

  • Strahan, R. F. (1991). Remarks on the binomial effect size display.American Psychologist, 46, 1083–1084.

    Article  Google Scholar 

  • Thompson, B. (1987).The use (and misuse) of statistical significance testing: Some recommendations for improved editorial policy and practice. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC.

  • Thompson, B. (1988). Program FACSTRAP: A program that computes bootstrap estimates of factor structure.Educational and Psychological Measurement, 48, 1129–1135.

    Article  Google Scholar 

  • Thompson, B. (1989). Statistical significance, result importance, and result generalizability: Three noteworthy but somewhat different issues.Measurement and Evaluation in Counselling and Development, 22, 2–6.

    Google Scholar 

  • Thompson, B. (1992).The use of statistical significance tests in research: Some criticisms and alternatives. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, April 22, 1992.

  • Tyler, R. W. (1931). What is statistical significance?Educational Research Bulletin, 10, 115–118, 142.

    Google Scholar 

  • Winch, R. P., & Campbell, D. T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance.The American Sociologist, 4, 140–143.

    Google Scholar 

  • Winer, B. J., Brown, D. R., & Michels, K. M. (1991).Statistical principles in experimental design (3rd ed.). New York: McGraw-Hill.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menon, R. Statistical significance testing should be discontinued in mathematics education research. Math Ed Res J 5, 4–18 (1993). https://doi.org/10.1007/BF03217248

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03217248

Keywords

Navigation