Skip to main content
Log in

Assessing environmentally significant effects: a better strength-of-evidence than a single P value?

  • Published:
Environmental Monitoring and Assessment Aims and scope Submit manuscript

Abstract

Interpreting a P value from a traditional nil hypothesis test as a strength-of-evidence for the existence of an environmentally important difference between two populations of continuous variables (e.g. a chemical concentration) has become commonplace. Yet, there is substantial literature, in many disciplines, that faults this practice. In particular, the hypothesis tested is virtually guaranteed to be false, with the result that P depends far too heavily on the number of samples collected (the ‘sample size’). The end result is a swinging burden-of-proof (permissive at low sample size but precautionary at large sample size). We propose that these tests be reinterpreted as direction detectors (as has been proposed by others, starting from 1960) and that the test’s procedure be performed simultaneously with two types of equivalence tests (one testing that the difference that does exist is contained within an interval of indifference, the other testing that it is beyond that interval—also known as bioequivalence testing). This gives rise to a strength-of-evidence procedure that lends itself to a simple confidence interval interpretation. It is accompanied by a strength-of-evidence matrix that has many desirable features: not only a strong/moderate/dubious/weak categorisation of the results, but also recommendations about the desirability of collecting further data to strengthen findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. In a precautionary approach, it is assumed that important differences exist. That assumption is only abandoned if data are sufficiently convincing: vice versa in the permissive approach.

  2. As we have noted, if the nil hypothesis cannot be true, this error cannot occur.

References

  • Anderson, P. D., & Meleason, M. A. (2009). Discerning responses of down wood and understory vegetation abundance to riparian buffer width and thinning treatments: an equivalence-inequivalence approach. Canadian Journal of Forest Research, 39, 2470–2485.

    Article  Google Scholar 

  • Beninger, P. G., Boldina, I., & Katsanevakis, S. (2012). Strengthening statistical usage in marine ecology. Journal of Experimental Marine Biology and Ecology, 426–427, 97–108.

    Article  Google Scholar 

  • Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24, 295–300.

    Article  Google Scholar 

  • Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses (rejoinder to Cox 1987). Statistical Science, 2(3), 348.

    Article  Google Scholar 

  • Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection–union tests and equivalence confidence sets. Statistical Science, 11(4), 283–319. with discussion.

    Article  Google Scholar 

  • Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37, 325–335.

    Article  Google Scholar 

  • Bohrer, R. (1979). Multiple three-decision rules for parametric signs. Journal of the American Statistical Association, 74, 432–437.

    Google Scholar 

  • Brosi, B. J., & Biber, E. G. (2009). Statistical inference, type II error, and decision making under the US Endangered Species Act. Frontiers in Ecology and the Environment, 7(9), 487–494.

    Article  Google Scholar 

  • Bross, I. D. (1985). Why proof of safety is much more difficult than proof of hazard. Biometrics, 41, 785–793.

    Article  CAS  Google Scholar 

  • Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: a practical information-theoretic approach (2nd ed.). New York: Springer-Verlag.

    Google Scholar 

  • Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378–399.

    Google Scholar 

  • Chow, S. L. (1996). Statistical significance: rationale, validity and utility. London: Sage.

    Google Scholar 

  • Chow, S.–C., & Shao, J. (1990). An alternative approach for the assessment of bioequivalence between two formulations of a drug. Biometrical Journal, 32, 969–976.

    Google Scholar 

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale: Lawrence Erlbaum.

    Google Scholar 

  • Cohen, J. (1994). The earth is round (p < 05). American Psychologist, 49(12), 997–1003.

    Article  Google Scholar 

  • Cole, R. G., & McBride, G. B. (2004). Assessing impacts of dredge spoil disposal using equivalence tests: implications of a precautionary (proof of safety) approach. Marine Ecology Progress Series, 279, 63–72.

    Article  Google Scholar 

  • Cox, D. R. (1987). Comment on Berger, J. O., & Delampady, M. Testing precise hypotheses. Statistical Science, 2(3), 335–336.

    Article  Google Scholar 

  • Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals based on central and noncentral distributions. Educational Psychological Measurement, 61, 530–572.

    Article  Google Scholar 

  • DeGroot, M. H. (1973). Doing what comes naturally: interpreting a tail area as a posterior probability or as a likelihood ratio. Journal of the American Statistical Association, 68, 966–969.

    Article  Google Scholar 

  • Dixon, P. M., & Pechmann, H. K. (2005). A statistical test to show negligible trend. Ecology, 86(7), 1751–1756.

    Article  Google Scholar 

  • Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242.

    Article  Google Scholar 

  • Fleiss, J. L. (1986). Significance tests have a role in epidemiologic research: reactions to AM Walker (different views). American Journal of Public Health, 76, 559–560.

    Article  CAS  Google Scholar 

  • Freund, J. E. (1992). Mathematical statistics (5th ed.). Upper Saddle River: Prentice-Hall.

    Google Scholar 

  • Frick, R. W. (1995). Accepting the null hypothesis. Memory and Cognition, 23(1), 132–138.

    Article  CAS  Google Scholar 

  • Germano, J. D. (1999). Ecology, statistics, and the art of misdiagnosis: the need for a paradigm shift. Environmental Reviews, 7, 167–190.

    Article  Google Scholar 

  • Gerrodette, T. (2011). Inference without significance: measuring support for hypotheses rather than rejecting them. Marine Ecology, 32(3), 404–418.

    Article  Google Scholar 

  • Gibbons, J. D., & Pratt, J. W. (1975). P-values: interpretation and methodology. American Statistician, 29, 20–25.

    Google Scholar 

  • Goudey, R. (2007). Do statistical inferences allowing three alternative decisions give better feedback for environmentally precautionary decision-making? Journal of Environmental Management, 85, 338–344.

    Article  Google Scholar 

  • Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52(1), 15–24.

    Article  Google Scholar 

  • Harris, R. J. (1997a). Significance tests have their place. Psychological Science, 8(1), 8–11.

    Article  Google Scholar 

  • Harris, R. J. (1997b). Reforming significance testing via three-valued logic. In L. L. Harlow, S. A. Muliak, & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 145–174). Mahwah: Lawrence Erlbaum.

    Google Scholar 

  • Harris, R. J. (2001). A primer of multivariate statistics (3rd ed.). Mahwah: Lawrence Erlbaum.

    Google Scholar 

  • Hodges, J. L., & Lehmann, E. L. (1954). Testing the approximate validity of statistical hypotheses. Journal of the Royal Statistical Society, Series B, 16, 261–268.

    Google Scholar 

  • Jeffreys, H. S. (1961). Theory of probability. Oxford: Oxford University Press.

    Google Scholar 

  • Johnson, D. H. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management, 63(3), 763–772.

    Article  Google Scholar 

  • Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods, 5(4), 411–414.

    Article  CAS  Google Scholar 

  • Kaiser, H. F. (1960). Directional statistical decisions. Psychological Review, 67(3), 160–167.

    Article  CAS  Google Scholar 

  • Läärä, E. (2009). Statistics: reasoning on uncertainty, and the insignificance of testing null. Annal Zoologici Fennici, 46(2), 138–157.

    Article  Google Scholar 

  • Lee, P. M. (1997). Bayesian statistics: an introduction (2nd ed.). London:Arnold.

  • Lehmann, E. L. (1986). Testing statistical hypotheses (2nd ed.). New York: Wiley.

    Book  Google Scholar 

  • McBride, G. B. (1999). Equivalence tests can enhance environmental science and management. Australian and New Zealand Journal of Statistics, 41(1), 19–29.

    Article  Google Scholar 

  • McBride, G. B. (2002). Statistical methods helping and hindering environmental science and management. Journal of Agricultural, Biological, and Environmental Statistics, 7, 300–305.

    Article  Google Scholar 

  • McBride, G. B. (2005). Using statistical methods for water quality management: issues, options and solutions. New York: Wiley.

    Book  Google Scholar 

  • McBride, G. B., Loftis, J. C., & Adkins, N. C. (1993). What do significance tests really tell us about the environment? Environmental Management, 17(4), 423–432. errata: 18: 317.

    Article  Google Scholar 

  • Newman, M. C. (2008). “What exactly are you inferring?” A closer look at hypothesis testing. Environmental Toxicology and Chemistry, 27(5), 1013–1019.

    Article  CAS  Google Scholar 

  • Platt, J. R. (1964). Strong inference. Science, 146(3642), 347–353.

    Article  CAS  Google Scholar 

  • Quinn, J. M., Davies-Colley, R. J., Hickey, C. W., Vickers, M. L., & Ryan, P. A. (1992). Effects of clay discharges in streams: 2. Benthic invertebrates. Hydrobiologia, 248, 235–247.

    Article  Google Scholar 

  • Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57(5), 416–428.

    Article  CAS  Google Scholar 

  • Schervish, M. J. (1996). P values: what they are and what they are not. American Statistician, 50(3), 203–206.

    Google Scholar 

  • Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmacuetics, 15, 657–680.

    Article  CAS  Google Scholar 

  • Schuirmann, D. J. (1989). Confidence intervals for the ratio of two means from a crossover study. In Proceedings of the Biopharmaceutical Section 121–126. Alexandria: American Statistical Association.

  • Schuirmann, D. J. (1996). Comment on: bioequivalence trials, intersection–union tests and equivalence confidence sets. R. L. Berger RL & J. C. Hsu. Statistical Science, 11(4), 312–313.

    Google Scholar 

  • Smithson, M. (2000). Statistics with confidence. London: Sage.

    Google Scholar 

  • Tukey, J. W. (1960). Conclusions vs decisions. Technometrics, 2, 423–433.

    Article  Google Scholar 

  • Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.

    Article  Google Scholar 

  • Wellek, S. (2003). Testing statistical hypotheses of equivalence. Boca Raton: Chapman and Hall/CRC.

    Google Scholar 

  • Westlake, W. J. (1976). Symmetric confidence intervals for bioequivalence trials. Biometrics, 32, 741–744.

    Article  CAS  Google Scholar 

  • Westlake, W. J. (1981). Response to TBL Kirkwood: bioequivalence testing—a need to rethink. Biometrics, 37, 589–594.

    Article  Google Scholar 

  • Zar, J. H. (1984). Biostatistical analysis (2nd ed.). Englewood Cliffs: Prentice-Hall.

    Google Scholar 

Download references

Acknowledgments

We thank colleagues (Jen Drummond, James Sukias, Rob Goudey and Mark Meleason) for constructive comments. John Quinn provided the data used in the example applications. This work was funded by the New Zealand Ministry of Science and Innovation (contract C09X1003: Integrated Valuation and Monitoring Framework for Improved Freshwater Outcomes). Sadly our second author, Dr. Russell Cole, passed away during the processing of our submission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham McBride.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(DOCX 37.8 kb)

ESM 2

(XLSX 18.7 kb)

Appendices

Appendices

Appendix 1. Nil hypothesis test result depends on sample size

Consider a survey designed to examine the effect of a marine protected area on populations of a fish, using normal statistical methods (for simplicity, ignoring any influence of data autocorrelation). After collecting n = 50 samples from each area, the mean fish weight per transect inside the MPA is found to be \( {\overline{X}}_1=6.1\kern0.75em \mathrm{kg} \), outside it is \( {\overline{X}}_2=5.1\kern0.75em \mathrm{kg} \) and the pooled standard deviation of fish sizes inside and outside the MPA is S p = 4.0 kg. Then, to test the nil hypothesis that there is in fact no difference whatsoever between the means of these two populations (i.e. within the MPA and beyond it), we calculate the test statistic \( T=\left|\frac{6.1-5.1}{4.0}\right|\sqrt{\frac{50}{2}}=0.250\times 5=1.250 \). Let us now see what T might be for a larger dataset. Bearing in mind that the sample means and variances are unbiased and consistent estimators of their true population values, on average, their values for larger sample sizes would be a little different. Say we have 500 samples at each site and obtain \( {\overline{X}}_1=6.2\kern0.75em \mathrm{kg} \), \( {\overline{X}}_2=5.3\kern0.75em \mathrm{kg} \) and S p = 4.2 kg. Then \( T=\left|\frac{6.2-5.3}{4.2}\right|\sqrt{\frac{500}{2}}=0.214\times \sqrt{250}=3.388. \) Finally, consider taking 700 samples, obtaining \( {\overline{X}}_1=6.0\kern0.75em \mathrm{kg} \), \( {\overline{X}}_2=5.1\kern0.75em \mathrm{kg} \) and S p = 4.3 kg, in which case \( T=\left|\frac{6.0-5.1}{4.3}\right|\sqrt{\frac{700}{2}}=0.2093\times \sqrt{350}=3.916. \) The P values for those situations (for example obtained using Excel’s ‘T.DIST.2T’ function) are 0.214, 0.00073 and 0.000094, respectively. At the 5 % significance level, only the last two results would be statistically significant. Having more samples has led to smaller P values—even though the measured effect size happens to have decreased a little (from 0.250 to 0.214 to 0.209). What is more, this pattern will most usually occur when we test a nil hypothesis: P will decrease as n is increased (there will be rare exceptions when the estimated effect size decreases substantially with a larger number of samples, by a factor greater than the square root of the proportional increase in n).

We think that a more meaningful hypothesis to test would be concerned with whether a difference of say 0.5 kg for fish size is large enough to be of environmental significance, so that the equivalence interval limits are ±0.5 kg.

Appendix 2. Procedures are all performed at level α (not α/2)

Under the three-valued logic procedure error risks concern erroneously inferring the direction-of-change in one direction or the other, were such an inference to be made. (The third possible inference—that there are too few data to detect the direction-of-change—is not an error, Jones and Tukey 2000.) In considering the probability of one of these two errors occurring, a critical departure arises from the two-sided nil hypothesis test’s decision rule, which is derived by minimising the risk of committing the type I error (of falsely rejecting a hypothesis).Footnote 2 That rule states that its hypothesis should be rejected if the test statistic cuts off an area of no more than α/2 in either tail of the t distribution—equivalent to examining whether the measured difference in sample means is contained within a 100(1 − α)% confidence interval. In the three-valued logic of the two one-sided TOST approach, the decision rule for each of the two one-sided tests rests on whether the test statistic cuts off an area of no more than α (not α/2) in the appropriate tail of the t distribution. This is equivalent to couching the rule in terms of a 100(1 − 2α)% confidence interval. At first, this may be surprising. Indeed, in the context of bioequivalence tests, Berger and Hsu (1996) noted: ‘The fact that the TOST seemingly corresponds to a 100(1 − 2α)%, not 100(1 − α)%, confidence interval procedure initially caused some concern (Westlake 1976, 1981) … but many authors (e.g. Chow and Shao 1990, and Schuirmann 1989) have defined bioequivalence tests in terms of 100(1–α)% confidence sets’. Indeed, Berger and Hsu (1996), based on earlier material by Berger (1982), present a theorem showing that if each of the two individual TOST tests is performed at level α, the overall test has the same level. Its proof rests on ‘intersection–union’ theory (IUT) (TOST is a simple example of IUT). Importantly, Schuirmann (1996) noted that this finding rests on the requirement that the 100(1 − 2α)% confidence interval is equi-tailed, i.e. is symmetrical about the estimated difference in means, which is the approach adopted herein.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McBride, G., Cole, R.G., Westbrooke, I. et al. Assessing environmentally significant effects: a better strength-of-evidence than a single P value?. Environ Monit Assess 186, 2729–2740 (2014). https://doi.org/10.1007/s10661-013-3574-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10661-013-3574-8

Keywords

Navigation