When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing

Rubin, Mark

doi:10.1007/s11229-021-03276-4

When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing

Original Research
Published: 06 July 2021

Volume 199, pages 10969–11000, (2021)
Cite this article

Synthese Aims and scope Submit manuscript

Mark Rubin ORCID: orcid.org/0000-0002-6483-8561¹

8126 Accesses
109 Citations
101 Altmetric
1 Mention
Explore all metrics

Abstract

Scientists often adjust their significance threshold (alpha level) during null hypothesis significance testing in order to take into account multiple testing and multiple comparisons. This alpha adjustment has become particularly relevant in the context of the replication crisis in science. The present article considers the conditions in which this alpha adjustment is appropriate and the conditions in which it is inappropriate. A distinction is drawn between three types of multiple testing: disjunction testing, conjunction testing, and individual testing. It is argued that alpha adjustment is only appropriate in the case of disjunction testing, in which at least one test result must be significant in order to reject the associated joint null hypothesis. Alpha adjustment is inappropriate in the case of conjunction testing, in which all relevant results must be significant in order to reject the joint null hypothesis. Alpha adjustment is also inappropriate in the case of individual testing, in which each individual result must be significant in order to reject each associated individual null hypothesis. The conditions under which each of these three types of multiple testing is warranted are examined. It is concluded that researchers should not automatically (mindlessly) assume that alpha adjustment is necessary during multiple testing. Illustrations are provided in relation to joint studywise hypotheses and joint multiway ANOVAwise hypotheses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond p values: utilizing multiple methods to evaluate evidence

Article 08 March 2019

Bayes Factors for Mixed Models

Article Open access 06 July 2021

A new correction for controlling family-wise error rate in multiple comparison studies

Article 06 February 2020

Availability of data and materials

There is no data associated with this article.

Code availability

There is no code associated with this article.

Notes

In the Neyman-Pearson approach, some researchers may consider alpha size tests rather than alpha level tests (Casella & Berger, 2002). However, alpha size tests are difficult to construct in the case of disjunction and conjunction testing (Casella & Berger, 2002, p. 385). Consequently, I refer to alpha level tests here.
The researchers could also collapse the green and red jelly beans conditions together and compare jelly beans versus the control (sugar pill) group, but they could do so on two measures of acne (e.g., inflammatory and noninflammatory). In this case, the researchers would be undertaking two tests of the same null hypothesis using two different outcome variables or endpoints. To keep things simple, I refer to the multiple comparisons example throughout this article. However, my arguments are equally applicable to the multiple endpoints situation.
The familywise error rate assumes that test results are independent. As Greenland (2020, p. 17) explained, the term independence is used to refer to several different concepts. In particular, he distinguished between logical and statistical independence. Logical independence refers to the mathematical independence of parameter values such that variation in one value is not logically dependent on variation in another. Logical independence may be demonstrated via the mathematics of a model. Statistical independence refers to independence among variables, estimators, standard errors, and tests, and it may be achieved via study design (e.g., randomisation). A weak form of statistical independence is uncorrelatedness, which assumes that there is no monotonic linear association between the variables (e.g., no positive correlation). As Greenland noted, “uncorrelatedness and hence statistical independence are rarely satisfied in nonexperimental studies.” Although this may be the case, two points allow a qualified interpretation of the familywise error rate under the assumption of independence. First, when interpreting the results of a disjunction test, researchers may adopt a counterfactual interpretation that (a) the joint null hypothesis is true and (b) all of the associated test assumptions are true, including the assumption of independence. Second, researchers may complement this qualified interpretation with an acknowledgment that, if the constituent test results were positively dependent, then the actual familywise error rate would be less than the nominal familywise error rate, because a family of dependent tests provides less opportunity to incorrectly reject the joint null hypothesis than a family of independent tests (e.g., Weber, 2007, p. 284). Hence, although the assumption of independence may not be met in reality, researchers may nonetheless interpret the familywise error rate as indicating a worst-case scenario that assumes that the constituent test results are independent.
Instead of adjusting their alpha level downwards, researchers can adjust their p values upwards (e.g., Pan, 2013; Westfall & Young, 1993). However, there are reasons to prefer alpha adjustment over p value adjustment (van der Zee, 2017).
Some commentators have argued that conjunction testing decreases the Type I error rate and therefore warrants a corresponding increase in the α_Constituent level above the α_Joint level (e.g., Capizzi & Zhang, 1996; Massaro, 2009; Weber, 2007). This argument is based on the assumption that the Type I error rate for k independent tests is the product of the Type I error rate for each test (i.e., α^k). Hence, for example, the probability of obtaining two independent false positive results at the .05 alpha level is only .0025. However, during conjunction testing, all of the tests are required to be significant in order to reject the joint null hypothesis. Consequently, when undertaking conjunction testing, the alpha level for each of the constituent null hypotheses (α_Constituent) cannot be higher than the alpha level for the joint null hypothesis (α_Joint; Berger, 1982; Julious & McIntyre, 2012; Kordzakhia et al., 2010).
Tukey (1953), who was a pioneer in the area of multiple testing, described this individual testing error rate as the per determination error rate (i.e., α_Individual). This error rate should not be confused with the per comparison error rate (i.e., α_Constituent). Both error rates use unadjusted alpha levels. However, the per determination error rate is used in the context of the individual testing of an individual null hypothesis, whereas the per comparison error rate is used in the context of the disjunction testing of a joint null hypothesis. Tukey (p. 90) was firmly against the use of the per comparison error rate. However, he believed that the per determination error rate was “entirely appropriate” (p. 82) for some research questions (i.e., individual testing; see also Hochberg & Tamhane, 1987, p. 6). For example, he argued that a per determination rate was suitable when diagnosing potentially diabetic patients based on their blood sugar levels. As Tukey (1953, p. 82) explained:

the doctor’s action on John Jones would not depend on the other 19 determinations made at the same time by the same technician or on the other 47 determinations on samples from patients in Smithville. Each determination is an individual matter, and it is appropriate to set error rates accordingly.
A selection bias remains problematic during individual testing, because it involves the suppression of hypotheses after the results are known or SHARKing (Rubin, 2017d). SHARKing is problematic when suppressed falsifications are theoretically (as opposed to statistically) relevant to the research conclusions. For example, in the jelly bean study, it is theoretically informative to know not only that green jelly beans cause acne but also that non-green jelly beans do not appear to cause acne.
Studywise and multiway ANOVAwise error rates are not the only types of error rates that have caused confusion in the area of multiple testing. Other examples include datasetwise error rates (in which the family includes all hypotheses that are tested using a specific dataset; Bennett et al., 2009, p. 417; Thompson et al., 2020), careerwise error rates (in which the family includes all hypotheses that are performed by a specific researcher during their career; O’Keefe, 2003; Stewart-Oaten, 1995), and fieldwise error rates (in which the family includes all hypotheses that are performed in a specific field). A key argument in the current article is that researchers do not usually make decisions about data sets, researchers, and fields. Instead, they make decisions about hypotheses.
Multiple testing corrections may be necessary in multiway ANOVAs when a factor contains more than two levels and multiple comparisons are conducted between those levels in order to test a joint intersection null hypothesis (Benjamini & Bogomolov, 2011; Yekutieli et al., 2006). However, in this case, familywise error rates are limited to the comparisons that are made within factors. Familywise error is not computed across all factors in the ANOVA.

References

An, Q., Xu, D., & Brooks, G. P. (2013). Type I error rates and power of multiple hypothesis testing procedures in factorial ANOVA. Multiple Linear Regression Viewpoints, 39, 1–16.
Google Scholar
Armstrong, R. A. (2014). When to use the Bonferroni correction. Ophthalmic and Physiological Optics, 34, 502–508. https://doi.org/10.1111/opo.12131
Article Google Scholar
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54, 343–349. https://doi.org/10.1016/S0895-4356(00)00314-0
Article Google Scholar
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z
Article Google Scholar
Benjamini, Y., & Bogomolov, M. (2011). Adjusting for selection bias in testing multiple families of hypotheses. https://arxiv.org/abs/1106.3670
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Article Google Scholar
Bennett, C. M., Baird, A. A., Miller, M. B., & Wolford, G. L. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic salmon: An argument for proper multiple comparisons correction. Journal of Serendipitous and Unexpected Results, 1(1), 1–5. https://teenspecies.github.io/pdfs/NeuralCorrelates.pdf
Bennett, C. M., Wolford, G. L., & Miller, M. B. (2009). The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience, 4, 417–422. https://doi.org/10.1093/scan/nsp053
Article Google Scholar
Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24, 295–300. https://doi.org/10.2307/1267823
Article Google Scholar
Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests, and equivalence confidence sets. Statistical Science, 11, 283–319. https://doi.org/10.1214/ss/1032280304
Article Google Scholar
Bretz, F., Hothorn, T., & Westfall, P. (2011). Multiple comparisons using R. CRC Press.
Google Scholar
Capizzi, T., & Zhang, J. I. (1996). Testing the hypothesis that matters for multiple primary endpoints. Drug Information Journal, 30, 949–956. https://doi.org/10.1177/009286159603000410
Article Google Scholar
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.
Google Scholar
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. https://doi.org/10.1037/0003-066X.45.12.1304
Article Google Scholar
Cook, R. J., & Farewell, V. T. (1996). Multiplicity considerations in the design and analysis of clinical trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 159, 93–110. https://doi.org/10.2307/2983471
Article Google Scholar
Cox, D. R. (1965). A remark on multiple comparison methods. Technometrics, 7, 223–224. https://doi.org/10.1080/00401706.1965.10490250
Article Google Scholar
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., Waldorp, L. J., & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23, 640–647. https://doi.org/10.3758/s13423-015-0913-5
Article Google Scholar
De Groot, A. D. (2014). The meaning of “significance” for different types of research. Translated and annotated by Wagenmakers, E. J., Borsboom, D., Verhagen, J., Kievit, R., Bakker, M., Cramer, A.,…van der Maas, H. L. J. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001
Article Google Scholar
Dennis, B., Ponciano, J. M., Taper, M. L., & Lele, S. R. (2019). Errors in statistical inference under model misspecification: Evidence, hypothesis testing, and AIC. Frontiers in Ecology and Evolution, 7, 372. https://doi.org/10.3389/fevo.2019.00372
Article Google Scholar
Dmitrienko, A., Bretz, F., Westfall, P. H., Troendle, J., Wiens, B. L., Tamhane, A. C., & Hsu, J. C. (2009). Multiple testing methodology. In A. Dmitrienko, A. C. Tamhane, & F. Bretz (Eds.), Multiple testing problems in pharmaceutical statistics (pp. 35–98). Chapman & Hall.
Chapter Google Scholar
Dmitrienko, A., & D’Agostino, R. (2013). Traditional multiplicity adjustment methods in clinical trials. Statistics in Medicine, 32, 5172–5218. https://doi.org/10.1002/sim.5990
Article Google Scholar
Drachman, D. (2012). Adjusting for multiple comparisons. Journal of Clinical Research Best Practice, 8, 1–3.
Google Scholar
Dudoit, S., & Van Der Laan, M. J. (2008). Multiple testing procedures with applications to genomics. Springer.
Book Google Scholar
Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? The Annals of Applied Statistics, 2, 197–223. https://doi.org/10.1214/07-AOAS141
Article Google Scholar
Feise, R. J. (2002). Do multiple outcome measures require p-value adjustment? BMC Medical Research Methodology, 2, 8. https://doi.org/10.1186/1471-2288-2-8
Article Google Scholar
Fisher, R. A. (1971). The design of experiments (9th ed.). Hafner Press.
Google Scholar
Forstmeier, W., Wagenmakers, E. J., & Parker, T. H. (2017). Detecting and avoiding likely false-positive findings—A practical guide. Biological Reviews, 19, 1941–1968. https://doi.org/10.1111/brv.12315
Article Google Scholar
Francis, G., & Thunell, E. (2021). Reversing Bonferroni. Psychonomic Bulletin and Review. https://doi.org/10.3758/s13423-020-01855-z
Article Google Scholar
Frane, A. V. (2015). Planned hypothesis tests are not necessarily exempt from multiplicity adjustment. Journal of Research Practice, 1, 2.
Google Scholar
Glickman, M. E., Rao, S. R., & Schultz, M. R. (2014). False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. Journal of Clinical Epidemiology, 67, 850–857. https://doi.org/10.1016/j.jclinepi.2014.03.012
Article Google Scholar
Goeman, J. J., & Solari, A. (2014). Multiple hypothesis testing in genomics. Statistics in Medicine, 33, 1946–1978. https://doi.org/10.1002/sim.0000
Article Google Scholar
Goodman, S. N., Fanelli, D., & Ioannidis, J. P. (2016). What does research reproducibility mean? Science Translational Medicine, 8, 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027
Article Google Scholar
Greenland, S. (2020). Analysis goals, error-cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and Perinatal Epidemiology, 35, 8–23. https://doi.org/10.1111/ppe.12711
Article Google Scholar
Haig, B. D. (2009). Inference to the best explanation: A neglected approach to theory appraisal in psychology. The American Journal of Psychology, 122(2), 219–234. http://www.jstor.org/stable/27784393
Hewes, D. E. (2003). Methods as tools. Human Communication Research, 29, 448–454. https://doi.org/10.1111/j.1468-2958.2003.tb00847.x
Article Google Scholar
Hochberg, Y., & Tamrane, A. C. (1987). Multiple comparison procedures. Wiley.
Book Google Scholar
Hsu, J. (1996). Multiple comparisons: Theory and methods. CRC Press.
Book Google Scholar
Huberty, C. J., & Morris, J. D. (1988). A single contrast test procedure. Educational and Psychological Measurement, 48, 567–578. https://doi.org/10.1177/0013164488483001
Article Google Scholar
Hung, H. M. J., & Wang, S. J. (2010). Challenges to multiple testing in clinical trials. Biometrical Journal, 52, 747–756. https://doi.org/10.1002/bimj.200900206
Article Google Scholar
Hurlbert, S. H., & Lombardi, C. M. (2012). Lopsided reasoning on lopsided tests and multiple comparisons. Australian & New Zealand Journal of Statistics, 54, 23–42. https://doi.org/10.1111/j.1467-842X.2012.00652.x
Article Google Scholar
Jannot, A. S., Ehret, G., & Perneger, T. (2015). P < 5 × 10^–8 has emerged as a standard of statistical significance for genome-wide association studies. Journal of Clinical Epidemiology, 68, 460–465. https://doi.org/10.1016/j.jclinepi.2015.01.001
Article Google Scholar
Julious, S. A., & McIntyre, N. E. (2012). Sample sizes for trials involving multiple correlated must-win comparisons. Pharmaceutical Statistics, 11, 177–185. https://doi.org/10.1002/pst.515
Article Google Scholar
Kim, K., Zakharkin, S. O., Loraine, A., & Allison, D. B. (2004). Picking the most likely candidates for further development: Novel intersection-union tests for addressing multi-component hypotheses in comparative genomics. In Proceedings of the American Statistical Association, ASA Section on ENAR Spring Meeting (pp. 1396–1402). http://www.uab.edu/cngi/pdf/2004/JSM%202004%20-IUTs%20Kim%20et%20al.pdf
Klockars, A. J. (2003). Multiple comparisons texts: Their utility in guiding research practice. Journal of Clinical Child and Adolescent Psychology, 32, 613–621. https://doi.org/10.1207/S15374424JCCP3204_15
Article Google Scholar
Kordzakhia, G., Siddiqui, O., & Huque, M. F. (2010). Method of balanced adjustment in testing co-primary endpoints. Statistics in Medicine, 29, 2055–2066. https://doi.org/10.1002/sim.3950
Article Google Scholar
Kotzen, M. (2013). Multiple studies and evidential defeat. Noûs, 47(1), 154–180. http://www.jstor.org/stable/43828821
Kozak, M., & Powers, S. J. (2017). If not multiple comparisons, then what? Annals of Applied Biology, 171, 277–280. https://doi.org/10.1111/aab.12379
Article Google Scholar
Kromrey, J. D., & Dickinson, W. B. (1995). The use of an overall F test to control Type I error rates in factorial analyses of variance: Limitations and better strategies. Journal of Applied Behavioral Science, 31, 51–64. https://doi.org/10.1177/0021886395311006
Article Google Scholar
Lew, M. J. (2019). A reckless guide to p-values: Local evidence, global errors. In A. Bespalov, M. C. Michel, & T. Steckler (Eds.), Good research practice in experimental pharmacology. Springer. https://arxiv.org/abs/1910.02042
Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn’t). Psychophysiology, 54, 146–157. https://doi.org/10.1111/psyp.12639
Article Google Scholar
Mascha, E. J., & Turan, A. (2012). Joint hypothesis testing and gatekeeping procedures for studies with multiple endpoints. Anesthesia and Analgesia, 114, 1304–1317. https://doi.org/10.1213/ANE.0b013e3182504435
Article Google Scholar
Massaro, J. (2009). Experimental design. In D. Robertson & G. H. Williams (Eds.) Clinical and translational science: Principles of human research (pp. 41–57). Academic Press. https://doi.org/10.1016/B978-0-12-373639-0.00003-0
Matsunaga, M. (2007). Familywise error in multiple comparisons: Disentangling a knot through a critique of O’Keefe’s arguments against alpha adjustment. Communication Methods and Measures, 1, 243–265. https://doi.org/10.1080/19312450701641409
Article Google Scholar
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data: A model comparison perspective (Vol. 1, 2nd edn.). Psychology Press.
Mead, R. (1988). The design of experiments. Cambridge University Press.
Google Scholar
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. https://doi.org/10.1037/0022-006X.46.4.806
Article Google Scholar
Mei, S., Karimnezhad, A., Forest, M., Bickel, D. R., & Greenwood, C. M. (2017). The performance of a new local false discovery rate method on tests of association between coronary artery disease (CAD) and genome-wide genetic variants. PLoS ONE, 12, e0185174. https://doi.org/10.1371/journal.pone.0185174
Article Google Scholar
Miller, R. G., Jr. (1981). Simultaneous statistical inference (2nd ed.). Springer.
Book Google Scholar
Morgan, J. F. (2007). p value fetishism and use of the Bonferroni adjustment. Evidence-Based Mental Health, 10(2), 34–35. https://doi.org/10.1136/ebmh.10.2.34
Article Google Scholar
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE, 7, e32734. https://doi.org/10.1371/journal.pone.0032734
Article Google Scholar
Mudge, J. F., Martyniuk, C. J., & Houlahan, J. E. (2017). Optimal alpha reduces error rates in gene expression studies: A meta-analysis approach. BMC Bioinformatics, 18, 312. https://doi.org/10.1186/s12859-017-1728-3
Article Google Scholar
Munroe, R. (2011). Significant. Retrieved from https://xkcd.com/882/
Neuhäuser, M. (2006). How to deal with multiple endpoints in clinical trials. Fundamental & Clinical Pharmacology, 20, 515–523. https://doi.org/10.1111/j.1472-8206.2006.00437.x
Article Google Scholar
Nichols, T., Brett, M., Andersson, J., Wager, T., & Poline, J. B. (2005). Valid conjunction inference with the minimum statistic. NeuroImage, 25, 653–660. https://doi.org/10.1016/j.neuroimage.2004.12.005
Article Google Scholar
Nosek, B. A., Beck, E. D., Campbell, L., Flake, J. K., Hardwicke, T. E., Mellor, D. T., van’t Veer, A. E., & Vazire, S. (2019). Preregistration is hard, and worthwhile. Trends in Cognitive Sciences, 23(10), 815–818. https://doi.org/10.1016/j.tics.2019.07.009
Article Google Scholar
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115, 2600–2606. https://doi.org/10.1073/pnas.1708274114
Article Google Scholar
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45, 137–141. https://doi.org/10.1027/1864-9335/a000192
Article Google Scholar
O’Keefe, D. J. (2003). Colloquy: Should familywise alpha be adjusted? Human Communication Research, 29, 431–447. https://doi.org/10.1111/j.1468-2958.2003.tb00846.x
Article Google Scholar
Otani, T., Noma, H., Nishino, J., & Matsui, S. (2018). Re-assessment of multiple testing strategies for more efficient genome-wide association studies. European Journal of Human Genetics, 26, 1038–1048. https://doi.org/10.1038/s41431-018-0125-3
Article Google Scholar
Pan, Q. (2013). Multiple hypotheses testing procedures in clinical trials and genomic studies. Frontiers in Public Health, 1, 63. https://doi.org/10.3389/fpubh.2013.00063
Article Google Scholar
Panagiotou, O. A., Ioannidis, J. P., & Genome-Wide Significance Project. (2011). What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations. International Journal of Epidemiology, 41, 273–286. https://doi.org/10.1093/ije/dyr178
Article Google Scholar
Parker, R. A., & Weir, C. J. (2020). Non-adjustment for multiple testing in multi-arm trials of distinct treatments: Rationale and justification. Clinical Trials, 17(5), 562–566. https://doi.org/10.1177/1740774520941419
Article Google Scholar
Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal, 316, 1236–1238. https://doi.org/10.1136/bmj.316.7139.1236
Article Google Scholar
Proschan, M. A., & Waclawiw, M. A. (2000). Practical guidelines for multiplicity adjustment in clinical trials. Controlled Clinical Trials, 21, 527–539. https://doi.org/10.1016/S0197-2456(00)00106-9
Article Google Scholar
Rodriguez, M. (1997). Non-factorial ANOVA: Test only substantive and interpretable hypotheses. Paper presented at the Annual Meeting of the Southwest Educational Research Association, Austin, Texas, USA. http://files.eric.ed.gov/fulltext/ED406444.pdf
Rosset, S., Heller, R., Painsky, A., & Aharoni, E. (2018). Optimal procedures for multiple testing problems. https://arxiv.org/abs/1804.10256
Rothman, K. J. (1990). No adjustments are needed for multiple comparisons. Epidemiology, 1, 43–46. https://www.jstor.org/stable/20065622
Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern epidemiology (3rd ed.). New York: Lippincott Williams & Wilkins.
Google Scholar
Roy, S. N. (1953). On a heuristic method of test construction and its use in multivariate analysis. The Annals of Mathematical Statistics, 24, 220–238. https://doi.org/10.1214/aoms/1177729029
Article Google Scholar
Rubin, M. (2017a). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman–Pearson approach. Review of General Psychology, 21, 321–329. https://doi.org/10.1037/gpr0000135
Article Google Scholar
Rubin, M. (2017b). Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology, 21, 269–275. https://doi.org/10.1037/gpr0000123
Article Google Scholar
Rubin, M. (2017c). The implications of significance testing based on hypothesiswise and studywise error. PsycArXiv. https://doi.org/10.17605/OSF.IO/7YFRV
Rubin, M. (2017d). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21, 308–320. https://doi.org/10.1037/gpr0000128
Article Google Scholar
Rubin, M. (2020). Does preregistration improve the credibility of research findings? The Quantitative Methods for Psychology, 16(4), 376–390. https://doi.org/10.20982/tqmp.16.4.p376
Article Google Scholar
Rubin, M. (2021). What type of Type I error? Contrasting the Neyman–Pearson and Fisherian approaches in the context of exact and direct replications. Synthese, 198, 5809–5834. https://doi.org/10.1007/s11229-019-02433-0
Article Google Scholar
Rubin, M. (2022). The costs of HARKing. British Journal for the Philosophy of Science. https://doi.org/10.1093/bjps/axz050
Article Google Scholar
Ryan, T. A. (1962). The experiment as the unit for computing rates of error. Psychological Bulletin, 59, 301–305. https://doi.org/10.1037/h0040562
Article Google Scholar
Sainani, K. L. (2009). The problem of multiple testing. PM&R, 1, 1098–1103. https://doi.org/10.1016/j.pmrj.2009.10.004
Article Google Scholar
Savitz, D. A., & Olshan, A. F. (1995). Multiple comparisons and related issues in the interpretation of epidemiologic data. American Journal of Epidemiology, 142, 904–908. https://doi.org/10.1093/oxfordjournals.aje.a117737
Article Google Scholar
Schochet, P. Z. (2009). An approach for addressing the multiple testing problem in social policy impact evaluations. Evaluation Review, 33, 539–567. https://doi.org/10.1177/0193841X09350590
Article Google Scholar
Schulz, K. F., & Grimes, D. A. (2005). Multiplicity in randomised trials I: Endpoints and treatments. The Lancet, 365, 1591–1595. https://doi.org/10.1016/S0140-6736(05)66461-6
Article Google Scholar
Senn, S. (2007). Statistical issues in drug development (2nd ed.). New York: Wiley.
Book Google Scholar
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021
Article Google Scholar
Shaffer, J. P. (2006). Simultaneous testing. Encyclopedia of Statistical Sciences. https://doi.org/10.1002/0471667196.ess2452.pub2
Article Google Scholar
Šidák, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62, 626–633. https://doi.org/10.1080/01621459.1967.10482935
Article Google Scholar
Sinclair, J., Taylor, P. J., & Hobbs, S. J. (2013). Alpha level adjustments for multiple dependent variable analyses and their applicability—A review. International Journal of Sports Science Engineering, 7, 17–20.
Google Scholar
Stacey, A. W., Pouly, S., & Czyz, C. N. (2012). An analysis of the use of multiple comparison corrections in ophthalmology research. Investigative Ophthalmology & Visual Science, 53, 1830–1834. https://doi.org/10.1167/iovs.11-8730
Article Google Scholar
Stewart-Oaten, A. (1995). Rules and judgments in statistics: Three examples. Ecology, 76, 2001–2009. https://doi.org/10.2307/1940736
Article Google Scholar
Streiner, D. L. (2015). Best (but oft-forgotten) practices: The multiple problems of multiplicity—Whether and how to correct for many statistical tests. The American Journal of Clinical Nutrition, 102, 721–728. https://doi.org/10.3945/ajcn.115.113548
Article Google Scholar
Thompson, W. H., Wright, J., Bissett, P. G., & Poldrack, R. A. (2020). Dataset decay and the problem of sequential analyses on open datasets. eLife, 9, e53498. https://doi.org/10.7554/eLife.53498
Article Google Scholar
Tsai, J., Kasprow, W. J., & Rosenheck, R. A. (2014). Alcohol and drug use disorders among homeless veterans: Prevalence and association with supported housing outcomes. Addictive Behaviors, 39, 455–460. https://doi.org/10.1016/j.addbeh.2013.02.002
Article Google Scholar
Tukey, J. W. (1953). The problem of multiple comparisons. Princeton University.
Google Scholar
Turkheimer, F. E., Aston, J. A., & Cunningham, V. J. (2004). On the logic of hypothesis testing in functional imaging. European Journal of Nuclear Medicine and Molecular Imaging, 31, 725–732. https://doi.org/10.1007/s00259-003-1387-7
Article Google Scholar
Tutzauer, F. (2003). On the sensible application of familywise alpha adjustment. Human Communication Research, 29, 455–463. https://doi.org/10.1111/j.1468-2958.2003.tb00848.x
Article Google Scholar
van der Zee, T. (2017). What are long-term error rates and how do you control them? The Skeptical Scientist. http://www.timvanderzee.com/long-term-error-rates-control/
Veazie, P. J. (2006). When to combine hypotheses and adjust for multiple tests. Health Services Research, 41(3), 804–818. https://doi.org/10.1111/j.1475-6773.2006.00512.x
Article Google Scholar
Wang, S. J., Bretz, F., Dmitrienko, A., Hsu, J., Hung, H. J., Koch, G., Maurer, W., Offen, W., & O’Neill, R. (2015). Multiplicity in confirmatory clinical trials: A case study with discussion from a JSM panel. Statistics in Medicine, 34, 3461–3480. https://doi.org/10.1002/sim.6561
Article Google Scholar
Wason, J. M., Stecher, L., & Mander, A. P. (2014). Correcting for multiple-testing in multi-arm trials: Is it necessary and is it done? Trials, 15, 364. https://doi.org/10.1186/1745-6215-15-364
Article Google Scholar
Weber, R. (2007). Responses to Matsunaga: To adjust or not to adjust alpha in multiple testing: That is the question. Guidelines for alpha adjustment as response to O’Keefe’s and Matsunaga’s critiques. Communication Methods and Measures, 1, 281–289. https://doi.org/10.1080/19312450701641391
Article Google Scholar
Westfall, P. H., Ho, S. Y., & Prillaman, B. A. (2001). Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials. Journal of Biopharmaceutical Statistics, 11, 125–138. https://doi.org/10.1081/BIP-100107653
Article Google Scholar
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment. Wiley.
Google Scholar
Wilson, W. (1962). A note on the inconsistency inherent in the necessity to perform multiple comparisons. Psychological Bulletin, 59, 296–300. https://doi.org/10.1037/h0040447
Article Google Scholar
Winkler, A. M., Webster, M. A., Brooks, J. C., Tracey, I., Smith, S. M., & Nichols, T. E. (2016). Non-parametric combination and related permutation tests for neuroimaging. Human Brain Mapping, 37, 1486–1511. https://doi.org/10.1002/hbm.23115
Article Google Scholar
Wu, P., Yang, Q., Wang, K., Zhou, J., Ma, J., Tang, Q., Jin, L., Xiao, W., Jiang, A., Jiang, Y., & Zhu, L. (2018). Single step genome-wide association studies based on genotyping by sequence data reveals novel loci for the litter traits of domestic pigs. Genomics, 110, 171–179. https://doi.org/10.1016/j.ygeno.2017.09.009
Article Google Scholar
Yekutieli, D., Reiner-Benaim, A., Benjamini, Y., Elmer, G. I., Kafkafi, N., Letwin, N. E., & Lee, N. H. (2006). Approaches to multiplicity issues in complex research in microarray analysis. Statistica Neerlandica, 60, 414–437. https://doi.org/10.1111/j.1467-9574.2006.00343.x
Article Google Scholar

Download references

Funding

No funding was received in relation to this article.

Author information

Authors and Affiliations

School of Psychology, Behavioural Sciences Building, The University of Newcastle, Callaghan, NSW, 2308, Australia
Mark Rubin

Authors

Mark Rubin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Rubin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the topical collection “Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications”, edited by Sorin Bangu, Emiliano Ippoliti, and Marianna Antonutti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rubin, M. When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing. Synthese 199, 10969–11000 (2021). https://doi.org/10.1007/s11229-021-03276-4

Download citation

Received: 08 October 2020
Accepted: 21 June 2021
Published: 06 July 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11229-021-03276-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing

Abstract

Access this article

Similar content being viewed by others

Beyond p values: utilizing multiple methods to evaluate evidence

Bayes Factors for Mixed Models

A new correction for controlling family-wise error rate in multiple comparison studies

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing

Abstract

Access this article

Similar content being viewed by others

Beyond p values: utilizing multiple methods to evaluate evidence

Bayes Factors for Mixed Models

A new correction for controlling family-wise error rate in multiple comparison studies

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation