Introduction

Owing to the dedicated work of its editorial office, the diligent work of its academic editors and peer reviewers, and contributions of authors from around the world, the Journal of Forestry Research (JFR) has been transformed into a prominent forestry journal. With a 2020 CiteScoreFootnote 1Footnote 2 of 2.8, JFR ranks 40th among 142 journals of forestry, agricultural and biological sciences, while the updated 2021 tracker value increased to 3.8 (www.scopus.com; last updated 6 March 2022; accessed 17 March 2022). As the journal increases its profile of the world’s forestry journals, more submissions are expected, resulting in a decreasing percentage of manuscripts that can be accepted and published.

JFR is published by non-profit, China-based academic societies and institutions and is not subject to policies of publishing that aim at maximizing economic profit (Agathokleous 2022). Hence, the journal publishes a specific maximum number of manuscripts annually which means that no additional papers may be published, even if all are excellent, report cutting-edge scientific findings, or are game changing. For example, there were over 1600 submissions in 2021, of which only 7% were accepted for publication. With limited space, the number of papers that may be desk rejected (rejected by editors without assigning it to peer reviewers) is increasing. A desk rejection decision does not always have to do with the science itself or the manuscript quality, but it may simply be that the paper is not considered to be competitive enough among other submissions or because the journal has different publishing priorities at a given time. However, in a journal with competition for space, there are always reasons that can lead to a desk rejection, and statistics-related issues in scientific writing are among the top ones. The following is a list of issues based on ones IFootnote 3 have encountered frequently as an associate editor and then as an associate editor-in-chief of JFR, as well as in the framework of my editorialFootnote 4 and reviewFootnote 5 works in other scientific journals (Fig. 1). When such issues exist in an original manuscript, a set of them are commonly observed. However, as mentioned, these are based on our own experience (L.Y. is Deputy Editor-in-Chief of JFR) and fields of expertise in which we actively engage peer review and do not cover all the statistical areas such as mathematical modeling, computing systems like artificial neural networks, and machine learning. Moreover, we focus on statistics-related issues in scientific writing but not on technical aspects of statistical procedures themselves, such as the nature of dataset and data distribution, and code availability, i.e. the issue of making data and programing codes of data analyses publicly available.

Fig. 1
figure 1

Statistics-related issues in scientific writing

Results claimed are not in line with statistical results (the issue of p values)

Note: this does not imply that biologically important results are statistically significant results. Statistically non-significant results can be biologically or practically important results and vice versa.

Inference is made regarding the differences between experimental conditions, whereas either there is no statistical support for the comparison or the statistical result is not in agreement with the conclusion. The latter is more prevalent. In this case, authors often claim ‘marginal’ differences when the p value approaches or exceeds 0.1; the worst I have observed was for a p value higher than 0.2 and considered to be significant. Conversely, other authors have asserted that there was no difference if the p value was approximately 0.05. The former case is more severe. Regarding the latter, “surely God loves the 0.06 as much as the 0.05” (Rosnow and Rosenthal 1989). However, it is my view that if p values are to be used, there should be some acceptable range as a reference point. For example, numbers are commonly rounded up if the previous decimal is ≥ 5. I do not see any reason justifying that a p value in the range of 0.051 − 0.054 is statistically different from a p value of 0.045 − 0.050. A p value of approximately 0.05 suggests that the findings warrant further investigation and is enlightening. But these are my views and journals rarely have specific guidelines regarding the use of p values. Therefore, it remains highly subjective, resting with the editor’s understanding, knowledge and ultimately opinion regarding what he/she finds acceptable. Nevertheless, I believe most, if not all editors, would find unacceptable the claim of significance when p values are approaching or exceeding 0.1. If it means to say what we like, no matter the statistics, why are we producing statistics? As an independent editor, I cannot force authors to replace p values with other measures or use them with complementary more informative metrics, but I expect authors to reach a conclusion based on p values in a logical and justified manner. Above all, we should remember that how p values are used defines what results are published and thus directs science and the progress of social and environmental development. Considering the widespread, subjective and highly personalized interpretation and use of p values, and how this can affect scientific progress (Dorey 2011; Masicampo and Lalande 2012), all biology journals should set precise guidelines for the interpretation and use of p values in consultation with editorial board members and statisticians. This includes JFR as well.

It should be added that the use of p values in biology has long been criticized by numerous statisticians. There is a famous quote: “scientists the world over use them, but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous, but even the modern frequentist has little time for them.” (Senn 2001). Some scientists believe that the bar for statistical significance should be raised to 0.005 or 0.001 (Johnson 2013), while others call for the retirement of the statistical significance and the use of confidence intervals instead (Amrhein et al. 2019). In fact, p values can be replaced by or used together with other, more integrated indexes such as effect size estimates and their intervals (e.g. Agathokleous et al. 2016), which can lead to better informed decisions (Connor 2004; Nakagawa 2004; Muff et al. 2022a, b), while Bayesian counterparts (e.g. Bayes’ factor) perform better (Goodman 2008; Johnson 2013; Wiens and Nilsson 2017). p values were not meant to be the sole criterion to attribute differences and compare magnitudes (Lew 2012; Nuzzo 2014; Agathokleous 2022; Alexander and Davis 2022). However, I do not believe that the replacement of p values will be occurring soon and, since they have become the backbone of biology, the so called ‘gold standard’ of validity (Nuzzo 2014), they should be used correctly. More details about statistical inference and bad practices, including the problematic hybrid interpretation of statistical results between Fisher’s p values and the strict Neyman–Pearson approach, can be found in the literature (Connor 2004; Goodman 2008; Lew 2012; Nuzzo 2014; Muff et al. 2022b).

Concluding this section, we would draw attention to final points regarding the reporting of p values, should p values be used. First, no p value should be reported as being equal to zero. It could be < 0.001 or < 0.0001 but never = 0.000. Second, reporting only p values without other information is impractical. A minimum requirement would be the simultaneous reporting of the value of the statistic (e.g., F, t, or U). For p values over 0.05, the exact value should be stated instead of writing p > 0.05. As a point of reference, reading a widely used publication would be enlightening and helpful, such as the Publication Manual of the American Psychological Association (APA 2019).

Issues with multiple tests or comparisons

Scientific research has become more demanding in the twenty-first century due to the increased need for multi-factorial experimental designs in some disciplines (Rillig et al. 2019, 2021). This reflects a considerable increase in statistical testing within a study. For example, ecological research is often multidimensional, including numerous variables. If one examines the association of 15 soil quality parameters with the alpha diversity of communities of microorganisms, the probability of detecting one or more p values smaller than 0.05 increases from 5 to approximately 54%! And, if more than one index of alpha diversity is considered, this probability increases further. This leads to the question of how much uncertainty lies behind the results and conclusions of an array of studies. As the number of statistical tests and comparisons increases, the uncertainty and probability of rejection or acceptance of null hypotheses can increase, depending on how it was accounted for. But this is another issue that remains largely subjective and personalized, as journals rarely have specific guidelines on this.

Modifications of traditional statistical testing procedures are widely applied in a range of research fields to decrease Type I errors (i.e., rejection of a null hypothesis that is true). Perhaps the widespread and most widely used modification is the Bonferroni correction, a modification of alpha (α) by dividing it with k number of statistical tests or comparisons. That is, for a study with 10 tests, the corrected α would be 0.005 (α = 0.05/10), if α was set at 0.05. The application of Bonferroni corrections reduces statistical power, highly increasing Type II errors, (i.e., acceptance of a null hypothesis that is false), and potentially contributing to a publication bias which eventually can thwart scientific advancement (Nakagawa 2004). For example, researchers that find many variables to be insignificant might simply choose to omit them from their paper and thus never covered by future meta-analyses, thereby contributing to a ‘file-drawer effect’Footnote 7 and publication bias (Nakagawa 2004; Fanelli 2010). If the accumulation of knowledge is thwarted, an entire scientific field may be suppressed (Nakagawa 2004). A further type of correction is the sequential Holms-Bonferroni method (Holm 1979) which controls the family-wise error rate while reducing statistical power to a lesser extent compared to the standard Bonferroni correction; however, the probability of Type II errors remains considerably high (Nakagawa 2004). Including less relevant or biologically irrelevant variables in a study leads to unnecessarily increased probability of Type I errors, which often results in reviewers pointing to the need of corrections such as Bonferroni (Nakagawa 2004). Based on these issues, Nakagawa proposed that “the practice of reviewers demanding Bonferroni procedures should be discouraged, (and also, researchers should play their part in carefully selecting relevant variables in their study)” (Nakagawa 2004). These are not new issues and have long been known. For example, ending the use of Bonferroni procedures and starting to report effect sizes and/or confidence intervals for effect size or alternatives was proposed two decades ago in animal behavior and behavioral ecology research (Nakagawa 2004).

Some journals have specific guidelines about multiple testing or comparisons. An example is that of the Annals of Applied Biology, the journal of the Association of Applied Biologists, where more specific author guidelines regarding statistics have been developed and put into effect, while statistics editors also evaluate relevant submissions (Kozak and Powers 2017; Powers and Kozak 2019; Butler 2021). This is an example that can serve as a reference point for further development in the JFR as well as in other journals. Author guidelines of the Annals of Applied Biology discourage using comparisons not based on biological hypothesis, stating “In particular, the use of multiple comparison adjustments such as Duncan's or Tukey's is not acceptable, nor is the use of letters to denote treatments which are 'not significantly different from each other'.” (https://onlinelibrary.wiley.com/page/journal/17447348/homepage/forauthors.html; Accessed 19 February 2022). Instead, it has been suggested to conduct post hoc comparisons that are of most interest, using the value of least significant difference (LSD) based on the relevant standard error of the difference (SED) from the analysis of variance (ANOVA) (Kozak and Powers 2017). Similarly, in unbalanced studies with an unequal number of experimental units (replicates) among experimental conditions or treatments, SED values may differ among comparisons; however, only post hoc comparisons of most interest should be made, but LSDs and SEDs should be reported for each comparison (Kozak and Powers 2017). Another suggestion is that, where a large number of variables exists, controlling the ‘false discovery rate’, the fraction of rejecting true null hypothesis, may be more appropriate than controlling the probability of even one false rejection of null hypothesis (Nakagawa 2004).

There are further options that can help with the trade-off between Type I and Type II errors. For example, the use of orthogonal or non-orthogonal linear contrasts is a good alternative, albeit their use is often complicated, difficult, or even impractical in terms of application, interpretation, and presentation, especially in light of current publishing policies of many journals. In fact, based on my experience as a reviewer, editor, referee, and author of literature reviews of numerous scientific papers, the use of post hoc comparisons in most cases is incorrect and problematic, while often planned (a priori) comparisons should be made. In highly multi-factorial studies, the number of biologically irrelevant comparisons is also high, many of which provide little or no useful information. This may be illustrated by a hypothetical example. A researcher studies the effect of various doses of the antibiotic tetracycline (0, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10,000 μM L−1) on saplings of a poplar clone grown in either charcoal-filtered air (i.e., air pollutants are eliminated; clean atmosphere) or ozone-enriched air (polluted atmosphere). However, many of the possible comparisons are biologically irrelevant. For example, it is irrational to compare the effects of 0.001 μM tetracycline L−1 on plants raised in charcoal-filtered air with the effects of concentrations of 0.01 − 10,000 μM tetracycline L−1 on plants in ozone-enriched air. Researchers should strive for planned comparisons wherever possible (Ruxton and Beauchamp 2008; see also Wiens and Nilsson 2017). If reviewers criticize the use of correctly applied a priori comparisons, it is important to address their comments and justify why a priori comparisons are correct and should be retained. In a paper my colleagues and I published six years ago, contrasts were used to examine the most biologically relevant questions/comparisons (Agathokleous et al. 2016). There were three reviewers and while endorsing the work, all had some comment(s) on the statistics and/or the way the results were presented; if post hoc comparisons among all means were done, reviewers would be satisfied. In fact, one of the issues raised was that the use of different specific questions, and thus contrasts, made the interpretation of figures and results more difficult, and dictated the repeated return to the questions/contrasts. The reviewers’ comments were helpful to thoroughly revise the manuscript by completely changing the presentation of the results, including display elements. However, this is an example where a major revision would be a minor one if post hoc comparisons were used. It could also be a rejection if there were other critical deficiencies in the paper or if one or more reviewers had recommended rejection and the handling editor were unqualified.

Incorrect claims of sizes of differences

As noted previously, p values alone do not indicate variations in the size of differences among experimental conditions (Agathokleous 2022). For example, if the p values of the effect of treatments A and B compared to control C were 0.011 and 0.002, no inference should be made that treatment B had a larger effect than treatment A, yet such phenomena frequently occur in manuscripts submitted to journals. An inference that may be made in this case is that if treatments A and B had no real effect, a difference from the controls of equal or larger magnitude would be observed in 1.1% and 0.2% of study repetitions, respectively, due to random error.Footnote 8 In another example, the null hypothesis is rejected for the effect of liquid chemical treatments D and E on the mycorrhizae colonization of roots of pine seedlings grown in a cambisol soil, and the arithmetic means of treatments D and E were 50% and 10% greater than the arithmetic means of the water-treated control. Speculation that “chemical treatments D and E significantly increased mycorrhizae colonization, and chemical D had a more pronouncedFootnote 9 effect” is inappropriate and misleading. The point is that p values say nothing about the magnitude of the effects or differences among experimental conditions. They only indicate the probability of a similar or more extreme finding than the one obtained in the study, given that the null hypothesis is true and the assumptions underlying the analysis are true to some extent (Butler 2021). A practice I often observe in manuscript submissions is drawing conclusions about the effect of size based only on p values or even differences in arithmetic means, such as denoting differences in treatment effects or ranking susceptibility/tolerance of different organisms or groups of organisms (Agathokleous and Saitanis 2020). Such a practice not only is harmful for the progress of science but is also misleading and thus, have societal implications (Agathokleous and Saitanis 2020). Whenever it is needed to make inference about the size of differences between experimental conditions, p values are insufficient. In fact, statistical significance or insignificance does not translate to biological importance (Ziliak and McCloskey 2008; Butler 2021), but effect sizes and their improving indexes can be used for biological or for practical importance (Agathokleous et al. 2016). There is a variety of effect size indexes that can be used, each with its own characteristics (Sullivan and Feinn 2012; Solla et al. 2018). Analysis of these indexes is beyond the scope of this paper, but there are various user-friendly software packages operating online or offline for the estimation of effect sizes as well as their improving indexes (e.g., Lenhard and Lenhard 2016; Agathokleous and Saitanis 2020; https://lbecker.uccs.edu/; https://goodcalculators.com/effect-size-calculator/; https://effect-size-calculator.herokuapp.com/). The availability of such computational tools makes calculation easier, even to those who might dislike making such calculations. The only task the user must do is to input the required data.

Redundant statistics

A problematic practice is to conduct redundant statistics. Although it might seem surprising, this problem exists in manuscript submissions even today. For example, in one study with single and combined effects of two factors each with two levels, the researchers carried out a two-way analysis of variance, but they also conducted independent t tests between experimental conditions within each factor. As a researcher, you may want to ensure that your manuscript contains no redundant statistics. Ask yourself whether some other statistical test that you have already conducted can provide answers to the questions your new statistical test is going to answer. Consider it for a while and think. If the answer is yes, then you should not conduct this further statistical test.

A situation I have observed many times is one of reviewers asking authors to conduct different tests to trace more significant results (and editors passively transferring reviewer comments onto authors). Such a practice reflects fishing for significant results. The more statistical tests/comparisons one runs, the more significant results will likely be found. As a basic principle, no changes to the statistics (by adding additional statistics) should be done without a clear purpose of doing so, such as due to problematic or incorrect methodology. Conducting different statistical tests also reflects redundancy, even if someone does not report all the results. As mentioned before, as long as you can justify why you did what you did, the chances that you will be asked to change your statistics are lower. Even if you are asked, it does not mean you should make changes, but doing so might enhance the chances of having your paper accepted.

Mixing up association with causation

It might be difficult to believe but mixing up association with causation occurs frequently. Association is a relationship between two variables. An X variation in the values of one variable is associated with a Y variation in the values of another. Association can represent causation, but in many cases it does not. If your study does not account for causation, no inference should be made to claim or imply causation. For example, you could state that “factor A was negatively associated with factor B” but you should not state that “factor B decreased due to factor A”. If you want to claim causation based on association, you only need to distinguish between causal and non-causal associations (Stovitz et al. 2019; Kukull 2020). Otherwise, if your study does not support causation, be careful not to state or imply causation.

Lack of sufficient information

Insufficient statistical information is among the most important aspects that may determine the fate of a manuscript submission. Yet insufficient information about statistics appears widely in the literature (Kramer et al. 2016). As noted, often it is about justifying what one did in the scientific process. If what you did is correct, it cannot be rejected. Even if there are cases where alternatives might be advantageous, the question for an editor would be whether a potential change in the statistical processes would be beneficial (beneficial does not mean more ‘statistical significances’). What would such a change add to the scientific content of the paper? Is such a change really needed? Would such a change be rather harmful, such as violating basic principles of statistics like fishing for significance and favoring type I errors over type II errors and vice versa? These are some questions an editor must answer when performing evaluations or re-assessments following peer review. These are some examples amongst many. The point is that if you detail adequately why you acted as you did and perhaps why you did not do something else,Footnote 10 you facilitate the work of editors and can prevent possibly unfair or incorrect criticisms by reviewers, thus enhancing the chances for a smooth peer review process. However, if the information about the experimental design and/or data analysis is insufficient to evaluate the robustness and validity of the study and does not permit its replication, a desk rejection is very likely. Here, I draw attention to some issues I encounter frequently, but those who have a keen interest in more detailed explanations can refer to the guidelines of the Annals of Applied Biology (Kozak and Powers 2017; Powers and Kozak 2019; Butler 2021) or Science (https://www.science.org/content/page/science-journals-editorial-policies#statistical-analysis).

The first issues that immediately come to mind are the lack of clarification of sample sizes, experimental and statistical units, and measures of dispersion around the mean, which should be done for each type of analyses. Without this information, the validity of the study cannot be assessed and replicated, which are the minimum requirements of scientific research. The meaning of replicate is often unclear or what is claimed to be a replicate is not valid. Author guidelines of the Annals of Applied Biology state that “Particular care should be taken to explain what is meant by a replicate; only biological replication from independent units can be used to assess variation within and between treatments. Authors should consult a statistician if they require assistance in making inferences from designed experiments” (https://onlinelibrary.wiley.com/page/journal/17447348/homepage/forauthors.html; Accessed 19 February 2022). Special attention should be given to the correct experimental unit, and thus the real replicates. Real replicates and the issue of pseudoreplication have been discussed extensively in the literature (Hurlbert 1984, 2004, 2013; Hawkins 1986; Potvin and Tardif 1988; Heffner et al. 1996; Oksanen 2001; Cottenie and De Meester 2003). Numerous reviewers recommend that a paper be rejected because the study was based on pseudoreplicates and not real ones. In some cases, authors do not identify what the replicates were. In other instances, however, a study may be acceptable and equally important even if there were no real replicates, assuming there was still statistical support. For these reasons, the experimental and statistical units should be properly identified, and, where real replicates did not exist in a study, clarification should be made as to why the study is still valid and important. Finally, reporting arithmetic means without any measure of dispersion around the mean is unscientific. Arithmetic means by themselves are of little –if any- value either biologically or statistically. Hence, these are the first issues we suggest exercising care to explain explicitly.

A frequently occurring issue is lack of clarification of whether data transformation was applied. This is important information and should be made clear, especially when it comes to statistical tools which lead to false conclusions if data were not transformed, as is often the case for multivariate statistics.

No specification of the type of statistical model applied and/or the type of effects/factors is another occurring issue, and care should be made to specify these. Failure to conduct a dependent-samples analysis also occurs, whereas the experimental design would require such an analysis. It might also be the case that it is unclear if a study was based upon a dependent-samples design. Hence, it is important to clarify whether it is a dependent sample design.

The failure to clarify what post hoc test had been made is another observation that is well known (Ruxton and Beauchamp 2008). Therefore, if a post hoc test is applied, it is important to identify the test. [As noted in Sect. 2, specific guidelines regarding p values, α values, and multiple testing and comparisons are difficult to find.] In the absence of specific guidelines, the peer review process and acceptance of a manuscript for publication depends on academic editors. Independent academic editors however, should remain objective and unaffected by their opinion with what is correct or appropriate. But to help the editor and enhance your publication chances, it is important to justify why you applied an α correction or not. Especially because the selection and use of α correction is multi-dimensional and depends on a series of factors (Armstrong 2014).

Repeated measures can be applied to give more biological information in several cases (Powers and Kozak 2019), which is often the case for many research papers submitted to JFR. However, I frequently encounter (across journals) papers where repeated measures (or dependent-samples analysis) could be applied to provide comprehensive biological information but was not applied, and/or it is unclear if it was applied.

Finally, there is no harm in clarifying whether your hypothesis testing was one- or two-tailed. Although most journals rarely request clarification, there are some that do (e.g., Science; https://www.science.org/content/page/science-journals-editorial-policies#statistical-analysis). Commonly a two-tailed hypothesis test is the case; however, if it was one-tailed, it is important to ensure that the p values reported, if you used p values, are the correct ones. That is, in many cases the p values should be divided by two because most traditional data analysis software provide results for two-tailed hypothesis testing.

Conclusion

The purpose of this paper is not to create more questions than answers. However, as academic editors, we can raise authors’ awareness about these issues, thus helping make proper selection of statistical tools from the earliest research stages. Authors cannot be forced to follow specific protocols but we can provide a basis for authors to consider and follow to make a correct selection of statistical procedures. No editors would reject a paper with justification of the procedure used simply because their opinions differ. But justifying the procedure that authors follow shows awareness of the issues and permits a proper evaluation of the study and the paper itself. We believe that any editor would appreciate a careful selection of tests or comparisons considering how type I and type II errors are affected.

It should be mentioned that this editorial should not be interpreted as suggesting that authors should simply satisfy the requirements of editors and journals –although it is often about compromise. Authors obtain funding, conduct research, and write up their results. This effort is often an outcome from government support (i.e., taxpayer funded), and authors should always bear in mind that the best choice is the one that can contribute to cumulative knowledge and society overall and not one that will facilitate the profit agenda of a publisher. You are free to follow or not to follow any editor’s or journals’ guidelines. The ultimate decision should be based on what would be ethically correct and fairer with respect to cumulative science and society, and not what would give a pass to a specific journal. If reviewers require the exclusion of specific data because they are not ‘statistically significant’ or for any other reason, you should ask yourself whether this is honest and ethically correct and what the implications to cumulative science and overall society might be. If you disagree with a particular guideline and you can provide robust scientific justification for it, you can always try a rebuttal, even if rarely successful. We hope you find this information useful. Editors look for good reasons to accept papers (Binkley et al. 2020), instead of searching for reasons to reject papers, and the methodology behind the statistics, beginning from the experimental design, is often a core determinant. Therefore, provide editors reasons for a peer review to eventually accept rather than reject your paper.