Background

One of the main functions of the editorial process (peer review and editorial handling) as employed by almost all serious scientific journals is to ensure that the research articles published are accurate, transparent, and complete reports of the research conducted. Spin is a term used to describe reporting practices that distort the interpretation of a study’s results [1]. Not mentioning (all important) study limitations is one way in which readers can be misguided into believing that, for example, the beneficial effect of an experimental treatment is greater than the trial’s result warrant.

In a survey among scientists, insufficient reporting of study limitations ranked high in a list of detrimental research practices [2]. In a masked before-after study at the editorial offices of Annals of Internal Medicine, Goodman et al. found that the reporting of study limitations was fairly poor in manuscripts but improved after peer review and editing [3]. Ter Riet et al. demonstrated that more than a quarter of biomedical research articles do not mention any limitations [4]. And finally, Horton, in a survey among all authors of ten Lancet papers, found that “Important weaknesses were often admitted on direct questioning, but were not included in the published article” [5]. Other forms of spin are inappropriate extrapolation of results and inferring causal relationships when the study’s design does not allow for it [1].

Peer reviewers should spot and suggest changes to overstatements and claims that are too strong and point out non-trivial study weaknesses that are not mentioned. The peer review process may therefore been seen as “a negotiation between authors and journal about the scope of the knowledge claims that will ultimately appear in print” [6]. Specific words that can be used to add nuance to statements and forestall potential overstatement are so-called “hedges”; these are words like “might,” “could,” “suggest,” “appear,” etc. [7] Authors of an article are arguably in the best position to point out their study’s weaknesses, but they may feel that naming too many or discussing them too extensively could hurt their chances of publication. In this contribution, we hypothesized that, compared to the subsequent publications, the discussion sections of the submitted manuscripts contain fewer acknowledgments of limitations and are less strongly hedged.

Methods

In this study, we considered the discussion sections of randomized clinical trial (RCT) reports published in 27 BioMed Central (BMC) journals and BMJ Open. Using two software tools, we determined the number of sentences dedicated to the acknowledgment of specific study limitations and the use of linguistic hedges, before (manuscripts) and after peer review (publications). The limitation detection tool relies on the structure of the discussion sections and linguistic clues to identify limitation sentences [8]. In a formal evaluation, its accuracy was found to be 91.5% (95% CI 90.1–92.9). The hedging detection tool uses a lexicon containing 190 weighted hedges. The system computes an overall hedging score based on the number and strength of hedges in a text. Hedge weights range from 1 (low hedging strength, e.g., “largely”) to 5 (high hedging strength, e.g., “may”). The overall hedging score is then divided by the word count of the discussion section (normalization). We also calculated “unweighted” scores, in which all hedges are weighted equally as 1. The software tool yielded 93% accuracy in identifying hedged sentences in a formal evaluation [9]. The manuscripts were downloaded from the journals’ websites followed by manual pre-processing to restore sentence and paragraph structure. Our software automatically extracted the discussion sections in the publications from PubMed Central.

We also carried out a qualitative analysis of the two publications with the largest increase and decrease of hedging score, respectively. For these two papers, KK compared the before and after discussion sections to see what the actual changes were. The reviewer reports, consisting of the reviewer’s comments and the authors responses, were analyzed.

We performed mixed linear regression analysis, for each manuscript-publication pair, of the mean changes in the number of limitation sentences and normalized hedging scores, with the journal as a random intercept. We repeated these analyses adjusting for the journal’s impact factor (continuous), editorial team size (continuous), and composition of authors in terms of English proficiency (three dummy variables representing four categories). English proficiency was derived from the classification of majority native English-speaking countries by the United Kingdom (UK) government for British citizenship application [10]. English proficiency was categorized as follows: (i) All authors are residents of an English native country; (ii) the first author is an English native, but at least one co-author is not; (iii) the first author is not an English native, but at least one co-author is; and (iv) none of the authors are English natives. We performed a sensitivity analysis, in which we excluded the manuscript-publication pairs of BMJ Open (n = 69) and BMC Medicine (n = 14) due to their exceptional number of editorial team members (84 and 182, respectively). Finally, using scatterplots and fractional polynomial functions, we visually explored if the effect on the changes in the number of limitation-acknowledging sentences was affected by the number of limitation-acknowledging sentences in the manuscript controlled for regression to the mean using a median split as suggested by Goodman et al. [3]. We present the results of the crude and adjusted analyses in Table 2 and those of the sensitivity analyses in Appendix 1.

We used mixed-effects logistic regression analysis to assess the impact of the abovementioned factors on the likelihood of mentioning at least one limitation in the publication among those that had none in the manuscript. Sensitivity analyses consisted of restricting the data set to the journals with fewer than 20 editorial team members, at least 10 manuscript-publication pairs, and both of those restrictions simultaneously, respectively.

Results

Four hundred forty-six research articles were selected. Table 1 shows a few key journal characteristics. The median number of manuscripts per journal was 10.5 (interquartile range (IQR) 6.5–18.5; range 2–69). Table 2 shows the results. The average number of distinct limitation sentences increased by 1.39, from 2.48 (manuscripts) to 3.87 (publications). Two hundred two manuscripts (45.3%) did not mention any limitations. Sixty-three (31%, 95% CI 25–38) of these mentioned at least one after peer review. Of the 244 manuscripts that mentioned at least one limitation, eight (3%, 95% CI 2–6) mentioned none in the publication. Across the (sensitivity) analyses performed, the probability of mentioning at least one limitation in the publication among those that had none in the manuscript was not consistently associated with any of the three covariables assessed, although higher impact factors tended to be weakly associated with lower probabilities and size of the editorial team weakly with higher probabilities (data not shown). The visual assessment of how the number of changes in the limitation-acknowledging sentences depended on the number of such sentences in the manuscript showed an inverse relation, that is, larger changes were seen in manuscripts with low numbers of limitation-acknowledging sentences (Fig. 1).

Table 1 Journal characteristics
Table 2 The results of the crude and adjusted analyses
Fig. 1
figure 1

Changes in the number of limitation-acknowledging sentences between manuscripts and publications as a function of the number such sentences in the manuscript. Left panel: manuscript-publication pairs below the median split. Right panel: manuscript-publication pairs above the median split. The median split was calculated as the average of the number of limitation-acknowledging sentences in the manuscript (Lm) and in the publication (Lp): (Lm + Lp)/2. These averages were ranked and the median (value = 2; interquartile range 0–5) determined. The lines are fitted using fractional polynomials with 95% confidence intervals (Stata 13.1, twoway fpfitci command). Note that, in both panels, the changes tend to increase with decreasing numbers of limitation-acknowledging sentences in the manuscript. In particular, the effect of peer review and editorial handling is large in those manuscripts above the median split (right panel) with zero limitation-acknowledging sentences in the manuscript. The vertical line in the right panel is the line x = 1. Cluttering of data points was prevented by jittering them. Therefore, data points for x = 0 are not placed exactly above the tick mark for x = 0 but somewhat scattered to the left and right. The same holds for all data points and for the vertical placement of the points

The hedging-related differences were all very close to zero. A post hoc analysis inspired by the hypothesis that limitation-acknowledging sentences themselves might affect the average hedging scores confirmed the main analysis.

The largest increase in hedging score was + 1.67 (from 3.33 to 5.00). The weighted hedging scores were 50 across the 15 detected sentences in the manuscript and 145 across the 29 detected sentences in the published paper, respectively. The largest decrease in hedging score was − 2.55 (from 6.85 to 4.30). The weighted hedging score was 192 across 28 sentences in the manuscript and 142 across 33 sentences in the published paper (see Appendix 3 for the textual changes).

Discussion

In a sample of 446 randomized trial reports published in 28 open access journals, we found a 56% increase in the number of sentences dedicated to study limitations after peer review, although one may argue that in absolute terms, the gain was modest (1.39 additional sentences). Our automated approach showed that 33% of research reports do not contain limitation sentences after peer review. This is comparable with the finding of 27% by Ter Riet et al., which they determined with a manual approach. Goodman et al. found that mentioning study limitations is one of the poorest scoring items before and one of the most improved factors after peer review [3]. Like Goodman et al., we found evidence that peer review and editorial handling had greater impact on manuscripts with zero and very low numbers of limitation-acknowledging sentences. In Appendix 2, we highlight the attention to mentioning study limitations in seven major reporting guidelines.

Our findings do not support the hypothesis that the editorial process increases the qualification of claims by using a more nuanced language. The small-scale qualitative analysis of two manuscript-publication pairs indicated that authors are asked to both tone down statements, that is, hedge more strongly, and make statements less speculative, that is hedge less. These phenomena may offset each other resulting in minimal changes in the overall use of hedges (see Appendix 3 for the actual text changes). While the hedging terms and their strength scores were selected based on a careful analysis of the linguistic literature on this topic, it is possible that authors use terms indicating different degrees of certainty (e.g., could vs. may) somewhat interchangeably. This may explain our finding that the net change in hedging scores was very small.

To better understand the influence of peer review on changes made to manuscripts before publication, it may be interesting to conduct more extensive qualitative analyses of the peer review reports and correspondence available in the files of editorial boards or publishers. Another interesting research avenue may be the comparison of rejected manuscripts to accepted ones, to assess if acknowledgment of limitations and degree of hedging affects acceptance rates. It may be useful to restrict such analyses to sentences in which particular claims on, for example, generalizability are made.

Arguably, our software tools might be utilized by editorial boards (or submitting authors) to flag up particular paragraphs that might deserve more (editorial) attention. The limitation sentence recognizing software could for example be used to alert editors to manuscripts with zero self-acknowledged limitations to see if such omission can be justified. If reference values existed that represented the range of hedging scores across a large body of papers, the hedge-detection software could help inform reviewers (or even authors) that the manuscript has an unusual (weighted) hedging score and let them revisit some the formulations in the paper. We think that currently, no direct conclusions should be drawn from the numbers alone. Human interpretation will remain critical for some time to come, but a signposting role of the software seems currently feasible.

A limitation of our study is that we only included reports or randomized trials that made it to publications. Acknowledgment of limitations among all submissions, including also observational studies, may be different than what we report here. Another limitation is that we only included open peer review journals of more than average editorial team quality. Blind peer review may lead to different results as may the case for journals with lower quality editorial team. Note also that the weight assigned to the hedges is somewhat subjective. However, our results were stable across weighted and unweighted hedges. Finally, one may argue that there is a discrepancy between our interest in overstated claims and what we actually measured, namely, hedging scores in all sentences in the discussion sections. A stricter operationalization of our objective would have required that we detect “claim sentences” first and then measure hedging levels in those sentences only. On the other hand, our approach to focus on discussion sections only is better than analyzing complete papers, because claims are usually made in the discussion sections. A strength of our study is the automated assessment of limitation sentences and hedges, limiting the likelihood of analytical or observational bias. Such automated assessment could also assist journal editors as well as peer reviewers in their review tasks. Our results suggest that reviewers and/or editors demand discussion of study limitations that authors were unaware of or unwilling to discuss. Since good science implies the full disclosure of issues that may (partially) invalidate the findings of a study, this increase in the number of limitation sentences is a positive effect of the peer and editorial review process.

Conclusion

Our findings support the idea that editorial handling and peer review, on average, cause a modest increase in the number of self-acknowledged study limitations and that these effects are larger in a manuscript reporting zero or very few limitations. This finding is important in the debates about the value of peer review and detrimental research practices. Software tools such as the ones used in this study may be employed by authors, reviewers, and editors to flag potentially problematic manuscripts or sections thereof. More research is needed to assess more precisely the effects, if any, of peer review and editorial handling on linguistic nuance of claims.