Worsening file-drawer problem in the abstracts of natural, medical and social science databases
- 511 Downloads
The file-drawer problem is the tendency of journals to preferentially publish studies with statistically significant results. The problem is an old one and has been documented in various fields, but to my best knowledge there has not been attention to how the issue is developing in a quantitative way through time. In the abstracts of various major scholarly databases (Science and Social Science Citation Index (1991–2008), CAB Abstracts and Medline (1970s–2008), the file drawer problem is gradually getting worse, in spite of an increase in (1) the total number of publications and (2) the proportion of publications reporting both the presence and the absence of significant differences. The trend is confirmed for particular natural science topics such as biology, energy and environment but not for papers retrieved with the keywords biodiversity, chemistry, computer, engineering, genetics, psychology and quantum (physics). A worsening file-drawer problem can be detected in various medical fields (infection, immunology, malaria, obesity, oncology and pharmacology), but not for papers indexed with strings such as AIDS/HIV, epidemiology, health and neurology. An increase in the selective publication of some results against some others is worrying because it can lead to enhanced bias in meta-analysis and hence to a distorted picture of the evidence for or against a certain hypothesis. Long-term monitoring of the file-drawer problem is needed to ensure a sustainable and reliable production of (peer-reviewed) scientific knowledge.
KeywordsHistory of science Meta-analysis Publication explosion Scientific knowledge Significant differences STM publishing
The file-drawer problem is the relative lack of non-significant published results (e.g., Sterling 1959; Begg and Berlin 1988; Kennedy 2004). It follows from the tendency of journals to preferentially publish studies with statistically significant results. Rosenthal famously argued in 1979 that journals are filled with the 5% of studies with significant differences, whilst the drawers (or computer folders, now) are full of the other 95% of studies. Since then, the problem has been documented in several fields, from medical research (Begg and Berlin 1988) to biology (Csada et al. 1996) and psychiatry (Gilbody et al. 2000), from oncology (Krzyzanowska et al. 2003) to sociology (Gerber and Malhotra 2008), from systematic reviews in medicine (Tricco et al. 2009) to research on drug addiction (Vecchi et al. 2009) and orthodontics (Koletsi et al. 2009).
Ratio of the proportion of papers reporting non-significant differences by the proportion of papers reporting in the title/abstract significant differences as a function of the year to which the ratio refers (1991–2008) for 10 selected topics in the natural sciences in Web of Science (all citation indexes, as of April 2009)
y = −7 − 0.004x
y = 33 − 0.016x
y = 12 − 0.006x
y = −5 + 0.003x
y = −34 + 0.017x
y = 21 − 0.010x
y = 27 − 0.013x
y = 0.2 + 0.001x
y = −6 + 0.004x
y = −6 + 22 − 0.011x
Ratio of the proportion of papers reporting in the title/abstract non-significant differences by the proportion of papers reporting significant differences as a function of the year to which the ratio refers (1991–2008) for 10 selected topics in the medical sciences (AIDS, cancer, epidemics, health, infections, immunity, malaria, neurology, obesity, pharmacology) in Web of Science (all citation indexes, as of April 2009)
Aids or Hiv
y = 22 − 0.010x
y = 55 − 0.027x
y = 20 − 0.009x
y = −15 + 0.008x
y = 45 − 0.021x
y = 41 − 0.024x
y = 263 − 0.130x
y = 15 − 0.013x
y = 171 − 0.080x
y = 71 − 0.034x
Materials and methods
I investigated any temporal trends in the file-drawer problem in the abstracts of four databases of the peer-reviewed scientific literature. For the Science and Social Science Citation Indexes, the period studied was 1991–2008 (before 1991, only titles are searched, whereas from 1991 onwards also abstracts are included in these databases). As abstracts are searched also before 1991 for Medline and CAB-Abstracts, it was possible to analyze data for these two databases from the 1970s.
For each year, I searched for papers with explicit mentioning of “no significant difference/s” or “no statistically significant difference/s” in the title and/or abstract. This is a conservative estimate of the number of publications with no statistical support for the hypothesis tested, as other papers may report no significant differences with other kinds of wordings or without explicitly mentioning them in the title and/or abstract. There is currently no way to search the full text of papers in ISI Web of Knowledge, but it is reasonable to expect that most abstracts are representative of the paper as a whole. It is true that in title and abstract authors tend to highlight the most important findings, but this (1) makes the analysis more conservative (it avoids investigating trivial results reported in the full text and concentrates on the major results of the studies) and (2) is the case throughout the period studied (so that temporal comparability is guaranteed).
I then searched for papers which referred in the abstract to “significant difference/s”, subtracted the first result from the second (searching for “significant difference/s” will find also results reporting “no (statistically) significant difference/s”), and used the result as a similarly conservative estimate of the number of publications which report significant differences. These two yearly estimates were standardized dividing them by the total number of papers indexed in the same database for the same year. Taking into account the number of papers reporting in the abstract “lack (or absence) of significant difference/s” did not affect results, as these papers were orders of magnitude less frequent than those using the wording “no significant difference/s”.
Similarly, findings with the strings “(no) significant difference*” appear to be much more frequent than with other wordings (e.g. “(no) significant effect”). For example, for ISI Web of Science (Science Citation Index, all years, as of January 2010) and the topic “biolog*”, searching for “no significant difference*” gives about 4400 results, whilst “no significant increase*” only provides ~170 results. Analogously, for the Science Citation Index (all years) and the topic “health”, searching for “no significant difference*” gives over 20,000 results, whilst “no significant association*” only provides ~1700 results. Similarly, for all Citation Indexes (all years) and the topic “cancer”, searching for “significant difference*” results in about 24,000 results, whilst “significantly different” obtains about 6400 items. Apart from papers retrieved with keywords other than “(no) significant difference*” being fewer in number, there is no reason to expect that the trends reported for “significant differences” may behave differently for papers where researchers express the absence or presence of significant results without the exact words “significant difference*”. A comparative analysis of such trends for different wordings may well be object of a later investigation, but is not the aim of the present study.
The regression of the ratio of the number of publications reporting in the abstract the absence versus the presence of significant differences as a function of year of publication was analyzed in SAS 9.1. The same analysis was performed for papers retrieved with 20 general keywords in Web of Science (all Citation Indexes) for the natural (biodiversity, biology, chemistry, computer, engineering, energy, environment, genetics, psychology, quantum physics) and the medical (AIDS/HIV, cancer, epidemiology, health, immunology, infection, malaria, neurology, obesity, pharmacology) sciences.
In spite of these results, for all databases there was a tendency for studies reporting in the abstract significant differences to increase faster than those which failed to do so. Hence, there was a decrease through time in the ratio of non-significant to significant results, which was consistent across the databases analyzed (Fig. 2). This decreasing ratio applied also for other databases with very little reporting of statistical differences in titles and/or abstracts such as the ISI Conference Proceedings database (n = 18, R2 = 0.50, y = 55 − 0.027x, slope s.e. = 0.007, p < 0.001) and the Arts and Humanities Citation Index (n = 17, R2 = 0.24, y = 57 − 0.028x, slope s.e. = 0.013, p = 0.04).
These results were confirmed when searching in Web of Science (all Citation Indexes) for some fields/topics in the natural sciences such as biology, energy and environment (Table 1). There was instead no significant variation in the ratio of the proportion of papers reporting in the abstract the absence versus the presence of significant differences for papers dealing with biodiversity, chemistry, computers, engineering, genetics, psychology and quantum (physics).
In the medical sciences, the worsening of the file drawer problem was more generalized, with a significant decrease of the ratio of the proportion of papers reporting in the abstract the absence versus the presence of significant differences for fields such as immunology, infection, research on malaria and obesity, as well as oncology and pharmacology (Table 2). Exception to this trend was made by papers on AIDS/HIV, epidemiology, neurology and by those retrieved with the generic keyword ‘health’. For the medical sciences studied, there was a generally higher proportion of papers reporting the absence of significant differences than in the non-medical fields investigated (Tables 1, 2). Similarly, the ratio of papers reporting the absence versus the presence of significant differences was generally higher for the medical topics studied compared to those in the natural sciences (Tables 1, 2).
Bias in science can (but does not need to) occur in several ways, from the selective funding of some proposals against some others to the preferential attention to and citation of some articles in comparison to others (Cicchetti 1991; Garfield 1997; Paris et al. 1998; Bensman 2007; Nieminen et al. 2007; Greenberg 2009; Marsh et al. 2009; Reinhart 2009; Taborsky 2009). In between funding and citation bias is publication bias, i.e. the tendency to accept for publication submissions with certain features and to dismiss manuscripts with some other features. Although it is possible that funding and citation bias are driving publication bias, this study does not address the issue of the temporal development in a potential bias of funding bodies towards proposals which are more likely to end up in obtaining positive results. Nor does this study deal with whether the citation likelihood of studies reporting the absence versus the presence of significant differences has been changing through time.
This study shows a recent catching up of the number of studies reporting the presence of statistical differences in title and/or abstracts compared to the studies reporting the absence of them. This result is evidence that in the abstracts of four (six counting ISI Conference Proceedings and the Arts and Humanities Citation Index) scientific databases the file-drawer problem has been getting worse during the last decades. This happened in spite of the progressive, generalized and unrelenting increase in the number of new publications per year indexed in ISI Web of Knowledge. The trend towards a worsening of the file-drawer problem could have different causes: (1) an increase through time of rejection rates (forcing editors to increase the tendency to decline publication to submissions with negative results), (2) a gradually increased general emphasis on the impact factor (forcing editors to try and publish more citable papers, thus, e.g., those reporting in the abstract the presence of significant differences).
These two mechanisms (increased rejection rates and increased emphasis on the impact factor) are not mutually exclusive and could also work from the perspective of authors and funding bodies. If there are increases in the peer review selectivity and in the importance of the standing of the journals where manuscripts appear, authors might tend to preferentially perform analyses which are likely to result in big effect sizes and thus in the presence of significant differences, or to preferentially write up and submit manuscripts about analyses which resulted in the presence of significant differences.
The increase in the number of indexed publications which report the presence or absence of significant differences in the title or abstract shows that publications are becoming more quantitative. This could imply that it is becoming more difficult to publish, as authors feel an increasing need to stress in the abstract that they found the presence or absence of significant differences. That peer review may have a role in the results reported here, and that these may not be entirely due to funding and authors’ practices, is suggested by independent studies of the likelihood of abstracts to be accepted to conferences in e.g. oncology and research on drug addiction (Krzyzanowska et al. 2003; Vecchi et al. 2009). Moreover, the downward trend of the ratio of reported non-significant versus significant results appears to be general, as it is happening across the databases studied, which span the natural, medical and social sciences (Fig. 2).
The trend is not universal, as there are topics which do not show variation in the ratio of papers reporting in the abstract the presence versus absence of significant differences over the time-span analyzed (e.g., chemistry, engineering, and genetics). This result confirms that the standards of peer review may well differ from scientific field to field (e.g., Abt 1992; Guetzkow et al. 2004; Klein 2006). However, the absence of a worsening of the file-drawer problem in some fields does not rule out that subtopics within these fields might show a negative trend in the proportion of papers reporting in the abstract the absence versus the presence of significant differences.
Conversely, for the general fields where a negative trend is manifest (e.g. biology), it is possible that sub-fields may not have experienced the same development in the file-drawer problem (e.g. biodiversity). For biodiversity, there has been a remarkable increase in the number of papers with that keyword over the period studied (about 40 times more papers in 2008 compared to 1991), so that this volcanic development could have masked any trends in the reporting of significant differences. In informatics, it might not be seen as important to report significant differences as in other more experimental sciences, although there is no trend also in genetics, where there is much more emphasis on reporting significant differences (Table 1). One further example where there is no significant trend in spite of a relatively good proportion of studies reporting significant differences in the abstract is psychology. This is interesting given that Rosenthal originally pointed out the file-drawer problem for the psychological literature. Psychology also stands out from the other topics studied as it shows a high proportion of papers reporting the absence of significant differences (comparable to the one generally observed in medical sciences), but at the same time a low ratio of papers reporting the absence versus presence of significant differences (common among the non-medical topics studied).
In spite of the negative temporal trend reported, the generally higher proportion of studies reporting in the abstract the absence rather than the presence of significant differences goes against the previous speculations and reports (e.g. Rosenthal 1979, Csada et al. 1996; Gilbody et al. 2000) of very little publishing of negative results in the peer-reviewed literature. It is possible that this difference could stem from the present analysis having been conducted on a large dataset (several million papers are indexed in Web of Knowledge), whereas previous studies may have tended to concentrate on selected issues of a few journals, which may or may not have been representative of the scientific literature in general. Although in the databases analyzed and during the last decades there does not seem to have been an overall manifest strong bias against publishing negative results, this may still be the case for specific topics and fields.
The selective publication of some results against some others is worrying because it can lead to bias in meta-analysis and hence to a distorted picture of the evidence for or against a certain hypothesis (Begg and Berlin 1988; Khoury et al. 2009; Levine et al. 2009; Song et al. 2009). Some scholars have expressed the feeling that the peer review system is in need of reform because of the worsening delays in obtaining constructive reports (Lawrence 2003; Hauser and Fehr 2007; Primack and Marrs 2008; Hochberg et al. 2009; Schwartz and Zamboanga 2009; de Mesnard 2010; Pautasso and Schäfer 2010). The trend documented here away from publishing studies that report the absence of significant differences is a further symptom of bad health and should be counteracted. One constructive suggestion is to focus on effect size rather than mere p values (Killeen 2005; Nakagawa and Cuthill 2007). In addition, it is important that guidelines for peer reviewers (e.g. Smith 1990; Provenzale and Stanley 2005; Bourne and Korngreen 2006; Pautasso and Pautasso 2010) explicitly discuss the file-drawer problem in the context of how to peer review scientific manuscripts, as the cumulative action and bias of the millions of individual peer reports provided each year on submissions has certainly the power to shape in a non-random way what gets through the sieve of peer review.
Further work is needed to assess the generality and pinpoint the causes of the negative trend reported. There is a need to check the results obtained with the wording “(no) significant difference*” with other less frequently used wordings: the present analysis only studies trends in the file drawer problem with a specific search string, not with all possible wordings conveying the presence or the absence of significant differences. Given that there are scientific fields where the proportion of studies reporting in the abstract the absence versus presence of significant differences has not changed during the last decades, these fields could provide information about factors which facilitate the publication of negative results. It would be interesting to know whether variation in institutional factors (e.g. types of journals (for profit, scientific society, open-access) and peer review (anonymous, double-blind, open)) can have an influence on the development of the file-drawer problem across the various scientific fields. Similarly, a fascinating question would be whether there are differences in the trend investigated here for cross-, mono-, multi-, inter- and trans-disciplinary fields.
Many thanks to L. Ambrosino, R. Brown, T. Hirsch, O. Holdenrieder, M. Jeger, C. Pautasso, R. Russo and H. Schäfer for insight, discussion or support and to I. Cuthill, O. Holdenrieder, T. Matoni, P. Vineis, K. West and anonymous reviewers for helpful comments on a previous draft.
- Cicchetti, D. V. (1991). The reliability of peer-review for manuscript and grant submissions: A cross-disciplinary investigation. Behavioral and Brain Sciences, 14, 119–134.Google Scholar
- de Mesnard, L. (2010). On Hochberg et al.’s “The tragedy of the reviewer commons”. Scientometrics, in press doi:10.1007/s11192-009-0141-8.
- Garfield, E. (1997). A statistically valid definition of bias is needed to determine whether the Science Citation Index(R) discriminates against third world journals. Current Science, 73, 639–641.Google Scholar
- Khoury, M. J., Bertram, L., Boffetta, P., Butterworth, A. S., Chanock, S. J., Dolan, S. M., et al. (2009). Genome-wide association studies, field synopses, and the development of the knowledge base on genetic variation and human diseases. American Journal of Epidemiology, 170, 269–279.CrossRefGoogle Scholar
- Koletsi, D., Karagianni, A., Pandis, N., Makou, M., Polychronopolou, A., & Eliades, T. (2009). Are studies reporting significant results more likely to be published? American Journal of Orthodontics and Dentofacial Orthopedics, 136, 632e1.Google Scholar
- Pautasso, M., & Schäfer, H. (2010). Peer review delay and selectivity in ecology journals. Scientometrics, in press. doi:10.1007/s11192-009-0105-z.
- Provenzale, J. M., & Stanley, R. J. (2005). A systematic guide to reviewing a manuscript. American Journal of Radiology, 185, 848–854.Google Scholar
- Smith, A. J. (1990). The task of the referee. IEEE Computer, 23, 46–51.Google Scholar