1 Introduction

Publication bias in scientific journals is widespread (Fanelli 2012). It leads to an incomplete view of scientific inquiry and results and presents an obstacle for evidence-based decision-making and public acceptance of valid, scientific discoveries and theories. A growing trend in scientific inquiry, as practiced in this article, includes the meta-analysis of large bodies of literature, a practice that is particularly susceptible to misleading and inaccurate results given a systematic bias in the literature (e.g. Michaels 2008; Fanelli 2012, 2013).

The role of publication bias in scientific consensus has been described in a variety of scientific disciplines, including but not limited to medicine (Kicinski 2013; Kicinski et al. 2015), social science (Fanelli 2012), ecology (Palmer 1999), and global climate change research (Michaels 2008; Reckova and Irsova 2015).

Despite widespread consensus among climate scientists that global warming is real and has anthropogenic roots (e.g., Holland 2007; Idso and Singer 2009; Anderegg et al. 2010), several end users of science such as popular media, politicians, industrialists, and citizen scientists continue to treat the facts of climate change as fodder for debate and denial. For example, Carlsson-Kanyama and Hörnsten Friberg (2012) found only 30% of politicians and directors from 63 Swedish municipalities believed humans contribute to global warming; 61% of respondents were uncertain about the causes of warming, and as much as 9% denied it was real.

Much of this skepticism stems from an event that has been termed Climategate, when emails and files from the Climate Research Unit (CRU) at the University of East Anglia were copied and later exposed for public scrutiny and interpretation. Climate change skeptics claimed the IPCC 2007 report—the Intergovernmental Panel on Climate Change Fourth Assessment Report (IPCC 2007), which uses scientific facts to argue humans are causing climate change—was based on an alleged bias for positive results by editors and peer reviewers of scientific journals; editors and scientists were accused of suppressing research that did not support the paradigm for carbon dioxide-induced global warming. In 2010, the CRU was cleared by the Muir Russell Committee of any scientific misconduct or dishonesty (Adams 2010; but see Michaels 2010).

Although numerous reviews have examined the credibility of climate researchers (Anderegg et al. 2010), the scientific consensus on climate change (Doran and Kendall Zimmerman 2009) and the complexity of media reporting (Corner et al. 2012), few studies have undertaken an empirical review of the publication record to evaluate the existence of publication biases in climate change science. However, Michaels (2008) scrutinized the two most prestigious journals, Nature and Science, in the field of global warming, and by using vote-counting meta-analysis, confirmed a skewed publication record. Reckova and Irsova (2015) also detected a publication bias after analyzing 16 studies of carbon dioxide concentrations in the atmosphere and changes in global temperature. Although publication biases were reported by Michaels (2008) and Reckova and Irsova (2015), the former test used a small set of pre-defined journals to test the prediction, while the latter test lacked statistical power given a sample size of 16 studies. In contrast, here we conducted a meta-analysis on results from 120 reports and 31 scientific journals. Our approach expands upon the conventional definition of publication bias to include publication trends over time and in relation to seminal events in the climate change community, stylistic choices made by authors who may selectively report some results in abstracts and others in the main body of articles (Fanelli 2012) and patterns of effect size and reporting style in journals representing a broad cross-section of impact factors.

We tested the hypothesis of bias in climate change publications stemming from the under-reporting of non-significant results (Rosenthal 1979) using fail-safe sample sizes, funnel plots, and diagnostic patterns of variability in effect sizes (Begg and Mazumdar 1994; Palmer 1999, 2000; Rosenberg 2005). More specifically, we (a) examined whether non-significant results were omitted disproportionately in the climate change literature, (b) if there were particular trends of unexpected and abrupt changes in the number of published studies and reported effects in relation to IPCC 2007 and Climategate, (c) whether effects presented in the abstracts were significantly larger than those reported in the main body of reports, and (d) how findings from these first three tests related to the impact factor of journals.

Meta-analysis is a powerful statistical tool used to synthesize statistical results from numerous studies and to facilitate general trends in a field of research. Unfortunately, not all articles within a given field of science will contain statistical estimates required for meta-analysis (e.g., estimate of effect size, error, sample size). Therefore, the literature used in meta-analysis is often a sample of all available articles, which is analogous to the analytical framework used in ecology and typically uses a sub-sample of a population to estimate parameters of true populations. For the purpose of our meta-analysis, we sampled articles from the body of literature that explores the effects of climate change on marine organisms. Marine species are exposed to a large array of abiotic factors that are linked directly to atmospheric climate change. For instance, oceans absorb heat from the atmosphere and mix with freshwater run-off from melting glaciers and ice caps, which changes ocean chemistry and puts stress on ocean ecosystems. For example, the resulting changes in ocean salinity and pH can inhibit calcification in shell-bearing organisms that are either habitat-forming (e.g., coral reefs, oyster reefs) or the foundation of food webs (e.g., plankton) (The Copenhagen Diagnosis 2009).

Results of our meta-analysis found no evidence of publication bias, in contrast to prior studies that were based on smaller sample sizes than used here (e.g., Michaels 2008; Reckova and Irsova 2015). We did, however, discover some interesting patterns in the numbers of climate change articles being published over time and, within journal articles, stylistic biases by authors with respect to reporting large statistically significant effects. Finally, results are discussed in the context of social responsibility borne by climate scientists and the challenges for communicating science to stakeholders and end users.

2 Materials and methods

Meta-analysis is a suite of data analysis tools that allow for quantitative synthesis of results from numerous scientific studies, now widely used from medicine to ecology (Adams et al. 1997). Here, we randomly sampled articles from a broader body of literature about climate change in marine systems and withdrew statistics summarizing magnitude of effects, error, and experimental sample size for meta-analysis.

2.1 Data collection

We surveyed the scientific literature via the ISI Web of Science, Scopus and Biological Abstracts, and in the reference sections of identified articles for experimental results pertaining to climate change in ocean ecosystems. The search was performed with no restrictions on publication year, using different combinations of the terms: (acidification* AND ocean*) OR (acidification* AND marine*) OR (global warming* AND marine*) OR (global warming* AND ocean*) OR (climate change* AND marine* AND experiment*) OR (climate change* AND ocean* AND experiment*). The search was performed exclusively on scientific journals with an impact factor of at least 3 (Journal Citation Reports science edition 2010).

We restricted our analysis to a sample of articles reporting (a) an empirical effect size between experimental and control group, (b) a measure of statistical error, and (c) a sample size with specific control and experimental groups (see Supplementary Material S1S3 for identification of studies). We identified 120 articles from 31 scientific journals published between the years 1997 and 2013, with impact factors ranging from 3.04 to 36.104 and a mean of 6.58. Experimental results (n = 1154) were extracted from the main body of articles; 362 results were also retrieved from the articles’ abstracts or summary paragraphs.

Data from the main body of articles and abstracts were analyzed separately to test for potential stylistic biases related to how authors report key findings. The two datasets, hereafter designated “main” dataset and “abstract” dataset, were also divided into three time periods: pre-IPCC 2007 (x–2007-November), After IPCC 2007/pre-Climategate (2007-December–2009-November) and after Climategate (2009-December–December 2012), based on each article’s date of acceptance. We used November 2007 as the publication date for the IPCC Fourth Assessment Report, which was an updated version of the original February 2007 release.

We extracted graphical data using the software GraphClick v. 3.0 (2011). Each study could include several experimental results, which could result in non-independence bias driven by studies with relatively large numbers of results. Therefore, we assessed the robustness of our meta-analysis by re-running the analysis multiple times with data subsets consisting of one randomly selected result per article (Hollander 2008).

Experimental results found in articles can be either negative or positive. To prevent composite, mean values from equalling zero, we reversed the negative sign of effects to positive; consequently, all results were analyzed as positive effects (Hollander 2008). The reversed effect sizes do not generally produce a standard normal distribution, as negative effects are reversed around zero. Statistical significance was therefore assessed using bias-corrected 95% bootstrap confidence intervals produced by re-sampling tests in 9999 iterations, with a two-tailed critical value from Students t distribution. If the mean of one sample lies outside the 95% confidence intervals of another mean, the null hypothesis that subcategories did not differ was rejected (Adams et al. 1997; Borenstein et al. 2010). Hedges’ d was used to quantify the weighted effect size of climate change effects (Gurevitch and Hedges 1993).

$$ d=\frac{{\overline{X}}^{\mathrm{E}}-{\overline{X}}^{\mathrm{C}}}{s}J $$

Hedges’ d was the mean of the control group (X C) subtracted from the mean of the experimental group (X E), divided by the pooled standard deviation (s) and multiplied by a correction factor for small sample sizes (J). However, since sample sizes vary among studies, and variance is a function of sample size, some form of weighting was necessary. In other words, studies with larger sample sizes are expected to have lower variances and will accordingly provide more precise estimates of the true population effect size (Hedges and Olkin 1985; Shadish and Haddock 1994; Cooper 1998). Therefore, a weighted average was used in the meta-analysis to estimate the cumulative effect size (weighted mean) for the sample of studies (see Rosenberg et al. 2000 for details).

2.2 Funnel plot and fail-safe sample sizes

Funnel plots are sensitive to heterogeneity, which is why they are effective for visual detection of systematic heterogeneity in the publication record. For example, funnel plot asymmetry has long been equated with publication bias (Light and Pillemer 1984; Begg and Mazumdar 1994), whereas a systematic inverted funnel is diagnostic of a “well-behaved” data set in which publication bias is unlikely. Initial funnel plots showed large amounts of variation in the y-axis (Hedge’s d) along the length of the x-axis (sample size), which could potentially obscure inspection of diagnostic features of the funnel around the mean. To improve detectability of publication bias, should one truly exist, we transformed Hedges’ d to Pearson coefficient correlation (r), which condensed extreme values in the y-axis and converted the measure of effect size to a range between zero and ±1 (Palmer 2000; Borenstein et al. 2009). Therefore, the data transformation ultimately converted the measure of effect size from a standardized mean difference (d) to a correlation (r)

$$ r=\frac{d}{\sqrt{d^2}+a} $$

where a is the correction factor for cases where n 1 ≠ n 2 (Borenstein et al. 2009).

Both funnel plots and fail-safe sample size were inspected to test for under-reporting of non-significant effects, following Palmer (2000). Extreme publication bias (caused by under-reporting of non-significant results) would appear as a hole or data gap in a funnel plot. Also, if there is no bias, the density of points should be greatest around the mean value and normally distributed around the mean at all sample sizes. To help visualize the threshold between significant and non-significant studies, 95% significance lines were calculated for the funnel plots.

Robust Z-scores were used to identify possible outliers in the dataset, as such values could distort the mean and make the conclusions of a study less accurate or even incorrect. Instead of using the dataset mean, robust Z-scores use the median as it has a higher breakdown point and is therefore more accurate than regular Z-scores (Rousseeuw and Hubert 2011). The cause for each identified outlier was carefully investigated before any value could be excluded from the dataset (Table S1).

All data were analyzed with MetaWin v. 2.1.5 (Sinauer Associates Inc. 2000), and graphs for visual illustrations were created using the graphic data tool DataGraph v. 3.1.2 (VisualDataTools Inc. 2013).

3 Results

3.1 Publication bias

For each of the three time periods considered in this study (prior to IPCC-AR4 2007, after IPCC-AR4 2007 and before Climategate 2009, and after Climategate 2009), the funnel plots showed no evidence of statistically non-significant results being under-represented (Fig. 1); there were no holes around Y = 0, nor were there conspicuous gaps or holes in other parts of the funnels (Fig. 1). Strong fail-safe sample sizes confirmed that the effect sizes were robust and that publication bias was not detected (Rosenthal 1979). We further tested the robustness of results by re-sampling single results from articles and reproducing funnel plots (see Supplementary Material Fig. S4 a–j).

Fig. 1
figure 1

Funnel plots representing effect size (r) as a function of sample size (N). The shaded areas represent results that were not significant statistically for the main dataset. a Pre-IPCC 2007 (n = 265). b After IPCC 2007/pre-Climategate (n = 345). c After Climategate (n = 544). n denotes number of experiments

3.2 Number of studies, effect size, and abstract versus main

The number of articles about climate change in ocean ecosystems has increased annually since 1997, peaking within 2 years after IPCC 2007 and subsiding after Climategate 2009 (Fig. 2). Before Climategate, reported effect sizes were significantly larger in article abstracts than in the main body of articles, suggesting a systematic bias in how authors are communicating results in scientific articles: Large, significant effects were emphasized where readers are most likely to see them (in abstracts), whereas small or non-significant effects were more often found in the technical results sections where we presume they are less likely to be seen by the majority of readers, especially non-scientists. Moreover, between IPCC 2007 and Climategate, when publication rates on ocean climate change were greatest, the difference in effect sizes reported in the abstract and body of reports was also greatest (Fig. 2). After Climategate, publication rates about ocean climate change fell, the magnitude of reported effect sizes in abstracts diminished, and the difference in effect sizes between abstracts and the body of reports returned to a level comparable to pre-IPCC 2007 (Fig. 2).

Fig. 2
figure 2

Publication rate. a Number of published reports for each year. The two vertical grey bars illustrate the timing of IPCC 2007 and Climategate 2009. b Cumulative effect sizes of Hedges’ d and bias-corrected 95% bootstrap confidence intervals for the magnitude of climate-change effects. Mean effect sizes are computed separately for results presented in abstracts and in the main body of articles. N sample size. Pre-IPCC 2007 main dataset: d = 1.46; CI = 1.30–1.63; df = 264, FSN = 36,299. Abstract dataset: d = 2.08; CI = 1.73–2.51; df = 62, FSN = 3475: P < 0.05. After IPCC 2007/pre-Climategate main dataset: d = 1.87; CI = 1.69–2.06; df = 344, FSN = 79,576. Abstract dataset: d = 2.82; CI = 2.41–3.31; df = 118, FSN = 11,557: P < 0.05. After Climategate main dataset: d = 1.72; CI = 1.59–1.88; df = 543, FSN = 214,674. Abstract dataset: d = 2.14; CI = 1.85–2.46; df = 176, FSN = 26,480: P = n.s. d Hedges’ d, CI bias-corrected 95% confidence intervals, df degrees of freedom (one less than total sample), FSN fail-safe numbers. P, n.s. probability that abstract results and main text results differ

3.3 Impact factor

Journals with an impact factor greater than 9 published significantly larger effect sizes than journals with an impact factor of less than 9 (Fig. 3). Regardless of the impact factor, journals reported significantly larger effect sizes in abstracts than in the main body of articles; however, the difference between mean effects in abstracts versus body of articles was greater for journals with higher impact factors. We also detected a small, yet statistically significant, negative relationship between reported sample size and journal impact factor, which was largely driven by the large effects reported in high impact factor journals (Fig. 4). Despite the larger effect sizes, journals with high impact factors published results with generally lower sample sizes.

Fig. 3
figure 3

Cumulative effect sizes of Hedges’ d and bias-corrected 95% bootstrap confidence intervals for the magnitude of climate change effects for journals with impact factor above or below 9. Results are computed separately for data from abstracts and the main body of reports. N denotes the sample size. IF < 9 main dataset: d = 1.60; CI = 1.51–1.69; df = 1042, FSN = 696,107, P < 0.05. Abstract dataset: d = 2.04; CI = 1.86–2.24; df = 316, FSN = 83,671, P < 0.05. IF > 9 main dataset: d = 2.65; CI = 2.20–3.23; df = 111, FSN = 10,131, P < 0.05. Abstract dataset: d = 5.27; CI = 3.66–7.50; df = 44, FSN = 2298, P < 0.05. Abbreviations as in Fig. 2 legend

Fig. 4
figure 4

Relationship between journal impact factor and sample size for experimental results (R 2 = 0.004; P < 0.05)

4 Discussion

Our meta-analysis did not find evidence of small, statistically non-significant results being under-reported in our sample of climate change articles. This result opposes findings by Michaels (2008) and Reckova and Irsova (2015), which both found publication bias in the global climate change literature, albeit with a smaller sample size for their meta-analysis and in other sub-disciplines of climate change science. Michaels (2008) examined articles from Nature and Science exclusively, and therefore, his results were influenced strongly by the editorial position of these high impact factor journals with respect to reporting climate change issues. We believe that the results presented here have added value because we sampled a broader range of journals, including some with relatively low impact factor, which is probably a better representation of potential biases across the entire field of study. Moreover, several end users and stakeholders of science, including other scientists and public officials, base their research and opinions on a much broader suite of journals than Nature and Science.

However, our meta-analysis did find multiple lines of evidence of biases within our sample of articles, which were perpetuated in journals of all impact factors and related largely to how science is communicated: The large, statistically significant effects were typically showcased in abstracts and summary paragraphs, whereas the lesser effects, especially those that were not statistically significant, were often buried in the main body of reports. Although the tendency to isolate large, significant results in abstracts has been noted elsewhere (Fanelli 2012), here we provide the first empirical evidence of such a trend across a large sample of literature.

We also discovered a temporal pattern to reporting biases, which appeared to be related to seminal events in the climate change community and may reflect a socio-economic driver in the publication record. First, there was a conspicuous rise in the number of climate change publications in the 2 years following IPCC 2007, which likely reflects the rise in popularity (among public and funding agencies) for this field of research and the increased appetite among journal editors to publish these articles. Concurrent with increased publication rates was an increase in reported effect sizes in abstracts. Perhaps a coincidence, the apparent popularity of climate change articles (i.e., number of published articles and reported effect sizes) plummeted shortly after Climategate, when the world media focused its scrutiny on this field of research, and perhaps, popularity in this field waned (Fig. 1). After Climategate, reported effect sizes also dropped, as did the difference in effects reported in abstracts versus main body of articles. The positive effect we see post IPCC 2007, and the negative effect post Climategate, may illustrate a combined effect of editors’ or referees’ publication choices and researchers’ propensity to submit articles or not. However, since meta-analysis is correlative, it does not elucidate the mechanisms underlying observed patterns.

Similar stylistic biases were found when comparing articles from journals with high impact factors to those with low impact factors. High impact factors were associated with significantly larger reported effect sizes (and lower sample sizes; see Fig. 4); these articles also had a significantly larger difference between effects reported in abstracts versus the main body of their reports (Fig. 3). This trend appears to be driven by a small number of journals with large impact factors; however, the result is consistent with those of supplementary studies. For example, our results corroborate with others by showing that high impact journals typically report large effects based on small sample sizes (Fraley and Vazire 2014), and high impact journals have shown publication bias in climate change research (Michaels 2008, and further discussed in Radetzki 2010).

Stylistic biases are less concerning than a systematic tendency to under-report non-significant effects, assuming researchers read entire reports before formulating theories. However, most audiences, especially non-scientific ones, are more likely to read article abstracts or summary paragraphs only, without perusing technical results. The onus to effectively communicate science does not fall entirely on the reader; rather, it is the responsibility of scientists and editors to remain vigilant, to understand how biases may pervade their work, and to be proactive about communicating science to non-technical audiences in transparent and un-biased ways. Ironically, articles in high impact journals are those most cited by other scientists; therefore, the practice of sensationalizing abstracts may bias scientific consensus too, assuming many scientists may also rely too heavily on abstracts during literature reviews and do not spend sufficient time delving into the lesser effects reported elsewhere in articles.

Despite our sincerest aim of using science as an objective and unbiased tool to record natural history, we are reminded that science is a human construct, often driven by human needs to tell a compelling story, to reinforce the positive, and to compete for limited resources—publication trends and communication bias is a proof of that.