Why do scientists fabricate and falsify data? A matched-control analysis of papers containing problematic image duplications

It is commonly hypothesized that scientists are more likely to engage in data falsification and fabrication when they are subject to pressures to publish, when they are not restrained by forms of social control, when they work in countries lacking policies to tackle scientific misconduct, and when they are male. Evidence to test these hypotheses, however, is inconclusive due to the difficulties of obtaining unbiased data. Here we report a pre-registered test of these four hypotheses, conducted on papers that were identified in a previous study as containing problematic image duplications through a systematic screening of the journal PLoS ONE. Image duplications were classified into three categories based on their complexity, with category 1 being most likely to reflect unintentional error and category 3 being most likely to reflect intentional fabrication. Multiple parameters connected to the hypotheses above were tested with a matched-control paradigm, by collecting two controls for each paper containing duplications. Category 1 duplications were mostly not associated with any of the parameters tested, in accordance with the assumption that these duplications were mostly not due to misconduct. Category 2 and 3, however, exhibited numerous statistically significant associations. Results of univariable and multivariable analyses support the hypotheses that academic culture, peer control, cash-based publication incentives and national misconduct policies might affect scientific integrity. Significant correlations between the risk of image duplication and individual publication rates or gender, however, were only observed in secondary and exploratory analyses. Country-level parameters generally exhibited effects of larger magnitude than individual-level parameters, because a subset of countries was significantly more likely to produce problematic image duplications. Promoting good research practices in all countries should be a priority for the international research integrity agenda.


INTRODUCTION
Office of Research Integrity [20]. However, other interpretations of these data have been 89 proposed [21] and gender did not significantly predict the likelihood to produce a 90 retracted or corrected paper, once various confounders were adjusted for in a matched-91 control analysis [12]. 92 Progress in assessing the validity of these hypotheses in explaining the prevalence of 93 misconduct has been hampered by difficulties in obtaining reliable data. A primary 94 source of evidence about scientific misconduct is represented by anonymous surveys. 95 These however are very sensitive to methodological choices and, by definition, report 96 what a sample of voluntary respondents think and are willing to declare in surveys -not 97 necessarily what the average scientist actually thinks and does [1,3,22]. Retractions of 98 scientific papers, most of which are due to scientific misconduct [23], offer a pool of 99 actual cases whose analyses have given important insights (e.g. [12,[23][24][25]). Results 100 obtained on retractions, however, may not be generalizable, because retractions still 101 constitute a very small fraction of the literature and by definition are the result of a 102 complex process that can be influenced by multiple contingent factors, such as level of 103 scrutiny of a literature, presence of retraction policies in journals, and the scientific 104 community's willingness to act [26]. 105 An unprecedented opportunity to probe further into the nature of scientific misconduct 106 is offered by a recent dataset of papers that contain image duplications of a questionable 107 or manifestly fraudulent nature, i.e. Bik et al. 2016 [5]. These papers were identified by 108 direct visual inspection of 20,621 papers that contained images of Western Blots, nucleic 109 acid gels, flow cytometry plots, histopathology or other forms of image that were 110 published between the years 1995 and 2015 in 40 journals. Having been obtained by a 111 systematic screening of the literature, this sample is free from most limitations and biases 112 that affect survey and retraction data, and therefore offers a representative picture of 113 errors and/or misconduct in the literature -at least with regard to image duplications in 114 biological research. Descriptive analyses of these data have yielded new insights into the 115 rate of scientific misconduct and its relative prevalence amongst different countries. 116 We conducted a pre-registered analysis (osf.io/w53yu) of data from [5]  to be most relevant in category 2 and 3 duplications and to have little or no effect on 138 category 1 errors. 139 For each paper containing duplicated images, we identified two controls that had been 140 published in the same journal and time period, and that contained images of Western 141 blots without detectable signs of duplication. We then measured a set of variables that 142 were relevant to each of the hypotheses listed above, and used logistic regression to test 143 whether and how these variables were associated with the risk of committing scientific 144 misconduct. 145 146 RESULTS 147 Figure 1 reports the effects in each category of duplication of each tested parameter 148 (i.e. odds ratio and 95% confidence interval), grouped by each composite hypothesis, 149 with an indication of the direction of effect predicted by that hypothesis. In line with our 150 overall predictions, Category 1 duplications yielded a null association with nearly all of 151 the parameters tested ( Figure 1, green error bars), and/or yielded markedly different 152 effects from Category 2 and Category 3 papers ( Figure 1, orange and red bars, 153 respectively). Sharp and highly significant differences between effects measured on the 154 latter and the former duplication categories were observed for authors' citation scores and 155 journal scores (Fig 1a), and for several country-level and team-level parameters (i.e. Fig 1  156 b-e). No significant difference was observed amongst gender effects, except for a 157 tendency of Category 3 duplications to be more common amongst female authors (Fig  158  1f).

159
Differences between effects measured on Category 2 and 3 duplications were not 160 always consistent with our prediction that Category 3 duplications should exhibit the 161 largest effect sizes. For example, the number of years of activity of the author was only 162 significantly associated with Category 2 duplications (Fig 1a). In most cases, however, 163 the confidence intervals of effects measured for Categories 2 and 3 were largely 164 overlapping, suggesting that differences between Category 2 and 3 might be due to the 165 smaller sample size (lower statistical power) achieved for the latter category. Overall, 166 therefore, results of univariable analyses are consistent with our predictions and confirm 167 the original assessment of the status of these categories suggested by Bik et al. (2016) Results of univariable tests combining Category 2 and 3 papers together are in good 176 agreement with the social control hypothesis (Fig 2c) and partial agreement with the 177 misconduct policy hypothesis (Fig 2e). The gender hypothesis was not supported (Fig 2f).

178
The pressures to publish hypothesis was not or negatively supported by most analyses. In 179 agreement with some predictions, the risk of misconduct was higher in countries in which 180 publications are rewarded by cash incentives (Fig 2b) and was lower for researchers with 181 a shorter publication time-span (i.e. presumably early-career researchers, Fig 2a).

182
Contrary to predictions, however, the risk of misconduct was lower for authors with 183 higher journal score (Fig 1a) and in countries with publication incentive policies that are 184 career-based and institutional-based, despite the fact that the latter are those where 185 pressures to publish are said to be highest [10]. 186 Overall, country-level parameters produced effects of larger magnitude (Fig 2). 187 Indeed, we observed sharp differences between countries with regard to the risk of 188 duplication (Fig 3). Compared to the United States, the risk was significantly higher in 189 China, India, Argentina and other developing countries (i.e. all those included in the 190 "other" category, Fig 3). Multiple other countries (e.g. Belgium, Austria,Brazil,Israel,191 etc.) also appeared to have higher average risk than the United States but the very small 192 number of studies from these countries hampered statistical power and thus our ability to 193 draw any conclusion. Germany and Australia tended to have lower risk than the United 194 States, but only Japan had a statistically significant lower risk (Fig 3).

195
To reduce the possible confounding effect of country, we performed secondary 196 analyses on subsamples of countries with relatively homogeneous cultural and economic 197 characteristics ( Fig S1). Such sub-setting appeared to improve the detection of individual-198 level variables. In particular, the risk of duplication appeared to be positively associated 199 with authors' publication rate, citation score, journal score and female gender ( Fig S1 a- h, 200 and see SM for all numerical results). These effects, however, were never formally 201 statistically significant in such univariable analyses. 202 Secondary multivariable analyses, however, corroborated all of our main results (Fig  203  4). A model that included individual parameters, as well as an interaction term between 204 number of authors and number of countries (in place of the country-to-author ratio, which 205 is not independent from the number of authors) and country-level parameters of 206 publication and misconduct policies suggested that the risk of misconduct was 207 predominantly predicted by country and team characteristics (Fig 4a). The risk was 208 significantly higher in countries with cash-based publication incentives, lower in those 209 with national misconduct policies, and grew with team size as well as with number of 210 authors, with the latter two factors modulating each other: for a given distance, larger 211 teams were less at risk from misconduct, as the social control hypothesis predicted ( Fig  212  4a).

213
When limited to English-speaking and EU15 countries, multivariable analyses of 214 individual and team characteristics supported most theoretical predictions, suggesting that 215 misconduct was more likely in long-distance collaborations and amongst early-career, 216 highly productive and high-impact first-authors (Fig 4b). Female first authors were 217 significantly more at risk of being associated with Category 2 and 3 problems, a finding 218 that is inconsistent with the gender hypothesis. Analyses on the remaining subset of 219 countries yielded similar results (Fig 4c). 220 Almost identical results were obtained with a non-conditional logistic regression 221 model, consistent with the fact that our sample was homogeneous with regards to 222 important characteristics such as journal, methodology and year of publication. Results 223 obtained combining all three categories of duplications were largely overlapping with 224 those presented in the main text and would have led to similar conclusions (see all 225 numerical results in SI). 226 To the best of our knowledge, this is the first direct test of hypotheses about the causes 229 of scientific misconduct that was conducted on an unbiased sample of papers containing 230 flawed or fabricated data. Our sample represented papers containing honest errors and 231 intentional fabrications of various degrees in unknown relative proportions. However, we 232 correctly predicted that Category 1 duplications would exhibit smaller or null effects, 233 whilst most significant effects, if observed at all, would be observed in Categories 2 and 3 234 (Fig 1). Support of this prediction retrospectively confirms that, as suggested by a 235 previous analysis of these data [5], Category 1 duplications are most likely the result of 236 unintentional errors or flawed methodologies, whilst Category 2 and 3 duplications are 237 likely to contain a significant proportion of intentional fabrications. 238 Results obtained on Category 2 and 3 papers, corroborated by multiple secondary 239 analyses (see SI), supported some predictions of the hypotheses tested, but did not 240 support or openly contradicted others:

241
-Pressure to publish hypothesis: partially supported. Early-career researchers, and 242 researchers working in countries where publications are rewarded with cash incentives 243 were at higher risk of image duplication, as predicted. However, countries having other 244 publication incentive policies had a null or even negative risk (Fig 1b). In further 245 refutation of predictions, individual publication rate and impact of authors was not or 246 negatively associated with image duplication, although in secondary multivariable 247 analyses we observed a positive association between publication rate of first authors and 248 risk of duplication. The latter finding might represent the first direct support of this 249 prediction, but should be verified in future confirmatory tests. The correlation with cash 250 incentives may not be taken to imply that such incentives were directly involved in the 251 problematic image duplications, but simply that such incentives may reflect the value 252 system in certain research communities that might incentivize questionable research 253 practices.

254
-Social control hypothesis: supported. In univariable analyses, only predictions 255 based on socio-cultural conditions of different countries were in large agreement with 256 observations (Fig 1c). However, when country characteristics were controlled and/or 257 adjusted for, we observed a consistent negative interaction between number of authors 258 and number of countries per author in a paper, which is in good agreement with the 259 hypothesis (Fig 4).

260
-Misconduct policy hypothesis: partially supported. Countries with national and 261 legally enforceable policies against scientific misconduct were significantly less likely to 262 produce image duplications (Fig 1e, Fig 4a). However, other misconduct policy 263 categories were not associated with a reduced risk of image duplication, and tended if 264 anything to have a higher risk. As noted above for publication incentive policies, we 265 cannot prove a cause-effect relationship. The presence of national misconduct policies 266 may simply reflect the greater attention that a country's scientific community pays to 267 research integrity.

268
-Gender hypothesis: not supported. In none of the main and secondary analyses did 269 we observe the predicted higher risk for males. Some of the secondary analyses might 270 have found an association between female authors and the risk of image duplication (Fig  271  4b). This latter finding, however, needs to be validated in future confirmatory studies. 272 A previous, analogous analysis conducted on retracted and corrected papers had led to 273 largely similar conclusions [12]. The likelihood to correct papers for unintentional errors 274 was not associated with most parameters, similarly to what this study observed for 275 category 1 duplications. The likelihood to retract papers, instead, was also found to be 276 significantly associated with misconduct policies, academic culture, as well as early-277 career status and average impact score of first or last author. Differently from what this 278 study observed on image duplications, however, individual publication rate was 279 negatively associated with the risk of retraction and positively with that of corrections 280 [12]. We hypothesize that at least two factors may underlie this difference in results.

281
First, analyses on retractions included every possible error and form of misconduct, 282 including plagiarism, whereas the present analysis is dedicated to a very specific form of 283 error or manipulation. Second, analyses on retractions are intrinsically biased and subject 284 to many confounding factors, because retractions are the end results of a complex chain 285 of events (e.g. a reader signals a possible problem to the journal, the journal editor 286 contacts the author, the author's institution starts an investigation, etc.…) which can be 287 subjected to many sources of noise and distortion. Therefore, whilst on the one hand our 288 results may be less generalizable, on the other hand they are more accurate and less 289 biased than results obtained on retractions.

290
A remarkable agreement was also observed between these results and those of a recent 291 assessment of the causes of bias in science, authored by two of us [14]. This latter study 292 tested similar hypotheses using identical independent variables on a completely different 293 outcome (the likelihood to over-estimate results in meta-analysis) and using a completely 294 different study design. Therefore, the convergence of results with this latter study is even 295 more striking and strongly suggests that all these separate analyses are detecting genuine 296 underlying patterns that reflect a connection between research integrity and 297 characteristics of authors, team and country.

298
The present study has avoided many of the confounding factors that limit studies on 299 retractions, but could not avoid other limitations. An overall limitation concerns the kind 300 of image duplication analyzed in this study, which is only one of the many possible forms 301 of data falsification and fabrication that may occur in the literature. This restriction limits 302 in principle broad generalizations. However, as noted above, our results are in large 303 agreement with previous analyses that encompassed all forms of bias and misconduct [12,304 14], which suggests that our findings are consistent with general patterns linked to these 305 phenomena.

306
Two other possible limitations of our study design made results very conservative. 307 Firstly, we could not ascertain which of the duplications were actually due to scientific 308 misconduct and which ones derived from honest error, systematic error or negligence. 309 Secondly, our individual-level analyses focused on characteristics of the first and the last 310 author, under the assumption that authors in these positions are most likely to be 311 responsible for any flaws in a publication. However, we do not know who, amongst the 312 co-authors of included studies, was actually behind the problematic duplication. Both 313 these limitations ought to increase confidence in our results, because they are likely to 314 have reduced the magnitude of measurable effect sizes. As our univariable analyses 315 confirmed, image duplications that are due not to scientific misconduct but to 316 unintentional error are unlikely to be associated with any factor (Fig 1). Similarly, if an 317 image duplication was not caused by its study's first or last author, then we simply would 318 not expect the characteristics of first and last author to be associated with the likelihood 319 of that error. Therefore, to any extent that they affected the study, these two limitations 320 have introduced random noise in our data, reducing the magnitude of any measurable 321 effect and thus making our results more conservative.

322
Any random noise in our data might have reduced the statistical power of our 323 analyses, for the reasons discussed above. However, our statistical power was relatively 324 large. Even when restricted to the smallest subset (e.g. category 3 duplications) our 325 analyses had over 89% power to detect an effect of small magnitude. We can therefore 326 conclude that, despite the limitations discussed above, all of our tests had sufficient 327 power to reject null hypotheses for at least large and medium effect sizes.

328
A further possible limitation in our analysis pertains to the accuracy with which we 329 could measure individual-level parameters. Our ability to correctly classify the gender 330 and to reconstruct the publication profile of each author was subject to standard 331 disambiguation errors [27] which may be higher for authors in certain subsets of 332 countries. In particular, authors from South-and East-Asian countries have names that 333 are difficult to classify, and often publish in local journals that are not indexed in the Web 334 of Science and were therefore not captured by our algorithms. Any systematic bias or 335 error in quantifying parameters for authors from these countries would significantly skew 336 our results because country-level factors were found in this study -as well in previous 337 studies on retractions -to have significant effects [12]. However, all our main conclusions 338 are based on effects that were measured consistently in subsets of authors based on 339 countries at lower risk of disambiguation error. Moreover, this limitation is only likely to 340 affect the subset of tests that focused on author characteristics. 341 Indeed, this study suggests that significant individual-level effects might not be 342 detectable unless country-level effects are removed or adjusted for. This prominence of 343 country-level effects in determining the risk of problematic image duplications might be 344 one of the most important finding of this study. We observed clear and indisputable 345 evidence that problematic image duplications are overwhelmingly more likely to come 346 from China, India and other developing countries, consistent with the original 347 interpretation of these data [5]. Regardless of whether the image duplications that we 348 have examined in this study were due to misconduct or unintentional error, country-level 349 effects suggest that particular efforts might be needed to improve the reliability of studies 350 from developing countries.

351
Previous analyses on retractions, corrections and bias [12,14] as well as the present 352 analysis of image duplications cannot demonstrate causality. However, all these analyses 353 consistently suggest that developing national misconduct policies and fostering an 354 academic culture of mutual criticism might be effective preventive measures to ensure the 355 integrity of future research. 356 357 358

MATERIALS AND METHODS 359
Methods of this study very closely followed the protocol of a previous analysis of risk 360 factors for retractions and corrections [12]. To guarantee the confirmatory and unbiased 361 nature of our analyses, all main and secondary analyses as well as sampling and 362 analytical methodology were pre-specified and registered at the Center for Open Science 363 (osf.io/w53yu) [28]. The main goal of the analysis was to produce a matched-control 364 retrospective analysis aimed at identifying which characteristics of papers and their 365 authors were significantly predictive of the likelihood to fall into the "treatment" as 366 opposed to "control" category (papers with or without problematic image duplications, 367 respectively).

369
Sampling of papers 370 Papers had been identified by the independent assessment of three of the present 371 paper's authors (EB, AC, FF). Control papers were retrieved from the set of papers that 372 had been examined by the authors and in which no evidence of data duplication had been 373 recognized. For each treatment paper, two controls were retrieved for inclusion, i.e. one 374 published immediately before and one immediately after the treatment paper. Order of 375 publication was determined based on Web of Science's unique identifier code. When the 376 candidate control paper of one treatment paper coincided with the candidate control of 377 another treatment paper, the next available control paper was selected instead.

379
Data collection 380 Following previous protocols [12,14], we collected a set of relevant characteristics of 381 all included papers and of all of their authors. More specifically, for each paper we 382 recorded:

383
-Number of authors of each paper.

384
-Number of countries listed in the authors' addresses.

385
-Average distance between author addresses, expressed in thousands of kilometers. 386 Geographic distance was calculated based on a geocoding of affiliations covered in the 387 Web of Science [29].

388
For each author of each paper in the sample we retrieved the following data from the 389 Web of Science:

390
-Year of first and last paper recorded in the Web of Science.

391
-Total number of article, letters and review papers authored or co-authored.

392
-Total number of citations received by all papers authored or co-authored.

395
-Proportion of papers authored or co-authored that appeared in the top-10 journals 396 of that author's field.

397
-Author's main country of activity, based on the address most commonly 398 indicated.

399
-Author's first name. The combination of first name and country was used to 400 assign gender. The majority of gender assignments were made by a commercial service 401 (genderapi.com) but an attempt was made to identify the gender of unassigned names. 402 When neither approach could attribute an author's gender reliably, gender was assigned 403 to the "unknown" category.

405
Country information was used to assign each author to the corresponding country-level 406 variable, using the following scheme: BD), data based on references in [12]. 418 Although we collected information for all authors of the papers, we only tested 419 individual predictors measured on the first and last authors, positions that in biomedical 420 papers tend to be attributed to the authors that most contributed to the research, often in 421 the role of junior and senior author, respectively [31,32].

423
Analyses 424 All variables were included in the analysis untransformed, although a few variables 425 were re-scaled linearly: author publication rate data was divided by 10, geographic 426 distance data was divided by 1000, and countries-to-author ratio was multiplied by 100.

427
This re-scaling of some variables served the purpose of improving the visibility of effect 428 sizes in figures and had no impact on the results. 429 All hypotheses were tested using standard conditional logistic regression analysis, i.e. 430 a logistic regression model with an added "stratum" term that identifies each subgroup of 431 treatment and matched controls. The conditional logistic regression approach is most 432 useful when papers differ widely in important characteristics, such as year and journal of 433 publication (see [12]). Analyses were also repeated with a non-conditional logistic 434 regression to assess the robustness of the results. Analyses were conducted with all three 435 categories of duplication combined, separately on category 1 and category 2 and 3 papers, 436 and combining categories 2 and 3. 437 Since the sample size was pre-determined, we did not conduct a prospective power 438 analysis.  authors on the likelihood to publish a paper containing a Category 1 (green), Category 2 558 (yellow) or Category 3 (red) problematic image duplication. When six error bars are 559 associated with one test, the first three error bars correspond to data from the first author 560 and the last three are for data from the last author. Panels are subdivided according to 561 overall hypothesis tested, and signs in parentheses indicate direction of expected effect 562 (">" : OR>1; "<" : OR<1; "0": intermediate effect predicted). 563 authors on the likelihood to publish a paper containing a Category 2 or 3 problematic 567 image duplication. For each individual-level parameter, first and second error bars 568 correspond to data from first and last authors, respectively. Panels are subdivided 569 according to overall hypothesis tested, and signs in parentheses indicate direction of 570 expected effect (">" : OR>1; "<" : OR<1; "0": intermediate effect predicted). Formal 571 thresholds of statistical significance are added above each error bar to facilitate effect 572 estimation ("+": p<0.1; "*": P<0.05; "**": P<0.01; "***": P<0.001). 573 574 the likelihood to publish a paper containing a Category 2 or 3 problematic image 578 duplication, compared to authors working the United States. The data were produced with 579 a multivariable logistic regression model, in which dummy variables are attributed to 580 countries that were associated with the first or last author of at least one treatment and 581 one control paper. All other countries were included in the "other" category. Numeric 582 data are raw numbers of treatment and control papers for first and last author (upper and 583 lower row, respectively). Formal thresholds of statistical significance are added above 584 each error bar to facilitate effect estimation ("+": p<0.1; "*": P<0.05; "**": P<0.01; 585 "***": P<0.001). 586 587 author on the probability of publishing a paper containing a Category 2 or 3 problematic 591 image duplication. Each subpanel illustrates results of a single multivariable model, 592 partitioned by country subsets (see text for further details). First and second error bars 593 correspond to data from first and last authors, respectively. Signs in parentheses indicate 594 direction of expected effect (">" : OR>1; "<" : OR<1). Formal thresholds of statistical 595 significance are added above each error bar to facilitate effect estimation ("+": p<0.1; 596 "*": P<0.05; "**": P<0.01; "***": P<0.001). probability of publishing a paper containing a Category 2 or 3 image duplication. Each 601 subpanel shows results of univariable analyses on subsets of countries (see text for further 602 details). First and second error bar correspond to data from first and last authors, 603 respectively. Panels are subdivided according to overall hypothesis tested, and signs in 604 parentheses indicate direction of expected effect (">" : OR>1; "<" : OR<1). Formal 605 thresholds of statistical significance are added above each error bar to facilitate effect 606 estimation ("+": p<0.1; "*": P<0.05; "**": P<0.01; "***": P<0.001). 607