This study found that only a quarter of preprint articles on COVID-19, posted on bioRxiv and medRxiv, had a data/code sharing statement within the manuscript. Furthermore, among the preprint articles that reported that data were available somewhere (i.e., in the manuscript or online in a repository, etc.), we found those raw data for less than half of those articles. Overall, 15% of the analyzed preprint articles have publicly shared raw data and/or code.
The results are comparable to the results of Lucas-Dominguez et al., who found that 13.6% of articles retrieved from PubMed Central early in the COVID-19 pandemic made their research data available (Lucas-Dominguez et al., 2021). Even though these data publication rates appear to be low, even lower rates were published for other fields and articles. In 2021, Towse et al. reported that 4% out of 1900 articles from 15 psychological journals have adhered to the open research data (Towse et al., 2021). Gorman analyzed data sharing in 13 high-impact addiction journals and found that only one (0.8%) out of 130 analyzed articles contained a direct link to the analyzed data (Gorman, 2020).
Another issue is the quality and completeness of the shared datasets. Roche et al. analyzed 100 datasets from journals publishing ecological and evolutionary research that have a strong public data archiving policy. They reported that 56% of the analyzed datasets were incomplete, and 64% archived in a manner that partially or entirely prevented their reuse (Roche et al., 2015).
Interest in raw data collected in studies devoted to a public health emergency is not purely academic exercise. During the COVID-19 pandemic, multiple high-profile retractions of research articles have been published; some of them happened when the data analytics company refused to share the raw data. Subsequently, it was suggested that journals should institute mandatory requests to authors to share the primary data as a measure that will likely ensure data integrity and transparency of the research findings and help prevent publication frauds (Krishan & Kanchan, 2020).
While writing a data sharing statement and sharing raw data is not synonymous, our study provides relevant insight into what happens when something is mandatory. Namely, bioRxiv and medRxiv had a different approach to requiring statements regarding data/code availability. The medRxiv requires the following from authors [quote]: “Please include a statement regarding the availability of all data referred to in the manuscript and note links below.”, and there is a separate field for Data availability links, indicating [quote] “Please provide any URLs for external datasets or supplementary material online at other repositories that pertain to this manuscript. These links will be provided online for readers once this submission is posted online. (Example: https://www.example.com).” Since the Data/Code field is obligatory in the medRxiv, this explains why virtually all articles posted on medRxiv had something written in the Data/Code field, compared to less than 10% of articles published on bioRxiv. Obviously, when authors are not required to disclose anything related to their Data/Code, few authors do it voluntarily.
It has been reported that journals could leverage compulsory open data to develop the reputation and amplify their journal impact factor (Zhang & Ma, 2021). While preprint servers are not journals, their obligatory demand for raw research data or code could help amplify their reputation in the field.
The authors should be required to provide their data sharing statement within the manuscript as well, as it is unclear how many readers will look for a Data/Code field on the website of the preprint article. Presumably, readers interested in the study will mostly rely on information provided within the manuscript. Preprint servers bioRxiv and medRxiv should request authors to include data sharing statements within the manuscript as well.
Li et al. analyzed data sharing intentions of COVID-19 clinical trials of interventions, as declared by authors in trial registrations and publications. They included 924 trial registrations in the analysis; authors of 15.7% of registrations were willing to share data, 38.6% were willing to share immediately after publishing results, and 47.6% reported they were unwilling to share their study data. The authors found 28 published COVID-19 clinical trials; of those, only 7 had a data sharing statement, with six that reported authors were willing to share data, and one reported data were not available (Li et al., 2021).
However, we need to be aware that the presence of a data sharing statement and the authors’ self-reported intention to share data may not translate to raw data sharing upon request. We have shown that even authors who indicated in their data availability statement that data will be available on request mostly do not even respond to the data request; few authors of clinical trials were willing to share their data (Gabelica et al., 2019).
Some researchers may need education regarding data sharing issues, as we found multiple statements in the Data/Code field that had nothing to do with data sharing. Despite very clear description about what is expected to be in the Data availability field, some authors wrote strange information in that field, for example, information about competing interests, or information that is difficult to interpret, such as “All authors agree that all data submitted here are publicly available.” Furthermore, many authors wrote that “all data” are in the manuscript or accompanying files, but neither the manuscript nor the associated files contained raw data; this implies that authors may not be aware of the meaning of the “data sharing” concept and that data sharing implies sharing of raw data collected within the study.
We even found one case where the authors expect payment for the data (DeCapprio et al., 2020a). Curiously, in the version of the article that was published in a scholarly journal, the authors did not write that a payment is needed to access the data. Instead, the authors simply wrote that the data are proprietary and they are not shareable (DeCapprio et al., 2020b).
Studies such as this one are relevant because they may help reshape biomedicine and biomedical research (Puljak, 2020). Ideas for future studies include repeating the same analysis on published articles about COVID-19. This study focused on preprint articles due to the spike in preprint publications at the beginning of the COVID-19 pandemic (Fidahic et al., 2020). Furthermore, it would be worthwhile to attempt to re-analyze raw data that the authors made available. Due to the heterogeneity of the studies in our sample, we did not attempt do to it; a large team of experts would be needed to attempt re-analysis of data from studies in our sample.
It would be interesting to analyze the inclusion of data/code in non-COVID-19 preprint articles in future studies. We searched the literature, but we were unable to find any such reports for comparison.
A limitation of the study is the low response rate of authors contacted in the survey (14%); this number was limited, but not unsurprising as this was an unsolicited email survey. Furthermore, we did not analyze factors associated with data sharing. It has been shown that some factors, for example, the later career stage of the researches, are associated with more prevalent data sharing (Dorta-González et al., 2021). We have also analyzed publication rates of the included articles, by January 2022. Almost half of the analyzed articles were published in a scholarly journal by that date. It is possible that perhaps more scholarly articles based on those preprints will be published subsequently.
In conclusion, we found that only a quarter of analyzed preprint articles on COVID-19 included a data sharing statement within their manuscript, and 15% shared their raw data or code publicly, either in the manuscript or elsewhere online, at the time of publication. All preprint servers should require authors to provide data sharing statements that will be included both on the website and in the manuscript. In addition, the education of researchers about the meaning of data sharing would be needed.