This study has revealed the availability of research data published as supplementary material or deposited in repositories from the articles on COVID-19 indexed in PMC, as well as the evolution, documentary typology, subject matter and financing during the first months of the pandemic. The visualization by fortnights showed an increase in the scientific production of COVID-19 related studies. To statistically demonstrate this increase, we perform a bimonthly comparison since, as expected after the declaration of this crisis as an official pandemics, March 2020 supposed a critical moment, which demonstrated empirically the high speed of response of research worldwide. Nevertheless, despite the extraordinary number of articles published and their continuous increase, only 13.6% contained supplementary material or data deposited in repositories. Research data sharing has evolved equally in case of the attachment as supplementary material of the article, while the deposit of material in a repository was invariable over time. While the number of articles has increased exponentially since the 2nd fortnight of March 2020, coinciding with the increase in the number of cases confirmed by COVID-19 as the pandemic progresses, the response of funding has grown modestly, although we need to keep in mind that the results of these works may not yet have been published as their funding began in early 2020. In addition to the low percentage of deposited data, it has been reported that patient-level COVID-19 data is not publicly available. In the current era of global interaction via the Internet, it would be desirable that electronic patient records, conveniently anonymized, were also available to researchers (Rios et al., 2020).
In parallel to the growth in the number of articles, numerous journals have published related articles (n = 335). High-impact generalist journals (such as N Engl J Med, Lancet and Science) have published more articles than journals on infectious diseases, public health, critical care medicine and respiratory systems. In total, 20.40% of published articles are letters, reviews (6.47%) or editorials (2.99%), without appreciating changes in the percentages of letters regarding journal articles (68,78%) throughout the fortnights (ranging between 20 and 30%). In a previous study, only 46% of papers were found to be articles or reviews (Aleixandre-Benavent et al., 2020), and in another study that analyzed the open data in 140 articles from five high impact journals, most of the published papers were opinion papers, case reports, and reviews (Gkiouras et al., 2020). It seems that all journals have something to convey to their readers regarding COVID-19. However, the information disseminated through peer-reviewed journals and the data sets published as supplementary material or deposited in online repositories are both vital for researchers and decision-makers (Dye et al., 2016; Modjarrad et al., 2016; Whitty et al., 2015).
The frequency and evolution of the keywords collected by fortnights shows the chronology of the main events that have marked this health crisis. On December 31, the WHO was alerted of a cluster of pneumonia cases of unknown aetiology in Wuhan. The 1st fortnight of January, MERS, SARS, and influenza viruses were ruled out as causative pathogens of this emerging outbreak, and the origin of the zoonosis in a bat was investigated as a reservoir. China publicly shared the gene sequence of the novel coronavirus SARS-CoV-2, establishing polymerase chain reaction diagnostic testing. During the 2nd half of January, the National Genomics Data Center of China launched the 2019 novel coronavirus database to release the genome of SARS-CoV-2, and the NIH started working on vaccines. On January 30, the WHO declared the COVID-19 outbreak a Public Health Emergency of International Concern. The 1st fortnight of February, the disease was officially called COVID-19, and scientific researchers discovered the SARS-CoV-2 spike protein binding to its human cell receptor protein called ACE2 (Scudellari, 2020). The 1st fortnight of March, the WHO declared a pandemic and announced that no pharmaceutical therapies had yet been shown to be safe and effective for the treatment of COVID-19 (Li et al., 2020; WHO, 2020). Following these events, some words are related to the specific topic of the journals, but more general terms stand out in almost all journals: epidemiological concepts (such as pandemics and outbreak), respiratory diseases (such as pneumonia and SARS-CoV, since the lung is a target organ in this infection), biological markers (such as spike proteins, ACE2, polymerase chain reaction), virology and genetics (genome of the virus). Since the articles analysed were published at the beginning of the outbreak, it is possible that a large part of them are notifications and reports of the place where the outbreak occurred, laboratory data, information obtained from previous outbreaks with similar organisms, mechanisms of transmission, natural history of infection, populations at risk, treatments being used to control the disease, diagnostic tests and genetic sequence information of the virus. Although early case studies of COVID-19 usually contain few patients, these are very important because they contain critical information about the contacts that transmitted the infection and those that the patient had subsequently, allow estimation of incubation periods, describe clinical manifestations, provide key laboratory and radiological information and facilitates decision making in concomitant diseases (Heymann, 2020).
A significant feature of the modern response to epidemics is the ability to efficiently exploit all available data, which can facilitate evidence-based research and decision-making (Campos et al., 2015; Cori, 2017; WHO Ebola Response Team, 2014). For example, analysis of data on epidemics can be used to predict outbreaks in other regions and the most significant factors associated with the disease, such as biological, environmental and climatic factors (humidity, temperature and rainfall), quality housing, transport conditions and population density, among others (Wu et al., 2018). However, experiences with data storage and use in epidemics have not always been conclusive. Thus, it has been reported that data sharing was important during the influenza virus A subtype H1N1 epidemic of 2009; however, it was not as significant with the Middle East Respiratory Syndrome (MERS) epidemic, first reported in Saudi Arabia in 2012, and the Ebola epidemics in 2014–2016, which revealed gaps in the open availability of the genetic sequences of the virus (Yozwiak et al., 2015). The Ebola, dengue and Zika epidemics have evidenced the need to develop infrastructures for the proper management of data of interest to public health (D’Agostino et al., 2018), as well as the establishment of codes of conduct that should govern the exchange of data on new biological threats (Capua, 2016).
The geographical production of papers has followed a pace very much in line with the expansion of the disease, with China leading the publication in the first fortnight of January and growing until March and being overtaken by the EU and USA in April, coinciding with the virus expansion in these areas.
The analysis of supplementary material showed that three-quarters of the documents were PDF and DOC, containing mostly textual or graphic materials complementary to the research, and a percentage that barely reached 10% (73 papers) were files with reusable data formats (xls and csv), which is equivalent to 1.2% of the 5,905 records published and analysed in this work. Furthermore, 68.78% of the documents retrieved in this work were originals or revisions (6.47%), which could include underlying research data. It is not unreasonable to say that this is a very low percentage that does not respond to recommendations and calls for sharing research data. The previous study by Gkiouras et al., (2020) also showed a low percentage, since only one out of the 140 articles on Covid-19 published in high-impact journals (0.7%), provided complete open data. This percentage may be diminished by classic impediments to data sharing, such as restrictions due to intellectual property, confidentiality problems, limitations of technical resources and humans, and the lack of incentives for researchers who deposit their data. Co-authoring an article and being cited frequently are often the only rewards for sharing information that can take years to collect and months of hard work to select. In short, we are in an era in which the scientific community is still debating the pros and cons of data sharing (Aleixandre-Benavent et al., 2018; Chretien et al., 2016; Sixto-Costoya et al., 2020; Vidal-Infer et al., 2019; Walport & Brest, 2011). This low percentage does not extend to gene sequence repositories, where scientists often share the sequences at the same time as they are discovered (Chretien et al., 2016; Pham-Kanter et al., 2014).
In global public health emergencies, it should be mandatory to disseminate any information that may be of value in fighting the crisis. For this to be done efficiently, there is a need to develop agreed global standards for sharing data and results for scientists, institutions and governments (Capua, 2016; McNutt, 2016; Modjarrad et al., 2016). Establishing data sharing as the gold standard of any published work may be crucial to contain the current and future possible health emergencies that may come from emerging biological threats. To overcome existing misgivings about data sharing, embargo periods could be established, but these should not delay data use, and codes of conduct for data on epidemics should be established (Capua, 2016; Research Data Alliance, 2020). The International Committee of Medical Journal Editors confirmed that the pre-publication dissemination of critical public health information in the context of WHO-declared health emergencies will not prejudice the publication of works (ICMJE 2020). An example of the importance of the data repository for some government institutions is provided by the NIH strategic plan for the decade 2017–2027, which involves improvements in data infrastructures, integration of individual data sets, new tools and resources for data analysis, and incorporation of FAIR principles (findability, accessibility, interoperability and reusability) (Wilkinson et al., 2016; National Institutes of Health Office of Strategic Coordination, 2020).
Accelerated reporting and data repository should include both clinical trials and epidemiological surveillance studies, observational studies, information on the virus and its genetic sequences, and disease control programmes (Moorthy et al., 2020). These emerging data, properly integrated, allow for the refinement of risk assessment and the provision of recommendations to countries for the management of the epidemic (Heymann, 2020; Yozwiak et al., 2015). To facilitate research on the disease, journals and publishers around the world issued a joint statement promising full cooperation in data exchange. These measures include, in addition to providing open access to all peer-reviewed research, sharing research findings prior to peer review, including protocols, results, and data (“Calling all coronavirus researchers”, 2020). However, the emergency initiative to share research findings prior to peer review by means of preprints implies a reduction in quality control that can lead to the dissemination of erroneous information. Some data series may have errors, which could generate misleading conclusions, with the consequences that this may have for the health of the population (Rinott et al., 2020). Therefore, researchers who reuse Covid-19 data should be cautious because the fact that they are findable does not guarantee their quality. It has been reported that some of the first published articles on COVID-19 had to be withdrawn because of quality issues, and the Retraction Watch blog has created a special list for COVID-19 publications (Retraction Watch, 2020).