Introduction

There is an important global interest in Open Science, which include open data and methods, in addition to open access (OA) publications [1, 2]. Several funding agencies in the United States and in Europe have mandates for open data generated in the research projects they support. In addition, an increasing number of scientific journals have policies encouraging or asking authors to provide data in open repositories [3]. In this commentary, we discuss key elements about data sharing in open repositories, from an international and interdisciplinary perspective [4].

Main text

Open research data

It has been proposed that public availability of raw data increases their value and the possibility of confirming scientific findings, improving reproducibility and replicability of results [5,6,7,8], in addition to enhancing the options of reducing research waste [9]. In this context, the Transparency and Openness Promotion (TOP) guidelines promotes data transparency (https://www.cos.io/initiatives/top-guidelines) [7, 8]. It has been highlighted that there are several main types of research data repositories: Institutional, disciplinary, multidisciplinary and project specific [10]. Availability of raw data in open repositories facilitates the adequate development of meta-analysis, particularly individual patient data -IPD- meta analyses [11], and the cumulative evaluation of evidence for specific topics [12], especially for high-dimensional data [13] (such as results from genomics, transcriptomics or epigenomics). In this context, certain research fields, such as genomics, have developed standards that facilitate and promote deposition of raw data [14].

A recent study showed, in a sample of 531.889 OA journal articles, that a minor fraction of papers included a link to data repositories and that those articles have a higher citation impact [3]. Another recent work analyzed 487 papers describing clinical trials and found that, although many declared data availabilities, very few included data in repositories [15]. An analysis of 500 articles from 50 high-impact journals found that only a small fraction deposited their full raw data online [16]. In addition, in a sample of 49 published articles it was found that the reluctance to share data was associated with a weaker evidence and a higher number of errors in the reporting of statistical results [17]. Ioannidis and coworkers found that raw data unavailability led to a low rate of repeatability of microarray results from published articles [18].

The FAIR Guiding Principles have been proposed for scientific data management [19] and they involve these main four categories: Findable (unique and persistent identifiers, in addition to rich metadata), Accessible (retrievable by their identifier), Interoperable (a broadly applicable language for data representation) and Reusable (a clear and accessible usage license) [19]. Metadata, the information containing the details of data organization, collection and preprocessing, is key for the appropriate processes of finding, using and citing files in open repositories [20]. Recently, Corpas et al. have provided several recommendations to comply with the FAIR principles, such as establishing an adequate consent framework, maximizing machine-readable data and selecting the most findable and accessible data repositories [21]. Broman et al. have proposed several valuable recommendations for the organization of data files, such as being consistent, choosing adequate names for variables, avoiding empty cells, creating data dictionaries and using standard file formats (such as comma-delimited files) [22]. In this context, it has been shown that the use of some commercial file formats, such as.xls files, has led to issues in data storage, such as changing gene symbols to dates [23].

Open access licenses and ethical aspects

There are several available OA licenses and the ones from Creative Commons (CC; https://creativecommons.org/about/cclicenses/) are frequently used [24]. CC BY is one of the less restrictive and involves attribution, CC BY-SA needs licensing under identical conditions, CC BY-ND does not allow derivative works, CC-BY-NC does not allow commercial uses and CC BY-ND-NC does not allow neither derivative works nor commercial uses [24]. It has been recommended [25] that a CC0 license (a universal public domain dedication; https://creativecommons.org/share-your-work/public-domain/cc0) should be used for data sharing.

There are several ethical aspects related to the sharing of data from human subjects, such as de-identification and having appropriate informed consents and approval by the institutional review boards [26,27,28,29]. In addition, in certain contexts, it is advisable the use of controlled-access repositories, in which the researchers need to apply to get access to the data. In specific cases of highly sensitive information, there is the option for the submission of processed data, such as summary statistics [25, 28]. The International Committee of Medical Journal Editors (ICMJE) requires, since 2017, that articles reporting the results of clinical trials should include a data sharing statement [30]. There are two major interesting examples of international sharing of data from patients and the development of important scientific findings and collaborations [28]: the Alzheimer’s Disease Neuroimaging Initiative (ADNI; adni.loni.usc.edu) has led to more than 2.100 international publications [31] and The Cancer Imaging Archive (TCIA; cancerimagingarchive.net) has facilitated the generation of more than 1.100 international publications [32]. In some regions of the world, there is the need for further training for members of research ethics committees about the multiple advantages of sharing data for the advancement of health sciences research [27, 28].

Recommendations for researchers around the globe

In Table 1 we present a selection of major data repositories (some of them are for general use and others are oriented to specific applications or data types), in order to provide options to the readers to submit their raw results [25]. Among them, the databases at the National Center for Biotechnology Information (NCBI) contain several billion records; some of the largest databases from NCBI are the ones for DNA and RNA sequences (more than 429 million records), gene expression profiles (more than 128 million records), single nucleotide polymorphisms (SNPs; more than 720 million records) and protein sequences (more than 874 million records) [33]. Regarding the databases from the European Bioinformatics Institute, the largest resources are the European Nucleotide and Genome-Phenome Archives, the PRoteomics IDEntifications and the ArrayExpress [34]. The Protein Data Bank has more than 140.000 entries [35] and the Image Data Resource stores different types of imaging data [36]. DataMed (datamed.org) is a search engine for data deposited in repositories [37], there is the Registry of Research Data Repositories (re3data.org) [10] and the European Data Portal (https://data.europa.eu/en) facilitates consolidation and search of open datasets from that region of the world [38]. The Research Data Alliance (RDA) is an international initiative promoting multiple aspects related to open data sharing (https://www.rd-alliance.org) [39].

Table 1 Information about selected major open data repositories

There is a need for more training about open science and data science [25], particularly in emerging economies, and a larger number of open data repositories are very needed in these regions of the world [40, 41]. In this context, the adequate implementation of standards for reporting of raw data for specific fields, such as the MIAME (Minimum Information About a Microarray Experiment) [14], is key in order to provide an adequate organization of files and inclusion of key metadata, with information such as description of the individuals/samples, experimental conditions and analyses [20]. Funding agencies and academic institutions from multiple countries are invited to consider the importance of open data in their policies and incentives [41, 42]. Although it is a common practice in several journals, editors and peer reviewers of even more international publications should enforce the guidelines asking authors of manuscripts to deposit raw data [12] and scientists from around the world are invited to deposit their data in open repositories [20, 25, 43]. These efforts could be particularly catalyzed by initiatives such as microattribution [44, 45], which provides researchers incentives to openly share their data to the public domain, allowing not only open data sharing but also the possibility of reaching new scientific conclusions that would otherwise not be possible if these data are not being made publicly available [44]. Such initiatives have already been implemented for data repositories, such as locus-specific databases [44], national/ethnic mutation databases [46], clinical databases and consortia [47] and scientific journals (https://www.nature.com/sdata).

Outlook

In times of COVID-19, it is critical to have good quality data (including aspects of accessibility, timeliness and support for users, among others [48]) for proper decision-making. We need data of high quality, that are reliable and trustworthy [49]. At the global level, initiatives like the Research Data Alliance COVID-19 Working Group involved 440 volunteer data experts to address several issues with data and software sharing to improve the response to the pandemic [49]. They provided recommendations and guidelines on data sharing [49].

However, several challenges have to be solved, particularly in emerging economies, such as: legal and policy issues, scarcity of coordination between research groups, lack of a culture for data sharing and ethical/privacy considerations, insufficiency of proper infrastructure (including high-speed Internet connectivity), deficiency in interoperability of platforms, shortage of data managers and data scientists and a scarcity of open data repositories to facilitate data sharing [50]. Recently, an examination of open government data portals for 60 countries found that USA, Czech Republic and Canada have the largest numbers of available datasets (more than 291,000, 136,000 and 85,000, respectively) [48]. In some cases, governments do not see the value for implementing open data repositories; besides it is an excellent way for transparency [48], accountability and even a strategy to deal with corruption. We all play a role in this pandemic, and we need more collaboration between private and public agencies, interdisciplinary approaches, universities, non-governmental organizations, and the civil society to promote an efficient use of open data repositories (as it has been demonstrated recently in the pandemic [51]). In addition, investing in health information systems, interoperability and incentives are key components. Governments should also monitor and evaluate the impact of sharing data on repositories. Finally, there is an important need to strength capacities in the biomedical personnel (particularly in emerging economies), in topics such as: data science, open data repositories, data intelligence, data protection regulations with multidisciplinary teams and collaboration between key stakeholders. As a very high number of publications about Open Science is written by authors from the Global North [8], it is needed to have more international articles about Open Data from the Global South [1, 4, 52].