In the era of big data, the format and methods by which the data are stored, accessed, and analysed have become increasingly important, and the concurrent rapid expansion of computational power and prowess is facilitating the synthesis and analysis of many types of data toward the goals of the 3Rs. These considerations apply not only to new high-dimensional data, e.g., from next generation sequencing technologies, but also to existing legacy data such as in vivo toxicology studies representing decades of work, vast numbers of animals, and millions of dollars. Varied groups have strong interests in creating common spaces where data can be shared, and in defining principles by which data sharing should occur. Scientific researchers want to publish their data and associated analyses, and to provide the opportunity for others to offer alternate interpretations. Journal editors and publishers are also under both internal and external pressure to ensure that publications are supported by transparent, easily accessible data sources, and funding agencies have recently enhanced their focus on proper data stewardship to ensure that grants are supporting valuable research. The Holdren memo, for example, mandates that work supported by the US federal taxpayer dollars should be delivered back to the public in an open manner. Similar ruling is provided by various European states and European Commission supported research. Finally, the data science community, including software and tool-builders, needs to have access to a broad, standardized swath of data to effectively process, analyse, and integrate multiple information sources to more efficiently and effectively advance scientific discovery. A number of recent initiatives and activities have brought these diverse stakeholders together to pool resources and experiences, discuss data sharing challenges and opportunities, and arrive upon a common set of ideals that should govern such processes.
One such initiative led to the publication of the FAIR principles (Table 1) for scientific data management and stewardship (Wilkinson et al. 2016). These four principles have become key objectives for data practises within the National Institutes of Health (NIH) and across the broader scientific community. With respect to the NIH, data objects that are federally funded, both intramurally and extramurally, must be findable, e.g., via a digital object identifier (DOI), and accessible, meaning that they can be read and interpreted by both humans and machines. Datasets should be well-described by metadata, using standardized ontologies, in an interoperable way that allows for proper cataloguing and storage to ensure that they can be integrated with other data sources and are therefore reusable. Ongoing efforts within the National Institute of Environmental Health Sciences (NIEHS), and the associated National Toxicology Program (NTP), provide a snapshot of the larger research community as data scientists work to put these principles into practise. NIEHS, and other parts of the NIH, deal with a heterogeneous mix of data systems and technologies, data management practises, metadata capture and standards, funding mechanisms and resources for building & sustaining systems, and policies around usage of data. NIEHS is building a Data Commons, which will serve as a common platform for management of research data, and is also developing a metadata catalogue for terminologies/ontologies that can be used to curate new and existing datasets. Significant efforts are underway to improve the capturing of data provenance, search functionality, and visual analytics, amenable to both user interfaces and web application programming interfaces (APIs) that will provide a bridge between The Chemical Effects in Biological Systems (CEBS: which houses all of the NTP’s data in a web-accessible format https://www.niehs.nih.gov/research/resources/databases/cebs) and other resources such as the EPA’s Chemistry Dashboard (https://comptox.epa.gov/dashboard), PubChem (https://pubchem.ncbi.nlm.nih.gov/), and NLM ToxNet (https://toxnet.nlm.nih.gov/) databases.
Table 1 The FAIR guiding principles for scientific data management and stewardship [from (Wilkinson et al. 2016)]
Specific to data sharing with respect to the 3Rs, the NTP Interagency Center for Evaluation of Alternative Toxicological Methods (NICEATM) has built the Integrated Chemical Environment (ICE: https://ice.ntp.niehs.nih.gov/) to apply FAIR principles to both non-animal in vitro and in silico data as well as legacy in vivo animal data (Bell et al. 2017). The data integrator portion of ICE is a portal through which users can compare alternative approaches and build predictive models using the existing animal data as anchoring endpoints, to help establish scientific confidence in new approaches. In coordination with the Interagency Coordinating Committee for the Validation of Alternative Methods (ICCVAM), and the 16 federal agencies who sit on the committee, NICEATM is also helping to develop a U.S. Strategic Roadmap for Modernizing Safety Testing of Chemicals and Medical Products (https://ntp.niehs.nih.gov/go/natl-strategy). One of the strategic goals of the roadmap is to foster the use of efficient, flexible, and robust practises to establish confidence in new methods (Casey 2016). Some specific objectives related to this goal include identifying and collating sources of high-quality human toxicological and exposure data, creating centralized data access points that are publicly available and easily accessible, actively soliciting the submission and collation of parallel data from existing animal studies and new alternative methods, and leveraging partnerships and complimentary initiatives, all of which necessitate FAIR data sharing practises.
Certain aspects of the FAIR principles are especially challenging, for both NIH and the broader scientific community. Interoperability and reusability of data depend largely upon the questions at hand, and require agreement and coordination across many parties. There are other, purely practical, issues that arise due to the size of datasets and policies around sensitive information that result in situations where the data may not be movable, and computation must be moved to the data. Further, under the reality of finite resources, the scientific community must guide prioritization of data storage, access, analysis, and maintenance.