1 Introduction

The application of high-throughput technologies for monitoring vital players in the cell (DNA, RNA, proteins, metabolites, etc.) produces more data today than ever at lower cost, resulting in extended and complex data sets [1]. The integrated analysis of these data sets provides opportunities to get a deeper understanding of molecular phenomena [2,3,4]. Thus, published bio-science research results must be visible, accessible, and reproducible for subsequent reuse by the community [5]. Both Research Data Management (RDM) and FAIRness (i.e., data that is findable, accessible, interoperable, and reusable) become the two key focus areas for individual bio-science communities from diverse fields of studies to improve the reproducibility, visibility, and accessibility of their heterogeneous research data [6,7,8,9]. Understandably, it is common for the larger research groups to deploy their proprietary RDM solutions mitigating inherent challenges, while smaller groups leverage the existing open solutions. The search & exploration applications play a pivotal role in fully integrating the standardized research data and facilitating the interdisciplinary analysis within and across various RDM eco-systems. They are expected to provide a near real-time consolidated view of the research data while hiding the data model heterogeneity, and data exchange methodologies among the participating groups.

Various research-specific RDM solutions exist, from web-based data management platforms, and public repositories, to standalone information systems that are suitable for managing data from specific fields of study. The public repositories such as MetaboLights [10, 11] and PRIDE [12] are popular open-source end-points for metabolomics and proteomics communities respectively to share study-specific data and curate knowledge. Similarly, stand-alone information systems such as openBIS [13] (Open Biological Information System) offer mechanisms for the storage and management of research data. Users can do a direct free-text search, or they can refine the search using search facets which allows for filtering the search results by specific studies or organisms. However, these types of data management are not conducive to identifying the inter-dependencies across data from different study disciplines.

To explore the inter-dependencies, it is prudent to have a generic data-centric RDM solution to manage omicsFootnote 1 datasets in local/remote hubs or repositories, however, with added integrated search functionalities [14]. This paper presents our work on research data management and analysis of large-scale experimental metadata generated within the plant science research community. We introduce PLANTdataHUB, a fully featured, end-to-end solution to FAIR-RDM in plant research, and outline its data-centric features. We discuss in detail our search application, ARC Registry, that indexes the experimental metadata from multiple studies and disciplines in real-time, thus offering a platform for multi-omics data analysis in plant science research.

2 Experimental Metadata

From a data perspective, a typical biological experiment can be broadly characterized into four stages as shown in Fig. 1. The first stage includes curation of high-level metadata. It is the minimum required information to describe an experiment to allow researchers to understand the biological and technical origins of the data. It is essential to provide a comprehensive description of the project (say, the project title, people involved, publications, etc.), its goals, and contextual information. The sequence of events or steps should be recorded before or during the experimental procedure. This includes information about the source and characteristics of a study sample, study protocols, sample preparation techniques followed, and any other required information to prepare the instrument to perform the actual measurement.

Fig. 1
figure 1

The stages of a typical bio-science experiment from a data perspective

The second stage involves conducting the experiment and collecting the raw output data produced by the instrument. The format of these files depends on the instrument used and the type of experiment performed. Generally, these data files are expected to be large and have project-specific storage requirements. In most cases, the instruments generate data in a proprietary format, which needs to be converted into standard formats for further analysis using specialized software for identification and quantification. This information is usually maintained at the level of each sample tested.

The third and fourth stages of the experiment involve creating computational pipelines that include workflows, analysis scripts, or tools to ensure the pipelines are reproducible. These pipelines contain code generated from common workflow languages, including sets of processed data files generated along with pre-processing, data cleaning steps, and tools used. Researchers perform computational analysis using the sets of processed data as input to the analysis steps from single or multiple studies. The system enables researchers to perform computational analyses collaboratively using workflow engines. The output datasets and other artifacts from a computational analysis can form soft metadata for further integrated analyses.

2.1 FAIR and RDM

The FAIR data principles serve as a set of guidelines for improving the Findability, Accessibility, Interoperability, and Reusability of scientific data [15]. The goal is to encourage researchers to build detailed annotated studies that use qualified references to other studies, produce machine-readable datasets, identify the datasets through global unique identifiers, and make them accessible through standardized communication protocols [16].

Research Data Management outlays a set of methods and processes to collect, maintain, use, preserve, and share complex research data. Within the biological science community, there can be several distinct RDM approaches with each having multiple stages, where process-oriented management at each stage in the data life-cycle not only ensures the discoverability of research data but also aids in the longevity of the research [17].

An RDM platform using FAIR principles provides an eco-system to progressively transform research data into Fair Data Objects (FDOs), which are technology-independent digital objects that unify the critical information about a research project [18]. An FDO bundles the data and the mechanisms for its creation, maintenance, and sufficient information to process it. Each FDO is identified by a unique Persistent Identifier (PID) resolving to a machine-readable PID information record to make relevance of the contents. A research project with journal publications referencing FDOs would be a stand-out with research results that are visible, accessible, and reproducible.

2.2 Challenges

There are a few key challenges to consider before implementing any data policies to manage complex bio-science research data [9, 17, 19, 20].

2.2.1 Coverage

End-to-end data coverage is essential for any data management system. The curation of experimental metadata is crucial for the management of research data. It is equally important to capture the provenance of the metadata, steps followed in the computational analysis, tools used, workflows, scripts, etc., for the reproducibility of a bio-science experiment. Although capturing every bit of information related to the experiment is preferred, annotating and managing large volumes of measurement data can be challenging. Achieving end-to-end data requires significant effort in integrating, annotating, and automating data capture at every experiment stage.

2.2.2 Standardization

The data collected from various experiments can be heterogeneous and diverse depending on the discipline, type of experiment, the equipment used, and the procedures followed. Therefore, standardization of research data is essential to ensure that research data can be stored, searched, and analyzed while maintaining interoperability and reusability. Several bio-science metadata specifications are available to structure, curate, and enable data sharing and integration [21,22,23,24]. However, as different research groups may use different specifications, conversion tools may be necessary to achieve metadata interoperability. Standardization is vital to maintaining reproducible results in public or private repositories and using industry-standard data formats reduces the complexity of integrating with external systems.

2.2.3 Search & Exploration

Enabling search and exploration of integrated research data can be challenging due to the heterogeneous nature of data across different domains, high volume, and technical challenges in modeling data from various standardized formats into a generic schema. Additionally, maintaining the quality of the curated data at every stage of the experiment is not trivial. Due to the expected size of the data, innovative storage solutions, efficient information management, and faster indexing mechanisms are necessary. Since most experiments are conducted collaboratively, the distributed nature of the stored data adds a new dimension not only for searching but also for integrating the systems with existing knowledge bases and public repositories. To speed up query response times, search applications must pre-compute the most frequently run queries ahead of time.

2.2.4 Usability

Naturally, additional tools and applications evolve along with any proposed RDM solution for research data. These tools and applications (annotation tools for metadata curation, conversion tools to generate standardized export formats, etc.), should be user-friendly and efficient so that even scientists with limited computer programming experience can perform tasks like metadata curation, computational analyses, and so on. Any user interfaces should be interactive and easy to learn, requiring no additional training for data stewards, scientists, and analysts. Since the sharing of experimental metadata is a fundamental requirement for addressing reproducibility, it is also expected that the RDM solution would allow direct import or export of metadata in standardized formats.

Fig. 2
figure 2

PLANTdataHUB model using ARCs as FAIR digital objects. Data stewards curate ARCs using support tools and maintain them in project-specific on-premise or remote DataHUBs. Also, ARCs can be exported into interoperable formats for external repositories or added as direct references in publications

3 PLANTdataHUB

PLANTdataHUB [25] is an end-to-end solution realizing research data management of plant science data to generate FAIR digital objects called ARCs (Annotated Research Contexts). The PLANTdataHUB eco-system, as shown in Fig. 2 provides tooling support to participating research groups to collaborate and generate ARCs that are self-contained, interoperable, and reproducible. We outline the RDM features and elaborate on how the main components and tools help overcome the FAIR-RDM challenges.

3.1 Annotated Research Context

The core idea behind Annotated Research Contexts (ARC) is to follow a fixed folder and file layout to organize the experimental metadata, measurement data, workflows, software, external files, etc., in a generic and interoperable format to ensure the reproducibility of each experiment within a research project [26]. The format of ARC is standardized and flexible to adapt and accommodate simple to complex project scenarios across any omics studies. As shown in Fig. 3, ARCs offer a single point of entry logic for both data management and computation. It provides end-to-end data coverage and easy-to-perform integrity checks to maintain the ARC quality.

The ARC specification [27] mandates the experimental metadata to be represented in ISA format, a well-known standard for plant research data. ISA is an extensible framework that categorizes metadata into Investigation (the project context), Study (a unit of research), and Assay (analytical measurement) [21]. The ISA abstract specification [28] helps in achieving a common format for representing experimental metadata not only for project-specific requirements but also for exchanging information across project groups and public repositories. The abstract model is implemented in hierarchical ISA-JSON file format [29]. The ARC specification also uses common workflow language for the workflows, and scripts that are used to build the computation pipelines. The artifacts from workflows and computational analysis go into respective run folders.

Fig. 3
figure 3

ARC folder specification packaging ISA-format metadata with workflows, scripts for computational pipelines, and result files/artifacts from workflow executions.

The process of data management and digital object generation inherently yields additional Object-level metadata which in turn can enrich data analyses. This may include, the digital object identifier (DOI) assigned to a particular version of an ARC, details about the analyst or the data steward who recently modified the ARC in the repository (could be the same or different from the actual scientist who conducted the experiments), the current status of the ARC, its visibility (public or private), etc.

3.2 DataHUBs

Research data management typically begins at individual research institutes, where most data is curated and stored locally. However, for publications, it is crucial to maintain research results and data in centralized repositories, and hence collaboration and data sharing are pivotal. In the PLANTdataHUB model, centralized repositories are known as DataHUBs. They offer versioning of ARCs and make it easy to search and discover through standardized API access. The ARC model and associated tools make it easy to initialize and curate ARCs locally while keeping them private during ongoing experiments. Research groups can also directly maintain private ARCs in associated DataHUBs with strict access control restrictions. When the results are ready for publication, private ARCs can be made accessible to the public.

PLANTdataHUB solution leverages GitLab for storage and versioning of individual ARCs. The use of a distributed version control system like git for ARCs provides flexibility in maintaining each iteration of a design-test-repeat cycle of a laboratory experiment, thus avoiding the need for a new customized versioning tool. DataHUBs can be deployed either on-premise or remote depending on the research groups’ data sharing and security policies. Additionally, PLANTdataHUB provides a range of supporting tools to enable easy curation, collaboration, and maintenance of ARCs both on-premise and in remote DataHUBs.

3.3 User Identification

A robust data security framework within the PLANTdataHUB eco-system is expected to complement ARCs, DataHUBs, and supporting tools & applications. An effective and user-friendly authentication and authorization infrastructure (AAI) is required to enforce data access policies seamlessly. PLANTdataHUB solution uses a Keycloak instance [30] that acts as a gateway AAI to streamline user identification across the DataHUBs.

3.4 Tool Support

PLANTdataHUB provides a wide range of tooling support to facilitate continuous FAIR collaboration built on top of ARC structure, such as

  • on-premise git-based tools to help data stewards initialize and curate ARC folder structure and initial ISA-formatted metadata files [31].

  • annotation tools to assist in curating ontology-driven ISA-format metadata annotation accessing the required external ontologies [32].

  • a range of converter tools to generate standardized export formats from the ARCs, pivoting on the ISA model.

  • a data management plan generator to guide the writing process of a DMP (data management plan) in user-desired format for respective funding agency [33].

4 Search Requirements

The hierarchical layered structure of data within the ARC specification makes it non-trivial to develop integrated data analysis platforms to let users explore file contents from the hierarchies. Typical user queries include finding a list of ARCs where a particular data steward is on the study author list, ARCs that have used a specific material sample from a named plant species, ARCs that have used a particular high-throughput technology type, etc. When integrated with multiple on-premise/remote DataHUBs hosting ARCs from different omics domains, the results of these queries offer a different facet of cross-domain research and analysis.

We broadly categorize the requirements from the plant science community into three different ARC search functions.

  • integrated metadata search from the top layers of the ARCs i.e. ISA-formatted experimental metadata and ARC Object-level metadata.

  • exploration of both metadata, associated measurement data (processed and annotated), and artifacts from iterations of computational analysis within an individual ARC.

  • integrated exploration of data across a set of selected ARCs (same or from different omics domains) and from single/multiple DataHUBs.

The required design features of an application that can integrate multiple DataHUBs and implements the above search functions include

  • Flexibility in integrating multiple DataHUBs supporting different integration strategies and data exchange protocols.

  • Robust data interchange mechanism, preferably with standardized data exchange formats.

  • Effective and comprehensive data access through APIs and web-based user interfaces.

  • Cloud-native application design with high availability.

  • A user identification system complementing external authentication and authorization infrastructure.

  • End-to-end logging for user support and troubleshooting.

5 ARC Registry

We introduce ARC Registry, a cloud-native application for integrated search and analysis of multi-omics metadata within the plant science community. The registry application is designed to integrate multiple DataHUBs at once and provides a consolidated real-time view of the metadata from the top layers of the ARCs across DataHUBs. It includes functionalities to search both ARC Object-level and ISA-formatted metadata.

5.1 Micro-services

ARC Registry follows a multi-service architecture. The application modules are built as individual micro-services that are loosely coupled and securely communicate with each other. Fig. 4 shows the individual services of the ARC Registry. API gateway service is the entry point that authorizes and authenticates incoming API requests. ARC Registry exposes a range of secure APIs to access both ARC Object-level and ISA-formatted metadata.

Fig. 4
figure 4

ARC Registry architecture. The core functionalities are implemented as micro-services

HUB Integration is an important service that is capable of connecting with several DataHUBs and instantly downloading experimental metadata from ARCs. Each participating DataHUB requires a one-time integration setup to establish a two-way data exchange with the ARC Registry. Once set up, the ARC Registry is expected to receive real-time push notifications from DataHUBs upon every ARC update, and the HUB Integration service downloads and processes the latest changes.

DB Connection service abstracts the underlying database and persisted metadata by exposing a set of internal APIs for other registry services to access. Since the experimental metadata is modeled as standardized JSON documents, any document database with decent text search functionality fits the best.

Fig. 5
figure 5

ARC Registry integrated with multiple on-premise/remote DataHUBs hosting ARCs

The Web UI is a standalone service that provides web pages for users to search and explore both ARC Object-level and ISA-formatted metadata. By default, the web pages only show public ARCs. However, users can authenticate themselves through external AAI to access both public (owned by any user) and private ARCs (owned by respective authenticated users).

The logging service keeps a log of all the interactions among the other services in the ARC Registry application. Specifically, the logging service maintains a stage-wise log of interactions between the HUB Integration service and the DataHUBs for each ARC.

5.2 DataHUB Integration

There are two essential aspects to integrating DataHUBs hosting ARCs. The first aspect is generating real-time notifications whenever a user updates an ARC. For instance, changes to an existing workflow in an ARC, adding sets of result files from executing a workflow, changes to the administrative metadata, etc. The second aspect is accessing ARCs from the DataHUBs through APIs while abstracting the mode of communication such that DataHUBs and the ARC Registry are agnostic to underlying data transfer protocols, say, HTTP/S, SFTP, etc.

PLANTdataHUB uses customized GitLab for DataHUBs to maintain ARCs, inheriting the built-in features of GitLab. ARCs are maintained as individual git repositories within the GitLab servers, making it easy to set up system-level continuous integration and delivery (CI/CD) pipelines that can invoke the ARC Registry APIs upon commits to individual ARC repositories. Fig. 5 shows how the registry can be integrated with multiple GitLab servers at the same time. After initial user authentication, the data exchange happens as follows

Fig. 6
figure 6

Sequence diagram depicting the interactions among the entities of RDM i.e. Users, AAI, DataHUB, and ARC Registry. 1) Automated data exchange using the CI/CD pipeline between the DataHUB and ARC Registry. 2) A Query API that returns all Public ARCs when the user is not authenticated with AAI. 3) A Query API that returns both Public and user-specific Private ARCs when authenticated

  1. 1.

    Any fresh user commits to individual ARC repositories trigger a CI/CD pipeline which performs two tasks, as shown in the first part of Fig. 6. Firstly, the internal script as part of the pipeline scans for updates in ISA-formatted metadata sections of ARC and generates a JSON artifact (in ISA-JSON exchange format) within the repository.

  2. 2.

    The pipeline then triggers webhook API of the ARC Registry with a notification in the form of JSON payload containing ARC Object-level metadata i.e. ARC repository ID, GitLab username, email ID, file commits, additions, etc.

  3. 3.

    The incoming webhook API request is routed to the HUB integration service, which parses ARC Object-level metadata, connects back to the respective DataHUB, and downloads the ISA-JSON artifact from the respective ARC repository using generic GitLab APIs. It then parses the ISA-JSON artifact and invokes the DB connection service to persist in the database.

5.3 Metadata Versioning

ARC Registry stores data as collections of JSON documents in two different formats: ARC Object-level metadata in a proprietary format and ISA-formatted metadata in a standard ISA-JSON format. It is important to note that the data is read-only, and no direct updates or deletes are allowed through user APIs. Furthermore, the ARC Registry versions the JSON documents in the order they are received as shown in Fig. 7. This allows for easy tracking of changes to the ARC data over time. The visibility of the ARC repository is reflected in the JSON documents.

Fig. 7
figure 7

Metadata versioning, where each commits to ARC repositories is saved as a new ARC metadata version, reflecting visibility

ARC registry provides a variety of query APIs that enable users to search and explore versioned metadata. By default, the query APIs return all public ARCs’ matching metadata. For instance, searching for the contextual parameter-value combination night exposure temperature = 22 degrees returns all the relevant experiments from public ARCs grouped by DataHUB name, ARC ID, ARC Version, Investigation, Study, and Assay Name. However, when a valid user token received from AAI is attached to the request header, the ARC Registry performs token introspection and also returns matching metadata of the private ARCs owned by the respective user.

5.4 Deployment

The individual services of ARC Registry, including the web interface, are developed in NodeJS with the ExpressJS framework, and for the API gateway we use Express Gateway. Each service is containerized using Docker, and the application runs on Docker Swarm, which monitors services and ensures high availability. For OAuth2.0 services and AAI identity management, we use Keycloak.

6 Lessons Learned

The iterative development of RDM solutions for the plant research community is challenging. It requires developing associated tools and search applications in tandem with the ever-evolving functional requirements of the community. It is necessary to use modularization, abstraction, and micro-service architectural style complementing iterative development. Additionally, leveraging the existing products (GitLab, etc.) inherently introduces the shortcomings of respective products into the RDM solution, requiring us to implement workarounds.

It can be challenging to automate data exchange between different RDM solutions, often requiring customized tools to integrate experimental metadata from research groups using different solutions.

7 Conclusion

It is acknowledged across bio-science research communities that an effective Research Data Management (RDM) platform based on FAIR principles is a necessity to package study-specific research results into uniquely identifiable and accessible FAIR digital objects. The integrated analysis of the research results provides better opportunities leading to knowledge discovery, collaboration, and innovation.

We introduce PLANTdataHUBFootnote 2, an end-to-end data-centric FAIR-RDM solution for plant research data. Our ARC RegistryFootnote 3 is a cloud-native application that seamlessly integrates the experimental metadata from the DataHUBs in real time, addressing the requirements of the plant research community. ARC Registry offers comprehensive data access through APIs and complements external authentication and authorization frameworks, making it an ideal platform for multi-omics data analysis between participating research groups.

In 2021, the DataPLANT consortium and its stakeholders deployed a production DataHUB for the plant science community. The repository has grown since then, with more than 200 users hosting over 400 ARCs (around 15TB of data) from various plant study disciplines.

We are working on expanding the ARC Registry to include both metadata and measurement data from multiple on-premise and remote DataHUBs. The application utilizes a polystore architecture to accommodate different data models. The standout feature is its adaptive indexing of measurement data based on the cross-model query workload.