Integrating FAIR Experimental Metadata for Multi-omics Data Analysis

Doniparthi, Gajendra; Mühlhaus, Timo; Deßloch, Stefan

doi:10.1007/s13222-024-00473-6

Integrating FAIR Experimental Metadata for Multi-omics Data Analysis

Fachbeitrag
Open access
Published: 05 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Datenbank-Spektrum Aims and scope Submit manuscript

Integrating FAIR Experimental Metadata for Multi-omics Data Analysis

Download PDF

166 Accesses
Explore all metrics

Abstract

The technological advancements in bio-science research are resulting in the generation of vast amounts of complex and heterogeneous data sets from individual studies. Efficient Research Data Management solutions based on FAIR principles can guide the research groups toward standardizing and packaging the study-specific research results into uniquely identifiable digital objects that are easily traceable and identifiable, leading to knowledge discovery, collaboration, and innovation. However, to explore the inter-dependencies among data sets originating from different study disciplines, it is crucial to deploy a generic data-centric RDM solution that overcomes inherent challenges and helps manage complex data sets. This solution should respect participating groups’ data and security policies while providing an integrated data view. In this paper, we introduce ARC Registry, a cloud-native search & exploration application that integrates experimental metadata from individual research groups that have deployed PLANTdataHUB (FAIR-RDM) solution. The focus is on the parts of PLANTdataHUB and ARC Registry that facilitate near-real-time integration of experimental metadata from standardized digital objects (Annotated Research Contexts) across participating research groups.

A Comparative Study of Platforms for Research Data Management: Interoperability, Metadata Capabilities and Integration Potential

End-to-End Research Data Management Workflows

Dendro: A FAIR, Open-Source Data Sharing Platform

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The application of high-throughput technologies for monitoring vital players in the cell (DNA, RNA, proteins, metabolites, etc.) produces more data today than ever at lower cost, resulting in extended and complex data sets [1]. The integrated analysis of these data sets provides opportunities to get a deeper understanding of molecular phenomena [2,3,4]. Thus, published bio-science research results must be visible, accessible, and reproducible for subsequent reuse by the community [5]. Both Research Data Management (RDM) and FAIRness (i.e., data that is findable, accessible, interoperable, and reusable) become the two key focus areas for individual bio-science communities from diverse fields of studies to improve the reproducibility, visibility, and accessibility of their heterogeneous research data [6,7,8,9]. Understandably, it is common for the larger research groups to deploy their proprietary RDM solutions mitigating inherent challenges, while smaller groups leverage the existing open solutions. The search & exploration applications play a pivotal role in fully integrating the standardized research data and facilitating the interdisciplinary analysis within and across various RDM eco-systems. They are expected to provide a near real-time consolidated view of the research data while hiding the data model heterogeneity, and data exchange methodologies among the participating groups.

Various research-specific RDM solutions exist, from web-based data management platforms, and public repositories, to standalone information systems that are suitable for managing data from specific fields of study. The public repositories such as MetaboLights [10, 11] and PRIDE [12] are popular open-source end-points for metabolomics and proteomics communities respectively to share study-specific data and curate knowledge. Similarly, stand-alone information systems such as openBIS [13] (Open Biological Information System) offer mechanisms for the storage and management of research data. Users can do a direct free-text search, or they can refine the search using search facets which allows for filtering the search results by specific studies or organisms. However, these types of data management are not conducive to identifying the inter-dependencies across data from different study disciplines.

To explore the inter-dependencies, it is prudent to have a generic data-centric RDM solution to manage omics^{Footnote 1} datasets in local/remote hubs or repositories, however, with added integrated search functionalities [14]. This paper presents our work on research data management and analysis of large-scale experimental metadata generated within the plant science research community. We introduce PLANTdataHUB, a fully featured, end-to-end solution to FAIR-RDM in plant research, and outline its data-centric features. We discuss in detail our search application, ARC Registry, that indexes the experimental metadata from multiple studies and disciplines in real-time, thus offering a platform for multi-omics data analysis in plant science research.

2 Experimental Metadata

From a data perspective, a typical biological experiment can be broadly characterized into four stages as shown in Fig. 1. The first stage includes curation of high-level metadata. It is the minimum required information to describe an experiment to allow researchers to understand the biological and technical origins of the data. It is essential to provide a comprehensive description of the project (say, the project title, people involved, publications, etc.), its goals, and contextual information. The sequence of events or steps should be recorded before or during the experimental procedure. This includes information about the source and characteristics of a study sample, study protocols, sample preparation techniques followed, and any other required information to prepare the instrument to perform the actual measurement.

The second stage involves conducting the experiment and collecting the raw output data produced by the instrument. The format of these files depends on the instrument used and the type of experiment performed. Generally, these data files are expected to be large and have project-specific storage requirements. In most cases, the instruments generate data in a proprietary format, which needs to be converted into standard formats for further analysis using specialized software for identification and quantification. This information is usually maintained at the level of each sample tested.

The third and fourth stages of the experiment involve creating computational pipelines that include workflows, analysis scripts, or tools to ensure the pipelines are reproducible. These pipelines contain code generated from common workflow languages, including sets of processed data files generated along with pre-processing, data cleaning steps, and tools used. Researchers perform computational analysis using the sets of processed data as input to the analysis steps from single or multiple studies. The system enables researchers to perform computational analyses collaboratively using workflow engines. The output datasets and other artifacts from a computational analysis can form soft metadata for further integrated analyses.

2.1 FAIR and RDM

The FAIR data principles serve as a set of guidelines for improving the Findability, Accessibility, Interoperability, and Reusability of scientific data [15]. The goal is to encourage researchers to build detailed annotated studies that use qualified references to other studies, produce machine-readable datasets, identify the datasets through global unique identifiers, and make them accessible through standardized communication protocols [16].

Research Data Management outlays a set of methods and processes to collect, maintain, use, preserve, and share complex research data. Within the biological science community, there can be several distinct RDM approaches with each having multiple stages, where process-oriented management at each stage in the data life-cycle not only ensures the discoverability of research data but also aids in the longevity of the research [17].

An RDM platform using FAIR principles provides an eco-system to progressively transform research data into Fair Data Objects (FDOs), which are technology-independent digital objects that unify the critical information about a research project [18]. An FDO bundles the data and the mechanisms for its creation, maintenance, and sufficient information to process it. Each FDO is identified by a unique Persistent Identifier (PID) resolving to a machine-readable PID information record to make relevance of the contents. A research project with journal publications referencing FDOs would be a stand-out with research results that are visible, accessible, and reproducible.

2.2 Challenges

There are a few key challenges to consider before implementing any data policies to manage complex bio-science research data [9, 17, 19, 20].

2.2.1 Coverage

End-to-end data coverage is essential for any data management system. The curation of experimental metadata is crucial for the management of research data. It is equally important to capture the provenance of the metadata, steps followed in the computational analysis, tools used, workflows, scripts, etc., for the reproducibility of a bio-science experiment. Although capturing every bit of information related to the experiment is preferred, annotating and managing large volumes of measurement data can be challenging. Achieving end-to-end data requires significant effort in integrating, annotating, and automating data capture at every experiment stage.

2.2.2 Standardization

The data collected from various experiments can be heterogeneous and diverse depending on the discipline, type of experiment, the equipment used, and the procedures followed. Therefore, standardization of research data is essential to ensure that research data can be stored, searched, and analyzed while maintaining interoperability and reusability. Several bio-science metadata specifications are available to structure, curate, and enable data sharing and integration [21,22,23,24]. However, as different research groups may use different specifications, conversion tools may be necessary to achieve metadata interoperability. Standardization is vital to maintaining reproducible results in public or private repositories and using industry-standard data formats reduces the complexity of integrating with external systems.

2.2.3 Search & Exploration

Enabling search and exploration of integrated research data can be challenging due to the heterogeneous nature of data across different domains, high volume, and technical challenges in modeling data from various standardized formats into a generic schema. Additionally, maintaining the quality of the curated data at every stage of the experiment is not trivial. Due to the expected size of the data, innovative storage solutions, efficient information management, and faster indexing mechanisms are necessary. Since most experiments are conducted collaboratively, the distributed nature of the stored data adds a new dimension not only for searching but also for integrating the systems with existing knowledge bases and public repositories. To speed up query response times, search applications must pre-compute the most frequently run queries ahead of time.

2.2.4 Usability

Naturally, additional tools and applications evolve along with any proposed RDM solution for research data. These tools and applications (annotation tools for metadata curation, conversion tools to generate standardized export formats, etc.), should be user-friendly and efficient so that even scientists with limited computer programming experience can perform tasks like metadata curation, computational analyses, and so on. Any user interfaces should be interactive and easy to learn, requiring no additional training for data stewards, scientists, and analysts. Since the sharing of experimental metadata is a fundamental requirement for addressing reproducibility, it is also expected that the RDM solution would allow direct import or export of metadata in standardized formats.

3 PLANTdataHUB

PLANTdataHUB [25] is an end-to-end solution realizing research data management of plant science data to generate FAIR digital objects called ARCs (Annotated Research Contexts). The PLANTdataHUB eco-system, as shown in Fig. 2 provides tooling support to participating research groups to collaborate and generate ARCs that are self-contained, interoperable, and reproducible. We outline the RDM features and elaborate on how the main components and tools help overcome the FAIR-RDM challenges.

3.1 Annotated Research Context

The core idea behind Annotated Research Contexts (ARC) is to follow a fixed folder and file layout to organize the experimental metadata, measurement data, workflows, software, external files, etc., in a generic and interoperable format to ensure the reproducibility of each experiment within a research project [26]. The format of ARC is standardized and flexible to adapt and accommodate simple to complex project scenarios across any omics studies. As shown in Fig. 3, ARCs offer a single point of entry logic for both data management and computation. It provides end-to-end data coverage and easy-to-perform integrity checks to maintain the ARC quality.

The ARC specification [27] mandates the experimental metadata to be represented in ISA format, a well-known standard for plant research data. ISA is an extensible framework that categorizes metadata into Investigation (the project context), Study (a unit of research), and Assay (analytical measurement) [21]. The ISA abstract specification [28] helps in achieving a common format for representing experimental metadata not only for project-specific requirements but also for exchanging information across project groups and public repositories. The abstract model is implemented in hierarchical ISA-JSON file format [29]. The ARC specification also uses common workflow language for the workflows, and scripts that are used to build the computation pipelines. The artifacts from workflows and computational analysis go into respective run folders.

The process of data management and digital object generation inherently yields additional Object-level metadata which in turn can enrich data analyses. This may include, the digital object identifier (DOI) assigned to a particular version of an ARC, details about the analyst or the data steward who recently modified the ARC in the repository (could be the same or different from the actual scientist who conducted the experiments), the current status of the ARC, its visibility (public or private), etc.

3.2 DataHUBs

Research data management typically begins at individual research institutes, where most data is curated and stored locally. However, for publications, it is crucial to maintain research results and data in centralized repositories, and hence collaboration and data sharing are pivotal. In the PLANTdataHUB model, centralized repositories are known as DataHUBs. They offer versioning of ARCs and make it easy to search and discover through standardized API access. The ARC model and associated tools make it easy to initialize and curate ARCs locally while keeping them private during ongoing experiments. Research groups can also directly maintain private ARCs in associated DataHUBs with strict access control restrictions. When the results are ready for publication, private ARCs can be made accessible to the public.

PLANTdataHUB solution leverages GitLab for storage and versioning of individual ARCs. The use of a distributed version control system like git for ARCs provides flexibility in maintaining each iteration of a design-test-repeat cycle of a laboratory experiment, thus avoiding the need for a new customized versioning tool. DataHUBs can be deployed either on-premise or remote depending on the research groups’ data sharing and security policies. Additionally, PLANTdataHUB provides a range of supporting tools to enable easy curation, collaboration, and maintenance of ARCs both on-premise and in remote DataHUBs.

3.3 User Identification

A robust data security framework within the PLANTdataHUB eco-system is expected to complement ARCs, DataHUBs, and supporting tools & applications. An effective and user-friendly authentication and authorization infrastructure (AAI) is required to enforce data access policies seamlessly. PLANTdataHUB solution uses a Keycloak instance [30] that acts as a gateway AAI to streamline user identification across the DataHUBs.

3.4 Tool Support

PLANTdataHUB provides a wide range of tooling support to facilitate continuous FAIR collaboration built on top of ARC structure, such as

on-premise git-based tools to help data stewards initialize and curate ARC folder structure and initial ISA-formatted metadata files [31].
annotation tools to assist in curating ontology-driven ISA-format metadata annotation accessing the required external ontologies [32].
a range of converter tools to generate standardized export formats from the ARCs, pivoting on the ISA model.
a data management plan generator to guide the writing process of a DMP (data management plan) in user-desired format for respective funding agency [33].

4 Search Requirements

The hierarchical layered structure of data within the ARC specification makes it non-trivial to develop integrated data analysis platforms to let users explore file contents from the hierarchies. Typical user queries include finding a list of ARCs where a particular data steward is on the study author list, ARCs that have used a specific material sample from a named plant species, ARCs that have used a particular high-throughput technology type, etc. When integrated with multiple on-premise/remote DataHUBs hosting ARCs from different omics domains, the results of these queries offer a different facet of cross-domain research and analysis.

We broadly categorize the requirements from the plant science community into three different ARC search functions.

integrated metadata search from the top layers of the ARCs i.e. ISA-formatted experimental metadata and ARC Object-level metadata.
exploration of both metadata, associated measurement data (processed and annotated), and artifacts from iterations of computational analysis within an individual ARC.
integrated exploration of data across a set of selected ARCs (same or from different omics domains) and from single/multiple DataHUBs.

The required design features of an application that can integrate multiple DataHUBs and implements the above search functions include

Flexibility in integrating multiple DataHUBs supporting different integration strategies and data exchange protocols.
Robust data interchange mechanism, preferably with standardized data exchange formats.
Effective and comprehensive data access through APIs and web-based user interfaces.
Cloud-native application design with high availability.
A user identification system complementing external authentication and authorization infrastructure.
End-to-end logging for user support and troubleshooting.

5 ARC Registry

We introduce ARC Registry, a cloud-native application for integrated search and analysis of multi-omics metadata within the plant science community. The registry application is designed to integrate multiple DataHUBs at once and provides a consolidated real-time view of the metadata from the top layers of the ARCs across DataHUBs. It includes functionalities to search both ARC Object-level and ISA-formatted metadata.

5.1 Micro-services

ARC Registry follows a multi-service architecture. The application modules are built as individual micro-services that are loosely coupled and securely communicate with each other. Fig. 4 shows the individual services of the ARC Registry. API gateway service is the entry point that authorizes and authenticates incoming API requests. ARC Registry exposes a range of secure APIs to access both ARC Object-level and ISA-formatted metadata.

HUB Integration is an important service that is capable of connecting with several DataHUBs and instantly downloading experimental metadata from ARCs. Each participating DataHUB requires a one-time integration setup to establish a two-way data exchange with the ARC Registry. Once set up, the ARC Registry is expected to receive real-time push notifications from DataHUBs upon every ARC update, and the HUB Integration service downloads and processes the latest changes.

DB Connection service abstracts the underlying database and persisted metadata by exposing a set of internal APIs for other registry services to access. Since the experimental metadata is modeled as standardized JSON documents, any document database with decent text search functionality fits the best.

The Web UI is a standalone service that provides web pages for users to search and explore both ARC Object-level and ISA-formatted metadata. By default, the web pages only show public ARCs. However, users can authenticate themselves through external AAI to access both public (owned by any user) and private ARCs (owned by respective authenticated users).

The logging service keeps a log of all the interactions among the other services in the ARC Registry application. Specifically, the logging service maintains a stage-wise log of interactions between the HUB Integration service and the DataHUBs for each ARC.

5.2 DataHUB Integration

There are two essential aspects to integrating DataHUBs hosting ARCs. The first aspect is generating real-time notifications whenever a user updates an ARC. For instance, changes to an existing workflow in an ARC, adding sets of result files from executing a workflow, changes to the administrative metadata, etc. The second aspect is accessing ARCs from the DataHUBs through APIs while abstracting the mode of communication such that DataHUBs and the ARC Registry are agnostic to underlying data transfer protocols, say, HTTP/S, SFTP, etc.

PLANTdataHUB uses customized GitLab for DataHUBs to maintain ARCs, inheriting the built-in features of GitLab. ARCs are maintained as individual git repositories within the GitLab servers, making it easy to set up system-level continuous integration and delivery (CI/CD) pipelines that can invoke the ARC Registry APIs upon commits to individual ARC repositories. Fig. 5 shows how the registry can be integrated with multiple GitLab servers at the same time. After initial user authentication, the data exchange happens as follows

1.
Any fresh user commits to individual ARC repositories trigger a CI/CD pipeline which performs two tasks, as shown in the first part of Fig. 6. Firstly, the internal script as part of the pipeline scans for updates in ISA-formatted metadata sections of ARC and generates a JSON artifact (in ISA-JSON exchange format) within the repository.
2.
The pipeline then triggers webhook API of the ARC Registry with a notification in the form of JSON payload containing ARC Object-level metadata i.e. ARC repository ID, GitLab username, email ID, file commits, additions, etc.
3.
The incoming webhook API request is routed to the HUB integration service, which parses ARC Object-level metadata, connects back to the respective DataHUB, and downloads the ISA-JSON artifact from the respective ARC repository using generic GitLab APIs. It then parses the ISA-JSON artifact and invokes the DB connection service to persist in the database.

5.3 Metadata Versioning

ARC Registry stores data as collections of JSON documents in two different formats: ARC Object-level metadata in a proprietary format and ISA-formatted metadata in a standard ISA-JSON format. It is important to note that the data is read-only, and no direct updates or deletes are allowed through user APIs. Furthermore, the ARC Registry versions the JSON documents in the order they are received as shown in Fig. 7. This allows for easy tracking of changes to the ARC data over time. The visibility of the ARC repository is reflected in the JSON documents.

ARC registry provides a variety of query APIs that enable users to search and explore versioned metadata. By default, the query APIs return all public ARCs’ matching metadata. For instance, searching for the contextual parameter-value combination night exposure temperature = 22 degrees returns all the relevant experiments from public ARCs grouped by DataHUB name, ARC ID, ARC Version, Investigation, Study, and Assay Name. However, when a valid user token received from AAI is attached to the request header, the ARC Registry performs token introspection and also returns matching metadata of the private ARCs owned by the respective user.

5.4 Deployment

The individual services of ARC Registry, including the web interface, are developed in NodeJS with the ExpressJS framework, and for the API gateway we use Express Gateway. Each service is containerized using Docker, and the application runs on Docker Swarm, which monitors services and ensures high availability. For OAuth2.0 services and AAI identity management, we use Keycloak.

6 Lessons Learned

The iterative development of RDM solutions for the plant research community is challenging. It requires developing associated tools and search applications in tandem with the ever-evolving functional requirements of the community. It is necessary to use modularization, abstraction, and micro-service architectural style complementing iterative development. Additionally, leveraging the existing products (GitLab, etc.) inherently introduces the shortcomings of respective products into the RDM solution, requiring us to implement workarounds.

It can be challenging to automate data exchange between different RDM solutions, often requiring customized tools to integrate experimental metadata from research groups using different solutions.

7 Conclusion

It is acknowledged across bio-science research communities that an effective Research Data Management (RDM) platform based on FAIR principles is a necessity to package study-specific research results into uniquely identifiable and accessible FAIR digital objects. The integrated analysis of the research results provides better opportunities leading to knowledge discovery, collaboration, and innovation.

We introduce PLANTdataHUB^{Footnote 2}, an end-to-end data-centric FAIR-RDM solution for plant research data. Our ARC Registry^{Footnote 3} is a cloud-native application that seamlessly integrates the experimental metadata from the DataHUBs in real time, addressing the requirements of the plant research community. ARC Registry offers comprehensive data access through APIs and complements external authentication and authorization frameworks, making it an ideal platform for multi-omics data analysis between participating research groups.

In 2021, the DataPLANT consortium and its stakeholders deployed a production DataHUB for the plant science community. The repository has grown since then, with more than 200 users hosting over 400 ARCs (around 15TB of data) from various plant study disciplines.

We are working on expanding the ARC Registry to include both metadata and measurement data from multiple on-premise and remote DataHUBs. The application utilizes a polystore architecture to accommodate different data models. The standout feature is its adaptive indexing of measurement data based on the cross-model query workload.

Notes

omics – an informal reference to a field of studies ending in -omics, such as Genomics, Proteomics, or Metabolomics.
https://nfdi4plants.org/.
https://arcregistry.nfdi4plants.org/isasearch.

References

Bornmann L, Mutz R (2015) Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Asso for Info Science & Tech 66:2215–2222 (https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23329)
Article Google Scholar
Subramanian I, Verma S, Kumar S, Jere A, Anamika K (2020) Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights 14: (PMID: 32076369)
Joyce AR, Palsson BØ (2006) The model organism as a system: integrating ’omics’ data sets. Nat Rev Mol Cell Biol 7:198–210. https://doi.org/10.1038/nrm1857
Article Google Scholar
Zhang W, Li F, Nie L (2010) Integrating multiple ‘omics’ analysis for microbial biology: application and methodologies. Microbiology 156:287–301 (https://www.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.034793-0)
Article Google Scholar
Borgman CL (2012) The conundrum of sharing research data. J Am Soc Inf Sci Technol 63:1059–1078 (https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.22634)
Article Google Scholar
Krantz M (2021) Data Management and Modeling in Plant Biology. Front Plant Sci 12: (https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2021.717958)
Pommier C et al (2023) Plant Science Data Integration, from Building Community Standards to Defining a Consistent Data Lifecycle. Springer, Cham, pp 149–160 https://doi.org/10.1007/978-3-031-13276-6_8
Book Google Scholar
Haug K, Salek RM, Steinbeck C (2017) Global open data management in metabolomics. Curr Opin Chem Biol 36:58–63 (https://www.sciencedirect.com/science/article/pii/S1367593116302083. Omics)
Article Google Scholar
Wruck W, Peuker M, Regenbrecht CR (2012) Data management strategies for multinational large-scale systems biology projects. Brief Bioinform 15:65–78. https://doi.org/10.1093/bib/bbs064
Article Google Scholar
Haug K et al (2012) MetaboLights–an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res 41:D781–6
Article Google Scholar
Haug K et al (2019) MetaboLights: a resource evolving in response to the needs of its scientific community. Nucl Acids Res 48:D440–D444. https://doi.org/10.1093/nar/gkz1019
Article Google Scholar
Hermjakob H, Apweiler R (2006) The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev Proteomics 3:1–3. https://doi.org/10.1586/14789450.3.1.1 (PMID: 16445344)
Article Google Scholar
Bauch A et al (2011) openBIS: a flexible framework for managing and analyzing complex data in biology research. Bmc Bioinform 12:468. https://doi.org/10.1186/1471-2105-12-468
Article Google Scholar
Zhu F, Wen W, Cheng Y, Alseekh S, Fernie AR (2023) Integrating multiomics data accelerates elucidation of plant primary and secondary metabolic pathways. aBIOTECH, vol 4, pp 47–56 https://doi.org/10.1007/s42994-022-00091-4
Book Google Scholar
Wilkinson MD et al (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018. https://doi.org/10.1038/sdata.2016.18
Article Google Scholar
Sansone S-A et al (2019) FAIRsharing as a community approach to standards, repositories and policies. Nat Biotechnol 37:358–367. https://doi.org/10.1038/s41587-019-0080-8
Article Google Scholar
Pommier C et al (2023) Plant Science Data Integration, from Building Community Standards to Defining a Consistent Data Lifecycle. Springer, Cham, pp 149–160 https://doi.org/10.1007/978-3-031-13276-6_8
Book Google Scholar
De Smedt K, Koureas D, Wittenburg P (2020) FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units. Publications 8. https://www.mdpi.com/2304-6775/8/2/21
Wang X, Williams C, Liu ZH, Croghan J (2019) Big data management challenges in health research - a literature review. Brief Bioinform 20:156–167. https://doi.org/10.1093/bib/bbx086
Article Google Scholar
Gomez-Cabrero D et al (2014) Data integration in the era of omics: current and future challenges. BMC Syst Biol 8:I1. https://doi.org/10.1186/1752-0509-8-S2-I1
Article Google Scholar
Sansone S-A et al (2012) Toward interoperable bioscience data. Nat Genet 44:121–126 (https://www.nature.com/articles/ng.1054)
Article Google Scholar
Tenenbaum JD, Sansone S-A, Haendel M (2013) A sea of standards for omics data: sink or swim? J Am Med Inform Assoc 21:200–203. https://doi.org/10.1136/amiajnl-2013-002066
Article Google Scholar
Chervitz SA et al (2011) Data Standards for Omics Data: The Basis of Data Sharing and Reuse. Humana Press, Totowa, NJ, pp 31–69 https://doi.org/10.1007/978-1-61779-027-0_2
Book Google Scholar
Ćwiek-Kupczyńska H et al (2016) Measures for interoperability of phenotypic data: minimum information requirements and formatting. Plant Methods 12:44
Article Google Scholar
Weil HL et al (2023) PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research. Plant J 116:974–988 (https://onlinelibrary.wiley.com/doi/abs/10.1111/tpj.16474)
Article Google Scholar
Garth C et al (2022) Immutable yet evolving: ARCs for permanent sharing in the research data-time continuum. heiBOOKS, h, pp 366–373 (ttps://books.ub.uni-heidelberg.de/heibooks/catalog/book/979/chapter/13751)
Google Scholar
DataPLANTcommunity. Annotated Research Context Specification, v1.1-rfc (2023). https://doi.org/10.5281/zenodo.8302662.
Sansone S-A, Rocca-Serra P, Gonzalez-Beltran A, Johnson D, Community I (2016) ISA Model and Serialization Specifications 1.0. https://doi.org/10.5281/zenodo.163640
Sansone, S.-A., Rocca-Serra, P., Gonzalez-Beltran, A., Johnson, D. & Community, I. ISA Model and Serialization Specifications 1.0 - ISA Json format (2016). [Webpage; Accessed on: 2024-02-07].
Christie M et al (2017) Using Keycloak for Gateway Authentication and Authorization. https://figshare.com/articles/journal_contribution/Using_Keycloak_for_Gateway_Authentication_and_Authorization/5483557
Mühlhaus T et al (2022) DataPLANT-Tools and Services to structure the Data Jungle for fundamental plant researchers. heiBOOKS, pp 132–145 (https://books.ub.uni-heidelberg.de/heibooks/catalog/book/979/chapter/13724)
Google Scholar
DataPLANTcommunity. nfdi4plants/ARCitect: Arcitect (2023). https://doi.org/10.5281/zenodo.8307729.
Zhou X-R et al (2023) DataPLAN: a web-based data management plan generator for the plant sciences. bioRxiv. https://www.biorxiv.org/content/early/2023/07/10/2023.07.07.548147

Download references

Acknowledgments

We acknowledge the support of DataPLANT(NFDI 7/1–42077441) as part of the German National Research Data Infrastructure funded by the Deutsche Forschungsgemeinschaft (DFG–German Research Foundation).

We thank Tim Nicolas Otto for his efforts in developing the ARC Registry application.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Heterogeneous Information Systems, RPTU Kaiserslautern Landau, Paul-Ehrlich-Strasse 23, 67663, Kaiserslautern, Rheinland-Pfalz, Germany
Gajendra Doniparthi & Stefan Deßloch
Computational Systems Biology, RPTU Kaiserslautern Landau, Paul-Ehrlich-Strasse 23, 67663, Kaiserslautern, Rheinland-Pfalz, Germany
Timo Mühlhaus

Authors

Gajendra Doniparthi
View author publications
You can also search for this author in PubMed Google Scholar
Timo Mühlhaus
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Deßloch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gajendra Doniparthi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Doniparthi, G., Mühlhaus, T. & Deßloch, S. Integrating FAIR Experimental Metadata for Multi-omics Data Analysis. Datenbank Spektrum (2024). https://doi.org/10.1007/s13222-024-00473-6

Download citation

Received: 15 February 2024
Accepted: 26 April 2024
Published: 05 June 2024
DOI: https://doi.org/10.1007/s13222-024-00473-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Integrating FAIR Experimental Metadata for Multi-omics Data Analysis

Abstract

Similar content being viewed by others

A Comparative Study of Platforms for Research Data Management: Interoperability, Metadata Capabilities and Integration Potential

End-to-End Research Data Management Workflows

Dendro: A FAIR, Open-Source Data Sharing Platform

1 Introduction

2 Experimental Metadata

2.1 FAIR and RDM

2.2 Challenges

2.2.1 Coverage

2.2.2 Standardization

2.2.3 Search & Exploration

2.2.4 Usability

3 PLANTdataHUB

3.1 Annotated Research Context

3.2 DataHUBs

3.3 User Identification

3.4 Tool Support

4 Search Requirements

5 ARC Registry

5.1 Micro-services

5.2 DataHUB Integration

5.3 Metadata Versioning

5.4 Deployment

6 Lessons Learned

7 Conclusion

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation