Research Data Infrastructures and Engineering Metadata

This chapter introduces metadata models as a semantic technology for knowledge representation to describe selected aspects of a research asset. The process of building a hierarchical metadata model is reenacted in this chapter and highlighted by the example of EngMeta. Moreover, an overview on data infrastructures is given, the general architecture and functions are disscussed, and multiple examples of data infrastructures in materials modelling are given.


How to Engineer Metadata
The art of engineering a metadata model includes several consecutive steps which are described in this subsection. It may happen that this process or a single step has to be iterated several times to come to a fine-grained, purposeful description of the research asset. In short, the following steps are necessary to engineer a metadata model. First, a consensus must be reached about what metadata actually serves in the single context. Then, an object model has to be carved out of the research process. Last, the object model has to be transferred to a formal representation and implemented and therefore becomes a metadata model.

Definitions of Metadata and Metadata Models
However, in the beginning of designing metadata for a certain purpose, it first has to be discussed how metadata is defined. Usually, metadata is defined as a structured form of knowledge representation, or simply, as many authors put it, "data about data" [2]. Edwards describes this as the holy grail of information science: Extensive, highly structured metadata often are seen as a holy grail, a magic chalice both necessary and sufficient to render sharing and reusing data seamless, perhaps even automatic. [3, p. 672] However, metadata is always strongly context dependent. To tackle their context dependence, metadata must serve as a mode of communication: We propose an alternative view of metadata, focusing on its role in an ephemeral process of scientific communication, rather than as an enduring outcome or product. [3, p. 667] Following this, metadata takes the role of semantic technology: Its task is to relieve the direct communication and negotiation of data producers and data consumers and should therefore diminish "science friction" [3], which occurs in every process where research data is exchanged. To illustrate science friction, imagine two researchers exchanging a dataset, which is not properly described by metadata. The receiver might suppose the variable t i as a data point in a time series. To provide clarification, the receiver would have to contact the sender of the data, and also in this process can be defective. This example shows the importance of metadata as semantic asset, and therefore as a mode of fixed, negotiated communication.
Additionally, as Jane Greenberg puts it, metadata should semantically support the specific workflow [4]. For example, metadata describes a data point with an error bar and defines the form of the error. Thus, metadata would support the interpretation of the data point.
Following the discussion of metadata, a metadata model then can be seen as the middle ground of a non-formal model and a complete formalization of metadata keys, according to [5]. Its task is to describe the research objects or parts of it and its relation to other objects. They are still interpretations; however, they are constructed in a transparent and comprehensible way and derived from a common understanding of the research object, and lead to a fixed negotiation. The approach described in this chapter could also be called an ontology-based metadata, since the metadata model is engineered from an object model. As depicted in Fig. 1.1, hierarchical models such as EngMeta range below an ontology; however, their task is also to balance the depth of domain knowledge representation and the depth of digitization. The question in what terms a metadata model is different from an ontology has already been discussed in Sect. 1.2.

Object Model
The object model is the starting point for engineering a metadata model and marks the first phase in the creation process [5]. In this phase, an object model, respectively, an ontology description is carved out in a non-formal or natural language (and maybe containing graphical elements) describing and explicating all the relevant objects, terms, relations and rules. Every person potentially involved has to contribute to this process, since the metadata model will act as a semantic convention for a common understanding of the research data described.
The first part of engineering an object model is a clear and fixed understanding of what the object of research is, and what data it is representing. This can only be conducted by the analysis of the research process with all the stakeholders included. In this step, following information must be gathered: • Entities All relevant entities (or objects) of the research process must be identified.
This includes finding classes of entities, grouping entities or merging them. In materials modelling, one entity which is relevant is, for example, the component which represents a chemical species. • Attributes For each entity defined in the previous step, attributes describing the entity must be found. To stick with the example, the component is characterized by attributes like a name, the smiles or IUPAC code and a unit. • Relations In this part, the relations between the entities must be cleared, e.g. how they are linked to each other to deliver a holistic description. The arguments must be reasonable, but are strongly specific to the research. For example, one could argue that the component is related to the simulated target system. Usually in metadata modelling, is-part-of relations are sufficient to model the vast majority of cases. However, relations are not limited to these hierarchical types and may give a semantically more advanced description which will eventually lead towards ontologies.

Fig. 2.1
Example of component entity, which has several attributes, such as the smilesCode and is a part of the simulated target system, which is shown by a relation Figure 2.1 shows how a component in materials modelling could be represented by an entity, some attributes and a relation according to the example given above. All the entities can then be categorized according to the proposed classes of Sect. 1.3. The component entity would be categorized as discipline-specific metadata.
Also in this step, the question arises if the description needs to be data centric or process centric. It strongly depends on the research process how to answer this question. For example, in code development, one needs to continuously follow the changes made to the codes, i.e. the process of programming. Hence, the appropriate description of programming can only be process centric. 1 In data science applications, it is strongly dependent on the workflow, if a data-centric or a process-centric description should be chosen. In general, if data is the main outcome, even in a chain of process steps, one might want to choose a data-centric approach. If the processes are central to the research endeavour, and each process has a discrete output, one might chose a process-centric description. Of course, both approaches are not mutually exclusive. A data-centric approach also includes process information and a process-centric approach an elaborated description of the data. It is just a matter of hierarchical structuring and precedence. In Sect. 2.1.2, we will discuss why and how we decide for a data-centric model for computational engineering and realized in EngMeta.

The Metadata Model and Its Implementation
When the object model is converted to a formal language, special care has to be taken if parts of the object model already exist in some standard. With respect to the categorization taken in Sect. 1.3, the probability to find existing, fitting standards for technical or descriptive metadata is high, whereas for process-and domain-specific standards they are not likely to be found. Some of the relevant standards are described in quoted section; however, an excessive amount of standards exist.
Another consideration when implementing the model is choosing the right formal language for representing the metadata model. Most likely, this will be XSD 2 or JSON Schema. 3 Both offer a strict structural definition of the entities, attributes and relations, and the decision is more or less based on setting of the metadata model: What are the skills available, what are the technical requirements for the implementation? For example, the question, which standard the database or repository supports, where the metadata later will stored, is crucial in deciding for an implementation language.

Metadata Processes
A metadata model alone is not sufficient. As Edwards puts it, metadata products such as models have to be accomplished by metadata processes: Metadata products can be powerful resources, but very often-perhaps even usually-they work only when metadata processes are also available. [3, p. 668] Otherwise, if processes are not available, something called "metadata friction" would occur and the semantic assets would become worthless. This phenomenon would indicate the additional effort of (manual) metadata annotation and management, which has to be reduced by corresponding processes. This view is backed by the FAIR principles [6] and the additional guidance from an EU report [7]. The FAIR principles state metadata description as the main concept, and the study [7] accomplished this rather technical approach by processes surrounding these principles. In the case of materials modelling and computational engineering, in general, these processes would include, but are not limited to, the following: • Automated metadata extraction. One finding of [8] states that manual metadata annotation is a barrier for good research data management especially in the engineering science. Hence, automated metadata extraction is a major supporting process. • Data and metadata stewardship. Data and metadata need clear responsibilities and roles that define stewardship. This means that such a role has the responsibility of supporting metadata annotation, building metadata models and checking the data inventory for unindexed data. Such a role is, for example, the Scientific Data Officer [9]. • Incentives. On main process to support metadata products is incentives to use models and tag the data with metadata. These incentives can either be intrinsic or extrinsic. Intrinsic incentives would include low barriers for metadata annotation. Extrinsic incentives would include making metadata annotation of the published research data mandatory for scientific publication. • Culture. Supporting metadata annotation and also cultural processes have to be adapted. Metadata annotation and research data management have to be seen as one essential part of scientific practice. The process of science has to be adapted to 1. publishing the data Open Access and 2. applying FAIR paradigm of data description to it. However, this cultural change may be linked to the above process of incentives. As of now, researchers only get recognition for publishing papers and not the data.

Metadata for Engineering: The EngMeta Metadata Scheme
In this subsection, an example for a metadata model and its design will be given. EngMeta [1,8,10] is a semantic metadata standard for computational engineering and was designed following principles of the above subsection. Following Staab et al. [5] EngMeta could be referred to as an ontology-based metadata model. A comparison to VIMMP as a genuine ontology is carried out in Sect. 4.5. EngMeta was designed as a joint effort of researchers from computational engineering sciences (process engineering and aerodynamics), from the library sciences as well as from the computer sciences. This allowed the design of an integrated metadata model covering all the relevant research aspects in all the four categories as described in Sect. 1.3.

The Object Model of EngMeta
For the design of EngMeta, the object of research had to be identified first. This seems to be an easy task, but the devil is in the detail. As aerodynamics and molecular dynamics served as use cases, it was clear that computational engineering and its outcome were the common ground, but not more. All the four metadata categories defined in Sect. 1.3 had to be written out with representations, which could only be accomplished by analysing the research itself for common entities and attributes for process and domain. Both technical and descriptive metadata keys were quite straightforward, since their specificity is low (see Fig. 1.2). The process metadata and the domain-specific metadata were harder to carve out from both use cases and could only be gathered by a detailed analysis of the research process. The following entities were determined as process metadata for computational engineering: • processingStep serves as the highest level of the description for the provenance of the data and describes one processing step in the research process. • environment describes the computational environment on which the research was conducted, e.g. the hardware and compiler. • software describes the software environment in which the research was conducted, e.g. the code and its version.
The following entities were determined as domain-specific metadata for computational engineering applications and were seen as common ground, stemming from the use case of aerodynamics and thermodynamics but could also be applicable for use cases of materials modelling and beyond: • system This key represents the simulated target system (or the observed system) and its characteristics, which are the metadata keys listed below. • variable This metadata field represents the used variables and parameters, which can be either controlled or measured variables. This is not bound to a specific field of research but holds more generally for most applications in computational science, as variables and parameters are the basis of every simulation. • method This field holds the information on the simulation method, such as "simulation with umbrella sampling". • component This metadata key describes the names and SMILES/IUPAC codes of the molecules and solvents used within the simulation. • force field Describes the force field which is used for the simulation.
• boundaryCondition Describes the properties on the boundaries of two components. • spacial resolution This key defines the spacial resolution of a simulation.
• temporal resolution This key defines the temporal resolution of a simulation, for example, the number of timesteps, the interval and other characteristics.
It also became clear that the model will be data centric, since the research process in computational engineering reaches a steady state when a dataset is produced by a simulation or by post-processing of some data. However, it is crucial to document the processing steps as well for a good provenance description. This leads to a object model where the dataset is on top of the hierarchy and can include several processing steps.
The complete object model of EngMeta, with all entities, their attributes and relations, is depicted in Fig. 2.2. The four metadata categories are coloured differently.

The Metadata Model of EngMeta and its Implementation
After setting up the object model, research was conducted if there are metadata standards that serve the purpose of describing research assets in computational engineering as defined by the object model. None was found, however it was identified  that different metadata standards cover certain aspects of the EngMeta entities. This coverage is shown in Table 2.1 with respect to the four metadata categories. CodeMeta is a description of software tools and serves for the software part in EngMeta. Data-Cite is the standard for descriptive metadata and moreover, enables the data to get a DOI and was therefor integrated into EngMeta. PREMIS is a standard for technical metadata, and ExptML was integrated for experimental device, which can also be modelled by EngMeta. As Prov is a standard for provenance, a crosswalk for this standard was developed in order to achieve semantic interoperability [1]. Moreover, in this table, a comparison to VIMMP, which is discussed in Chap. 4 regarding existing standards is shown. The model has been implemented as an XML Schema Definition (XSD) and is available for open use and modification. 4

The Metadata Processes Supporting EngMeta
As discussed in Sect. 2.1.1.4, a metadata model needs to be complemented with metadata processes. Otherwise, it will not be fully effective to make research data FAIR. In the example of EngMeta, the model was complemented by an automated metadata extraction, the establishment of a research data management competence centre and an institutional repository. Details on the repository can be found in the following section on research data infrastructures, especially in Sect. 2.2.3.1. FOKUS was established as the main competence centre for questions and support regarding research data management at the University of Stuttgart. The automated metadata extraction ExtractIng was designed and implemented. It works in a way that all the existing metadata, stemming from log-, job-and various other files in the HPC and simulation environment, are extracted and are converted to the EngMeta metadata model. It can be integrated in the specific research process, and it was shown how an automated approach would look like for simulation sciences. Right after the simulation run, the ExtractIng tool will be triggered, transforming all the scattered metadata in a standardized form according to EngMeta. Then, the metadata can be automatically uploaded to the repository, all together with the data, forming a dataset within the repository including all relevant semantic information for FAIR interoperability.

Research Data Infrastructures
Research data infrastructures enable the data to become findable and accessible (the FA in FAIR), whereas semantic standards enable the interoperability and reusability (the IR in FAIR). Hence, research data infrastructures are the second crucial pillar for FAIR data technology as both parts are inseparable for semantic interoperability in materials modelling. Research data infrastructures resemble to repositories as they ensure enriching data with metadata, long-term preservation and open-access availability for the scientific community. Moreover, the data infrastructures serve as the link between the data and the community, and therefore play a significant role in science. This section is organized as follows. First, the requirements and functions for data infrastructures are explained in detail in Sect. 2

Requirements and Functions
Data infrastructures in materials modelling should, besides the typical data management tasks of storing, sharing and enabling FAIR data, support the specific research by integrating open simulation codes, analytics tools and the management of the scientific workflow [11]. This means that a data infrastructure goes beyond mere archival repositories. However, the core of all data infrastructures is an archive with repositoral functions. The OAIS Reference Model (ISO 14721) can give an orientation how such a core may look like [12], and the following functionality was derived from this framework: • Data Ingest Functionalities how to ingest data have to be defined and implemented.
This includes the design of an appropriate user interface and integration in the workflow. • Data Preservation and Archiving Originally split into two functionalities in the OAIS framework, for our purpose of defining functionalities for materials modelling, merging them into one is sufficient. This functionality should ensure permanent storage of the ingested data. Data preservation resembles to bitstream preservation on this layer. • Data Management This functionality corresponds to metadata management and linking the data objects according to metadata information. • Administration This functionality includes not only administrative tasks, but also policy management and AAI. • Data Access This functionality must be designed and implemented by a user interface in order to ensure data access for users. Moreover, this includes capabilities to search and explore the data infrastructure.
As it was mentioned earlier, the above basic functions have to be accompanied by supportive functions for the scientific workflow. These should include the following: • Workflow support This means that the above functionalities have to be integrated seamlessly into the scientific workflow in the field. • Service tool integration As moving data is expensive, the data infrastructure has to enable data analytics and processing tools close to the data repository. This can also include visualization services.

Architectures
Data infrastructures can be logically divided into three major layers, which are depicted in Fig. 2.3 [13]. The functions defined in the previous Sect. 2.2.1 have to be implemented in the specific or throughout all the three layers. It is subject to the precise implementation of a data infrastructure which function resides in which layer. 5 The base layer of a data infrastructure is the storage layer (l1), where the data objects are physically stored and bitstream preservation is guaranteed. Technically, this layer can exist in distributed and/or hierarchical setting and is often a combination from hard disc and tape storage. The intermediate layer is the object layer (l2), whose basic functionality is metadata management. By this layer, data from the storage layer is enriched with metadata and data objects become information objects with a persistent identifier, whose purpose is to make the data citable. The third layer is the service layer (l3) and includes the user interface and marks the visible part of the data infrastructures. Moreover, this layer includes additional services, such as an automated metadata extraction. Basically, data infrastructures implement all three layers; however, they can operate or work in distributed environments. Usually the base layer (l1) is the hardware part of the data infrastructure, whereas the layers (l2) and (l3) are the software part. The functionalities of the layers (l2) and (l3) are usually covered by repository software. A repository is a store for data that organizes this data in some logical manner and makes the data available for usage to a specified group of persons. It is important to mention that a repository is not a filesystem, which means that its purpose is not to manage the files in directory structures. In contrast, a repository must be imagined as collections of files organized in sets (of some logical manner, for example, as datasets, as linked data, in a loose hierarchical structure,...), which are described by metadata, are search and retrievable, and are provided with a persistent identifier.
Out-of-the-box generic repository software packages are generally available and serve different purposes. Some of those packages stem from document management, whereas others have their origins in data/file management. Their origin has to be taken into account when evaluating the repository for a specific use case or domain. Table 2.2 gives an overview over typical data repository software. For example, Dataverse originates from the management of datasets, whereas Dspace stems from managing document files. 6 However, also Dspace is capable of managing datasets, and the Fraunhofer-Gesellschaft is using it to store research data in its institutional repository Fordatis 7 [16].
Research data infrastructures can be classified as institutional and domain-specific infrastructures. Institutional data infrastructures resemble to research data management on an institutional level and are not bound to a specific discipline. An example of this type is DaRUS, which will be discussed in Sect. 2.2.3.1 of this chapter. A domain-specific data infrastructure serves as an approach which is bound to a specific discipline and can span across multiple institutions. An example of a domain-specific data infrastructure for materials modelling is NOMAD, which will be discussed in Sect. 2.2.3.2.

DaRUS
Even though the Data Repository of the University of Stuttgart (DaRUS) 8 is an institutional repository and not limited to materials modelling, it will be discussed here since its development was strongly driven by the EngMeta metadata model. Moreover, it is an example of a loosely coupled data infrastructure. Its overall development was urged by the need of a sustainable repository for the University of Stuttgart and, in particular, the materials modelling community at the university as well as by the precursory design of EngMeta. Within the repository, EngMeta serves as the semantic core and the repository is built around the metadata model, which is also deemed metadata-driven repository development. The requirements, such as handling large datasets, were stemming from aerodynamics and molecular dynamics [17].
DaRUS is based on Dataverse, and the driving factors for choosing this repository software were its design for research data management, its integration with the DOI persistent identifier infrastructure, its adaptability with metadata standards and its monolithic design. In the Dataverse repository software package, all the data is organized in Dataverses (organizational structure), datasets and files [18]. A Dataverse is the highest element in the hierarchical data organization structure in the repository and typically represents an institute or a research project. A dataset in the Dataverse terminology resembles to a directory or a collection of files. As of July 2020, DaRUS holds almost 600 files in 49 datasets, which are organized in 60 Dataverses, mainly from the fields of engineering, computer science and physics.
As DaRUS is an institutional repository, it is only loosely coupled to the research infrastructure since it is generic. This means that the service layer (l3) is basically the generic Dataverse web GUI. Additional services can be integrated by using one of the APIs that Dataverse offers, such as REST or SWORD. For example, an automated toolchain (as an external tool) was implemented using the Dataverse API for the specific use case of thermodynamics: after a simulation run, an automated metadata extraction is triggered. Then, the extracted metadata altogether with the data is automatically ingested into the DaRUS repository [19].

NOMAD
In contrast to DaRUS, the Novel Materials Discovery (NOMAD) laboratory 9 (or Novel Materials Discovery Center of Excellence (NOMAD CoE)) is a prime example of a domain-specific data infrastructure which is highly integrated [20] in a virtual research environment. The repository part is complemented with the NOMAD Archive, the NOMAD Encylopedia, the NOMAD Visualization Tools and the NOMAD Analytics Toolkit. NOMAD is recommended by Nature 10 for depositing supplementary data when submitting a research article on materials modelling.
The NOMAD repository is the central component of the laboratory and holds input and output data from material simulations with a retention period of 10 years for free. The NOMAD archive holds the open-access data from the repository which was converted into a code-independent format. To accomplish this, developing a metadata definition and a metadata component was crucial for this. It serves, just as proposed in Sect. 2.1.1.1, as a common understanding 11 and, as the overall outline of this book, for making data semantically interoperable. The metadata definition uses 168 aligned and 2,360 code-specific metadata keys. For example, the different terms for quantities had to be mapped to one aligned term. According to [20], the development of this component of the data infrastructure was a challenge. The NOMAD encylopedia is the part of the NOMAD data infrastructure which provides millions of calculations via a web GUI with a materials-oriented view and therefore serves as knowledge base and a material classification system. The NOMAD visualization tools are a centralized service for data visualization within the data infrastructure allowing users interactive graphical analysis in materials modelling. Additionally, the NOMAD Analytics Toolkit is a big data analytics approach to support data evaluation, for example, scanning for specific thermoelectric materials or finding suitable materials for heterogeneous catalysis.
In the NOMAD laboratory, the archive and the repository components correspond to the storage layer (l1) and the object layer (l2), whereas the encyclopedia, the analytics toolkit and the visualization tools correspond to the service layer (l3), which is strongly coupled to the base layers.
As of February 2020, the NOMAD data infrastructure holds 49TB of raw data in the repository and 19TB of the archive in normalized, annotated form in 758 datasets. 12

Materials Cloud
The Materials Cloud 13 is another domain-specific data infrastructure, which includes all the three aforementioned layers and implement them with specific technology supporting the data life cycle in materials modelling [11]. The Materials Cloud is, just like NOMAD, recommended by Nature for supplementary data for journal submissions in materials modelling. In the Materials Cloud, the ARCHIVE, DISCOVER, EXPLORE, WORK and LEARN components form according to data infrastructure.
The ARCHIVE component represents the open-access research data repository component with long-term storage, metadata protocols (including metadata harvesting for Google Dataset Search and B2FIND) and persistent identifiers (DOIs). The hardware backend of ARCHIVE is hosted at the Swiss National Computing Centre, is free of charge and data records are preserved for 10 years. For the software layer, Invenio will be used. ARCHIVE is moderated, which means all the ingested data is first checked against certain criteria, just as on preprint document servers. The DIS-COVER component corresponds to the browsing capabilities for curated datasets of ARCHIVE and offers interactive visualization. The EXPLORE part of the system is the component that tracks and displays provenance information of the datasets to ensure FAIR and reproducible data. All this information is recorded by the AiiDA system, which can be imagined as a git style methodology for data. The information is shown in a provenance graph. The WORK component is the part of the Materials Cloud data infrastructure that allows working with the available data, which can be either stand-alone tools to perform inexpensive calculations or AiiDA lab. AiiDA lab is a tool for defining workflows and orchestrating them from the web interface, since it lets users connect and use remote computational resources or other repositories which include the OPTIMADE standard, 14 so, for example, NOMAD. The LEARN part of the system features educational material, such as tutorials or video lectures and a downloadable image of a virtual machine for training purposes in materials modelling. This part is important since it covers metadata processes as displayed in Sect. 2.1.1.4.
Just as NOMAD, the Materials Cloud is highly integrated data infrastructure, where the ARCHIVE component acts as the storage layer (l1) and the object layer (l2). The service layer (l3) is set up by DISCOVER, EXPLORE, WORK and LEARN components.

Chemotion, MoMaF and NFDI
The Science Data Center for Molecular Materials Research (MoMaF) 15 is one of the four Science Data Center (SDC) projects of the state of Baden-Württemberg in Germany started in late 2019. Its goal is to support the data life cycle and implement the FAIR principles by a domain-specific repository for molecular materials research, digitalization of lab books and metadata standards.
MoMaF relies on preliminary work that was conducted in the Chemotion project, 16 whose aim was to build a data infrastructure for synthetic and analytic chemistry [21,22]. The core of Chemotion is a repository that allows to collect, reuse and publish data. It is complemented with discipline-specific data processing tools and it incorporates DOI generation and supports publishing, such as support for peer-reviewing submissions and comparing submissions with the PubChem database. The repository architecture consists of a private workspace and a publication area. Electronic laboratory notebooks play a crucial role here and can be imported into the private workspace. Research data 17 can, after adding metadata and a reviewing process, later be staged from the private workspace to the publication area, where they are provided with a DOI and made Open Data. Also, within this approach, we can see how a repository on the object layer is complemented with additional tools in the service layer, such as data processing tools or electronic laboratory notebooks.
The work and the results from the MoMaF SDC will later be used in the National Research Data Infrastructure (NFDI) for Chemistry [23] as one of the NFDI projects in Germany. Another project within the NFDI, which also will have an impact for materials modelling, is NFDI for Catalysis. 18 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.