1.1 Digitalization and Data Management

Digitalization is one of the driving forces of technological and social progress today. In the engineering sciences, in combination with a great variety of quantitatively reliable modelling and simulation approaches, digitalization supports the development of what has become known as Industry 4.0 by contributing to virtual manufacturing through cyber-physical systems. To predict thermodynamic, mechanical and other physical properties of materials and processes, data-driven and physics-based models are combined [1], supported by massively parallel simulation methods that continue to become more scalable and performant [2]; model databases are developed [3, 4], and data with a heterogeneous provenance (i.e. origin), based on different methods and coming from different sources [5, 6], are integrated into shared data infrastructures [7]. A multitude of names have been proposed for the related lines of work in academic and industrial research and development, including Integrated Computational Materials Engineering (ICME), with a focus on solids [8, 9], Computational Molecular Engineering (CME), with a focus on fluids [4, 6] and process data technology or computer-aided process engineering (CAPE), with an orientation towards process technology and CAPE-OPEN-based simulation technology [10,11,12]. This book discusses data management in materials modelling, which is here understood to encompass all these fields.

Digitalization is achieved in two steps: First, data must be available in digital form. The process of making data available digitally is referred to as digitization; in the engineering sciences, with certain exceptions (e.g. data published in old volumes of journals that have not yet been digitized by scanning), this can usually be presupposed. However, the possible use of raw unannotated digital data, also known as dark data [13, 14], is very limited. Beyond digitization, a second step is therefore required for digitalization to ensure that the data are and remain findable, accessible, interoperable and reusable (FAIR): These are the FAIR principles of data management or data stewardship [15,16,17]. For some applications, such as mediation systems [18, 19] for Ontology-Based Data Access (OBDA) to distributed heterogeneous data sources [20, 21], these four principles are jointly fundamental and cannot be separated from each other. In other typical cases, e.g. for complex simulation workflows, interoperability is the main concern [22, 23]; however, even in these cases, it is reasonable to follow good practices concerning all the aspects of FAIR data management. Findability and accessibility are supported by systems of persistent identifiers, with Digital Object Identifiers (DOIs) now covering almost all scientific publications, as well as platforms and legal solutions for open-access publishing.

The single aspect of greatest importance to the findability, interoperability and reusability of data is semantically characterized data annotation, i.e. the provision of metadata in a way that is widely agreed and understood on the basis of community-governed metadata standardization. This is the main topic of this book, where the focus will be on the interoperability aspects of FAIR data management and its practical realization by digital platforms and data infrastructures for materials modelling.

1.2 Semantic Interoperability

Interoperability is generally understood as being constituted by an agreement of multiple parties (platforms, code developers or similar) on a common standard, so that certain issues can be dealt with by all of them in the same way or, at least, in a sufficiently similar way. Ideally, this is the case when a whole community coherently adopts a single approach. This is often also called compatibility; in the strict sense, however, more recent use of the term compatibility restricts itself to the capability of exchanging data bilaterally, in the absence of a community standard. Theoretically, compatibility would then be more immediate than interoperability, since an intermediate third-party standard would not be required. However, it can be doubted whether this is a particularly useful distinction. Virtually every work on compatibility eventually aims at the widespread acceptance of a standard, protocol or file format. In this sense, interoperability is simply another, more modern word for all efforts at ensuring that heterogeneous software architectures, in the broadest sense, can function correctly.

Kerber and Schweitzer summarize that “interoperability has become a buzzword in European policy debates on the future of the digital economy” where “one of the difficulties of the interoperability discussion is the absence of a clear definition of interoperability” [24]. This is certainly not coincidental. A research and development landscape dominated by project-based funding from calls with priorities driven by political or cultural trends is a sure recipe for rendering the associated terminology vague to the point of complete dilution. “All stakeholders,” to use another buzzword, aim at securing their share. This is evidenced by the multitude of researchers who have only recently detected that their traditional line of work is actually a subdiscipline of artificial intelligence, Industry 4.0 or data science (or, of course, all the three). In the case of interoperability, this is particularly ironic given that one of its core elements consists in defining the precise meaning of concepts. But it is not a time for academic rigour.

As understood by the present work—necessarily in disagreement with others—there are three aspects of interoperability, corresponding to the major branches of theoretical linguistics: syntax, semantics and pragmatics. Syntactic interoperability is based on a common agreement on the grammar of a formal language, such as a file format or the arrangement of data items in a stream or in memory, while semantic interoperability refers to an agreement on the meaning or implications of the communicated content. In the context of digitalization and the design of digital infrastructures, the focus is typically on establishing a shared formalization of the semantics (rather than syntax) for a particular application area, i.e. a domain of knowledge. Semantic interoperability can only be achieved if there are metadata standards by which the annotation of data is carried out and understood by all participants in an agreed way [16, 17]. This permits the integration of data communicated to a single platform from multiple sources or by multiple users. This leads to interoperability between multiple platforms whenever the developers of these platforms agree on the same metadata standards or, where semantic heterogeneity remains, if an alignment can be constructed to harmonize the divergent standards [25,26,27,28,29]. Accordingly, the meaning of concepts and relations needs to be agreed upon, while the technical implementation and I/O are permitted to adhere to a variety of specifications and formats.

Semantic metadata standards are also known as semantic assets; in Fig. 1.1, the most common types of semantic assets are arranged by two main measures of their expressivity and richness in content: First, the depth of the provided representation of domain knowledge; second, the depth of digitalization, characterized by the extent to which processing of the represented knowledge can be automated. At the minimum with respect to both coordinates, only a list of concepts is compiled, i.e. a vocabulary (or lexicon). If explanations and definitions are added in a way that is understandable to human readers, this becomes a dictionary; in the field of materials modelling, this includes the molecular model database (MolMod DB) nomenclature [4]. A hierarchy of concepts is a taxonomy, where multiple narrower concepts are subsumed (symbol \( \sqsubseteq \)) under a broader concept, yielding a tree structure, e.g. in the scientific taxonomy of biological organisms

$$\begin{aligned} \textsf {{homo sapiens}} ~ \sqsubseteq ~ \textsf {{homo}} ~ \sqsubseteq ~ \textsf {{hominid}} ~ \sqsubseteq ~ \textsf {{primate}} ~ \sqsubseteq ~ \textsf {{mammal}} ~ \sqsubseteq ~ \textsf {{animal}}. \end{aligned}$$
(1.1)

A thesaurus extends a system of concepts by definitions of possible relations between individuals (objects) that instantiate them. For the use on digital platforms, this is typically further formalized either as a hierarchical schema or as an ontology. In a hierarchical notation (e.g. XML or JSON), relations take the form of containment, e.g. in XML format, the tag representing one object can contain tags representing subordinate objects, in an arrangement that is well defined by an XML Schema Definition (XSD) and distinct from the taxonomic hierarchy of concepts. Applied to the structure of a document, e.g. such a hierarchy might be given by

$$\begin{aligned} \textsf {{word}} ~\leftarrow ~ \textsf {{sentence}} ~\leftarrow ~ \textsf {{paragraph}} ~\leftarrow ~ \textsf {{section}} ~\leftarrow ~ \textsf {{chapter}} ~\leftarrow ~ \textsf {{book}}, \end{aligned}$$
(1.2)

where the symbol \(\leftarrow \) indicates that one entity is subordinate to another in terms of containment. In a hierarchical schema, this structural containment coexists with taxonomic subsumption. Ontologies, on the other hand, are non-hierarchical schemas, formalizing rules and definitions underlying knowledge graphs, where nodes representing individuals are connected to each other by edges that respresent the relations. The Description Logic (DL) variants that are used to specify ontologies, mainly by means of the Web Ontology Language (OWL), are more expressive than the languages used for hierarchical schemas, and can therefore, in principle, encode more domain knowledge [30, 31]. Some extensions of DL even include modal logic or temporal logic [32].

Fig. 1.1
figure 1

Common categories of semantic assets. The horizontal axis indicates the amount of domain knowledge that is covered, while the vertical axis shows to what extent this information is machine-actionable. Labels adjacent to the boxes refer to formats or languages that are typically used (underlined) and to specific semantic assets that are particularly relevant to materials modelling (bold)

However, an agreement on syntax and semantics is insufficient without a general understanding of performative roles, the context in which a communication occurs, and generally what the different participants in an exchange can reasonably expect from each other. The statement “the accused is guilty of high treason,” for instance, is syntactically correct in the English language. Its denotational meaning might be formalized by linking “the accused” to a formal representation of the specific person, and “is guilty of” and “high treason,” respectively, to a relation and an entity from an ontology representing the laws of the country. However, even assuming that we accept the statement to be true, its impact will vary greatly depending on who says it (e.g. a journalist, the prosecutor or the judge), at which point, and in which context. If multiple countries decide to set up a joint court, they need to agree on the legal framework and on the language to be used at its sessions, but also on the pragmatics, much of which relates to role definitions and standards for good and best practices: How is a person appointed to become a judge, what qualifications are needed and what code of conduct needs to be followed? Pragmatic interoperability concerns such requirements and recommendations pertaining to the practice of communicating and dealing with data [33,34,35]. If this is to be implemented in a machine-processable way, this is inseparable from semantic technology, and closely related techniques can be used to specify semantic and pragmatic interoperability standards [35,36,37].

1.3 Semantic Assets and Metadata Categories

The purpose of semantic assets and metadata models, in particular, is the description of a research object in all its relevant aspects. It is advisable to define categories for this description, since the aspects might differ in their specificity. Some are general or subject to every discipline (such as file size or authorship), whereas others only apply to a single domain. Also, existing metadata standards and ontologies usually cover specific aspects of a description and then may be used as building blocks. Moreover, in big data science, automated extractability of semantic information gets crucial, and the different aspects described are differently hard to extract. In the following, we categorize the semantic description in four main classes. These originally stem from computational engineering [38], but also hold for materials modelling:

  1. 1.

    Technical metadata describe technical characteristics of the research asset, i.e. basically, the file attributes on a filesystem level and other syntactic information. These can not only be file sizes, checksum information, storage location or access dates, but also file formats.

  2. 2.

    Descriptive metadata provide general information about the research asset, such as the authors of the data, some keywords or a title. The data are described content-wise from a higher logical standpoint.

  3. 3.

    Process metadata describe the generation process and the provenance of the research asset, for example, the computational environment and software used to generate or process the data. This description may include several linked, consecutive steps.

  4. 4.

    Domain-specific metadata describe the research objects from the domain-specific perspective. In computational engineering, this includes details about the simulated target system, the simulation method or the spatial and temporal resolution, for example.

These four dimensions are the core of every rich data description. The four classes are found to hold not only for engineering but also for different fields of science. It is now subject to the metadata engineer to fill the categories with content, and we will learn how to do this in Chap. 2 taking the example of EngMeta.

Fig. 1.2
figure 2

The four metadata categories and their level of specificity

The specificity of the categories is in ascending order (1–4), which is also shown in Fig. 1.2. Whereas the technical and descriptive categories and their metadata keys are generic and hold for different fields of science, category of process information is heavily bound to the research process and the domain-specific category to the research object. However, the content of the classes may overlap and a metadata key might be part of two or more categories. The probability that suitable standards exist for the four categories is decreasing with the categories’ specificity, as shown in Fig. 1.2. Whereas for technical and descriptive metadata, many standards exist, this does not hold for the latter two categories, which require a significant dedicated development effort. Regarding technical metadata, the semantic information is similar in all research fields as long as the data are organized in files. A typical standard here is PREMIS [39]. Also descriptive metadata keys are similar (or even the same) throughout all disciplines. Here, DataCite is the de facto standard for a general description and citable data objects [40]. In contrast, process metadata are strongly related to the research process, where metadata standards only exit for specific processes, e.g. CodeMeta [41] and the Citation File Format [42] for the description of software and codes. For domain-specific metadata, only standards for specific research objects exist. Some relevant standards for to the four categories are shown in Table 1.1. Knowledge of the categories and existing standards enables the metadata designer to use certain parts as building blocks when compiling a standard for a certain area.

Table 1.1 Examples of relevant metadata standards for the four categories

Moreover, the distinction between these four categories is crucial with respect to automated extractability. It has been shown that some categories are easier to automatically extract than others in computational engineering [38]: Technical information is easy to extract, since it is mostly file system attributes. Process- and domain-specific information is relatively easy to extract automatically for computational engineering applications, since it is available in output, job (input) and log files of simulation codes.Footnote 1 Descriptive information is hardly extractable, since it describes the research from a higher level and makes human interaction necessary.

1.4 Perspective and Outline of the Book

Fig. 1.3
figure 3

Landscape of interoperable platforms and infrastructures in materials modelling funded from the Horizon 2020 research and innovation programme

This work presents two approaches to metadata standardization in materials modelling: first, hierarchical schemas, represented here by EngMeta, an XML schema for data management in the engineering sciences that is presently in use for the DaRUS data infrastructure at the University of Stuttgart. Second, ontologies, represented by the metadata standards from the Virtual Materials Marketplace (VIMMP) project.

VIMMP belongs to the LEIT-NMBP line of the European Union’s Horizon 2020 research and innovation programme, where substantial efforts and funds have been concentrated on the single objective of creating what may be labelled an environment or ecosystem of interoperable CME/ICME platforms, cf. Fig. 1.3. This line of work is grouped around two coordination and support actions (i.e. networking-oriented projects), the first of which, EMMC-CSA, led to the creation of the European Materials Modelling Council (EMMC) as an interest group and community organization; the second, OntoCommons, supports the uptake of ontologies as a technology in materials modelling. The associated interoperability effort is based on the Review of Materials Modelling (RoMM), a compendium that aims at establishing a coherent understanding of all major modelling and simulation approaches from quantum mechanics up to continuum methods [43]. On this basis, first, MODA (Model Data) was introduced as a standardized description simulation workflows together with their intended use cases [44]; subsequently, a variety of domain ontologies were developed and connected to the European Materials and Modelling Ontology (EMMO), a top-level ontology aiming at describing all that exists from a perspective that is advantageous for CME/ICME infrastructures and applications [45, 46]. EngMeta, on the other hand, was developed at the University of Stuttgart in an environment where digitalization of materials modelling is advanced through the Cluster of Excellence “Data-Integrated Simulation Science” (EXC 2075), the Stuttgart Center for Simulation Science, and work on repositories including ReSUS (Reusable Software University Stuttgart), DaRUS (Data Repository of the University of Stuttgart), and the programme for national research data infrastructures (NFDI) of the German Research Foundation (DFG).

Both approaches have their strengths. On the one hand, ontologies are experiencing a great surge in popularity. They are increasingly seen as a key component of state-of-the-art solutions in data technology. Nonetheless, they also have drawbacks. First, as a technology, ontologies are comparably heavy, requiring substantial resources for development and maintenance. While certain communities have succeeded at establishing agreed domain-specific semantic frameworks [47, 48], it is also commonly found that (occasionally quite complex) ontologies are developed within a project and then abandoned when the project is over. While it is undeniable that classification schemes, in general, are a prerequisite for interoperability, there is no consensus on whether a less expressive framework or a more expressive one is to be preferred. The advantage of less expressive languages is they can be handled with multiple technologies and tools, and are typically lighter and faster. Richer languages allow to describe more complex relations, but at the price of being tied to newer, less widespread technologies and being typically computationally more demanding. Moreover, all new technologies need to overcome a barrier to adoption before they are employed widely. Superficially, it may seem that ontologies have advanced relatively far on this path; however, comparing the uptake of ontology-based semantic technologies in research software engineering (simulation codes, etc. ) with alternatives such as XML/XSD-based solutions makes it clear that the advance of ontology-based solutions is still at an early stage.

Additionally, ontology design can result in an overregulation of domain practices. This risk is inherent to the prescriptive (rather than descriptive) nature of ontologies and taxonomies—like grammar regulates syntax, ontologies pretend to regulate meaning. In reality, however, there are always many possible ways to ontologize any given domain of knowledge. The concurrent development of multiple incommensurable paradigms is one of the manifestations of progress in a scientific discipline and the major driving force for the emergence of new specializations [49]; in the words of Kuhn, scientific revolutions are characterized by a “change in several of the taxonomic categories prerequisite to scientific descriptions and generalizations. That change, furthermore, is an adjustment not only of criteria relevant to categorization, but also of the way in which given objects and situations are distributed among preexisting categories” [50]. Insisting on the adoption of a single ontology by a whole field of science makes that field perfectly interoperable, but at the price of scientific stagnation. In this view, any field that is developing fast—as modelling and simulation technology today certainly is—will consider a plurality of paradigms, semantic heterogeneity, as an indicator of its success, rather than as an obstacle, and pursue the frequent renegotiation of alignments between multiple ontologies, taxonomies and hierarchical schemas [26, 27, 51].

However, even if knowledge is formalized in a machine-readable format, besides the design of the ontology, there will still be other steps where human intervention is needed, notably, to classify and annotate the individuals. Whenever this occurs, we rely on the fact that the semantics is really shared, i.e. that there is an actual common understanding of all the concepts by all the users. In practice, colloquial language and insufficient familiarity with concept definitions can give rise to common pitfalls and false friends in using ontologies. By stating that something is generally true, we could mean “all the time” or “most of the time;” in the EMMC context, the words translation and mesoscopic have very specific meanings that would be misconstrued intuitively by most domain experts in materials modelling. Hence, human error must be anticipated (in addition to genuine disagreements on how an ontology should be applied), and similarly, automated annotation tools based on natural language processing are not free of error. All of this highlights the need for a community to gather and share concepts in a continuous effort; it corroborates the conclusion that there is a trade-off between the expressive power of ontologies and the associated social and technological cost of development and maintenance.

The remainder of the book is structured as follows: Chapter 2 introduces the XSD-based engineering metadata schema EngMeta and the research data infrastructure DaRUS, situating it in the context of the emerging environment of databases and repositories for data from physics-based modelling and simulation. The system of marketplace-level domain ontologies developed by VIMMP is presented in Chap. 3, concerning data provenance, services and transactions at the marketplace level, and in Chap. 4, concerning the description of solvers, associated aspects such as licenses and software features, and the characterization of physical and other variables that occur in modelling and simulation. On this basis, Chap. 5 addresses issues related to the practical use of the metadata standards, including syntactic interoperability and concrete scenarios from molecular modelling and simulation; it also discusses challenges that arise from semantic heterogeneity, wherever multiple interoperability standards are concurrently employed for identical or overlapping domains of knowledge, or where domain ontologies need to be matched to top-level ontologies such as the EMMO.