1 Introduction

If we look back at the history of computing, we can see that IT technologies were initially very much bound to the capabilities of the hardware, but became more intuitive and “human” over the past decades. In the very beginning of computing, programmers had to physically interact with the technology via pushing and pulling registers or punch cards. In the 1970s and 1980s, assembler programming became more prevalent, where you still interacted relatively close with the physical hardware, but at least it was not a physical exercise anymore. Computer scientists then discovered that there are more intuitive ways to interact with computers and got inspired by cooking recipes: functional/procedural programming of the 1980s and 1990s (e.g., using programming languages such as PASCAL or C) basically resembled what cookbooks have done for centuries: describing the ingredients and a sequence of steps to realize a certain outcome. Even more intuitive was object-oriented programming dominating the next two decades 1990s and 2000s, where large amounts of source code were not expressed in lengthy spaghetti-code but organized intuitively in objects and methods. This programming paradigm can already be seen as being inspired how our brain sees the real world—we learned concepts of abstract objects (abstract entities such as cars, trees, or buildings) and see their realization (or instantiation) in reality. Also, objects have certain characteristics (size, color, shape) and functions, which correspond to data and methods associated with the objects.

This, however, is not the end of the development. The problem with object-oriented programming is that functions and methods are more dominant and the data is often deeply hidden in the code or in data silos where the structure of the data is only known to a few experts. We currently see that there is increasing attention to data (e.g., in the context of big data, smart data, data science)—data is becoming more and more a first class citizen of computing. Still many challenges are ahead of us to realize the vision of cognitive data. We need to find and use more intuitive representations of data, which capture their structure and semantics in machine and human comprehensible ways, so that we develop a common understanding of the data along use cases, organizations, applications, value chains, or domains. Knowledge graphs, linked data, and semantic technologies (see e.g., also [1,9,10,11,5]) are good candidates in this regard and discussed in this article as a basis for realizing data spaces.

2 The Neglected Variety Dimension

The three classic dimensions of Big Data are volume, velocity, and variety. While there has been much focus on addressing the volume and velocity dimensions, the variety dimension was rather neglected for some time (or tackled independently). However, meanwhile most use cases, where large amounts of data are available in a single well-structured data format, are already exploited. The music plays now, where we have to aggregate and integrate large amounts of heterogeneous data from different sources—this is exactly the variety dimension. The Linked Data principles emphasizing the holistic identification, representation, and linking allow us to address the variety dimension. As a result, similarly as we have with the Web a vast global information system, we can build with the Linked Data principles a vast global distributed data space and efficiently integrate enterprise data. This is not only a vision, but has started and gains more and more traction as can be seen with the schema.org initiative, Europeana, or the International Data Spaces.

2.1 From Big Data to Cognitive Data

While there has been much focus on addressing the volume and velocity dimensions of Big Data, e.g., with distributed data processing frameworks such as Hadoop, Spark, and Flink, the variety dimension was rather neglected for some time. We have not only a variety of data formats—e.g., XML, CSV, JSON, relational data, graph data, etc.—but also data distributed in large value chains, in different departments inside a company, under different governance regimes, data models, etc. Often the data is distributed across dozens, hundreds, or in some use cases even thousands of information systems.

An analysis by SpaceMachineFootnote 1 demonstrates that the breakthroughs in AI are mainly related to the data—while algorithms were devised early and are relatively old, only once suitable (training) datasets became available, we are able to exploit these AI algorithms. Another important factor of course is computing power, which, thanks to Moore’s law, allows us to efficiently process after every 4–5 years data being a magnitude larger than before..

In order to deal with the variety dimension of Big Data and to establish a common understanding in data spaces, we need a lingua franca for data moderation, which allows to:

  • Uniquely identify small data elements without a central identifier authority. This sounds like a small issue, but identifier clashes are probably the biggest challenge for data integration.

  • Map from and to a large variety of data models, since there are and always will be a vast number of different specialized data representation and storage mechanisms (relational, graph, XML, JSON, etc.).

  • Distribute modular data schema definition and incremental schema refinement. The power of agility and collaboration became meanwhile widely acknowledged, but we need to apply this for data and schema creation and evolution.

  • Deal with schema and data in an integrated way, because what is a schema from one perspective turns out to be data from another one (think of a car product model—it’s an instance for the engineering department, but the schema for manufacturing).

  • Generate different perspectives on data, because data is often represented in a way suitable for a particular use case. If we want to exchange and aggregate data more widely, data needs to be represented more independently and flexibly, thus abstracting from a particular use case.

The Linked Data principlesFootnote 2 (coined by Tim Berners-Lee) allow us to exactly deal with these requirements:

  1. 1.

    Use Universal Resource Identifiers (URI) to identify the “things” in your data—URIs are almost the same as the URLs we use to identify and locate Web pages and allow us to retrieve and link the global Web information space. We also do not need a central authority for coining the identifiers, but everyone can create his own URIs simply by using a domain name or Webspace under his control as prefix. “Things” refers here to any physical entity or abstract concept (e.g., products, organizations, locations and their properties/attributes, etc.)

  2. 2.

    Use http://URIs so people and machines can look them up on the web (or an intra/extranet)—an important aspect is that we can use the identifiers also for retrieving information about them. A nice side effect of this is that we can actually verify the provenance of information by retrieving the information about a particular resource from its original location. This helps to establish trust in the distributed global data space.

  3. 3.

    When a URI is looked up, return a description of the thing in the W3C Resource Description Format (RDF)—as we have a unified information representation technique on the Web with HTML, we need a similar mechanism for data. RDF is relatively simple and allows to represent data in a semantic way and to moderate between many different other data models.

  4. 4.

    Include links to related things—as we can link between web pages located on different servers or even different ends of the world, we can reuse and link to data items. This is a crucial aspect to reuse data and definitions instead of recreating them over and over again and thus establish a culture of data collaboration.

As a result, similarly as we have with the Web a vast global information system, we can build with these principles a vast global distributed data management system, where we can represent and link data across different information systems. This is not just a vision, but currently already started to happen; some large-scale examples include:

  • The schema.org initiativeFootnote 3 of the major search engines and Web commerce companies, which defined a vast vocabulary for structuring data on the Web (and is used already on a large and growing share of Web pages) and uses GitHub for collaboration on the vocabulary.

  • Initiatives in the cultural heritage domain such as Europeana,Footnote 4 where many thousands of memory organizations (libraries, archives, museums) integrate and link data describing the artifacts.

  • The International Data Spaces Initiative,Footnote 5 aiming to facilitate the distributed data exchange in enterprise value networks, thus establishing data sovereignty for enterprises.

  • The National Research Data InfrastructureFootnote 6 aiming to build a research data space comprising data repositories, ontologies, as well as exploration and visualization infrastructure for all major research areas.

Further similar initiatives started in other areas such as Open Government Data, Life-Science, or Geo-Spatial Data.

3 Representing Knowledge in Semantic Graphs

For representing knowledge in graphs, we need two ingredients, unique identifiers and a mechanism to link and connect information from different sources. Universal Resource Identifiers (URIs) and subject-predicate-object statements according to the W3C RDF standard allow exactly this. As we can build long texts out of small sentences, we can build large and complex knowledge graphs from relatively simple RDF statements. As a result, knowledge graphs can capture the semantics and meaning of data and thus lay the foundation for data spaces between different organizations or a data innovation architecture within an organization.

3.1 Representing Data Semantically

Let us look how we can represent data semantically, so it can capture meaning, represent a common understanding between different stakeholders, and allow to interlink data stored in different systems.

Identifying Things

The basis of semantic data representation is URIs—Universal Resource Identifiers. Similarly as every Web page has its URL (which you can see in the location bar of your browser), URIs can identify every thing, concept, data item, or resource. Here are some examples of some URIs:

Everyone (who has a domain name or webspace) can coin its own URIs, so you do not rely on a central authority (e.g., GS1, ISBN) as with other identifier systems. Since every URI contains a domain name, information about the provenance and the authority coining the URI is built in the identifier. It is important to note that these URI identifiers can point to any concept, thing, entity, and relationship, be it physical/real or abstract/conceptual.

Representing Knowledge

Once we have a way to identify things, we need a way to connect information. The W3C standard RDF follows simple linguistic principles: the key elements of natural language (e.g., English) sentences are subject, predicate, and object. The following subject-predicate-object triple, for example, encodes the sentence “InfAI institute organizes the Leipzig Semantic Web Day 2021”:

figure a

As you can see, we use an identifier http://infai.org for “InfAI institute,” http://conf-vocab.org/organizes for the predicate “organizes,” and http://infai.org/LSWT2021 as identifier for the object of the sentence “Leipzig Semantic Web Day 2021.” As we connect sentences in natural language by using the object of a sentence as subject of further sentence, we can add more triples describing LSWT2021, for example, in more detail, by adding the start day and the location:

figure b

As you can see in this example, a small knowledge graph starts to emerge, where we describe and interlink entities represented as nodes. You can also see that we can mix and mesh identifiers from different knowledge bases and vocabularies, e.g., here the predicates from a conference vocabulary and the location referring to the DBpedia resource Leipzig. The start date of the event is here not represented as a resource (having an identifier), but as a Literal—the RDF term of a data value, which can have various data types (e.g., string, date, numbers).

Knowledge Graphs

Build on these simple ingredients; we can build arbitrary large and complex knowledge graphs. Here is an example graph describing a company:

figure c

A knowledge graph [6] now is a fabric of concept, class, property, relationships, and entity descriptions, which uses a knowledge representation formalism (typically according to the W3C standards RDF, RDF-Schema, OWL) and comprises holistic knowledge covering multiple domains, sources, and varying granularity. In particular:

  • Instance data (ground truth), which can be open (e.g., DBpedia, WikiData), private (e.g., supply chain data), or closed data (product models) and derived, aggregated data.

  • Schema data (vocabularies, ontologies) and meta-data (e.g., provenance, versioning, documentation licensing) as well as comprehensive taxonomies to categorize entities.

  • Links between internal and external data and mappings to data stored in other systems and databases.

Meanwhile a growing number of companies and organizations (including Google, Thompson Reuters, Uber, AirBnB, UK Parliament) are building their knowledge graphs to connect the variety of their data and information sources and build an data innovation ecosystem [7].

4 RDF a Holistic Data Representation for Schema, Data, and Metadata

A major advantage of RDF-based knowledge graphs is that they can comprise data, schema, and metadata using the same triple paradigm. Data integration is thus already built in into RDF, and knowledge graphs can capture information from heterogeneous distributed sources. As a result, RDF-based knowledge graphs are a perfect basis for establishing a data innovation layer in the enterprise for mastering digitalization and laying the foundation for new data-based business models.

RDF is a holistic data and knowledge representation technique, which allows to represent not only data but also its structure, the schema (and metadata) in a unified way. This is important, since what is schema from one perspective is data from another and vice versa. Let’s look again at our company knowledge graph from the section. We can represent the entities DHL and PostTower in the graph as RDF triples consisting of subject, predicate, and object (the simple text syntax is called RDF Turtle):

figure d

From the triple representation, you can already see why RDF is superior for data integration, since it is already built in into the data model. In order to integrate RDF data from different sources, we just have to merge the respective triples (i.e., throw them into a larger graph). Imagine if you have a multitude of different XML or relational data schemata, integrating those requires dramatically more effort.

The entities are now also assigned to classes: DHL is a company and PostTower a building (indicated by the rdf:type property). We can define these classes, by listing the properties, which can be used in conjunction with them and the respective expected types:

figure e

This means that the properties inIndustry, fullName, and headquarter are supposed to be used with instances of the class company and inIndustry should point to an instance of another class Industry, while the values assigned to the fullName property should be strings. The schema.org initiativeFootnote 7 of the major search engines defines a large number of classes and associated properties, and meanwhile a large percentage of Web pages are annotated using the schema.org vocabulary. The following figure illustrates how schema and data can be both represented in triples:

figure f

Although it sounds like a small thing, the integrated representation of schema and data in a simple, flexible data model is a key requisite for heterogeneous data integration. As such, RDF follows a bit the schema-last-paradigm—you do not have to define a comprehensive data model upfront, but can easily add new classes, attributes, and properties simply by using them. Also the URIs as unique identifiers and the fine-grained elements (triples) allow representing data, information, and knowledge in an integrated way while keeping things flexible.

5 Establishing Interoperability by Linking and Mapping between Different Data and Knowledge Representations

After computer scientists were looking for the holy grail of data representation for decades (remember the logic, ER/relational, XML, graph, NoSQL waves), it is now meanwhile widely accepted that no single data representation scheme fits all [8]. Instead we have a vast variety of data models, structures, systems, and management techniques, which all have their right to exist, since they are good serving one particular requirement (e.g., data representation or query expressiveness, scalability, or simplicity). As a result of this plurality, there is a paramount importance for a systematic approach for linking and integration. RDF fulfills exactly this requirement: it can moderate between different data models and evolve incrementally, as the original data sources and schemes change.

A unique and key feature of RDF is that it is perfectly suited to link and moderate between different data schemas, conceptual/data granularity levels, or data management systems. Let me illustrate this with three examples how relational and taxonomic/tree data as well as logical/axiomatic can be represented in RDF.

The following diagram shows schematically how a relational table can be transformed into RDF triples. The rationale is that a (primary) key is used for generating URI identifiers (here the Id column) and columns are mapped to RDF properties (here “Title” to rdfs:label and “Screen” to hasScreenSize) and rows to RDF instances. The following example shows how the first database table row can be represented in RDF triples:

figure g

Of course, in large-scale applications, it is not required to actually physically transform the data to RDF (i.e., materialize the RDF view). Instead data can be transformed on demand when RDF links are accessed or RDF queries (e.g., in the SPARQL query language) executed. The W3C R2RML standardFootnote 8 provides a comprehensive mapping language for mapping relational data to RDF, which is meanwhile integrated into major DBMS (e.g., Oracle) and vendor-independent stand-alone mapping systems (e.g., eccenca’s CMEMFootnote 9).

The next example illustrates how taxonomic or tree data can be represented in RDF. Here the idea is to simply express each sub-taxon relationship with a respective RDF triple.

figure h

This example also demonstrates that RDF is perfectly suited as a data integration lingua franca, since its triples are small-grained building blocks, which can be combined in very different ways.

Another common type of information, which needs to be represented, is schema, constraint or logical information. Here is an example how this works and can be even augmented with logical axioms:

figure i

First “Male” and “Female” are defined as subclasses of “Human,” and the logical axiom of the OWL Web Ontology Language in addition states that both subclasses must be disjoint. This example illustrates that we can start initially with relatively simple representations and then iteratively add more structure and semantics as we go. For example, we can start expressing the data in RDF, then add schema (e.g., class and property definitions), then add more constraints and axioms, and so on—thus following the schema or here (logic/constraints) last paradigm.

These were only three examples; there are many more meanwhile standardized ways how to map, link, and integrate other data representation formalisms with RDF such as:

  • JSON Linked Data (JSON-LD) for JSON.Footnote 10

  • RDF in Attributes (RDFa) for embedding/integrating RDF with HTML.Footnote 11

  • CSV on the Web for Tabular CSV data.Footnote 12

After the disgrace of its late birth (after XML) and the initial misconceptions of positioning/aligning RDF too much with XML (encoding the triple data model in XML trees) and heavyweight ontologies, meanwhile with Linked Data, JSON-LD, and knowledge graphs, a much more pragmatic and developer as well as application-friendly positioning of RDF is gaining traction. This is illustrated in an adaptation of the original Semantic Web layer cake from 2001,Footnote 13 where we now can see that RDF integrates well with many different data models and technology ecosystems:

figure j

6 Exemplary Data Integration in Supply Chains with ScorVoc

Supply Chain Management aims at optimizing the flow of goods and services from the producer to the consumer. Closely interconnected enterprises that align their production, logistics, and procurement with one another thus enjoy a competitive advantage in the market. To achieve a close alignment, an instant, robust, and efficient information flow along the supply chain between and within enterprises is required. However, still much less efficient human communication is often used instead of automatic systems because of the great diversity of enterprise information systems, data governance schemes, and data models.

Automatic communication and analysis among various enterprises requires a common process model as a basis. The industry-agnostic Supply Chain Operation Reference (SCOR) [9], backed up by many global players (including IBM, HP, and SAP), precisely aims at tackling this challenging task. By providing 201 different standardized processes and 286 metrics, it offers a well-defined basis that allows describing supply chains within and between enterprises. A metric represents a KPI that is used to measure processes. The applicability of SCOR, however, is still limited, since the standard stays on the conceptual and terminological level and major effort is required for implementing the standard in existing systems and processes. The following figure represents a typical supply chain workflow.

figure k

Each node represents an enterprise and each arrow a connection. The values besides the connection can have many dimensions: the reliability of a delivery, the costs involved, and the time it takes to deliver from one place to another.

A semantic, knowledge graph-based representation of supply chain data exchanged in such a network enables the configuration of individual supply chains together with the execution of industry-accepted performance metrics. Employing a machine-processable supply chain data model such as the SCORVoc RDF vocabulary implementing the SCOR standard and W3C standardized protocols such as SPARQL, such an approach represents an alternative to closed software systems, which lack support for inter-organizational supply chain analysis.

SCORVoc [10] is an RDFS vocabulary that formalizes the SCOR reference model. It contains definitions for the processes and KPIs (“metrics”) in the form of SPARQLFootnote 14 queries. A process is a basic business activity. The following figure gives an overview on the SCORVoc vocabulary.

figure l

As an example, there are multiple delivery processes: scor:SourceStockedProduct, scor:SourceMakeToOrderProduct, and scor:SourceEngineerToOrderProduct depending on whether a delivery is unloaded into the stock, used in the production lines, or used for special engineered products. Each time such an activity takes place, all data that is needed to evaluate how well this process performed is captured such as whether the delivery was on time, whether it was delivered in full, or if all documents were included. Each process contains at least one metric definition. Due to its depth, we chose SCORVoc as our common data basis. The vocabulary is availableFootnote 15 including definitions and localization for each concept.

Once partners in a supply network represent supply chain data according to the SCORVoc vocabulary, this data can be seamlessly exchanged and integrated along the supply chain. Due to the standardized information model, it is easy to integrate new suppliers in the network. Also, the knowledge graph representation allows to integrate supply chain data easily with other data (e.g., engineering, product, marketing data) in the enterprise. Due to the flexibility and extensibility of the RDF data model and vocabularies, such auxiliary data can also be easily added to the RDF-based data exchange in the supply network.

7 Conclusions

In this chapter, we have given an overview on the foundations for establishing semantic interoperability using established semantic technology standards by the World Wide Web Consortium, such as RDF, RDF-Schema, OWL, and SPARQL. These standards can be used to establish a data interoperability layer within organizations or between organizations, e.g., in supply networks. Using vocabularies and ontologies, participating departments (inside organizations) or companies (between organizations) can establish a common understanding of relevant data, by defining concepts as well as their properties and relationships. Establishing such a semantic layer and a common understanding of the data is crucial for realizing the vision of data spaces. Meanwhile, a number of data spaces are emerging. In addition to enterprise data spaces, this also includes data spaces in the cultural domain (e.g., the Europeana data space), in the government and public administration area, or in the research domain with the European Open Science Cloud or National Research Data Initiatives.

A deeper and wider penetration of semantic technologies in enterprises is still required in order to fully realize the potential of digitalization and artificial intelligence. While the semantic technology standards were already developed often more than a decade ago and the vision to leverage these for enterprise data integration was long discussed (cf. e.g., [11, 12]), mature and enterprise grade software platforms (e.g., eccenca Corporate MemoryFootnote 16) started only to emerge in the last years. More work needs to be done to broaden the interoperability of semantic technologies with other enterprise technology ecosystems, such as property graphs or big data lakes. Also, traditional industry standardization methodologies need to shift to more agile interoperability paradigms following the Schema.org example of leveraging GitHub as an agile collaboration infrastructure.