In this section we describe how TripleWave enables the publication and consumption of RDF streams on the Web, following the requirements listed in Sect. 2.
Figure 1 represents a high-level architectural view of our solution. As we saw previously, RDF stream consumers may have different requirements on how to ingest the incoming data. According to [R1] and [R2], in TripleWave we consider two main types of data sources: (i) Non-RDF live streams on the Web, and (ii) RDF datasets with time-annotations. While the former mainly requires a conversion of existing streams to RDF, the latter is focused on streaming RDF data, provided that it has timestamped data elements. When performing the transformation to RDF streams, TripleWave makes use of R2RML mappings in order to allow customizing the shape of the resulting stream. In the case of streaming time-annotated RDF datasets, TripleWave also re-arranges the data if necessary, so that it is structured as a sequence of timestamped RDF graphs, following the W3C RSP Group design principles.
As output, TripleWave produces a JSON stream in the JSON-LD format: each stream element is described by an RDF graph and the time annotation is modeled as an annotation over the graphFootnote 7. Using this format compliant with existing standards, TripleWave enables processing RDF streams not only through specialized RSP engines, but also with existing frameworks and techniques for standard RDF processing.
3.1 Running Modes
In order to address requirements [R1, R2, R3], TripleWave supports a flexible set of data sources, namely non-RDF streams from the Web, as well as timestamped RDF datasets. TripleWave also provides different use modes for RDF stream generation, detailed below.
Converting Web streams. Existing streams can be consumed through the TripleWave JSON and CSV connectors. Extensions of these can be easily incorporated in order to support additional formats. These feeds or streams (e.g. Twitter, earthquakes, live weather, Wikipedia updates, etc.) can be directly plugged to the TripleWave pipeline, which then uses R2RML mappings in order to construct RDF triples that will be output as part of an RDF stream. The mappings can be customized to produce RDF triples of arbitrary structure, and using any ontology.
As an example of input, consider the following GeoJSON feed item from the USGS earthquake APIFootnote 8. It contains information about the last reported earthquakes around the world, including the magnitude, location, type and other observed annotations.
Replaying RDF Datasets. RDF data is commonly available as archives and Linked Data endpoints, which may contain timestamp annotations and that can be replayed as a stream. Examples of these include sensor data archives, event datasets, transportation logs, update feeds, etc. These datasets typically contain a time-annotation within the data triples, and one or more other triples are connected to this timestamp. Replaying such datasets means converting an otherwise static dataset into a continuous flow of RDF data, which can then be used by an RDF Stream Processing engine. Common use cases include evaluation, testing, and benchmarking applications, as well as simulation systems. As an example consider the example air temperature observation extracted from the Linked Sensor Data [14] dataset. Each observation is associated to a particular instant, represented as an XSD dateTime literal. Using TripleWave we can replay the contents of this dataset as a stream, and the original timestamps can be tuned so that they can meet any test or benchmarking requirements.
Replay Loop. In certain cases, the replay of RDF datasets as streams can be set up in a way that the data is re-fed to the system after it has been entirely consumed. This is common in testing and benchmarking scenarios where data needs to be endlessly available until a break point, or in simulation use-cases where an infinite data stream is required [15]. Similar to the previous scenario, the original RDF dataset is pre-processed in order to structure the stream as a sequence of annotated graphs, and then it is continuously streamed through TripleWave as a JSON-LD RDF stream. The main difference is that the timestamps are cyclically incremented, when the dataset is replayed, so that they provide the impression of an endless stream.
3.2 R2RML to Generate RDF Streams
Streams on the Web are available in a large variety of formats, so in order to adapt and transform them into RDF streams we use a generic transformation process that is specified as R2RMLFootnote 9 mappings. Although these mappings were originally conceived for relational database inputs, we use light extensions that support other formats such as CSV or JSON (as in RML extensionsFootnote 10).
The example in Listing 1.3 specifies how earthquake stream data items can be mapped to a graph of an RDF streamFootnote 11. This mapping defines first a triple that indicates that the generated subject is of type ex:Earthquake. The predicateObjectMap clauses add two more triples, one specifying the URL of the earthquake (e.g. the reference USGS page) and its description.
A snippet of the resulting RDF Stream graph, serialized in JSON-LD, is shown in in Listing 1.4. As can be observed, a stream element is contained in a timestamped graph, using the generatedAtTime property of the PROV ontologyFootnote 12.
3.3 Consuming TripleWave RDF Streams
TripleWave is implemented in Node.js and produces the output RDF stream using HTTP with chunked transfer encoding by default, or alternatively through WebSockets. Consumers can register to a TripleWave endpoint and receive the data following a push paradigm. In cases where consumers may want to pull the data, TripleWave allows publishing the data according to the Linked Data principles [5]. Given that the stream supplies data that changes very frequently, data is only temporarily available for consumption, assuming that recent stream elements are more relevant. We describe both cases below.
Publishing Stream Elements as Linked Data. TripleWave allows consuming RDF Streams following the Linked Data principles, extending the framework proposed in [4]. According to this scheme, for each RDF Stream TripleWave distinguishes between two kinds of Named Graphs: the Stream Graph (sGraph) and Instantaneous Graphs (iGraphs). Intuitively, an iGraph represents one stream element, while the sGraph contains the descriptions of the iGraphs, e.g. their timestamps.
As an example, the sGraph in Listing 1.5 describes the current content that can be retrieved from a TripleWave RDF stream. The ordered list of iGraphs is modeled as an rdf:list with the most recent iGraph as the first element, and with each iGraph having its relative timestamp annotation. By accessing the sGraph, consumers discover which are the stream elements (identified by iGraphs) available at the current time instants. Next, the consumer can access the iGraphs dereferencing the iGraph URL address. The annotations on the sGraph use a dedicated vocabularyFootnote 13.
RDF Stream Push. An RSP engine can consume an RDF stream from TripleWave, extending the rsp-services frameworkFootnote 14 as follows (with C-SPARQL as a sample RSP): (1) the client identifies the stream by its IRI (which is the URL of the sGraph). (2) rsp-services registers the new stream in the C-SPARQL engine. (3) rsp-services looks at the sGraph URL, parses it and gets the information regarding the TBox and WebSocket. (4) The TBox is associated to the stream. (5) A WebSocket connection is established and the data flows into C-SPARQL. (6) The user registers a new query for the registered stream. (7) The TBox is loaded into the reasoner (if available) associated to the query. (8) The query is performed on the flowing data.