Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Semantic streams represent flows of knowledge over time, in diverse domains including social networks, health monitoring, financial markets or environmental monitoring, to name only a few. The Semantic Web community has studied the problems associated with the processing of and reasoning over these complex streams of data, leading to the emergence of RDF Stream Processing and Stream Reasoning techniques.

The Web is a natural context for semantic streams, due to the quantity of dynamic data it contains, generated for example by social networks or the Web of Things [11]. RDF streams emerged as a model to realize semantic streams on the Web: they are (potentially infinite) sequences of time-annotated RDF items ordered chronologically.

While several definitions of RDF streams have been proposed in the past, thanks to the efforts of the W3C RSP Community GroupFootnote 1 these are converging towards a general formalization based on time-annotated RDF graphs. However, the model is not the only aspect about RDF streams that needs agreement in this community. Standard protocols and mechanisms for RDF stream exchange are currently missing, therefore limiting the adoption and spread of RSP technologies on the Web.

Existing systems such as C-SPARQL [3], CQELS [9] and EP-SPARQL [1] do not tackle directly this problem, but delegate the task of managing the stream publication and ingestion to the developer. Other approaches have proposed to create RDF datasets fed from unstructured streams [8, 16], to lift streaming data as Linked Data [2, 10], or to provide virtual RDF Streams [6]. To improve scalability, systems like Ztreamy [7] are designed for efficient transmission of compressed data streams, although they do not address the heterogeneity of data sources, declarative transformation and consumption modes. Nevertheless, so far there is still a need for a generic and flexible solution for making RDF streams available on the Web. Such a solution needs to follow Semantic Web standards and best practices, and to allow different data source configurations and data access modes.

In this work, we propose TripleWaveFootnote 2, an open-source framework for creating RDF streams and publishing them over the Web. Triplewave facilitates the dissemination and consumption of RDF streams, in a similar manner as is already common for static RDF datasets with RDF graphs and datasets. In order to do so, we first elicit a set of requirements (Sect. 2) identified from real scenarios and use cases reported in the literature and the W3C RSP group. Then, we extend the prototype in [13] to address the new requirements. These include a flexible configuration of different types of data sources, and different stream generation modes, including transformation (via mappings) of Web streams and replay of existing RDF datasets and RDF sub-streams. Moreover, TripleWave offers a hybrid consumption mechanism that allows both pull-based consumption of RDF streams [2], and push communication through WebSockets.

2 Requirements

To elicit the requirements for TripleWave, we have taken into account a set of scenarios based on real-world use casesFootnote 3. In the following, we highlight the most important of these requirements, organized along four main axes: data sources, data models, data provisioning, and management of contextual data (schema and metadata).

Data Sources. The first set of requirements focuses on the data sources that TripleWave should support to ensure wide adoption and reuse of the tool.

  • [R1] TripleWave may use streams available on the Web as input. Examples of this kind of data may be found in Twitter, Wikipedia, etc. Twitter supplies data through its streaming APIFootnote 4; similarly, Wikipedia publishes a change stream through an IRC-based or Websocket APIFootnote 5. Online streaming data is not the only kind of data that TripleWave may support. In the context of testing and benchmarking, system developers & designers have an intrinsic need to feed the engines in a reproducible and repeatable way. That means, they aim at streaming previously generated data, one or multiple times, to assess the behavior of the system.

  • [R2] TripleWave shall be able to process existing time-aware (RDF) datasets, which could or could not be formatted as streams.

  • [R3] TripleWave shall provide format conversion mechanisms towards RDF streams, in case the input is not formatted as a stream. A typical usage scenario where these features are needed is testing. In this context, sharing the test data may not be enough, as the time dimension plays a key role and streaming the same data in different ways may influence the behavior of the engines. An open, reusable tool to stream data is therefore needed to enable a fair and reproducible execution of the tests.

Data Models. Although recommendations on data models and serialization formats for RDF streams are still under specificationFootnote 6, it is important to identify and reuse formats that adhere as much as possible to existing recommendations and standards.

  • [R4] TripleWave should adopt a data format compatible with RDF, since RDF streams are heavily based on the RDF building blocks. In this way, it would be possible to increase the potential data reuse and tool usage itself in a wider set of scenarios.

Data Provisioning. This category of requirements describes the different ways of consuming RDF streams by RSP client applications.

  • [R5] TripleWave shall be capable of pro-actively supplying streaming data to processing engines. Indeed, stream processing applications are usually designed to be fed with streaming data. For instance, the SLD framework [2] receives and analyzes real-time data from social networks or sensor networks, while Star-City [12] is fed with Dublin public transportation data to compute urban analyses.

  • [R6] TripleWave shall offer the data accordingly to existing W3C recommendations. In particular, RDF streams should be accessible not only for stream processing and reasoning engines, but also for other applications based on Semantic Web technologies (e.g. SPARQL and Linked Data).

Requirements [R4] and [R6] ensure the compatibility with tools and frameworks already developed and available to process RDF data.

Contextual Data Management. In the stream processing context, continuous execution models are often adopted.

  • [R7] TripleWave shall be able to publish the schema and metadata about the stream independently from the actual transmission of the stream itself. Indeed, in the case of continuous query evaluation, the steps of query registration, schema provisioning, and metadata provisioning can be performed separately from the streaming itself.

3 The TripleWave Approach

In this section we describe how TripleWave enables the publication and consumption of RDF streams on the Web, following the requirements listed in Sect. 2.

Figure 1 represents a high-level architectural view of our solution. As we saw previously, RDF stream consumers may have different requirements on how to ingest the incoming data. According to [R1] and [R2], in TripleWave we consider two main types of data sources: (i) Non-RDF live streams on the Web, and (ii) RDF datasets with time-annotations. While the former mainly requires a conversion of existing streams to RDF, the latter is focused on streaming RDF data, provided that it has timestamped data elements. When performing the transformation to RDF streams, TripleWave makes use of R2RML mappings in order to allow customizing the shape of the resulting stream. In the case of streaming time-annotated RDF datasets, TripleWave also re-arranges the data if necessary, so that it is structured as a sequence of timestamped RDF graphs, following the W3C RSP Group design principles.

As output, TripleWave produces a JSON stream in the JSON-LD format: each stream element is described by an RDF graph and the time annotation is modeled as an annotation over the graphFootnote 7. Using this format compliant with existing standards, TripleWave enables processing RDF streams not only through specialized RSP engines, but also with existing frameworks and techniques for standard RDF processing.

Fig. 1.
figure 1

The architecture of TripleWave: generating RDF streams from non-RDF data sources and time-annotated datasets. R2RML mappings allow customizing the transformation from non-RDF streams. The RDF stream output can be pushed or pulled toward the client as a JSON-LD dataset.

3.1 Running Modes

In order to address requirements [R1, R2, R3], TripleWave supports a flexible set of data sources, namely non-RDF streams from the Web, as well as timestamped RDF datasets. TripleWave also provides different use modes for RDF stream generation, detailed below.

Converting Web streams. Existing streams can be consumed through the TripleWave JSON and CSV connectors. Extensions of these can be easily incorporated in order to support additional formats. These feeds or streams (e.g. Twitter, earthquakes, live weather, Wikipedia updates, etc.) can be directly plugged to the TripleWave pipeline, which then uses R2RML mappings in order to construct RDF triples that will be output as part of an RDF stream. The mappings can be customized to produce RDF triples of arbitrary structure, and using any ontology.

figure a

As an example of input, consider the following GeoJSON feed item from the USGS earthquake APIFootnote 8. It contains information about the last reported earthquakes around the world, including the magnitude, location, type and other observed annotations.

Replaying RDF Datasets. RDF data is commonly available as archives and Linked Data endpoints, which may contain timestamp annotations and that can be replayed as a stream. Examples of these include sensor data archives, event datasets, transportation logs, update feeds, etc. These datasets typically contain a time-annotation within the data triples, and one or more other triples are connected to this timestamp. Replaying such datasets means converting an otherwise static dataset into a continuous flow of RDF data, which can then be used by an RDF Stream Processing engine. Common use cases include evaluation, testing, and benchmarking applications, as well as simulation systems. As an example consider the example air temperature observation extracted from the Linked Sensor Data [14] dataset. Each observation is associated to a particular instant, represented as an XSD dateTime literal. Using TripleWave we can replay the contents of this dataset as a stream, and the original timestamps can be tuned so that they can meet any test or benchmarking requirements.

figure b

Replay Loop. In certain cases, the replay of RDF datasets as streams can be set up in a way that the data is re-fed to the system after it has been entirely consumed. This is common in testing and benchmarking scenarios where data needs to be endlessly available until a break point, or in simulation use-cases where an infinite data stream is required [15]. Similar to the previous scenario, the original RDF dataset is pre-processed in order to structure the stream as a sequence of annotated graphs, and then it is continuously streamed through TripleWave as a JSON-LD RDF stream. The main difference is that the timestamps are cyclically incremented, when the dataset is replayed, so that they provide the impression of an endless stream.

3.2 R2RML to Generate RDF Streams

Streams on the Web are available in a large variety of formats, so in order to adapt and transform them into RDF streams we use a generic transformation process that is specified as R2RMLFootnote 9 mappings. Although these mappings were originally conceived for relational database inputs, we use light extensions that support other formats such as CSV or JSON (as in RML extensionsFootnote 10).

The example in Listing 1.3 specifies how earthquake stream data items can be mapped to a graph of an RDF streamFootnote 11. This mapping defines first a triple that indicates that the generated subject is of type ex:Earthquake. The predicateObjectMap clauses add two more triples, one specifying the URL of the earthquake (e.g. the reference USGS page) and its description.

figure c

A snippet of the resulting RDF Stream graph, serialized in JSON-LD, is shown in in Listing 1.4. As can be observed, a stream element is contained in a timestamped graph, using the generatedAtTime property of the PROV ontologyFootnote 12.

figure d

3.3 Consuming TripleWave RDF Streams

TripleWave is implemented in Node.js and produces the output RDF stream using HTTP with chunked transfer encoding by default, or alternatively through WebSockets. Consumers can register to a TripleWave endpoint and receive the data following a push paradigm. In cases where consumers may want to pull the data, TripleWave allows publishing the data according to the Linked Data principles [5]. Given that the stream supplies data that changes very frequently, data is only temporarily available for consumption, assuming that recent stream elements are more relevant. We describe both cases below.

Publishing Stream Elements as Linked Data. TripleWave allows consuming RDF Streams following the Linked Data principles, extending the framework proposed in [4]. According to this scheme, for each RDF Stream TripleWave distinguishes between two kinds of Named Graphs: the Stream Graph (sGraph) and Instantaneous Graphs (iGraphs). Intuitively, an iGraph represents one stream element, while the sGraph contains the descriptions of the iGraphs, e.g. their timestamps.

As an example, the sGraph in Listing 1.5 describes the current content that can be retrieved from a TripleWave RDF stream. The ordered list of iGraphs is modeled as an rdf:list with the most recent iGraph as the first element, and with each iGraph having its relative timestamp annotation. By accessing the sGraph, consumers discover which are the stream elements (identified by iGraphs) available at the current time instants. Next, the consumer can access the iGraphs dereferencing the iGraph URL address. The annotations on the sGraph use a dedicated vocabularyFootnote 13.

figure e

RDF Stream Push. An RSP engine can consume an RDF stream from TripleWave, extending the rsp-services frameworkFootnote 14 as follows (with C-SPARQL as a sample RSP): (1) the client identifies the stream by its IRI (which is the URL of the sGraph). (2) rsp-services registers the new stream in the C-SPARQL engine. (3) rsp-services looks at the sGraph URL, parses it and gets the information regarding the TBox and WebSocket. (4) The TBox is associated to the stream. (5) A WebSocket connection is established and the data flows into C-SPARQL. (6) The user registers a new query for the registered stream. (7) The TBox is loaded into the reasoner (if available) associated to the query. (8) The query is performed on the flowing data.

4 Conclusion

In this work we have described TripleWave, an open-source framework for publishing and sharing RDF streams on the Web. This work fills an important gap in RDF stream processing as it provides flexible mechanisms for plugging in diverse Web data sources, and for consuming streams in both push and pull mode. TripleWave covers a set of crucial requirements for the stream reasoning community and the semantic Web community at large, including: reusing available streams on the Web [R1], as well as time annotated RDF datasets [R2], which are transformed to follow a homogenized RDF stream structure [R3]. TripleWave adopts a stream format compatible with Semantic Web standards, including RDF for data modeling, and Linked Data principles for publishing [R4] [R6]. The proposed tool also provides pull and push data access to client applications and RSP engines [R5], as well as context information about the stream [R7].

The inherent flexibility of TripleWave makes it suitable for reuse in a wide range of streaming data applications, and it has the potential of enabling the integration of RSP query engines, stream reasoners, RDF stream filters, semantic complex event processors, benchmark platforms, and stored RDF data sources. This versatility, combined with a standards-driven design, and aligned with the requirements and design principles discussed in the W3C RSP Group, can help spreading the adoption of RDF for streaming data scenarios and applications.

Show Cases. We developed two show cases in order to illustrate the capabilities of TripleWave. In the first case we set up TripleWave for converting Web streams and we configured it to transform the stream generated by the changes in Wikipedia. We developed the component to listen to the Wikipedia endpoint and the R2RML mapping.

In the second case we started another instance of TripleWave and we configured it to endlessly replay as a stream the Linked Sensor Data [14] dataset as a stream. Furthermore for this scenario we also set up a instance of the C-SPARQL engine to consume the data produced by TripleWave. Links to both the show cases are available on the project websiteFootnote 15.

Availability. TripleWave is available under the Apache 2.0 licenseFootnote 16, its code is accessible on GithubFootnote 17, and accompanied by user and developer guides. It is maintained and supported by the Stream Reasoning initiativeFootnote 18.