Abstract
Processing data streams is increasingly gaining momentum, given the need to process these flows of information in real-time and at Web scale. In this context, RDF Stream Processing (RSP) and Stream Reasoning (SR) have emerged as solutions to combine semantic technologies with stream and event processing techniques. Research in these areas has proposed an ecosystem of solutions to query, reason and perform real-time processing over heterogeneous and distributed data streams on the Web. However, so far one basic building block has been missing: a mechanism to disseminate and exchange RDF streams on the Web. In this work we close this gap, proposing TripleWave, a reusable and generic tool that enables the publication of RDF streams on the Web. The features of TripleWave were selected based on requirements of real use-cases, and support a diverse set of scenarios, independent of any specific RSP implementation. TripleWave can be fed with existing Web streams (e.g. Twitter and Wikipedia streams) or time-annotated RDF datasets (e.g. the Linked Sensor Data dataset). It can be invoked through both pull- and push-based mechanisms, thus enabling RSP engines to automatically register and receive data from TripleWave.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Semantic streams represent flows of knowledge over time, in diverse domains including social networks, health monitoring, financial markets or environmental monitoring, to name only a few. The Semantic Web community has studied the problems associated with the processing of and reasoning over these complex streams of data, leading to the emergence of RDF Stream Processing and Stream Reasoning techniques.
The Web is a natural context for semantic streams, due to the quantity of dynamic data it contains, generated for example by social networks or the Web of Things [11]. RDF streams emerged as a model to realize semantic streams on the Web: they are (potentially infinite) sequences of time-annotated RDF items ordered chronologically.
While several definitions of RDF streams have been proposed in the past, thanks to the efforts of the W3C RSP Community GroupFootnote 1 these are converging towards a general formalization based on time-annotated RDF graphs. However, the model is not the only aspect about RDF streams that needs agreement in this community. Standard protocols and mechanisms for RDF stream exchange are currently missing, therefore limiting the adoption and spread of RSP technologies on the Web.
Existing systems such as C-SPARQL [3], CQELS [9] and EP-SPARQL [1] do not tackle directly this problem, but delegate the task of managing the stream publication and ingestion to the developer. Other approaches have proposed to create RDF datasets fed from unstructured streams [8, 16], to lift streaming data as Linked Data [2, 10], or to provide virtual RDF Streams [6]. To improve scalability, systems like Ztreamy [7] are designed for efficient transmission of compressed data streams, although they do not address the heterogeneity of data sources, declarative transformation and consumption modes. Nevertheless, so far there is still a need for a generic and flexible solution for making RDF streams available on the Web. Such a solution needs to follow Semantic Web standards and best practices, and to allow different data source configurations and data access modes.
In this work, we propose TripleWaveFootnote 2, an open-source framework for creating RDF streams and publishing them over the Web. Triplewave facilitates the dissemination and consumption of RDF streams, in a similar manner as is already common for static RDF datasets with RDF graphs and datasets. In order to do so, we first elicit a set of requirements (Sect. 2) identified from real scenarios and use cases reported in the literature and the W3C RSP group. Then, we extend the prototype in [13] to address the new requirements. These include a flexible configuration of different types of data sources, and different stream generation modes, including transformation (via mappings) of Web streams and replay of existing RDF datasets and RDF sub-streams. Moreover, TripleWave offers a hybrid consumption mechanism that allows both pull-based consumption of RDF streams [2], and push communication through WebSockets.
2 Requirements
To elicit the requirements for TripleWave, we have taken into account a set of scenarios based on real-world use casesFootnote 3. In the following, we highlight the most important of these requirements, organized along four main axes: data sources, data models, data provisioning, and management of contextual data (schema and metadata).
Data Sources. The first set of requirements focuses on the data sources that TripleWave should support to ensure wide adoption and reuse of the tool.
-
[R1] TripleWave may use streams available on the Web as input. Examples of this kind of data may be found in Twitter, Wikipedia, etc. Twitter supplies data through its streaming APIFootnote 4; similarly, Wikipedia publishes a change stream through an IRC-based or Websocket APIFootnote 5. Online streaming data is not the only kind of data that TripleWave may support. In the context of testing and benchmarking, system developers & designers have an intrinsic need to feed the engines in a reproducible and repeatable way. That means, they aim at streaming previously generated data, one or multiple times, to assess the behavior of the system.
-
[R2] TripleWave shall be able to process existing time-aware (RDF) datasets, which could or could not be formatted as streams.
-
[R3] TripleWave shall provide format conversion mechanisms towards RDF streams, in case the input is not formatted as a stream. A typical usage scenario where these features are needed is testing. In this context, sharing the test data may not be enough, as the time dimension plays a key role and streaming the same data in different ways may influence the behavior of the engines. An open, reusable tool to stream data is therefore needed to enable a fair and reproducible execution of the tests.
Data Models. Although recommendations on data models and serialization formats for RDF streams are still under specificationFootnote 6, it is important to identify and reuse formats that adhere as much as possible to existing recommendations and standards.
-
[R4] TripleWave should adopt a data format compatible with RDF, since RDF streams are heavily based on the RDF building blocks. In this way, it would be possible to increase the potential data reuse and tool usage itself in a wider set of scenarios.
Data Provisioning. This category of requirements describes the different ways of consuming RDF streams by RSP client applications.
-
[R5] TripleWave shall be capable of pro-actively supplying streaming data to processing engines. Indeed, stream processing applications are usually designed to be fed with streaming data. For instance, the SLD framework [2] receives and analyzes real-time data from social networks or sensor networks, while Star-City [12] is fed with Dublin public transportation data to compute urban analyses.
-
[R6] TripleWave shall offer the data accordingly to existing W3C recommendations. In particular, RDF streams should be accessible not only for stream processing and reasoning engines, but also for other applications based on Semantic Web technologies (e.g. SPARQL and Linked Data).
Requirements [R4] and [R6] ensure the compatibility with tools and frameworks already developed and available to process RDF data.
Contextual Data Management. In the stream processing context, continuous execution models are often adopted.
-
[R7] TripleWave shall be able to publish the schema and metadata about the stream independently from the actual transmission of the stream itself. Indeed, in the case of continuous query evaluation, the steps of query registration, schema provisioning, and metadata provisioning can be performed separately from the streaming itself.
3 The TripleWave Approach
In this section we describe how TripleWave enables the publication and consumption of RDF streams on the Web, following the requirements listed in Sect. 2.
Figure 1 represents a high-level architectural view of our solution. As we saw previously, RDF stream consumers may have different requirements on how to ingest the incoming data. According to [R1] and [R2], in TripleWave we consider two main types of data sources: (i) Non-RDF live streams on the Web, and (ii) RDF datasets with time-annotations. While the former mainly requires a conversion of existing streams to RDF, the latter is focused on streaming RDF data, provided that it has timestamped data elements. When performing the transformation to RDF streams, TripleWave makes use of R2RML mappings in order to allow customizing the shape of the resulting stream. In the case of streaming time-annotated RDF datasets, TripleWave also re-arranges the data if necessary, so that it is structured as a sequence of timestamped RDF graphs, following the W3C RSP Group design principles.
As output, TripleWave produces a JSON stream in the JSON-LD format: each stream element is described by an RDF graph and the time annotation is modeled as an annotation over the graphFootnote 7. Using this format compliant with existing standards, TripleWave enables processing RDF streams not only through specialized RSP engines, but also with existing frameworks and techniques for standard RDF processing.
3.1 Running Modes
In order to address requirements [R1, R2, R3], TripleWave supports a flexible set of data sources, namely non-RDF streams from the Web, as well as timestamped RDF datasets. TripleWave also provides different use modes for RDF stream generation, detailed below.
Converting Web streams. Existing streams can be consumed through the TripleWave JSON and CSV connectors. Extensions of these can be easily incorporated in order to support additional formats. These feeds or streams (e.g. Twitter, earthquakes, live weather, Wikipedia updates, etc.) can be directly plugged to the TripleWave pipeline, which then uses R2RML mappings in order to construct RDF triples that will be output as part of an RDF stream. The mappings can be customized to produce RDF triples of arbitrary structure, and using any ontology.
As an example of input, consider the following GeoJSON feed item from the USGS earthquake APIFootnote 8. It contains information about the last reported earthquakes around the world, including the magnitude, location, type and other observed annotations.
Replaying RDF Datasets. RDF data is commonly available as archives and Linked Data endpoints, which may contain timestamp annotations and that can be replayed as a stream. Examples of these include sensor data archives, event datasets, transportation logs, update feeds, etc. These datasets typically contain a time-annotation within the data triples, and one or more other triples are connected to this timestamp. Replaying such datasets means converting an otherwise static dataset into a continuous flow of RDF data, which can then be used by an RDF Stream Processing engine. Common use cases include evaluation, testing, and benchmarking applications, as well as simulation systems. As an example consider the example air temperature observation extracted from the Linked Sensor Data [14] dataset. Each observation is associated to a particular instant, represented as an XSD dateTime literal. Using TripleWave we can replay the contents of this dataset as a stream, and the original timestamps can be tuned so that they can meet any test or benchmarking requirements.
Replay Loop. In certain cases, the replay of RDF datasets as streams can be set up in a way that the data is re-fed to the system after it has been entirely consumed. This is common in testing and benchmarking scenarios where data needs to be endlessly available until a break point, or in simulation use-cases where an infinite data stream is required [15]. Similar to the previous scenario, the original RDF dataset is pre-processed in order to structure the stream as a sequence of annotated graphs, and then it is continuously streamed through TripleWave as a JSON-LD RDF stream. The main difference is that the timestamps are cyclically incremented, when the dataset is replayed, so that they provide the impression of an endless stream.
3.2 R2RML to Generate RDF Streams
Streams on the Web are available in a large variety of formats, so in order to adapt and transform them into RDF streams we use a generic transformation process that is specified as R2RMLFootnote 9 mappings. Although these mappings were originally conceived for relational database inputs, we use light extensions that support other formats such as CSV or JSON (as in RML extensionsFootnote 10).
The example in Listing 1.3 specifies how earthquake stream data items can be mapped to a graph of an RDF streamFootnote 11. This mapping defines first a triple that indicates that the generated subject is of type ex:Earthquake. The predicateObjectMap clauses add two more triples, one specifying the URL of the earthquake (e.g. the reference USGS page) and its description.
A snippet of the resulting RDF Stream graph, serialized in JSON-LD, is shown in in Listing 1.4. As can be observed, a stream element is contained in a timestamped graph, using the generatedAtTime property of the PROV ontologyFootnote 12.
3.3 Consuming TripleWave RDF Streams
TripleWave is implemented in Node.js and produces the output RDF stream using HTTP with chunked transfer encoding by default, or alternatively through WebSockets. Consumers can register to a TripleWave endpoint and receive the data following a push paradigm. In cases where consumers may want to pull the data, TripleWave allows publishing the data according to the Linked Data principles [5]. Given that the stream supplies data that changes very frequently, data is only temporarily available for consumption, assuming that recent stream elements are more relevant. We describe both cases below.
Publishing Stream Elements as Linked Data. TripleWave allows consuming RDF Streams following the Linked Data principles, extending the framework proposed in [4]. According to this scheme, for each RDF Stream TripleWave distinguishes between two kinds of Named Graphs: the Stream Graph (sGraph) and Instantaneous Graphs (iGraphs). Intuitively, an iGraph represents one stream element, while the sGraph contains the descriptions of the iGraphs, e.g. their timestamps.
As an example, the sGraph in Listing 1.5 describes the current content that can be retrieved from a TripleWave RDF stream. The ordered list of iGraphs is modeled as an rdf:list with the most recent iGraph as the first element, and with each iGraph having its relative timestamp annotation. By accessing the sGraph, consumers discover which are the stream elements (identified by iGraphs) available at the current time instants. Next, the consumer can access the iGraphs dereferencing the iGraph URL address. The annotations on the sGraph use a dedicated vocabularyFootnote 13.
RDF Stream Push. An RSP engine can consume an RDF stream from TripleWave, extending the rsp-services frameworkFootnote 14 as follows (with C-SPARQL as a sample RSP): (1) the client identifies the stream by its IRI (which is the URL of the sGraph). (2) rsp-services registers the new stream in the C-SPARQL engine. (3) rsp-services looks at the sGraph URL, parses it and gets the information regarding the TBox and WebSocket. (4) The TBox is associated to the stream. (5) A WebSocket connection is established and the data flows into C-SPARQL. (6) The user registers a new query for the registered stream. (7) The TBox is loaded into the reasoner (if available) associated to the query. (8) The query is performed on the flowing data.
4 Conclusion
In this work we have described TripleWave, an open-source framework for publishing and sharing RDF streams on the Web. This work fills an important gap in RDF stream processing as it provides flexible mechanisms for plugging in diverse Web data sources, and for consuming streams in both push and pull mode. TripleWave covers a set of crucial requirements for the stream reasoning community and the semantic Web community at large, including: reusing available streams on the Web [R1], as well as time annotated RDF datasets [R2], which are transformed to follow a homogenized RDF stream structure [R3]. TripleWave adopts a stream format compatible with Semantic Web standards, including RDF for data modeling, and Linked Data principles for publishing [R4] [R6]. The proposed tool also provides pull and push data access to client applications and RSP engines [R5], as well as context information about the stream [R7].
The inherent flexibility of TripleWave makes it suitable for reuse in a wide range of streaming data applications, and it has the potential of enabling the integration of RSP query engines, stream reasoners, RDF stream filters, semantic complex event processors, benchmark platforms, and stored RDF data sources. This versatility, combined with a standards-driven design, and aligned with the requirements and design principles discussed in the W3C RSP Group, can help spreading the adoption of RDF for streaming data scenarios and applications.
Show Cases. We developed two show cases in order to illustrate the capabilities of TripleWave. In the first case we set up TripleWave for converting Web streams and we configured it to transform the stream generated by the changes in Wikipedia. We developed the component to listen to the Wikipedia endpoint and the R2RML mapping.
In the second case we started another instance of TripleWave and we configured it to endlessly replay as a stream the Linked Sensor Data [14] dataset as a stream. Furthermore for this scenario we also set up a instance of the C-SPARQL engine to consume the data produced by TripleWave. Links to both the show cases are available on the project websiteFootnote 15.
Availability. TripleWave is available under the Apache 2.0 licenseFootnote 16, its code is accessible on GithubFootnote 17, and accompanied by user and developer guides. It is maintained and supported by the Stream Reasoning initiativeFootnote 18.
Notes
- 1.
- 2.
TripleWave: http://streamreasoning.github.io/TripleWave/.
- 3.
- 4.
- 5.
- 6.
W3C RSP Design Principles draft http://streamreasoning.github.io/RSP-QL/RSP_Requirements_Design_Document.
- 7.
The time annotation is stored in the default graph, as in http://www.w3.org/TR/json-ld#named-graphs, Example 49.
- 8.
- 9.
R2RML W3C Recommendation: http://www.w3.org/TR/r2rml/.
- 10.
- 11.
We use schema.org as the vocabulary in the example.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
References
Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language for event processing and stream reasoning. In: WWW, pp. 635–644. ACM (2011)
Balduini, M., Della Valle, E., Dell’Aglio, D., Tsytsarau, M., Palpanas, T., Confalonieri, C.: Social listening of city scale events using the streaming linked data framework. In: Alani, H., et al. (eds.) The Semantic Web – ISWC 2013. LNCS, vol. 8219, pp. 1–16. Springer, Heidelberg (2013)
Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-sparql: a continuous query language for rdf data streams. Intl. J. Semant. Comput. 4(01), 3–25 (2010)
Barbieri, D.F., Della Valle, E.: A proposal for publishing data streams as linked data - A position paper. In: LDOW (2010)
Berners-Lee, T., Bizer, C., Heath, T.: Linked data-the story so far. IJSWIS 5(3), 1–22 (2009)
Calbimonte, J.-P., Jeung, H., Corcho, O., Aberer, K.: Enabling query technologies for the semantic sensor web. Int. J. Semant. Web Inf. Syst. 8, 43–63 (2012)
Fisteus, J.A., Garcia, N.F., Fernandez, L.S., Fuentes-Lorenzo, D.: Ztreamy: A middleware for publishing semantic streams on the web. J. Web Semant. 25, 16–23 (2014)
Gerber, D., Hellmann, S., Bühmann, L., Soru, T., Usbeck, R., Ngonga Ngomo, A.-C.: Real-Time RDF extraction from unstructured data streams. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 135–150. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41335-3_9
Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and adaptive approach for unified processing of linked streams and linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 370–388. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25073-6_24
Le-Phuoc, D., Nguyen-Mau, H.Q., Parreira, J.X., Hauswirth, M.: A middleware framework for scalable management of linked streams. J. Web Semant. 16, 42–51 (2012)
Le-Phuoc, D., Quoc, H.N.M., Quoc, H.N., Nhat, T.T., Hauswirth, M.: The graph of things: A step towards the live knowledge graph of connected things. J. Web Semant. 37, 25–35 (2016)
Lécué, F., Tallevi-Diotallevi, S., Hayes, J., Tucker, R., Bicer, V., Sbodio, M.L., Tommasi, P.: Star-city: semantic traffic analytics and reasoning for city. In: ACM IUI, pp. 179–188 (2014)
Mauri, A., Calbimonte, J.-P., Dell’Aglio, D., Balduini, M., Della Valle, E., Aberer, K.: Where are the rdf streams?: Deploying rdf streams on the web of data with triplewave. In: Poster Proceedings of ISWC (2015)
Patni, H., Henson, C., Sheth, A.: Linked sensor data. In: IEEE CTS, pp. 362–370 (2010)
Scharrenbach, T., Urbani, J., Margara, A., Valle, E., Bernstein, A.: Seven commandments for benchmarking semantic flow processing systems. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 305–319. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38288-8_21
Trinh, T.-D., Wetz, P., Do, B.-L., Anjomshoaa, A., Kiesling, E., Tjoa, A.M.: A web-based platform for dynamic integration of heterogeneous data. In: IIWAS, pp. 253–261 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Mauri, A. et al. (2016). TripleWave: Spreading RDF Streams on the Web. In: Groth, P., et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-46547-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46546-3
Online ISBN: 978-3-319-46547-0
eBook Packages: Computer ScienceComputer Science (R0)