Abstract
There is a multitude of tools for preparation of Linked Data from data sources such as CSV and XML files. These tools usually perform as expected when processing examples, or smaller real world data. However, a majority of these tools become hard to use when faced with a larger dataset such as hundreds of megabytes large CSV file. Tools which load the entire resulting RDF dataset into memory usually have memory requirements unsatisfiable by commodity hardware. This is the case of RDF-based ETL tools. Their limits can be avoided by running them on powerful and expensive hardware, which is, however, not an option for majority of data publishers. Tools which process the data in a streamed way tend to have limited transformation options. This is the case of text-based transformations, such as XSLT, or per-item SPARQL transformations such as the streamed version of TARQL. In this paper, we show how the power and transformation options of RDF-based ETL tools can be combined with the possibility to transform large datasets on common consumer hardware for so called chunkable data - data which can be split in a certain way. We demonstrate our approach in our RDF-based ETL tool, LinkedPipes ETL. We include experiments on selected real world datasets and a comparison of performance and memory consumption of available tools.
This work was supported in part by the Czech Science Foundation (GAČR), grant number 16-09713S and in part by the project SVV 260451.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
References
Calbimonte, J.-P., Aberer, K.: Reactive processing of RDF streams of events. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9341, pp. 457–468. Springer, Cham (2015). doi:10.1007/978-3-319-25639-9_56
Corcoglioniti, F., Aprosio, A.P., Rospocher, M.: Demonstrating the power of streaming and sorting for non-distributed RDF processing: RDFpro. In: Proceedings of the ISWC 2015 Posters & Demonstrations Track Co-located with the 14th International Semantic Web Conference (ISWC 2015), vol. 1486. CEUR Workshop Proceedings, Bethlehem, PA, USA, 11 October 2015. CEUR-WS.org (2015)
Giménez-Garcia, J.M., Fernández, J.D., Martínez-Prieto, M.A.: MapReduce-based solutions for scalable SPARQL querying. Open J. Semant. Web (OJSW) 1(1), 1–18 (2014)
Gschwend, A., Neuroni, A.C., Gehrig, T., Combettoo, M.: Publication and reuse of linked data: the fusepool publish-process-perform platform for linked data. Innov. Public Sect. 22, 116–123 (2015)
Klímek, J., Škoda, P., Nečaský, M.: LinkedPipes ETL: evolved linked data preparation. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9989, pp. 95–100. Springer, Cham (2016). doi:10.1007/978-3-319-47602-5_20
Knap, T., Hanečák, P., Klímek, J., Mader, C., Nečaský, M., Nuffelen, B.V., Škoda, P.: UnifiedViews: an ETL tool for RDF data management. Semantic Web (Accepted 2017). http://semantic-web-journal.net/content/unifiedviews-etl-tool-rdf-data-management-0
Knoblock, C.A., Szekely, P., Ambite, J.L., Goel, A., Gupta, S., Lerman, K., Muslea, M., Taheriyan, M., Mallick, P.: Semi-automatically mapping structured sources into the semantic web. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 375–390. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30284-8_32
Le-Phuoc, D., Polleres, A., Hauswirth, M., Tummarello, G., Morbidoni, C.: Rapid prototyping of semantic mash-ups through semantic web pipes. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 581–590. ACM, New York (2009)
Marx, E., Shekarpour, S., Auer, S., Ngomo, A.-C.N.: Large-scale RDF dataset slicing. In: Proceedings of the 2013 IEEE Seventh International Conference on Semantic Computing, ICSC 2013, pp. 228–235. IEEE Computer Society, Washington, DC (2013)
De Meester, B., Maroy, W., Dimou, A., Verborgh, R., Mannens, E.: Declarative data transformations for linked data generation: the case of DBpedia. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10250, pp. 33–48. Springer, Cham (2017). doi:10.1007/978-3-319-58451-5_3
Scharffe, F., Atemezing, G., Troncy, R., Gandon, F., Villata, S., Bucher, B., Hamdi, F., Bihanic, L., Képéklian, G., Cotton, F., Euzenat, J., Fan, Z., Vandenbussche, P.-Y., Vatant, B.: Enabling linked data publication with the Datalift platform. In: Proceedings of AAAI Workshop on Semantic Cities, Toronto, Canada, July 2012
Thellmann, K., Orlandi, F., Auer, S.: LinDA - visualising and exploring linked data. In: Proceedings of the Posters and Demos Track of 10th International Conference on Semantic Systems - SEMANTiCS 2014, Leipzig, Germany, September 2014
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Klímek, J., Škoda, P. (2017). Speeding up Publication of Linked Data Using Data Chunking in LinkedPipes ETL. In: Panetto, H., et al. On the Move to Meaningful Internet Systems. OTM 2017 Conferences. OTM 2017. Lecture Notes in Computer Science(), vol 10574. Springer, Cham. https://doi.org/10.1007/978-3-319-69459-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-69459-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69458-0
Online ISBN: 978-3-319-69459-7
eBook Packages: Computer ScienceComputer Science (R0)