Data Integration for Open Data on the Web

  • Sebastian Neumaier
  • Axel Polleres
  • Simon Steyskal
  • Jürgen Umbrich
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10370)


In this lecture we will discuss and introduce challenges of integrating openly available Web data and how to solve them. Firstly, while we will address this topic from the viewpoint of Semantic Web research, not all data is readily available as RDF or Linked Data, so we will give an introduction to different data formats prevalent on the Web, namely, standard formats for publishing and exchanging tabular, tree-shaped, and graph data. Secondly, not all Open Data is really completely open, so we will discuss and address issues around licences, terms of usage associated with Open Data, as well as documentation of data provenance. Thirdly, we will discuss issues connected with (meta-)data quality issues associated with Open Data on the Web and how Semantic Web techniques and vocabularies can be used to describe and remedy them. Fourth, we will address issues about searchability and integration of Open Data and discuss in how far semantic search can help to overcome these. We close with briefly summarizing further issues not covered explicitly herein, such as multi-linguality, temporal aspects (archiving, evolution, temporal querying), as well as how/whether OWL and RDFS reasoning on top of integrated open data could be help.


Open Data Link Data Tabular Data Link Open Data Metadata Description 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The work presented in this paper has been supported by the Austrian Research Promotion Agency (FFG) under the projects ADEQUATe (grant no. 849982) and DALICC (grant no. 855396).


  1. 1.
    Abele, A., McCrae, J.P., Buitelaar, P., Jentzsch, A., Cyganiak, R.: Linking open data cloud diagram 2017 (2017)Google Scholar
  2. 2.
    Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proc. VLDB Endow. 6(6), 421–432 (2013)CrossRefGoogle Scholar
  3. 3.
    Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the VoID Vocabulary, March 2011.
  4. 4.
    Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Foundations of Data Exchange. Cambridge University Press, New York (2014)zbMATHGoogle Scholar
  5. 5.
    Assaf, A., Troncy, R., Senart, A.: HDL - towards a harmonized dataset model for open data portals. In: PROFILES 2015, 2nd International Workshop on Dataset Profiling & Federated Search for Linked Data, Main conference ESWC15, 31 May-4, Portoroz, Slovenia, Portoroz, Slovenia, 05 2015., June 2015Google Scholar
  6. 6.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-76298-0_52 CrossRefGoogle Scholar
  7. 7.
    Auer, S., Lehmann, J.: Creating knowledge out of interlinked data. Semant. Web 1(1–2), 97–104 (2010)Google Scholar
  8. 8.
    Bailey, J., Bry, F., Furche, T., Schaffert, S.: Web and semantic web query languages: a survey. In: Eisinger, N., Małuszyński, J. (eds.) Reasoning Web. LNCS, vol. 3564, pp. 35–133. Springer, Heidelberg (2005). doi: 10.1007/11526988_3 CrossRefGoogle Scholar
  9. 9.
    Bauckmann, J., Abedjan, Z., Leser, U., Müller, H., Naumann, F.: Discovering conditional inclusion dependencies. In 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), Maui, HI, USA, October 29 - November 02, 2012, pp. 2094–2098 (2012)Google Scholar
  10. 10.
    Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 turtle: the terse RDF triple language. W3C Recommendation, February 2014.
  11. 11.
    Beek, W., Rietveld, L., Schlobach, S., van Harmelen, F.: LOD laundromat: why the semantic web needs centralization (even if we don’t like it). IEEE Internet Comput. 20(2), 78–81 (2016)CrossRefGoogle Scholar
  12. 12.
    Berners-Lee, T.: Linked Data. W3C Design Issues, July 2006. Accessed 31 Mar 2017
  13. 13.
    Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 5, 29–37 (2001)Google Scholar
  14. 14.
    Bernstein, A., Hendler, J., Noy, N.: The semantic web. Commun. ACM 59(9), 35–37 (2016)CrossRefGoogle Scholar
  15. 15.
    Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between RDF and XML with XSPARQL. J. Data Semant. 1(3), 147–185 (2012)CrossRefGoogle Scholar
  16. 16.
    Borriello, M., Dirschl, C., Polleres, A., Ritchie, P., Salliau, F., Sasaki, F., Stoitsis, G.: From XML to RDF step by step: approaches for leveraging xml workflows with linked data. In: XML Prague 2016 - Conference Proceedings, pp. 121–138, Prague, Czech Republic, February 2016Google Scholar
  17. 17.
    Bourhis, P., Reutter, J.L., Suárez, F., Domagoj Vrgoc, J.: Data model, query languages and schema specification. CoRR, abs/1701.02221 (2017)Google Scholar
  18. 18.
    Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. Internet Engineering Task Force (IETF) RFC 7159, March 2014Google Scholar
  19. 19.
    Brickley, D., Guha, R.V.: RDF Schema 1.1. W3C Recommendation, February 2014.
  20. 20.
    Cabrio, E., Palmero Aprosio, A., Villata, S.: These are your rights. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 255–269. Springer, Cham (2014). doi: 10.1007/978-3-319-07443-6_18 CrossRefGoogle Scholar
  21. 21.
    Carothers, G., Seaborne, A.: RDF 1.1 N-triples: a line-based syntax for an RDF graph. W3C Recommendation, February 2014.
  22. 22.
    Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 1101–1102, New York, NY, USA. ACM (2008)Google Scholar
  23. 23.
    Cyganiak, R., Wood, D., Lanthaler, M., Klyne, G., Carroll, J.J., Mcbride, B.: RDF 1.1 concepts and abstract syntax. Technical report (2014)Google Scholar
  24. 24.
    d’Aquin, M., Motta, E.: Watson, more than a semantic web search engine. Semant. Web 2(1), 55–63 (2011)Google Scholar
  25. 25.
    Sarma, A.D., Fang, L., Gupta, N., Halevy, A., Lee, H., Wu, F., Xin, R., Yu, C.: Finding related tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 817–828. ACM (2012)Google Scholar
  26. 26.
    Dell’Aglio, D., Polleres, A., Lopes, N., Bischof, S.: Querying the web of data with XSPARQL 1.1. In: ISWC2014 Developers Workshop, vol. 1268 of CEUR Workshop Proceedings., October 2014Google Scholar
  27. 27.
    Ding, L., Finin, T., Joshi, A., Pan, R., Scott Cost, R., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659, New York, NY, USA. ACM (2004)Google Scholar
  28. 28.
    Ermilov, I., Auer, S., Stadler, C.: User-driven semantic mapping of tabular data. In: Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS 2013), pp. 105–112, New York, NY, USA. ACM (2013)Google Scholar
  29. 29.
    European Commission. Towards a thriving data-driven economy, July 2014Google Scholar
  30. 30.
    Fernández, J.D., Martınez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). J. Web Semant. 19(2), 22–41 (2013)CrossRefGoogle Scholar
  31. 31.
    Fernández Garcia, J.D., Umbrich, J., Knuth, M., Polleres, A.: Evaluating query and storage strategies for RDF archives. In: 12th International Conference on Semantic Systems (SEMANTICS), ACM International Conference Proceedings Series, pp. 41–48. ACM, September 2016Google Scholar
  32. 32.
    Fürber, C., Hepp, M.: Towards a vocabulary for data quality management in semantic web architectures. In: Proceedings of the 1st International Workshop on Linked Web Data Management (LWDM 2011), pp. 1–8, New York, NY, USA. ACM (2011)Google Scholar
  33. 33.
    Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation, March 2013.
  34. 34.
    Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers, San Rafael (2011)Google Scholar
  35. 35.
    Hernández, D., Hogan, A., Krötzsch, M.: Reifying RDF: what works well with wikidata? In: Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems Co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA, USA, October 11, 2015, pp. 32–47 (2015)Google Scholar
  36. 36.
    Hernández, D., Hogan, A., Riveros, C., Rojas, C., Zerega, E.: Querying wikidata: comparing SPARQL, relational and graph databases. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 88–103. Springer, Cham (2016). doi: 10.1007/978-3-319-46547-0_10 CrossRefGoogle Scholar
  37. 37.
    Hitzler, P., Lehmann, J., Polleres, A.: Logics for the semantic web. In: Gabbay, D.M., Siekmann, J.H., Woods, J. (eds.) Computational Logic, vol. 9 of Handbook of the History of Logic, pp. 679–710. Elesevier, Amsterdam (2014)Google Scholar
  38. 38.
    Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)CrossRefGoogle Scholar
  39. 39.
    Iannella, R., Villata, S.: ODRL information model. W3C Working Draft (2017).
  40. 40.
    Open Knowledge International. Open Definition Conformant Licenses, April 2017. Accessed 28 Apr 2017
  41. 41.
    Klyne, G., Carroll, J.J.: Resource description framework (RDF): concepts and abstract syntax. Technical report (2004)Google Scholar
  42. 42.
    Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs, Datenbanken und Informationssysteme (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2017)Google Scholar
  43. 43.
    Kruse, S., Papenbrock, T., Naumann, F.: Scaling out the discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, pp. 445–454 (2015)Google Scholar
  44. 44.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 6(2), 167–195 (2015)Google Scholar
  45. 45.
    Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1), 1338–1347 (2010)Google Scholar
  46. 46.
    Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 1247–1258, New York, NY, USA. ACM (2014)Google Scholar
  47. 47.
    Lopez, V., Kotoulas, S., Sbodio, M.L., Stephenson, M., Gkoulalas-Divanis, A., Aonghusa, P.M.: QuerioCity: a linked data platform for urban information management. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7650, pp. 148–163. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-35173-0_10 CrossRefGoogle Scholar
  48. 48.
    Maali, F., Erickson, J.: Data Catalog Vocabulary (DCAT), January 2014.
  49. 49.
    McGuinness, D., Lebo, T., Sahoo, S.: The PROV Ontology (PROV-O), April 2013.
  50. 50.
    Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). doi: 10.1007/978-3-319-11964-9_18 Google Scholar
  51. 51.
    Meusel, R., Ritze, D., Paulheim, H.: Towards more accurate statistical profiling of deployed microdata. J. Data Inf. Qual. 8(1), 3:1–3:31 (2016)Google Scholar
  52. 52.
    Miles, A., Bechhofer, S.: Simple knowledge organization system reference. W3C Recommendation (2009)Google Scholar
  53. 53.
    Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Howard Ho, C.T., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)CrossRefGoogle Scholar
  54. 54.
    Mitlöhner, J., Neumaier, S., Umbrich, J., Polleres, A.: Characteristics of open data CSV files. In: 2nd International Conference on Open and Big Data, Invited Paper, August 2016Google Scholar
  55. 55.
    Mulwad, V., Finin, T., Joshi, A.: Semantic message passing for generating linked data from tables. In: The Semantic Web - ISWC 2013–12th International Semantic Web Conference, Sydney, NSW, Australia, 21–25 October, 2013, Proceedings, Part I, pp. 363–378 (2013)Google Scholar
  56. 56.
    Navigli, R., Ponzetto., S.P.: Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  57. 57.
    Neumaier, S., Umbrich, J., Parreira, J.X., Polleres, A.: Multi-level semantic labelling of numerical values. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 428–445. Springer, Cham (2016). doi: 10.1007/978-3-319-46523-4_26 CrossRefGoogle Scholar
  58. 58.
    Neumaier, S., Umbrich, J., Polleres, A.: Automated quality assessment of metadata across open data portals. J. Data Inf. Qual. 8(1), 2:1–2:29 (2016)Google Scholar
  59. 59.
    Neumaier, S., Umbrich, J., Polleres, A.: Lifting data portals to the web of data. In: WWW 2017 Workshop on Linked Data on the Web (LDOW 2017), Perth, Australia, 3-7 April, 2017 (2017)Google Scholar
  60. 60.
    Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to linked data and its lifecycle on the web. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S., Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS, vol. 6848, pp. 1–75. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23032-5_1 CrossRefGoogle Scholar
  61. 61.
    Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)CrossRefGoogle Scholar
  62. 62.
    Papenbrock, T., Kruse, S., Quiané-Ruiz, J.-A., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)Google Scholar
  63. 63.
    Pezoa, F., Reutter, J.L., Suárez, F., Ugarte, M., Vrgoc, D.: Foundations of JSON schema. In: Proceedings of the 25th International Conference on World Wide Web (WWW 2016), Montreal, Canada, 11–15 April, 2016, pp. 263–273 (2016)Google Scholar
  64. 64.
    Polleres, A., Hogan, A., Delbru, R., Umbrich, J.: RDFS & OWL reasoning for linked data. In: Rudolph, S., Gottlob, G., Horrocks, I., van Harmelen, F. (eds.) Reasoning Web. Semantic Technologies for Intelligent Data Access (Reasoning Web 2013), volume 8067, pp. 91–149. Springer, Mannheim (2013)Google Scholar
  65. 65.
    Pollock, R., Tennison, J., Kellogg, G., Herman, I.: Metadata vocabulary for tabular data, W3C Recommendation, December 2015.
  66. 66.
    Ramnandan, S.K., Mittal, A., Knoblock, C.A., Szekely, P.: Assigning semantic labels to data sources. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 403–417. Springer, Cham (2015). doi: 10.1007/978-3-319-18818-8_25 CrossRefGoogle Scholar
  67. 67.
    Shafranovich,Y.: Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180 (Informational), October 2005Google Scholar
  68. 68.
    Sporny, M., Kellogg, G., Lanthaler, M.: JSON-LD 1.0A JSON-based Serialization for Linked Data, January 2014.
  69. 69.
    Steyskal, S., Polleres, A.: Defining expressive access policies for linked data using the ODRL ontology 2.0. In: Proceedings of the 10th International Conference on Semantic Systems (SEMANTICS 2014) (2014)Google Scholar
  70. 70.
    Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: A scalable approach to learn semantic models of structured sources. In: Proceedings of the 8th IEEE International Conference on Semantic Computing (ICSC 2014) (2014)Google Scholar
  71. 71.
    Tanon, T.P., Vrandecic, D., Schaffert, S., Steiner, T., Pintscher, L.: From freebase to wikidata: the great migration. In: Proceedings of the 25th International Conference on World Wide Web (WWW 2016), Montreal, Canada, 11–15 April, 2016, pp. 1419–1428 (2016)Google Scholar
  72. 72.
    The Open Data Charter. G8 open data charter and technical annex (2013)Google Scholar
  73. 73.
    Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Fei, W., Miao, G., Chung, W.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)Google Scholar
  74. 74.
    Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  75. 75.
    Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Technical report, USA (1998)Google Scholar
  76. 76.
    Zhang, Z.: Towards efficient and effective semantic table interpretation. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 487–502. Springer, Cham (2014). doi: 10.1007/978-3-319-11964-9_31 Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Sebastian Neumaier
    • 1
  • Axel Polleres
    • 1
    • 2
  • Simon Steyskal
    • 1
  • Jürgen Umbrich
    • 1
  1. 1.Vienna University of Economics and BusinessViennaAustria
  2. 2.Complexity Science Hub ViennaViennaAustria

Personalised recommendations