Collecting, Integrating, Enriching and Republishing Open City Data as Linked Data

  • Stefan BischofEmail author
  • Christoph Martin
  • Axel PolleresEmail author
  • Patrik SchneiderEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9367)


Access to high quality and recent data is crucial both for decision makers in cities as well as for the public. Likewise, infrastructure providers could offer more tailored solutions to cities based on such data. However, even though there are many data sets containing relevant indicators about cities available as open data, it is cumbersome to integrate and analyze them, since the collection is still a manual process and the sources are not connected to each other upfront. Further, disjoint indicators and cities across the available data sources lead to a large proportion of missing values when integrating these sources. In this paper we present a platform for collecting, integrating, and enriching open data about cities in a reusable and comparable manner: we have integrated various open data sources and present approaches for predicting missing values, where we use standard regression methods in combination with principal component analysis (PCA) to improve quality and amount of predicted values. Since indicators and cities only have partial overlaps across data sets, we particularly focus on predicting indicator values across data sets, where we extend, adapt, and evaluate our prediction model for this particular purpose: as a “side product” we learn ontology mappings (simple equations and sub-properties) for pairs of indicators from different data sets. Finally, we republish the integrated and predicted values as linked open data.


Link Open Data Target Indicator Triple Store City Data Open Data Source 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bettencourt, L.M.A., Lobo, J., Helbing, D., Kühnert, C., West, G.B.: Growth, innovation, scaling, and the pace of life in cities. Proc. of the National Academy of Sciences of the United States of America 104(17), 7301–7306 (2007)CrossRefGoogle Scholar
  2. 2.
    Bischof, S., Polleres, A.: RDFS with attribute equations via SPARQL rewriting. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 335–350. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  3. 3.
    Bischof, S., Polleres, A., Sperl, S.: City data pipeline. In: Proc. of the I-SEMANTICS 2013 Posters & Demonstrations Track, pp. 45–49 (2013)Google Scholar
  4. 4.
    Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - A crystallization point for the web of data. J. Web. Sem. 7(3), 154–165 (2009)CrossRefGoogle Scholar
  5. 5.
    Brickley, D., Guha, R., (eds.): RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation, W3C (2004)Google Scholar
  6. 6.
    Economist Intelligence Unit (ed.): The Green City Index. Siemens AG (2012)Google Scholar
  7. 7.
    Euzenat, J., Shvaiko, P.: Ontology matching, 2nd edn. Springer (2013)Google Scholar
  8. 8.
    Gil, Y., Miles, S.: PROV Model Primer. W3C Note, W3C (2013)Google Scholar
  9. 9.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  10. 10.
    Han, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc. (2012)Google Scholar
  11. 11.
    Kämpgen, B., O’Riain, S., Harth, A.: Interacting with statistical linked data via OLAP operations. In: Simperl, E., Norton, B., Mladenic, D., Valle, E.D., Fundulaki, I., Passant, A., Troncy, R. (eds.) ESWC 2012. LNCS, vol. 7540, pp. 87–101. Springer, Heidelberg (2015) Google Scholar
  12. 12.
    Keet, C.M., Ławrynowicz, A., d’Amato, C., Kalousis, A., Nguyen, P., Palma, R., Stevens, R., Hilario, M.: The data mining OPtimization ontology. Web Semantics: Science, Services and Agents on the World Wide Web 32, 43–53 (2015)CrossRefGoogle Scholar
  13. 13.
    Lopez, V., Kotoulas, S., Sbodio, M.L., Stephenson, M., Gkoulalas-Divanis, A., Aonghusa, P.M.: QuerioCity: a linked data platform for urban information management. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part II. LNCS, vol. 7650, pp. 148–163. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  14. 14.
    Nickel, M., Tresp, V., Kriegel, H.: Factorizing YAGO: scalable machine learning for linked data. In: Proc. of WWW 2012, pp. 271–280 (2012)Google Scholar
  15. 15.
    Office for Official Publications of the European Communities: Urban Audit. Methodological Handbook (2004)Google Scholar
  16. 16.
    Paulheim, H.: Generating possible interpretations for statistics from linked open data. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 560–574. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  17. 17.
    Paulheim, H., Fürnkranz, J.: Unsupervised generation of data mining features from linked open data. In: Proc. of WIMS 2012, p. 31. ACM (2012)Google Scholar
  18. 18.
    Paulheim, H., Ristoski, P., Mitichkin, E., Bizer, C.: Data mining with background knowledge from the web. In: Proc. of the 5th RapidMiner World (2014)Google Scholar
  19. 19.
    R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing (2009)Google Scholar
  20. 20.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  21. 21.
    Roweis, S.T.: EM algorithms for PCA and SPCA. In: Advances in Neural Information Processing Systems, (NIPS 1997), vol. 10, pp. 626–632 (1997)Google Scholar
  22. 22.
    Sanchez, V.: Advanced support vector machines and kernel methods. Neurocomputing 55(1–2), 5–20 (2003)CrossRefGoogle Scholar
  23. 23.
    Stadler, C., Lehmann, J., Höffner, K., Auer, S.: LinkedGeoData: A core for a web of spatial open data. Semantic Web 3(4), 333–354 (2012)Google Scholar
  24. 24.
    Statistics, L.B., Breiman, L.: Random forests. In: Machine Learning, pp. 5–32 (2001)Google Scholar
  25. 25.
    Thomsen, C., Pedersen, T.B.: A survey of open source tools for business intelligence. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 74–84. Springer, Heidelberg (2005) CrossRefGoogle Scholar
  26. 26.
    U.S. Census Bureau: County and City Data Book 2007 (2007).
  27. 27.
    Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S., 4th edn. Springer (2002)Google Scholar
  28. 28.
    West, M., Harrison, P.J., Migon, H.S.: Dynamic generalized linear models and bayesian forecasting. Journal of the American Statistical Association 80(389), 73–83 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc. (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Siemens AG ÖsterreichViennaAustria
  2. 2.Vienna University of Economics and BusinessViennaAustria
  3. 3.Vienna University of TechnologyViennaAustria

Personalised recommendations