How Linked Data can Aid Machine Learning-Based Tasks

  • Michalis MountantonakisEmail author
  • Yannis Tzitzikas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10450)


The discovery of useful data for a given problem is of primary importance since data scientists usually spend a lot of time for discovering, collecting and preparing data before using them for various reasons, e.g., for applying or testing machine learning algorithms. In this paper we propose a general method for discovering, creating and selecting, in an easy way, valuable features describing a set of entities for leveraging them in a machine learning context. We demonstrate the feasibility of this approach by introducing a tool (research prototype), called \(\mathtt{LODsyndesis}_\mathcal{ML}\), which is based on Linked Data technologies, that (a) discovers automatically datasets where the entities of interest occur, (b) shows to the user a big number of useful features for these entities, and (c) creates automatically the selected features by sending SPARQL queries. We evaluate this approach by exploiting data from several sources, including British National Library, for creating datasets in order to predict whether a book or a movie is popular or non-popular. Our evaluation contains a 5-fold cross validation and we introduce comparative results for a number of different features and models. The evaluation showed that the additional features did improve the accuracy of prediction.


Linked Data Machine Learning Feature Discovery & Selection Automatic classification Prediction 



This work has received funding from the European Union’s Horizon 2020 Research and Innovation programme under the BlueBRIDGE project (Grant agreement No: 675680).


  1. 1.
  2. 2.
    Antoniou, G., Van Harmelen, F.: A Semantic Web Primer. MIT press, Cambridge (2004)Google Scholar
  3. 3.
    Bischof, S., Martin, C., Polleres, A., Schneider, P.: Collecting, integrating, enriching and republishing open city data as linked data. In: Arenas, M., Corcho, O., Simperl, E., Strohmaier, M., d’Aquin, M., Srinivas, K., Groth, P., Dumontier, M., Heflin, J., Thirunarayan, K., Staab, S. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 57–75. Springer, Cham (2015). doi: 10.1007/978-3-319-25010-6_4 CrossRefGoogle Scholar
  4. 4.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Semantic Services, Interoperability, Web Applications: Emerging Concepts, pp. 205–227 (2009)Google Scholar
  5. 5.
    Cheng, W., Kasneci, G., Graepel, T., Stern, D., Herbrich, R.: Automated feature generation from structured knowledge. In: CIKM, pp. 1395–1404. ACM (2011)Google Scholar
  6. 6.
    Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). doi: 10.1007/978-3-319-46547-0_5 CrossRefGoogle Scholar
  7. 7.
    Fafalios, P., Baritakis, M., Tzitzikas, Y.: Configuring named entity extraction through real-time exploitation of linked data. In: WIMS 2014, p. 10. ACM (2014)Google Scholar
  8. 8.
    Fafalios, P., Yannakis, T., Tzitzikas, Y.: Querying the web of data with SPARQL-LD. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds.) TPDL 2016. LNCS, vol. 9819, pp. 175–187. Springer, Cham (2016). doi: 10.1007/978-3-319-43997-6_14 CrossRefGoogle Scholar
  9. 9.
    Katz, G., Shin, E.C.R., Song, D.: Explorekit: automatic feature generation and selection. In: ICDM 2016, pp. 979–984. IEEE (2016)Google Scholar
  10. 10.
    Lehmann, J., Isele, R., Jakob, M., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 6(2), 167–195 (2015)Google Scholar
  11. 11.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: shedding light on the web of documents. In: I-SEMANTICS, pp. 1–8. ACM (2011)Google Scholar
  12. 12.
    Mountantonakis, M., Tzitzikas, Y.: On measuring the lattice of commonalities among several linked datasets. Proc. VLDB Endow. 9(12), 1101–1112 (2016)CrossRefGoogle Scholar
  13. 13.
    Mynarz, J., Svátek, V.: Towards a benchmark for LOD-enhanced knowledge discovery from structured data. In: KNOW@ LOD, pp. 41–48 (2013)Google Scholar
  14. 14.
    Narasimha, V., Kappara, P., Ichise, R., Vyas, O.: Liddm: a data mining system for linked data. In: Workshop on LDOW, vol. 813 (2011)Google Scholar
  15. 15.
    Paulheim, H., Fümkranz, J.: Unsupervised generation of data mining features from linked open data. In: Proceedings of WIMS 2012, p. 31. ACM (2012)Google Scholar
  16. 16.
    Pennock, M., Day, M.: Managing and preserving digital collections at the British library. Managing Digital Cultural Objects: Analysis, discovery and Retrieval, p. 111 (2016)Google Scholar
  17. 17.
    Hommeaux, E.P., Seaborne, A., et al.: Sparql query language for RDF. In: W3C Recommendation, 15 January 2008Google Scholar
  18. 18.
    Ristoski, P., Bizer, C., Paulheim, H.: Mining the web of linked data with rapidminer. Web Semant. Sci. Serv. Agents World Wide Web 35, 142–151 (2015)CrossRefGoogle Scholar
  19. 19.
    Ristoski, P., Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 186–194. Springer, Cham (2016). doi: 10.1007/978-3-319-46547-0_20 CrossRefGoogle Scholar
  20. 20.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  21. 21.
    Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., Mining, D.: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)Google Scholar
  22. 22.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)Google Scholar
  23. 23.
    Zibran, M.F.: Chi-squared test of independence. Department of Computer Science, University of Calgary, Alberta, Canada (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute of Computer ScienceFORTH-ICSHeraklionGreece
  2. 2.Computer Science DepartmentUniversity of CreteHeraklionGreece

Personalised recommendations