Skip to main content

Retrieval, Crawling and Fusion of Entity-centric Data on the Web

  • Conference paper
  • First Online:
Semantic Keyword-Based Search on Structured Data Sources (IKC 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10151))

Abstract

While the Web of (entity-centric) data has seen tremendous growth over the past years, take-up and re-use is still limited. Data vary heavily with respect to their scale, quality, coverage or dynamics, what poses challenges for tasks such as entity retrieval or search. This chapter provides an overview of approaches to deal with the increasing heterogeneity of Web data. On the one hand, recommendation, linking, profiling and retrieval can provide efficient means to enable discovery and search of entity-centric data, specifically when dealing with traditional knowledge graphs and linked data. On the other hand, embedded markup such as Microdata and RDFa has emerged a novel, Web-scale source of entity-centric knowledge. While markup has seen increasing adoption over the last few years, driven by initiatives such as schema.org, it constitutes an increasingly important source of entity-centric data on the Web, being in the same order of magnitude as the Web itself with regards to dynamics and scale. To this end, markup data lends itself as a data source for aiding tasks such as knowledge base augmentation, where data fusion techniques are required to address the inherent characteristics of markup data, such as its redundancy, heterogeneity and lack of links. Future directions are concerned with the exploitation of the complementary nature of markup data and traditional knowledge graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    RDFa W3C recommendation: http://www.w3.org/TR/xhtml-rdfa-primer/.

  2. 2.

    http://www.w3.org/TR/microdata.

  3. 3.

    http://microformats.org.

  4. 4.

    http://www.datahub.io.

  5. 5.

    http://vocab.deri.ie/void.

  6. 6.

    http://data.linkededucation.org/vol/.

  7. 7.

    http://data-observatory.org/lod-profiles/profile-explorer/.

  8. 8.

    http://km.aifb.kit.edu/ws/semsearch10/.

  9. 9.

    RDFa W3C recommendation: http://www.w3.org/TR/xhtml-rdfa-primer/.

  10. 10.

    http://www.w3.org/TR/microdata.

  11. 11.

    http://microformats.org.

  12. 12.

    http://www.webdatacommons.org.

  13. 13.

    http://webdatacommons.org/structureddata/.

  14. 14.

    http://www.lrmi.net.

  15. 15.

    http://glimmer.research.yahoo.com/.

References

  1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  2. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)

    Article  Google Scholar 

  3. Blanco, R., Cambazoglu, B.B., Mika, P., Torzec, N.: Entity recommendations in web search. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 33–48. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41338-4_3

    Chapter  Google Scholar 

  4. Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25073-6_6

    Chapter  Google Scholar 

  5. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD 2008, pp. 1247–1250. ACM, New York (2008)

    Google Scholar 

  6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)

    Google Scholar 

  7. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41338-4_18

    Chapter  Google Scholar 

  8. DAquin, M., Adamou, A., Dietze, S.: Assessing the educational linked data landscape. In: ACM Web Science 2013 (WebSci 2013), Paris, France. ACM (2013)

    Google Scholar 

  9. Demartini, G., Missen, M.M.S., Blanco, R., Zaragoza, H.: Entity summarization of news articles. In: Proceedings of the 33rd ACM SIGIR, pp. 795–796 (2010)

    Google Scholar 

  10. Dietze, S., Taibi, D., dAquin, M.: Facilitating scientometrics in learning analytics and educational data mining - the LAK dataset. Semantic Web J. 8(3), 395–403 (2017)

    Article  Google Scholar 

  11. Dietze, S., Taibi, D., Yu, H.Q., Dovrolis, N.: A linked dataset of medical educational resources. Br. J. Educ. Technol. BJET 46(5), 1123–1129 (2015)

    Article  Google Scholar 

  12. Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.: Beyond established knowledge graphs-recommending web datasets for data linking. In: Bozzon, A., Cudre-Maroux, P., Pautasso, C. (eds.) ICWE 2016. LNCS, vol. 9671, pp. 262–279. Springer, Heidelberg (2016). doi:10.1007/978-3-319-38791-8_15

    Chapter  Google Scholar 

  13. Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.: Dataset recommendation for data linking: an intensional approach. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 36–51. Springer, Heidelberg (2016). doi:10.1007/978-3-319-34129-3_3

    Chapter  Google Scholar 

  14. Fetahu, B., Dietze, S., Pereira Nunes, B., Antonio Casanova, M., Taibi, D., Nejdl, W.: A scalable approach for efficiently generating structured dataset topic profiles. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 519–534. Springer, Heidelberg (2014). doi:10.1007/978-3-319-07443-6_35

    Chapter  Google Scholar 

  15. Fetahu, B., Gadiraju, U., Dietze, S.: Improving entity retrieval on structured data. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 474–491. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25007-6_28

    Chapter  Google Scholar 

  16. Guéret, C., Groth, P., Stadler, C., Lehmann, J.: Assessing linked data mappings using network measures. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 87–102. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30284-8_13

    Chapter  Google Scholar 

  17. Harth, A.: Billion Triples Challenge data set. http://km.aifb.kit.edu/projects/btc-2012/ (2012)

  18. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  Google Scholar 

  19. Leme, L.A.P.P., Lopes, G.R., Nunes, B.P., Casanova, M.A., Dietze, S.: Identifying candidate datasets for data interlinking. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 354–366. Springer, Heidelberg (2013). doi:10.1007/978-3-642-39200-9_29

    Chapter  Google Scholar 

  20. Rabello Lopes, G., Paes Leme, L.A.P., Pereira Nunes, B., Casanova, M.A., Dietze, S.: Two approaches to the dataset interlinking recommendation problem. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014. LNCS, vol. 8786, pp. 324–339. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11749-2_25

    Chapter  Google Scholar 

  21. Meusel, R., Paulheim, H.: Heuristics for fixing common errors in deployed schema.org microdata. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 152–168. Springer, Heidelberg (2015). doi:10.1007/978-3-319-18818-8_10

    Chapter  MATH  Google Scholar 

  22. Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_18

    Chapter  Google Scholar 

  23. Pereira Nunes, B., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W.: Combining a co-occurrence-based and a semantic measure for entity linking. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 548–562. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38288-8_37

    Chapter  Google Scholar 

  24. Oulabi, Y., Meusel, R., Bizer, C.: Fusing time-dependent web table data. In: Proceedings of the 19th International Workshop on Web and Databases, p. 3. ACM (2016)

    Google Scholar 

  25. Pelleg, D., Moore, A.W. et al.: X-means: extending k-means with efficient estimation of the number of clusters. In: ICML, pp. 727–734 (2000)

    Google Scholar 

  26. Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: Proceedings of the 19th WWW, pp. 771–780 (2010)

    Google Scholar 

  27. Lopes, G.R., Leme, L.A.P.P., Nunes, B.P., Casanova, M.A., Dietze, S.: Recommending tripleset interlinking through a social network approach. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013. LNCS, vol. 8180, pp. 149–161. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41230-1_13

    Chapter  Google Scholar 

  28. Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.: Analysing structured scholarly data embedded in web pages. April 2016

    Google Scholar 

  29. Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.: Analysing structured scholarly data embedded in web pages. In: Proceedings of the 25th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee (2016)

    Google Scholar 

  30. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_16

    Chapter  Google Scholar 

  31. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Williamson, C.L., Zurko, M.E., Patel-Schneider, P.F., Shenoy, P.J. (eds) WWW, pp. 697–706. ACM, New York (2007)

    Google Scholar 

  32. Taibi, D., Chawla, S., Dietze, S., Marenzi, I., Fetahu, B.: Exploring ted talks as linked data for education. Brit. J. Educational Tech. 46(5), 1092–1096 (2015)

    Article  Google Scholar 

  33. Taibi, D., Dietze, S.: Towards embedded markup of learning resources on the web: An initial quantitative analysis of LRMI terms usage. In: Bourdeau, J., Hendler, J., Nkambou, R., Horrocks, I., Zhao, B.Y. (eds.) WWW (Companion Volume), pp. 513–517. ACM, New York (2016)

    Google Scholar 

  34. Taibi, D., Dietze, S., Fetahu, B., Fulantelli, G.: Exploring type-specific topic profiles of datasets: a demo for educational linked data. In: Horridge, M., Rospocher, M., van Ossenbruggen, J. (eds.) International Semantic Web Conference - Posters and Demos, vol. 1272. CEUR Workshop Proceedings, pp. 353–356. CEUR-WS.org (2014)

    Google Scholar 

  35. Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for Ad-hoc object retrieval. In: Proceedings of the 35th ACM SIGIR, pp. 125–134 (2012)

    Google Scholar 

  36. White, S., Smyth, P.: Algorithms for estimating relative importance in networks. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 266–275 (2003)

    Google Scholar 

  37. Yu, R., Fetahu, B., Gadiraju, U., Dietze, S.: A survey on challenges in web markup data for entity retrieval. In: 15th International Semantic Web Conference (ISWC 2016) (2016)

    Google Scholar 

  38. Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., Dietze, S.: Towards entity summarisation on structured web markup. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9989, pp. 69–73. Springer, Heidelberg (2016). doi:10.1007/978-3-319-47602-5_15

    Chapter  Google Scholar 

  39. Yuan, W., Demidova, E., Dietze, S., Zhou, X.: Analyzing relative incompleteness of movie descriptions in the web of data: a case study. In: Horridge, M., Rospocher, M., van Ossenbruggen, J. (eds.) International Semantic Web Conference - Posters and Demos, vol. 1272. CEUR Workshop Proceedings, pp. 197–200. CEUR-WS.org (2014)

    Google Scholar 

Download references

Acknowledgements

While all discussed works are joint research with numerous colleagues, friends and collaborators from a number of research institutions, the author would like to thank all involved researchers for the inspiring and productive work throughout the previous years. In addition, the author expresses his gratitude to all funding bodies that enabled the presented research through a variety of funding programs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Dietze .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Dietze, S. (2017). Retrieval, Crawling and Fusion of Entity-centric Data on the Web. In: Calì, A., Gorgan, D., Ugarte, M. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2016. Lecture Notes in Computer Science(), vol 10151. Springer, Cham. https://doi.org/10.1007/978-3-319-53640-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-53640-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-53639-2

  • Online ISBN: 978-3-319-53640-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics