Large-scale linked data integration using probabilistic reasoning and crowdsourcing

Abstract

We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    http://linkeddata.org/.

  2. 2.

    http://www.dpbedia.org.

  3. 3.

    http://freebase.org.

  4. 4.

    http://trec.nist.gov.

  5. 5.

    https://inex.mmci.uni-saarland.de/.

  6. 6.

    http://www.okkam.org.

  7. 7.

    http://extractiv.com/.

  8. 8.

    http://www.opencalais.com/.

  9. 9.

    http://linkeddata.org/.

  10. 10.

    http://km.aifb.kit.edu/ws/semsearch10/.

  11. 11.

    https://km.aifb.kit.edu/ws/semsearch11/.

  12. 12.

    http://challenge.semanticweb.org/.

  13. 13.

    http://www.mturk.com.

  14. 14.

    In our experiments, we use 100 ground truth matchings that are discarded later when evaluating the proposed matching approaches.

  15. 15.

    We can already see the benefit of having better matchings across data sets for that matter.

  16. 16.

    http://oaei.ontologymatching.org/2011/instance/.

  17. 17.

    http://data.nytimes.com/.

  18. 18.

    http://km.aifb.kit.edu/projects/btc-2009/.

  19. 19.

    http://www.mturk.com.

  20. 20.

    The testset we have created together with the matching results from the crowd is available for download at the page: http://exascale.info/ZenCrowd.

  21. 21.

    This is the average accuracy over all entity types reported in Table 3.

  22. 22.

    The improvement is statistically significant (t test \(p<0.05\)).

  23. 23.

    The test collection we created is available for download at: http://exascale.info/zencrowd/.

  24. 24.

    http://dbpedia.org/.

  25. 25.

    http://www.freebase.com/.

  26. 26.

    http://www.geonames.org/.

  27. 27.

    http://data.nytimes.com/.

  28. 28.

    http://www.mturk.com.

  29. 29.

    Our approach is hence similar to Blanco et al. [7], though we do not use BM25F as a ranking function.

  30. 30.

    Confidence scores have all been normalized to \([0,1]\) by manually defining a transformation function.

References

  1. 1.

    Alonso, O., Baeza-Yates, R.A.: Design and implementation of relevance assessments using crowdsourcing. In: ECIR, pp. 153–164 (2011).

  2. 2.

    Bailey, P., de Vries, A.P., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC (2007)

  3. 3.

    Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2010 entity track. In: TREC (2010)

  4. 4.

    Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)

  5. 5.

    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp. 39–48. ACM, New York (2003). doi:10.1145/956750.956759

  6. 6.

    Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H.S., Tran, D.T.: Repeatable and reliable search system evaluation using crowdsourcing. In: SIGIR, pp. 923–932 (2011)

  7. 7.

    Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: International Semantic Web Conference (ISWC), pp. 83–97 (2011)

  8. 8.

    Bouquet, P., Stoermer, H., Niederee, C., Mana, A.: Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC), pp. 554–561 (2008)

  9. 9.

    Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6 (2006)

  10. 10.

    Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006)

  11. 11.

    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012). doi:10.1109/TKDE.2011.127

    Article  Google Scholar 

  12. 12.

    Ciaramita, M., Altun, Y.: Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pp. 594–602. ACL, Stroudsburg (2006). http://dl.acm.org/citation.cfm?id=1610075.1610158

  13. 13.

    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, vol. 2007, pp. 708–716 (2007)

  14. 14.

    Cudré-Mauroux, P., Aberer, K., Feher, A.: Probabilistic message passing in peer data management systems. In: International Conference on Data Engineering (ICDE) (2006)

  15. 15.

    Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., De Meer, H.: idMesh: graph-based disambiguation of linked data. In: WWW ’09, pp. 591–600. ACM, New York (2009). doi:10.1145/1526709.1526789

  16. 16.

    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the ACL (2002)

  17. 17.

    Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pp. 469–478. ACM, New York (2012). doi:10.1145/2187836.2187900

  18. 18.

    Demartini, G., Iofciu, T., de Vries, A.P.: Overview of the INEX 2009 entity ranking track. In: INEX, pp. 254–264 (2009)

  19. 19.

    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39 (1977)

  20. 20.

    Difallah, D.E., Demartini, G., Cudré-Mauroux, P.: Pick-A-Crowd: Tell me what you like, and I’ll tell you what to do. In: WWW’13. ACM, New York (2013)

  21. 21.

    Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96. ACM, New York (2005)

  22. 22.

    Feng, A., Franklin, M.J., Kossmann, D., Kraska, T., Madden, S., Ramesh, S., Wang, A., Xin, R.: CrowdDB: Query Processing with the VLDB Crowd. PVLDB 4(11), 1387–1390 (2011)

    Google Scholar 

  23. 23.

    Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pp. 80–88 (2010)

  24. 24.

    Getoor, L., Machanavajjhala, A.: Entity resolution: tutorial. In: VLDB (2012)

  25. 25.

    Haas, K., Mika, P., Tarjan, P., Blanco, R.: Enhanced results for web search. In: SIGIR, pp. 725–734 (2011)

  26. 26.

    Han, X., Zhao, J.: Named entity disambiguation by leveraging wikipedia semantic knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 215–224. ACM, New York (2009). doi:10.1145/1645953.1645983

  27. 27.

    Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  28. 28.

    Kazai, G.: In search of quality in crowdsourcing for search engine evaluation. In: ECIR, pp. 165–176 (2011)

  29. 29.

    Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: SIGIR, pp. 205–214 (2011)

  30. 30.

    Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430 (2003)

  31. 31.

    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450 (2010)

  32. 32.

    Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory 47(2) (2001)

  33. 33.

    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)

  34. 34.

    Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Inform. Sci. 89(12), 1–38 (1996). doi:10.1016/0020-0255(95)00185-9

    Article  Google Scholar 

  35. 35.

    Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: Cdas: a crowdsourcing data analytics system. Proc. VLDB Endow. 5(10), 1040–1051 (2012). http://dl.acm.org/citation.cfm?id=2336664.2336676

  36. 36.

    Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)

    Google Scholar 

  37. 37.

    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011)

  38. 38.

    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pp. 233–242. ACM, New York (2007). doi: 10.1145/1321440.1321475

  39. 39.

    Murphy, K.M., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Uncertainty in Artificial Intelligence (UAI) (1999)

  40. 40.

    On, B., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE), pp. 496–505 (2007)

  41. 41.

    Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pp. 53–62. ACM, New York (2012). doi:10.1145/2124295.2124305

  42. 42.

    Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW, pp. 771–780 (2010)

  43. 43.

    Selke, J., Lofi, C., Balke, W.T.: Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. Proc. VLDB Endow. 5(6), 538–549 (2012). http://dl.acm.org/citation.cfm?id=2168651.2168655

  44. 44.

    Shen, W., Wang, J., Luo, P., Wang, M.: Liege: link entities in web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, pp. 1424–1432. ACM, New York (2012). doi:10.1145/2339530.2339753

  45. 45.

    Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for ad-hoc object retrieval. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 125–134. ACM, New York (2012). doi:10.1145/2348283.2348304

  46. 46.

    von Ahn, L., Dabbish, L.: Designing games with a purpose. Commun. ACM 51(8), 58–67 (2008). doi:10.1145/1378704.1378719

    Google Scholar 

  47. 47.

    von Ahn, L., Liu, R., Blum, M.: Peekaboom: a game for locating objects in images. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, pp. 55–64. ACM, New York (2006). doi:10.1145/1124772.1124782

  48. 48.

    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)

    Google Scholar 

  49. 49.

    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD ’09, pp. 219–232. ACM, New York (2009). doi:10.1145/1559845.1559870

  50. 50.

    Winkler, W.: The state of record linkage and current research problems. US Census Bureau. In: Statistical Research Division (1999)

  51. 51.

    Wylot, M., Pont, J., Wisniewski, M., Cudré-Mauroux, P.: dipLODocus[RDF]—short and long-tail rdf analytics for massive webs of data. In: International Semantic Web Conference (ISWC), pp. 778–793 (2011)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gianluca Demartini.

Additional information

This work was supported by the Swiss National Science Foundation under grant number PP00P2_128459.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Demartini, G., Difallah, D.E. & Cudré-Mauroux, P. Large-scale linked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal 22, 665–687 (2013). https://doi.org/10.1007/s00778-013-0324-z

Download citation

Keywords

  • Instance matching
  • Entity linking
  • Data integration
  • Crowdsourcing
  • Probabilistic reasoning