Skip to main content

Embracing Uncertainty in Entity Linking

  • Chapter
  • First Online:
Semantic Search over the Web

Part of the book series: Data-Centric Systems and Applications ((DCSA))

  • 1366 Accesses

Abstract

An important task in data integration and data cleaning is the identification of data that describe the same real-world object, such as an event, a person, or a movie. There are various techniques to tackle this problem. The typical methodology is to collect matching evidence, such as similarities between the entity strings, and based on them, generate information to link the entities. Then, using predefined thresholds, or human intervention, the entities are merged, and thus, queries are executed over the resulted merged entities. In this chapter, we explain the limitations of this methodology on recently introduced data, for instance data from Web 2.0 applications, and the challenges that such data impose on the entity linkage methodology. We then propose an alternative, generic methodology that allows the use of the entity linkage information upon query processing. In particular, we define a generic data model suitable for representing the entity and linkage information as this is generated by a number of the existing entity linkage techniques. Entities are compiled on the fly, by effectively processing the incoming query over the representation model, and thus, query answers reflect the most probable entity solution for the specific query. We also report the results of our extensive experimental evaluation, which verify the efficiency and effectiveness of the suggested methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz

References

  1. Adar, E., Re, C.: Managing uncertainty in social networks. IEEE Data Eng. Bull. 15–22 (2007)

    Google Scholar 

  2. Agrawal, P., Benjelloun, O., Sarma, A., Hayworth, C., Nabar, S., Sugihara, T., Widom, J.: Trio: a system for data, uncertainty, and lineage. VLDB, pp. 1151–1154 (2006)

    Google Scholar 

  3. Andritsos, P., Fuxman, A., Miller, R.: Clean answers over dirty databases: a probabilistic approach. ICDE (2006)

    Google Scholar 

  4. Antova, L., Koch, C., Olteanu, D.: \(1{0}^{{(10)}^{6} }\)worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)

    Google Scholar 

  5. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., Widomr, J., Jonas, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  6. Bex, G., Neven, F., Vansummeren, S.: Inferring xml schema definitions from xml data. VLDB, pp. 998–1009 (2007)

    Google Scholar 

  7. Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. DMKD, pp. 11–18 (2004)

    Google Scholar 

  8. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intel. Syst. 18(5), 16–23 (2003)

    Article  Google Scholar 

  9. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. IIWeb, pp. 73–78 (2003)

    Google Scholar 

  10. Dalvi, N., Kumar, R., Pang, B., Ramakrishnan, R., Tomkins, A., Bohannon, P., Keerthi, S., Merugu, S.: A web of concepts. PODS, pp. 1–12 (2009)

    Google Scholar 

  11. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)

    Article  Google Scholar 

  12. Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)

    Google Scholar 

  13. Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, NY, USA (2003)

    Google Scholar 

  14. Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)

    Google Scholar 

  15. Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: a profiler-based approach. IIWeb, pp. 53–58 (2003)

    Google Scholar 

  16. Domingos, P.: Multi-relational record linkage. Multi-relational data mining workshop co-located with KDD, pp. 31–48 (2004)

    Google Scholar 

  17. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. SIGMOD conference, pp. 85–96 (2005)

    Google Scholar 

  18. Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB, pp. 687–698 (2007)

    Google Scholar 

  19. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  20. Getoor, L., Diehl, C.: Link mining: a survey. SIGKDD explorations (2005)

    Google Scholar 

  21. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. VLDB, pp. 965–976 (2006)

    Google Scholar 

  22. Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. PODS, pp. 1–9 (2006)

    Google Scholar 

  23. HernĂ¡ndez, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowledge Dis. 2(1), 9–37 (1998)

    Article  Google Scholar 

  24. Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1), 429–438 (2010)

    Google Scholar 

  25. Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: LinkDB: a probabilistic linkage database system. SIGMOD conference, pp. 1307–1310 (2011)

    Google Scholar 

  26. Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. CAiSE, pp. 302–316 (2008)

    Google Scholar 

  27. Kalashnikov, D., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31(2), 716–767 (2006)

    Article  Google Scholar 

  28. Lenzerini, M.: Data integration: a theoretical perspective. PODS, pp. 233–246 (2002)

    Google Scholar 

  29. Morris, A., Velegrakis, Y., Bouquet, P.: Entity identification on the semantic web. SWAP (2008)

    Google Scholar 

  30. Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. WSDM, pp. 535–544 (2011)

    Google Scholar 

  31. Rastogi, V., Dalvi, N., Garofalakis, M.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)

    Google Scholar 

  32. Re, C., Suciu, D.: Managing probabilistic data with MystiQ: the can-do, the could-do, and the can’t-do. SUM, pp. 5–18 (2008)

    Google Scholar 

  33. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. KDD, pp. 269–278 (2002)

    Google Scholar 

  34. Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. ICDE, pp. 596–605 (2007)

    Google Scholar 

  35. Velegrakis, Y.: On the importance of updates in information integration and data exchange systems. DBISP2P (2008)

    Google Scholar 

  36. Whang, S., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. SIGMOD Conference, pp. 219–232 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ekaterini Ioannou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y. (2012). Embracing Uncertainty in Entity Linking. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25008-8_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25007-1

  • Online ISBN: 978-3-642-25008-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics