Advertisement

Embracing Uncertainty in Entity Linking

  • Ekaterini Ioannou
  • Wolfgang Nejdl
  • Claudia Niederée
  • Yannis Velegrakis
Chapter
Part of the Data-Centric Systems and Applications book series (DCSA)

Abstract

An important task in data integration and data cleaning is the identification of data that describe the same real-world object, such as an event, a person, or a movie. There are various techniques to tackle this problem. The typical methodology is to collect matching evidence, such as similarities between the entity strings, and based on them, generate information to link the entities. Then, using predefined thresholds, or human intervention, the entities are merged, and thus, queries are executed over the resulted merged entities. In this chapter, we explain the limitations of this methodology on recently introduced data, for instance data from Web 2.0 applications, and the challenges that such data impose on the entity linkage methodology. We then propose an alternative, generic methodology that allows the use of the entity linkage information upon query processing. In particular, we define a generic data model suitable for representing the entity and linkage information as this is generated by a number of the existing entity linkage techniques. Entities are compiled on the fly, by effectively processing the incoming query over the representation model, and thus, query answers reflect the most probable entity solution for the specific query. We also report the results of our extensive experimental evaluation, which verify the efficiency and effectiveness of the suggested methodology.

Keywords

Query Processing Query Result Query Evaluation Linkage Information Query Answer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Adar, E., Re, C.: Managing uncertainty in social networks. IEEE Data Eng. Bull. 15–22 (2007)Google Scholar
  2. 2.
    Agrawal, P., Benjelloun, O., Sarma, A., Hayworth, C., Nabar, S., Sugihara, T., Widom, J.: Trio: a system for data, uncertainty, and lineage. VLDB, pp. 1151–1154 (2006)Google Scholar
  3. 3.
    Andritsos, P., Fuxman, A., Miller, R.: Clean answers over dirty databases: a probabilistic approach. ICDE (2006)Google Scholar
  4. 4.
    Antova, L., Koch, C., Olteanu, D.: \(1{0}^{{(10)}^{6} }\)worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)Google Scholar
  5. 5.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., Widomr, J., Jonas, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)Google Scholar
  6. 6.
    Bex, G., Neven, F., Vansummeren, S.: Inferring xml schema definitions from xml data. VLDB, pp. 998–1009 (2007)Google Scholar
  7. 7.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. DMKD, pp. 11–18 (2004)Google Scholar
  8. 8.
    Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intel. Syst. 18(5), 16–23 (2003)Google Scholar
  9. 9.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. IIWeb, pp. 73–78 (2003)Google Scholar
  10. 10.
    Dalvi, N., Kumar, R., Pang, B., Ramakrishnan, R., Tomkins, A., Bohannon, P., Keerthi, S., Merugu, S.: A web of concepts. PODS, pp. 1–12 (2009)Google Scholar
  11. 11.
    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)Google Scholar
  12. 12.
    Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)Google Scholar
  13. 13.
    Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, NY, USA (2003)Google Scholar
  14. 14.
    Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)Google Scholar
  15. 15.
    Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: a profiler-based approach. IIWeb, pp. 53–58 (2003)Google Scholar
  16. 16.
    Domingos, P.: Multi-relational record linkage. Multi-relational data mining workshop co-located with KDD, pp. 31–48 (2004)Google Scholar
  17. 17.
    Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. SIGMOD conference, pp. 85–96 (2005)Google Scholar
  18. 18.
    Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB, pp. 687–698 (2007)Google Scholar
  19. 19.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)Google Scholar
  20. 20.
    Getoor, L., Diehl, C.: Link mining: a survey. SIGKDD explorations (2005)Google Scholar
  21. 21.
    Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. VLDB, pp. 965–976 (2006)Google Scholar
  22. 22.
    Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. PODS, pp. 1–9 (2006)Google Scholar
  23. 23.
    Hernández, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowledge Dis. 2(1), 9–37 (1998)Google Scholar
  24. 24.
    Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1), 429–438 (2010)Google Scholar
  25. 25.
    Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: LinkDB: a probabilistic linkage database system. SIGMOD conference, pp. 1307–1310 (2011)Google Scholar
  26. 26.
    Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. CAiSE, pp. 302–316 (2008)Google Scholar
  27. 27.
    Kalashnikov, D., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31(2), 716–767 (2006)Google Scholar
  28. 28.
    Lenzerini, M.: Data integration: a theoretical perspective. PODS, pp. 233–246 (2002)Google Scholar
  29. 29.
    Morris, A., Velegrakis, Y., Bouquet, P.: Entity identification on the semantic web. SWAP (2008)Google Scholar
  30. 30.
    Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. WSDM, pp. 535–544 (2011)Google Scholar
  31. 31.
    Rastogi, V., Dalvi, N., Garofalakis, M.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)Google Scholar
  32. 32.
    Re, C., Suciu, D.: Managing probabilistic data with MystiQ: the can-do, the could-do, and the can’t-do. SUM, pp. 5–18 (2008)Google Scholar
  33. 33.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. KDD, pp. 269–278 (2002)Google Scholar
  34. 34.
    Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. ICDE, pp. 596–605 (2007)Google Scholar
  35. 35.
    Velegrakis, Y.: On the importance of updates in information integration and data exchange systems. DBISP2P (2008)Google Scholar
  36. 36.
    Whang, S., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. SIGMOD Conference, pp. 219–232 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Ekaterini Ioannou
    • Wolfgang Nejdl
      • 1
    • Claudia Niederée
      • 2
    • Yannis Velegrakis
      • 3
    1. 1.Technical University of CreteChaniaGreece
    2. 2.L3S Research CenterHannoverGermany
    3. 3.University of TrentoTrentoItaly

    Personalised recommendations