Embracing Uncertainty in Entity Linking

Ioannou, Ekaterini; Nejdl, Wolfgang; Niederée, Claudia; Velegrakis, Yannis

doi:10.1007/978-3-642-25008-8_9

Embracing Uncertainty in Entity Linking

Ekaterini Ioannou⁴,
Wolfgang Nejdl⁵,
Claudia Niederée⁵ &
…
Yannis Velegrakis⁶

Chapter
First Online: 01 January 2012

1366 Accesses

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

An important task in data integration and data cleaning is the identification of data that describe the same real-world object, such as an event, a person, or a movie. There are various techniques to tackle this problem. The typical methodology is to collect matching evidence, such as similarities between the entity strings, and based on them, generate information to link the entities. Then, using predefined thresholds, or human intervention, the entities are merged, and thus, queries are executed over the resulted merged entities. In this chapter, we explain the limitations of this methodology on recently introduced data, for instance data from Web 2.0 applications, and the challenges that such data impose on the entity linkage methodology. We then propose an alternative, generic methodology that allows the use of the entity linkage information upon query processing. In particular, we define a generic data model suitable for representing the entity and linkage information as this is generated by a number of the existing entity linkage techniques. Entities are compiled on the fly, by effectively processing the incoming query over the representation model, and thus, query answers reflect the most probable entity solution for the specific query. We also report the results of our extensive experimental evaluation, which verify the efficiency and effectiveness of the suggested methodology.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz

References

Adar, E., Re, C.: Managing uncertainty in social networks. IEEE Data Eng. Bull. 15–22 (2007)
Google Scholar
Agrawal, P., Benjelloun, O., Sarma, A., Hayworth, C., Nabar, S., Sugihara, T., Widom, J.: Trio: a system for data, uncertainty, and lineage. VLDB, pp. 1151–1154 (2006)
Google Scholar
Andritsos, P., Fuxman, A., Miller, R.: Clean answers over dirty databases: a probabilistic approach. ICDE (2006)
Google Scholar
Antova, L., Koch, C., Olteanu, D.: \(1{0}^{{(10)}^{6} }\)worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., Widomr, J., Jonas, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Bex, G., Neven, F., Vansummeren, S.: Inferring xml schema definitions from xml data. VLDB, pp. 998–1009 (2007)
Google Scholar
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. DMKD, pp. 11–18 (2004)
Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intel. Syst. 18(5), 16–23 (2003)
Article Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. IIWeb, pp. 73–78 (2003)
Google Scholar
Dalvi, N., Kumar, R., Pang, B., Ramakrishnan, R., Tomkins, A., Bohannon, P., Keerthi, S., Merugu, S.: A web of concepts. PODS, pp. 1–12 (2009)
Google Scholar
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Article Google Scholar
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)
Google Scholar
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, NY, USA (2003)
Google Scholar
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)
Google Scholar
Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: a profiler-based approach. IIWeb, pp. 53–58 (2003)
Google Scholar
Domingos, P.: Multi-relational record linkage. Multi-relational data mining workshop co-located with KDD, pp. 31–48 (2004)
Google Scholar
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. SIGMOD conference, pp. 85–96 (2005)
Google Scholar
Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB, pp. 687–698 (2007)
Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Getoor, L., Diehl, C.: Link mining: a survey. SIGKDD explorations (2005)
Google Scholar
Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. VLDB, pp. 965–976 (2006)
Google Scholar
Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. PODS, pp. 1–9 (2006)
Google Scholar
Hernández, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowledge Dis. 2(1), 9–37 (1998)
Article Google Scholar
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1), 429–438 (2010)
Google Scholar
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: LinkDB: a probabilistic linkage database system. SIGMOD conference, pp. 1307–1310 (2011)
Google Scholar
Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. CAiSE, pp. 302–316 (2008)
Google Scholar
Kalashnikov, D., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31(2), 716–767 (2006)
Article Google Scholar
Lenzerini, M.: Data integration: a theoretical perspective. PODS, pp. 233–246 (2002)
Google Scholar
Morris, A., Velegrakis, Y., Bouquet, P.: Entity identification on the semantic web. SWAP (2008)
Google Scholar
Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. WSDM, pp. 535–544 (2011)
Google Scholar
Rastogi, V., Dalvi, N., Garofalakis, M.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)
Google Scholar
Re, C., Suciu, D.: Managing probabilistic data with MystiQ: the can-do, the could-do, and the can’t-do. SUM, pp. 5–18 (2008)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. KDD, pp. 269–278 (2002)
Google Scholar
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. ICDE, pp. 596–605 (2007)
Google Scholar
Velegrakis, Y.: On the importance of updates in information integration and data exchange systems. DBISP2P (2008)
Google Scholar
Whang, S., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. SIGMOD Conference, pp. 219–232 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Technical University of Crete, University Campus – Kounoupidiana, 73100, Chania, Greece
Ekaterini Ioannou
L3S Research Center, Appelstr. 9a, 30167, Hannover, Germany
Wolfgang Nejdl & Claudia Niederée
University of Trento, Via Sommarive 14, 38123, Trento, Italy
Yannis Velegrakis

Authors

Ekaterini Ioannou
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Niederée
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Velegrakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ekaterini Ioannou .

Editor information

Editors and Affiliations

, Dipartimento di Informatica, Università degli Studi Roma Tre, Via della Vasca Navale 79, Roma, 00146, Italy
Roberto De Virgilio
e Reggio Emilia, Dipartimento di Economia Aziendale, Università degli Studi di Modena, Via le Berengario, 51, Modena, 41100, Italy
Francesco Guerra
Università degli Studi di Trento, Via Sommarive 14, Trento, 38123, Italy
Yannis Velegrakis

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y. (2012). Embracing Uncertainty in Entity Linking. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-25008-8_9
Published: 28 January 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25007-1
Online ISBN: 978-3-642-25008-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics