Abstract
An important task in data integration and data cleaning is the identification of data that describe the same real-world object, such as an event, a person, or a movie. There are various techniques to tackle this problem. The typical methodology is to collect matching evidence, such as similarities between the entity strings, and based on them, generate information to link the entities. Then, using predefined thresholds, or human intervention, the entities are merged, and thus, queries are executed over the resulted merged entities. In this chapter, we explain the limitations of this methodology on recently introduced data, for instance data from Web 2.0 applications, and the challenges that such data impose on the entity linkage methodology. We then propose an alternative, generic methodology that allows the use of the entity linkage information upon query processing. In particular, we define a generic data model suitable for representing the entity and linkage information as this is generated by a number of the existing entity linkage techniques. Entities are compiled on the fly, by effectively processing the incoming query over the representation model, and thus, query answers reflect the most probable entity solution for the specific query. We also report the results of our extensive experimental evaluation, which verify the efficiency and effectiveness of the suggested methodology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adar, E., Re, C.: Managing uncertainty in social networks. IEEE Data Eng. Bull. 15–22 (2007)
Agrawal, P., Benjelloun, O., Sarma, A., Hayworth, C., Nabar, S., Sugihara, T., Widom, J.: Trio: a system for data, uncertainty, and lineage. VLDB, pp. 1151–1154 (2006)
Andritsos, P., Fuxman, A., Miller, R.: Clean answers over dirty databases: a probabilistic approach. ICDE (2006)
Antova, L., Koch, C., Olteanu, D.: \(1{0}^{{(10)}^{6} }\)worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., Widomr, J., Jonas, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bex, G., Neven, F., Vansummeren, S.: Inferring xml schema definitions from xml data. VLDB, pp. 998–1009 (2007)
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. DMKD, pp. 11–18 (2004)
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intel. Syst. 18(5), 16–23 (2003)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. IIWeb, pp. 73–78 (2003)
Dalvi, N., Kumar, R., Pang, B., Ramakrishnan, R., Tomkins, A., Bohannon, P., Keerthi, S., Merugu, S.: A web of concepts. PODS, pp. 1–12 (2009)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, NY, USA (2003)
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)
Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: a profiler-based approach. IIWeb, pp. 53–58 (2003)
Domingos, P.: Multi-relational record linkage. Multi-relational data mining workshop co-located with KDD, pp. 31–48 (2004)
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. SIGMOD conference, pp. 85–96 (2005)
Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB, pp. 687–698 (2007)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Getoor, L., Diehl, C.: Link mining: a survey. SIGKDD explorations (2005)
Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. VLDB, pp. 965–976 (2006)
Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. PODS, pp. 1–9 (2006)
HernĂ¡ndez, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowledge Dis. 2(1), 9–37 (1998)
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1), 429–438 (2010)
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: LinkDB: a probabilistic linkage database system. SIGMOD conference, pp. 1307–1310 (2011)
Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. CAiSE, pp. 302–316 (2008)
Kalashnikov, D., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31(2), 716–767 (2006)
Lenzerini, M.: Data integration: a theoretical perspective. PODS, pp. 233–246 (2002)
Morris, A., Velegrakis, Y., Bouquet, P.: Entity identification on the semantic web. SWAP (2008)
Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. WSDM, pp. 535–544 (2011)
Rastogi, V., Dalvi, N., Garofalakis, M.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)
Re, C., Suciu, D.: Managing probabilistic data with MystiQ: the can-do, the could-do, and the can’t-do. SUM, pp. 5–18 (2008)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. KDD, pp. 269–278 (2002)
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. ICDE, pp. 596–605 (2007)
Velegrakis, Y.: On the importance of updates in information integration and data exchange systems. DBISP2P (2008)
Whang, S., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. SIGMOD Conference, pp. 219–232 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y. (2012). Embracing Uncertainty in Entity Linking. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-25008-8_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25007-1
Online ISBN: 978-3-642-25008-8
eBook Packages: Computer ScienceComputer Science (R0)