Abstract
Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on static databases. However, real-world databases are often dynamic, and increasingly organizations need to resolve entities in real-time. Thus, there is a need for new techniques that facilitate working with dynamic databases in real-time. In this paper, we propose a dynamic similarity-aware inverted indexing technique (DySimII) that meets these requirements. We also propose a frequency-filtered indexing technique where only the most frequent attribute values are indexed. We experimentally evaluate our techniques on a large real-world voter database. The results show that when the index size grows no appreciable increase is found in the average record insertion time (around 0.1 msec) and in the average query time (less than 0.1 sec). We also find that applying the frequency-filtered approach reduces the index size with only a slight drop in recall.
This research was funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP100200079.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aizawa, A., Oyama, K.: A Fast Linkage Detection Scheme for Multi-Source Information Integration. In: WIRI, Tokyo, pp. 30–39 (2005)
Baxter, R., Christen, P., Churches, T.: A Comparison of fast blocking methods for record linkage. In: ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC (2003)
Bhattacharya, I., Getoor, L.: Collective Entity Resolution. Journal of Artificial Intelligence Research 30, 621–657 (2007)
Christen, P., Goiser, K.: Quality and Complexity Measures for Data Linkage and Deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Studies in Computational Intelligence (SCI), vol. 43, pp. 127–151. Springer, Heidelberg (2007)
Christen, P., Gayler, R., Hawking, D.: Similarity-Aware Indexing for Real-Time Entity Resolution. In: ACM CIKM, Hong Kong, pp. 1565–1568 (2009)
Christen, P.: A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering (2012)
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Springer, Canberra (2012)
Dey, D., Mookerjee, V., Liu, D.: Efficient Techniques for Online Record linkage. IEEE Transactions on Knowledge and Data Engineering 23(3), 373–387 (2010)
Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive Windows for Duplicate Detection. In: International Conference on Data Engineering. IEEE (2012)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. Knowledge and Data Engineering 19(1), 1–16 (2007)
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: ACM SIGMOD, San Jose, pp. 127–138 (1995)
Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)
Ioannou, E., Nejdl, W., Niederee, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proceeding of the VLDB Endowment 3 (2010)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: ACM SIGKDD, Boston, pp. 169–178 (2000)
North Carolina State Board of Elections, NC voter registration database, ftp://www.app.sboe.state.nc.us/
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity Resolution with Iterative Blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (2009)
Whang, S.E., Garcia-Molina, H.: Entity Resolution with Evolving Rules. Proceeding of the VLDB Endowment 3(1-2), 1326–1337 (2010)
Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive Sorted Neighborhood Methods for Efficient Record Linkage. In: ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 185–119 (2007)
Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D. (2013). Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution. In: Li, J., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7867. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40319-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-40319-4_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40318-7
Online ISBN: 978-3-642-40319-4
eBook Packages: Computer ScienceComputer Science (R0)