Advertisement

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution

  • Banda Ramadan
  • Peter Christen
  • Huizhi Liang
  • Ross W. Gayler
  • David Hawking
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7867)

Abstract

Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on static databases. However, real-world databases are often dynamic, and increasingly organizations need to resolve entities in real-time. Thus, there is a need for new techniques that facilitate working with dynamic databases in real-time. In this paper, we propose a dynamic similarity-aware inverted indexing technique (DySimII) that meets these requirements. We also propose a frequency-filtered indexing technique where only the most frequent attribute values are indexed. We experimentally evaluate our techniques on a large real-world voter database. The results show that when the index size grows no appreciable increase is found in the average record insertion time (around 0.1 msec) and in the average query time (less than 0.1 sec). We also find that applying the frequency-filtered approach reduces the index size with only a slight drop in recall.

Keywords

Dynamic indexing real-time query record linkage data matching duplicate detection frequency-filtered indexing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aizawa, A., Oyama, K.: A Fast Linkage Detection Scheme for Multi-Source Information Integration. In: WIRI, Tokyo, pp. 30–39 (2005)Google Scholar
  2. 2.
    Baxter, R., Christen, P., Churches, T.: A Comparison of fast blocking methods for record linkage. In: ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC (2003)Google Scholar
  3. 3.
    Bhattacharya, I., Getoor, L.: Collective Entity Resolution. Journal of Artificial Intelligence Research 30, 621–657 (2007)zbMATHGoogle Scholar
  4. 4.
    Christen, P., Goiser, K.: Quality and Complexity Measures for Data Linkage and Deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Studies in Computational Intelligence (SCI), vol. 43, pp. 127–151. Springer, Heidelberg (2007)Google Scholar
  5. 5.
    Christen, P., Gayler, R., Hawking, D.: Similarity-Aware Indexing for Real-Time Entity Resolution. In: ACM CIKM, Hong Kong, pp. 1565–1568 (2009)Google Scholar
  6. 6.
    Christen, P.: A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering (2012)Google Scholar
  7. 7.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Springer, Canberra (2012)Google Scholar
  8. 8.
    Dey, D., Mookerjee, V., Liu, D.: Efficient Techniques for Online Record linkage. IEEE Transactions on Knowledge and Data Engineering 23(3), 373–387 (2010)CrossRefGoogle Scholar
  9. 9.
    Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive Windows for Duplicate Detection. In: International Conference on Data Engineering. IEEE (2012)Google Scholar
  10. 10.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  11. 11.
    Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)CrossRefGoogle Scholar
  12. 12.
    Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: ACM SIGMOD, San Jose, pp. 127–138 (1995)Google Scholar
  13. 13.
    Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)zbMATHGoogle Scholar
  14. 14.
    Ioannou, E., Nejdl, W., Niederee, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proceeding of the VLDB Endowment 3 (2010)Google Scholar
  15. 15.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: ACM SIGKDD, Boston, pp. 169–178 (2000)Google Scholar
  16. 16.
    North Carolina State Board of Elections, NC voter registration database, ftp://www.app.sboe.state.nc.us/
  17. 17.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity Resolution with Iterative Blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (2009)Google Scholar
  18. 18.
    Whang, S.E., Garcia-Molina, H.: Entity Resolution with Evolving Rules. Proceeding of the VLDB Endowment 3(1-2), 1326–1337 (2010)Google Scholar
  19. 19.
    Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive Sorted Neighborhood Methods for Efficient Record Linkage. In: ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 185–119 (2007)Google Scholar
  20. 20.
    Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley (1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Banda Ramadan
    • 1
  • Peter Christen
    • 1
  • Huizhi Liang
    • 1
  • Ross W. Gayler
    • 2
  • David Hawking
    • 1
    • 3
  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia
  2. 2.VedaMelbourneAustralia
  3. 3.Funnelback Pty. Ltd.CanberraAustralia

Personalised recommendations