Skip to main content

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution

  • Conference paper
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7867))

Included in the following conference series:

Abstract

Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on static databases. However, real-world databases are often dynamic, and increasingly organizations need to resolve entities in real-time. Thus, there is a need for new techniques that facilitate working with dynamic databases in real-time. In this paper, we propose a dynamic similarity-aware inverted indexing technique (DySimII) that meets these requirements. We also propose a frequency-filtered indexing technique where only the most frequent attribute values are indexed. We experimentally evaluate our techniques on a large real-world voter database. The results show that when the index size grows no appreciable increase is found in the average record insertion time (around 0.1 msec) and in the average query time (less than 0.1 sec). We also find that applying the frequency-filtered approach reduces the index size with only a slight drop in recall.

This research was funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP100200079.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aizawa, A., Oyama, K.: A Fast Linkage Detection Scheme for Multi-Source Information Integration. In: WIRI, Tokyo, pp. 30–39 (2005)

    Google Scholar 

  2. Baxter, R., Christen, P., Churches, T.: A Comparison of fast blocking methods for record linkage. In: ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington DC (2003)

    Google Scholar 

  3. Bhattacharya, I., Getoor, L.: Collective Entity Resolution. Journal of Artificial Intelligence Research 30, 621–657 (2007)

    MATH  Google Scholar 

  4. Christen, P., Goiser, K.: Quality and Complexity Measures for Data Linkage and Deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Studies in Computational Intelligence (SCI), vol. 43, pp. 127–151. Springer, Heidelberg (2007)

    Google Scholar 

  5. Christen, P., Gayler, R., Hawking, D.: Similarity-Aware Indexing for Real-Time Entity Resolution. In: ACM CIKM, Hong Kong, pp. 1565–1568 (2009)

    Google Scholar 

  6. Christen, P.: A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering (2012)

    Google Scholar 

  7. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Springer, Canberra (2012)

    Google Scholar 

  8. Dey, D., Mookerjee, V., Liu, D.: Efficient Techniques for Online Record linkage. IEEE Transactions on Knowledge and Data Engineering 23(3), 373–387 (2010)

    Article  Google Scholar 

  9. Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive Windows for Duplicate Detection. In: International Conference on Data Engineering. IEEE (2012)

    Google Scholar 

  10. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. Knowledge and Data Engineering 19(1), 1–16 (2007)

    Article  Google Scholar 

  11. Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)

    Article  Google Scholar 

  12. Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: ACM SIGMOD, San Jose, pp. 127–138 (1995)

    Google Scholar 

  13. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)

    MATH  Google Scholar 

  14. Ioannou, E., Nejdl, W., Niederee, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. Proceeding of the VLDB Endowment 3 (2010)

    Google Scholar 

  15. McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: ACM SIGKDD, Boston, pp. 169–178 (2000)

    Google Scholar 

  16. North Carolina State Board of Elections, NC voter registration database, ftp://www.app.sboe.state.nc.us/

  17. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity Resolution with Iterative Blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (2009)

    Google Scholar 

  18. Whang, S.E., Garcia-Molina, H.: Entity Resolution with Evolving Rules. Proceeding of the VLDB Endowment 3(1-2), 1326–1337 (2010)

    Google Scholar 

  19. Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive Sorted Neighborhood Methods for Efficient Record Linkage. In: ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 185–119 (2007)

    Google Scholar 

  20. Zipf, G.K.: Human Behavior and the Principle of Least Effort. Addison-Wesley (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D. (2013). Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution. In: Li, J., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7867. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40319-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40319-4_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40318-7

  • Online ISBN: 978-3-642-40319-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics