Skip to main content

Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution

  • Conference paper
Databases Theory and Applications (ADC 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8506))

Included in the following conference series:

Abstract

Real-time entity resolution is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing techniques are used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are then compared with the query record in more details. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has successfully been used for entity resolution of very large databases. However, because it is based on static sorted arrays, this technique is not suitable for dynamic databases. We propose a tree-based dynamic sorted neighborhood index that facilitates matching a stream of query records against a large and dynamic database in real-time. We evaluate our approach on two large data sets. Our results show that the times for both inserting and querying of records stays nearly constant as the index grows, and our approach achieves over one magnitude faster indexing and querying times compared to an earlier real-time entity resolution technique with comparable high matching accuracy.

This research was funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP100200079.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)

    Google Scholar 

  2. Herzog, T., Scheuren, F., Winkler, W.: Data quality and record linkage techniques. Springer (2007)

    Google Scholar 

  3. Dong, X.L., Srivastava, D.: Big data integration. In: IEEE ICDE, Brisbane, AU, pp. 1245–1248 (2013)

    Google Scholar 

  4. Rice, S.V.: Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In: Western MultiConference on Computer Simulation, San Diego, pp. 150–155 (2007)

    Google Scholar 

  5. Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD, San Jose, pp. 127–138 (1995)

    Google Scholar 

  6. Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: ACM CIKM, Hong Kong (2009)

    Google Scholar 

  7. Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: Li, J., Cao, L., Wang, C., Tan, K.C., Liu, B., Pei, J., Tseng, V.S. (eds.) PAKDD 2013 Workshops. LNCS (LNAI), vol. 7867, pp. 47–58. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  8. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, Edinburgh, Scotland, pp. 518–529 (1999)

    Google Scholar 

  9. Zhang, Z., Jiang, J., Liu, X., Lau, R., Wang, H., Zhang, R.: A real time hybrid pattern matching scheme for stock time series. In: Australin Database Conference, pp. 161–170. Australian Computer Society, Inc., Brisbane (2010)

    Google Scholar 

  10. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  11. Bhattacharya, I., Getoor, L.: Query-time entity resolution. Journal of Artificial Intelligence Research 30, 621–657 (2007)

    MATH  Google Scholar 

  12. Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. VLDB Endowment 3(1) (2010)

    Google Scholar 

  13. Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, Canada, pp. 185–194 (2007)

    Google Scholar 

  14. Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)

    Google Scholar 

  15. Christen, P.: Preparation of a real voter data set for record linkage and duplicate detection research. Technical report. Australian National University (2013)

    Google Scholar 

  16. Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Ramadan, B., Christen, P., Liang, H. (2014). Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution. In: Wang, H., Sharaf, M.A. (eds) Databases Theory and Applications. ADC 2014. Lecture Notes in Computer Science, vol 8506. Springer, Cham. https://doi.org/10.1007/978-3-319-08608-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08608-8_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08607-1

  • Online ISBN: 978-3-319-08608-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics