Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution

  • Banda Ramadan
  • Peter Christen
  • Huizhi Liang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8506)

Abstract

Real-time entity resolution is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing techniques are used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are then compared with the query record in more details. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has successfully been used for entity resolution of very large databases. However, because it is based on static sorted arrays, this technique is not suitable for dynamic databases. We propose a tree-based dynamic sorted neighborhood index that facilitates matching a stream of query records against a large and dynamic database in real-time. We evaluate our approach on two large data sets. Our results show that the times for both inserting and querying of records stays nearly constant as the index grows, and our approach achieves over one magnitude faster indexing and querying times compared to an earlier real-time entity resolution technique with comparable high matching accuracy.

Keywords

Dynamic indexing data matching braided tree 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)Google Scholar
  2. 2.
    Herzog, T., Scheuren, F., Winkler, W.: Data quality and record linkage techniques. Springer (2007)Google Scholar
  3. 3.
    Dong, X.L., Srivastava, D.: Big data integration. In: IEEE ICDE, Brisbane, AU, pp. 1245–1248 (2013)Google Scholar
  4. 4.
    Rice, S.V.: Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In: Western MultiConference on Computer Simulation, San Diego, pp. 150–155 (2007)Google Scholar
  5. 5.
    Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD, San Jose, pp. 127–138 (1995)Google Scholar
  6. 6.
    Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: ACM CIKM, Hong Kong (2009)Google Scholar
  7. 7.
    Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: Li, J., Cao, L., Wang, C., Tan, K.C., Liu, B., Pei, J., Tseng, V.S. (eds.) PAKDD 2013 Workshops. LNCS (LNAI), vol. 7867, pp. 47–58. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  8. 8.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, Edinburgh, Scotland, pp. 518–529 (1999)Google Scholar
  9. 9.
    Zhang, Z., Jiang, J., Liu, X., Lau, R., Wang, H., Zhang, R.: A real time hybrid pattern matching scheme for stock time series. In: Australin Database Conference, pp. 161–170. Australian Computer Society, Inc., Brisbane (2010)Google Scholar
  10. 10.
    Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  11. 11.
    Bhattacharya, I., Getoor, L.: Query-time entity resolution. Journal of Artificial Intelligence Research 30, 621–657 (2007)MATHGoogle Scholar
  12. 12.
    Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. VLDB Endowment 3(1) (2010)Google Scholar
  13. 13.
    Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, Canada, pp. 185–194 (2007)Google Scholar
  14. 14.
    Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)Google Scholar
  15. 15.
    Christen, P.: Preparation of a real voter data set for record linkage and duplicate detection research. Technical report. Australian National University (2013)Google Scholar
  16. 16.
    Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Banda Ramadan
    • 1
  • Peter Christen
    • 1
  • Huizhi Liang
    • 1
  1. 1.Research School of Computer Science, College of Engineering and Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations