Abstract
Real-time entity resolution is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing techniques are used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are then compared with the query record in more details. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has successfully been used for entity resolution of very large databases. However, because it is based on static sorted arrays, this technique is not suitable for dynamic databases. We propose a tree-based dynamic sorted neighborhood index that facilitates matching a stream of query records against a large and dynamic database in real-time. We evaluate our approach on two large data sets. Our results show that the times for both inserting and querying of records stays nearly constant as the index grows, and our approach achieves over one magnitude faster indexing and querying times compared to an earlier real-time entity resolution technique with comparable high matching accuracy.
This research was funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP100200079.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)
Herzog, T., Scheuren, F., Winkler, W.: Data quality and record linkage techniques. Springer (2007)
Dong, X.L., Srivastava, D.: Big data integration. In: IEEE ICDE, Brisbane, AU, pp. 1245–1248 (2013)
Rice, S.V.: Braided AVL trees for efficient event sets and ranked sets in the SIMSCRIPT III simulation programming language. In: Western MultiConference on Computer Simulation, San Diego, pp. 150–155 (2007)
Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD, San Jose, pp. 127–138 (1995)
Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: ACM CIKM, Hong Kong (2009)
Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: Li, J., Cao, L., Wang, C., Tan, K.C., Liu, B., Pei, J., Tseng, V.S. (eds.) PAKDD 2013 Workshops. LNCS (LNAI), vol. 7867, pp. 47–58. Springer, Heidelberg (2013)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, Edinburgh, Scotland, pp. 518–529 (1999)
Zhang, Z., Jiang, J., Liu, X., Lau, R., Wang, H., Zhang, R.: A real time hybrid pattern matching scheme for stock time series. In: Australin Database Conference, pp. 161–170. Australian Computer Society, Inc., Brisbane (2010)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Bhattacharya, I., Getoor, L.: Query-time entity resolution. Journal of Artificial Intelligence Research 30, 621–657 (2007)
Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. VLDB Endowment 3(1) (2010)
Yan, S., Lee, D., Kan, M.Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, Canada, pp. 185–194 (2007)
Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)
Christen, P.: Preparation of a real voter data set for record linkage and duplicate detection research. Technical report. Australian National University (2013)
Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ramadan, B., Christen, P., Liang, H. (2014). Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution. In: Wang, H., Sharaf, M.A. (eds) Databases Theory and Applications. ADC 2014. Lecture Notes in Computer Science, vol 8506. Springer, Cham. https://doi.org/10.1007/978-3-319-08608-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-08608-8_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08607-1
Online ISBN: 978-3-319-08608-8
eBook Packages: Computer ScienceComputer Science (R0)