Abstract
The research describes obtaining the best linking results when using the scoring matrix to perform entity resolution (ER) on unstandardized references. The accuracy of the linking results produced by the scoring matrix depends upon three critical parameters, the blocking frequency threshold, the stop word frequency threshold, and the scoring (matching) threshold. This paper describes results from building a regression model for estimating the values of the optimal parameters, i.e. the parameter values giving the best ER results in terms of F-measure. The experimental method used 20 fully-annotated sets of unstandardized references of varying size and data quality. The reference sets were a mixture of synthetically created person references and real-world business references. A grid search was used to find the setting giving the best results along with seven statistical values collected for each reference set. For each combination of statistics from the 20 training set, three linear regression models were built to predict each of the critical scoring matrix parameters. The final result was using the combination of reference set size and the standard deviation of the token frequency distribution as independent variables produced the best linear regression models for estimating the three critical scoring matrix parameters. The linear regression model developed and in this research will help users generate more accurate estimates of the three critical scoring matrix parameters in practical applications. This research proposed solution and opens the door to a number of new research questions for improving the performance of the scoring matrix approach to ER.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bossé, É., Rogova, G.: Information Quality in Information Fusion and Decision Making. Springer (2019)
Christen, V., Christen, P., Rahm, E.: Informativeness-Based Active Learning for Entity Resolution
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection, pp. 1–16 (2007)
Talburt, J.R., Zhou, Y.: Entity information life cycle for big data: Master data management and information integration (2015)
Talburt, J.R.: Entity resolution and information quality (2011)
Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer (2007)
Alsarkhi, A., Talburt, J.R.: A method for implementing probabilistic entity resolution. Int. J. Adv. Comput. Sci. Appl. 9(11), 7–15 (2018)
Alsarkhi, A., Talburt, J.: An analysis of the effect of stop words on the performance of the matrix comparator for entity resolution. J. Comput. Sci. Colleges, 64–71 (2019)
Jurek-Loughrey, A., Deepak, P.: Semi-supervised and unsupervised approaches to record pairs classification in multi-source data linkage. In: Linking and Mining Heterogeneous and Multi-view Data, pp. 55–78 (2019)
O’Hare, K., Jurek-Loughrey, A., de Campos, C.: An unsupervised blocking technique for more efficient record linkage. Data & Knowledge Engineering, pp. 181–195 (2019)
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (2012)
Brunner, U., Stockinger, K.: Entity matching on unstructured data: an active learning approach (2019)
Monge, A.E., Elkan, C.: The field matching problem: algorithms and applications. In: Kdd, vol. 2, pp. 267–270 (1996)
Moustakides, G.V., Verykios, V.S.: Optimal stopping: a record-linkage approach. ACM J. Data Inf. Qual. (JDIQ) 2009
Zhou, Y., Talburt, J.R.: OYSTER: an open source entity resolution system supporting identity information management. In: ID360-The Global Forum on Identity, Austin, vol. 90 (2012)
Oyster Open Source Project. https://bitbucket.org/oysterer/oyster/
Talburt, J.R., Zhou, Y.: A practical guide to entity resolution with OYSTER. In: Sadiq, S. (ed.) Handbook of Data Quality. Springer, pp. 235–270 (2013)
Talburt, J.R., Zhou, Y., Shivaiah, S.Y.: SOG: a synthetic occupancy generator to support entity resolution instruction and research. In: ICIQ (2009)
rlErrorGeneratoR. https://github.com/ilangurudev/rlErrorGeneratoR
Tejada, S.: Restaurant benchmark dataset. http://www.cs.utexas.edu/users/ml/riddle/data.html]
Reuther, P.: DBLP-ACM Bibliographic benchmark dataset. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution
Alsarkhi, A., Talburt, J.R.: Optimizing inverted index blocking for the matrix comparator in linking unstandardized references. In: Proceedings of the 2019 International Conference on Scientific Computing (2019)
Al-Sarkhi, A., Talburt, J.R.: Estimatng the parameters for linking unstandardized references with the matrix comparator. J. Inf. Technol. Manage., 12–26 (2019)
Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: the 22nd ACM International Conference on Information & Knowledge Management (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Al Sarkhi, A.K., Talburt, J.R. (2021). Model for Estimating the Optimal Parameter Values of the Scoring Matrix in the Entity Resolution of Unstandardized References. In: Arai, K. (eds) Advances in Information and Communication. FICC 2021. Advances in Intelligent Systems and Computing, vol 1364. Springer, Cham. https://doi.org/10.1007/978-3-030-73103-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-73103-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73102-1
Online ISBN: 978-3-030-73103-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)