Skip to main content

Model for Estimating the Optimal Parameter Values of the Scoring Matrix in the Entity Resolution of Unstandardized References

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2021)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1364))

Included in the following conference series:

Abstract

The research describes obtaining the best linking results when using the scoring matrix to perform entity resolution (ER) on unstandardized references. The accuracy of the linking results produced by the scoring matrix depends upon three critical parameters, the blocking frequency threshold, the stop word frequency threshold, and the scoring (matching) threshold. This paper describes results from building a regression model for estimating the values of the optimal parameters, i.e. the parameter values giving the best ER results in terms of F-measure. The experimental method used 20 fully-annotated sets of unstandardized references of varying size and data quality. The reference sets were a mixture of synthetically created person references and real-world business references. A grid search was used to find the setting giving the best results along with seven statistical values collected for each reference set. For each combination of statistics from the 20 training set, three linear regression models were built to predict each of the critical scoring matrix parameters. The final result was using the combination of reference set size and the standard deviation of the token frequency distribution as independent variables produced the best linear regression models for estimating the three critical scoring matrix parameters. The linear regression model developed and in this research will help users generate more accurate estimates of the three critical scoring matrix parameters in practical applications. This research proposed solution and opens the door to a number of new research questions for improving the performance of the scoring matrix approach to ER.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bossé, É., Rogova, G.: Information Quality in Information Fusion and Decision Making. Springer (2019)

    Google Scholar 

  2. Christen, V., Christen, P., Rahm, E.: Informativeness-Based Active Learning for Entity Resolution

    Google Scholar 

  3. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection, pp. 1–16 (2007)

    Google Scholar 

  4. Talburt, J.R., Zhou, Y.: Entity information life cycle for big data: Master data management and information integration (2015)

    Google Scholar 

  5. Talburt, J.R.: Entity resolution and information quality (2011)

    Google Scholar 

  6. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer (2007)

    Google Scholar 

  7. Alsarkhi, A., Talburt, J.R.: A method for implementing probabilistic entity resolution. Int. J. Adv. Comput. Sci. Appl. 9(11), 7–15 (2018)

    Google Scholar 

  8. Alsarkhi, A., Talburt, J.: An analysis of the effect of stop words on the performance of the matrix comparator for entity resolution. J. Comput. Sci. Colleges, 64–71 (2019)

    Google Scholar 

  9. Jurek-Loughrey, A., Deepak, P.: Semi-supervised and unsupervised approaches to record pairs classification in multi-source data linkage. In: Linking and Mining Heterogeneous and Multi-view Data, pp. 55–78 (2019)

    Google Scholar 

  10. O’Hare, K., Jurek-Loughrey, A., de Campos, C.: An unsupervised blocking technique for more efficient record linkage. Data & Knowledge Engineering, pp. 181–195 (2019)

    Google Scholar 

  11. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (2012)

    Google Scholar 

  12. Brunner, U., Stockinger, K.: Entity matching on unstructured data: an active learning approach (2019)

    Google Scholar 

  13. Monge, A.E., Elkan, C.: The field matching problem: algorithms and applications. In: Kdd, vol. 2, pp. 267–270 (1996)

    Google Scholar 

  14. Moustakides, G.V., Verykios, V.S.: Optimal stopping: a record-linkage approach. ACM J. Data Inf. Qual. (JDIQ) 2009

    Google Scholar 

  15. Zhou, Y., Talburt, J.R.: OYSTER: an open source entity resolution system supporting identity information management. In: ID360-The Global Forum on Identity, Austin, vol. 90 (2012)

    Google Scholar 

  16. Oyster Open Source Project. https://bitbucket.org/oysterer/oyster/

  17. Talburt, J.R., Zhou, Y.: A practical guide to entity resolution with OYSTER. In: Sadiq, S. (ed.) Handbook of Data Quality. Springer, pp. 235–270 (2013)

    Google Scholar 

  18. Talburt, J.R., Zhou, Y., Shivaiah, S.Y.: SOG: a synthetic occupancy generator to support entity resolution instruction and research. In: ICIQ (2009)

    Google Scholar 

  19. rlErrorGeneratoR. https://github.com/ilangurudev/rlErrorGeneratoR

  20. Tejada, S.: Restaurant benchmark dataset. http://www.cs.utexas.edu/users/ml/riddle/data.html]

  21. Reuther, P.: DBLP-ACM Bibliographic benchmark dataset. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution

  22. Alsarkhi, A., Talburt, J.R.: Optimizing inverted index blocking for the matrix comparator in linking unstandardized references. In: Proceedings of the 2019 International Conference on Scientific Computing (2019)

    Google Scholar 

  23. Al-Sarkhi, A., Talburt, J.R.: Estimatng the parameters for linking unstandardized references with the matrix comparator. J. Inf. Technol. Manage., 12–26 (2019)

    Google Scholar 

  24. Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: the 22nd ACM International Conference on Information & Knowledge Management (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Awaad K. Al Sarkhi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al Sarkhi, A.K., Talburt, J.R. (2021). Model for Estimating the Optimal Parameter Values of the Scoring Matrix in the Entity Resolution of Unstandardized References. In: Arai, K. (eds) Advances in Information and Communication. FICC 2021. Advances in Intelligent Systems and Computing, vol 1364. Springer, Cham. https://doi.org/10.1007/978-3-030-73103-8_2

Download citation

Publish with us

Policies and ethics