Unsupervised learning blocking keys technique for indexing Arabic entity resolution

Alian, Marwah; Awajan, Arafat; Ramadan, Bandan

doi:10.1007/s10772-018-9489-6

Unsupervised learning blocking keys technique for indexing Arabic entity resolution

Published: 15 January 2018

Volume 22, pages 621–628, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Marwah Alian^1,2,
Arafat Awajan² &
Bandan Ramadan³

198 Accesses
3 Citations
Explore all metrics

Abstract

Attribute values in textual datasets are subjects of different types of errors due to the data entry processes such as typographical errors, pronunciation errors or dialects alterations. These errors make the entity resolution process more challenging. The iterative blocking indexing technique can be used for correcting this type of errors mainly in query access where the records are stored into more than one block. Blocking indexing technique selects a subset of object pairs saved in the same block for later detailed computation for similarity discarding other pairs in other blocks considering them as irrelevant. This work aims to solving such problems for Arabic texts. It proposes to adapt a specific model for learning blocking keys and analyze its performance for Arabic datasets. The resulted blocking keys are passed as blocking keys to Dynamic Aware Inverted Index (DySimII) that worked efficiently with Arabic datasets. The model is tested against a telephone book dataset that contains duplicates and errors in attribute values according to phonetic and typing errors. The results reach a matching accuracy of 84% for using learned keys with small number of corrupted attributes while the performance is declined with the increase of the number of corrupted attributes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic real time entity resolution using inverted indexing

Article 07 October 2020

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

AKEA: An Arabic Keyphrase Extraction Algorithm

References

Attia, M. (1999). A large scale computational processor of Arabic morphology and applications. Master’s Dissertation, Computer Engineering. Egypt: Cairo University.
Ramadan, B. (2016). Indexing techniques for Real-time entity resolution on large dynamic databases, PhD Thesis, Ed.: Australian National University.
Ramadan, B., Christen, P. (2014). Forest-based dynamic sorted neighborhood indexing for real-time entity resolution, in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, pp. 1787–1790.
Chelli, A. (2016). ASem Light Stemmer, available at: http://www.arabicstemmer.com/.
Christen, P. (2012). Data matching: Concepts and techniques for record link-age, entity resolution, and duplicate detection. New York: Springer.
Book Google Scholar
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.
Article Google Scholar
Christen, P., & Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication, In Hamilton, F., Guillet, H. J. (Ed.) Quality measures in data mining, ser. studies in computational intelligence, New York: Springer, pp. 127–151.
Chapter Google Scholar
Farghaly, A. F. (1987). Three level morphology for Arabic,” in the Arabic Morphology Workshop (AMW’87), Italy.
Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Article MATH Google Scholar
Al Ameed, H. K., Al Ketbi, S. O., Al Kaabi, A. A., Al Shebli, K. S. A., Al Shamsi, N. F., Al Nuaimi, N. H., Al Muhairi S. S., (2005). Arabic Light Stemmer: Anew Enhanced Approach,” in The Second International Conference on Innovations in Information Technology (IIT’05),, pp. 1–9.
Hernandez, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1): 9–37.
Article Google Scholar
Alian, M., Al-Naymat, G., Ramadan B. (2017). Using transliteration with entity resolution for Arabic datasets, in 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’2017), Hammamet, Tunisia, 2017, pp. 593–597. https://doi.org/10.1109/AICCSA.2017.11.
Kejriwal M., Miranker, D. P. (2013). An unsupervised algorithm for learning blocking schemes, in 2013 IEEE 13th International Conference on Data Mining, pp. 340–349.
Bilenko, M., Kamath, B., Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage, in Sixth IEEE International Conference on Data Mining (ICDM-06), Hong Kong, pp. 87–96.
Christen, P., Gayler, R., & Hawking, D. (2009). Similarity-aware indexing for real-time entity resolution. in 18th ACM conference on Information and knowledge management, Hong Kong, China, pp. 1565–1568.
Ramadan, B., & Christen, P. (2015). Unsupervised blocking key selection for real-time entity resolution in advances in knowledge discovery and data mining PAKDD 2015. Lecture Notes in Computer Science, Cham, Vol. 9078, pp. 574–585.
Ramadan, B., Christen, P., Liang, H., & Gayler, R. W. (2013). Dynamic similarity-aware inverted indexing for real-time entity resolution in The series of lecture notes in computer science,trends and applications in knowledge discovery and data mining. New York: Springer, Vol. 7867, pp. 47–58.
Chapter Google Scholar
Farghaly, A., Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing, 8(4), 22.
Article Google Scholar
Tran, K. N., Vatsalan, D., Christen, P., (2013). GeCo—An online personal data Generator and Corruptor, in ACM Conference on Information and Knowledge Management (ICIKM’13), San Francisco, CA, USA, pp. 2473–2475, http://dmm.anu.edu.au/geco.
Vogel, T., Naumann, F. (2012). Automatic blocking key selection for duplicate detection based on unigram combinations, in the international workshop on quality in databases (QDB).

Download references

Author information

Authors and Affiliations

Hashemite University, Zarqa, Jordan
Marwah Alian
Princess Sumaya University for Technology, Amman, Jordan
Marwah Alian & Arafat Awajan
Prince Sultan University, Riyadh, Saudi Arabia
Bandan Ramadan

Authors

Marwah Alian
View author publications
You can also search for this author in PubMed Google Scholar
Arafat Awajan
View author publications
You can also search for this author in PubMed Google Scholar
Bandan Ramadan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marwah Alian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alian, M., Awajan, A. & Ramadan, B. Unsupervised learning blocking keys technique for indexing Arabic entity resolution. Int J Speech Technol 22, 621–628 (2019). https://doi.org/10.1007/s10772-018-9489-6

Download citation

Received: 12 October 2017
Accepted: 03 January 2018
Published: 15 January 2018
Issue Date: September 2019
DOI: https://doi.org/10.1007/s10772-018-9489-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised learning blocking keys technique for indexing Arabic entity resolution

Abstract

Access this article

Similar content being viewed by others

Arabic real time entity resolution using inverted indexing

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

AKEA: An Arabic Keyphrase Extraction Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised learning blocking keys technique for indexing Arabic entity resolution

Abstract

Access this article

Similar content being viewed by others

Arabic real time entity resolution using inverted indexing

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

AKEA: An Arabic Keyphrase Extraction Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation