Unsupervised learning blocking keys technique for indexing Arabic entity resolution
- 65 Downloads
Attribute values in textual datasets are subjects of different types of errors due to the data entry processes such as typographical errors, pronunciation errors or dialects alterations. These errors make the entity resolution process more challenging. The iterative blocking indexing technique can be used for correcting this type of errors mainly in query access where the records are stored into more than one block. Blocking indexing technique selects a subset of object pairs saved in the same block for later detailed computation for similarity discarding other pairs in other blocks considering them as irrelevant. This work aims to solving such problems for Arabic texts. It proposes to adapt a specific model for learning blocking keys and analyze its performance for Arabic datasets. The resulted blocking keys are passed as blocking keys to Dynamic Aware Inverted Index (DySimII) that worked efficiently with Arabic datasets. The model is tested against a telephone book dataset that contains duplicates and errors in attribute values according to phonetic and typing errors. The results reach a matching accuracy of 84% for using learned keys with small number of corrupted attributes while the performance is declined with the increase of the number of corrupted attributes.
KeywordsArabic entity resolution Learning keys Indexing Arabic datasets
- Attia, M. (1999). A large scale computational processor of Arabic morphology and applications. Master’s Dissertation, Computer Engineering. Egypt: Cairo University.Google Scholar
- Ramadan, B. (2016). Indexing techniques for Real-time entity resolution on large dynamic databases, PhD Thesis, Ed.: Australian National University.Google Scholar
- Ramadan, B., Christen, P. (2014). Forest-based dynamic sorted neighborhood indexing for real-time entity resolution, in Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, pp. 1787–1790.Google Scholar
- Chelli, A. (2016). ASem Light Stemmer, available at: http://www.arabicstemmer.com/.
- Farghaly, A. F. (1987). Three level morphology for Arabic,” in the Arabic Morphology Workshop (AMW’87), Italy.Google Scholar
- Al Ameed, H. K., Al Ketbi, S. O., Al Kaabi, A. A., Al Shebli, K. S. A., Al Shamsi, N. F., Al Nuaimi, N. H., Al Muhairi S. S., (2005). Arabic Light Stemmer: Anew Enhanced Approach,” in The Second International Conference on Innovations in Information Technology (IIT’05),, pp. 1–9.Google Scholar
- Alian, M., Al-Naymat, G., Ramadan B. (2017). Using transliteration with entity resolution for Arabic datasets, in 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’2017), Hammamet, Tunisia, 2017, pp. 593–597. https://doi.org/10.1109/AICCSA.2017.11.
- Kejriwal M., Miranker, D. P. (2013). An unsupervised algorithm for learning blocking schemes, in 2013 IEEE 13th International Conference on Data Mining, pp. 340–349.Google Scholar
- Bilenko, M., Kamath, B., Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage, in Sixth IEEE International Conference on Data Mining (ICDM-06), Hong Kong, pp. 87–96.Google Scholar
- Christen, P., Gayler, R., & Hawking, D. (2009). Similarity-aware indexing for real-time entity resolution. in 18th ACM conference on Information and knowledge management, Hong Kong, China, pp. 1565–1568.Google Scholar
- Ramadan, B., & Christen, P. (2015). Unsupervised blocking key selection for real-time entity resolution in advances in knowledge discovery and data mining PAKDD 2015. Lecture Notes in Computer Science, Cham, Vol. 9078, pp. 574–585.Google Scholar
- Ramadan, B., Christen, P., Liang, H., & Gayler, R. W. (2013). Dynamic similarity-aware inverted indexing for real-time entity resolution in The series of lecture notes in computer science,trends and applications in knowledge discovery and data mining. New York: Springer, Vol. 7867, pp. 47–58.CrossRefGoogle Scholar
- Tran, K. N., Vatsalan, D., Christen, P., (2013). GeCo—An online personal data Generator and Corruptor, in ACM Conference on Information and Knowledge Management (ICIKM’13), San Francisco, CA, USA, pp. 2473–2475, http://dmm.anu.edu.au/geco.
- Vogel, T., Naumann, F. (2012). Automatic blocking key selection for duplicate detection based on unigram combinations, in the international workshop on quality in databases (QDB).Google Scholar