Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

Abstract

In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    We used Habash–Soudi–Buckwalter transliteration scheme (Habash et al. 2007).

  2. 2.

    GATE is freely available at the web link: http://gate.ac.uk/.

  3. 3.

    WEKA is available from www.cs.waikato.ac.nz/ml/weka/.

  4. 4.

    Available for our institution under license agreement from the Linguistic Data Consortium (LDC).

  5. 5.

    Available for download from http://www1.ccls.columbia.edu/~ybenajiba/downloads.html.

References

  1. Abdallah, S., Shaalan, K., & Shoaib, M. (2012). Integrating rule-based system with classification for arabic named entity recognition. In Proceedings of the 13th international conference on intelligent text processing and computational linguistics (CICLing) (pp. 311–322). Berlin: Springer.

  2. AbdelRahman, S., Elarnaoty, M., Magdy, M., & Fahmy, A. (2010). Integrated machine learning techniques for Arabic named entity recognition. International Journal of Computer Science Issues (IJCSI), 7(3), 27–36.

    Google Scholar 

  3. Abdul-Hamid, A., & Darwish, K. (2010). Simplified feature set for Arabic named entity recognition. In Proceedings of the 2010 named entities workshop (ACL 2010) (pp. 110–115).

  4. Aboaoga, M., & Aziz, M. J. A. (2013). Arabic person names recognition by using a rule based approach. Journal of Computer Science, 9, 922–927.

    Article  Google Scholar 

  5. Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic word net coverage and usability. Language Resources and Evaluation, 47(3), 891–917.

    Article  Google Scholar 

  6. Alias-I. (2008). LingPipe 4.1.0., In: LingPipe, http://alias-i.com/lingpipe. 1 Oct 2008.

  7. Al-Sughaiyer, I., & Al-Kharashi, A. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55, 189–213.

    Article  Google Scholar 

  8. Babych, B., & Hartley, A. (2003). Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th international EAMT workshop on MT and other language technology tools, improving MT through other language technology tools: Resources and tools for building MT (EAMT 2003) (pp. 1–8).

  9. Benajiba, Y., Diab, M., & Rosso, P. (2008a). Arabic named entity recognition: An SVM-based approach. In Proceedings of Arab international conference on information technology (ACIT 2008) (pp. 16–18).

  10. Benajiba, Y., Diab, M., & Rosso, P. (2008b). Arabic named entity recognition using optimized feature sets. In Proceedings of the conference on empirical methods in natural language.

  11. Benajiba, Y., Diab, M., & Rosso, P. (2009a). Arabic named entity recognition: A feature-driven study. IEEE Transactions on Audio, Speech and Language Processing, 17(5), 926–934.

    Article  Google Scholar 

  12. Benajiba, Y., Diab, M., & Rosso, P. (2009b). Using language independent and language specific features to enhance Arabic named entity recognition. The International Arab Journal of Information Technology, 6(5), 464–473.

    Google Scholar 

  13. Benajiba, Y., & Rosso, P. (2007). ANERsys 2.0: Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. In Proceedings of workshop on natural language-independent engineering, 3rd indian international conference on artificial intelligence (IICAI-2007) (pp. 1814–1823).

  14. Benajiba, Y., & Rosso, P. (2008). Arabic named entity recognition using conditional random fields. In Proceedings of workshop on HLT & NLP within the Arabic World (LREC 2008).

  15. Benajiba, Y., Rosso, P., & Bened’i, J. M. (2007). ANERsys: An Arabic named entity recognition system based on maximum entropy. In Proceedings of the 8th international conference on computational linguistics and intelligent text processing (CICLing-2007) (pp. 143–153). Berlin: Springer.

  16. Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing (pp. 1–8).

  17. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I. et al. (2011). Text processing with GATE (Version 6), University of Sheffield Department of Computer Science.

  18. Elsebai, A., Meziane, F., & BelKredim, F. Z. (2009). A rule based Persons names Arabic extraction system. Communications of the IBIMA, 11(6), 53–59.

  19. Farber, B., Freitag, D., Habash, N., & Rambow, O. (2008). Improving NER in Arabic using a morphological tagger. In Proceedings of workshop on HLT & NLP within the Arabic world (LREC 2008) (pp. 2509–2514).

  20. Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8, 1–22.

    Article  Google Scholar 

  21. Finkel, J., & Manning, C. (2009). Nested named entity recognition. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 141–150).

  22. Habash, N., Owen, R., & Ryan, R. (2009). MADA + TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.

  23. Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05) (pp. 573–580).

  24. Habash, N., Rambow, O., & Roth, R. (2010). MADA + TOKAN Manual. Technical Report CCLS-10-01, Center for Computational Learning Systems (CCLS), Columbia University.

  25. Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods, 38, 15–22

  26. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.

    Article  Google Scholar 

  27. Hamadene, A., Shaheen, M., & Badawy, O. (2011). ARQA: An intelligent Arabic question answering system. In Proceedings of Arabic language technology international conference (ALTIC 2011).

  28. Küçük, D., & Yazıcı, A. (2012). A hybrid named entity recognizer for Turkish. Expert Systems with Applications, 39, 2733–2742.

    Article  Google Scholar 

  29. Maloney, J., & Niv, M. (1998). TAGARAB: A fast, accurate Arabic name recognizer using high-precision morphological analysis. In Proceedings of the workshop on computational approaches to Semitic languages (Semitic 1998) (pp. 8–15).

  30. Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

    Google Scholar 

  31. Mayfield, J., McNamee, P., & Piatko, C. (2003). Named entity recognition using hundreds of thousands of features. In Proceedings of the 7th conference on natural language learning at HLT-NAACL 2003 (CONLL 2003) (pp. 184–187).

  32. Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y. (2001). Named entity recognition from diverse text types. In Proceedings of recent advances in natural language processing 2001 conference.

  33. Mesfar, S. (2007). Named entity recognition for Arabic using syntactic grammars. In Proceedings of the 12th international conference on application of natural language to information systems (pp. 305–316). Berlin: Springer.

  34. Mitchell, A., Strassel, S., Huang, S., & Zakhary, R. (2005). ACE 2004 Multilingual Training Corpus, Ldc2005t09: Linguistic Data Consortium.

  35. Mitchell, A., Strassel, S., Przybocki, M., Davis, J., Doddington, G., Grishman, R. et al. (2003). Tides extraction (ACE) 2003 Multilingual Training Data, Ldc2004t09: Linguistic Data Consortium.

  36. Mohammed, N. F., & Omar, N. (2012). Arabic named entity recognition using artificial neural network. Journal of Computer Science, 8, 1285–1293.

    Article  Google Scholar 

  37. Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.

    Article  Google Scholar 

  38. Oudah, M. M., & Shaalan, K. (2012). A pipeline Arabic named entity recognition using a hybrid approach. In Proceedings of the 24th international conference on computational linguistics (COLING 2012) (pp. 2159–2176).

  39. Oudah, M., & Shaalan, K. (2013). Person name recognition using the hybrid approach. In Lecture Notes in Computer Science, Natural language processing and information systems (Vol. 7934, pp. 237–248). Springer, Berlin.

  40. Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C. D. (2001) Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceeding conference of association for computational linguistics (pp. 426–433).

  41. Riaz, K. (2010). Rule-based named entity recognition in Urdu. In Proceedings of the 2010 named entities workshop (ACL 2010) (pp. 126–135).

  42. Salloum, W., & Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the international conference on computational linguistics (pp. 385–392).

  43. Seon, C., Ko, Y., Kim, J., & Seo, J. (2001). Named entity recognition using machine learning methods and pattern-selection rules. In Proceedings of the 6th natural language processing Pacific Rim symposium (pp. 229–236).

  44. Shaalan, K. (2010). Rule-based approach in Arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT), 3(3), 11–19.

    Google Scholar 

  45. Shaalan, K. (2014). A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510.

    Article  Google Scholar 

  46. Shaalan, K., & Oudah, M. (2014). A hybrid approach to Arabic named entity recognition. Journal of Information Science (JIS), 40, 67–87.

    Article  Google Scholar 

  47. Shaalan, K., Rafea, A., Abdel Monem, A., & Baraka, H. (2004). Machine translation of English noun phrases into Arabic. The International Journal of Computer Processing of Oriental Languages (IJCPOL), 17(2), 121–134.

    Article  Google Scholar 

  48. Shaalan, K., & Raza, H. (2007). Person name entity recognition for Arabic. In Proceedings of the 5th workshop on important unresolved matters (pp. 17–24).

  49. Shaalan, K., & Raza, H. (2008). Arabic named entity recognition from diverse text types. In Proceedings of the 6th international conference on natural language processing (GoTAL 2008) (pp. 440–451). Berlin: Springer.

  50. Shaalan, K., & Raza, H. (2009). NERA: Named entity recognition for Arabic. Journal of the American Society for Information Science and Technology, 60(8), 1652–1663.

    Article  Google Scholar 

  51. Srihari, R., Niu, C., & Li, W. (2000). A hybrid approach for named entity and sub-type tagging. In Proceedings of the 6th conference on applied natural language processing (ANLC 2000) (pp. 247–254).

  52. Toral, A., Noguera, E., Llopis, F., & Munoz, R. (2005). Improving question answering using named entity recognition. In Proceedings of the 10th international conference on Natural Language Processing and Information Systems (NLDB’05) (pp. 181–191). Berlin: Springer.

  53. Tsai, T., Wu, S., Lee, C., Shih, C., & Hsu, W. (2004). Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model. Computational Linguistics and Chinese Language Processing, 9, 65–82.

    Google Scholar 

  54. Zaghouani, W. (2012). RENAR: A rule-based Arabic named entity recognition system. ACM Transactions on Asian Language Information Processing, 11, 1–13.

    Article  Google Scholar 

  55. Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL) (pp. 473–480).

  56. Zirikly, A., & Diab, M. (2015). Named entity recognition for Arabic social media. In Proceedings of NAACL-HLT 2015 (pp. 176–185).

Download references

Acknowledgements

This research was funded by the British University in Dubai (Grant No. INF004-Using machine learning to improve Arabic named entity recognition).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mai Oudah.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Oudah, M., Shaalan, K. Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition. Lang Resources & Evaluation 51, 351–378 (2017). https://doi.org/10.1007/s10579-016-9376-1

Download citation

Keywords

  • Named entity recognition
  • Information extraction
  • Rule-based approach
  • Machine learning
  • Hybrid approach
  • Natural language processing