Skip to main content
Log in

Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language

  • Research Article-Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Phrase chunking is an important task in various natural language processing (NLP) applications. This paper presents a neural phrase chunking for Urdu by training contextualized word representations. This work also produces an annotated corpus. The annotation has been performed by using IOB (inside-outside-begin) labels. Comprehensive guidelines have been developed for four phrases which are noun phrase (NP), verb phrase (VP), post-positional phrase (PP) and prepositional phrase (PRP). The annotated text has been evaluated for completeness and correctness automatically. Inter-annotator agreement has been calculated for ten percent reference corpus. A neural chunker has been developed and trained on the annotated corpus. The chunker is based on long–short- term memory networks. Transfer learning has been employed to improve the chunking results. For that purpose, context-free (Word2Vec) and contextualized (ELMo) word representations have been trained. The chunker performed with an f-score of 94.9 when trained by using third layer of ELMo embeddings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://cle.org.pk/clestore/urduphrasechunker.htm

References

  1. Eberhard, D.M.; Simons, G.F.; Fennig, C.D.: Ethnologue: Languages of the World . SIL International (2019)

  2. Bögel, T.; Butt, M.; Hautli, A.; Sulger, S.: Developing a Finite-State Morphological Analyzer for Urdu and Hindi. Universität Potsdam (2008)

  3. Hussain, S.: Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan (2004)

  4. Butt, M.: The Structure of Complex Predicates in Urdu. Center for the Study of Language (CSLI) (1995)

  5. Butt, M.; Ramchand, G.: Complex Aspectual Structure in Hindi/Urdu. M. Liakata, B. Jensen, D. Maillat, Eds, 1–30 (2001)

  6. Khan, T.A.: Spatial Expressions and Case in South Asian Languages. PhD thesis (2009)

  7. Butt, M.; King, T.H.: The Status of Case. In: Clause Structure in South Asian Languages, pp. 153–198. Springer (2004)

  8. Raza, G.; Ahmed, T.; Butt, M.; King, T.H.: Argument Scrambling within Urdu NPs. Proceedings of LFG11, 461 (2011)

  9. Carreras, X.; Marquez, L.: Phrase Recognition by Filtering and Ranking with Perceptrons. Recent advances in natural language processing III: selected papers from RANLP 2003 260, 205 (2004)

  10. Etzioni, O.; Banko, M.; Soderland, S.; Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)

    Article  Google Scholar 

  11. Ahmed, T.; Urooj, S.; Hussain, S.; Mustafa, A.; Parveen, R.; Adeeba, F.; Hautli, A.; Butt, M.: The CLE Urdu POS Tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, pp. 2920–2925 (2015)

  12. Ali, W.; Malik, M.K.; Hussain, S.; Siddiq, S.; Ali, A.: Urdu Noun Phrase Chunking: HMM based approach. In: 2010 International Conference on Educational and Information Technology, vol. 2, pp. 2–494 (2010). IEEE

  13. Ali, W.; Hussain, S.: A Hybrid Approach to Urdu Verb Phrase Chunking. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 137–143 (2010)

  14. Asopa, S.; Asopa, P.; Mathur, I.; Joshi, N.: Rule based Chunker for Hindi. In: 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), pp. 442–445 (2016). IEEE

  15. Ehsani, R.; Solak, E.; Yıldız, O.T.: Hybrid Chunking for Turkish Combining Morphological and Semantic Features

  16. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:1310.4546 (2013)

  17. Park, S.-B.; Zhang, B.-T.: Text Chunking by Combining Hand-crafted Rules and Memory-based Learning. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 497–504 (2003)

  18. Le Nguyen, M.; Nguyen, H.T.; Nguyen, P.-T.; Ho, T.-B.; Shimazu, A.: An Empirical Study of Vietnamese Noun Phrase Chunking with Discriminative Sequence Models. In: Proceedings of the 7th Workshop on Asian Language Resources (ALR7), pp. 9–16 (2009)

  19. Knutsson, O.; Bigert, J.; Kann, V.: A Robust Shallow Parser for Swedish. In: Proceedings of Nodalida, vol. 2003, p. 2003 (2003)

  20. Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 149–152 (2004)

  21. Eiselen, R.: South African Language Resources: Phrase Chunking. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 689–693 (2016)

  22. Sang, E.F.; Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. arXiv preprint arXiv:cs/0009008 (2000)

  23. Gharaibeh, I.K.: Development of Arabic Noun Phrase Extractor (ANPE). International Journal on Natural Language Computing (IJNLC) Vol 6 (2017)

  24. Prathibba, R.; Padma, M.: Shallow parser for Kannada sentences using machine learning approach. Int. J. Comput. Linguistic. Res. 8(4), 158–170 (2017)

    Google Scholar 

  25. Sun, X.; Nan, X.: Chinese Base Phrases Chunking Based on Latent Semi-CRF Model. In: Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE-2010), pp. 1–7 (2010). IEEE

  26. Sun, X.; Nan, X.: Chinese Noun Phrases Chunking: A Latent Discriminative Model with Global Features. In: 2011 14th IEEE International Conference on Computational Science and Engineering, pp. 167–172 (2011). IEEE

  27. Sarkar, K.; Gayen, V.: Bengali Noun Phrase Chunking Based on Conditional Random Fields. In: 2014 2nd International Conference on Business and Information Management (ICBIM), pp. 148–153 (2014). IEEE

  28. Pawar, S.; Ramrakhiyani, N.; Palshikar, G.; Bhattacharyya, P.; Hingmire, S.: Noun Phrase Chunking for Marathi using Distant Supervision. In: Proceedings of the 12th International Conference on Natural Language Processing, pp. 29–38 (2015)

  29. Sassano, M.; Kurohashi, S.: A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 49–52 (2009)

  30. Supnithi, T.; Onman, C.; Porkaew, P.; Ruangrajitpakorn, T.; Trakultaweekoon, K.; Kawtrakul, A.: A Supervised Learning based Chunking in Thai using Categorial Grammar. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 129–136 (2010)

  31. Nongmeikapam, K.; Chingangbam, C.; Keisham, N.; Varte, B.; Bandopadhyay, S.: Chunking in Manipuri using CRF. Int. J. Nat. Lang. Comput. (IJNLC) 3(3) (2014)

  32. Aung, M.P.; Moe, A.L.: New phrase chunking algorithm for Myanmar natural language processing. In: Applied Mechanics and Materials, vol. 695, pp. 548–552 (2015). Trans Tech Publications

  33. Ehsan, T.; Hussain, S.: Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Language Resourc. Eval. 1–40 (2020)

  34. Ehsan, T.; Hussain, S.: Analysis of experiments on statistical and neural parsing for a morphologically rich and free word order language Urdu. IEEE Access 7, 161776–161793 (2019)

    Article  Google Scholar 

  35. Ahmed, T.; Ehsan, T.; Ashraf, A.; u Rahman, M.; Hussain, S.; Butt, M.: A Multilayered Urdu Treebank. In: International Conference on Language and Technology (CLT 2020) (2020)

  36. Ehsan, T.; Butt, M.: Dependency parsing for Urdu: resources, conversions and learning. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5202–5207 (2020)

  37. Kamran Malik, M.; Ahmed, T.; Sulger, S.; Bögel, T.; Gulzar, A.; Raza, G.; Hussain, S.; Butt, M.: Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation, pp. 2921–2927 (2010)

  38. Jespersen, O.: A Modern English Grammar on Historical Principles, vol. 3. Routledge (2013)

  39. Gómez, I.P.: Nominal Modifiers in Noun Phrase Structure: Evidence from Contemporary English. University of Santiago de Compostela (2010)

  40. Bharati, A.; Sangal, R.; Sharma, D.M.; Bai, L.: Anncorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages. LTRC-TR31, 1–38 (2006)

  41. Bhatt, R.; Farudi, A.; Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines (2013)

  42. Anwar, B.: Urdu-English code switching: the use of Urdu phrases and clauses in Pakistani English (a non-native variety). Int J Language Stud 3(4) (2009)

  43. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)

    Article  Google Scholar 

  44. Adeeba, F.; Akram, Q.; Khalid, H.; Hussain, S.: Cle Urdu Books N-Grams. In: Conference on Language and Technology (2014)

  45. Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162

  46. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

Download references

Acknowledgements

We are grateful to Prof. Miriam Butt, University of Konstanz, Germany, for providing valuable feedback and hardware support for this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toqeer Ehsan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ehsan, T., Khalid, J., Ambreen, S. et al. Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language. Arab J Sci Eng 47, 9781–9799 (2022). https://doi.org/10.1007/s13369-021-06343-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-021-06343-7

Keywords

Navigation