Abstract
Phrase chunking is an important task in various natural language processing (NLP) applications. This paper presents a neural phrase chunking for Urdu by training contextualized word representations. This work also produces an annotated corpus. The annotation has been performed by using IOB (inside-outside-begin) labels. Comprehensive guidelines have been developed for four phrases which are noun phrase (NP), verb phrase (VP), post-positional phrase (PP) and prepositional phrase (PRP). The annotated text has been evaluated for completeness and correctness automatically. Inter-annotator agreement has been calculated for ten percent reference corpus. A neural chunker has been developed and trained on the annotated corpus. The chunker is based on long–short- term memory networks. Transfer learning has been employed to improve the chunking results. For that purpose, context-free (Word2Vec) and contextualized (ELMo) word representations have been trained. The chunker performed with an f-score of 94.9 when trained by using third layer of ELMo embeddings.
Similar content being viewed by others
References
Eberhard, D.M.; Simons, G.F.; Fennig, C.D.: Ethnologue: Languages of the World . SIL International (2019)
Bögel, T.; Butt, M.; Hautli, A.; Sulger, S.: Developing a Finite-State Morphological Analyzer for Urdu and Hindi. Universität Potsdam (2008)
Hussain, S.: Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan (2004)
Butt, M.: The Structure of Complex Predicates in Urdu. Center for the Study of Language (CSLI) (1995)
Butt, M.; Ramchand, G.: Complex Aspectual Structure in Hindi/Urdu. M. Liakata, B. Jensen, D. Maillat, Eds, 1–30 (2001)
Khan, T.A.: Spatial Expressions and Case in South Asian Languages. PhD thesis (2009)
Butt, M.; King, T.H.: The Status of Case. In: Clause Structure in South Asian Languages, pp. 153–198. Springer (2004)
Raza, G.; Ahmed, T.; Butt, M.; King, T.H.: Argument Scrambling within Urdu NPs. Proceedings of LFG11, 461 (2011)
Carreras, X.; Marquez, L.: Phrase Recognition by Filtering and Ranking with Perceptrons. Recent advances in natural language processing III: selected papers from RANLP 2003 260, 205 (2004)
Etzioni, O.; Banko, M.; Soderland, S.; Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)
Ahmed, T.; Urooj, S.; Hussain, S.; Mustafa, A.; Parveen, R.; Adeeba, F.; Hautli, A.; Butt, M.: The CLE Urdu POS Tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, pp. 2920–2925 (2015)
Ali, W.; Malik, M.K.; Hussain, S.; Siddiq, S.; Ali, A.: Urdu Noun Phrase Chunking: HMM based approach. In: 2010 International Conference on Educational and Information Technology, vol. 2, pp. 2–494 (2010). IEEE
Ali, W.; Hussain, S.: A Hybrid Approach to Urdu Verb Phrase Chunking. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 137–143 (2010)
Asopa, S.; Asopa, P.; Mathur, I.; Joshi, N.: Rule based Chunker for Hindi. In: 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), pp. 442–445 (2016). IEEE
Ehsani, R.; Solak, E.; Yıldız, O.T.: Hybrid Chunking for Turkish Combining Morphological and Semantic Features
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:1310.4546 (2013)
Park, S.-B.; Zhang, B.-T.: Text Chunking by Combining Hand-crafted Rules and Memory-based Learning. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 497–504 (2003)
Le Nguyen, M.; Nguyen, H.T.; Nguyen, P.-T.; Ho, T.-B.; Shimazu, A.: An Empirical Study of Vietnamese Noun Phrase Chunking with Discriminative Sequence Models. In: Proceedings of the 7th Workshop on Asian Language Resources (ALR7), pp. 9–16 (2009)
Knutsson, O.; Bigert, J.; Kann, V.: A Robust Shallow Parser for Swedish. In: Proceedings of Nodalida, vol. 2003, p. 2003 (2003)
Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 149–152 (2004)
Eiselen, R.: South African Language Resources: Phrase Chunking. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 689–693 (2016)
Sang, E.F.; Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. arXiv preprint arXiv:cs/0009008 (2000)
Gharaibeh, I.K.: Development of Arabic Noun Phrase Extractor (ANPE). International Journal on Natural Language Computing (IJNLC) Vol 6 (2017)
Prathibba, R.; Padma, M.: Shallow parser for Kannada sentences using machine learning approach. Int. J. Comput. Linguistic. Res. 8(4), 158–170 (2017)
Sun, X.; Nan, X.: Chinese Base Phrases Chunking Based on Latent Semi-CRF Model. In: Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE-2010), pp. 1–7 (2010). IEEE
Sun, X.; Nan, X.: Chinese Noun Phrases Chunking: A Latent Discriminative Model with Global Features. In: 2011 14th IEEE International Conference on Computational Science and Engineering, pp. 167–172 (2011). IEEE
Sarkar, K.; Gayen, V.: Bengali Noun Phrase Chunking Based on Conditional Random Fields. In: 2014 2nd International Conference on Business and Information Management (ICBIM), pp. 148–153 (2014). IEEE
Pawar, S.; Ramrakhiyani, N.; Palshikar, G.; Bhattacharyya, P.; Hingmire, S.: Noun Phrase Chunking for Marathi using Distant Supervision. In: Proceedings of the 12th International Conference on Natural Language Processing, pp. 29–38 (2015)
Sassano, M.; Kurohashi, S.: A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 49–52 (2009)
Supnithi, T.; Onman, C.; Porkaew, P.; Ruangrajitpakorn, T.; Trakultaweekoon, K.; Kawtrakul, A.: A Supervised Learning based Chunking in Thai using Categorial Grammar. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 129–136 (2010)
Nongmeikapam, K.; Chingangbam, C.; Keisham, N.; Varte, B.; Bandopadhyay, S.: Chunking in Manipuri using CRF. Int. J. Nat. Lang. Comput. (IJNLC) 3(3) (2014)
Aung, M.P.; Moe, A.L.: New phrase chunking algorithm for Myanmar natural language processing. In: Applied Mechanics and Materials, vol. 695, pp. 548–552 (2015). Trans Tech Publications
Ehsan, T.; Hussain, S.: Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Language Resourc. Eval. 1–40 (2020)
Ehsan, T.; Hussain, S.: Analysis of experiments on statistical and neural parsing for a morphologically rich and free word order language Urdu. IEEE Access 7, 161776–161793 (2019)
Ahmed, T.; Ehsan, T.; Ashraf, A.; u Rahman, M.; Hussain, S.; Butt, M.: A Multilayered Urdu Treebank. In: International Conference on Language and Technology (CLT 2020) (2020)
Ehsan, T.; Butt, M.: Dependency parsing for Urdu: resources, conversions and learning. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5202–5207 (2020)
Kamran Malik, M.; Ahmed, T.; Sulger, S.; Bögel, T.; Gulzar, A.; Raza, G.; Hussain, S.; Butt, M.: Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation, pp. 2921–2927 (2010)
Jespersen, O.: A Modern English Grammar on Historical Principles, vol. 3. Routledge (2013)
Gómez, I.P.: Nominal Modifiers in Noun Phrase Structure: Evidence from Contemporary English. University of Santiago de Compostela (2010)
Bharati, A.; Sangal, R.; Sharma, D.M.; Bai, L.: Anncorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages. LTRC-TR31, 1–38 (2006)
Bhatt, R.; Farudi, A.; Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines (2013)
Anwar, B.: Urdu-English code switching: the use of Urdu phrases and clauses in Pakistani English (a non-native variety). Int J Language Stud 3(4) (2009)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Adeeba, F.; Akram, Q.; Khalid, H.; Hussain, S.: Cle Urdu Books N-Grams. In: Conference on Language and Technology (2014)
Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Acknowledgements
We are grateful to Prof. Miriam Butt, University of Konstanz, Germany, for providing valuable feedback and hardware support for this work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ehsan, T., Khalid, J., Ambreen, S. et al. Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language. Arab J Sci Eng 47, 9781–9799 (2022). https://doi.org/10.1007/s13369-021-06343-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-021-06343-7