Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language

Ehsan, Toqeer; Khalid, Javairia; Ambreen, Saadia; Mustafa, Asad; Hussain, Sarmad

doi:10.1007/s13369-021-06343-7

Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language

Research Article-Computer Engineering and Computer Science
Published: 02 December 2021

Volume 47, pages 9781–9799, (2022)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Toqeer Ehsan ORCID: orcid.org/0000-0002-6724-6705¹^na1,
Javairia Khalid²^na1,
Saadia Ambreen²,
Asad Mustafa² &
…
Sarmad Hussain²

394 Accesses
3 Citations
Explore all metrics

Abstract

Phrase chunking is an important task in various natural language processing (NLP) applications. This paper presents a neural phrase chunking for Urdu by training contextualized word representations. This work also produces an annotated corpus. The annotation has been performed by using IOB (inside-outside-begin) labels. Comprehensive guidelines have been developed for four phrases which are noun phrase (NP), verb phrase (VP), post-positional phrase (PP) and prepositional phrase (PRP). The annotated text has been evaluated for completeness and correctness automatically. Inter-annotator agreement has been calculated for ten percent reference corpus. A neural chunker has been developed and trained on the annotated corpus. The chunker is based on long–short- term memory networks. Transfer learning has been employed to improve the chunking results. For that purpose, context-free (Word2Vec) and contextualized (ELMo) word representations have been trained. The chunker performed with an f-score of 94.9 when trained by using third layer of ELMo embeddings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-perspective Embeddings for Chinese Chunking

Noun Phrase Chunking for Turkish Using a Dependency Parser

A Semi-supervised Approach for Chinese Noun Phrase Chunking

Notes

https://cle.org.pk/clestore/urduphrasechunker.htm

References

Eberhard, D.M.; Simons, G.F.; Fennig, C.D.: Ethnologue: Languages of the World . SIL International (2019)
Bögel, T.; Butt, M.; Hautli, A.; Sulger, S.: Developing a Finite-State Morphological Analyzer for Urdu and Hindi. Universität Potsdam (2008)
Hussain, S.: Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan (2004)
Butt, M.: The Structure of Complex Predicates in Urdu. Center for the Study of Language (CSLI) (1995)
Butt, M.; Ramchand, G.: Complex Aspectual Structure in Hindi/Urdu. M. Liakata, B. Jensen, D. Maillat, Eds, 1–30 (2001)
Khan, T.A.: Spatial Expressions and Case in South Asian Languages. PhD thesis (2009)
Butt, M.; King, T.H.: The Status of Case. In: Clause Structure in South Asian Languages, pp. 153–198. Springer (2004)
Raza, G.; Ahmed, T.; Butt, M.; King, T.H.: Argument Scrambling within Urdu NPs. Proceedings of LFG11, 461 (2011)
Carreras, X.; Marquez, L.: Phrase Recognition by Filtering and Ranking with Perceptrons. Recent advances in natural language processing III: selected papers from RANLP 2003 260, 205 (2004)
Etzioni, O.; Banko, M.; Soderland, S.; Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)
Article Google Scholar
Ahmed, T.; Urooj, S.; Hussain, S.; Mustafa, A.; Parveen, R.; Adeeba, F.; Hautli, A.; Butt, M.: The CLE Urdu POS Tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, pp. 2920–2925 (2015)
Ali, W.; Malik, M.K.; Hussain, S.; Siddiq, S.; Ali, A.: Urdu Noun Phrase Chunking: HMM based approach. In: 2010 International Conference on Educational and Information Technology, vol. 2, pp. 2–494 (2010). IEEE
Ali, W.; Hussain, S.: A Hybrid Approach to Urdu Verb Phrase Chunking. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 137–143 (2010)
Asopa, S.; Asopa, P.; Mathur, I.; Joshi, N.: Rule based Chunker for Hindi. In: 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), pp. 442–445 (2016). IEEE
Ehsani, R.; Solak, E.; Yıldız, O.T.: Hybrid Chunking for Turkish Combining Morphological and Semantic Features
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:1310.4546 (2013)
Park, S.-B.; Zhang, B.-T.: Text Chunking by Combining Hand-crafted Rules and Memory-based Learning. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 497–504 (2003)
Le Nguyen, M.; Nguyen, H.T.; Nguyen, P.-T.; Ho, T.-B.; Shimazu, A.: An Empirical Study of Vietnamese Noun Phrase Chunking with Discriminative Sequence Models. In: Proceedings of the 7th Workshop on Asian Language Resources (ALR7), pp. 9–16 (2009)
Knutsson, O.; Bigert, J.; Kann, V.: A Robust Shallow Parser for Swedish. In: Proceedings of Nodalida, vol. 2003, p. 2003 (2003)
Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 149–152 (2004)
Eiselen, R.: South African Language Resources: Phrase Chunking. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 689–693 (2016)
Sang, E.F.; Buchholz, S.: Introduction to the CoNLL-2000 Shared Task: Chunking. arXiv preprint arXiv:cs/0009008 (2000)
Gharaibeh, I.K.: Development of Arabic Noun Phrase Extractor (ANPE). International Journal on Natural Language Computing (IJNLC) Vol 6 (2017)
Prathibba, R.; Padma, M.: Shallow parser for Kannada sentences using machine learning approach. Int. J. Comput. Linguistic. Res. 8(4), 158–170 (2017)
Google Scholar
Sun, X.; Nan, X.: Chinese Base Phrases Chunking Based on Latent Semi-CRF Model. In: Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE-2010), pp. 1–7 (2010). IEEE
Sun, X.; Nan, X.: Chinese Noun Phrases Chunking: A Latent Discriminative Model with Global Features. In: 2011 14th IEEE International Conference on Computational Science and Engineering, pp. 167–172 (2011). IEEE
Sarkar, K.; Gayen, V.: Bengali Noun Phrase Chunking Based on Conditional Random Fields. In: 2014 2nd International Conference on Business and Information Management (ICBIM), pp. 148–153 (2014). IEEE
Pawar, S.; Ramrakhiyani, N.; Palshikar, G.; Bhattacharyya, P.; Hingmire, S.: Noun Phrase Chunking for Marathi using Distant Supervision. In: Proceedings of the 12th International Conference on Natural Language Processing, pp. 29–38 (2015)
Sassano, M.; Kurohashi, S.: A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 49–52 (2009)
Supnithi, T.; Onman, C.; Porkaew, P.; Ruangrajitpakorn, T.; Trakultaweekoon, K.; Kawtrakul, A.: A Supervised Learning based Chunking in Thai using Categorial Grammar. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp. 129–136 (2010)
Nongmeikapam, K.; Chingangbam, C.; Keisham, N.; Varte, B.; Bandopadhyay, S.: Chunking in Manipuri using CRF. Int. J. Nat. Lang. Comput. (IJNLC) 3(3) (2014)
Aung, M.P.; Moe, A.L.: New phrase chunking algorithm for Myanmar natural language processing. In: Applied Mechanics and Materials, vol. 695, pp. 548–552 (2015). Trans Tech Publications
Ehsan, T.; Hussain, S.: Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Language Resourc. Eval. 1–40 (2020)
Ehsan, T.; Hussain, S.: Analysis of experiments on statistical and neural parsing for a morphologically rich and free word order language Urdu. IEEE Access 7, 161776–161793 (2019)
Article Google Scholar
Ahmed, T.; Ehsan, T.; Ashraf, A.; u Rahman, M.; Hussain, S.; Butt, M.: A Multilayered Urdu Treebank. In: International Conference on Language and Technology (CLT 2020) (2020)
Ehsan, T.; Butt, M.: Dependency parsing for Urdu: resources, conversions and learning. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5202–5207 (2020)
Kamran Malik, M.; Ahmed, T.; Sulger, S.; Bögel, T.; Gulzar, A.; Raza, G.; Hussain, S.; Butt, M.: Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation, pp. 2921–2927 (2010)
Jespersen, O.: A Modern English Grammar on Historical Principles, vol. 3. Routledge (2013)
Gómez, I.P.: Nominal Modifiers in Noun Phrase Structure: Evidence from Contemporary English. University of Santiago de Compostela (2010)
Bharati, A.; Sangal, R.; Sharma, D.M.; Bai, L.: Anncorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages. LTRC-TR31, 1–38 (2006)
Bhatt, R.; Farudi, A.; Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines (2013)
Anwar, B.: Urdu-English code switching: the use of Urdu phrases and clauses in Pakistani English (a non-native variety). Int J Language Stud 3(4) (2009)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Article Google Scholar
Adeeba, F.; Akram, Q.; Khalid, H.; Hussain, S.: Cle Urdu Books N-Grams. In: Conference on Language and Technology (2014)
Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

Download references

Acknowledgements

We are grateful to Prof. Miriam Butt, University of Konstanz, Germany, for providing valuable feedback and hardware support for this work.

Author information

Toqeer Ehsan and Javairia Khalid These authors contributed equally to this work.

Authors and Affiliations

Department of Computer Science, University of Gujrat, Gujrat, 50700, Pakistan
Toqeer Ehsan
Center for Language Engineering (CLE), Al-Khawarizmi Institute of Computer Science (KICS), University of Engineering and Technology (UET), Lahore, 54000, Pakistan
Javairia Khalid, Saadia Ambreen, Asad Mustafa & Sarmad Hussain

Authors

Toqeer Ehsan
View author publications
You can also search for this author in PubMed Google Scholar
Javairia Khalid
View author publications
You can also search for this author in PubMed Google Scholar
Saadia Ambreen
View author publications
You can also search for this author in PubMed Google Scholar
Asad Mustafa
View author publications
You can also search for this author in PubMed Google Scholar
Sarmad Hussain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toqeer Ehsan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ehsan, T., Khalid, J., Ambreen, S. et al. Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language. Arab J Sci Eng 47, 9781–9799 (2022). https://doi.org/10.1007/s13369-021-06343-7

Download citation

Received: 06 September 2021
Accepted: 07 October 2021
Published: 02 December 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s13369-021-06343-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language

Abstract

Access this article

Similar content being viewed by others

Multi-perspective Embeddings for Chinese Chunking

Noun Phrase Chunking for Turkish Using a Dependency Parser

A Semi-supervised Approach for Chinese Noun Phrase Chunking

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving Phrase Chunking by using Contextualized Word Embeddings for a Morphologically Rich Language

Abstract

Access this article

Similar content being viewed by others

Multi-perspective Embeddings for Chinese Chunking

Noun Phrase Chunking for Turkish Using a Dependency Parser

A Semi-supervised Approach for Chinese Noun Phrase Chunking

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation