A Lazy Man’s Way to Part-of-Speech Tagging

Zamin, Norshuhani; Oxley, Alan; Abu Bakar, Zainab; Farhan, Syed Ahmad

doi:10.1007/978-3-642-32541-0_9

A Lazy Man’s Way to Part-of-Speech Tagging

Norshuhani Zamin²¹,
Alan Oxley²¹,
Zainab Abu Bakar²² &
…
Syed Ahmad Farhan²³

Conference paper

1195 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7457))

Abstract

A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the “lazy man’s way” because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for F1-Score. This shows that the “lazy man’s way”, where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

El-Imam, Y.A., Don, Z.M.: Rules and Algorithms for Phonetic Transcription of Standard Malay. IEICE - Trans. Inf. Syst. E88-D, 2354–2372 (2005)
Google Scholar
Hassan, A.: The Morphology of Malay. Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia (1974)
Google Scholar
Tan, Y.L.: A Minimally-Supervised Malay Affix Learner. In: Proceedings of the Class of 2003 Senior Conference, Computer Science Department, Swarthmore College (2003)
Google Scholar
Abdullah, I.H., Ahmad, Z., Ghani, R.A., Jalaludin, N.H., Aman, I.: A Practical Grammar of Malay – A Corpus based Approach to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore (2004)
Google Scholar
Ranaivo, B.: Methodology for Compiling and Preparing Malay Corpus. Technical Report. Unit Terjemahan Melalui Komputer. Pusat Pengajian Sains Komputer, Universiti Sains Malaysia (2004)
Google Scholar
Don, Z.M.: Processing Natural Malay Texts: A Data Driven Approach. TRAMES 14(1), 90–103 (2010)
Article MathSciNet Google Scholar
Jody, F.: An Overview of Bitext Alignment Algorithm, http://www.ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf (accessed on March 2012)
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005), doi:10.1007/11573036_36
Chapter Google Scholar
Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A Statistical Dictionary-based Word Alignment Algorithm: An Unsupervised Approach. In: Proceedings of International Conference on Computer and Information Sciences (2012) (manuscript to be published)
Google Scholar
Ranaivo-Malanco, B.: Malay Lexical Analysis Through Corpus-based Approach. In: Proceedings of International Conference of Malay Lexicology and Lexicography (PALMA), Kuala Lumpur, Malaysia (2005)
Google Scholar
Ranaivo-Malancon, B.: Approach for a Malay Morphosyntactic Tagging. In: Proceedings of the Traitement Automatique des Langues Naturelles, Dourdan, France (2005)
Google Scholar
Ranaivo-Malancon, B.: Computational Analysis of Affixed Words in Malay Language. In: Proceedings of the 8th International Symposium on Malay/Indonesian Linguistics, Penang, Malaysia (2004)
Google Scholar
Knowles, G., Don, Z.M.: Tagging a Corpus of Malay Text and Coping with Syntactic Drift. In: Proceedings of the Corpus Linguistics. Centre for Computer Corpus Research on Language, pp. 422–428. University of Lancaster (2003)
Google Scholar
Knowles, G., Don, Z.M.: World Class in Malay: A Corpus-based Approach. Dewan Bahasa dan Pustaka (2006)
Google Scholar
Baldwin, T., Awab, S.: Open Source Corpus Analysis Tools for Malay. In: Proceedings of the International Conference of Language Resources and Evaluation, Genoa, Italy (2005)
Google Scholar
Quah, C.K., Bond, F., Yamazaki, T.: Design and Construction of a Machine-Tractable Malay-English Lexicon. In: Proceedings of Asian Association of Lexicography, Seoul, Korea (2001)
Google Scholar
Al-Adhaileh, Mosleh, H., Tang, E.K., Melamed, I.: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms. Working Paper, Universiti Sains Malaysia (2009)
Google Scholar
Mohamed, H., Omar, N., Aziz, A.J.A.: Statistical Malay Part-of-Speech (POS) Tagger using Hidden Markov Model Approach. In: Proceedings of the International Conference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia (2011)
Google Scholar
Hock, O.Y.: Kamus Dwibahasa Edisi Kedua. Pearson Longman, Malaysia (2009)
Google Scholar
Indurkhya, N., Damerau, F.J.: Handbook of Natural Language Processing, 2nd edn. Chapman & Hall / CRC Press (2010)
Google Scholar
Toutonova, R., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings Human Language Technology Conference (2003)
Google Scholar
Jusoh, S., Fawareh, H.M.A.: Resolving Ambiguous Semantic in Malay Texts. In: Proceedings of International CODATA Conference, pp. 350–356 (2009)
Google Scholar
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging. Journal of Computational Linguistics (1995)
Google Scholar
Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)
Google Scholar
Christodoulopoulus, C., Goldwater, S., Steedman, M.: Two Decades of Unsupervised POS Induction: How Far Have We Come. In: Proceedings of Empirical Methods in Natural Language Processing (2010)
Google Scholar
Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Journal of Ecology 26, 297–302 (1945)
Article Google Scholar
Dien, D.: Building an English-Vietnamese Bilingual Corpus. Master Thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam (2001)
Google Scholar
Kondrak, G.: N-gram Similarity and Distance. In: Proceedings of the International Conference on String Processing and Information, Buenos Aires, Argentina (2005)
Google Scholar
Dunning, T.: Statistical Identification of Language. New Mexico State University, Technical Report MCCS, pp 94-273 (1994)
Google Scholar
Florian, R., Ngai, G.: Fast Transformation-based Learning Toolkit. Technical Report (2001)
Google Scholar
Ahrenberg, M., Hein, A.S., Tiedemann, J.: Evaluation of Word Alignment Systems. In: Proceedings of International Conference on Linguistic Resources (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science and Information Technology, Universiti Teknologi Mara, 40000, Shah Alam, Selangor, Malaysia
Norshuhani Zamin & Alan Oxley
Faculty of Computer and Mathematical Sciences, Universiti Teknologi Mara, 40000, Shah Alam, Selangor, Malaysia
Zainab Abu Bakar
Faculty of Engineering, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 31750, Tronoh, Perak, Malaysia
Syed Ahmad Farhan

Authors

Norshuhani Zamin
View author publications
You can also search for this author in PubMed Google Scholar
Alan Oxley
View author publications
You can also search for this author in PubMed Google Scholar
Zainab Abu Bakar
View author publications
You can also search for this author in PubMed Google Scholar
Syed Ahmad Farhan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing, Macquarie University, 2109, North Ryde, NSW, Australia
Deborah Richards
School of Computing and Information Systems, University of Tasmania, 7000, Hobart, Tasmania, Australia
Byeong Ho Kang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zamin, N., Oxley, A., Abu Bakar, Z., Farhan, S.A. (2012). A Lazy Man’s Way to Part-of-Speech Tagging. In: Richards, D., Kang, B.H. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2012. Lecture Notes in Computer Science(), vol 7457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32541-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-32541-0_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32540-3
Online ISBN: 978-3-642-32541-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics