Skip to main content

A Lazy Man’s Way to Part-of-Speech Tagging

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7457))

Abstract

A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the “lazy man’s way” because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for F1-Score. This shows that the “lazy man’s way”, where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. El-Imam, Y.A., Don, Z.M.: Rules and Algorithms for Phonetic Transcription of Standard Malay. IEICE - Trans. Inf. Syst. E88-D, 2354–2372 (2005)

    Google Scholar 

  2. Hassan, A.: The Morphology of Malay. Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia (1974)

    Google Scholar 

  3. Tan, Y.L.: A Minimally-Supervised Malay Affix Learner. In: Proceedings of the Class of 2003 Senior Conference, Computer Science Department, Swarthmore College (2003)

    Google Scholar 

  4. Abdullah, I.H., Ahmad, Z., Ghani, R.A., Jalaludin, N.H., Aman, I.: A Practical Grammar of Malay – A Corpus based Approach to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore (2004)

    Google Scholar 

  5. Ranaivo, B.: Methodology for Compiling and Preparing Malay Corpus. Technical Report. Unit Terjemahan Melalui Komputer. Pusat Pengajian Sains Komputer, Universiti Sains Malaysia (2004)

    Google Scholar 

  6. Don, Z.M.: Processing Natural Malay Texts: A Data Driven Approach. TRAMES 14(1), 90–103 (2010)

    Article  MathSciNet  Google Scholar 

  7. Jody, F.: An Overview of Bitext Alignment Algorithm, http://www.ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf (accessed on March 2012)

  8. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005), doi:10.1007/11573036_36

    Chapter  Google Scholar 

  9. Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A Statistical Dictionary-based Word Alignment Algorithm: An Unsupervised Approach. In: Proceedings of International Conference on Computer and Information Sciences (2012) (manuscript to be published)

    Google Scholar 

  10. Ranaivo-Malanco, B.: Malay Lexical Analysis Through Corpus-based Approach. In: Proceedings of International Conference of Malay Lexicology and Lexicography (PALMA), Kuala Lumpur, Malaysia (2005)

    Google Scholar 

  11. Ranaivo-Malancon, B.: Approach for a Malay Morphosyntactic Tagging. In: Proceedings of the Traitement Automatique des Langues Naturelles, Dourdan, France (2005)

    Google Scholar 

  12. Ranaivo-Malancon, B.: Computational Analysis of Affixed Words in Malay Language. In: Proceedings of the 8th International Symposium on Malay/Indonesian Linguistics, Penang, Malaysia (2004)

    Google Scholar 

  13. Knowles, G., Don, Z.M.: Tagging a Corpus of Malay Text and Coping with Syntactic Drift. In: Proceedings of the Corpus Linguistics. Centre for Computer Corpus Research on Language, pp. 422–428. University of Lancaster (2003)

    Google Scholar 

  14. Knowles, G., Don, Z.M.: World Class in Malay: A Corpus-based Approach. Dewan Bahasa dan Pustaka (2006)

    Google Scholar 

  15. Baldwin, T., Awab, S.: Open Source Corpus Analysis Tools for Malay. In: Proceedings of the International Conference of Language Resources and Evaluation, Genoa, Italy (2005)

    Google Scholar 

  16. Quah, C.K., Bond, F., Yamazaki, T.: Design and Construction of a Machine-Tractable Malay-English Lexicon. In: Proceedings of Asian Association of Lexicography, Seoul, Korea (2001)

    Google Scholar 

  17. Al-Adhaileh, Mosleh, H., Tang, E.K., Melamed, I.: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms. Working Paper, Universiti Sains Malaysia (2009)

    Google Scholar 

  18. Mohamed, H., Omar, N., Aziz, A.J.A.: Statistical Malay Part-of-Speech (POS) Tagger using Hidden Markov Model Approach. In: Proceedings of the International Conference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia (2011)

    Google Scholar 

  19. Hock, O.Y.: Kamus Dwibahasa Edisi Kedua. Pearson Longman, Malaysia (2009)

    Google Scholar 

  20. Indurkhya, N., Damerau, F.J.: Handbook of Natural Language Processing, 2nd edn. Chapman & Hall / CRC Press (2010)

    Google Scholar 

  21. Toutonova, R., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings Human Language Technology Conference (2003)

    Google Scholar 

  22. Jusoh, S., Fawareh, H.M.A.: Resolving Ambiguous Semantic in Malay Texts. In: Proceedings of International CODATA Conference, pp. 350–356 (2009)

    Google Scholar 

  23. Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging. Journal of Computational Linguistics (1995)

    Google Scholar 

  24. Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)

    Google Scholar 

  25. Christodoulopoulus, C., Goldwater, S., Steedman, M.: Two Decades of Unsupervised POS Induction: How Far Have We Come. In: Proceedings of Empirical Methods in Natural Language Processing (2010)

    Google Scholar 

  26. Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Journal of Ecology 26, 297–302 (1945)

    Article  Google Scholar 

  27. Dien, D.: Building an English-Vietnamese Bilingual Corpus. Master Thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam (2001)

    Google Scholar 

  28. Kondrak, G.: N-gram Similarity and Distance. In: Proceedings of the International Conference on String Processing and Information, Buenos Aires, Argentina (2005)

    Google Scholar 

  29. Dunning, T.: Statistical Identification of Language. New Mexico State University, Technical Report MCCS, pp 94-273 (1994)

    Google Scholar 

  30. Florian, R., Ngai, G.: Fast Transformation-based Learning Toolkit. Technical Report (2001)

    Google Scholar 

  31. Ahrenberg, M., Hein, A.S., Tiedemann, J.: Evaluation of Word Alignment Systems. In: Proceedings of International Conference on Linguistic Resources (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zamin, N., Oxley, A., Abu Bakar, Z., Farhan, S.A. (2012). A Lazy Man’s Way to Part-of-Speech Tagging. In: Richards, D., Kang, B.H. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2012. Lecture Notes in Computer Science(), vol 7457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32541-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32541-0_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32540-3

  • Online ISBN: 978-3-642-32541-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics