A Lazy Man’s Way to Part-of-Speech Tagging

  • Norshuhani Zamin
  • Alan Oxley
  • Zainab Abu Bakar
  • Syed Ahmad Farhan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7457)


A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the “lazy man’s way” because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for F1-Score. This shows that the “lazy man’s way”, where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.


Natural Language Processing Proper Noun Common Noun Parallel Corpus Word Alignment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    El-Imam, Y.A., Don, Z.M.: Rules and Algorithms for Phonetic Transcription of Standard Malay. IEICE - Trans. Inf. Syst. E88-D, 2354–2372 (2005)Google Scholar
  2. 2.
    Hassan, A.: The Morphology of Malay. Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia (1974)Google Scholar
  3. 3.
    Tan, Y.L.: A Minimally-Supervised Malay Affix Learner. In: Proceedings of the Class of 2003 Senior Conference, Computer Science Department, Swarthmore College (2003)Google Scholar
  4. 4.
    Abdullah, I.H., Ahmad, Z., Ghani, R.A., Jalaludin, N.H., Aman, I.: A Practical Grammar of Malay – A Corpus based Approach to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore (2004)Google Scholar
  5. 5.
    Ranaivo, B.: Methodology for Compiling and Preparing Malay Corpus. Technical Report. Unit Terjemahan Melalui Komputer. Pusat Pengajian Sains Komputer, Universiti Sains Malaysia (2004)Google Scholar
  6. 6.
    Don, Z.M.: Processing Natural Malay Texts: A Data Driven Approach. TRAMES 14(1), 90–103 (2010)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Jody, F.: An Overview of Bitext Alignment Algorithm, (accessed on March 2012)
  8. 8.
    Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005), doi:10.1007/11573036_36CrossRefGoogle Scholar
  9. 9.
    Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A Statistical Dictionary-based Word Alignment Algorithm: An Unsupervised Approach. In: Proceedings of International Conference on Computer and Information Sciences (2012) (manuscript to be published)Google Scholar
  10. 10.
    Ranaivo-Malanco, B.: Malay Lexical Analysis Through Corpus-based Approach. In: Proceedings of International Conference of Malay Lexicology and Lexicography (PALMA), Kuala Lumpur, Malaysia (2005)Google Scholar
  11. 11.
    Ranaivo-Malancon, B.: Approach for a Malay Morphosyntactic Tagging. In: Proceedings of the Traitement Automatique des Langues Naturelles, Dourdan, France (2005)Google Scholar
  12. 12.
    Ranaivo-Malancon, B.: Computational Analysis of Affixed Words in Malay Language. In: Proceedings of the 8th International Symposium on Malay/Indonesian Linguistics, Penang, Malaysia (2004)Google Scholar
  13. 13.
    Knowles, G., Don, Z.M.: Tagging a Corpus of Malay Text and Coping with Syntactic Drift. In: Proceedings of the Corpus Linguistics. Centre for Computer Corpus Research on Language, pp. 422–428. University of Lancaster (2003)Google Scholar
  14. 14.
    Knowles, G., Don, Z.M.: World Class in Malay: A Corpus-based Approach. Dewan Bahasa dan Pustaka (2006)Google Scholar
  15. 15.
    Baldwin, T., Awab, S.: Open Source Corpus Analysis Tools for Malay. In: Proceedings of the International Conference of Language Resources and Evaluation, Genoa, Italy (2005)Google Scholar
  16. 16.
    Quah, C.K., Bond, F., Yamazaki, T.: Design and Construction of a Machine-Tractable Malay-English Lexicon. In: Proceedings of Asian Association of Lexicography, Seoul, Korea (2001)Google Scholar
  17. 17.
    Al-Adhaileh, Mosleh, H., Tang, E.K., Melamed, I.: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms. Working Paper, Universiti Sains Malaysia (2009)Google Scholar
  18. 18.
    Mohamed, H., Omar, N., Aziz, A.J.A.: Statistical Malay Part-of-Speech (POS) Tagger using Hidden Markov Model Approach. In: Proceedings of the International Conference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia (2011)Google Scholar
  19. 19.
    Hock, O.Y.: Kamus Dwibahasa Edisi Kedua. Pearson Longman, Malaysia (2009)Google Scholar
  20. 20.
    Indurkhya, N., Damerau, F.J.: Handbook of Natural Language Processing, 2nd edn. Chapman & Hall / CRC Press (2010)Google Scholar
  21. 21.
    Toutonova, R., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings Human Language Technology Conference (2003)Google Scholar
  22. 22.
    Jusoh, S., Fawareh, H.M.A.: Resolving Ambiguous Semantic in Malay Texts. In: Proceedings of International CODATA Conference, pp. 350–356 (2009)Google Scholar
  23. 23.
    Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging. Journal of Computational Linguistics (1995)Google Scholar
  24. 24.
    Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)Google Scholar
  25. 25.
    Christodoulopoulus, C., Goldwater, S., Steedman, M.: Two Decades of Unsupervised POS Induction: How Far Have We Come. In: Proceedings of Empirical Methods in Natural Language Processing (2010)Google Scholar
  26. 26.
    Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Journal of Ecology 26, 297–302 (1945)CrossRefGoogle Scholar
  27. 27.
    Dien, D.: Building an English-Vietnamese Bilingual Corpus. Master Thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam (2001)Google Scholar
  28. 28.
    Kondrak, G.: N-gram Similarity and Distance. In: Proceedings of the International Conference on String Processing and Information, Buenos Aires, Argentina (2005)Google Scholar
  29. 29.
    Dunning, T.: Statistical Identification of Language. New Mexico State University, Technical Report MCCS, pp 94-273 (1994)Google Scholar
  30. 30.
    Florian, R., Ngai, G.: Fast Transformation-based Learning Toolkit. Technical Report (2001)Google Scholar
  31. 31.
    Ahrenberg, M., Hein, A.S., Tiedemann, J.: Evaluation of Word Alignment Systems. In: Proceedings of International Conference on Linguistic Resources (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Norshuhani Zamin
    • 1
  • Alan Oxley
    • 1
  • Zainab Abu Bakar
    • 2
  • Syed Ahmad Farhan
    • 3
  1. 1.Faculty of Science and Information TechnologyUniversiti Teknologi MaraShah AlamMalaysia
  2. 2.Faculty of Computer and Mathematical SciencesUniversiti Teknologi MaraShah AlamMalaysia
  3. 3.Faculty of EngineeringUniversiti Teknologi PETRONASTronohMalaysia

Personalised recommendations