Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages

  • Kiem-Hieu Nguyen
  • Cheol-Young Ock
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7181)


This paper proposes using linguistic knowledge from Wiktionary to improve lexical disambiguation in multiple languages, focusing on part-of-speech tagging in selected languages with various characteristics including English, Vietnamese, and Korean. Dictionaries and subsumption networks are first automatically extracted from Wiktionary. These linguistic resources are then used to enrich the feature set of training examples. A first-order discriminative model is learned on training data using Hidden Markov-Support Vector Machines. The proposed method is competitive with related contemporary works in the three languages. In English, our tagger achieves 96.37% token accuracy on the Brown corpus, with an error reduction of 2.74% over the baseline.


Wiktionary collaborative dictionary lexical disambiguation part-of-speech tagging supervised learning discriminative model 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 173–180. Association for Computational Linguistics, Stroudsburg (2003)CrossRefGoogle Scholar
  2. 2.
    Spoustova, D.J., Hajic, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron pos tagger. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009, pp. 763–771. Association for Computational Linguistics, Stroudsburg (2009)CrossRefGoogle Scholar
  3. 3.
    Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Chesley, P., Vincent, B., Xu, L., Srihari, R.: Using verbs and adjectives to automatically classify blog sentiment. In: AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW), pp. 27–29 (2006)Google Scholar
  5. 5.
    Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Zesch, T., Muller, C., Gurevych, I.: Using wiktionary for computing semantic relatedness. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2, pp. 861–866. AAAI Press (2008)Google Scholar
  7. 7.
    Meyer, C.M., Gurevych, I.: Ontowiktionary - constructing an ontology from the collaborative online dictionary wiktionary. In: Pazienza, M.T., Stellato, A. (eds.) Semi-Automatic Ontology Development: Processes and Resources. IGI Global, Hershey (2011) (to appear) Google Scholar
  8. 8.
    Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. International Journal of Human-Computer Studies 67, 716–754 (2009)CrossRefGoogle Scholar
  9. 9.
    Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 39–41 (1995)CrossRefGoogle Scholar
  10. 10.
    Navarro, E., Sajous, F., Gaume, B., Prevot, L., Hsieh, S., Kuo, I., Magistry, P., Huang, C.R.: Wiktionary and NLP: Improving synonymy networks. In: Proceedings of the 2009 ACL-IJCNLP Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 19–27. Association for Computational Linguistics, Suntec (2009)CrossRefGoogle Scholar
  11. 11.
    Meyer, C.M., Gurevych, I.: What psycholinguists know about chemistry: Aligning wiktionary and wordnet for increased domain coverage. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 883–892. Asian Federation of Natural Language Processing, Chiang Mai (2011)Google Scholar
  12. 12.
    Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics 21, 543–565 (1995)Google Scholar
  13. 13.
    Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp. 1–8. Association for Computational Linguistics (2002)Google Scholar
  14. 14.
    Lee, G.G., Lee, J.H., Cha, J.: Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of korean. Comput. Linguist. 28, 53–70 (2002)CrossRefGoogle Scholar
  15. 15.
    Nguyen, L.M., Xuan, B.N., Viet, C.N., Nhat, M.P.Q., Shimazu, A.: A semi-supervised learning method for vietnamese Part-of-Speech tagging. In: International Conference on Knowledge and Systems Engineering, pp. 141–146. IEEE Computer Society, Los Alamitos (2010)CrossRefGoogle Scholar
  16. 16.
    Jiang, W., Huang, L., Liu, Q., Lu, Y.: A cascaded linear model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL-2008: HLT, pp. 897–904. Association for Computational Linguistics, Columbus (2008)Google Scholar
  17. 17.
    Kim, H.: Korean national corpus in the 21st century sejong project (2003)Google Scholar
  18. 18.
    Han, C., Han, N., Ko, E., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of the Pacific Asian Conference of Language and Computation (2002)Google Scholar
  19. 19.
    Pham, D.D., Tran, G.B., Pham, S.B.: A hybrid approach to vietnamese word segmentation using part of speech tags. In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp. 154–161. IEEE Computer Society, Washington, DC (2009)CrossRefGoogle Scholar
  20. 20.
    Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large syntactically-annotated corpus of vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 182–185. Association for Computational Linguistics, Suntec (2009)CrossRefGoogle Scholar
  21. 21.
    Nguyen, K.H., Ock, C.Y.: Margin perceptron for word sense disambiguation. In: Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, pp. 64–70. ACM, New York (2010)CrossRefGoogle Scholar
  22. 22.
    Nguyen, K.H., Ock, C.Y.: Word sense disambiguation as a traveling salesman problem. Artificial Intelligence Review (2011) (online first)Google Scholar
  23. 23.
    Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden markov support vector machines. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 3–10. AAAI Press (2003)Google Scholar
  24. 24.
    Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)Google Scholar
  25. 25.
    Ahn, Y.M., Seo, Y.H.: Korean part-of-speech tagging using disambiguation rules for ambiguous word and statistical information. In: Proceedings of the 2007 International Conference on Convergence Information Technology, ICCIT 2007, pp. 1598–1601. IEEE Computer Society, Washington, DC (2007)CrossRefGoogle Scholar
  26. 26.
    Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19, 313–330 (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Kiem-Hieu Nguyen
    • 1
  • Cheol-Young Ock
    • 1
  1. 1.School of Electrical EngineeringUniversity of UlsanUlsanKorea

Personalised recommendations