Abstract
This paper proposes using linguistic knowledge from Wiktionary to improve lexical disambiguation in multiple languages, focusing on part-of-speech tagging in selected languages with various characteristics including English, Vietnamese, and Korean. Dictionaries and subsumption networks are first automatically extracted from Wiktionary. These linguistic resources are then used to enrich the feature set of training examples. A first-order discriminative model is learned on training data using Hidden Markov-Support Vector Machines. The proposed method is competitive with related contemporary works in the three languages. In English, our tagger achieves 96.37% token accuracy on the Brown corpus, with an error reduction of 2.74% over the baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 173–180. Association for Computational Linguistics, Stroudsburg (2003)
Spoustova, D.J., Hajic, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron pos tagger. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009, pp. 763–771. Association for Computational Linguistics, Stroudsburg (2009)
Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011)
Chesley, P., Vincent, B., Xu, L., Srihari, R.: Using verbs and adjectives to automatically classify blog sentiment. In: AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW), pp. 27–29 (2006)
Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)
Zesch, T., Muller, C., Gurevych, I.: Using wiktionary for computing semantic relatedness. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2, pp. 861–866. AAAI Press (2008)
Meyer, C.M., Gurevych, I.: Ontowiktionary - constructing an ontology from the collaborative online dictionary wiktionary. In: Pazienza, M.T., Stellato, A. (eds.) Semi-Automatic Ontology Development: Processes and Resources. IGI Global, Hershey (2011) (to appear)
Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. International Journal of Human-Computer Studies 67, 716–754 (2009)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 39–41 (1995)
Navarro, E., Sajous, F., Gaume, B., Prevot, L., Hsieh, S., Kuo, I., Magistry, P., Huang, C.R.: Wiktionary and NLP: Improving synonymy networks. In: Proceedings of the 2009 ACL-IJCNLP Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 19–27. Association for Computational Linguistics, Suntec (2009)
Meyer, C.M., Gurevych, I.: What psycholinguists know about chemistry: Aligning wiktionary and wordnet for increased domain coverage. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 883–892. Asian Federation of Natural Language Processing, Chiang Mai (2011)
Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics 21, 543–565 (1995)
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp. 1–8. Association for Computational Linguistics (2002)
Lee, G.G., Lee, J.H., Cha, J.: Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of korean. Comput. Linguist. 28, 53–70 (2002)
Nguyen, L.M., Xuan, B.N., Viet, C.N., Nhat, M.P.Q., Shimazu, A.: A semi-supervised learning method for vietnamese Part-of-Speech tagging. In: International Conference on Knowledge and Systems Engineering, pp. 141–146. IEEE Computer Society, Los Alamitos (2010)
Jiang, W., Huang, L., Liu, Q., Lu, Y.: A cascaded linear model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL-2008: HLT, pp. 897–904. Association for Computational Linguistics, Columbus (2008)
Kim, H.: Korean national corpus in the 21st century sejong project (2003)
Han, C., Han, N., Ko, E., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of the Pacific Asian Conference of Language and Computation (2002)
Pham, D.D., Tran, G.B., Pham, S.B.: A hybrid approach to vietnamese word segmentation using part of speech tags. In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp. 154–161. IEEE Computer Society, Washington, DC (2009)
Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large syntactically-annotated corpus of vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 182–185. Association for Computational Linguistics, Suntec (2009)
Nguyen, K.H., Ock, C.Y.: Margin perceptron for word sense disambiguation. In: Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, pp. 64–70. ACM, New York (2010)
Nguyen, K.H., Ock, C.Y.: Word sense disambiguation as a traveling salesman problem. Artificial Intelligence Review (2011) (online first)
Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden markov support vector machines. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 3–10. AAAI Press (2003)
Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)
Ahn, Y.M., Seo, Y.H.: Korean part-of-speech tagging using disambiguation rules for ambiguous word and statistical information. In: Proceedings of the 2007 International Conference on Convergence Information Technology, ICCIT 2007, pp. 1598–1601. IEEE Computer Society, Washington, DC (2007)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19, 313–330 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, KH., Ock, CY. (2012). Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-28604-9_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)