Training State-of-the-Art Portuguese POS Taggers without Handcrafted Features

dos Santos, Cícero Nogueira; Zadrozny, Bianca

doi:10.1007/978-3-319-09761-9_8

Training State-of-the-Art Portuguese POS Taggers without Handcrafted Features

Cícero Nogueira dos Santos²⁵ &
Bianca Zadrozny²⁵

Conference paper

661 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8775))

Abstract

Part-of-speech (POS) tagging for morphologically rich languages normally requires the use of handcrafted features that encapsulate clues about the language’s morphology. In this work, we tackle Portuguese POS tagging using a deep neural network that employs a convolutional layer to learn character-level representation of words. We apply the network to three different corpora: the original Mac-Morpho corpus; a revised version of the Mac-Morpho corpus; and the Tycho Brahe corpus. Using the proposed approach, while avoiding the use of any handcrafted feature, we produce state-of-the-art POS taggers for the three corpora: 97.47% accuracy on the Mac-Morpho corpus; 97.31% accuracy on the revised Mac-Morpho corpus; and 97.17% accuracy on the Tycho Brahe corpus. These results represent an error reduction of 12.2%, 23.6% and 15.8%, respectively, on the best previous known result for each corpus.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art pos taggers for portuguese. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (2004)
Google Scholar
Nogueira dos Santos, C., Milidiú, R.L., Rentería, R.P.: Portuguese part-of-speech tagging using entropy guided transformation learning. In: Teixeira, A., de Lima, V.L.S., de Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS (LNAI), vol. 5190, pp. 143–152. Springer, Heidelberg (2008)
Chapter Google Scholar
Milidiú, R.L., dos Santos, C.N., Duarte, J.C.: Portuguese corpus-based learning using etl. J. Braz. Comp. Soc. 14(4), 17–27 (2008)
Article Google Scholar
Fernandes, E.L.R.: Entropy Guided Feature Generation for Structure Learning. PhD thesis. Pontifícia Universidade Católica do Rio de Janeiro (2012)
Google Scholar
Collobert, R.: Deep learning for efficient discriminative parsing. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 224–232 (2011)
Google Scholar
Fonseca, E.R., Ao Luís, G., Rosa, J.: Mac-morpho revisited: Towards robust part-of-speech tagging. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, pp. 98–107 (2013)
Google Scholar
Luong, M.T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Conference on Computational Natural Language Learning, Sofia, Bulgaria (2013)
Google Scholar
Chrupala, G.: Text segmentation with character-level text embeddings. In: Proceedings of the Workshop on Deep Learning for Audio, Speech and Language Processing, ICML (2013)
Google Scholar
dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China. JMLR: W&CP, vol. 32 (2014)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011)
MATH Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing 37(3), 328–339 (1989)
Article Google Scholar
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2278–2324 (1998)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2), 260–269 (1967)
Article MATH Google Scholar
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference, SciPy (2010)
Google Scholar
Alexandrescu, A., Kirchhoff, K.: Factored neural language models. In: Proceedings of the Human Language Technology Conference of the NAACL, New York City, USA, pp. 1–4 (June 2006)
Google Scholar
Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositional–ly derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1517–1526 (2013)
Google Scholar
Zheng, X., Chen, H., Xu, T.: Deep learning for chinese word segmentation and pos tagging. In: Proceedings of the Conference on Empirical Methods in NLP, pp. 647–657 (2013)
Google Scholar
Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations (2013)
Google Scholar
Aluísio, S.M., Pelizzoni, J.M., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)
Chapter Google Scholar
Namiuti, C.: O corpus anotado do português histórico: um avanço para as pesquisas em lingüística histórica do português. Revista Virtual de Estudos da Linguagem 2(3) (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research - Brazil, Av. Pasteur, 138/146 Botafogo, Rio de Janeiro, 22296-903, Brazil
Cícero Nogueira dos Santos & Bianca Zadrozny

Authors

Cícero Nogueira dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Bianca Zadrozny
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

FCHS, Universidade do Algarve, Campus de Gambelas,, 8005-139, Faro, Portugal
Jorge Baptista
INESC-ID Lisboa, Lisbon, Portugal
Nuno Mamede
IT-University of Coimbra, Coimbra, Portugal
Sara Candeias
USP-EACH, São Paulo-SP, Brazil
Ivandré Paraboni
USP-ICMC, Universidade de São Paulo, São Carlos, SP, Brazil
Thiago A. S. Pardo
SCC-ICMC, University of São Paulo, São Carlos, SP, Brazil
Maria das Graças Volpe Nunes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

dos Santos, C.N., Zadrozny, B. (2014). Training State-of-the-Art Portuguese POS Taggers without Handcrafted Features. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds) Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science(), vol 8775. Springer, Cham. https://doi.org/10.1007/978-3-319-09761-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-09761-9_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09760-2
Online ISBN: 978-3-319-09761-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics