Skip to main content

Training State-of-the-Art Portuguese POS Taggers without Handcrafted Features

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8775))

Abstract

Part-of-speech (POS) tagging for morphologically rich languages normally requires the use of handcrafted features that encapsulate clues about the language’s morphology. In this work, we tackle Portuguese POS tagging using a deep neural network that employs a convolutional layer to learn character-level representation of words. We apply the network to three different corpora: the original Mac-Morpho corpus; a revised version of the Mac-Morpho corpus; and the Tycho Brahe corpus. Using the proposed approach, while avoiding the use of any handcrafted feature, we produce state-of-the-art POS taggers for the three corpora: 97.47% accuracy on the Mac-Morpho corpus; 97.31% accuracy on the revised Mac-Morpho corpus; and 97.17% accuracy on the Tycho Brahe corpus. These results represent an error reduction of 12.2%, 23.6% and 15.8%, respectively, on the best previous known result for each corpus.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art pos taggers for portuguese. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (2004)

    Google Scholar 

  2. Nogueira dos Santos, C., Milidiú, R.L., Rentería, R.P.: Portuguese part-of-speech tagging using entropy guided transformation learning. In: Teixeira, A., de Lima, V.L.S., de Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS (LNAI), vol. 5190, pp. 143–152. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  3. Milidiú, R.L., dos Santos, C.N., Duarte, J.C.: Portuguese corpus-based learning using etl. J. Braz. Comp. Soc. 14(4), 17–27 (2008)

    Article  Google Scholar 

  4. Fernandes, E.L.R.: Entropy Guided Feature Generation for Structure Learning. PhD thesis. Pontifícia Universidade Católica do Rio de Janeiro (2012)

    Google Scholar 

  5. Collobert, R.: Deep learning for efficient discriminative parsing. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 224–232 (2011)

    Google Scholar 

  6. Fonseca, E.R., Ao Luís, G., Rosa, J.: Mac-morpho revisited: Towards robust part-of-speech tagging. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, pp. 98–107 (2013)

    Google Scholar 

  7. Luong, M.T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Conference on Computational Natural Language Learning, Sofia, Bulgaria (2013)

    Google Scholar 

  8. Chrupala, G.: Text segmentation with character-level text embeddings. In: Proceedings of the Workshop on Deep Learning for Audio, Speech and Language Processing, ICML (2013)

    Google Scholar 

  9. dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China. JMLR: W&CP, vol. 32 (2014)

    Google Scholar 

  10. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011)

    MATH  Google Scholar 

  11. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing 37(3), 328–339 (1989)

    Article  Google Scholar 

  12. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2278–2324 (1998)

    Google Scholar 

  13. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  14. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: A CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference, SciPy (2010)

    Google Scholar 

  15. Alexandrescu, A., Kirchhoff, K.: Factored neural language models. In: Proceedings of the Human Language Technology Conference of the NAACL, New York City, USA, pp. 1–4 (June 2006)

    Google Scholar 

  16. Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositional–ly derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1517–1526 (2013)

    Google Scholar 

  17. Zheng, X., Chen, H., Xu, T.: Deep learning for chinese word segmentation and pos tagging. In: Proceedings of the Conference on Empirical Methods in NLP, pp. 647–657 (2013)

    Google Scholar 

  18. Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2013)

    Google Scholar 

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations (2013)

    Google Scholar 

  20. Aluísio, S.M., Pelizzoni, J.M., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 110–117. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  21. Namiuti, C.: O corpus anotado do português histórico: um avanço para as pesquisas em lingüística histórica do português. Revista Virtual de Estudos da Linguagem 2(3) (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

dos Santos, C.N., Zadrozny, B. (2014). Training State-of-the-Art Portuguese POS Taggers without Handcrafted Features. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds) Computational Processing of the Portuguese Language. PROPOR 2014. Lecture Notes in Computer Science(), vol 8775. Springer, Cham. https://doi.org/10.1007/978-3-319-09761-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09761-9_8

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09760-2

  • Online ISBN: 978-3-319-09761-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics