Advertisement

Unsupervised Approaches for Computing Word Similarity in Portuguese

  • Hugo Gonçalo Oliveira
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10423)

Abstract

This paper presents several approaches for computing word similarity in Portuguese and is motivated by the recent availability of state-of-the-art distributional models of Portuguese words, which add to several lexical knowledge bases (LKBs) for this language, available for a longer time. The previous resources were exploited to answer word similarity tests, also recently available for Portuguese. We conclude that there are several valid approaches for this task, but not one that outperforms all the others in every single test. For instance, distributional models seem to capture relatedness better, but LKBs are better suited for computing genuine similarity.

Keywords

Semantic similarity Word similarity Lexical knowledge bases Lexical semantics Word embeddings Distributional semantics 

References

  1. 1.
    Banjade, R., Maharjan, N., Niraula, N.B., Rus, V., Gautam, D.: Lemon and tea are not similar: measuring word-to-word similarity by combining different methods. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 335–346. Springer, Cham (2015). doi: 10.1007/978-3-319-18111-0_25 Google Scholar
  2. 2.
    Barreiro, A.: ParaMT: a paraphraser for machine translation. In: Teixeira, A., Lima, V.L.S., Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS, vol. 5190, pp. 202–211. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85980-2_21 CrossRefGoogle Scholar
  3. 3.
    Barreiro, A.: Port4NooJ: an open source, ontology-driven Portuguese linguistic system with applications in machine translation. In: Proceedings of the 2008 International NooJ Conference (NooJ 2008). Newcastle-upon-Tyne: Cambridge Scholars Publishing, Budapest, Hungary (2010)Google Scholar
  4. 4.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  5. 5.
    Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)CrossRefzbMATHGoogle Scholar
  6. 6.
    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)CrossRefGoogle Scholar
  7. 7.
    Fonseca, E.R., dos Santos, L.B., Criscuolo, M., Aluísio, S.M.: Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática 8(2), 3–13 (2016)Google Scholar
  8. 8.
    Gonçalo Oliveira, H.: CONTO.PT: groundwork for the automatic creation of a fuzzy Portuguese wordnet. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS, vol. 9727, pp. 283–295. Springer, Cham (2016). doi: 10.1007/978-3-319-41552-9_29 Google Scholar
  9. 9.
    Gonçalo Oliveira, H.: Comparing and combining Portuguese lexical-semantic knowledge bases. In: Proceedings of the 6th Symposium on Languages, Applications and Technologies (SLATE 2017), pp. 16:1–16:14. OASICS, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2017)Google Scholar
  10. 10.
    Gonçalo Oliveira, H., Santos, D., Gomes, P., Seco, N.: PAPEL: a dictionary-based lexical ontology for Portuguese. In: Teixeira, A., Lima, V.L.S., Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS, vol. 5190, pp. 31–40. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85980-2_4 CrossRefGoogle Scholar
  11. 11.
    Granada, R., Trojahn, C., Vieira, R.: Comparing semantic relatedness between word pairs in Portuguese using wikipedia. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 170–175. Springer, Cham (2014). doi: 10.1007/978-3-319-09761-9_17 Google Scholar
  12. 12.
    Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRefGoogle Scholar
  13. 13.
    Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with genuine similarity estimation. Comput. Linguist. 41(4), 665–695 (2015)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th Annual International Conference on Systems Documentation (SIGDOC 1986), NY, USA, pp. 24–26 (1986)Google Scholar
  15. 15.
    Luong, T., Socher, R., Manning, C.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. ACL Press, Sofia, Bulgaria, August 2013Google Scholar
  16. 16.
    Maziero, E.G., Pardo, T.A.S., Felippo, A.D., Dias-da-Silva, B.C.: A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. In: VI Workshop em Tecnologia da Informação e Linguagem Humana, pp. 390–392. TIL (2008)Google Scholar
  17. 17.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the Workshop Track of the International Conference on Learning Representations (ICLR), Scottsdale, Arizona (2013)Google Scholar
  18. 18.
    de Paiva, V., Rademaker, A., de Melo, G.: OpenWordNet-PT: an open brazilian wordnet for reasoning. In: Proceedings of 24th International Conference on Computational Linguistics. COLING (Demo Paper) (2012)Google Scholar
  19. 19.
    Pilehvar, M.T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: a unified approach for measuring semantic similarity. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria, vol. 1: Long Papers, pp. 1341–1351. ACL Press (2013)Google Scholar
  20. 20.
    Pilehvar, M.T., Navigli, R.: From senses to texts: an all-in-one graph-based approach for measuring semantic similarity. Artif. Intell. 228, 95–128 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Pinheiro, V., Furtado, V., Albuquerque, A.: Semantic textual similarity of Portuguese-language texts: an approach based on the semantic inferentialism model. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 183–188. Springer, Cham (2014). doi: 10.1007/978-3-319-09761-9_19 Google Scholar
  22. 22.
    Rodrigues, J., Branco, A., Neale, S., Silva, J.: LX-DSemVectors: distributional semantics models for Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS, vol. 9727, pp. 259–270. Springer, Cham (2016). doi: 10.1007/978-3-319-41552-9_27 Google Scholar
  23. 23.
    Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)CrossRefGoogle Scholar
  24. 24.
    Simões, A., Sanromán, Á.I., Almeida, J.J.: Dicionário-Aberto: a source of resources for the Portuguese language processing. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS, vol. 7243, pp. 121–127. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-28885-2_14 CrossRefGoogle Scholar
  25. 25.
    Simões, A., Guinovart, X.G.: Bootstrapping a Portuguese wordnet from galician, spanish and english wordnets. In: Navarro Mesa, J.L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo García, A., Guerra Moreno, I., Toledano, D.T. (eds.) IberSPEECH 2014. LNCS, vol. 8854, pp. 239–248. Springer, Cham (2014). doi: 10.1007/978-3-319-13623-3_25 Google Scholar
  26. 26.
    Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. In: Proceedings of 31st AAAI Conference on Artificial Intelligence, San Francisco, California, USA, pp. 4444–4451 (2017)Google Scholar
  27. 27.
    Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Raedt, L., Flach, P. (eds.) ECML 2001. LNCS, vol. 2167, pp. 491–502. Springer, Heidelberg (2001). doi: 10.1007/3-540-44795-4_42 CrossRefGoogle Scholar
  28. 28.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)MathSciNetzbMATHGoogle Scholar
  29. 29.
    Wilkens, R., Zilio, L., Ferreira, E., Villavicencio, A.: B\(^2\)SG: a TOEFL-like task for Portuguese. In: Proceedings of 10th International Conference on Language Resources and Evaluation. LREC, ELRA (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.CISUC, Department of Informatics EngineeringUniversity of CoimbraCoimbraPortugal

Personalised recommendations