Gradually Improving the Computation of Semantic Textual Similarity in Portuguese

  • Hugo Gonçalo Oliveira
  • Ana Oliveira Alves
  • Ricardo Rodrigues
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10423)


There is much research on Semantic Textual Similarity (STS) in English, specially since its inclusion in the SemEval evaluations. For other languages, it is not as common, mostly due to the unavailability of benchmarks. Recently, the ASSIN shared task targeted STS in Portuguese and released training and test collections. This paper describes an incremental approach to ASSIN, where the computed similarity is gradually improved by exploiting different features (e.g., token overlap, semantic relations, chunks, and negation) and approaches. The best reported results, obtained with a supervised approach, would get second place overall in ASSIN.


Natural language processing Semantic Textual Similarity Semantic relations Supervised machine learning 



This work was financed by the ERDF European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project REMINDS – UTAP-ICDT/EEI-CTP/0022/2014.


  1. 1.
    Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G., Wiebe, J.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 497–511. ACL Press, June 2016Google Scholar
  2. 2.
    Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the 1st Joint Conference on Lexical and Computational Semantics, vol. 1: Proceedings of the Main Conference and the Shared Task, and Proceedings of the Sixth International Workshop on Semantic Evaluation, vol. 2, pp. 385–393. ACL Press (2012)Google Scholar
  3. 3.
    Fonseca, E., Santos, L., Criscuolo, M., Aluísio, S.: Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática 8(2), 3–13 (2016)Google Scholar
  4. 4.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  5. 5.
    Rychalska, B., Pakulska, K., Chodorowska, K., Walczak, W., Andruszkiewicz, P.: Samsung Poland NLP team at SemEval-2016 task 1: necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In: Proceedings of 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 602–608. ACL Press, June 2016Google Scholar
  6. 6.
    Brychcín, T., Svoboda, L.: UWB at semeval-2016 task 1: semantic textual similarity using lexical, syntactic, and semantic information. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 588–594. ACL Press, June 2016Google Scholar
  7. 7.
    Hänig, C., Remus, R., de la Puente, X.: ExB themis: extensive feature extraction from word alignments for semantic textual similarity. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 264–268. ACL Press, June 2015Google Scholar
  8. 8.
    Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., Zamparelli, R.: Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, pp. 1–8. ACL Press, August 2014Google Scholar
  9. 9.
    Zhao, J., Zhu, T., Lan, M.: ECNU: one stone two birds: ensemble of heterogenous measures for semantic relatedness and textual entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, pp. 271–277. ACL Press, August 2014Google Scholar
  10. 10.
    Alves, A., Ferrugento, A., Lourenço, M., Rodrigues, F.: ASAP: automatic semantic alignment for phrases. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, pp. 104–108. ACL Press, August 2014Google Scholar
  11. 11.
    Alves, A., Simões, D., Gonçalo Oliveira, H., Ferrugento, A.: ASAP-II: from the alignment of phrases to textual similarity. In: Proceedings of 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 184–189. ACL Press, June 2015Google Scholar
  12. 12.
    Pinheiro, V., Furtado, V., Albuquerque, A.: Semantic textual similarity of portuguese-language texts: an approach based on the semantic inferentialism model. In: Proceedings of the 11th Conference on the Computational Processing of the Portuguese Language, PROPOR 2014, São Carlos/SP, Brazil, pp. 183–188, 6–8 October 2014 (2014)Google Scholar
  13. 13.
    Hartmann, N.: Solo queue at ASSIN: combinando abordagens tradicionais e emergentes. Linguamática 8(2), 59–64 (2016)Google Scholar
  14. 14.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the Workshop track of the International Conference on Learning Representations (ICLR), Scottsdale, Arizona (2013)Google Scholar
  15. 15.
    Fialho, P., Marques, R., Martins, B., Coheur, L., Quaresma, P.: INESC-ID@ASSIN: medição de similaridade semântica e reconhecimento de inferência textual. Linguamática 8(2), 33–42 (2016)Google Scholar
  16. 16.
    Alves, A., Gonçalo Oliveira, H., Rodrigues, R.: ASAPP: alinhamento semântico automático de palavras aplicado ao português. Linguamçtica 8(2), 43–58 (2016)Google Scholar
  17. 17.
    Rodrigues, R., Gonçalo-Oliveira, H., Gomes, P.: LemPORT: a high-accuracy cross-platform lemmatizer for portuguese. In: Proceedings of the 3rd Symposium on Languages, Applications and Technologies (SLATE 2014), OASICS, Germany, Schloss Dagstuhl–Leibniz-Zentrum für Informatik, pp. 267–274. Dagstuhl Publishing, June 2014Google Scholar
  18. 18.
    Dias-da-Silva, B.C.: Wordnet.Br: an exercise of human language technology research. In: Proceedings of 3rd International WordNet Conference (GWC), GWC 2006, South Jeju Island, Korea, pp. 301–303, January 2006Google Scholar
  19. 19.
    Paiva, V., Rademaker, A., Melo, G.: OpenWordNet-PT: an open Brazilian wordnet for reasoning. In: Proceedings of 24th International Conference on Computational Linguistics, COLING (Demo Paper) (2012)Google Scholar
  20. 20.
    Simões, A., Guinovart, X.G.: Bootstrapping a Portuguese wordnet from Galician, Spanish and English wordnets. In: Navarro Mesa, J.L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo García, A., Guerra Moreno, I., Toledano, D.T. (eds.) IberSPEECH 2014. LNCS, vol. 8854, pp. 239–248. Springer, Cham (2014). doi: 10.1007/978-3-319-13623-3_25 Google Scholar
  21. 21.
    Maziero, E., Pardo, T., Felippo, A., Dias-da-Silva, B.: A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. In: VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pp. 390–392 (2008)Google Scholar
  22. 22.
    Gonçalo Oliveira, H., Santos, D., Gomes, P., Seco, N.: PAPEL: a dictionary-based lexical ontology for Portuguese. In: Teixeira, A., Lima, V.L.S., Oliveira, L.C., Quaresma, P. (eds.) PROPOR 2008. LNCS, vol. 5190, pp. 31–40. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85980-2_4 CrossRefGoogle Scholar
  23. 23.
    Simões, A., Sanromán, Á.I., Almeida, J.J.: Dicionário-Aberto: a source of resources for the Portuguese language processing. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS, vol. 7243, pp. 121–127. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-28885-2_14 CrossRefGoogle Scholar
  24. 24.
    Barreiro, A.: Port4NooJ: an open source, ontology-driven portuguese linguistic system with applications in machine translation. In: Proceedings of the 2008 International NooJ Conference (NooJ 2008), Budapest, Hungary, Newcastle-upon-Tyne: Cambridge Scholars Publishing (2010)Google Scholar
  25. 25.
    Gonçalo Oliveira, H.: Comparing and combining Portuguese lexical-semantic knowledge bases. In: Proceedings of 6th Symposium on Languages, Applications and Technologies (SLATE 2017), OASICS, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik. pp. 16: 1–16: 14 (2017)Google Scholar
  26. 26.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  27. 27.
    Holmes, G., Hall, M., Prank, E.: Generating rule sets from model trees. In: Foo, N. (ed.) AI 1999. LNCS, vol. 1747, pp. 1–12. Springer, Heidelberg (1999). doi: 10.1007/3-540-46695-9_1 CrossRefGoogle Scholar
  28. 28.
    Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)CrossRefGoogle Scholar
  29. 29.
    Mackay, D.: Introduction to Gaussian processes. In: Bishop, C.M. (ed.) Neural Networks and Machine Learning. Springer, Berlin (1998)Google Scholar
  30. 30.
    Rodrigues, J., Branco, A., Neale, S., Silva, J.: LX-DSemVectors: distributional semantics models for Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS, vol. 9727, pp. 259–270. Springer, Cham (2016). doi: 10.1007/978-3-319-41552-9_27 Google Scholar
  31. 31.
    Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.CISUC, DEIUniversity of CoimbraCoimbraPortugal
  2. 2.ISECPolytechnic Institute of CoimbraCoimbraPortugal
  3. 3.ESECPolytechnic Institute of CoimbraCoimbraPortugal

Personalised recommendations