Skip to main content

Semantic textual similarity between sentences using bilingual word semantics

Abstract

Semantic textual similarity between sentences is indispensable for many information retrieval tasks. Traditional lexical similarity measures cannot compute the similarity beyond a trivial level. Moreover, they only can capture the textual similarity, but not semantic. In this paper, we propose a method for semantic textual similarity that leverages bilingual word-level semantics to compute the semantic similarity between sentences. To capture word-level semantics, we employ distribute representation of words in two different languages. The similarity function based on the concept-to-concept relationship corresponding to the words is also utilized for the same purpose. Multiple new semantic similarity measures are introduced based on word-embedding models trained on two different corpora in two different languages. Apart from these, another new semantic similarity measure is also introduced using the word sense comparison. The similarity score between the sentences is then computed by applying a linear ranking approach to all proposed measures with their importance score estimated employing a supervised feature selection technique. We conducted experiments on the SemEval Semantic Textual Similarity (STS-2017) test collections. The experimental results demonstrated that our method is effective for measuring semantic textual similarity and outperforms some known related methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. SemEval: https://en.wikipedia.org/wiki/SemEval.

  2. Indri’s Stopwords: http://www.lemurproject.org/stopwords/stoplist.dft.

  3. STS2017: http://alt.qcri.org/semeval2017/task1/.

  4. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.

References

  1. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R.: Semeval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 252–263 (2015)

  2. Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G., Wiebe, J.: Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 497–511 (2016)

  3. Aliguliyev, R.M.: A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36(4), 7764–7772 (2009)

    Article  Google Scholar 

  4. Bär, D., Biemann, C., Gurevych, I., Zesch, T.: Ukp: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Association for Computational Linguistics, pp. 435–440 (2012)

  5. Barrow, J., Peskov, D.: UMDeep at SemEval-2017 task 1: end-to-end shared weight LSTM model for semantic textual similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 180–184 (2017)

  6. Biçici, E.: RTM at SemEval-2017 task 1: referential translation machines for predicting semantic similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 203–207 (2017)

  7. Bjerva, J., Östling, R.: ResSim at SemEval-2017 task 1: multilingual word representations for semantic textual similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 154–158 (2017)

  8. Callan, J., Hoy, M., Yoo, C., Zhao, l.: Clueweb09 data set (2009)

  9. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055 (2017)

  10. España-Bonet, C., Barrón-Cedeño, A.: Lump at SemEval-2017 task 1: towards an interlingua semantic similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 144–149 (2017)

  11. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)

  12. Ferreira, R., Lins, R.D., Freitas, F., Simske, S.J., Riss, M.: A new sentence similarity assessment measure based on a three-layer sentence representation. In: Proceedings of the 2014 ACM Symposium on Document Engineering, ACM, pp. 25–34 (2014)

  13. Fewzee, P., Karray, F.: Elastic net for paralinguistic speech recognition. In : Proceedings of the 14th ACM International Conference on Multimodal Interaction, ACM, pp. 509–516 (2012)

  14. Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: UMBC\_ebiquity-core: semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 44–52 (2013)

  15. Hassanzadeh, H., Groza, H., Nguyen, A., Hunter, J.: Uqeresearch: semantic textual similarity quantification. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 123–127 (2015)

  16. Hoerl, A., Kennard, R.: Ridge Regression, in Encyclopedia of Statistical Sciences, vol. 8, pp. 129–136. Wiley, New York (1988)

    Google Scholar 

  17. Jijkoun, V., de Rijke, M.: Recognizing textual entailment using lexical similarity. In: Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment, Citeseer, pp. 73–76 (2005)

  18. Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM, pp. 1411–1420 (2015)

  19. Kozareva, Z., Vazquez, S., Montoyo, A.: Adaptation of a machine-learning textual entailment system to a multilingual answer validation exercise. In: CLEF (Working Notes), 2006

  20. Li, H., Xu, J.: Semantic matching in search. Found. Trends Inf. Retr. 7(5), 343–469 (2014)

    MathSciNet  Article  Google Scholar 

  21. Li, Y., McLean, D., Bandar, Z.A., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 8, 1138–1150 (2006)

    Article  Google Scholar 

  22. Lintean, M.C., Rus, V.: Measuring semantic similarity in short texts through greedy pairing and word semantics. In: FLAIRS Conference (2012)

  23. Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: European Conference on Information Retrieval, pp. 16–27. Springer, Berlin (2007)

  24. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6, 775–780 (2006)

    Google Scholar 

  25. Šarić, F., Glavaš, G., Karan, M., Šnajder, J., Bašić, B.D.: Takelab: systems for measuring semantic text similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Association for Computational Linguistics, pp. 441–448 (2012)

  26. Shajalal, Md., Ullah, M.Z., Chy, A.N., Aono N.: Query subtopic diversification based on cluster ranking and semantic features. In: Advanced Informatics: Concepts, Theory And Application (ICAICTA), 2016 International Conference On, IEEE, pp. 1–6 (2016)

  27. Tibshirani, Robert: Regression shrinkage and selection via the lasso: a retrospective. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73(3), 273–282 (2011)

    MathSciNet  Article  Google Scholar 

  28. Zhang, Z. , Saligrama, V.: Zero-shot learning via semantic similarity embedding. In: Proceedings of the IEEE International Conference on Computer Vision, p. 4166–4174 (2015)

  29. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Shajalal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shajalal, M., Aono, M. Semantic textual similarity between sentences using bilingual word semantics. Prog Artif Intell 8, 263–272 (2019). https://doi.org/10.1007/s13748-019-00180-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-019-00180-4

Keywords

  • Semantic similarity
  • Word semantics
  • Word-embedding
  • Textual similarity
  • Bilingual semantics