Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing Detection

Gómez-Adorno, Helena; Bel-Enguix, Gemma; Sierra, Gerardo; Torres-Moreno, Juan-Manuel; Martinez, Renata; Serrano, Pedro

doi:10.1007/978-3-030-60887-3_19

Helena Gómez-Adorno¹²,
Gemma Bel-Enguix¹³,
Gerardo Sierra¹³,
Juan-Manuel Torres-Moreno^14,15,
Renata Martinez¹⁶ &
…
Pedro Serrano¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12469))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

820 Accesses
1 Citations

Abstract

In this paper, we present a similarity-based approach towards paraphrase detection in Spanish. We evaluate various models for semantic similarity computation using a gold-standard paraphrase corpus. It contains one original document and paraphrased documents on different levels (low and high), and reference documents on the same topic or same vocabulary. It allows to assess the similarity between a pair of texts or individual sentences. We found that some of the similarity metrics have a larger difference when comparing paraphrased sentences than others. Finally, we obtained a threshold for each of the similarity metrics with the aim of determining a classification boundary to decide if two sentences are paraphrased.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 262–271 (2009)
Google Scholar
Castro, B., Sierra, G., Torres-Moreno, J.M., Da Cunha, I.: El discurso y la semántica como recursos para la detección de similitud textual. In: Proceedings of the III RST Meeting (8th Brazilian Symposium in Information and Human Language Technology, STIL 2011). Brazilian Computer Society, Cuiabá (2011)
Google Scholar
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: METER: measuring text reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 152–159. ACL (2002)
Google Scholar
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP: Volume 1, pp. 468–476. ACL (2009)
Google Scholar
Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2880–2890 (2016)
Google Scholar
Dolan, W., Quirk, C., Brockett, C., Dolan, B.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING) (2004)
Google Scholar
El Desouki, M.I., Gomaa, W.H., Abdalhakim, H.: A hybrid model for paraphrase detection combines pros of text similarity with deep learning. Int. J. Comput. Appl. 975, 8887 (2019)
Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013)
Google Scholar
Molina, A., Torres-Moreno, J.M., SanJuan, E., Sierra, G., Rojas-Mora, J.: Analysis and transformation of textual energy distribution. In: 2013 12th Mexican International Conference on Artificial Intelligence, pp. 203–208. IEEE (2013)
Google Scholar
Potthast, M., Stein, B., Eiselt, A., Cedeño, A.B., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, p. 1 (2009)
Google Scholar
Torres-Moreno, J.M., Sierra, G., Peinl, P.: A German corpus for similarity detection tasks. Int. J. Comput. Linguist. Appl. 5(2), 9–24 (2014)
Google Scholar
Zhou, L., Lin, C.Y., Munteanu, D.S., Hovy, E.: ParaEval: using paraphrases to evaluate summaries automatically. In: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 447–454. Association for Computational Linguistics (2006)
Google Scholar

Download references

Acknowledgments

This work has been partially supported by PAPIIT projects IA401219, TA100520, AG400119 and CONACYT project A1-S-27780.

Author information

Authors and Affiliations

Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Mexico City, Mexico
Helena Gómez-Adorno
Instituto de Ingeniería, Universidad Nacional Autónoma de México, Mexico City, Mexico
Gemma Bel-Enguix & Gerardo Sierra
LIA-Université d’Avignon, Avignon, France
Juan-Manuel Torres-Moreno
Polytechnique Montréal, Montreal, Canada
Juan-Manuel Torres-Moreno
Facultad de Ciencias, Universidad Nacional Autónoma de México, Mexico City, Mexico
Renata Martinez & Pedro Serrano

Authors

Helena Gómez-Adorno
View author publications
You can also search for this author in PubMed Google Scholar
Gemma Bel-Enguix
View author publications
You can also search for this author in PubMed Google Scholar
Gerardo Sierra
View author publications
You can also search for this author in PubMed Google Scholar
Juan-Manuel Torres-Moreno
View author publications
You can also search for this author in PubMed Google Scholar
Renata Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Serrano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Helena Gómez-Adorno .

Editor information

Editors and Affiliations

Facultad de Ingeniería, Universidad Panamericana, Mexico City, Mexico
Lourdes Martínez-Villaseñor
Universidad Autónoma Metropolitana, Mexico City, Mexico
Oscar Herrera-Alcántara
Facultad de Ingeniería, Universidad Panamericana, Mexico City, Mexico
Hiram Ponce
Universidad Autónoma del Estado de Hidalgo, Hidalgo, Mexico
Félix A. Castro-Espinoza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gómez-Adorno, H., Bel-Enguix, G., Sierra, G., Torres-Moreno, JM., Martinez, R., Serrano, P. (2020). Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing Detection. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-60887-3_19
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60886-6
Online ISBN: 978-3-030-60887-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics