Abstract
In this paper, we present a similarity-based approach towards paraphrase detection in Spanish. We evaluate various models for semantic similarity computation using a gold-standard paraphrase corpus. It contains one original document and paraphrased documents on different levels (low and high), and reference documents on the same topic or same vocabulary. It allows to assess the similarity between a pair of texts or individual sentences. We found that some of the similarity metrics have a larger difference when comparing paraphrased sentences than others. Finally, we obtained a threshold for each of the similarity metrics with the aim of determining a classification boundary to decide if two sentences are paraphrased.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 262–271 (2009)
Castro, B., Sierra, G., Torres-Moreno, J.M., Da Cunha, I.: El discurso y la semántica como recursos para la detección de similitud textual. In: Proceedings of the III RST Meeting (8th Brazilian Symposium in Information and Human Language Technology, STIL 2011). Brazilian Computer Society, Cuiabá (2011)
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: METER: measuring text reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 152–159. ACL (2002)
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP: Volume 1, pp. 468–476. ACL (2009)
Dey, K., Shrivastava, R., Kaushik, S.: A paraphrase and semantic similarity detection system for user generated short-text content on microblogs. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2880–2890 (2016)
Dolan, W., Quirk, C., Brockett, C., Dolan, B.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING) (2004)
El Desouki, M.I., Gomaa, W.H., Abdalhakim, H.: A hybrid model for paraphrase detection combines pros of text similarity with deep learning. Int. J. Comput. Appl. 975, 8887 (2019)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013)
Molina, A., Torres-Moreno, J.M., SanJuan, E., Sierra, G., Rojas-Mora, J.: Analysis and transformation of textual energy distribution. In: 2013 12th Mexican International Conference on Artificial Intelligence, pp. 203–208. IEEE (2013)
Potthast, M., Stein, B., Eiselt, A., Cedeño, A.B., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, p. 1 (2009)
Torres-Moreno, J.M., Sierra, G., Peinl, P.: A German corpus for similarity detection tasks. Int. J. Comput. Linguist. Appl. 5(2), 9–24 (2014)
Zhou, L., Lin, C.Y., Munteanu, D.S., Hovy, E.: ParaEval: using paraphrases to evaluate summaries automatically. In: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 447–454. Association for Computational Linguistics (2006)
Acknowledgments
This work has been partially supported by PAPIIT projects IA401219, TA100520, AG400119 and CONACYT project A1-S-27780.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Gómez-Adorno, H., Bel-Enguix, G., Sierra, G., Torres-Moreno, JM., Martinez, R., Serrano, P. (2020). Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing Detection. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-60887-3_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60886-6
Online ISBN: 978-3-030-60887-3
eBook Packages: Computer ScienceComputer Science (R0)