Abstract
In this paper we study the contribution of semantic features to the detection of Russian paraphrases. The features were calculated on the Russian Thesaurus RuThes. First, we applied RuThes synonyms in clustering news articles, many of which had been created with rewriting (that is paraphrasing) of source news, and found significant improvement. Second, we applied several semantic similarity measures proposed for English thesaurus WordNet to RuThes thesaurus and utilized them for detecting Russian paraphrased sentences.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Fader, A., Zettlemoyer, L.S., Etzioni, O.: Paraphrase-driven learning for open question answering. In: Proceedings of ACL-2013, pp. 1608–1618 (2013)
Vossen, P., Rigau, G., Serafini, L., Stouten, P., Irving, F., van Hage, W.R.: NewsReader: recording history from daily news streams. In: Proceedings of LREC-2014, pp. 2000–2007 (2014)
Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data Book, pp. 43–76. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_3
Loukachevitch, N., Alekseev, A.: Summarizing news clusters on the basis of thematic chains. In: Proceedings of LREC-2012, pp. 1600–1607 (2012)
Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: METER: MEasuring TExt reuse. In: Proceedings of the 40th Anniversary Meeting for the Association for Computational Linguistics (ACL 2002), pp. 152–159 (2002)
Marton, Y., Callison-Burch, C., Resnik, P.: Improved statistical machine translation using monolingually-derived paraphrases. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, pp. 381–390 (2009)
Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Coling-2004, Geneva, Switzerland (2004)
Pavlick, E., Rastogi, P., Ganitkevitch, J., Durme, B., Callison-Burch, C.: PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of ACL-2015 and the 7th International Joint Conference on Natural Language Processing, vol. 2, pp. 425–430 (2015)
Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 385–393. Association for Computational Linguistics (2012)
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., Wiebe, J.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of SemEval, pp. 497–511 (2016)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. The MIT Press, Cambridge (1998)
Han, L., Kashyap, A., Finin, T., Mayfield, J., Weese, J.: UMBC EBIQUITY-CORE: semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Atlanta, Georgia, USA, June, pp. 44–52. Association for Computational Linguistics (2013)
Loukachevitch, N., Dobrov, B.: RuThes linguistic ontology vs. Russian wordnets. In: Proceedings of Global WordNet Conference GWC-2014, pp. 154–162 (2014)
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, D.I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41718-9_8
Pivovarova, L., Pronoza, E., Yagunova, E., Pronoza, A.: ParaPhraser: Russian paraphrase corpus and shared task. In: Filchenkov, A., et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 211–225. Springer, Cham (2018)
Loukachevitch, N., Shevelev, A., Mozharova V.: Testing features and methods in Russian Paraphrasing Task. In: Proceedings of International Conference on Computational Linguistics and Intellectual Technologies Dialog 2017, vol. 1, pp. 135–145 (2017)
Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006). https://doi.org/10.1007/11816508_52
Pronoza, E., Yagunova, E.: Low-level features for paraphrase identification. In: Sidorov, G., Galicia-Haro, S.N. (eds.) MICAI 2015. LNCS (LNAI), vol. 9413, pp. 59–71. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27060-9_5
Brockett, C., Dolan, W.B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)
Mihalcea, R., Corley, C., Strapparava C.: Corpus-based and Knowledge-based measures of text semantic similarity. In: Proceedings of the American Association for Artificial Intelligence (2006)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Bar, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the 6th International Workshop on Semantic Evaluation, Held in Conjunction with the 1st Joint Conference on Lexical and Computational Semantics, pp. 435–440 (2012)
Rychalska, B., Pakulska, K., Chodorowska, K., Walczak, W., Andruszkiewicz, P.: Samsung Poland NLP team at SemEval-2016 Task 1: necessity for diversity; combining recursive autoencoders, wordnet and ensemble methods to measure semantic similarity. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, CA, USA (2016)
Gurevych, I., Niederlich, H.: Computing semantic relatedness in German with revised information content metrics. In: Proceedings of OntoLex 2005 - Ontologies and Lexical Resources, IJCNLP 2005 Workshop (2005)
Kunze, C., Lemnitzer, L.: GermaNet-representation, visualization, application. In: LREC-2002 (2002)
Muller, C., Gurevych, I., Muhlhauser, M.: Integrating semantic knowledge into text similarity and information retrieval. In: International Conference on Semantic Computing, ICSC 2007, pp. 257–264. IEEE (2007)
Loukachevitch, N.V., Dobrov, B.V., Chetviorkin, I.I.: Ruthes-lite, a publicly available version of thesaurus of Russian language ruthes. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue-2014, Bekasovo, Russia, pp. 340–349 (2014)
Guarino, N.: The ontological level: revisiting 30 years of knowledge representation. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 52–67. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_4
Loukachevitch, N., Dobrov, B.: The Sociopolitical Thesaurus as a resource for automatic document processing in Russian. Terminology 21(2), 238–263 (2015). Special issue Terminology across languages and domains
Dobrov, B.V., Kuralenok, I., Loukachevitch, N.V., Nekrestyanov, I., Segalovich, I.: Russian information retrieval evaluation seminar. In: LREC-2004 (2004)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Rokach, L., Maimon, O.: Clustering Methods. Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, New York (2005)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M., (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 1996), pp. 226–231. AAAI Press (1996)
Zagoruiko, N.G.: Intellectual data analysis based on a rival similarity function. Optoelectron. Instrum. Data Process. 44(3), 211–217 (2008)
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007 (1995)
Acknowledgments
This work was partially supported by Russian National Foundation, grant N16-18-02074.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Loukachevitch, N., Shevelev, A., Mozharova, V., Dobrov, B., Pavlov, A. (2018). RuThes Thesaurus in Detecting Russian Paraphrases . In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, vol 789. Springer, Cham. https://doi.org/10.1007/978-3-319-71746-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-71746-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71745-6
Online ISBN: 978-3-319-71746-3
eBook Packages: Computer ScienceComputer Science (R0)