Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction

Pronoza, Ekaterina; Yagunova, Elena; Pronoza, Anton

doi:10.1007/978-3-319-41718-9_8

Ekaterina Pronoza¹⁷,
Elena Yagunova¹⁷ &
Anton Pronoza¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 573))

Included in the following conference series:

Russian Summer School in Information Retrieval

830 Accesses
14 Citations

Abstract

This paper presents a crowdsourcing project on the creation of a publicly available corpus of sentential paraphrases for Russian. Collected from the news headlines, such corpus could be applied for information extraction and text summarization. We collect news headlines from different agencies in real-time; paraphrase candidates are extracted from the headlines using an unsupervised matrix similarity metric. We provide user-friendly online interface for crowdsourced annotation which is available at paraphraser.ru. There are 5181 annotated sentence pairs at the moment, with 4758 of them included in the corpus. The annotation process is going on and the current version of the corpus is freely available at http://paraphraser.ru.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This statement can only be applied to the informative news texts (the ones intended to inform, and not to persuade the reader) and not to the publicistic texts (exerting influence on the reader in the first place). A publicistic headline is often designed to attract readers’ attention. However, both publicistic and informative texts can be used as a source of paraphrases.
2.
The latter might be of no importance for English, but they are essential for detecting Russian sentential paraphrases.

References

Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo W.: SEM 2013 shared task: semantic textual similarity. In: The Second Joint Conference on Lexical and Computational Semantics (2013)
Google Scholar
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the ACL, pp. 597–604 (2005)
Google Scholar
Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In: Proceedings of the ACL 2008 3rd Workshop on Innovative Use of NLP for Building Educational Applications, pp. 44–52 (2008)
Google Scholar
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference (2009)
Google Scholar
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 101–104 (2014)
Google Scholar
Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)
Article Google Scholar
Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh (2007)
Google Scholar
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA, pp. 190–200 (2011)
Google Scholar
Dzikovska, M.O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., Dang, H.T.: SemEval – 2013 Task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA (2013)
Google Scholar
Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: METER: MEasuring TExt Reuse. In: Isabelle, P. (ed.) Proceedings of the Fortieth Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 152–159 (2002)
Google Scholar
Cohn, T., Callison-Burch, C., Lapata, M.: Constructing corpora for the development and evaluation of paraphrase systems. Comput. Linguist. Arch. 34(4), 597–614 (2008)
Article Google Scholar
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)
Google Scholar
Duboue, P.A., Chu-Carroll, J.: Answering the question you wish they had asked: the impact of paraphrasing for question answering. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, New York, pp. 33–36 (2006)
Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium (2008)
Google Scholar
Fujita, A., Inui, K.: A class-oriented approach to building a paraphrase corpus. In: Proceedings of the Third International Workshop on Paraphrasing (2005)
Google Scholar
Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik (2014)
Google Scholar
Jaccard, P.: Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Google Scholar
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)
Article MathSciNet MATH Google Scholar
McCarthy, Ph.M., McNamara, D.S.: The user-language paraphrase corpus. In: Cross-Disciplinary Advances in Applied Natural Language Processing: Issues and Approaches, pp. 73–89 (2008)
Google Scholar
Rus, V., Banjade, R., Lintean, M.: On paraphrase identification corpora. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2422–2429. European Language Resources Association (ELRA), Reykjavik (2014)
Google Scholar
Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: The winning approach to text alignment for text reuse detection at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Notebook for PAN at CLEF 2014. CEUR Workshop Proceedings, vol. 1180, pp. 1004–1011. CEUR-WS.org (2014). ISSN: 1613-0073
Google Scholar
Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop, Dublin, Ireland (1995)
Google Scholar
Shimohata, M., Sumita, E., Matsumoto, Y.: Building a paraphrase corpus for speech translation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon (2004)
Google Scholar
Shinyama, Y., Sekine, S.: Paraphrase acquisition for information extraction. In: Proceedings of the Second International Workshop on Paraphrasing, vol. 16, pp. 65–71 (2003)
Google Scholar
Vila, M., Rodriguez, H., Marti, M.A.: WRPA: a system for relational paraphrase acquisition from wikipedia. Procesamiento del Lenguaje Nat. 45, 11–19 (2010)
Google Scholar
Wubben, S., van den Bosch, A., Krahmer, E., Marsi, E.: Clustering and matching headlines for automatic paraphrase acquisition. In: Proceedings of the 12th European Workshop on Natural Language Generation, Athens, Greece, pp. 122–125 (2009)
Google Scholar
Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, pp. 121–128 (2013)
Google Scholar
Zhao, Sh., Lan, X., Liu, T., Li, Sh.: Application-driven statistical paraphrase generation. In: Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, Suntec, Singapore, pp. 834–842 (2009)
Google Scholar
Abramov, N.: Slovar’ russkih synonymov I shodnyh po smislu virazheniy, 7th edn. Russkie slovari, Moscow (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Saint-Petersburg State University, Saint-Petersburg, Russian Federation
Ekaterina Pronoza, Elena Yagunova & Anton Pronoza

Authors

Ekaterina Pronoza
View author publications
You can also search for this author in PubMed Google Scholar
Elena Yagunova
View author publications
You can also search for this author in PubMed Google Scholar
Anton Pronoza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ekaterina Pronoza .

Editor information

Editors and Affiliations

Ural Federal University , Yekaterinburg, Russia
Pavel Braslavski
University of Amsterdam, Amsterdam, The Netherlands
Ilya Markov
University of Florida , Gainsville, Florida, USA
Panos Pardalos
Eurecat , Barcelona, Spain
Yana Volkovich
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
National Research University Higher School of Economics, Saint Petersburg, Russia
Sergei Koltsov
National Research University Higher School of Economics, Saint Petersburg, Russia
Olessia Koltsova

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pronoza, E., Yagunova, E., Pronoza, A. (2016). Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction. In: Braslavski, P., et al. Information Retrieval. RuSSIR 2015. Communications in Computer and Information Science, vol 573. Springer, Cham. https://doi.org/10.1007/978-3-319-41718-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-41718-9_8
Published: 26 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41717-2
Online ISBN: 978-3-319-41718-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics