Advertisement

A New Corpus of the Russian Social Network News Feed Paraphrases: Corpus Construction and Linguistic Feature Analysis

  • Ekaterina PronozaEmail author
  • Elena YagunovaEmail author
  • Anton PronozaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10633)

Abstract

In this paper we present a new Russian paraphrase corpus derived from the news feed of the social network and conduct its primary analysis. Most media agencies post their news reports on their pages in social networks, and the headlines of the messages are often the same as those of the corresponding news articles from the official websites of the agencies. However, sometimes these pairs of headlines differ, and in such cases a headline from the social network can be considered a compression or a paraphrase of the original headline. In other words, such news feed from social networks is a rich resource of textual entailment, and, as it is shown in this paper, various linguistic phenomena, e.g., irony, presupposition and attention attracting markers. We collect the described pairs of headlines and construct the Russian social network news feed paraphrase corpus based on them. We test the paraphrase detection model trained on the other existing Russian paraphrase corpus, ParaPhraser.ru, collected from official news headlines only, against the constructed dataset, and explore its linguistic and pragmatic features.

Keywords

Paraphrase corpus News headlines Social network news feed Text compression Textual entailment Linguistic phenomena Loose paraphrase 

References

  1. 1.
    Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo W.: SEM 2013 shared task: Semantic Textual Similarity. In: The Second Joint Conference on Lexical and Computational Semantics (2013)Google Scholar
  2. 2.
    Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)Google Scholar
  3. 3.
    Chen, D.L., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 190–200. Portland, Oregon, USA (2011)Google Scholar
  4. 4.
    Demir, S., El-Kahlout, l.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: LREC 2012, pp. 4081–4091 (2012)Google Scholar
  5. 5.
    Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)Google Scholar
  6. 6.
    Dzikovska, M.O., et al.: SemEval—2013 Task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA. 13–14 June 2013Google Scholar
  7. 7.
    Eyecioglu, A., Keller, B.: Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9623, pp. 588–599. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-75477-2_42CrossRefGoogle Scholar
  8. 8.
    Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium (2008)Google Scholar
  9. 9.
    Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)CrossRefGoogle Scholar
  10. 10.
    McCarthy, P.M., McNamara, D.S.: The user-language paraphrase corpus. In: Cross-Disciplinary Advances in Applied Natural Language Processing: Issues and Approaches, pp. 73–89 (2008)Google Scholar
  11. 11.
    Pivovarova, L., Pronoza, E., Yagunova, E., Pronoza, A.: ParaPhraser: Russian Paraphrase Corpus and Shared Task. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 211–225. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-71746-3_18CrossRefGoogle Scholar
  12. 12.
    Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, Dmitry I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-41718-9_8CrossRefGoogle Scholar
  13. 13.
    Pronoza, E., Yagunova, E.: Low-Level Features for Paraphrase Identification. In: Sidorov, G., Galicia-Haro, Sofía N. (eds.) MICAI 2015. LNCS (LNAI), vol. 9413, pp. 59–71. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-27060-9_5CrossRefGoogle Scholar
  14. 14.
    Pronoza E., Yagunova E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82 (2015)Google Scholar
  15. 15.
    Pronoza, E., Yagunova, E., Kochetkova, N.: Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions? In: Sidorov, G., Herrera-Alcántara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 41–52. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-62434-1_4CrossRefGoogle Scholar
  16. 16.
    Regneri, M., Wang, R., Pinkal, M.: Aligning predicate-argument structures for paraphrase fragment extraction. In: LREC 2014, pp. 4300–4307 (2014)Google Scholar
  17. 17.
    Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación Sistemas 18(3), 491–504 (2014)Google Scholar
  18. 18.
    Wubben, S., van den Bosch, A., Krahmer, E., Marsi, E.: Clustering and matching headlines for automatic paraphrase acquisition. In: Proceedings of the 12th European Workshop on Natural Language Generation, pp. 122–125, Athens, Greece (2009)Google Scholar
  19. 19.
    Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pp. 121–128. Sofia, Bulgaria (2013)Google Scholar
  20. 20.
    Tikhonov, A.: Slovoobrazovatelnij slovar’ russkogo yazika v dvuh tomah: Ok 145000 Slov. Russkiy Yazik, Moscow (1985)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.St.-Petersburg State UniversitySt.-PetersburgRussian Federation
  2. 2.Institute for Informatics and Automation of the Russian Academy of SciencesSt.-PetersburgRussian Federation

Personalised recommendations