Abstract
As part of our project ParaPhraser on the identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases. The corpus is annotated via crowdsourcing by naïve native Russian speakers, but from the point of view of the expert, our complex paraphrase detection model can be more successful at predicting paraphrase class than a naive native speaker.
Our paraphrase corpus is collected from news headlines and therefore can be considered a summarized news stream describing the most important events. By building a graph of paraphrases, we can detect such events.
In this paper we construct two such graphs: based on the current human annotation and on the complex model prediction. The structure of the graphs is compared and analyzed and it is shown that the model graph has larger connected components which give a more complete picture of the important events than the human annotation graph. Predictive model appears to be better at capturing full information about the important events from the news collection than human annotators.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Since the second half of the corpus is already annotated, actually we do not need any prediction here, but to be able to compare the graphs we have to construct them on the same data, and that is why we use model prediction.
- 3.
Moreover, we only work with news headlines, and better results in the detection of the same events could be achieved by taking into account the bodies of the news reports as well. We believe that current results (i.e., model performance) are acceptable for building adequate paraphrase graph based on the corpus.
- 4.
References
Alexandrov, M., Gelbukh, A., Rosso, P.: An approach to clustering abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005). doi:10.1007/11428817_25
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 101–104 (2014)
Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: METER: MEasuring TExt Reuse. In: Isabelle, P. (ed.) Proceedings of the Fortieth Annual Meeting on Association for Computational Linguistics, pp. 152–159. Association for Computational Linguistics, Philadelphia (2002)
Cohn, T., Callison-Burch, C., Lapata, M.: Constructing corpora for the development and evaluation of paraphrase systems. Comput. Linguist. Arch. 34(4), 597–614 (2008)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 350–356 (2004)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008)
Gelbukh, A., Sidorov, G., Guzmán-Arenas, A.: A method of describing document contents through topic selection. In: Proceedings of the String Processing and Information Retrieval Symposium and International Workshop on Groupware, pp. 73–80 (1999)
Guha, R., Kumar R., Sivakumar, D., Sundaram, R.: Unweaving a web of documents. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 574–579 (2005)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781/
Moe, R.E.: Clustering in a news corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 301–307. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_37
Norwegian Newspaper Corpus. http://avis.uib.no/om-aviskorpuset/english
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, Dmitry I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). doi:10.1007/978-3-319-41718-9_8
Pronoza, E., Yagunova, E.: Low-level features for paraphrase identification. In: Sidorov, G., Galicia-Haro, Sofía N. (eds.) MICAI 2015. LNCS, vol. 9413, pp. 59–71. Springer, Cham (2015). doi:10.1007/978-3-319-27060-9_5
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82 (2015)
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Tihonov, A. N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p.; vol. 2, 885 p. (1985)
Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, pp. 121–128, August 2013
Acknowledgements
The authors acknowledge St.-Petersburg State University for the research grant 30.38.305.2014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pronoza, E., Yagunova, E., Kochetkova, N. (2017). Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions?. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-62434-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62433-4
Online ISBN: 978-3-319-62434-1
eBook Packages: Computer ScienceComputer Science (R0)