Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions?

Pronoza, Ekaterina; Yagunova, Elena; Kochetkova, Nataliya

doi:10.1007/978-3-319-62434-1_4

Ekaterina Pronoza¹⁵,
Elena Yagunova¹⁵ &
Nataliya Kochetkova¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10061))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1388 Accesses
2 Citations

Abstract

As part of our project ParaPhraser on the identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases. The corpus is annotated via crowdsourcing by naïve native Russian speakers, but from the point of view of the expert, our complex paraphrase detection model can be more successful at predicting paraphrase class than a naive native speaker.

Our paraphrase corpus is collected from news headlines and therefore can be considered a summarized news stream describing the most important events. By building a graph of paraphrases, we can detect such events.

In this paper we construct two such graphs: based on the current human annotation and on the complex model prediction. The structure of the graphs is compared and analyzed and it is shown that the model graph has larger connected components which give a more complete picture of the important events than the human annotation graph. Predictive model appears to be better at capturing full information about the important events from the news collection than human annotators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://scikit-learn.org.
2.
Since the second half of the corpus is already annotated, actually we do not need any prediction here, but to be able to compare the graphs we have to construct them on the same data, and that is why we use model prediction.
3.
Moreover, we only work with news headlines, and better results in the detection of the same events could be achieved by taking into account the bodies of the news reports as well. We believe that current results (i.e., model performance) are acceptable for building adequate paraphrase graph based on the corpus.
4.
https://www.yworks.com/products/yed.

References

Alexandrov, M., Gelbukh, A., Rosso, P.: An approach to clustering abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005). doi:10.1007/11428817_25
Chapter Google Scholar
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 101–104 (2014)
Google Scholar
Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: METER: MEasuring TExt Reuse. In: Isabelle, P. (ed.) Proceedings of the Fortieth Annual Meeting on Association for Computational Linguistics, pp. 152–159. Association for Computational Linguistics, Philadelphia (2002)
Google Scholar
Cohn, T., Callison-Burch, C., Lapata, M.: Constructing corpora for the development and evaluation of paraphrase systems. Comput. Linguist. Arch. 34(4), 597–614 (2008)
Article Google Scholar
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 350–356 (2004)
Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008)
Google Scholar
Gelbukh, A., Sidorov, G., Guzmán-Arenas, A.: A method of describing document contents through topic selection. In: Proceedings of the String Processing and Information Retrieval Symposium and International Workshop on Groupware, pp. 73–80 (1999)
Google Scholar
Guha, R., Kumar R., Sivakumar, D., Sundaram, R.: Unweaving a web of documents. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 574–579 (2005)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781/
Moe, R.E.: Clustering in a news corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 301–307. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_37
Google Scholar
Norwegian Newspaper Corpus. http://avis.uib.no/om-aviskorpuset/english
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, Dmitry I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). doi:10.1007/978-3-319-41718-9_8
Chapter Google Scholar
Pronoza, E., Yagunova, E.: Low-level features for paraphrase identification. In: Sidorov, G., Galicia-Haro, Sofía N. (eds.) MICAI 2015. LNCS, vol. 9413, pp. 59–71. Springer, Cham (2015). doi:10.1007/978-3-319-27060-9_5
Chapter Google Scholar
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82 (2015)
Google Scholar
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Article Google Scholar
Tihonov, A. N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p.; vol. 2, 885 p. (1985)
Google Scholar
Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, pp. 121–128, August 2013
Google Scholar

Download references

Acknowledgements

The authors acknowledge St.-Petersburg State University for the research grant 30.38.305.2014.

Author information

Authors and Affiliations

St. Petersburg State University, 7/9 Universitetskaya Nab., St. Petersburg, Russian Federation
Ekaterina Pronoza & Elena Yagunova
National Research University Higher School of Economics, 20 Myasnitskaya ul., Moscow, Russian Federation
Nataliya Kochetkova

Authors

Ekaterina Pronoza
View author publications
You can also search for this author in PubMed Google Scholar
Elena Yagunova
View author publications
You can also search for this author in PubMed Google Scholar
Nataliya Kochetkova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elena Yagunova .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov
Universidad Autónoma Metropolitana, Mexico City, Mexico
Oscar Herrera-Alcántara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pronoza, E., Yagunova, E., Kochetkova, N. (2017). Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions?. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-62434-1_4
Published: 03 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62433-4
Online ISBN: 978-3-319-62434-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics