Skip to main content

Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions?

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10061))

Included in the following conference series:

Abstract

As part of our project ParaPhraser on the identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases. The corpus is annotated via crowdsourcing by naïve native Russian speakers, but from the point of view of the expert, our complex paraphrase detection model can be more successful at predicting paraphrase class than a naive native speaker.

Our paraphrase corpus is collected from news headlines and therefore can be considered a summarized news stream describing the most important events. By building a graph of paraphrases, we can detect such events.

In this paper we construct two such graphs: based on the current human annotation and on the complex model prediction. The structure of the graphs is compared and analyzed and it is shown that the model graph has larger connected components which give a more complete picture of the important events than the human annotation graph. Predictive model appears to be better at capturing full information about the important events from the news collection than human annotators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://scikit-learn.org.

  2. 2.

    Since the second half of the corpus is already annotated, actually we do not need any prediction here, but to be able to compare the graphs we have to construct them on the same data, and that is why we use model prediction.

  3. 3.

    Moreover, we only work with news headlines, and better results in the detection of the same events could be achieved by taking into account the bodies of the news reports as well. We believe that current results (i.e., model performance) are acceptable for building adequate paraphrase graph based on the corpus.

  4. 4.

    https://www.yworks.com/products/yed.

References

  1. Alexandrov, M., Gelbukh, A., Rosso, P.: An approach to clustering abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005). doi:10.1007/11428817_25

    Chapter  Google Scholar 

  2. Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 101–104 (2014)

    Google Scholar 

  3. Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: METER: MEasuring TExt Reuse. In: Isabelle, P. (ed.) Proceedings of the Fortieth Annual Meeting on Association for Computational Linguistics, pp. 152–159. Association for Computational Linguistics, Philadelphia (2002)

    Google Scholar 

  4. Cohn, T., Callison-Burch, C., Lapata, M.: Constructing corpora for the development and evaluation of paraphrase systems. Comput. Linguist. Arch. 34(4), 597–614 (2008)

    Article  Google Scholar 

  5. Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 350–356 (2004)

    Google Scholar 

  6. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008)

    Google Scholar 

  7. Gelbukh, A., Sidorov, G., Guzmán-Arenas, A.: A method of describing document contents through topic selection. In: Proceedings of the String Processing and Information Retrieval Symposium and International Workshop on Groupware, pp. 73–80 (1999)

    Google Scholar 

  8. Guha, R., Kumar R., Sivakumar, D., Sundaram, R.: Unweaving a web of documents. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 574–579 (2005)

    Google Scholar 

  9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781/

  10. Moe, R.E.: Clustering in a news corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 301–307. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_37

    Google Scholar 

  11. Norwegian Newspaper Corpus. http://avis.uib.no/om-aviskorpuset/english

  12. Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, Dmitry I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). doi:10.1007/978-3-319-41718-9_8

    Chapter  Google Scholar 

  13. Pronoza, E., Yagunova, E.: Low-level features for paraphrase identification. In: Sidorov, G., Galicia-Haro, Sofía N. (eds.) MICAI 2015. LNCS, vol. 9413, pp. 59–71. Springer, Cham (2015). doi:10.1007/978-3-319-27060-9_5

    Chapter  Google Scholar 

  14. Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82 (2015)

    Google Scholar 

  15. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)

    Article  Google Scholar 

  16. Tihonov, A. N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p.; vol. 2, 885 p. (1985)

    Google Scholar 

  17. Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, pp. 121–128, August 2013

    Google Scholar 

Download references

Acknowledgements

The authors acknowledge St.-Petersburg State University for the research grant 30.38.305.2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elena Yagunova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Pronoza, E., Yagunova, E., Kochetkova, N. (2017). Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators’ Decisions?. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62434-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62433-4

  • Online ISBN: 978-3-319-62434-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics