Skip to main content

Low-Level Features for Paraphrase Identification

  • Conference paper
  • First Online:
Advances in Artificial Intelligence and Soft Computing (MICAI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9413))

Included in the following conference series:


This paper deals with the task of sentential paraphrase identification. We work with Russian but our approach can be applied to any other language with rich morphology and free word order. As part of our project, we construct a paraphrase corpus and then experiment with supervised methods of paraphrase identification. In this paper we focus on the low-level string, lexical and semantic features which unlike complex deep ones do not cause information noise and can serve as a solid basis for the development of an effective paraphrase identification system. Results of the experiments show that the features introduced in this paper improve the paraphrase identification model based solely on the standard low-level features or the optimized matrix metric used for corpus construction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    In fact, our approach is not restricted to languages with these characteristics (e.g., it can be applied for English as well) but the features we propose in this paper take serious advantage of them, and therefore we recommend using our method for morphologically rich languages with free word order.

  2. 2.

    We follow a simplified approach and consider any notional title cased word a Proper name.

  3. 3.

    In this section we only show that the modified metric improves over our baseline: we do not solve the task of selecting the optimal classifier, and we simply choose SVM because it is well-known and widely used in NLP. Further in Sect. 5 we present the results obtained in the experiments with other classifiers.

  4. 4. .

  5. 5.

    In this paper we do not attempt to select the optimal classifier – we leave the elaborate choice of it for future work.


  1. Amaral, A.: Paraphrase identification and applications in finding answers in FAQ databases.

  2. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)

    Google Scholar 

  3. Bouma, G.: Normalized (Pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference (2009)

    Google Scholar 

  4. Braslavski, P., Ustalov, D., Mukhin, M.: A Spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 101–104. Gothenburg, Sweden (2014)

    Google Scholar 

  5. Brockett, C., Dolan, B.: Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing, pp. 1–8 (2005)

    Google Scholar 

  6. Burrows, S., Potthast, M., Stein, B.: Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4(3), 43 (2013)

    Article  Google Scholar 

  7. Callison-Burch, C.: Paraphrasing and Translation. Institute for Communicating and Collaborative Systems. School of Informatics, University of Edinburgh, Edinburgh (2007)

    Google Scholar 

  8. Chitra, A., Kumar, S.: Paraphrase identification using machine learning techniques. In: Proceedings of the 12th International Conference on Networking, VLSI and Signal Processing, pp. 245–249 (2010)

    Google Scholar 

  9. Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and of the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (2009)

    Google Scholar 

  10. Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  11. Dolan, W. B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland (2004)

    Google Scholar 

  12. Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 64–69 (2015)

    Google Scholar 

  13. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: 11th Annual Research Colloqium on Computational Linguistics UK (CLUK 2008) (2008)

    Google Scholar 

  14. Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)

    Google Scholar 

  15. Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  16. Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: Proceedings of the Twenty-Seventh International Florida Artificial Intelligence Research Society Conference (2014)

    Google Scholar 

  18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013).

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  20. Miller, G., Fellbaum, C.: Wordnet: An electronic lexical database. MIT Press, Cambridge (1998)

    Google Scholar 

  21. Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Proceedings of the 9th Summer School in Information Retrieval and Young Scientist Conference (2015). (in press)

    Google Scholar 

  22. Rajkumar, A., Chitra, A.: Paraphrase recognition using neural network classification. Int. J. Comput. Appl. 1(29), 42–47 (2010). ISSN: (0975 - 8887)

    Google Scholar 

  23. Rus, V., McCarthy, Ph. M., Lintean, M.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: Proceedings of the Twenty-First International FLAIRS Conference, pp. 201–206 (2008)

    Google Scholar 

  24. Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland (1995)

    Google Scholar 

  25. Sidorov, G.: Non-linear construction of n-grams in computational linguistics: syntactic, filtered, and generalized n-grams, p. 166 (2013)

    Google Scholar 

  26. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)

    Article  Google Scholar 

  27. Sidorov, G., Gómez-Adorno, H., Markov, I., Pinto, D., Loya, N.: Computing text similarity using tree edit distance. In: NAFIPS 2015 (accepted paper) (2015)

    Google Scholar 

  28. Socher, R., Huang, E.H., Pennington, J., Ng, A.Y., Manning, C.D.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the Conference on Neural Information Processing Systems (2011)

    Google Scholar 

  29. Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “Para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, pp. 131–138 (2006)

    Google Scholar 

  30. Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Proceedings of the Australasian Language Technology Workshop, pp. 160–166 (2005)

    Google Scholar 

  31. Tihonov, A.N.: Slovoobrazovatelnij Slovar’ Russkogo Yazika v Dvuh Tomah: Ok 145000 Slov. Moscow, Russkiy Yazik, vol. 1, 854 p., vol. 2, 885 p. (1985)

    Google Scholar 

Download references


The authors acknowledge Saint-Petersburg State University for the research grant 30.38.305.2014.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ekaterina Pronoza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Pronoza, E., Yagunova, E. (2015). Low-Level Features for Paraphrase Identification. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27059-3

  • Online ISBN: 978-3-319-27060-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics