Paraphrase Identification on the Basis of Supervised Machine Learning Techniques

  • Zornitsa Kozareva
  • Andrés Montoyo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)


This paper presents a machine learning approach for paraphrase identification which uses lexical and semantic similarity information. In the experimental studies, we examine the limitations of the designed attributes and the behavior of three machine learning classifiers. With the objective to increase the final performance of the system, we scrutinize the influence of the combination of lexical and semantic information, as well as techniques for classifier combination.


Support Vector Machine Natural Language Processing Semantic Similarity Computational Linguistics Longe Common Subsequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barzilay, R., Lee, L.: Learning to paraphrase: An unsupervised approach using multiplesequence alignment. In: HLT-NAACL 2003: Main Proceedings, pp. 16–23 (2003)Google Scholar
  2. 2.
    Barzilay, R., McKeown, K.: Extracting paraphrases from a parallel corpus. In: 39th Annual Meeting of the Association for Computational Linguistics, pp. 50–57 (2001)Google Scholar
  3. 3.
    Brockett, C., Dolan, W.B.: Support vector machines for paraphrase identification and corpus construction. In: Second International Joint Conference on Natural Language ProcessingGoogle Scholar
  4. 4.
    Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas (1960)Google Scholar
  5. 5.
    Collobert, R., Bengio, S.: Svmtorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research 1, 143–160 (2001)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Corley, C., Mihalcea, R.: Measures of text semantic similarity. In: Proceedings of the ACL workshop on Empirical Modeling of Semantic EquivalenceGoogle Scholar
  7. 7.
    Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: Timbl: Tilburg memory-based learner. Technical Report ILK 03-10, Tilburg University (November 2003)Google Scholar
  8. 8.
    Dagan, I., Glickman, O.: Probabilistic textual entailment: Generic applied modeling of language variability. In: PASCAL Workshop on Learning Methods for Text Understanding and MiningGoogle Scholar
  9. 9.
    Dagan, I., Glickman, O., Magnini, B.: The pascal recognising textual entailment challenge. In: Proceedings of the PASCAL Challenges Workshop on Recognising Textual EntailmentGoogle Scholar
  10. 10.
    Dolan, W.B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: International Conference on Computational Linguistics, COLINGGoogle Scholar
  11. 11.
    Glickman, O., Dagan, I.: Acquiring lexical paraphrases from a single corpus. In: Recent Advances in Natural Language Processing IIIGoogle Scholar
  12. 12.
    Kozareva, Z., Montoyo, A.: The role and resolution of textual entailment in natural language processing applications. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 186–196. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Lin, C.-Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 71–78 (2003)Google Scholar
  14. 14.
    Lin, D.: An information-theoretic definition of similarity. In: Proceedings of 15th International Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)Google Scholar
  15. 15.
    Paşca, M., Dienes, P.: Aligning needles in a haystack: Paraphrase acquisition across the web. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 119–130. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  16. 16.
    Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word sense disambiguation. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational LinguisticsGoogle Scholar
  17. 17.
    Pedersen, T.: Assessing system agreement and instance difficulty in the lexical sample tasks of senseval-2. In: Proceedings of the 40th Annual Meeting of the Association for Computational LinguisticsGoogle Scholar
  18. 18.
    Quirk, C., Brockett, C., Dolan, W.B.: Monolingual machine translation for paraphrase generation. In: Proceedings of the Conference on Empirical Methods in Natural Language ProcessingGoogle Scholar
  19. 19.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, Manchester, UK (1994)Google Scholar
  20. 20.
    Shinyama, Y., Sekine, S., Sudo, K., Grishman, R.: Automatic paraphrase acquisition from news articles (2002)Google Scholar
  21. 21.
    Suárez, A., Palomar, M.: A maximum entropy-based word sense disambiguation system. In: COLING (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Zornitsa Kozareva
    • 1
  • Andrés Montoyo
    • 1
  1. 1.Departamento de Lenguajes y Sistemas InformáticosUniversidad de Alicante 

Personalised recommendations