Advertisement

Knowledge-lean Paraphrase Identification Using Character-Based Features

  • Asli EyeciogluEmail author
  • Bill Keller
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 789)

Abstract

The paraphrase identification task has practical importance in the NLP community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods should help improve the performance of NLP applications, including machine translation, information retrieval, question answering, text summarization, document clustering and plagiarism detection, amongst others. We consider an approach to paraphrase identification that may be considered “knowledge-lean”. Our approach minimizes the need for data transformation and avoids the use of knowledge-based tools and resources. Candidate paraphrase pairs are represented using combinations of word- and character-based features. We show that SVM classifiers may be trained to distinguish paraphrase and non-paraphrase pairs across a number of different paraphrase corpora with good results. Analysis shows that features derived from character bigrams are particularly informative. We also describe recent experiments in identifying paraphrase for Russian, a language with rich morphology and free word order that presents a particularly interesting challenge for our knowledge-lean approach. We are able to report good results on a three-way paraphrase classification task.

Keywords

Paraphrase identification Paraphrase corpora Character N-grams Lexical overlap Support vector machines 

Notes

Acknowledgements

The authors gratefully acknowledge the comments of our reviewers on earlier drafts of this paper.

References

  1. 1.
    Agirre, E., et al.: Semeval-2012 task 6: A pilot on semantic textual similarity. In: Proceedings of the 6th International Workshop on Semantic Evaluation, in Conjunction with the First Joint Conference on Lexical and Computational Semantics, pp. 385–393 (2012)Google Scholar
  2. 2.
    Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. Artif. Intell. Res. 38(1), 135–187 (2010)zbMATHGoogle Scholar
  3. 3.
    Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43th Annual Meeting on Association for Computational Linguistics, pp. 597–604 (2005)Google Scholar
  4. 4.
    Barron-Cedeno, A., et al.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)CrossRefGoogle Scholar
  5. 5.
    Barzilay, R., et al.: Information fusion in the context of multi-document summarization. In: Proceedings of ACL, pp. 550–557 (1999)Google Scholar
  6. 6.
    Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Naacl-2003, pp. 16–23 (2003)Google Scholar
  7. 7.
    Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pp. 546–556 (2012)Google Scholar
  8. 8.
    Callison-Burch, C., et al.: Improved statistical machine translation using paraphrases. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL 2006), pp. 17–24 (2006)Google Scholar
  9. 9.
    Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)CrossRefGoogle Scholar
  10. 10.
    Culicover, P.W.: Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11(1–2), 78–88 (1968)Google Scholar
  11. 11.
    Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: ACL-IJCNLP 2009, pp. 468–476 (2009)Google Scholar
  12. 12.
    Demir, S., et al.: Turkish paraphrase corpus. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pp. 4087–4091 (2012)Google Scholar
  13. 13.
    Dolan, W.B., et al.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004. Association for Computational Linguistics, Geneva (2004)Google Scholar
  14. 14.
    Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of IWP, pp. 9–16. Asia Federation of Natural Language Processing (2005)Google Scholar
  15. 15.
    Duclaye, F., et al.: Using the web as a linguistic resource for learning reformulations automatically. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands, Spain, pp. 390–396 (2002)Google Scholar
  16. 16.
    Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 64–69 (2015)Google Scholar
  17. 17.
    Eyecioglu, A., Keller, B.: Constructing a Turkish corpus for paraphrase identification and semantic similarity. In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics. LNCS, pp. 562–574 (2016)Google Scholar
  18. 18.
    Fellbaum, C.: WordNet. An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  19. 19.
    Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)Google Scholar
  20. 20.
    Finch, A., et al.: Using Machine translation evaluation techniques to determine sentence-level semantic equivalence. In: Proceedings of the Third International Workshop on Paraphrasing (IWP 2005), pp. 17–24 (2005)Google Scholar
  21. 21.
    Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)CrossRefGoogle Scholar
  22. 22.
    Ganitkevitch, J., et al.: Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation, pp. 1168–1179. Computational Linguistics (2011)Google Scholar
  23. 23.
    He, W., et al.: Enriching SMT training data via paraphrasing. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, IJCNLP 2011, pp. 803–810. Asian Federation of Natural Language Processing (2011)Google Scholar
  24. 24.
    Hearst, M.A., Grefenstette, G.: Refining automatically-discovered lexical relations: combining weak techniques for stronger results. In: Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Menlo Park, CA, pp. 64–72 (1992)Google Scholar
  25. 25.
    Hsu, C.-W., et al.: A practical guide to support vector classification. BJU Int. 101(1), 1396–1400 (2008)Google Scholar
  26. 26.
    Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)CrossRefGoogle Scholar
  27. 27.
    Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 891–896. Association for Computational Linguistics, Seattle (2013)Google Scholar
  28. 28.
    Kim, Y., et al.: Character-aware neural language models. CoRR 1508.06615 (2015)Google Scholar
  29. 29.
    Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006).  https://doi.org/10.1007/11816508_52CrossRefGoogle Scholar
  30. 30.
    Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. CoRR 1508.02096 (2015)Google Scholar
  31. 31.
    Lintean, M., Rus, V.: Dissimilarity kernels for paraphrase identification. In: Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, Palm Beach, FL, pp. 263–268 (2011)Google Scholar
  32. 32.
    Madnani, N., et al.: Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012), PA, USA, pp. 182–190 (2012)Google Scholar
  33. 33.
    Madnani, N., et al.: Using paraphrases for parameter tuning in statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation (WMT 2007), Prague, Czech Republic (2007)Google Scholar
  34. 34.
    Malakasiotis, P.: Paraphrase recognition using machine learning to combine similarity measures. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Suntec, Singapore, pp. 27–35 (2009)Google Scholar
  35. 35.
    Marton, Y., et al.: Filtering antonymous, trend-contrasting, and polarity-dissimilar distributional paraphrases for improving statistical machine translation. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 237–249. Association for Computational Linguistics, Edingburgh, Scotland (2011)Google Scholar
  36. 36.
    Marton, Y., et al.: Improved statistical machine translation using monolingually-derived paraphrases. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore (2009)Google Scholar
  37. 37.
    Mckeown, K.R.: Paraphrasing questions using given and new information. Comput. Linguist. 9(1), 1–10 (1983)MathSciNetGoogle Scholar
  38. 38.
    Mihalcea, R., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 775–780. AAAI Press (2006)Google Scholar
  39. 39.
    Owczarzak, K., et al.: Contextual bitext-derived paraphrases in automatic MT evaluation. In: StatMT 2006, Stroudsburg, PA, USA, pp. 86–93 (2006)Google Scholar
  40. 40.
    Pedersen, T., Bruce, R.: Knowledge lean word-sense disambiguation. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 800–805. AAAI Press (1998)Google Scholar
  41. 41.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: Machine Learning in Python. http://scikit-learn.org/stable/
  42. 42.
    Pivovarova, L., et al.: ParaPhraser: Russian paraphrase corpus and shared task. In: Filchenkov, A., et al. (eds.) AINL 2017, CCIS, vol. 789, pp. 211–225. Springer, Cham (2018)Google Scholar
  43. 43.
    Power, R., Scott, D.: Automatic generation of large-scale paraphrases. In: Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005), Jeju, Republic of Korea, pp. 33–40 (2005)Google Scholar
  44. 44.
    Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Proceedings of the AINL-ISMW FRUCT 2015, pp. 74–82. IEEE (2015)Google Scholar
  45. 45.
    Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. (2002)Google Scholar
  46. 46.
    Rus, V., et al.: On paraphrase identification corpora. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)Google Scholar
  47. 47.
    Shinyama, Y., et al.: Automatic paraphrase acquisition from news articles. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 313–318 (2002)Google Scholar
  48. 48.
    Shinyama, Y., Sekine, S.: Paraphrase acquisition for information extraction. In: Proceedings of the second International Workshop on Paraphrasing - Volume 16 (PARAPHRASE 2003), vol. 16, pp. 65–71. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  49. 49.
    Socher, R., et al.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)Google Scholar
  50. 50.
    Wan, S., et al.: Using dependency-based features to take the ‘Para-farce’ out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 131–138 (2006)Google Scholar
  51. 51.
    Xu, W.: Data-driven approaches for paraphrasing across language variations. New York University (2014)Google Scholar
  52. 52.
    Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 160–166 (2005)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Bartin UniversityBartinTurkey
  2. 2.University of SussexBrightonUK

Personalised recommendations