Abstract
The Paraphrase identification (PI) task has practical importance for work in Natural Language Processing (NLP) because of the problem of linguistic variation. Accurate methods should help improve performance of key NLP applications. Paraphrase corpora are important resources in developing and evaluating PI methods. This paper describes the construction of a paraphrase corpus for Turkish. The corpus comprises pairs of sentences with semantic similarity scores based on human judgments, permitting experimentation with both PI and semantic similarity. We believe this is the first such corpus for Turkish. The data collection and scoring methodology is described and initial PI experiments with the corpus are reported. Our approach to PI is novel in using ‘knowledge lean’ methods (i.e. no use of manually constructed knowledge bases or processing tools that rely on these). We have previously achieved excellent results using such techniques on the Microsoft Research Paraphrase Corpus, and close to state-of-the-art performance on the Twitter Paraphrase Corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
TuPC can be downloaded from: http://aslieyecioglu.com/data/.
References
Agirre, E., et al.: Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the 6th International Workshop on Semantic Evaluation, in conjunction with the First Joint Conference on Lexical and Computational Semantics, pp. 385–393 (2012)
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43th Annual Meeting on Association for Computational Linguistics, pp. 597–604 (2005)
Bird, S., et al.: Natural Language Processing with Python. O’Reilly Media Inc., Newton (2009)
Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pp. 546–556 (2012)
Callison-Burch, C. et al.: Improved statistical machine translation using paraphrases. In: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL 2006), pp. 17–24 (2006)
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT 2011), pp. 190–200 (2011)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: ACL-IJCNLP 2009, pp. 468–476 (2009)
Demir, S., et al.: Turkish Paraphrase Corpus. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pp. 4087–4091 (2012)
Dolan, B., et al.: Unsupervised construction of large paraphrase corpora. In: Proceedings of the 20th international conference on Computational Linguistics - COLING 2004, NJ, USA, p. 350–es (2004)
Duclaye, F., et al.: Using the web as a linguistic resource for learning reformulations automatically. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands, Spain, pp. 390–396 (2002)
Eyecioglu, A., Keller, B.: ASOBEK : Twitter Paraphrase Identification with Simple Overlap Features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 64–69 (2015)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Finch, A. et al.: Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence. In: Proceedings of the Third International Workshop on Paraphrasing (IWP 2005), pp. 17–24 (2005)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)
Ganitkevitch, J., et al.: PPDB : the paraphrase database. In: Proceedings of NAACL-HLT, Atlanta, Gerogia, pp. 758–764 (2013)
Gwet, K.L.: Handbook of Inter-rater Reliability. Advanced Analytics, Gaithersburg (2012)
Hsu, C.-W., et al.: A Practical Guide to Support Vector Classification. BJU Int. 101(1), 1396–1400 (2008)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Lintean, M., Rus, V.: Dissimilarity kernels for paraphrase identification. In: Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, Palm Beach, FL, pp. 263–268 (2011)
Madnani, N., et al.: Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012), PA, USA, pp. 182–190 (2012)
Madnani, N., et al.: Using paraphrases for parameter tuning in statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation (WMT 2007), Prague, Czech Republic (2007)
Malakasiotis, P.: Paraphrase recognition using machine learning to combine similarity measures. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Suntec, Singapore, pp. 27–35 (2009)
Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from wikipedia’s revision history. In: Proceeding of LREC, pp. 3143–3148 (2010)
Mihalcea, R., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on Artificial intelligence- Volume 1, pp. 775–780. AAAI Press (2006)
Owczarzak, K., et al.: Contextual bitext-derived paraphrases in automatic MT evaluation. In: StatMT 2006, Stroudsburg, PA, USA, pp. 86–93 (2006)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: Machine Learning in Python. http://scikit-learn.org/stable/
Quirk, C., et al.: Monolingual machine translation for paraphrase generation. In: EMNLP-2014, pp. 142–149 (2004)
Socher, R., et al.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Stanovsky, G.: A Study in Hebrew Paraphrase Identification. Ben-Gurion University of Negev (2012)
Wubben, S. et al.: Creating and using large monolingual parallel corpora for sentential paraphrase generation, pp. 4292–4299 (2010)
Xu, W.: Data-driven approaches for paraphrasing across language variations. New York University (2014)
Xu, W., et al.: SemEval-2015 Task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval) (2015)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Eyecioglu, A., Keller, B. (2018). Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_42
Download citation
DOI: https://doi.org/10.1007/978-3-319-75477-2_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)