Abstract
In this paper, we present a system that detects paraphrases in Indian Languages- Hindi, Punjabi, Malayalam and Tamil. Our paraphrase detection method uses machine learning algorithms such as multinomial logistic regression and support vector machines trained with a variety of features which are basically various lexical and semantic level similarities between two sentences in a pair. With our developed paraphrase detection system, we participate in the shared Task on detecting paraphrases in Indian Languages (DPIL) organized by Forum for Information Retrieval Evaluation (FIRE) in 2016. This shared task consisted of two tasks-Task1 and Task2. We participated in task1 and task2 both for all four Indian Languages. We participate in the shared task with the system that uses multinomial logistic regression model and it was officially evaluated by the organizers of the contest against the test set released for the FIRE 2016 shared task on DPIL. After the conference, we enhance our system using another machine learning algorithm-Support Vector Machines and compare its performance with our previous systems. We present in this paper the description of our system, its performance in the shared task and its enhancement using Support Vector Machines. Our evaluation of the system based on the overall average system performance including task1 and task2 over all four languages reveals that the performance of our system is comparable to the best system participated in the shared task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Madnani, N., Dorr, B.J.: Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput. Linguist. 36(3), 341–387 (2010)
Culicover, P.W.: Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11(1–2), 78–88 (1968)
Sparck-Jones, K., Tait, J.I.: Automatic search term variant generation. J. Doc. 40(1), 50–66 (1984)
Beeferman, D., Berger, A.: Agglomerative clustering of a search engine query log. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining, Boston, MA, pp. 407–416 (2000)
Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions. In: Proceedings of the World Wide Web Conference, Edinburgh, pp. 387–396 (2006)
Sahami, M., Heilman, T.D: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the World Wide Web Conference, Edinburgh, pp. 377–386 (2006)
Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_5
Shi, X., Yang, C.C.: Mining related queries from web search engine query logs using an improved association rule mining model. JASIST 58(12), 1871–1883 (2007)
Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: Proceedings of ACL, Philadelphia, PA, pp. 41–47 (2002)
Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V.O., Liu, Y.: Statistical machine translation for query expansion in answer retrieval. In: Proceedings of ACL, Prague, pp. 464–471 (2007)
Owczarzak, K., Groves, D., Genabith, J.V., Way, A.: Contextual bitext-derived paraphrases in automatic MT evaluation. In: Proceedings on the Workshop on Statistical Machine Translation, New York, NY, pp. 86–93 (2006)
Zhou, L., Lin, C.-Y., Hovy. E.: Re-evaluating machine translation results with paraphrase support. In: Proceedings of EMNLP, Sydney, pp. 77–84 (2006)
Callison-Burch, C., Koehn, P., Osborne M.: Improved statistical machine translation using paraphrases. In: Proceedings of NAACL, New York, NY, pp. 17–24 (2006)
Fujita, A., Sato, S.: A probabilistic model for measuring grammaticality and similarity of automatically generated paraphrases of predicate phrases. In: Proceedings of COLING, Manchester, pp. 225–232 (2008)
Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, pp. 13–18 (2005)
Uzuner, O., Katz, B.: Capturing expression using linguistic information. In: Proceedings of AAAI, Pittsburgh, PA, pp. 1124–1129 (2005)
Brockett, C., Dolan, W.B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the Third International Workshop on Paraphrasing, Jeju Island, pp. 1–8 (2005)
Marsi, E., Krahmer, E.: Explorations in sentence fusion. In: Proceedings of the European Workshop on Natural Language Generation, Aberdeen, pp. 109–117 (2005)
Wu, D.: Recognizing paraphrases and textual entailment using inversion transduction grammars. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, pp. 25–30 (2005)
Cordeiro, J., Dias, G., Brazdil, P.: A metric for paraphrase detection. In: Proceedings of the Second International Multi-Conference on Computing in the Global Information Technology, Guadeloupe, p. 7 (2007a)
Cordeiro, J., Dias, G., Brazdil, P.: New functions for unsupervised asymmetrical paraphrase detection. J. Softw. 2(4), 12–23 (2007b)
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of ACL/IJCNLP, Singapore, pp. 468–476 (2009)
Malakasiotis, P.: Paraphrase recognition using machine learning to combine similarity measures. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Singapore, pp. 27–35 (2009)
Dolan, B., Dagan, I. (eds.): Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI. ACL (2005)
Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. Comput. Linguist. 31(3), 297–328 (2005)
Sekine, S.: On-demand information extraction. In: Proceedings of COLING-ACL, Sydney, pp. 731–738 (2006)
Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognising textual entailment challenge. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS (LNAI), vol. 3944, pp. 177–190. Springer, Heidelberg (2006). https://doi.org/10.1007/11736790_9
Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B., Szpektor, I. (eds.): Proceedings of the Second PASCAL Challenges Workshop on Recognizing Textual Entailment, Venice (2007)
Sekine, S., Inui, K., Dagan, I., Dolan, B., Giampiccolo, D., Magnini, B. (eds.): Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, Prague (2007)
Giampiccolo, D., Dang, H., Dagan, I., Dolan, B., Magnini, B. (eds.): Proceedings of the Text Analysis Conference (TAC): Recognizing Textual Entailment Track, Gaithersburg, MD (2008)
Gensim-Deep learning with word2vec. https://radimrehurek.com/gensim/models/word2vec.html, Retrieved in 2016
Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop Papers (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in indian languages using multinomial logistic regression model. eprint arXiv:1612.08171 (2016)
Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in indian languages using multinomial logistic regression model. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, pp. 250–255 (2016)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Platt, J.C.: Sequential minimal optimization: a fast algorithm for training support vector machines. In: SchOlkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods- Support Vector Learning, pp. 185–208. M.I.T. Press (1999)
Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K. P.: DPIL@FIRE2016: overview of shared task on detecting Paraphrases in indian languages. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, CEUR Workshop Proceedings (2016). CEUR-WS.org
Kong, L., Chen, K., Tian, L., Hao, Z., Han, Z., Qi, H.: HIT2016@DPIL-FIRE2016: detecting paraphrases in Indian Languages based on gradient tree boosting. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, pp. 260–265 (2016)
Acknowledgments
This research work has received support from the project entitled ‘‘Design and Development of a System for Querying, Clustering and Summarization for Bengali’’ funded by the Department of Science and Technology, Government of India under the SERB scheme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Sarkar, K. (2018). Learning to Detect Paraphrases in Indian Languages. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-73606-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)