Learning to Detect Paraphrases in Indian Languages

Sarkar, Kamal

doi:10.1007/978-3-319-73606-8_12

Kamal Sarkar ORCID: orcid.org/0000-0002-0689-3976¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Forum for Information Retrieval Evaluation

Abstract

In this paper, we present a system that detects paraphrases in Indian Languages- Hindi, Punjabi, Malayalam and Tamil. Our paraphrase detection method uses machine learning algorithms such as multinomial logistic regression and support vector machines trained with a variety of features which are basically various lexical and semantic level similarities between two sentences in a pair. With our developed paraphrase detection system, we participate in the shared Task on detecting paraphrases in Indian Languages (DPIL) organized by Forum for Information Retrieval Evaluation (FIRE) in 2016. This shared task consisted of two tasks-Task1 and Task2. We participated in task1 and task2 both for all four Indian Languages. We participate in the shared task with the system that uses multinomial logistic regression model and it was officially evaluated by the organizers of the contest against the test set released for the FIRE 2016 shared task on DPIL. After the conference, we enhance our system using another machine learning algorithm-Support Vector Machines and compare its performance with our previous systems. We present in this paper the description of our system, its performance in the shared task and its enhancement using Support Vector Machines. Our evaluation of the system based on the overall average system performance including task1 and task2 over all four languages reveals that the performance of our system is comparable to the best system participated in the shared task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Madnani, N., Dorr, B.J.: Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput. Linguist. 36(3), 341–387 (2010)
Article MathSciNet Google Scholar
Culicover, P.W.: Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11(1–2), 78–88 (1968)
Google Scholar
Sparck-Jones, K., Tait, J.I.: Automatic search term variant generation. J. Doc. 40(1), 50–66 (1984)
Article Google Scholar
Beeferman, D., Berger, A.: Agglomerative clustering of a search engine query log. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining, Boston, MA, pp. 407–416 (2000)
Google Scholar
Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions. In: Proceedings of the World Wide Web Conference, Edinburgh, pp. 387–396 (2006)
Google Scholar
Sahami, M., Heilman, T.D: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the World Wide Web Conference, Edinburgh, pp. 377–386 (2006)
Google Scholar
Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_5
Chapter Google Scholar
Shi, X., Yang, C.C.: Mining related queries from web search engine query logs using an improved association rule mining model. JASIST 58(12), 1871–1883 (2007)
Article Google Scholar
Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: Proceedings of ACL, Philadelphia, PA, pp. 41–47 (2002)
Google Scholar
Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V.O., Liu, Y.: Statistical machine translation for query expansion in answer retrieval. In: Proceedings of ACL, Prague, pp. 464–471 (2007)
Google Scholar
Owczarzak, K., Groves, D., Genabith, J.V., Way, A.: Contextual bitext-derived paraphrases in automatic MT evaluation. In: Proceedings on the Workshop on Statistical Machine Translation, New York, NY, pp. 86–93 (2006)
Google Scholar
Zhou, L., Lin, C.-Y., Hovy. E.: Re-evaluating machine translation results with paraphrase support. In: Proceedings of EMNLP, Sydney, pp. 77–84 (2006)
Google Scholar
Callison-Burch, C., Koehn, P., Osborne M.: Improved statistical machine translation using paraphrases. In: Proceedings of NAACL, New York, NY, pp. 17–24 (2006)
Google Scholar
Fujita, A., Sato, S.: A probabilistic model for measuring grammaticality and similarity of automatically generated paraphrases of predicate phrases. In: Proceedings of COLING, Manchester, pp. 225–232 (2008)
Google Scholar
Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, pp. 13–18 (2005)
Google Scholar
Uzuner, O., Katz, B.: Capturing expression using linguistic information. In: Proceedings of AAAI, Pittsburgh, PA, pp. 1124–1129 (2005)
Google Scholar
Brockett, C., Dolan, W.B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the Third International Workshop on Paraphrasing, Jeju Island, pp. 1–8 (2005)
Google Scholar
Marsi, E., Krahmer, E.: Explorations in sentence fusion. In: Proceedings of the European Workshop on Natural Language Generation, Aberdeen, pp. 109–117 (2005)
Google Scholar
Wu, D.: Recognizing paraphrases and textual entailment using inversion transduction grammars. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, pp. 25–30 (2005)
Google Scholar
Cordeiro, J., Dias, G., Brazdil, P.: A metric for paraphrase detection. In: Proceedings of the Second International Multi-Conference on Computing in the Global Information Technology, Guadeloupe, p. 7 (2007a)
Google Scholar
Cordeiro, J., Dias, G., Brazdil, P.: New functions for unsupervised asymmetrical paraphrase detection. J. Softw. 2(4), 12–23 (2007b)
Google Scholar
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of ACL/IJCNLP, Singapore, pp. 468–476 (2009)
Google Scholar
Malakasiotis, P.: Paraphrase recognition using machine learning to combine similarity measures. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Singapore, pp. 27–35 (2009)
Google Scholar
Dolan, B., Dagan, I. (eds.): Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI. ACL (2005)
Google Scholar
Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. Comput. Linguist. 31(3), 297–328 (2005)
Article Google Scholar
Sekine, S.: On-demand information extraction. In: Proceedings of COLING-ACL, Sydney, pp. 731–738 (2006)
Google Scholar
Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognising textual entailment challenge. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS (LNAI), vol. 3944, pp. 177–190. Springer, Heidelberg (2006). https://doi.org/10.1007/11736790_9
Chapter Google Scholar
Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B., Szpektor, I. (eds.): Proceedings of the Second PASCAL Challenges Workshop on Recognizing Textual Entailment, Venice (2007)
Google Scholar
Sekine, S., Inui, K., Dagan, I., Dolan, B., Giampiccolo, D., Magnini, B. (eds.): Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, Prague (2007)
Google Scholar
Giampiccolo, D., Dang, H., Dagan, I., Dolan, B., Magnini, B. (eds.): Proceedings of the Text Analysis Conference (TAC): Recognizing Textual Entailment Track, Gaithersburg, MD (2008)
Google Scholar
Gensim-Deep learning with word2vec. https://radimrehurek.com/gensim/models/word2vec.html, Retrieved in 2016
Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop Papers (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in indian languages using multinomial logistic regression model. eprint arXiv:1612.08171 (2016)
Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in indian languages using multinomial logistic regression model. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, pp. 250–255 (2016)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
Platt, J.C.: Sequential minimal optimization: a fast algorithm for training support vector machines. In: SchOlkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods- Support Vector Learning, pp. 185–208. M.I.T. Press (1999)
Google Scholar
Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K. P.: DPIL@FIRE2016: overview of shared task on detecting Paraphrases in indian languages. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, CEUR Workshop Proceedings (2016). CEUR-WS.org
Kong, L., Chen, K., Tian, L., Hao, Z., Han, Z., Qi, H.: HIT2016@DPIL-FIRE2016: detecting paraphrases in Indian Languages based on gradient tree boosting. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, pp. 260–265 (2016)
Google Scholar

Download references

Acknowledgments

This research work has received support from the project entitled ‘‘Design and Development of a System for Querying, Clustering and Summarization for Bengali’’ funded by the Department of Science and Technology, Government of India under the SERB scheme.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
Kamal Sarkar

Authors

Kamal Sarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamal Sarkar .

Editor information

Editors and Affiliations

DAIICT, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
DAIICT, Gujarat, India
Parth Mehta
DAIICT, Gujarat, India
Jainisha Sankhavara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarkar, K. (2018). Learning to Detect Paraphrases in Indian Languages. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-73606-8_12
Published: 04 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics