Skip to main content

Learning to Detect Paraphrases in Indian Languages

  • Conference paper
  • First Online:
Text Processing (FIRE 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Abstract

In this paper, we present a system that detects paraphrases in Indian Languages- Hindi, Punjabi, Malayalam and Tamil. Our paraphrase detection method uses machine learning algorithms such as multinomial logistic regression and support vector machines trained with a variety of features which are basically various lexical and semantic level similarities between two sentences in a pair. With our developed paraphrase detection system, we participate in the shared Task on detecting paraphrases in Indian Languages (DPIL) organized by Forum for Information Retrieval Evaluation (FIRE) in 2016. This shared task consisted of two tasks-Task1 and Task2. We participated in task1 and task2 both for all four Indian Languages. We participate in the shared task with the system that uses multinomial logistic regression model and it was officially evaluated by the organizers of the contest against the test set released for the FIRE 2016 shared task on DPIL. After the conference, we enhance our system using another machine learning algorithm-Support Vector Machines and compare its performance with our previous systems. We present in this paper the description of our system, its performance in the shared task and its enhancement using Support Vector Machines. Our evaluation of the system based on the overall average system performance including task1 and task2 over all four languages reveals that the performance of our system is comparable to the best system participated in the shared task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.microsoft.com/en-us/download/confirmation.aspx?id=52398.

  2. 2.

    https://www.cs.york.ac.uk/semeval-2012/task6/index.html.

  3. 3.

    http://nlp.amrita.edu/dpil_cen/.

References

  1. Madnani, N., Dorr, B.J.: Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput. Linguist. 36(3), 341–387 (2010)

    Article  MathSciNet  Google Scholar 

  2. Culicover, P.W.: Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11(1–2), 78–88 (1968)

    Google Scholar 

  3. Sparck-Jones, K., Tait, J.I.: Automatic search term variant generation. J. Doc. 40(1), 50–66 (1984)

    Article  Google Scholar 

  4. Beeferman, D., Berger, A.: Agglomerative clustering of a search engine query log. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining, Boston, MA, pp. 407–416 (2000)

    Google Scholar 

  5. Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions. In: Proceedings of the World Wide Web Conference, Edinburgh, pp. 387–396 (2006)

    Google Scholar 

  6. Sahami, M., Heilman, T.D: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the World Wide Web Conference, Edinburgh, pp. 377–386 (2006)

    Google Scholar 

  7. Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_5

    Chapter  Google Scholar 

  8. Shi, X., Yang, C.C.: Mining related queries from web search engine query logs using an improved association rule mining model. JASIST 58(12), 1871–1883 (2007)

    Article  Google Scholar 

  9. Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: Proceedings of ACL, Philadelphia, PA, pp. 41–47 (2002)

    Google Scholar 

  10. Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V.O., Liu, Y.: Statistical machine translation for query expansion in answer retrieval. In: Proceedings of ACL, Prague, pp. 464–471 (2007)

    Google Scholar 

  11. Owczarzak, K., Groves, D., Genabith, J.V., Way, A.: Contextual bitext-derived paraphrases in automatic MT evaluation. In: Proceedings on the Workshop on Statistical Machine Translation, New York, NY, pp. 86–93 (2006)

    Google Scholar 

  12. Zhou, L., Lin, C.-Y., Hovy. E.: Re-evaluating machine translation results with paraphrase support. In: Proceedings of EMNLP, Sydney, pp. 77–84 (2006)

    Google Scholar 

  13. Callison-Burch, C., Koehn, P., Osborne M.: Improved statistical machine translation using paraphrases. In: Proceedings of NAACL, New York, NY, pp. 17–24 (2006)

    Google Scholar 

  14. Fujita, A., Sato, S.: A probabilistic model for measuring grammaticality and similarity of automatically generated paraphrases of predicate phrases. In: Proceedings of COLING, Manchester, pp. 225–232 (2008)

    Google Scholar 

  15. Corley, C., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, pp. 13–18 (2005)

    Google Scholar 

  16. Uzuner, O., Katz, B.: Capturing expression using linguistic information. In: Proceedings of AAAI, Pittsburgh, PA, pp. 1124–1129 (2005)

    Google Scholar 

  17. Brockett, C., Dolan, W.B.: Support vector machines for paraphrase identification and corpus construction. In: Proceedings of the Third International Workshop on Paraphrasing, Jeju Island, pp. 1–8 (2005)

    Google Scholar 

  18. Marsi, E., Krahmer, E.: Explorations in sentence fusion. In: Proceedings of the European Workshop on Natural Language Generation, Aberdeen, pp. 109–117 (2005)

    Google Scholar 

  19. Wu, D.: Recognizing paraphrases and textual entailment using inversion transduction grammars. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, pp. 25–30 (2005)

    Google Scholar 

  20. Cordeiro, J., Dias, G., Brazdil, P.: A metric for paraphrase detection. In: Proceedings of the Second International Multi-Conference on Computing in the Global Information Technology, Guadeloupe, p. 7 (2007a)

    Google Scholar 

  21. Cordeiro, J., Dias, G., Brazdil, P.: New functions for unsupervised asymmetrical paraphrase detection. J. Softw. 2(4), 12–23 (2007b)

    Google Scholar 

  22. Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of ACL/IJCNLP, Singapore, pp. 468–476 (2009)

    Google Scholar 

  23. Malakasiotis, P.: Paraphrase recognition using machine learning to combine similarity measures. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Singapore, pp. 27–35 (2009)

    Google Scholar 

  24. Dolan, B., Dagan, I. (eds.): Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI. ACL (2005)

    Google Scholar 

  25. Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. Comput. Linguist. 31(3), 297–328 (2005)

    Article  Google Scholar 

  26. Sekine, S.: On-demand information extraction. In: Proceedings of COLING-ACL, Sydney, pp. 731–738 (2006)

    Google Scholar 

  27. Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognising textual entailment challenge. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS (LNAI), vol. 3944, pp. 177–190. Springer, Heidelberg (2006). https://doi.org/10.1007/11736790_9

    Chapter  Google Scholar 

  28. Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B., Szpektor, I. (eds.): Proceedings of the Second PASCAL Challenges Workshop on Recognizing Textual Entailment, Venice (2007)

    Google Scholar 

  29. Sekine, S., Inui, K., Dagan, I., Dolan, B., Giampiccolo, D., Magnini, B. (eds.): Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  30. Giampiccolo, D., Dang, H., Dagan, I., Dolan, B., Magnini, B. (eds.): Proceedings of the Text Analysis Conference (TAC): Recognizing Textual Entailment Track, Gaithersburg, MD (2008)

    Google Scholar 

  31. Gensim-Deep learning with word2vec. https://radimrehurek.com/gensim/models/word2vec.html, Retrieved in 2016

  32. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop Papers (2013)

    Google Scholar 

  33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)

    Google Scholar 

  34. Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in indian languages using multinomial logistic regression model. eprint arXiv:1612.08171 (2016)

  35. Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in indian languages using multinomial logistic regression model. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, pp. 250–255 (2016)

    Google Scholar 

  36. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  37. Platt, J.C.: Sequential minimal optimization: a fast algorithm for training support vector machines. In: SchOlkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods- Support Vector Learning, pp. 185–208. M.I.T. Press (1999)

    Google Scholar 

  38. Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K. P.: DPIL@FIRE2016: overview of shared task on detecting Paraphrases in indian languages. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, CEUR Workshop Proceedings (2016). CEUR-WS.org

  39. Kong, L., Chen, K., Tian, L., Hao, Z., Han, Z., Qi, H.: HIT2016@DPIL-FIRE2016: detecting paraphrases in Indian Languages based on gradient tree boosting. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December, pp. 260–265 (2016)

    Google Scholar 

Download references

Acknowledgments

This research work has received support from the project entitled ‘‘Design and Development of a System for Querying, Clustering and Summarization for Bengali’’ funded by the Department of Science and Technology, Government of India under the SERB scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamal Sarkar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sarkar, K. (2018). Learning to Detect Paraphrases in Indian Languages. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73606-8_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73605-1

  • Online ISBN: 978-3-319-73606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics