English–Vietnamese cross-language paraphrase identification using hybrid feature classes

  • Dien Dinh
  • Nguyen Le ThanhEmail author


Paraphrase identification plays an important role with various applications in natural language processing tasks such as machine translation, bilingual information retrieval, plagiarism detection, etc. With the development of information technology and the Internet, the requirement of textual comparing is not only in the same language but also in many different language pairs. Especially in Vietnamese, detecting paraphrase in the English–Vietnamese pair of sentences is a high demand because English is one of the most popular foreign languages in Vietnam. However, the in-depth studies on cross- language paraphrase identification tasks between English and Vietnamese are still limited. Therefore, in this paper, we propose a method to identify the English–Vietnamese cross-language paraphrase cases, using hybrid feature classes. These classes are calculated by using the fuzzy-based method as well as the siamese recurrent model, and then combined to get the final result with a mathematical formula. The experimental results show that our model achieves 87.4% F-measure accuracy.


Paraphrase identification Semantic similarity Cross-language BabelNet Vietnamese 



  1. Alzahrani, S., Salim, N.: Fuzzy semantic-based string similarity for extrinsic plagiarism detection: Lab report for PAN at CLEF’10, Presented at the 4th Int. Workshop PAN-10, Padua, Italy (2010)Google Scholar
  2. Bach, N.X., Oanh, T.T., Hai, N.T., Phuong, T.M.: Paraphrase identification in Vietnamese Documents. In: Proceedings of the 7th international conference on knowledge and systems engineering (KSE) pp. 174–179 (2015)Google Scholar
  3. Barron-Cedéno, A.: On the mono- and cross-language detection of text re-use and plagiarism, PhD thesis, Valencia. Spain (2012)Google Scholar
  4. Bojanowski, P., Grave, E., Joulin A., Mikolov, T.: Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606 (2016)
  5. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)CrossRefzbMATHGoogle Scholar
  6. Dolan, W., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Third International workshop on paraphrasing (2005)Google Scholar
  7. Franco-Salvador, M., Rosso, P., Montes-y-Gómez, M.: A systematic study of knowledge graph analysis forcross-language plagiarism detection. Inf. Process. Manag. 52(4), 550–570 (2012)CrossRefGoogle Scholar
  8. Frank, E., Hall, M. A., Witten, I. H.: The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Technique”, Morgan Kaufmann, Fourth Edition (2016)Google Scholar
  9. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference for artificial intelligence, Hyderabad, India (2007)Google Scholar
  10. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artifical intelligence (IJCAI’07), pp. 1606–1611 (2007)Google Scholar
  11. Gardner, M.W., Dorling, S.R.: Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmos. Environ. 32, 2627–2636 (1998)CrossRefGoogle Scholar
  12. Gupta, P., Barron-Cedeno, A., Rosso, P.: Cross-language high similarity search using a conceptual thesaurus. In: Information access evaluation, multilinguality, multimodality, and visual analytics, pp. 67–75 (2012)Google Scholar
  13. Khue, H., Nguyen, D.T.N., Dinh, D., Nguyen, T.T.: Application of a multi-lingual parallel corpus in teaching foreigners. In: Proceedings of the conference on researching and teaching Vietnamese and Vietnamese studies, Hue, Vietnam (2018)Google Scholar
  14. MacKay, D.J.C.: Introduction to gaussian processes. In: Bishop, C.M. (ed.) Neural Networks and Machine Learning, pp. 133–165. Springer, Berlin (1998)Google Scholar
  15. Mahajan, R.S., Zaveri, M.A.: Machine learning based paraphrase identification system using lexical syntactic features. In: Proceedings of IEEE international conference on computational intelligence and computing research (ICCIC 2016), 15–17 December 2016, Thalambur, Chennai, Tamilnadu, India (2016)Google Scholar
  16. Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Proceedings of the thirtieth AAAI conference on artificial intelligence (AAAI-16) (2016)Google Scholar
  17. Navigli, R., Ponzetto, S.P.: BabelNet: The atomatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)CrossRefzbMATHGoogle Scholar
  18. NMcnamee, P., Mayfield, J.: Character N-gram tokenization for European language text retrieval. Inf. Retr. Proc. 7, 73–97 (2004)CrossRefGoogle Scholar
  19. Pinto, D., Civera, J., Juan, A., Rosso, P., Barron-Cedêno, A.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64, 51–60 (2009)CrossRefzbMATHGoogle Scholar
  20. Potthast, M., Barron-Cedeno, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45, 45–62 (2011)CrossRefGoogle Scholar
  21. Toi, N.X., Hung, N.V., Son, S.B.: A unified plagiarism detection framework. VNU J. Sci. Math. Phys. 27, 55–62 (2011)Google Scholar
  22. Van Tri, T.: WordNet machine translation from English to Vietnamese using the Oxford English-Vietnamese Dictionary. In: Master thesis, Ho Chi Minh City, Vietnam (2017)Google Scholar
  23. Yerra, R., Ng, Y.-K.: A sentence-based copy detection approach for web documents. In: Fuzzy system and knowledge discovery, pp. 557–570 (2005)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of Science, VNU-HCMHo Chi Minh CityVietnam

Personalised recommendations