Abstract
Measuring the semantic similarity of sentence pairs is an important natural language processing (NLP) problem and has many applications in many NLP systems. Sentence similarity is used to improve the performance of many systems such as machine translation, speech recognition, automatic question and answer, text summarization. However, accurately evaluate the semantic similarity between sentences is still a challenge. Up to now, there are not sentence similarity methods, which exploit Vietnamese specific characteristics, have been proposed. Moreover, there are not sentence similarity datasets for Vietnamese that have been published. In this paper, we propose a new method to measure the semantic similarity of Vietnamese sentence pairs based on combining lexical similarity score and distribution semantic similarity score of two sentences. The experimental results have shown that our proposed model has high performance for the Vietnamese semantic similarity problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
In this study, we use google API to translate Chinese, Laotian, Khmer sentences into Vietnamese.
References
Aliguliyev, R.M.: A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36(4), 7764–7772 (2009). http://dblp.uni-trier.de/db/journals/eswa/eswa36.html#Aliguliyev09
Burke, R., Hammond, K., Kulyukin, V., Tomuro, S.: Question answering from frequently asked question files. AI Mag. 18(2), 57–66 (1997)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Farouk, Mamdouh, Ishizuka, Mitsuru, Bollegala, Danushka: Graph matching based semantic search engine. In: Garoufallou, Emmanouel, Sartori, Fabio, Siatri, Rania, Zervas, Marios (eds.) MTSR 2018. CCIS, vol. 846, pp. 89–100. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14401-2_8, http://dblp.uni-trier.de/db/conf/mtsr/mtsr2018.html#FaroukIB18
Ferreira, R., Lins, R.D., Simske, S.J., Freitas, F., Riss, M.: Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. 39, 1–28 (2016). http://dblp.uni-trier.de/db/journals/csl/csl39.html#FerreiraLSFR16
Heo, T.S., Kim, J.D., Park, C.Y., Kim, Y.S.: Global and local information adjustment for semantic similarity evaluation. Appl. Sci. 11(5), 2161 (2021). https://doi.org/10.3390/app11052161, https://www.mdpi.com/2076-3417/11/5/2161
Lee, M.C., Chang, J.W., Hsieh, T.C.: A grammar-based semantic similarity algorithm for natural language sentences. Sci. World J. 2014, 17 (2014). https://www.hindawi.com/journals/tswj/2014/437162/
Lee, M.C., Zhang, J.W., Lee, W.X., Ye, H.Y.: Sentence similarity computation based on PoS and semantic nets. In: Kim, J., et al. (eds.) NCM, pp. 907–912. IEEE Computer Society (2009). http://dblp.uni-trier.de/db/conf/ncm/ncm2009.html#LeeZLY09
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957)
Manning, C.D., MacCartney, B.: Natural language inference (2009)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arxiv:1301.3781
Morris, A.C., Maier, V., Green, P.D.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: INTERSPEECH. ISCA (2004). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2004.html#MorrisMG04
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Schuurmans, D., Wellman, M.P. (eds.) AAAI, pp. 2786–2792. AAAI Press (2016). http://dblp.uni-trier.de/db/conf/aaai/aaai2016.html#MuellerT16
Nguyen, D.Q., Nguyen, A.T.: Phobert: pre-trained language models for Vietnamese. In: Cohn, T., He, Y., Liu, Y. (eds.) EMNLP (Findings), pp. 1037–1042. Association for Computational Linguistics (2020). http://dblp.uni-trier.de/db/conf/emnlp/emnlp2020f.html#NguyenN20
Nguyen, P.T., Pham, V.L., Nguyen, H.A., Vu, H.H., Tran, N.A., Truong, T.T.H.: A two-phase approach for building Vietnamese WordNet. In: the 8th Global Wordnet Conference, pp. 259–264 (2015)
Nguyen, T.M.H., Romary, L., Rossignol, M., Vu, X.L.: A lexicon for Vietnamese language processing. Lang. Resour. Eval. 40(3–4), 291–309 (2006)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
Wang, Z., Mi, H., Ittycheriah, A.: Sentence similarity learning by lexical decomposition and composition. In: Calzolari, N., Matsumoto, Y., Prasad, R. (eds.) COLING, pp. 1340–1349. ACL (2016). http://dblp.uni-trier.de/db/conf/coling/coling2016.html#WangMI16
Yang, M., et al.: Sentence-level agreement for neural machine translation. In: Korhonen, A., Traum, D.R., Mà rquez, L. (eds.) ACL (1), pp. 3076–3082. Association for Computational Linguistics (2019). http://dblp.uni-trier.de/db/conf/acl/acl2019-1.html#YangWCUSZZ19
Acknowledgments
This paper is part of project number KC-4.0-12/19-25 that is led by Doctor Nguyen Van Vinh and funded by the Science and Technology Program KC 4.0.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bui, VT., Nguyen, PT. (2022). Measuring Semantic Similarity of Vietnamese Sentences Based on Lexical and Distribution Similarity. In: Le Thi, H.A., Pham Dinh, T., Le, H.M. (eds) Modelling, Computation and Optimization in Information Systems and Management Sciences. MCO 2021. Lecture Notes in Networks and Systems, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-030-92666-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-92666-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92665-6
Online ISBN: 978-3-030-92666-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)