Measuring Semantic Similarity of Vietnamese Sentences Based on Lexical and Distribution Similarity

Bui, Van-Tan; Nguyen, Phuong-Thai

doi:10.1007/978-3-030-92666-3_22

Van-Tan Bui¹² &
Phuong-Thai Nguyen¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 363))

Included in the following conference series:

International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences

345 Accesses

Abstract

Measuring the semantic similarity of sentence pairs is an important natural language processing (NLP) problem and has many applications in many NLP systems. Sentence similarity is used to improve the performance of many systems such as machine translation, speech recognition, automatic question and answer, text summarization. However, accurately evaluate the semantic similarity between sentences is still a challenge. Up to now, there are not sentence similarity methods, which exploit Vietnamese specific characteristics, have been proposed. Moreover, there are not sentence similarity datasets for Vietnamese that have been published. In this paper, we propose a new method to measure the semantic similarity of Vietnamese sentence pairs based on combining lexical similarity score and distribution semantic similarity score of two sentences. The experimental results have shown that our proposed model has high performance for the Vietnamese semantic similarity problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/VinAIResearch/PhoBERT.
2.
https://github.com/BuiTan/ViSentSim-1000.
3.
In this study, we use google API to translate Chinese, Laotian, Khmer sentences into Vietnamese.

References

Aliguliyev, R.M.: A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst. Appl. 36(4), 7764–7772 (2009). http://dblp.uni-trier.de/db/journals/eswa/eswa36.html#Aliguliyev09
Burke, R., Hammond, K., Kulyukin, V., Tomuro, S.: Question answering from frequently asked question files. AI Mag. 18(2), 57–66 (1997)
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Farouk, Mamdouh, Ishizuka, Mitsuru, Bollegala, Danushka: Graph matching based semantic search engine. In: Garoufallou, Emmanouel, Sartori, Fabio, Siatri, Rania, Zervas, Marios (eds.) MTSR 2018. CCIS, vol. 846, pp. 89–100. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14401-2_8, http://dblp.uni-trier.de/db/conf/mtsr/mtsr2018.html#FaroukIB18
Ferreira, R., Lins, R.D., Simske, S.J., Freitas, F., Riss, M.: Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. 39, 1–28 (2016). http://dblp.uni-trier.de/db/journals/csl/csl39.html#FerreiraLSFR16
Heo, T.S., Kim, J.D., Park, C.Y., Kim, Y.S.: Global and local information adjustment for semantic similarity evaluation. Appl. Sci. 11(5), 2161 (2021). https://doi.org/10.3390/app11052161, https://www.mdpi.com/2076-3417/11/5/2161
Lee, M.C., Chang, J.W., Hsieh, T.C.: A grammar-based semantic similarity algorithm for natural language sentences. Sci. World J. 2014, 17 (2014). https://www.hindawi.com/journals/tswj/2014/437162/
Lee, M.C., Zhang, J.W., Lee, W.X., Ye, H.Y.: Sentence similarity computation based on PoS and semantic nets. In: Kim, J., et al. (eds.) NCM, pp. 907–912. IEEE Computer Society (2009). http://dblp.uni-trier.de/db/conf/ncm/ncm2009.html#LeeZLY09
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957)
Article MathSciNet Google Scholar
Manning, C.D., MacCartney, B.: Natural language inference (2009)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arxiv:1301.3781
Morris, A.C., Maier, V., Green, P.D.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: INTERSPEECH. ISCA (2004). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2004.html#MorrisMG04
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Schuurmans, D., Wellman, M.P. (eds.) AAAI, pp. 2786–2792. AAAI Press (2016). http://dblp.uni-trier.de/db/conf/aaai/aaai2016.html#MuellerT16
Nguyen, D.Q., Nguyen, A.T.: Phobert: pre-trained language models for Vietnamese. In: Cohn, T., He, Y., Liu, Y. (eds.) EMNLP (Findings), pp. 1037–1042. Association for Computational Linguistics (2020). http://dblp.uni-trier.de/db/conf/emnlp/emnlp2020f.html#NguyenN20
Nguyen, P.T., Pham, V.L., Nguyen, H.A., Vu, H.H., Tran, N.A., Truong, T.T.H.: A two-phase approach for building Vietnamese WordNet. In: the 8th Global Wordnet Conference, pp. 259–264 (2015)
Google Scholar
Nguyen, T.M.H., Romary, L., Rossignol, M., Vu, X.L.: A lexicon for Vietnamese language processing. Lang. Resour. Eval. 40(3–4), 291–309 (2006)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
Google Scholar
Wang, Z., Mi, H., Ittycheriah, A.: Sentence similarity learning by lexical decomposition and composition. In: Calzolari, N., Matsumoto, Y., Prasad, R. (eds.) COLING, pp. 1340–1349. ACL (2016). http://dblp.uni-trier.de/db/conf/coling/coling2016.html#WangMI16
Yang, M., et al.: Sentence-level agreement for neural machine translation. In: Korhonen, A., Traum, D.R., Màrquez, L. (eds.) ACL (1), pp. 3076–3082. Association for Computational Linguistics (2019). http://dblp.uni-trier.de/db/conf/acl/acl2019-1.html#YangWCUSZZ19

Download references

Acknowledgments

This paper is part of project number KC-4.0-12/19-25 that is led by Doctor Nguyen Van Vinh and funded by the Science and Technology Program KC 4.0.

Author information

Authors and Affiliations

University of Economic and Technical Industries, Hanoi, Vietnam
Van-Tan Bui
University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
Phuong-Thai Nguyen

Authors

Van-Tan Bui
View author publications
You can also search for this author in PubMed Google Scholar
Phuong-Thai Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Van-Tan Bui .

Editor information

Editors and Affiliations

Computer science and Applications Department, LGIPM, University of Lorraine, Metz Cedex, France
Hoai An Le Thi
Laboratory of Mathematics, National Institute for Applied Sciences - Rouen, Saint-Etienne-du-Rouvray Cedex, France
Tao Pham Dinh
Computer science and Applications Department, LGIPM, University of Lorraine, Metz Cedex, France
Hoai Minh Le

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bui, VT., Nguyen, PT. (2022). Measuring Semantic Similarity of Vietnamese Sentences Based on Lexical and Distribution Similarity. In: Le Thi, H.A., Pham Dinh, T., Le, H.M. (eds) Modelling, Computation and Optimization in Information Systems and Management Sciences. MCO 2021. Lecture Notes in Networks and Systems, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-030-92666-3_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-92666-3_22
Published: 08 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92665-6
Online ISBN: 978-3-030-92666-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics