SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation

Dang, Binh; Le, Tung; Nguyen, Le-Minh

doi:10.1007/s10489-022-04184-x

SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation

Published: 11 October 2022

Volume 53, pages 13470–13487, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

282 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In most applications, text understanding and representation always play an important role, especially in automatic processing. Together with the surface features of words, topic information is highly meaningful and essential to provide the context meaning in the text representation. Recently, the integration of linguistic features and topic information has not received the close critical attention. With the aim to take advantage of topic information, we propose a novel approach to integrate the topic features into the most popular language models which are called the Sub-word Latent Topic and Sentence Transformer (SubTST). Inspired by Sentence-BERT and tBERT, our proposed architecture has a significant chance to learn and incorporate topic information with linguistic features. The most strength of our proposed approach comes from the delicate combination between latent topic information and linguistic features of language models instead of only utilizing topic information in the previous works. The comparison in experiments and ablation studies against competitive baselines proves the strength of our proposed approach in most benchmark datasets in both Semantic Textual Similarity and Semantic Similarity Detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

Study on text representation method based on deep learning and topic information

Article 06 September 2019

Topic Modeling over Short Texts by Incorporating Word Embeddings

Notes

References

Liu H, Feng Y, Zhou M, Qiang B (2021) Semantic ranking structure preserving for cross-modal retrieval. Appl Intell 51(3):1802–1812. https://doi.org/10.1007/s10489-020-01930-x
Article Google Scholar
O’Shea K, Crockett KA, Bandar Z, O’Shea J (2014) Erratum to: an approach to conversational agent design using semantic sentence similarity. Appl Intell 40(1):199. https://doi.org/10.1007/s10489-013-0488-7 https://doi.org/10.1007/s10489-013-0488-7
Article Google Scholar
Amara A, Taieb MAH, Aouicha MB (2021) Multilingual topic modeling for tracking COVID-19 trends based on facebook data analysis. Appl Intell 51(5):3052–3073. https://doi.org/10.1007/s10489-020-02033-3 https://doi.org/10.1007/s10489-020-02033-3
Article Google Scholar
Du X, Zhu R, Zhao F, Zhao F, Han P, Zhu Z (2020) A deceptive detection model based on topic, sentiment, and sentence structure information. Appl Intell 50(11):3868–3881. https://doi.org/10.1007/s10489-020-01779-0
Article Google Scholar
Gao C, Ren J (2019) A topic-driven language model for learning to generate diverse sentences. Neurocomputing 333:374–380. https://doi.org/10.1016/j.neucom.2019.01.002
Article Google Scholar
Qin Z, Thint M, Huang Z (2009) Ranking answers by hierarchical topic models. In: Chien B, Hong T, Chen S, Ali M (eds) Next-generation applied intelligence, 22nd international conference on industrial, engineering and other applications of applied intelligent systems, IEA/AIE 2009, Tainan, Taiwan, 24-27 June 2009. Proceedings. Lecture notes in computer science. Springer, vol 5579, pp 103–112. https://doi.org/10.1007/978-3-642-02568-6_11
Ovsjanikov M, Chen Y (2010) Topic modeling for personalized recommendation of volatile items. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases, european conference, ECML PKDD 2010, Barcelona, Spain, 20-24 September 2010, Proceedings, Part II. Lecture notes in computer science. Springer, vol 6322, pp 483–498. https://doi.org/10.1007/978-3-642-15883-4_31
Tran QH, Tran VD, Vu T, Nguyen M, Pham SB (2015) JAIST: combining multiple features for answer selection in community question answering. In: Cer DM, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 9th international workshop on semantic evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA. The association for computer linguistics, 4-5 June 2015, pp 215–219. https://doi.org/10.18653/v1/s15-2038
Dang TB, Nguyen H, Nguyen L (2020) Latent topic refinement based on distance metric learning and semantics-assisted non-negative matrix factorization. In: Nguyen ML, Luong MC, Song S (eds) Proceedings of the 34th pacific asia conference on language, information and computation, PACLIC 2020, Hanoi, Vietnam. Association for Computational Linguistics, 24-26 October 2020, pp 70–75. https://aclanthology.org/2020.paclic-1.8/. Accessed 06 Aug 2021
Wu G, Sheng Y, Lan M, Wu Y (2017) ECNU at semeval-2017 task 3: using traditional and deep learning methods to address community question answering task. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for Computational Linguistics, pp 365–369. https://doi.org/10.18653/v1/S17-2060
Peinelt N, Nguyen D, Liakata M (2020) tbert: topic models and BERT joining forces for semantic similarity detection. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, 5-10 July 2020. Association for Computational Linguistics, pp 7047–7055. https://doi.org/10.18653/v1/2020.acl-main.630
Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3-7 November 2019. Association for computational linguistics, pp 3980–3990. https://doi.org/10.18653/v1/D19-1410
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2-7 June 2019, vol 1 (long and short papers). Association for computational linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Li B, Zhou H, He J, Wang M, Yang Y, Li L (2020) On the sentence embeddings from pre-trained language models. In: Webber B, Cohn T, He Y, Liu Y (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, 16-20 November 2020. Association for computational linguistics, pp 9119–9130. https://doi.org/10.18653/v1/2020.emnlp-main.733
Su J, Cao J, Liu W, Ou Y (2021) Whitening sentence representations for better semantics and faster retrieval. arXiv:2103.15316
Yan Y, Li R, Wang S, Zhang F, Wu W, Xu W (2021) Consert: a contrastive framework for self-supervised sentence representation transfer. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: long papers), virtual event, 1-6 August 2021. Association for computational linguistics, pp 5065–5075. hhtps://doi.org/10.18653/v1/2021.acl-long.393
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of machine learning research. PMLR, vol 70, pp 2410–2419. http://proceedings.mlr.press/v70/miao17a.html
Wang R, Zhou D, He Y (2019) ATM: Adversarial-neural topic model. Inf Process Manag, VOL 56(6). https://doi.org/10.1016/j.ipm.2019.102098
Choo J, Lee C, Reddy CK, Park H (2013) UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 19(12):1992–2001. https://doi.org/10.1109/TVCG.2013.212
Article Google Scholar
Choo J, Lee C, Reddy CK, Park H (2015) Weakly supervised nonnegative matrix factorization for user-driven clustering. Data Min Knowl Discov 29(6):1598–1621. https://doi.org/10.1007/s10618-014-0384-8
Article MathSciNet MATH Google Scholar
Cheng X, Guo J, Liu S, Wang Y, Yan X (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 13th SIAM international conference on data mining, 2-4 May 2013. Austin, Texas, USA. SIAM, pp 749–757. https://doi.org/10.1137/1.9781611972832.83
Wang Z, Wang C, Zhang H, Duan Z, Zhou M, Chen B (2020) Learning dynamic hierarchical topic graph with graph convolutional network for document classification. In: Chiappa S, Calandra R (eds) The 23rd international conference on artificial intelligence and statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy]. Proceedings of machine learning research, PMLR, vol 108, pp 3959–3969. http://proceedings.mlr.press/v108/wang20l.html
Zhang J, Li L, Way A, Liu Q (2016) Topic-informed neural machine translation. In: Calzolari N, Matsumoto Y, Prasad R (eds) COLING 2016, 26th international conference on computational linguistics, proceedings of the conference: technical papers, 11-16 December 2016, Osaka, Japan. ACL, pp 1807–1817. https://aclanthology.org/C16-1170/
Fu X, Wang J, Zhang J, Wei J, Yang Z (2020) Document summarization with VHTM: variational hierarchical topic-aware mechanism. pp 7740–7747
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, Lecun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, 7-9 May 2015, conference track proceedings. arXiv:1412.6980
Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. In: Sierra C (ed) Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI 2017, Melbourne, Australia, 19-25 August 2017, pp 4144–4150. ijcai.org https://doi.org/10.24963/ijcai.2017/579
Dolan WB, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the third international workshop on paraphrasing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005. Asian Federation of natural language processing, https://aclanthology.org/I05-5002/. Accessed 06 Aug 2021
Nakov P, Màrquez L, Magdy W, Moschitti A, Glass JR, Randeree B (2015) Semeval-2015 task 3: answer selection in community question answering. In: Cer DM, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 9th international workshop on semantic evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, 4-5 June 2015. The Association for computer linguistics, pp 269–281. https://doi.org/10.18653/v1/s15-2047
Nakov P, Màrquez L, Moschitti A, Magdy W, Mubarak H, Freihat AA, Glass JR, Randeree B (2016) Semeval-2016 task 3: community question answering. In: Bethard S, Cer DM, Carpuat M, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 10th international workshop on semantic evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, 16-17 June 2016, pp 525–545. The association for computer linguistics. https://doi.org/10.18653/v1/s16-1083
Nakov P, Hoogeveen D, Màrquez L, Moschitti A, Mubarak H, Baldwin T, Verspoor K (2017) Semeval-2017 task 3: community question answering. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 27–48. https://doi.org/10.18653/v1/S17-2003
Deriu J, Cieliebak M (2017) Swissalps at semeval-2017 task 3: attention-based convolutional neural network for community question answering. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 334–338. https://doi.org/10.18653/v1/S17-2054
Filice S, Martino GDS, Moschitti A (2017) Kelp at semeval-2017 task 3: learning pairwise patterns in community question answering. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 326–333. https://doi.org/10.18653/v1/S17-2053
Wang W, Bi B, Yan M, Wu C, Xia J, Bao Z, Peng L, Si L (2020) Structbert: incorporating language structures into pre-training for deep language understanding. In: 8th International conference on learning representations, ICLR 2020, addis ababa, ethiopia, 26-30 April 2020. Openreview.net. https://openreview.net/forum?id=BJgQ4lSFPH. Accessed 07 May 2020
He R, Ravula A, Kanagal B, Ainslie J (2021) Realformer: transformer likes residual attention. In: Zong C, Xia F, Li W, Navigli R (eds) Findings of the association for computational linguistics: ACL/IJCNLP 2021, Online Event, 1-6 August 2021. Findings of ACL, vol. ACL/IJCNLP 2021. Association for computational linguistics, pp 929–943. https://doi.org/10.18653/v1/2021.findings-acl.81
Humeau S, Shuster K, Lachaux M, Weston J (2020) Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring. In: 8th International conference on learning representations, ICLR 2020, addis ababa, ethiopia, 26-30 April 2020. Openreview.net, https://openreview.net/forum?id=SkxgnnNFvH. Accessed 30 July 2020
Agirre E, Cer DM, Diab MT, Gonzalez-Agirre A (2012) Semeval-2012 task 6: a pilot on semantic textual similarity. In: Agirre E, Bos J, Diab MT (eds) Proceedings of the 6th international workshop on semantic evaluation, SemEval@NAACL-HLT 2012, Montréal, Canada, 7-8 June 2012. The association for computer linguistics, pp 385–393. https://doi.org/10.5555/2387636.2387697. https://aclanthology.org/S12-1051/
Agirre E, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W (2013) *sem 2013 shared task: semantic textual similarity. In: Diab MT, Baldwin T, Baroni M (eds) Proceedings of the second joint conference on lexical and computational semantics, *SEM 2013, 13-14 June 2013, Atlanta, Georgia, USA. association for computational linguistics, pp 32–43. https://aclanthology.org/S13-1004/. Accessed 06 Aug 2021
Agirre E, Banea C, Cardie C, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W, Mihalcea R, Rigau G, Wiebe J (2014) Semeval-2014 task 10: multilingual semantic textual similarity. In: Nakov P, Zesch T (eds) Proceedings of the 8th international workshop on semantic evaluation, SemEval@COLING 2014, Dublin, Ireland, 23-24 August 2014. The association for computer linguistics, pp 81–91. https://doi.org/10.3115/v1/s14-2010
Agirre E, Banea C, Cardie C, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W, Lopez-Gazpio I, Maritxalar M, Mihalcea R, Rigau G, Uria L, Wiebe J (2015) Semeval-2015 task 2: semantic textual similarity, english, spanish and pilot on interpretability. In: Cer DM, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 9th international workshop on semantic evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, 4-5 June 2015. The association for computer linguistics, pp 252–263. https://doi.org/10.18653/v1/s15-2045
Agirre E, Banea C, Cer DM, Diab MT, Gonzalez-Agirre A, Mihalcea R, Rigau G, Wiebe J (2016) Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Bethard S, Cer DM, Carpuat M, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 10th international workshop on semantic evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, 16-17 June 2016. The association for computer linguistics, pp 497–511. https://doi.org/10.18653/v1/s16-1081
Cer DM, Diab MT, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 1–14. https://doi.org/10.18653/v1/S17-2001
Marelli M, Menini S, Baroni M, Bentivogli L, Bernardi R, Zamparelli R (2014) A SICK cure for the evaluation of compositional distributional semantic models. In: Calzolari N, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the ninth international conference on language resources and evaluation, LREC 2014, Reykjavik, Iceland, 26-31 May 2014. European language resources association (ELRA), pp 216–223. http://www.lrec-conf.org/proceedings/lrec2014/summaries/363.html. Accessed 19 Aug 2019
Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In: Màrquez L, Callison-Burch C, Su J, Pighin D, Marton Y (eds) Proceedings of the 2015 conference on empirical methods in natural language processing, EMNLP 2015, Lisbon, Portugal, 17-21 September 2015. The association for computational linguistics, pp 632–642. https://doi.org/10.18653/v1/d15-1075
Williams A, Nangia N, Bowman SR (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, 1-6 June 2018, vol 1 (long papers). Association for computational linguistics, pp 1112–1122. https://doi.org/10.18653/v1/n18-1101

Download references

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Binh Dang, Tung Le & Le-Minh Nguyen
Faculty of Information Technology, University of Science, Ho Chi Minh city, Vietnam
Tung Le
Vietnam National University, Ho Chi Minh city, Vietnam
Tung Le

Authors

Binh Dang
View author publications
You can also search for this author in PubMed Google Scholar
Tung Le
View author publications
You can also search for this author in PubMed Google Scholar
Le-Minh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study’s conception and design. Writing – original draft, methodology, and conceptualization was performed by Binh Dang. Supervision, writing – review, and conceptualization were performed by Tung Le and Le-Minh Nguyen. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Le-Minh Nguyen.

Ethics declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dang, B., Le, T. & Nguyen, LM. SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation. Appl Intell 53, 13470–13487 (2023). https://doi.org/10.1007/s10489-022-04184-x

Download citation

Accepted: 14 September 2022
Published: 11 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04184-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation

Abstract

Access this article

Similar content being viewed by others

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

Study on text representation method based on deep learning and topic information

Topic Modeling over Short Texts by Incorporating Word Embeddings

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation

Abstract

Access this article

Similar content being viewed by others

TSSE-DMM: Topic Modeling for Short Texts Based on Topic Subdivision and Semantic Enhancement

Study on text representation method based on deep learning and topic information

Topic Modeling over Short Texts by Incorporating Word Embeddings

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation