Skip to main content
Log in

SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In most applications, text understanding and representation always play an important role, especially in automatic processing. Together with the surface features of words, topic information is highly meaningful and essential to provide the context meaning in the text representation. Recently, the integration of linguistic features and topic information has not received the close critical attention. With the aim to take advantage of topic information, we propose a novel approach to integrate the topic features into the most popular language models which are called the Sub-word Latent Topic and Sentence Transformer (SubTST). Inspired by Sentence-BERT and tBERT, our proposed architecture has a significant chance to learn and incorporate topic information with linguistic features. The most strength of our proposed approach comes from the delicate combination between latent topic information and linguistic features of language models instead of only utilizing topic information in the previous works. The comparison in experiments and ablation studies against competitive baselines proves the strength of our proposed approach in most benchmark datasets in both Semantic Textual Similarity and Semantic Similarity Detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout

  2. https://github.com/google-research/bert

  3. https://github.com/binhdt95/SubTST

  4. https://alt.qcri.org/semeval2017/task3/

  5. https://github.com/UKPLab/sentence-transformers

  6. https://github.com/wuningxi/tBERT

  7. https://spacy.io/usage/linguistic-features

  8. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

References

  1. Liu H, Feng Y, Zhou M, Qiang B (2021) Semantic ranking structure preserving for cross-modal retrieval. Appl Intell 51(3):1802–1812. https://doi.org/10.1007/s10489-020-01930-x

    Article  Google Scholar 

  2. O’Shea K, Crockett KA, Bandar Z, O’Shea J (2014) Erratum to: an approach to conversational agent design using semantic sentence similarity. Appl Intell 40(1):199. https://doi.org/10.1007/s10489-013-0488-7https://doi.org/10.1007/s10489-013-0488-7

    Article  Google Scholar 

  3. Amara A, Taieb MAH, Aouicha MB (2021) Multilingual topic modeling for tracking COVID-19 trends based on facebook data analysis. Appl Intell 51(5):3052–3073. https://doi.org/10.1007/s10489-020-02033-3https://doi.org/10.1007/s10489-020-02033-3

    Article  Google Scholar 

  4. Du X, Zhu R, Zhao F, Zhao F, Han P, Zhu Z (2020) A deceptive detection model based on topic, sentiment, and sentence structure information. Appl Intell 50(11):3868–3881. https://doi.org/10.1007/s10489-020-01779-0

    Article  Google Scholar 

  5. Gao C, Ren J (2019) A topic-driven language model for learning to generate diverse sentences. Neurocomputing 333:374–380. https://doi.org/10.1016/j.neucom.2019.01.002

    Article  Google Scholar 

  6. Qin Z, Thint M, Huang Z (2009) Ranking answers by hierarchical topic models. In: Chien B, Hong T, Chen S, Ali M (eds) Next-generation applied intelligence, 22nd international conference on industrial, engineering and other applications of applied intelligent systems, IEA/AIE 2009, Tainan, Taiwan, 24-27 June 2009. Proceedings. Lecture notes in computer science. Springer, vol 5579, pp 103–112. https://doi.org/10.1007/978-3-642-02568-6_11

  7. Ovsjanikov M, Chen Y (2010) Topic modeling for personalized recommendation of volatile items. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases, european conference, ECML PKDD 2010, Barcelona, Spain, 20-24 September 2010, Proceedings, Part II. Lecture notes in computer science. Springer, vol 6322, pp 483–498. https://doi.org/10.1007/978-3-642-15883-4_31

  8. Tran QH, Tran VD, Vu T, Nguyen M, Pham SB (2015) JAIST: combining multiple features for answer selection in community question answering. In: Cer DM, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 9th international workshop on semantic evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA. The association for computer linguistics, 4-5 June 2015, pp 215–219. https://doi.org/10.18653/v1/s15-2038

  9. Dang TB, Nguyen H, Nguyen L (2020) Latent topic refinement based on distance metric learning and semantics-assisted non-negative matrix factorization. In: Nguyen ML, Luong MC, Song S (eds) Proceedings of the 34th pacific asia conference on language, information and computation, PACLIC 2020, Hanoi, Vietnam. Association for Computational Linguistics, 24-26 October 2020, pp 70–75. https://aclanthology.org/2020.paclic-1.8/. Accessed 06 Aug 2021

  10. Wu G, Sheng Y, Lan M, Wu Y (2017) ECNU at semeval-2017 task 3: using traditional and deep learning methods to address community question answering task. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for Computational Linguistics, pp 365–369. https://doi.org/10.18653/v1/S17-2060

  11. Peinelt N, Nguyen D, Liakata M (2020) tbert: topic models and BERT joining forces for semantic similarity detection. In: Jurafsky D, Chai J, Schluter N, Tetreault JR (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, 5-10 July 2020. Association for Computational Linguistics, pp 7047–7055. https://doi.org/10.18653/v1/2020.acl-main.630

  12. Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3-7 November 2019. Association for computational linguistics, pp 3980–3990. https://doi.org/10.18653/v1/D19-1410

  13. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2-7 June 2019, vol 1 (long and short papers). Association for computational linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423

  14. Li B, Zhou H, He J, Wang M, Yang Y, Li L (2020) On the sentence embeddings from pre-trained language models. In: Webber B, Cohn T, He Y, Liu Y (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, 16-20 November 2020. Association for computational linguistics, pp 9119–9130. https://doi.org/10.18653/v1/2020.emnlp-main.733

  15. Su J, Cao J, Liu W, Ou Y (2021) Whitening sentence representations for better semantics and faster retrieval. arXiv:2103.15316

  16. Yan Y, Li R, Wang S, Zhang F, Wu W, Xu W (2021) Consert: a contrastive framework for self-supervised sentence representation transfer. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: long papers), virtual event, 1-6 August 2021. Association for computational linguistics, pp 5065–5075. hhtps://doi.org/10.18653/v1/2021.acl-long.393

  17. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  18. Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of machine learning research. PMLR, vol 70, pp 2410–2419. http://proceedings.mlr.press/v70/miao17a.html

  19. Wang R, Zhou D, He Y (2019) ATM: Adversarial-neural topic model. Inf Process Manag, VOL 56(6). https://doi.org/10.1016/j.ipm.2019.102098

  20. Choo J, Lee C, Reddy CK, Park H (2013) UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph 19(12):1992–2001. https://doi.org/10.1109/TVCG.2013.212

    Article  Google Scholar 

  21. Choo J, Lee C, Reddy CK, Park H (2015) Weakly supervised nonnegative matrix factorization for user-driven clustering. Data Min Knowl Discov 29(6):1598–1621. https://doi.org/10.1007/s10618-014-0384-8

    Article  MathSciNet  MATH  Google Scholar 

  22. Cheng X, Guo J, Liu S, Wang Y, Yan X (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 13th SIAM international conference on data mining, 2-4 May 2013. Austin, Texas, USA. SIAM, pp 749–757. https://doi.org/10.1137/1.9781611972832.83

  23. Wang Z, Wang C, Zhang H, Duan Z, Zhou M, Chen B (2020) Learning dynamic hierarchical topic graph with graph convolutional network for document classification. In: Chiappa S, Calandra R (eds) The 23rd international conference on artificial intelligence and statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy]. Proceedings of machine learning research, PMLR, vol 108, pp 3959–3969. http://proceedings.mlr.press/v108/wang20l.html

  24. Zhang J, Li L, Way A, Liu Q (2016) Topic-informed neural machine translation. In: Calzolari N, Matsumoto Y, Prasad R (eds) COLING 2016, 26th international conference on computational linguistics, proceedings of the conference: technical papers, 11-16 December 2016, Osaka, Japan. ACL, pp 1807–1817. https://aclanthology.org/C16-1170/

  25. Fu X, Wang J, Zhang J, Wei J, Yang Z (2020) Document summarization with VHTM: variational hierarchical topic-aware mechanism. pp 7740–7747

  26. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, Lecun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, 7-9 May 2015, conference track proceedings. arXiv:1412.6980

  27. Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. In: Sierra C (ed) Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI 2017, Melbourne, Australia, 19-25 August 2017, pp 4144–4150. ijcai.org https://doi.org/10.24963/ijcai.2017/579

  28. Dolan WB, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the third international workshop on paraphrasing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005. Asian Federation of natural language processing, https://aclanthology.org/I05-5002/. Accessed 06 Aug 2021

  29. Nakov P, Màrquez L, Magdy W, Moschitti A, Glass JR, Randeree B (2015) Semeval-2015 task 3: answer selection in community question answering. In: Cer DM, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 9th international workshop on semantic evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, 4-5 June 2015. The Association for computer linguistics, pp 269–281. https://doi.org/10.18653/v1/s15-2047

  30. Nakov P, Màrquez L, Moschitti A, Magdy W, Mubarak H, Freihat AA, Glass JR, Randeree B (2016) Semeval-2016 task 3: community question answering. In: Bethard S, Cer DM, Carpuat M, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 10th international workshop on semantic evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, 16-17 June 2016, pp 525–545. The association for computer linguistics. https://doi.org/10.18653/v1/s16-1083

  31. Nakov P, Hoogeveen D, Màrquez L, Moschitti A, Mubarak H, Baldwin T, Verspoor K (2017) Semeval-2017 task 3: community question answering. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 27–48. https://doi.org/10.18653/v1/S17-2003

  32. Deriu J, Cieliebak M (2017) Swissalps at semeval-2017 task 3: attention-based convolutional neural network for community question answering. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 334–338. https://doi.org/10.18653/v1/S17-2054

  33. Filice S, Martino GDS, Moschitti A (2017) Kelp at semeval-2017 task 3: learning pairwise patterns in community question answering. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 326–333. https://doi.org/10.18653/v1/S17-2053

  34. Wang W, Bi B, Yan M, Wu C, Xia J, Bao Z, Peng L, Si L (2020) Structbert: incorporating language structures into pre-training for deep language understanding. In: 8th International conference on learning representations, ICLR 2020, addis ababa, ethiopia, 26-30 April 2020. Openreview.net. https://openreview.net/forum?id=BJgQ4lSFPH. Accessed 07 May 2020

  35. He R, Ravula A, Kanagal B, Ainslie J (2021) Realformer: transformer likes residual attention. In: Zong C, Xia F, Li W, Navigli R (eds) Findings of the association for computational linguistics: ACL/IJCNLP 2021, Online Event, 1-6 August 2021. Findings of ACL, vol. ACL/IJCNLP 2021. Association for computational linguistics, pp 929–943. https://doi.org/10.18653/v1/2021.findings-acl.81

  36. Humeau S, Shuster K, Lachaux M, Weston J (2020) Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring. In: 8th International conference on learning representations, ICLR 2020, addis ababa, ethiopia, 26-30 April 2020. Openreview.net, https://openreview.net/forum?id=SkxgnnNFvH. Accessed 30 July 2020

  37. Agirre E, Cer DM, Diab MT, Gonzalez-Agirre A (2012) Semeval-2012 task 6: a pilot on semantic textual similarity. In: Agirre E, Bos J, Diab MT (eds) Proceedings of the 6th international workshop on semantic evaluation, SemEval@NAACL-HLT 2012, Montréal, Canada, 7-8 June 2012. The association for computer linguistics, pp 385–393. https://doi.org/10.5555/2387636.2387697. https://aclanthology.org/S12-1051/

  38. Agirre E, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W (2013) *sem 2013 shared task: semantic textual similarity. In: Diab MT, Baldwin T, Baroni M (eds) Proceedings of the second joint conference on lexical and computational semantics, *SEM 2013, 13-14 June 2013, Atlanta, Georgia, USA. association for computational linguistics, pp 32–43. https://aclanthology.org/S13-1004/. Accessed 06 Aug 2021

  39. Agirre E, Banea C, Cardie C, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W, Mihalcea R, Rigau G, Wiebe J (2014) Semeval-2014 task 10: multilingual semantic textual similarity. In: Nakov P, Zesch T (eds) Proceedings of the 8th international workshop on semantic evaluation, SemEval@COLING 2014, Dublin, Ireland, 23-24 August 2014. The association for computer linguistics, pp 81–91. https://doi.org/10.3115/v1/s14-2010

  40. Agirre E, Banea C, Cardie C, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W, Lopez-Gazpio I, Maritxalar M, Mihalcea R, Rigau G, Uria L, Wiebe J (2015) Semeval-2015 task 2: semantic textual similarity, english, spanish and pilot on interpretability. In: Cer DM, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 9th international workshop on semantic evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, 4-5 June 2015. The association for computer linguistics, pp 252–263. https://doi.org/10.18653/v1/s15-2045

  41. Agirre E, Banea C, Cer DM, Diab MT, Gonzalez-Agirre A, Mihalcea R, Rigau G, Wiebe J (2016) Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Bethard S, Cer DM, Carpuat M, Jurgens D, Nakov P, Zesch T (eds) Proceedings of the 10th international workshop on semantic evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, 16-17 June 2016. The association for computer linguistics, pp 497–511. https://doi.org/10.18653/v1/s16-1081

  42. Cer DM, Diab MT, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Bethard S, Carpuat M, Apidianaki M, Mohammad SM, Cer DM, Jurgens D (eds) Proceedings of the 11th international workshop on semantic evaluation, SemEval@ACL 2017, Vancouver, Canada, 3-4 August 2017. Association for computational linguistics, pp 1–14. https://doi.org/10.18653/v1/S17-2001

  43. Marelli M, Menini S, Baroni M, Bentivogli L, Bernardi R, Zamparelli R (2014) A SICK cure for the evaluation of compositional distributional semantic models. In: Calzolari N, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S (eds) Proceedings of the ninth international conference on language resources and evaluation, LREC 2014, Reykjavik, Iceland, 26-31 May 2014. European language resources association (ELRA), pp 216–223. http://www.lrec-conf.org/proceedings/lrec2014/summaries/363.html. Accessed 19 Aug 2019

  44. Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In: Màrquez L, Callison-Burch C, Su J, Pighin D, Marton Y (eds) Proceedings of the 2015 conference on empirical methods in natural language processing, EMNLP 2015, Lisbon, Portugal, 17-21 September 2015. The association for computational linguistics, pp 632–642. https://doi.org/10.18653/v1/d15-1075

  45. Williams A, Nangia N, Bowman SR (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, 1-6 June 2018, vol 1 (long papers). Association for computational linguistics, pp 1112–1122. https://doi.org/10.18653/v1/n18-1101

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study’s conception and design. Writing – original draft, methodology, and conceptualization was performed by Binh Dang. Supervision, writing – review, and conceptualization were performed by Tung Le and Le-Minh Nguyen. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Le-Minh Nguyen.

Ethics declarations

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dang, B., Le, T. & Nguyen, LM. SubTST: a consolidation of sub-word latent topics and sentence transformer in semantic representation. Appl Intell 53, 13470–13487 (2023). https://doi.org/10.1007/s10489-022-04184-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04184-x

Keywords

Navigation