1 Introduction

The massive advancement of communication and information technologies has resulted in a huge volume of data of multiple varieties from multiple sources. In recent years, the advent of internet media, e.g. (blogs), information websites, e.g. (Wikipedia), and the invincible Social Media Platforms (SMP) e.g. Facebook, Weibo, Twitter, etc. rapidly, have become the major sources of such massive information (Ediger et al. 2010). Extracting and analyzing this information is one of the pivotal areas of study for various research and business entities (Ruths and Pfeffer 2014). The analysis of this data is useful for various aspects such as product/service opinion mining, market analysis, trend analysis, event detection, and so on (Malleson and Birkin 2012). Moreover, extracting topics can be helpful for detecting events such as natural disasters to act rapidly and mitigate the disaster's impact (Kraft et al. 2013; Earle et al. 2011; Oh et al. 2010), supporting political parties (Tumasjan et al. 2010), companies and organizations to figure out customers opinions about their brands, and to ameliorate content marketing by better understanding customer requirements (Ren and Wu 2013). Topic Modeling (TM) is the process of automatically discovering the latent/hidden thematic structure from a set of documents/short text and facilitates building new ways to browse and summarize the large archive of text as topics (Nikolenko et al. 2017). It also aids in organizing, understanding and summarizing the large collection of data into specialized topic labels.

Many extensive research studies have focused on long text TM. The traditional long text TM models such as Latent Dirichlet Allocation (LDA) (Blei et al. 2003), Latent Semantic Analysis (LSA) (Deerwester et al. 1990), Probabilistic Latent Semantic Analysis (PLSA) (Hofmann 1999), and Non-negative Matrix Factorization (NMF) (Lee and Seung 2001) are popular for discovering latent semantic topic structures in long texts. They do not require prior annotation or labelling processes. These traditional topic models visualize each text and document as a mixture of different themes (topics) which are distributed over the text. They employ statistical approaches such as variational techniques and Gibbs sampling to infer the trending topics of each document through complex order word co-occurrence patterns (Ostrowski 2015). Recently, the Short Text (ST) in social media networks is gaining more interest because of common people’s raw and direct thoughts. However, modelling such short text topics is quite challenging mainly due to their short length and the limited number of words. The scarce content in the short text makes it more challenging to find the co-occurrence of topic patterns. The application of long text TMs for short text is not promising in terms of performance due to the absence of word co-occurrence in short texts (Abdel-Hafez and Xu 2013). Therefore, the research community focuses on addressing the problem of data sparsity in short texts such as Twitter data to provide efficient topic modelling.

Earlier studies on short text TM employed the traditional topic models but with added external metadata sources for knowledge extraction to formulate the word co-occurrences. Some studies employed the strategy of extracting latent topics from long documents and inferred them to extract short text topics. However, these models failed to achieve the expected results both due to the limited availability of external sources for metadata and their expensive deployment costs. Hence, the research community started considering the use of specialized designed short text topic models instead of conventional topic models. Therefore, we are motivated to aggregate such studies and provide a holistic taxonomy which can help the research community to develop new efficient TM for short text on social media.

This article aims at presenting a holistic survey, comparative analysis, and an expanded taxonomy for the most recent and ever-growing efficient TM approaches in social media. It mainly focuses on most aspects of Traditional Short Text Topic Modeling (TSTTM) models such as (probabilistic models, matrix factorization (non-probabilistic), Exemplar based, clustering techniques, dynamic-based categories, data source based, labelling based, word types, application-based,, and Frequent Pattern Mining (FPM) techniques) and Advanced Short Text Topic Modeling (ASTTM) based models such as (Dirichlet Multinomial Mixture (DMM) based models, Global word co-occurrences based models, and Self-aggregation based models, Deep Learning Topic Modeling (DLTM) based models). Despite the fact that various surveys have already been proposed in the literature of short text topic modelling, our work focuses on a more comprehensive and holistic survey and taxonomy along with qualitative and quantitative analysis. Further, a comparative empirical analysis of the most efficient recently proposed TM approaches.

1.1 Paper contributions

In general, our contributions in this taxonomy and analysis article can be stated as follows:

  • A comprehensive review to investigate the various existing STTM models.

  • A taxonomy of existing TSTTM and ASTTM models.

  • A qualitative analysis of the STTM models based on their strengths and weaknesses.

  • Review and quantitative analysis of the utilized datasets by STTM.

  • Review and summarize the useful software tools and open-sources libraries for STTM.

  • A quantitative analysis of the STTM models based categories, publication year, and evaluation Metrics.

  • A comparative study of ten TSTTM and ASTTM models to evaluate their performance through an experimental analysis based on two real-world datasets: the Real-World Pandemic Twitter dataset (RW-Pand-Twitter) and Real-World Cyberbullying Twitter dataset (RW-CB-Twitter).

  • An overview of the open challenges and the future research directions in this promising field.

This study boosts the existing surveys by providing a detailed and up-to-date comprehensive review and taxonomy, which helps researchers in understanding and utilizing the key elements of STTM. Moreover, it aids in finding out the limitations of currently available STTM techniques, open research issues, and challenges by which they can decide their future research direction.

1.2 Paper organization

This paper is organized as the following: The recent proposed surveys on topic modelling and short text topic modelling are discussed in Sect. 2. The Problem Formulation of STTM is formulated in Sect. 3. The topic modelling process flow and the prominent traditional topic models applied for short texts and the advanced STTM methods are described in Sect. 4. Then, Sect. 5 reviews the existing datasets utilized by STTM along with qualitative analysis. Section 6 summarizes and reviews the tools and open-source library for topic modelling. Quantitative analysis of the literature is presented in Sect. 7, while Sect. 8 describes the utilized dataset for experiments, common evaluation metrics, and the experimental results of the prominent methods. The overall observations noted from the qualitative and quantitative analysis, as well as the comparative analysis, are discussed in Sect. 9. Section 10 highlights the open challenges and suggestions for future directions of STTM. Finally, Sect. 11 concludes the article.

2 Existing surveys

Many existing surveys and review articles were proposed for long text TM approaches and short texts. In this section, the recently proposed surveys on topic modelling for short text are reviewed, analyzed, and compared with our suggested taxonomy. To this extent, (Li and Lei 2021; Vayansky and Kumar 2020; and Kherwa and Bansal 2020) are the most recent articles that provide a survey and analysis of all prominent research studies on the topic models. Alghamdi and Alfalqi (2015) conducted a survey of the TM in text mining. Only common methods in TM are presented. Jelisavčić et al. (2012) presented a review of the eminent probabilistic TM for providing prospective inspiration for research direction. Xia et al. (2019) provided a survey of TM by classifying the models into the traditional (probabilistic) evaluation and Hybrid topic modelling. However, as positioned in recent research, short text topic modelling is efficient through specialized strategies and techniques, which make them different from long text models.

Several studies have focused on the evolution of short text topic models. Most studies in sentiment analysis and event detection from social media data were related to topic modelling. Stieglitz et al. (2018) presented an elaborate analysis on social media analytics, including the topic discovery models. The authors addressed the research gap in data discovery, collection and processing and also elaborated upon the analytics techniques for social media data processing models with larger volumes. Hasan et al. (2018) conducted a survey on real-time Twitter event detection techniques that analyses most of the recent techniques that are similar to the functioning of topic models.

Ibrahim et al. (2018) surveyed topic detection models for tweets and evaluated their performance. The authors divided the techniques into five classes based on their functioning properties and discussed the most prominent research techniques. Although this article covered all major techniques, the survey focused mainly on the topic models that are popular for tweet streams. However, the advanced STTM techniques, data source based, labelling based, word types, and application-based models are not covered in this survey. Mottaghinia et al. (2020) explored various techniques to extract and discover topics of tweets. Then, they grouped the topic detection techniques into four classes of categories. Kaur and Bansal (2019) surveyed topic extraction techniques on Twitter. The authors reviewed the topic extraction techniques suggested by experts for the reliable collection of information. However, it considered only the techniques that use attribute-based extraction, which seems to be efficient only in limited applications. Likhitha et al. (2019) also conducted a detailed survey on the topic models for short texts in documents through semantic structure detection. Many different strategies were reviewed, but the survey is not so impressive since most techniques are the same as those of the long text topic models.

Qiang et al. (2020) conducted a comparative survey on short text topic modelling techniques and analysed their performances and applications. The authors categorized topic models into three categories and evaluated them on benchmark datasets. They also developed a comprehensive open-source library for short text topic modelling in Java language that combines all the survey methods and their evaluations with ample scope for including future techniques. However, compared to our survey, this survey did not cover all the ASTTM techniques; it only covered 8 of 62 methods. Also, a quantitative analysis of the literature did not perform in this survey.

Nugroho et al. (2020) provided yet another significant survey of topic models for Twitter data for different feature-based strategies and evaluated them on different datasets. The authors identified the main algorithms and provided an extensive analysis of Twitter streams. They focused mainly on the content, social and temporal features-based topics. Similarly, Dutta et al. (2020) conducted a survey that exploited the spatiotemporal topic models for Twitter data. This survey has provided brief descriptions about almost all the recent topic models for Twitter, focusing on spatiotemporal topic analysis. However, this study contains only a limited number of reviews due to the sparse number of relevant articles. Burkhardt and Kramer (2019a) conducted a survey of topic modelling based on multi-label methods by grouping the methods according to various variants dimensions. The authors summarized the most widely multi-label topic modelling models from a variety of studies that have been conducted for this purpose. Albalawi et al. (2020) provided a survey covering the topic modelling and its tools, applications, and methods. The authors compared and tested five common TM methods such as NMF, LSA, LDA, Principal Component Analysis (PCA), and random projection and applied them to the short-text data to show their superiority in extracting topics.

For nearly two decades, TM has been a successful text analysis technique. Integrating this technique with Deep Neural Networks (DNN) has opened a new and growing research area. The Neural Topic Models (NTM) have emerged with various models and a wide range of applications for NLP such as text generation, summarization etc. Zhao et al. (2021) presented a taxonomy and comprehensive survey for NTM; this taxonomy classifies NTM based on their backbone framework to facilitate researchers in exploring this fast-growing research area. Recently, NTM has attracted more attention since it takes the benefits of both the probabilistic topic models and neural networks. Several works related to this approach have been proposed in the literature and showed their effectiveness compared to traditional probabilistic models. Doan and Hoang (2021) presented an empirical analysis for some of the state-of-the-art NTM in various aspects over large, diverse datasets in terms of the set of metrics. Specifically, they analyzed the performance of these models in different three tasks: (1) uncovering cohesive topics, modelling the input documents, and representing them for downstream classification. The evaluation results illustrated that the neural topic models are effective in both tasks the first and third, whereas in many cases, the traditional probabilistic models are still effective in the second task. However, these findings and insights enable researchers to easily select off-the-shelf topic modelling toolboxes in various contexts. Table 1 depicts a comparative analysis of the existing surveys on short text topic modelling.

Table 1 Comparative analysis on the existing surveys for STTM

It is notable that, as indicated in Table 1, there are numerous existing reviews and surveys in the state of the art for short text topic modelling. But they did not take into account all the elements of STTM, as done in this taxonomy; it is distinct from other existing surveys by being holistic, extensive, and up-to-date. It comprehensively reviews and investigates both TSTTM, and ASTTM models along with their different sub-categories; it analyzes them based on their performance, respective strengths and weaknesses. This taxonomy presents the survey and qualitative analysis of the literature datasets used to evaluate the STTM models. Besides, it summarizes and reviews the useful software tools and open-sources libraries for STTM. Further, it also provides a quantitative analysis of the STTM models based on publication year, sub-categories, and platform. It also provides the quantitative analysis of datasets utilized by STTM based on Language and data sources. Moreover, it compared some of the prominent techniques and evaluated their performance on two real-world Twitter datasets (RW-Pand-Twitter and RW-CB-Twitter datasets). Finally, it highlights the challenges and open issues related to STTM techniques which facilitate in finding the inefficient and efficient topic modelling techniques and their merits and demerits by researchers. This survey is intended to improve the efficiency of learning cutting-edge methods and to identify potential research gaps, allowing researchers to choose their research directions. To the best of our knowledge, it will broaden their minds and pave the way for future proposing new techniques in the future.

3 Problem formulation

Let \({\mathcal{T}}_{x}\) denotes a topic detection algorithm. Where, the subscript \(x\) is used to indicate various algorithms, i.e., when \( x = ALG\), the \({\mathcal{T}}_{ALG}\) represents any STTM model. Let \({\mathcal{Q}}\) represents the optimal quality of discovered topics by \( {\mathcal{T}}_{ALG}\). Assume that \(A\) represents the optimal topic accuracy. \({\mathcal{P}}\) represents the optimal precision of discovered topics by \({\mathcal{T}}_{ALG}\). Let \({\mathcal{R}}\) represents the optimal recall of discovered topics by \({\mathcal{T}}_{ALG}\). Let us assume that algorithm \(ALG\) is available and always finds the optimal quality of discovered topics. The short text topic modelling problem is formulated as follows:

Given a set of \(m\) short text social media documents (posts), \(D = \left\{ {d_{1} ,d_{2} , \ldots ,d_{m} } \right\}\), a vocabulary \(W\) with size \(V\) and \(K\) pre-defined latent topics, where each short text \(d_{i} \in D, 1 \le i \le m\) is unstructured and unlabelled or labelled posts. The size of each short text \(d_{i} \in D\) is given as:

$$ S_{i} = \mathop \sum \limits_{k = 1}^{n} \mathop \sum \limits_{j = 1}^{z} c_{kj} $$
(1)

Note that each short text (post) \(d_{i} \in D\) is represented by a set of \(n\) words, \( d_{i} = \left\{ {w_{i1} ,w_{i2} , \cdots ,w_{in} } \right\} \) such that each word \( w_{ik} \in d_{i } ,1 \le k \le n\) is composed of \(z\) characters \( w_{ik} = \left\{ {c_{k1} ,c_{k2} , \ldots ,c_{kz} } \right\}\).

A multinomial distribution represents the topic \(\varphi\) in a specific collection \(D\) over the vocabulary \(W\) that means \(\left\{ {p\left( {w{|}\varphi } \right)} \right\}_{w \in W}\). A multinomial distribution over \(K\) topics represents the representation of topic of a documents or short text \(d\) \(\theta_{d}\) that means \(\left\{ {p\left( {\varphi_{k} {|}\theta_{d} } \right)} \right\}_{k = 1, \ldots .,k}\).

The STTM problem is to represent short texts \(D\) as a set of \(K\) topics \(\left\{ {\varphi_{k = 1, \ldots ,k} } \right\}\) under the following constraints:

$$ S_{min} \le \left| {S_{i} } \right| \le S_{max} $$
(2)
$$ \left| d \right|_{{\mathcal{W}}} \approx {\mathcal{T}}_{{ALG,{\mathcal{W}}}} , {\mathcal{W}} \in \left\{ {{\mathcal{Q}}{ },A,{ }{\mathcal{P}},{ \mathcal{R}}} \right\} $$
(3)

The primary objectives of the STTM for a given of \(m\) short text posts \(D\) can be stated as follows: (a) Learn how topics \(\varphi\) are represented in words, and the sparse topic representation of short texts (posts) \({\uptheta }\). Constraint (2) indicates that the size of each tweet or post should not be less than the minimum characters limit (\(S_{min}\)) and should not be greater than the maximum characters limit (\(S_{max}\)) in short text data (post). In our case, \(S_{max}\) contains 280 characters, including blank space. The minimum size can be set depending on the quality of the short texts received. The constraint (2) is formulated specifically for Twitter social media data. Constraint (3) deals with the optimality of the four metrics, namely accuracy, recall, and precision, and quality of the discovered topics.

4 Topic modelling

This section presents the general topic modelling process flow, as depicted in Fig. 1 and the taxonomy of STTM models, where STTM models are broadly classified into two main categories: Traditional Short text Topic Modeling (TSTTM) and Advanced Short Text Topic Modeling (ASTTM) models, as depicted in Fig. 2. The TSTTM models are discussed in Sect. 4.2, and ASTTM models are discussed in Sect. 4.3.

Fig.1
figure 1

Topic modelling process flow

Fig. 2
figure 2

Taxonomy on STTM models

4.1 Topic modelling process flow

Social media becomes a significant information source; such information comes in the form of tweets or posts which are short in nature. Discovering the potential topics from such tweets and posts is important for many natural language processing (NLP) tasks, such as automatic summarization, document classification, content analysis, emerging topic detecting, question answering, sentiment analysis, recommendation systems, information retrieval and etc. However, Topic modelling (TM) is a key technique that has been used for extracting knowledge and latent topics from a short text in social media. Generally, TM can be defined as the process of automatically extracting and identifying topics from short texts.

A typical approach for creating topics from short text involves three key sub-tasks: (i) data acquisition, (ii) pre-processing, and (iii) topic modelling method. The initial phase, as shown in Fig. 1, is to collect unstructured and semi-structured data from data sources such as Twitter. The dataset consists of tweets collected from multiple topics of interest. The Twitter dataset could be collected using the Twitter streaming API using Python language with tweepy package. Following that, the baseline pre-processing step is applied to the dataset to clean up the data using toolkits such as the NLTK python package (Anil Phand and Chakkarwar 2018) that provides stop-word and punctuation removal, tokenization, lemmatizing, stemming, identifying n-gram procedures, and lowercase transformation and also other pre-processing and data cleansing steps (Murshed et al. 2021). Finally, the TM method is applied to extract a set of recurrent themes/topics that are explored throughout the collection of posts and the extent to which each post reflects those themes. The output of TM is a set of topics which can be further used to explore, visualize, and summarize posts and tweets.

4.2 TSTTM models

The generalized topic models are developed primarily for extracting latent topics from long text documents. However, many studies applied them for STTM (Niyogi and Pal 2019; Shirolkar and Deshmukh 2019; and Hidayatullah et al. 2019). Others attempted to combine short texts into long text documents and using traditional TM (Al-Sultany and Aleqabie 2019; and López-Ramírez et al. 2019). Some others have designed by modifying the strategies of long text models to be applied for short texts, especially in news and tweets data (Zhao et al. 2011; Quercia et al. 2012; Fang et al. 2017; Sharath et al. 2019; Wang et al. 2012; and Han et al. 2020). Figure 3 shows the overall classification of TSTTM that can be used for short texts. This part of taxonomy classifies the TSTTM models into eleven sub-categories: probabilistic models, matrix factorization (non-probabilistic), Machine learning-based (unsupervised and supervised) techniques, dynamic (strategies) based categories, exemplar-based, data source based, word types, application-based, Frequent Pattern Mining (FPM) techniques, and Hybrid based. All the models in these categories can be used for both long text and short text. However, this article focuses on short texts models.

Fig. 3
figure 3

Taxonomy on TSTTM models

4.2.1 Probabilistic based models

Probabilistic TMs are a suite of models that apply statistical solutions to extract and uncover the latent thematic structure in a massive collection of documents and decompose and deconstruct its documents into topics. One of the most important assumptions of these probabilistic models is that the generative process follows a bag-of-words (BOW) assumption, which means that each token (word) is independent from the token that came before it. This section presents the probabilistic topic modelling models with their extensions, which classifies as follows:

4.2.1.1 Latent Dirichlet Allocation (LDA) based models

LDA is a probabilistic statistical approach based on the de Finneti’s theorem for extracting the significant statistical structure in a text document. It is one of the most popular utilized techniques for topic discovery and extraction models (Blei et al. 2003). The basic idea in LDA is that each document is represented as a probability distribution over hidden topics, while each topic is characterized as a probability distribution over a number of words. The generative process of the LDA model for each document or short text \(d_{i} \in D\) in a dataset \(D\) is written as following.

A. Draw each topic parameter \(\beta_{k} \sim {\text{Dirichlet }}\left( {\varphi } \right)\), for \(k \in \left\{ {1 \ldots K} \right\}\)

B. For each document:

   1. Choose the topic distribution \(\theta_{m} \sim {\text{Dirichlet }}\left( {\upalpha } \right)\)

   2. For each of the \(N\) words \( w_{n}\):

      i. Select a topic \(z_{mn} \sim {\text{Multinomial}}\left( {\theta_{m} } \right)\)

      ii. Select a word \(w_{n} \sim {\text{Multinomial}}\left( {\beta_{k} } \right)\) from \(p(w_{n} |z_{mn} ,\beta_{k} )\)

Where the dataset level parameters are represented by the parameters \(\beta\) and \(\alpha\), which are sampled just once in the procedure of generating the dataset, \(K\) denotes the number of topics, and the word-level variables that are taken just once for every word in every document are denoted by \(z_{mn}\) and \(w_{mn}\). The document-level variable that is sampled just once per short text (document) is denoted by the parameter \(\theta_{m}\). Finally, the word probability distribution for the topic \(k \) is denoted by \(\varphi_{k}\). The time complexity of LDA algorithm is \(O\left( {N_{itr} KN\overline{I}} \right)\), where \(N_{itr}\) denotes the number of iterations, \({\text{K}}\) is the number of hidden topics, \(N\) denotes the number of documents in the dataset, and \(\overline{I} \) is the average length of each document in \({\text{D}}\).

Many studies apply LDA for topic modelling of both short texts and long texts. Hoffman et al. (2010) proposed a new model named Online LDA (OLDA), which is based on online stochastic optimization with a natural gradient step. Online LDA can handle and analyze the enormous number of documents, including streaming document collections. The time complexity of Online LDA is \(O\left( {N_{itr} K|N^{\left( t \right)} \overline{I}^{\left( t \right)} } \right)\), where \(N_{itr}\) denotes the number of iterations, \(K\) is the number of hidden topics, \(N\) denotes the number of documents in the dataset, \(\overline{I}{ }\) is the average length of each document in \( D\), and the superscript of \(t\) represents the latest time-slice or version. Al-Sultany and Aleqabie (2019) utilized LDA for enriching tweets topic modelling by merging the tweets into long documents and linking them to Wikipedia for latent topic discovery. The procedure of merging tweets into long text, Twitter Name Entity Recognition (TNER) was suggested to categorize the short text tweets, extract entities, and linking the entities for each short text tweet with Wikipedia to construct a new Twitter dataset. Then, the pre-processing tasks are conducted. Finally, the topic modelling is applied by LDA. Similarly, Niyogi and Pal (2019) used LDA to discover conversational topics associated with India Demonetization tweets. Further, López-Ramírez et al. (2019) used LDA for extracting the geographical collection of topics. Shirolkar and Deshmukh (2019) also used LDA for finding topic experts in the Twitter dataset. Hidayatullah et al. (2019) utilized LDA for extracting the weather and climate condition topics on Twitter. Chan (2020) developed real-time social big data analytics using LDA topic model. LDA is based on the BOW (Bag-Of-Words) for extracting the features of interest topics but fails to combine the short text features and interest attributes. Further, the standard LDA doesn’t seem to reflect the dynamic and hierarchical trend of microblogs users’ interest.

As the limitations of LDA seem to impact its’ performance for short texts, the focus of the research community was shifted towards a modification of the traditional LDA. Along this direction, Chen and Kao (2017) developed an improved topic model using Re-Organized LDA (RO-LDA), which seems to resolve the lack of local word co-occurrence of LDA. However, this model has limitations in terms of handling redundant data. Fang et al. (2017) utilized Time-Sensitive Variational Bayesian inference LDA (TSVB-LDA) for extracting trending topics with higher levels of accuracy. However, this model has limitations in terms of inference of news tweets. Ni et al. (2018) presented the hot event detection model using Background Removal LDA (BR-LDA) topic modelling, which removes the background words from tweets.

Sharath et al. (2019) designed the Corpus-based Topic Derivation(CTD), which combines LF-LDA (Latent Feature-LDA) using an asymmetric topic model and Timestamp-based Popular Hashtag Prediction (TPHP) to discover Twitter topics based on corpus semantics. Korshunova et al. (2019) developed the Logistic LDA for discriminative topic modelling of tweets which can extract topics from unlabelled data through unsupervised exploitation of the data group structure. However, Logistic LDA is closer to LDA and in which it has limited inference abilities. Tajbakhsh and Bagherzadeh (2019) designed the Semantic knowledge LDA with topic vector extracting tweet topics based on the co-occurrence of word for recommendation system. Zheng et al. (2019) developed the Three-layer Interest Network LDA (TIN-LDA) model to discover topics from tweet data through interest attributes. The TIN-LDA significantly extracts the semantic correlation among the keywords and thus enhancing the coherence of the topics. Wang et al. (2012) presented Temporal LDA (TM-LDA) for latent temporal topic extraction from tweets. It has been applied to a huge volume of data and updated when the new temporal stream data comes. Although high accuracy was achieved, it uses time only for collecting tweets. In some of the current topic representations, there is no attention about how to choose terms with better differentiation for representing topics, Han et al. (2020) developed a topic representation model based on user behaviour analysis and LDA, namely (MBA-LDA) for analyzing the data of microblogging. Topic-word distribution is obtained by the LDA model. This model addressed the problem and considered the words distribution and the user behaviour information to re-appraise the significance of words for topic representation. Some topic models like LDA generate incoherent topics with noisy terms. Such these topic models suffer from a lack of semantic data, data sparsity, and the binary weighting of words, so Akhtar et al. (2019b) developed a new model based on fuzzy document representation with LDA, namely (FBLDA) to handle these issues. Besides, Ozyurt and Akcayol (2021) Suggested a new topic modelling, namely Sentence Segment LDA SS-LDA model, to extract the aspect product of user reviews in attempt to overcome data sparsity. Rahimi et al. (Rahimi et al. 2022) proposed a novel model named LLDA, which concentrated on local word relationships and encoded the word orders using overlapping windows. The authors suppose that a document consists of overlapping windows of fixed size, and a novel generative process is formulated appropriately. According to the inference process, every word is only sampled just once in a single window while influencing the sampling of its co-occurring counterparts in the other windows. The LLDA model alleviates the sparseness problem and generates more coherent topics.

4.2.1.2 Twitter LDA

Twitter LDA or Tweet-LDA is an extended version of LDA specifically designed for tweet topic modelling with additional data pre-processing techniques and data interpretation from sparse tweets. Zhao et al. (2011) designed Twitter-LDA for extracting topics from noisy tweets. Since the general LDA has one topic label for each word, it may not work well and is not suitable with Twitter due to the tweets are very short and noisy in nature, and every single tweet is more likely to be a single topic. Therefore, the Twitter-LDA is proposed to fill this gap. The generative process of the Twitter-LDA is written as follows:

A. Draw \(\varphi^{B} \sim {\text{Dirichlet }}\left( {\upbeta } \right)\),\(\pi \sim {\text{Dirichlet }}\left( {\upgamma } \right)\)

B. For each topic \( t = \left\{ {1, \ldots , T} \right\}\)

   (a) draw \(\varphi^{t} \sim {\text{Dirichlet }}\left( {\upbeta } \right)\)

C. For each user \(u = 1, \ldots , U\)

      (i) draw \(\theta^{u} \sim {\text{Dirichlet }}\left( {\upalpha } \right)\)

      (ii) for each tweet \(s = 1, \ldots , N_{u}\)

         (a) Draw \(z_{u,s} \sim {\text{Multinomial}}\left( {\theta^{u} } \right)\)

         (b) For each word \(n = 1, \ldots , N_{u,s}\)

            (i) Draw \(y_{u,s,n} \sim {\text{Multinomial}}\left( \pi \right)\)

            (ii) Draw \(w_{u,s,n} \sim {\text{Multinomial}}\left( {\varphi^{B} } \right)\)

If \(y_{u,s,n} = 0\), and \(w_{u,s,n} \sim {\text{ Multinomial}}\left( {\varphi^{{z_{u,s} }} } \right) \) if \(y_{u,s,n} = 1\).

Formally, let \(T\) denote the topics in Twitter data, each is defined by word distribution. The word distribution for the background words and topic \(t\) are denoted by \(\varphi^{B}\) and \(\varphi^{t}\), respectively. A Bernoulli distribution that controls the selection among topic and background words is indicated by \(\pi\) and the topic distribution of the user \(u\) is denoted by \(\theta^{u}\). When composing a tweet, a user initially selects a topic based on his topic distribution. Then, he selects a BOW (Bag of Word) individually based on the selected background model or topic.

Other studies use the LDA model, with some modifications of the original model for Twitter data. Quercia et al. (2012) also designed another Tweet-LDA by modifying the Labelled LDA as a supervised topic classification model for tweets with a Support Vector Machine (SVM). Both methods provided better topic coherence when compared to the LDA model. Akhtar (2017) also used Twitter-LDA for the hierarchical summarization of news tweets. Yu et al. (2019) designed Twitter Hierarchical LDA (TH-LDA) for discovering topics in tweets for On-Line Analytical Processing (OLAP). It mines hierarchical dimensions automatically and uses the word2vec model to analyse the semantic relationships. However, it focuses only on direct relationships while ignoring the other indirect social relationships. Table 2 provides the comparative analysis of probabilistic based TSTTM models.

Table 2 Comparative analysis of probabilistic based TSTTM models
4.2.1.3 PLSI based models

PLSI is another extensively utilized document model, which stands for probabilistic latent semantic indexing (Hofmann 1999). PLSI was introduced as an improvement to the LSI model by providing a more solid statistical foundation and identifying an appropriate generative data model compared to LSI. Besides the merits of this model compared to the conventional LSI model, since it depends on the probability principle, it can also make use of statistical models for model fitting, immediately reducing word perplexity, model combination as well as overfitting control. it seems to offer a probabilistic interpretation. The main aim of the PLSA model is to use the co-occurrence matrix for discovering the topics and exploring the documents as a set mixture of topics. It extracts the latent statistical class model as a mixed decomposition through co-occurrences among words and documents.

Formally, let \(D\) be a dataset consisting of a large collection of documents, which is represented in the document-term-matrix indicating how many times every word occurs in every document. Assuming the latent variables model is defined as the following: (1) Documents are the observed variables, \(d \in D = \left\{ {d_{1} ,d_{2} ,d_{3} , \ldots ,d_{m} } \right\}\) where \(m\) denotes the number of documents or short texts in the dataset. (2) Words: are the observed variable, where \(w \in W = \left\{ {w_{1} ,w_{2} ,w_{3} , \ldots ,w_{n} } \right\}\), and \(n\) is the number of words in the dataset. (3) Topics are hidden (Latent) variables, \(z \in Z = \left\{ {z_{1} ,z_{2} ,z_{3} , \ldots ,z_{K} } \right\}\), where \(K\) denotes the number of topics, and it should be identified priori. The generative process of the PLSI for a document in a dataset is written in the following steps.

  1. 1.

    Select a document \(d_{m}\) with the probability \(p\left( d \right)\).

  2. 2.

    For each word \(w_{i}\), where \(i \in \left\{ {1, \ldots , n} \right\}\) in the document \(d_{m}\):

    1. a.

      Choose a topic \(z_{i}\) from a multinomial conditioned on the selected document with the probability \( p(z|{ }d_{m} )\).

    2. b.

      Choose a word \(w_{i} \) from a multinomial conditioned on the selected topic with probability \(p(w|{ }z_{i} )\)

PLSI model has been used by several research works for TM. This sub-section presents recent studies of the PLSI based models. Hennig (2009) designed a topic model based on multiple document summarization using PLSA by combining both query and thematic features. Yirdaw and Ejigu (2012) also used a similar topic model with PLSA for Amharic language text summarization. However, PLSA is considered to be very vague for short texts because these methods require multiple features. PLSA was also used for certain other major applications, such as sentiment analysis in Twitter data. Kumar and Vardhan (2019) utilized PLSA in combination with the Independent Component Analysis (ICA) for aspect-based sentiment analysis of tweets. Likewise, many studies have utilized PLSA for sentiment analysis and event detection from short texts. However, there are some limitations attached to PLSA, which suffers from the problem of data sparsity when applied to short text topic modelling. Table 2 shows a summary of the probabilistic-based TM based on their perspective objectives, weaknesses, the dataset used, source, domain, and platform.

4.2.1.4 A Bayesian graphical model

In this sub-section, we review the probabilistic topic modelling based on the Bayesian model. The Probabilistic topic models, which can detect latent topics or themes in the documents such as LDA, PLSI, etc., have been investigated extensively. These models learn only from a single document dataset. Therefore, several real-world applications necessitate an in-depth comprehension of the relationships between numerous document datasets. To fill this gap, (Hua et al. 2020) suggested a novel model named Common and Distinctive Topic Modelling (CDTM), which can detect and discover the common and distinctive topics from several datasets simultaneously, where the common topics (global mixtures) are the topics which are shared among all the multiple datasets, while the distinctive topics(specific topics) represent the unique features or characteristics locally in every respective dataset. The proposed model is the first model based on the Bayesian graphical approach.

Formally, let us suppose that a set of \(l\) datasets, represented by \(S = \left\{ {D_{1} ,D_{2} , \ldots ,D_{l} } \right\}\). In this set of datasets, each dataset is a collection of \(m\) documents in the dataset \(D\) and indicated by \(D = \left\{ {d_{1} ,d_{2} , \ldots ,d_{m} } \right\}\). Each document or short \(d_{i} \in D,{ }1 \le i \le m\), is represented by \(n\) of words \(d_{i} = \left\{ {w_{i1} ,w_{i2} , \ldots ,w_{in} } \right\}.\) The vocabulary \(V\) of a set of datasets (S) consists of the words from all the considered datasets. The local topics (distinctive topics) is indicated by a \(K_{d}\)- dimensional Dirichlet variable \(\theta_{d}\), where the number of distinctive topics for every respective dataset is denoted by \(K_{d}\). The Global topics (common topics) is indicated by a \(K_{c}\)- dimensional Dirichlet variable \(\theta_{c}\), where the number o global topics (common topics) is represented by \(K_{c}\). Utilizing these variables and definitions, the distinctive and common topics are learnt as the estimation for posterior distributions of variable \(\theta_{d}\) and \(\theta_{c}\).

4.2.2 Matrix factorization based models

This section presents the matrix factorization based models utilized for TM and shows their extension. Table 3 summarizes the Matrix factorization based TSTTM models.

Table 3 Comparative analysis of matrix factorization based models
4.2.2.1 LSI based models

Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA) is a traditional and popular text mining method that extracts the hidden semantic structure of the words from a collection of documents or short text (Deerwester et al. 1990). The existing text mining techniques before LSI were unable to retrieve data based on concepts and queries; LSI was the first technique to be introduced. The demerits of LSI model are that the detected topics are hidden and ambiguous. Besides, it has the issue of negative values in its decomposed matrices which cannot be interpreted.

Let assume that \(X\) is a data matrix (documents \(m \times n\) terms), and the LSI model factorizes the matrix \(X\) into the product of 3 matrices \(U\varvec{\Sigma V}^{{\varvec{T}}}\). This process is called Singular Value Decomposition (SVD). Figure 4 shows the process of SVD of LSA topic modelling. It can be formulated as given in Eq. (4).

$$ X = U\Sigma V^{T } $$
(4)
Fig. 4
figure 4

SVD of the Latent semantic indexing TM

where \(X\) denotes the data matrix with size \(\left( {m \times n} \right)\), \(m\) represents documents and \(n\) terms, \(U \) denotes to \(\left( {m \times r} \right)\) matrix, r represents concepts. \(V\) indicates to \(\left( {n \times r} \right)\) matrix, \(V^{T}\) denotes the coefficients of terms in new space, \({\Sigma }\) denotes a diagonal \(\left( {r \times r} \right)\) matrix all values are equal to zero those in the diagonal.

Many recent studies have employed LSI for topic modelling and different applications of text mining. Valdez et al. (2018) presented a topic modelling framework for US election tweets using LSA and provided thematic patterns embedded in a large tweet dataset with a high degree of accuracy. However, Qomariyah et al. (2019) compared LSA and LDA for Twitter topic discovery and concluded that the LDA is better than LSA. Though it is evident that LDA is better than LSI, some studies tried to develop LSI with certain improved features to overcome its limitations in handling larger semantic structures. Karami et al. (2018) developed Fuzzy LSA (FLSA) topic discovery in health news tweets, which avoids the negative impact of redundant data. Kim et al. (2020) presented Word2Vec-based Latent Semantic Analysis (W2V-LSA) for trending topic discovery in expert tweets. However, this model is entirely dependent on data for optimizing user-defined parameters, which tends to reduce its efficiency in large scale modelling. However, the LSA model is not very suitable for short texts owing to their usage of approximation results in negative matrix values. Yet, this model is suitable for multiple applications such as document clustering (Magerman et al. 2010), language modelling (Yeh et al. 2005), etc.

4.2.2.2 NMF based models

Non-Negative Matrix Factorization is a statistical and linear-algebraic model that reduces the dimensions of the input dataset. Internally, it utilizes the factor analysis model to assign relatively less weight to the less coherent words. The NMF-based methods formulate the input dataset as a matrix and learn themes or (topics) by immediately splitting the term-document matrix, which represents a text dataset as a bag-of-words matrix, into two low-rank factor dimensions matrices to extract the trending topics. It discovers the latent topical structures of the data by identifying the factor matrices. Although this is one of the efficient topic models, it is mostly considered only after the LDA model. This is because LDA adds a Dirichlet prior on top of the data collection process, while NMF qualitatively leads to worse mixtures. As mentioned in the LSI section, the LSI has an issue with negative values in its decomposed matrices; then, the NMF was suggested to alleviate this issue.

Formally, let \(X\) be a data matrix of size \({\text{m}} \times {\text{n}}\), which represents the term-document matrix. Where this model of TM decomposes \(X\) matrix into the projecting of two lower-dimensional matrices named \(W\) with the size of \(m \times k\) and \(H\) with the size of \(k \times n\) spanned using a set of hidden or latent topics. The interpretation of these factorized matrices is that each column of the matrix \(W\) denotes the weight of every word received in every sentence and every row of matrix \(H\) is a word embedding. Since the values of all the elements of \(W\) and \(H\) factorized matrices are Nonnegative, given that all elements of \(X\) matrix are nonnegative. The general process of the NMF is shown in Fig. 5 and formulated as the Eq. (5).

$$ X \approx W \times H $$
(5)
Fig. 5
figure 5

NMF process

Many recent researchers adapted NMF for topic modelling. To this extent, this section presents the most recent NMF based models for topic modelling. Belford et al. (2016) developed the ensemble topic modelling using NMF for annotated tweet data and achieved high stability and accuracy. The proposed model combines and integrates multiple unstable topic modelling methods to form one ensemble topic model with the ability to produce a stable and informative solution. Sitorus et al. (2017) utilized NMF for sensing trending urban topics on Twitter near the Greater Jakarta area. However, as described above, the limitations of NMF were observed to have a negative impact on overall performance. Hence many studies tried to modify and improve versions of NMF. Yan et al. (2013b) presented an approach using NMF on the term correlation matrix. This approach enhances both term correlation and stability. Yan et al. (2012) developed a new model, named Ncut-weight term weight (N-cut-weighted NMF) topic model, which assesses the discriminability of terms based on word co-occurrences. Murfi (2017) utilized separable NMF for topic extraction with a higher level of accuracy than LDA. Iskandar (2017) developed a regular expression discovery (RED) algorithm based NMF (RED-NMF) for disease outbreak topic extraction on Twitter. Shi et al. (2018) developed a semantics-assisted non-negative matrix factorization (SeaNMF) approach which enriches with local word-context correlations for extracting the latent topics and improves topic coherence and accuracy of classification. Lahoti et al. (2018) utilized joint NMF for learning ideological topics on Twitter with 90% purity and higher correlation. Casalino et al. (2018) designed an intelligent topic modelling framework using NMF and applied subtractive clustering to detect trending topics. Chen et al. (2020b) developed the Affinity regularized NMF named NFM-LTM for lifelong topic modelling in short text big data. Although this model is efficient, it demands many adjustments to support short texts due to the problem of data sparsity.

4.2.2.3 Column Subset Selection (CSS) models

The CSS issue can be characterized broadly as the selection of the most representative columns from \(X\) data matrix in a general manner. The column subset selection generalizes the challenges of selecting features problem and representative data instances. Farahat et al. (2015) suggested an accurate and novel greedy method named CSS for tweet topic modelling, which can choose a subset of columns of a matrix to reduce the error. It is the fastest and most efficient approach for topic modelling with greedy selection. The primary goal of this method is to minimize the rate of reconstruction error of \({\text{X}}\) matrix by employing the features \(\left( {\text{S}} \right)\) that have been selected. The approach is based on an efficient recursive formula for computing the error reconstruction of the data matrix depending on the chosen criterion for the subset of columns at every iteration. This concept was utilized in this research for the purpose of topic extraction on the Twitter dataset. Specifically, each picked data point is regarded as a representative of a particular topic. However, the greedy approach significantly reduces the throughput and consumes a lot of time for settling at optimal coherent topics.

4.2.3 Machine learning-based TSTTM models

Another important category of TSTTM is the Machine Learning (ML) based models. This section reviews some the TSTTM based on ML. Generally, the ML models can be classified into two main categories: Unsupervised and Supervised TSTTM where the supervised is classified into Single-Label and Multi-Label. The comparative analysis of the machine learning-based TSTTM models is summarized in Table 4.

Table 4 Comparative analysis on machine learning based TSTTM models
4.2.3.1 Unsupervised models

This section presents the clustering techniques used for topic modelling. The Clustering algorithms are generally a part of topic modelling algorithms. Most times, the topics extracted by the LDA and other topic models are clustered using any clustering algorithm for post-processing. However, in some cases, the researchers have tried employing clustering algorithms directly as a topic extraction model. Li et al. (2013) utilized incremental clustering for improved topic detection from Chinese microblogs. Yang et al. (2013) developed the hot topic detection method using CURE hierarchical clustering algorithm. Along the same lines, Fang et al. (2014) proposed Multi-View Topic Detection (MVTD) to detect hot topics from Twitter with high levels of coherence and accuracy. However, it does not consider the retweet–reply and geographical relations of tweets. Nur’aini et al. (2015) proposed a model which integrates the K-means clustering method and Singular Value Decomposition (SVD) for tweet topic extraction. Muliawati and Murfi (2017) designed a topic model using Eigen space-based Fuzzy C-Means (EFCM) clustering in which Singular Value Decomposition (SVD) is used for reduction in data dimension. Prakoso et al. (2018) extended this work by introducing Kernel Eigen space-based Fuzzy C-Means (KEFCM) for sensing detecting trending topics. Trupthi et al. (2018) utilized Probabilistic fuzzy C-means topic modelling for analysing user sentiments. Lim et al. (2017) developed Clustop by merging the concepts of clustering-based topic modelling word networks of n-grams and part-of-speech tagging. However, it models only those topic labels that are based on Wikipedia articles.

Capdevila et al. (2017) developed Tweet-SCAN, which is based on DBSCAN and used a hierarchical Dirichlet process and Jensen-Shannon distance for event discovery from geo-located tweets. Mustakim et al. (2019) employed DBSCAN algorithm for clustering of trending tweet topic pilkada pekanbaru. However, DBSCAN has limitations in handling a large number of sparse texts. Rashid et al. (2019b) designed the Fuzzy Topic Modelling (FTM) approach, where the frequencies of the global and local terms are generated by the bag-of-words and high dimensions removed by Principal Component Analysis (PCA). Indra and Pulungan (2019) developed a trending topic detection model using BN-grams and Doc-pivotal. Abou-Of (2020) developed an incremental trending topic detection model named Incremental FCM (IFCM), which integrates the incremental semantic metrics and Fuzzy C-Mean clustering to increase the accuracy of trending topics. IFCM aims to solve such issues as discovering the semantic relatedness when different titles present the same event and to tackle incorporating semantically similar topics from various sources. Yang et al. (2019) proposed a Topic Representative Terms Discovering (TRTD) for short text, which alleviates and addresses the noise and data sparsity problem. Though analysis reveals that the clustering based topic models are largely accurate and stable, they seem to lack the effectiveness of STTM techniques owing to their dependence on topic features for clustering even in cases of sparse tweet texts.

4.2.3.2 Supervised (Label) based topic models

In this section, we present the single-label and multi-label-based topic modelling techniques. Single label topic models are mostly used for classifying tweets based on a single topic. While most studies fail to undertake such labels, some researchers have already performed such tasks.

Supervised LDA (Mcauliffe and Blei 2008), SSHLDA (Mao et al. 2012), and DiscLDA (Lacoste-Julien et al. 2009) are some of the most recent single-label topic models. Most authors seem to prefer multi-label topic models such as Labeled LDA (Bhattacharya et al. 2014), LF-LDA (Zhang et al. 2018c) and DF-LDA (Li et al. 2015). Yu et al. (2017) developed the Multi-layered Semantic LDA (MS-LDA) for mining topics from unstructured tweet data with high recognition of accuracy. However, this model has a major handicap in that it consumes a lot of running time. Jun He et al. (2020b) designed the Bi-Labeled LDA for automatic detection of interest tags from Twitter topics using the social relationship between popular and non-popular users. The topic discovery in this model is highly cohesive and, therefore, best suited for inference applications. However, the model supports only two labels, thus reducing the multi-label relationships. Slutsky et al. (2014) designed a new model named Hash-Based Stream LDA, a multi-label topic model, which is one of the most commonly used topic models in recent years. Wilcox et al. (2021) introduced a new supervised LDA with covariates (SLDAX) model, which integrates both a measurement (latent variable) model of text and a regression model to enable the hidden themes and other manifest variables to be used as predictors of results. The big issue of short text is that it suffers from the data sparsity in the feature vector, Pang et al. (2019) developed two supervised topic models, namely a Weighted Labeled Topic Model (WLTM) and X-term Emotion-Topic Model (XETM), to address this issue and to discover emotions toward specific topics.

4.2.4 Exemplar based topic model

Conventional models for detection topics concentrate on representing topics utilizing words, which are negatively impacted by Twitter's character limit and lack of context. Besides, one of the limitations is the scalability of the processing models required to handle the enormous volume of tweets created daily. Therefore, Exemplar-based techniques are suggested by Elbagoury et al. (2015) to fill this gap. An Exemplar-based technique selects the representative Exemplar tweets instead of the set of words to represent the extracted topics based on the variance of the similarity among other tweets and exemplars tweets. The proposed model mitigated the aforesaid issues and adapted for an easy interpretation and understanding of the meaning of retrieved topics.

Formally, let \(T\) be a collection of tweets with the size of \(m\), and the main objective is to extract and detect the hidden topics from the \(T\) collection and represent every topic utilizing just one tweet (Exemplar). The criterion for selection should be able to identify a tweet for every topic so that every tweet describes a single topic and distinguishes it from other topics simultaneously. The criterion utilized in the proposed model is as follows: A tweet \(t_{i}\) which is similar to a collection of tweets and yet different from the other tweets, is an excellent theme (topic) representative. It can be expressed by formulating a similarity matrix \(S_{m \times m}\) in where \(S_{i,j}\) represents the similarity among tweets \(t_{i}\) and \(t_{j}\). Consequently, the sample variance of its similarity distribution will be considerable. The formula of the sample variance for every tweet \(t_{i}\) can be calculated as in Eq. (6).

$$ var\left( {S_{:i} } \right) = \frac{1}{m - 1}\mathop \sum \limits_{j = 1}^{m} \left( {S_{i,j} - \mu_{i} } \right)^{2} $$
(6)

where \(S_{i,j}\) is the similarity between two tweets \(t_{i}\) and \(t_{j}\), \(\mu_{i} = \frac{1}{m}\mathop \sum \limits_{j = 1}^{m} S_{i,j}\) is the average of the similarities of the tweet \(t_{i}\).

Similarly, other recent research used Exemplar based technique for detecting topics from tweets. To this extent, Shi et al. (2019b) used the exemplar approach for event detection from tweet topics with higher degrees of accuracy. Liu et al. (2020a) also employed an exemplar approach for event evolution strategy with high stability and purity. Although this model is highly efficient and seems to provide a better balance between the term-recall and term precision, it has limited control over the dynamically changing number of topic labels. The analysis of these models is furnished in Table 5.

Table 5 Comparative analysis for dynamic based, exemplar-based, application-based, word type based, and frequent pattern mining based models

4.2.5 Dynamic topic models

Static topic models were the most widely used models in many applications. But these models are limited to static vector modelling of tweets and thus unsuitable for representing dynamically changes of Twitter streams. This category includes traditional topic models such as LDA, LSA, and PLSA. On the other side, Dynamic Topic Modeling (DTM) generally pursues to detect the hidden topics from sequentially reached data blocks in order to catch the evolving trends of these topics. One of these models (Blei and Lafferty 2006) suggested DTM based on the assumption that all sequentially arranged short text or documents have the same concentrated themes with smooth variations. Continuous-Time Dynamic Topic Models (CDTM) was proposed by (Wang et al. 2008), which models latent topics through a successive set of documents by employing Brownian motion. The CDTM has several advantages, the most important of which is the ability to use sparse variational inference for rapid model comparison. These models DTM and CDTM were scaled by (Bhadury et al. 2016) on big data utilizing a form of Gibbs Sampling, which combines the Metropolis–Hastings sampler and Stochastic Gradient Langevin Dynamics. Saha and Sindhwani (2012) and Vaca et al. (2014) presented a dynamic NMF approach with temporal regularisation to understand and learn emerging and evolving topics in social media networks in order to better capture freshly emerging and fading topics. Cotelo et al. (2014) used the Dynamic topic-related tweet retrieval approach to extract efficient topics. Liang et al. (2016) introduced a new Dynamic Clustering Topic (DCT) model, which is capable of tracking words over topics and the time-varying distributions of topics over documents. The data sparsity issue of the short text and the dynamic nature of topics across time was solved by this model, but the scalability issue was not solved. Yao and Wang (2020) also used DTM for tracking urban geo-topics from user tweets. Similarly, many studies developed DTM models such as Biterm Topic Modelling (BTM) (Cheng et al. 2014), Pseudo-document-based Topic Modelling (PTM) (Zuo et al. 2016a), etc. Finally, Ghoorchian and Sahlgren (2020) developed a Graph-based Dynamic Topic Modelling (GDTM) which combines a language representation technique and an incremental dimensionality reduction method with a graph partitioning method to solve dynamicity and scalability problems and utilized a rich language method to address a data sparsity problem.

4.2.6 Single and multi-source topic models

Topic models are generally based on single-source documents. Even the STTM models are mostly based on single-source data such as Twitter, Facebook, Weibo, etc. On the other hand, multi-source data-based topic modelling seems to be gradually increasing in recent years. Because it is highly beneficial to extract topics, sentiments and events from multiple sources instead of extracting them from single sources. The only possible limitation in employing these multiple source data is their inability to handle complex semantic relations. This section describes the single and multi-source-based topic models and analyzes them in Table 5. Hong et al. (2011) developed a time-dependent topic model for multiple text streams from Twitter and Yahoo news data and achieved high accuracy. Cao et al. (2017) employed a domain-aware LDA topic model to extract topics from multiple data sources, namely Twitter, news and PubMed. Gupta et al. (2019) described Multi-view and Multi-source data transfers from PubMed, AG news, and Twitter using predefined topics and word embedding.

4.2.7 Application-based models

Researchers developed application-based models by modifying the existing topic models for specialized applications of topic modelling. For example, Wang and Iwaihara (2015) designed the Bilingual LDA topic model for cross-lingual tweet recommendation by assuming the cross-lingual tweets as similar to linguistic Wikipedia articles. Pu et al. (2016) presented the Wiki-LDA model in which the tweets are merged as documents and the Wikipedia labels, which is applied to extract the topics. Feng (2018) presented the Environmental Data LDA(ED-LDA) specifically designed for environmental tweet datasets through probabilistic learning.

4.2.8 Word type models

Most topic models such as LDA, LSA and PLSI and their extensions use the Bag-Of-Words (BOW) approach for topic representation. Sequence-of-Words (SOW) approach is simpler than BOW, but it is used less frequently due to its unpopular representation in tweets. Table 5 presents the comparative analysis of word type models. Koike et al. (2013) developed the SOW based document representation model for time series topic detection from correlated news and Twitter. Sasaki et al. (2014) proposed Twitter Topic Tracking Model (Twitter-TTM), which used the sequence-of-words approach for the online trending topic model for Twitter. Further, Dey et al. (2018) also used a sequence-of-words approach with LSTM for topical stance detection. Word2vec is yet another representation model that is being used widely in Twitter topic modelling for detecting a specific group of user profiles. Vargas-Calderón and Camargo (2019) used word2vec and latent topic analysis for extracting topics and portraying of citizens. Similarly, word2vec is also used for sentiment analysis based on user categories.

4.2.9 Frequent pattern mining based models

Frequent Pattern Mining (FPM) was initially suggested in mining transactions with an aim to locate items that occurred simultaneously in the transactions. The FPM can also be used for Twitter topic modelling, as suggested in (Aiello et al. 2013). In the context of social media, the item refers to any word \(w\) included in the post or tweet (except punctuation token and stop words). The transaction refers to the post, and all the posts or tweets appear in the slot of time \(T_{i}\) are denoted by transaction sets. The frequency with which a particular set of words occurs in a certain time slot is referred to as its support, and any combination of words (itemset) that meets minimal support is named a pattern. This approach consists of two processing stages: Frequent Pattern detection(FP-detection) and Frequent Pattern ranking (FP-ranking), where the FP-detection is utilized to discover and detect the frequent patterns and FP-ranking is utilized to rank the patterns. This method uses the FP-growth algorithm, which consists of the following phases, to identify frequent patterns.

  • Compute the frequency of every word and disregard those words that fall below a given threshold.

  • Arrange the patterns based on their frequencies and co-occurrences.

  • Construct association rules on the transaction set using the following form:\(\left\{ {w_{1} ,w_{2} } \right\} \to p_{i} = \left\{ {w_{3} ,w_{4} ,w_{5} ,. \ldots } \right\} \)\(with support \left( {p_{i} } \right)\)

Then, this method ranks the frequent patterns after identifying them and gives the top \(k\) frequent patterns as the discovered and extracted themes (topics). Guo et al. (2012) suggested an approach for extracting hot topics from Twitter streams with high stability, accuracy, and with lesser overheads using the Frequent Pattern stream mining (FP-stream) algorithm. The FP-Stream technique can get results that are sensitive to time; indeed, it has the ability to differentiate between new and old new transactions. Therefore, the differentiation between the old and new transactions is an important part of discovering the hot trending topics on Twitter. Kim et al. (2012) presented a probabilistic topic model using FPM and improved the coherence of topics. Other versions of FPM are have also been used in topic modelling. Peng et al. (2018a) developed Emerging Topic detection based on Emerging Pattern Mining (ET-EPM) model using High Utility Item-set Mining (HUIM) algorithm. Likewise, Choi and Park (2019) detected Emerging Topics in the Twitter stream using High Utility Pattern Mining (ET-HUPM). Since the FPM algorithms require an extensive learning process to extract hot topics on Twitter, they are not preferred in STTM. The summary and comparison of FPM techniques are provided in Table 5.

4.2.10 Hybrid topic modelling

Apart from the standard topic modelling techniques, many studies have designed hybrid models to utilize the benefits of multiple models. This section reviews the hybrid TSTTM models. To this extent, Huang et al. (2017) and Ge et al. (2019) presented a hybrid topic model by combining TF-IDF and LDA with high coherence and perplexity. Zhang et al. (2019) developed a hot topic detection model using deep learning and LDA from limited-words, noisy tweets. It integrates image data using deep learning with the short text information using LDA from twitter to match topic words using fuzzy matching. However, the news tweets are not effectively classified in this hybrid model due to a lack of feature training.

Zhang and Eick (2019) developed a topic model by combining LDA and density–contour-based Spatio-temporal clustering for event detection. Rashid et al. (2019a) designed the Fuzzy K-means Latent Semantic Analysis (FKLSA) model for trending medical tweet topics. The frequencies of the global and local terms are extracted by the BOW model, and PCA is utilized for dimension reduction. FKLSA handles the problem of redundancy effectively to extract medical topics from the tweet health dataset. Zhang and Zhang (2020) developed a model based on Long Short-Term Memory (LSTM) to discover new topics for incremental short text. The short text was transformed into a word vector using word embedding (word2vec) in the first stage. Then, two models were designed based on LSTM. Lastly, hierarchical clustering was utilized to get the number of new topics. Pornwattanavichai et al. (2020) presented a tweet recommendation system using hybrid topic modelling of supervised and unsupervised strategies. It combined LDA and matrix factorization-based neural networks for discovering topics and providing recommendations. Though these hybrid models are highly efficient, they are also prone to certain complexities problems in most scenarios, which prove to be an obstacle. Yi et al. (2020) developed a novel regularized NMF topic model for short texts called TRNMF. It combines the extended NMF and clustering mechanism by presenting topic regularization and document regularization, respectively, to mitigate the data sparsity issue in the short text. Shahbazi and Byun (2021) proposed a model to anticipate the topics and knowledge discovery that integrates deep learning such as Artificial Neural Network (ANN) and LSTM with topic modelling and machine learning. This model overcomes data sparsity, data limitation, and word relationship issues. Ha et al. (2019) presented an approach which combines dropout into many learning models to learn LDA. The purpose of dropout assists in preventing the overfitting of probabilistic topic models on noisy and short text. The Hybrid based topic models for short text are compared and summarized in Table 6, along with the respective merits and demerits of each method.

Table 6 Comparative analysis for hybrid TSTTM of short text

4.3 ASTTM models

This section presents the second part of the taxonomy, which mainly categorizes ASTTM models into four categories, namely Dirichlet Multinomial Mixture (DMM) based models, Global word co-occurrences based models, and Self-aggregation based models (Qiang et al. 2020), and Deep Learning Topic Modeling (DLTM) based models. Figure 6 shows the taxonomy of the ASTTM models.

Fig. 6
figure 6

Taxonomy of the ASTTM models

4.3.1 DMM based models

The DMM model was initially suggested by (Nigam et al. 2000), which has been applied to infer the hidden topics over short texts (Qiang et al. 2020). Nigam et al. (2000) suggested an Expectation–Maximization (EM) based method for DMM. Apart from the fundamental EM, various inference models such as Gibbs sampling and variation inference were utilized to estimate the parameters. It is based on a strategy of a simple assumption that one short text or tweet is sampled by only one hidden topic, which is much suited for the short texts in comparison with the more complicated assumption. This adopts the use of LDA in which every document or text is formed on a collection of topics (Quan et al. 2015; Yan et al. 2015). Some of the models are based on variational inference algorithms (Huang et al. 2013), such as the DMAFP model, which has been suggested by (Yu et al. 2010) and other models proposed by collapsed Gibbs sampling algorithm for DMM.

This sub-section presents the DMM based models which can discover and extract the Latent topics from a short text. Hence, many studies incorporating the DMM models for STTM followed. Yin and Wang (2014) proposed a Gibbs Sampling algorithm for Dirichlet Multinomial Mixture (GSDMM), which utilized DMM for short text topic clustering and achieved higher efficiency. Further, they introduced a Fast GSDMM (FGSDMM) (Yin and Wang 2016), which acclimatized an online clustering method for initialization. The time complexity of the GSDMM algorithm is \(O\left( {KN\overline{I}} \right)\) where \(N\) denotes number of documents in the dataset, \(K\) is the number of pre-defined hidden topics, and \(\overline{I} \) is the average length of each document in \({\text{D}}\). However, DMM has two key drawbacks when handling short texts. First, DMM supposes that each short text has only one topic, which is reasonable, not always true due to the users’ approach to topic collection. This reduces its overall effectiveness. Likewise, the DMM does not possess the background knowledge of short texts. To address the first limitation, Li et al. (2017) designed an improved DMM model known as Poisson DMM (PDMM), which is based on modelling the topic number as the Poisson distribution with auxiliary word embedding. To resolve the second limitation, Li et al. (2016a) developed the Generalized Pólya Urn (GPU) model for semantic relatedness in the sampling process of DMM to develop GPU-PDMM and GPU-DMM. However, the promotion weight of the topics is fixed in this model. These models seem to outperform both the DMM and individual PDMM methods but also involve high computation costs. The time complexity of the GPU-DMM and GPU-PDMM algorithms are \(O\left( {KN\overline{I} + NI\zeta + KV} \right)\) and \( \left( {{\text{N}{\bar{I}}}\mathop \sum \nolimits_{{{\text{i}} = 1}}^{{\varsigma - 1}} {\text{C}}_{{\text{K}}}^{{\text{i}}} + {\text{NI}}\zeta + {\text{KV}}} \right) \), respectevely. where \(N\) denotes number of documents in the dataset, \(K\) is the number of pre-defined hidden topics, and \(\overline{I} \) is the average length of each document in \(D\). \(V\) denotes number of words in the vocabulary, ζ denotes the time and cost of considering GPU mode, \({{\varsigma }}\) is the naximum number of topics allowable in a short text, and \(c\) is size of sliding window. Zhang et al. (2018b) enhanced the GPU-DMM by including context information and word embedding to obtain the semantic similarities of the word pairs in order to improve topic coherence. However, this model also has limitations in terms of handling large collections of data. Mazarura et al. (2020) designed the Gamma-Poisson Mixture (GPM) topic model using an improved DMM concept and collapsed Gibbs sampling. It provided a high convergence and high coherence with better flexibility in topic extraction but offered limited performance on complex short texts. Nguyen et al. (2015) presented the Latent Feature vector based DMM known as LF-DMM by improving the feature word representations and incorporating word-topic mapping. However, LF-DMM increased noise interference due to the dependence only on external word expansions. The time complexity of the LapDMM algorithm is \(O\left( {O\left( {2KN\overline{I} + KVU} \right)} \right)\) where \(N\) denotes number of documents in the dataset, \(K\) is the number of pre-defined hidden topics, and \(\overline{I} \) is the average length of each document in \(D\). \(V\) denotes number of words in the vocabulary, and \(U\) is the number of dimensions in word embeddings. Yu and Qiu (2019) developed ULW-DMM as an extension to DMM by combining DMM with user-LDA for potential words representation using feature vectors. ULW-DMM model increases topic coherence in short texts and reduces noise interference by considering both external and internal word expansion. Li et al. (2019c) developed the Laplacian DMM (Lap-DMM) topic model with Variational Manifold Regularization to improve the topic classification by 20%. However, it is based on the document similarities and hence, involves complexities.

Li et al. (2019a) developed a highly efficient text clustering model known as X-DMM utilized to reduce the complexity of sampling utilizing the Metropolis–Hastings method and presents an uncollapsed Gibbs sampler to parallelize the training model for scalable topic clustering. However, this model produced wrong mixtures in some cases due to contrasting sampling. Xiao et al. (2019) developed the Word Sense embedding based DMM (WS-DMM) for topic extraction and item recommendation through time-aware probabilistic modelling of user profile presence score. But this model suffers from the problem of error propagation that negatively impacts accuracy. Liu et al. (2020b) designed the Collaboratively Modelling and Embedding based DMM known as CME-DMM, incorporating topic and WE for extracting latent topics with high coherence. Although this model correlated different source data, the complexity of handling large data continues to be an issue. He et al. (2020a) developed Targeted Aspects-oriented Topic Modelling (TATM) using the Enhanced DMM process (E-DMM) and different angle target aspect extraction. Though it overcomes the problems of standard DMM with efficient time management, it is incapable of handling complex structured short texts. Garcia and Berton (2021) introduced an effective method to explore a huge number of posts or tweets in both the USA and Brazil countries with the death and spreading by COVID-19, which uses a sentiment analysis and topic modelling. Li et al. (2021) suggested two models, namely Lab-DMM and OLabDMM, which can handle a set of massive short texts and try to alleviate the data sparsity problem. The time complexity of the LapDMM algorithm is \(O\left( {TDK\overline{N} + T\hat{T}DKR + D^{2} } \right)\) where \(D\) represents the numbers of short texts, \(K \) denotes the topics, \(\overline{N}\) represents the average document length of a corpus, \(T\) denote the numbers of outer iteration and \(\hat{T}\) denotes the number of inner iterations in LapDMM. Mai et al. (2021) suggested a TM model called TSSE-DMM over short texts to improve the interpretability and coherence themes utilizing the topic subdivision and mitigating data sparsity problem utilizing the semantic improvement mechanism. Table 7 summarizes and analysis the existing DMM based ASTTM models qualitatively.

Table 7 Comparative analysis of existing DMM based models

4.3.2 Global word co-occurrences based ASTTM models

Generally, there is insufficient word co-occurrence information in a short text. To cope with this issue, some models attempt to leverage the wealthy Global word co-occurrence patterns from the original dataset to infer hidden topics (themes) such as (Cheng et al. 2014; Zuo et al. 2016b). The short text sparsity problem is alleviated to some extent using the Global word co-occurrences due to its adequacy. These models require configuring a sliding window in order to extract word co-occurrences. Generally, if the length of each short text is greater than ten, they employ a sliding window and fix its size to 10; otherwise, If the length is less than 10, they can just treat the short text as a sliding window. This sort of model can be classified into two categories based on the utilization strategies of global word co-occurrences. The first one of the categories can infer the latent topics immediately by utilizing Global word co-occurrences. For example, the BTM model (Cheng et al. 2014) assumes that the two words comprising a biterm have the same topic, which is derived from various topics on the entire dataset. Whereas the second one of these categories, such as (Zuo, Zhao, et al. 2016b), creates a word co-occurrence network based on Global word co-occurrences and then figures out hidden topics from the constructed network, where the weight of each edge in the network represents the empirical likelihood of co-occurrence between the two connected words and each term or word represents a node of the constructed network.

This sub-section presents the STTM models based on Global word co-occurrences. The most important methods in this category are the Word-Network Topic Model (WNTM) and Biterm Topic Model (BTM). BTM is one of the most effective STTM. Cheng et al. (2014) first proposed BTM to extract topics from short texts by generating word co-occurrence patterns. BTM is easy to implement and can learn higher quality topics, and can offer better topic structure extraction. However, BTM can lose several possible prominent and coherent word co-occurrence patterns that can’t be remarked in the corpus. It also suffers from noise and extraction of more irrelevant biterms. The time complexity of the BTM algorithm is \(O\left( {KN\overline{I}c} \right)\), where \(N\) denotes number of documents in the dataset, \(c\) is size of sliding window, \(\overline{I} \) is the average length of each document in \(D\), and \(K\) is the number of pre-defined hidden topics. Pang et al. (2016) developed Sentimental BTM (SBTM) through the resemblance between words and documents with sentimental relations to tackle the problem of irrelevance. Li et al. (2018a) developed Latent Semantic augmented BTM (LS-BTM) to include the latent semantic details for topic extraction and improve the performance of BTM. However, this model has its limitations due to the usage of more topic-irrelevant bi-terms. Li, A. Zhang, et al. (2019) developed Relational BTM (R-BTM) to overcome the loss of coherence of the BTM model through a similarity list of related words using word embedding. However, R-BTM does not consider other problems of BTM, such as meaningless biterms extraction and noise reduction.

Li et al. (2016b) designed the hybrid model of K-means clustering and BTM for micro-blog topic discovery with less noise. Xu et al. (2018) proposed a Chinese Topic Modelling (CTM) based on LDA and BTM to extract the topic distribution of short text from the corpus and the document itself. Zhu et al. (2019b) developed a joint topic model using Incremental BTM (IBTM) and extended LDA over streaming Chinese short texts. The BTM model tends to overlook the document-topic semantic data and dearth of the exact intents of users with sparsity problems. This joint model overcomes this limitation by using extended LDA for retaining document-topic information and IBTM to alleviate the issue of sparsity. However, in this model, some words fail to have semantic relevance and tend to reduce efficiency.

Wu and Li (2019) established the Multi-term Topic Model (MTM), which extracts the length of variable and multiple correlative word patterns from short texts to infer the latent trending topics. This model overcomes the limitations of BTM, such as extracting many irrelevant and useless bi-terms. However, MTM has smaller limitations of complexity due to separate handling the multi-terms and topics. WNTM is the second type of Global word co-occurrences based topic model, which utilizes Global word co-occurrence to build up word co-occurrence network and learns the distribution over topics using LDA. Zuo, Zhao, et al. (2016b) developed WNTM and used it for topic clustering from short and imbalanced texts. However, WNTM fails to express the deep meaning among words due to a lack of semantic distance measures. Further, WNTM has a lot of high irrelevant data in word-word space. The time complexity of the WNTM algorithm is \(O\left( {KN\overline{I}c(c - 1} \right)\), where \(K\) is the number of pre-defined hidden topics, \(N\) denotes number of documents in the dataset, \(c\) is size of sliding window, \(\overline{I}{ }\) is the average length of every document in \( D\). Wang et al. (2016) developed the Robust WNTM (R-WNTM) as an extension for Short Texts. As the irrelevant data in the word-word space building procedure of WNTM is high, the R-WNTM that filters the unrelated data during the sampling process is presented. Jiang et al. (2018) presented WNTM with Word2Vector (WNTM-W2V) to extract deep meaning between words to enhance topic coherence as well as increases the accuracy of relationship among words.

Another word co-occurrence based model, namely, the Couple-Word Topic Model (CWTM) was presented by (Diao et al. 2017) to tackle the problem of data sparsity and incomplete description in topic extraction using couple word co-occurrences. CWTM is the first model to incorporate a couple of words, but it also suffers from difficulties in handling complex structures short texts and noisy microblogs data. Akhtar and Beg (2019a) developed the User Graph Topic Model (UGTM) by extending the author topic model through semantic relationships of contextual data. This method is highly efficient for topic extraction in a dynamic manner. Liang et al. (2018) designed the Global and Local word embedding-based TM (GLTM), which trains global embedding with a suitable encoding of continuous Skip-Gram model with Negative Sampling (SGNS) for getting local word embedding. This process enhances the semantic relatedness based topic discovery in short texts. Li, Wang, et al. (2018c) presented the Common Semantics Topic Model (CSTM) using unigrams for filtering noise in short text topic discovery. But, this model has limitations in terms of settings priority and determining the number of topic labels.

Chen et al. (2020a) developed two models, namely; Dirichlet Process Biterm-based Mixture Model (DP-BMM), which can alleviate the sparsity issue and handle the topic drift in the short text stream; the second method is an enhancing model of DP-BMM with forgetting property named (DP-BMM-FP) which removes the biterms of antiquated documents efficiently by eliminating clusters of antiquated batches. Moreover, Singh and Singh (2020) proposed a novel algorithm, namely Significance-based Label Propagation Community Detection (NSLPCD), which is capable of detecting and identifying topics promptly after happening from the Twitter dataset in a faster manner without compromising accuracy. Due to the issue of the data sparsity associated with short text and Twitter data, traditional topic discovery typically face difficulties of unintelligible and incoherent representation of topics. Thus, Liqing et al. (2019) presented a new model named (Hot topic detection based on the VSM Combined HMBTM) HVCH fusion to resolve this problem. Also, Hadi and Fard (2020) proposed an approach called Adoptive Online BTM (AOBTM) to solve the data sparsity issue and takes into account the statistical data for an optimal number of previous time-slices. The time complexity of the AOBTM model is \(O\left( {N_{itr} K\left| {N^{\left( t \right)} } \right| + vKW} \right)\) Where \(N\) denotes the number of documents in the dataset, \(K\) is the number of pre-defined hidden topics, \(v\) denotes to the number of available time-slices, and \({\text{W}}\) represents the total number of words. Moreover, a new model called Noise BTM Word Embedding (NBTMWE) was developed by (Huang et al. 2020) to tackle data sparsity. NBTMWE combines noise BTM and WE from external corpus to ameliorate the coherence of the topic. Wu et al. (2020a) invented a clustering algorithm for short texts based on the (BG & SLF–Kmeans) technique. This research proposed to discover the hot topics from short text microblogs. The pre-processed short texts were modelled using the BTM and GloVe approach. The similarity of the text based on the BTM vector was estimated using the JS divergence, and the similarity of the text based on the GloVe vector was estimated using the Improved Word Mover’s Distance (IWMD). Lastly, the K-means clustering was realized with the use of the distance function obtained from the linear fusion of the two similarities. Yang and Wang (2021) presented propose a new TM called AOTM (Author co-Occurring Topic Model) for extracting the topics from short user comments and normal text. By taking authorship into account, AOTM provides each author of short text with a probability distribution over a collection of themes exemplified solely short texts. It explores clean user preferences and alleviates the sparsity of data. Table 8 presents the summary and analysis of the existing Global word co-occurrences based ASTTM models qualitatively based on their respective merits, limitations, the dataset used, and data sources.

Table 8 Comparative analysis of Global word co-occurrences based models

4.3.3 Self-aggregation based ASTTM models

The self-aggregation based models are introduced to achieve topic modelling for the short text and automatically aggregate the short text during topic inference at the same time in one iteration. Such models merge short texts into long pseudo-documents to extract the hidden topics and assist in enhancing word co-occurrence information as well as to some extent addressing the problem of data sparsity. Current aggregation models, such as those (Weng et al. 2010; Mehrotra et al. 2013), and (Qiang et al. 2017), aggregate the short text and then apply the topic modelling. The new strategies, such as Pseudo-document-based Topic Modelling (PTM) (Zuo et al. 2016a) and Self-aggregation based Topic Modelling (SATM) (Quan et al. 2015) and etc., are different from the previous models, and they integrate clustering and topic modelling simultaneously in one iteration. The SATM and PTM have been considered the most commonly used models in this category.

This subsection introduces the latest Self-aggregation based models. To this extent, SATM, first proposed by (Quan et al. 2015), considers each short text as a sample from a hidden long pseudo-document and merges them to use Gibbs sampling for topic extraction without relying on metadata or auxiliary information. However, SATM is prone to over-fitting and is also computationally expensive. The time complexity of the SATM model is \(O\left( {N\overline{I}PK} \right)\), where \(P\) denotes to the Long pseudo-document set generated by SATM model, \(K\) is the number of pre-defined hidden topics, \(N\) denotes number of documents in the dataset, \(\overline{I}{ }\) is the average length of each document in \(D\). L. Shi et al. (2019a) enhanced SATM to develop a dynamic topic model that improves the efficiency of topic extraction. Blair et al. (2020) also designed aggregated topic models using cosine similarity and Jensen-Shannon divergence for increasing topic coherence. However, it does not seem to consider human perceptions and seems to trade-off between explicit and intrinsic features. The time complexity of Blair et al. (2020) algorithm is \(O\left( {N_{itr} ,DK\overline{I}} \right)\) where \(N_{itr}\) denotes the number of iterations, \(K\) is the number of topics, \(D\) is the number of documents, and the average number of words in a document, and \(\overline{I}\) is the average number of words in a document.

In order to improve the performance of topic extraction when compared to that of SATM, PTM (Zuo et al. 2016a) presented the concept of the pseudo document to implicitly combine short texts to tackle data sparsity. In addition, Zuo et al. (2021) proposed a novel TM model named Word Embedding-enhanced PTM (WE-PTM) to leverage pre-trained WEs, which overcomes the data sparsity issue. The authors also introduced Sparsity-enhanced PTM (SPTM) by applying Spike and Slab prior for removing unwanted correlations among the pseudo documents. Although efficient, continuous research is needed to further improve its performance. The time complexity of the PTM algorithm is \( O\left( {N\overline{I}\left( {P + K} \right)} \right)\), where \(K\) is the number of pre-defined hidden topics, \(N\) denotes the number of documents in the dataset, \(P\) denotes the Long pseudo-document set generated by the PTM model, \(\overline{I}{ }\) is the average length of each document in \(D\). Jiang et al. (2016) developed Biterm Pseudo Document Topic Model (BPDTM) which is extended to BTM. Wandabwa et al. (2021) presented a method for learning the semantic relevance and importance of tweets. It determines a tweeter's degree of interest in a given topic based on the semantic relevance of the user's tweets. Bicalho et al. (2017) proposed Co-Frequency Expansion (CoFE) and Distributed Representation-based Expansion (DREx) to expand the short text into large pseudo-document models. Feng et al. (2020a) presented a User Group-based Topic-Emotion model (UGTE) for topic extraction and Emotion detection, which can alleviate the data sparsity problem by aggregating the short text of the group into long pseudo-documents. Most of the previous work took into account the data sparsity problem; in addition, they did not consider the sensitivity of word order in short texts. To address these issues, Lin et al. (2020a) developed a new topic model for short text named the Pseudo-document-based Topical N-Gram model (PTNG), which tackles the sparsity of the data in the short text as well as is sensitive to word order. Moreover, Lu et al.(2021) proposed a new model, namely Sense Unit based Phrase Topic Model (SenU-PTM), which alleviates the data sparsity problem and enhances the readability of the topics of short-text. The different models of self-aggregation based on ASTTM, along with their respective advantages and disadvantages are summarized in Table 9.

Table 9 Comparative analysis of self-aggregation based ASTTM models

4.3.4 Deep Learning Topic Modelling (DLTM) models

Short text such as social media posts, product reviews, news headlines, etc., is becoming a more popular form of textual data. Unlike long texts from formal documents, messages on social media are generally short. However, automatic extraction of semantic topics from such short textual form is highly desirable in many NLP applications.. Traditional TMs such as LDA and PLSA have some limitations when handling social network data due to the limited word co-occurrence in each tweet or post. DL-based models appear to be viable for extracting valuable knowledge from such short text in complex systems. To this extent, various DLTM techniques for short texts are emerging to achieve flexibility and high performance. Recently, NTM is becoming a key DL tool for dealing with such short text. Driven by the desire to learn more coherent and semantic topics. The NTM approach has attracted much attention as it benefits from both neural networks and probabilistic TMs. The literature on TM has reported several models based on NTM; this section categorizes NTM into Neural Variational Inference (NVI), Variational Autoencoder (VAE), and Graph-based DLTM and then briefly reviews the related works to each category. Table 10 presents the comparative analysis of deep learning topic modelling based models.

Table 10 Comparative analysis of deep learning topic modelling based Models
4.3.4.1 Neural Variational Inference (NVI) based models

Traditional probabilistic TMs are more likely in finding a closed-form solution to model parameters and also approach the intractable posteriors based on approximation methods. Eventually, these models result in inaccurate parameters inference and low efficiency when dealing with large-scale data. Recently, NVI emerged to solve such issues, which provides scalable and powerful deep generative models for modelling latent topics based on neural networks. Surprisingly, most neural variational TM makes the assumption that topics are independent and irrelevant to one another. This assumption, however, is unreasonable in many real-world scenarios. Typically, deep latent variable models have experienced an emergence as a result of recent advances in NVI. Miao et al. (2016) suggested a generic deep NVI approach for generative and conditional text models. The traditional variational techniques yield an analytic approximation for the intractable distributions over latent variables. Whereas an inference network conditioned on the discrete text input provides the variational distribution. This approach was evaluated over two variant text modelling applications: Supervised Question Answering and Generative Document Modelling. The neural variational document model combines a continuous stochastic document representation with a bag of words generative model and scored the lowest reported perplexities on two standard test corpora. The neural answer selection model utilizes a stochastic representation layer within an attention mechanism for semantics extraction.

Text analysis methods such as visualization and TM are widely used. Typically, traditional visualization methods search the visualization space for low-dimensional representations of documents. In contrast, TM aims at detecting topics from text, but for visualization, a post-hoc embedding utilizing dimensionality reduction techniques is required. Some NTM models employ a generative model for jointly discovering topics and visualization, including semantics in the visualization space for a better analysis. The scalability of their inference algorithms is a major barrier to their practical application. Pham and Le (2020) proposed Auto-Encoding Variational Bayes based inference method to jointly visualize and infer topics. Since the proposed model is a black box, it can effectively handle model changes with a little mathematical rederivation effort. Further, Pham and Le (2021) proposed NTM approach for Hierarchical Topic detection and Visualization called (HTV) by jointly detecting topic hierarchy and generating document visualization with their topic structure. This helps in the quick detection of documents and significant topics with desirable granularity. To build an unbounded topic tree, they utilized a DRNN for generating topic embeddings. They also used KK layout objective function to regularize the model.

BTM proved its efficiency in addressing sparsity issues utilizes the word co-occurrence relationship. But, BTM and extended models ignore the internal relationship between words. Considering the relationship between words (Lu et al. 2017) suggested a short text TM named RIBS-TM using RNN for relationship learning and IDF for filtering high-frequency words. For document embedding, Li et al. (2020) suggested a bi-Directional Recurrent Attentional Topic Model (bi-RATM). Apart from using the sequential orders of sentences, the bi-RATM also uses the attention mechanism for modeling the relationships between successive sentences. Besides, they presented a bi-Directional Recurrent Attentional Bayesian Process (bi-RABP) for handling the sequences. Based on bi-RABP, the Bi-RATM, fully utilizes the bidirectional sequential information of sentences in a document. Furthermore, an online bi-RATM is suggested for handling large-scale corpora. Isonuma et al. (2020) suggested a tree-structured NTM distribute topics over a tree with an infinite number of branches. this model parameterizes an unbounded ancestral and fraternal topic distribution by employing the doubly-recurrent neural networks. It utilized autoencoding variational Bayes by which the data scalability and performance were enhanced when inducing latent topics and tree structures.

Peng, Xie, et al. (2018b) proposed a model called NSTC by incorporating the neural network and WE to enhance sparsity TMs. They replaced the complex inference process with the back propagation to reduce the computation complexity of TMs, while the WE external semantic information improves short text representation. Based on this work, Min Peng et al. (2019) presented Bayesian hierarchical TM known as Bayesian Sparse Topical Coding with Poisson Distribution (BSTC-P) to impose hierarchical sparse prior for leveraging the prior information of relevance between sparse coefficients. Moreover, Sparse Bayesian learning was proposed in this model to enhance the learning of sparse word, topic, and document representations. BSTC-P benefits from learning the word, topic, and document proportions. In the meantime, a sparsity improved version of BSTC is developed for obtaining the sparsest optimum solution using the Normal-Jeffrey hierarchical prior. For effective learning, the Expectation–Maximization and Variational Inference procedures are used.

Miao et al. (2017) proposed various NTMs based on Gaussian Soft max, Gaussian Stick-Breaking, and Recurrent Stick-Breaking constructions for parameterizing each document's latent multinomial topic distributions. With the assistance of the ground-breaking construction. For detecting latent topic sparsity while maintaining training stability and topic coherence. Lin et al. (2019) suggested two NTM based on the Gaussian sparse max (GSM) construction that provides sparse posterior distributions over topics and allows effective training via stochastic backpropagation. They built an inference network conditioned on the input data and use the Relaxed Wasserstein (RW) divergence to infer the variational distribution. Feng, Zhang, et al. (2020b) suggested a Context Reinforced NTM called (CRNTM). The proposed model infers the topic for each word in a narrow range, assuming that every short text covers a few salient topics. Then, in the embedding space, exploits pre-trained WE by simply treating topics as multivariate Gaussian distributions or Gaussian mixture distributions.

To model the user interactions in microblogs, He et al. (2018) suggested a topic modelling approach called Interaction-Aware Topic Model (IATM). This approach combines both user attention and network embedding. To mine dynamic user behaviours, they built a conversation network by connecting users using repost and reply relationships. They learn interaction ware edge embeddings with social context by modelling dynamic interactions and user attention, then incorporate them into neural variational inference for producing more consistent topics. Card et al. (2018) utilized a stochastic variational inference to introduce a general neural framework based on TMs for flexible metadata incorporation and rapid exploration of alternative models. This model was evaluated over a corpus of articles related to the immigration United States. It achieves high performance while balancing perplexity, coherence, and sparsity in a manageable way. Dieng et al. (2020) proposed an Embedded Topic Model (ETM) that utilizes LDA with WE for generating interpretable topics from large vocabularies containing rare and stop words. ETM models each word with a categorical distribution whose natural parameter is the internal product between the word’s embedding and an embedding of its allotted topic. Further, they proposed an efficient Amortized Variational Inference (AVI) algorithm to fit the ETM. Rezaee and Ferraro (2020) used discrete random variables for NTM learning, explicitly modelling the assigned topic of each word based on NVI. To handle discrete variables, it does not rely on stochastic backpropagation. Using NVI, this model combines the expressive power of neural techniques for representing text sequences with the ability of TMs for capturing global, thematic coherence.

4.3.4.2 Variational Autoencoder (VAE) based models

VAE enables generative TMs using neural networks. Recently, advances in NVI have resulted in effective text processing. For instance, NTMs are typically built on (VAE) with the goal of minimizing the error of reconstructing original documents based on the learned latent topic vectors. However, reducing reconstruction errors does not always result in high-quality topics. Srivastava and Sutton (2017) suggested an AEVB-based inference model for LDA called Auto encoded Variational Inference for Topic Model (AVITM). This model addresses the issues for AEVB caused by the Dirichlet prior and component collapsing. It matches the traditional methods in accuracy while taking much less time to infer. Indeed, the computational cost of running variational optimization on test data is unnecessary due to the inference network. Since AVITM is a black box, it can be applied easily to any TM. As an example, consider ProdLDA, a new proposed TM approach that replaces the mixture model in LDA with an expert product. By modifying just one line of code from LDA. Even when LDA is trained using collapsed Gibbs sampling, the ProdLDA produces far more interpretable topics.

Zeng et al. (2018) jointly investigate topic inference and short text classification by using Topic Memory Networks (TMN) for encoding latent topic representations indicative of class labels. They focus on extending features by using external knowledge or pre-trained topics. Gui et al. (2019) utilized Reinforcement Learning (RL) as reward signals with topic coherence measures for guiding the learning of a VAE-based TM. This enables the automatic separation of background words from topic words, obviating the need for the pre-processing stage of filtering infrequent and/or top frequent words, which is typically needed for learning traditional TMs. Bougteb et al. (2019) applied a deep autoencoder model with the Kmeans++ algorithm for detecting eventual topics in reconstructed data with less noise. Liu et al. (2019) presented a Centralized Transformation Flow for capturing topic correlations by reshaping topic distributions. Moreover, they proposed the Transformation Flow Lower Bound to enhance the proposed model's performance. Lin et al. (2020b) suggested NTM for short texts based on Auto-Encoding Variational Bayes. This model uses the Clayton Copulas for guiding the estimated topic distributions yield from linear projected samples of re-parameterized posterior distributions. Wu et al. (2020b) designed an NTM to produce a high-quality topic from short texts. This model utilizes a new topic distribution quantization technique for producing peakier distributions to model short texts. They further developed a negative sampling decoder for enhancing the diversity of short text topics.

4.3.4.3 Graph-based DLTM models

LDA’s failure to capture rich topical correlations between topics, and high inference complexity, to address Probabilistic Latent Semantic Indexing's overfitting problem Graph Neural Networks (GNNs) such as GCN help in learning the document representations efficiently by exploiting the semantic relation graph between documents and words. Despite a few exceptions, most of the previous works in this field do not take into account the underlying topical semantics inherited in document contents and the relation graph, making the representations less effective and difficult to interpret. Few recent studies attempt to apply latent topics to GNNs, where the topics are learned independently from the relation graph modelling. Xie et al. (2021) suggested a Graph Topic Neural Network (GTNN) model for extracting the semantics of latent topics for intelligible document representation learning, considering the word to word, document to word, and document to document relationships in the graph. To extract topics from concurrent word co-occurrence and semantic similarity graphs, Wang et al. (2021b) proposed the Dual Word Graph Topic Model (DWGTM). Where the global word cooccurrence graph is used for training DWGTM to learn word features. Then, it generates text features from word features and feeds them into an encoder network to obtain topic proportions per text; finally, it reforms texts and word co-occurrence graphs using topical distributions and word features, respectively. Furthermore, they used word features for rebuilding a word semantic similarity graph computed by pre-trained WEs in order to extract the semantics of words.

Zhu et al. (2018) suggested an approach for representing bi-terms as graphs known as Graph-BTM. They also designed a Graph Convolutional network along with residual connections for extracting transitive features from bi-terms. Moreover, for addressing LDA's data sparsity and BTM's strong assumption, they sample a fixed number of documents to create a mini-corpus as a training sample. Further, to generate more coherent topics, they proposed an AVI method for Graph-BTM. Yang et al. (2020) suggested an approach to address the overfitting issue of PLSI by applying the AVI with WE as input rather than the Dirichlet prior in LDA. To minimize the number of parameters, the AVI replaces the inference of the latent variable with a function that possesses the shared learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. The number of shared parameters is constant and independent of the corpus scale. They proposed a Graph Attention Topic Network (GATON) for modelling the topic structure of non-independent and identically distributed documents, overcoming the limitations of AVI's application to independent and identically distributed documents.

4.3.4.4 Other deep learning topic modelling based models

Traditional TMs often need specific inference procedures for certain tasks. They are also not intended for producing word-level semantic representations. Wang et al. (2019) suggested an NTM model called the Adversarial-neural Topic Model (ATM) using the Generative Adversarial Nets (GANs). It models topics with Dirichlet prior and applies a generator network for extracting the semantic patterns from latent topics. Meanwhile, it utilizes a generator network for extracting semantic patterns from latent topics and models topics with Dirichlet prior. Furthermore, ATM was applied for open-domain event extraction to demonstrate the feasibility of the model for tasks other than TM. To enforce sparseness, LDA-based TMs typically use the Dirichlet distribution as a prior for the topic and word distributions. However, in Dirichlet distributions, there is a trade-off between sparsity and smoothness. Where, the sparsity is significant for low reconstruction error during autoencoder training and the smoothness allows for generalization and leads to a higher loglikelihood of the test data. These properties were encoded in the Dirichlet parameter vector by Burkhardt and Kramer (2019b). This parameter vector can be rewritten as a product of a sparse binary vector and a smoothness vector. This results in a model with both competitive topic coherence and a high log-likelihood. For the reparameterization of the Dirichlet distribution, rejection sampling variational inference allows for efficient training.

Chuluunsaikhan et al. (2020) developed an approach for predicting the daily retail price of pork in the South Korean local markets based on news articles by integrating DL with TM. They initially utilized TM for extracting relevant keywords for expressing price changes. Then, using these keywords, they built a prediction model based on statistical, ML, and DL methods. NTM has shown improvement in overall coherence, and the contextual embeddings have advanced the state of the art of neural models in general. Bianchi, Terragni, and Hovy (2021) integrated the NTM with contextualized representations to generate more meaningful and coherent topics. Mishra et al. (2021) utilized TM to detect hidden topics and identify the narrative and direction of tourism hospitality, and healthcare relevant topics. Furthermore, TM was used to detect inter-cluster similar terms and analyze the flow of information from a group of a similar viewpoint. Finally, a cutting-edge DL classification model was utilized with various epoch sizes of the dataset to predict and classify the people’s feelings. Murakami and Chakraborty (2022) proposed a fine-tuning phase with an original corpus for training NTM for generating semantically coherent and corpus-specific topics. They used eight NTM to evaluate the effectiveness of the proposed additional fine-tuning phase and pre-trained WE in generating high-quality interpretable topics by simulation analysis over several datasets.

4.3.5 Other ASTTM models

Certain other techniques that were no way similar to the three categories of models described in the earlier paragraphs were also described. These methods used the benefits of multiple strategies without having similar to them. Liu et al. (2015) designed the Micro Blog Hierarchical Dirichlet Process (MB-HDP) topic model to tackle the problem of sparsity without a fixed number of topics as in the case of LDA. Xie et al. (2016) designed Topicsketch, which is a data sketch-based topic model to perform real-time burst topic detection. This model maps the tweet data as sketches and then extracts the topics from each sketch. Zhang et al. (2018a) developed the Pattern-based Topic Detection and Analysis System (PTDAS) to extract the trending topics from Chinese tweets through interesting cosine patterns. Liu et al. (2018) developed the Multiple Relational Topic Model (MRTM) by establishing document-attribute distribution and a two-step random sampling strategy to exploit both explicit and implicit relations. It improves both the coherence and accuracy of topic extraction but has limitations in terms of slow sampling speed and an unresolved trade-off between explicit and implicit relations. Wang et al. (2018) developed the Attention Segmentation based TM (ASTM) for short texts by integrating a human attention mechanism and word embedding as supplementary information to improve topic coherence. Although significantly efficient, this model depends entirely on a fixed similarity threshold, which might reduce the performance of segmentation. Zhu et al. (2019a) developed the Bayesian topic modelling for hierarchical topic viewpoints discovery from tweets based on the same depth tree formation.

5 Existing datasets

STTM models utilize so many datasets, most datasets are publically available, and the other datasets are collected and used specifically. This section reviews these datasets and provides a comparative analysis to analyze them based on the Number of Documents (ND), Vocabulary Size (VS), Labels/clusters (L), Average of the Document Length (AvgDL), utility, and Language (Lang) where (EN: English and, CHI: Chinese). We also provide the references of the publically available datasets to facilitate their accessibility to researchers. Table 11 shows a comparative analysis of the prominent datasets used by STTM models, whereas Table 12 summarises the datasets used by STTM models based on domain and availability. Here, the dominant datasets are briefly described below.

Table 11 Comparative analysis of the prominent datasets used by STTM models
Table 12 Summary of the datasets used by STTM models based on domain and availability

5.1 Tweetset dataset

This dataset contains 2472 tweets that are highly relevant to 89 queries. The relevance between queries and tweets is labelled manually in the 2011 and 2012 microblog tracks at the Text REtrieval Conference (TREC). The vocabulary size in this dataset is 5098 and the Average of the Document Length (AvgDL) is 8.56. They denoted this dataset as the TweetSet dataset. The other details of the TweetSet dataset are listed in Table 11.

5.2 Tweets dataset

A huge collection of tweets are gathered and labelled by (Zubiaga and Ji 2013). They scrape tweets, including URLs and classify them according to web page categories pointed by the URLs. The web pages categories are identified by the Open Directory Project (ODP). This dataset includes ten various groups and a total of around 360 k labelled tweets. They chose nine topic-related groups and sampled 182,671 tweets in the total under those categories. This Tweets dataset has been utilized in a few studies (Jiang et al. 2018; Wang et al. 2016; Zuo, Wu, et al. 2016a; Lin et al. 2020a; Indra and Pulungan 2019). Table 11 describes the details of this dataset.

5.3 Tweets2011 dataset

Tweets2011 dataset is a standard short text collection published in TREC 2011 microblog track, which includes approximately 16 million tweets sampled in the period from January 23rd to February 8th, 2011. Besides, each tweet contains a timestamp and user id. In this dataset, most of the studies selected the tweets randomly for the experiments. For instance, Liang et al. (2018) chose 3200 tweets for their experiments before performing the pre-processing stage and 30,946 after pre-processing, whereas Cheng et al. (2014) chose randomly part of this dataset, where the total number of tweets after pre-processing is 4230578. Moreover, Li et al. (2019b) obtained 5.42 Million tweets after pre-processing. Table 11 presents the details of these variations of the dataset.

5.4 BaiduQA/Questions dataset

The BaiduQA or Questions dataset contains 648,514 questions that were gathered from a well-known Chinese Q&A website. Each question is labelled into one of the 35 categories by its annotator. This dataset was prepared and utilized in (Cheng et al. 2014; Yan et al. 2013a; Li et al. 2021), which contains 189,080 questions for evaluating the proposed methods. In contrast, the dataset after pre-processing in the other studies, such as (Li et al. 2016a; and Zhu et al. 2019b), includes 179,042 questions. Another study presented by (Zuo et al. 2016a) used only 142,690 questions from the dataset for their experiments. The authors pre-processed the dataset, e.g., Chinese word segmentation was employed, and duplicate words were deleted. The detail of this dataset is presented in Table 11.

5.5 News dataset

This dataset is a collection of 29,200 English news articles gathered from RSS feeds of three famous newspaper news websites: reuters.com, usatoday.com, and nyt.com. It consists of seven clusters/categories: business, health, sport, sci&tech, U.S., world, and entertainment. The description of each news in the dataset is retained as standard short text, and its content is one or two simple sentences of this news article. This dataset is used for many STTM models, such as those (Jiang et al. 2018; Wang et al. 2016; Zuo, Wu, et al. 2016a; and (Lin et al. 2020a), whereas the other models (Nguyen et al. 2015; and Yi et al. 2020) used 32,600 English news and the detail is introduced in Table 11.

5.6 Google news dataset

The Google news dataset is one of the labelled collections utilized for evaluating the clustering performance. The news articles are classified automatically into topics/stories/ clusters. The authors took a snapshot of Google News on November 27th, 2013, and crawled the snippets and titles of 11,109 news articles as short text documents and grouped them into 152 clusters. The Google News dataset is split into three sub-datasets: SnippetSet (SSet), TitleSnippetSet (TSSet) and, TitleSet (TSet) as used in (Qiang et al. 2018b; Yin and Wang 2014; and J. Chen et al. 2020a). The TitleSnippetSet includes both snippets and titles, whereas TitleSet and SnippetSet datasets only include the titles and snippets, respectively.

5.7 Web snippets dataset

Web Snippet dataset contains 12,340 web search snippets. Each snippet belongs to one of eight groups. It has been utilized in many research like (Zhang et al. 2018b; Li et al. 2019c), which use all the samples of web search snippets for their experiments, While the other studies such as those (Li et al. 2016a; Li et al. 2017; Huang et al. 2020) use the dataset after pre-processing, which consists of 12,265 web search snippets. Moreover, one study proposed by (Wang et al. 2018) selected only 1707 snippets for evaluating their model. Table 11 shows the details of these datasets.

5.8 DBLP dataset

The DBLP dataset consists of 55,290 short texts (titles of conference articles) from 6 research areas: NLP, Information Retrieval (IR), DataBase (DB), Data Mining (DM), computer vision, and Machine Learning (ML). The vocabulary size of this dataset is 7,525. Each conference title (short text) is labelled with one of the research areas. Zuo, Wu, et al. (2016a) and Lin et al. (2020a) used this dataset for their experiments to evaluate the proposed models. The detail of the DBLP dataset is provided in Table 11. The rest of the datasets used only one time for assessing the advanced STTM models are presented with their details in Table 12.

6 Tools and open-source library for topic modelling

This section presents the available open-source libraries and software tools for traditional and short text topic modelling. There are many tools and open-sources packages that can be used for both traditional and short text topic modelling.

6.1 Tools and open-source library of STTM

Several open sources and packages are available, especially for short text topic modelling. Table 13 shows the available tools of STTM. The prominent, well-known STTM tools are briefly described in this sub-section.

  • STTM is an open-source java package developed by Qiang et al. (2020), which integrates most of the representatives of the SoTA short text topic modelling models such as LDA, DMM, LF-LDA, LF-DMM, GPU-PDMM, GPU-DMM, SATM, PTM, WNTM, and BTM. The STTM package is intended to facilitate the extension of new models in this work area as well as the accessibility of comparisons between new models and the SoTA models in the field.

  • jLDADMM is an open-source Java toolkit proposed for traditional topic modelling DMM (Nigam et al. 2000) and LDA (Blei et al. 2003) utilizing collapsed Gibbs sampling. jLDADMM package has been developed by Nguyen et al. (2018) to provide different ways for topic modelling over short or long texts.

  • LFTM is an open-source Java package suggested by Nguyen et al. (2015) to provide different ways for extracting latent topics over short or long texts. It includes two models, LF-DMM and LF-LDA.

  • BTM and OnlineBTM are open-source c++ packages developed by Cheng et al. (2014) based on word co-occurrence that learns hidden topics using word-word co-occurrences patterns (biterms). OnlineBTM includes two models: OBTM and iBTM.

  • CRFTM is an open-source java and python tool developed by Gao et al. (2019) and used to extract hidden topics from short text data. It alleviates the data sparsity issue by aggregating the short text data into pseudo-documents.

  • Palmetto is an open-source tool and is publicly available in the project Palmetto. Palmetto was developed for topic modelling by (Röder, Both, and Hinneburg 2015) and used to measure topics' quality based on coherence computations on an external corpus. It carries out the six well-known topic coherences: UMass (Mimno et al. 2011), UCI (Newman et al. 2010), C_V and C_P (Röder et al. 2015), C_A and NPMI (Aletras and Stevenson 2013). These topic coherences metrics are determined using co-occurrences of words in the English Wikipedia corpus and have been shown to be correlated with human ratings.

Table 13 Tools and Open-source library of STTM

6.2 Tools and open-source library for traditional topic modelling

This sub-section presents the software tools and open-source library that can be utilized for the traditional topic modelling. Table 14 shows the available tools for traditional topic modelling.

  • MALLET is a Java-based package released in 2002 by (McCallum 2002). It is a TM tool used for document clustering, classification, information extraction, topic modelling, statistical NLP, and other ML application to text. MALLET TM contains various models to discover and detect the latent topics from a dataset, including hierarchical LDA and Pachinko Allocation Model (PAM).

  • Stanford TMT developed by (Ramage et al. 2009) and released for the first time in Sep 2009. It was conducted by the Stanford NLP group. TMT contains several TM models, including latent Dirichlet allocation (PLDA), labelled LDA, and LDA (Blei et al. 2003).

  • Mr. LDA is a Java topic modelling package in the MapReduce framework developed by (Zhai et al. 2012). Mr. LDA utilizes the Variational Bayesian inference, and it has two advantages compared to Gibbs sampling based models: (1) Topic discovery is guided by informed priors. (2) Discovering the latent topics from the multilingual dataset.

  • JGibbLDA is an LDA implementation developed in Java language by (Phan and Nguyen 2006) that uses the Gibbs Sampling method for inference and parameter estimation. It is helpful for many areas such as Retrieval of Information (inferring latent topic and analyzing semantics), text Summarization, text Clustering, text Classification, and generally text data mining.

  • JGibbLDA++ is a C/C++ free software of LDA proposed in 2006 by (Phan and Nguyen 2007) utilizing the Gibbs Sampling. It is a fast processing algorithm. GibbsLDA++ is well-suited for analyzing and extracting the hidden topics structures in large text datasets.

  • Gensim is an open-source topic modelling toolkit developed by (Řehůřek and Sojka 2011) written in python language that can leverage large-scale unstructured texts to discover and detect the latent topics from the datasets by utilizing an efficient model. Gensim has a lot of different models like LSI, LDA, SVD, hierarchical Dirichlet processes (HDPs), TF-IDF, and LSA. It is faster and more scalable than the MALLET topic modelling tool.

  • TopicXP is an open-source package (Eclipse plugin) written in java language developed by (Savage et al. 2010) that utilizes LDA for extracting latent topics from the natural language utilized in comments and identifiers of source code as well as visualizing the discovered topics for the users.

  • Keyphrase extraction algorithm (KEA) is an open-source distributed tool written in Java language developed by Medelyan et al. it is utilized for extracting the keyphrase from the whole text of the dataset. KEA can be used for both free and controlled vocabulary indexing in a supervised manner.

  • Yahoo_LDA is an open-source implementation written in C++ language and developed by Ahmed et al. (Ahmed et al. 2012) for their proposed framework architecture. The source code can be found at https://github.com/shravanmn/Yahoo_LDA.

  • R Language has several libraries and packages for effective topic modelling, such as (1) LSAfun is a standard package in R consisting of a set of functions developed by (Günther et al. 2014).

  • The Structural Topic Model (STM) is an R package developed by (Roberts et al. 2019) for the structural topic model. The STM provides several characteristics such as assessing uncertainty, ways to explore the topics and visualizing the discovered topics.

Table 14 Tools and open-source library for traditional Topic modelling

7 Quantitative analysis of the literature

This section quantitatively analyzes the literature of STTM models. It shows the percentage of publications based on main categories, publication year, sub-categories, platform, and evaluation metrics. Moreover, it quantitatively analyses the existing datasets based on utility, language, and source, answering the following research questions?

  • How many recent research articles were yearly published in each STTM category?

  • Which categories of the STTM models are studied the most and the least?

  • What does a distribution of the papers look like?

  • What are the prominent evaluation metrics in the literature on STTM that are used the most and the least?

  • What are the most used existing datasets by STTM?

  • What are the prominent sources of existing datasets?

  • What are the languages of existing datasets?

  • What is the implementation platform used by the existing STTM models?

For answering the aforesaid research questions, the reviewed research articles in this taxonomy were quantitatively analyzed based on their main category. Then each main category was quantitatively analyzed based on the publication year and their sub-categories, usage of programing language in current models, and evaluation metrics. Finally, the utilized existing datasets for evaluating the STTM models were quantitatively analyzed based on language, sources, and their names.

7.1 STTM papers based on (TSTTM and ASTTM) main categories

The total number of surveyed articles, including surveys papers in this taxonomy, is 231 articles, out of which 51.52% (119) research articles are related to TSTTM models main category, and 41.13% (95) research papers are related to ASTTM models, whereas the rest 7.36% (17) articles are related to the STTM surveys, this is clearly depicted in Fig. 7.

Fig. 7
figure 7

Distribution of TSTTM, ASTTM and surveys publications

7.2 STTM publications based on Publication year

This sub-section presents the quantitative analysis of both TSTTM and ASTTM publications based on the publication year.

7.2.1 TSTTM publications

This paper has surveyed 119 published research articles (51.52%) related to TSTTM publications which have been published from ≤ 2011 to 2022. This part of taxonomy is classified into twelve classes, and each publication year represents a class, starting from 2011 till 2022, except the first class represents the publication earlier than 2012, named as ≤ 2011. Figure 8a illustrates the distribution of research publications of the TSTTM by publication year. It can be observed that from Fig. 8a, 12.61% of the papers of TSTTM models were published in the years before 2012. In recent years, researchers have obviously given greater attention to STTM like Twitter, etc. it can be clearly observed that 24.37% of the papers were published in 2019, which is the most productive year for publications in TSTTM models that’s mean it is one-fourth of the publications. Moreover, it is clearly noted that the percentage of publication papers increased continuously from 2016 to 2019, and the rate was 4.2%, 5.04%, 10.08%, 11.76%, and 24.37%, respectively. Whereas the year of 2020 got a 11.76% rate, the rate in 2021 decreases to 4.2%. The rate of 2022 is 0.84%. Here, we can’t decide the final rate due to it does not complete.

Fig. 8
figure 8

Papers distribution based on year of publication in STTM

7.2.2 ASTTM publications

The taxonomy surveyed 95 publication articles (41.13%) related to ASTTM publications which have been published from 2014 to 2021. ASTTM publications are categorised into eight classes, starting from 2014 to 2022. It can be obviously noted that from Fig. 8b, 22.34% of the papers were published in 2020, which is the highest rate among all the classes in ASTTM publications. Moreover, it is clearly observed that a continuous increase in the number of published papers from 2014 to 2020, except in 2017, the number of publications is more significantly decreased. In the 2021 class, the percentage of publications rate is 13.83%, which is considered reasonable. The rate of publication in the year 2022 is 1.06%; however, the year has not been completed until now.

From Fig. 8a and b, we can conclude that the higher rate of publications was in the 2019 class according to the TSTTM publications, whereas the 2020 class is considered the most productive year for publications in ASTTM publications.

7.3 STTM publications based on sub-categories

This section presents the quantitative analysis of both TSTTM and ASTTM publications based on the sub-Categories.

7.3.1 TSTTM publications

This part of the taxonomy paper has categorized the TSTTM models into eleven sub-categories: probabilistic models, matrix factorization, unsupervised, supervised, dynamic-based, exemplar-based, data source based, word types, application-based, Frequent Pattern Mining (FPM), and Hybrid. This section presents the quantitative analysis of the publications of TSTTM based on their sub-categories. It can be observed that from Fig. 9a the most prominent category in this paper is the probabilistic models. Over one-third of the research papers are dedicated to this category; it got 33.65% rate of the publications, which is the highest rate among all sub-categories. The second and third prominent groups of this category are clustering and label-based models, respectively; each of them covers over one-tenth of the papers.

Fig. 9
figure 9

Papers distribution over selected categories

7.3.2 ASTTM publications

The second part of this taxonomy has categorized the ASTTM models into five sub-categories: DMM, Global-word co-occurrence, self-aggregation, deep learning-based models, and other ASTTM model. This section analyzes quantitatively the publications of ASTTM based on their sub-categories. We can observe from Fig. 9b that over one-fourth (26.6%) of the papers cover Global word co-occurrences based models, and one-third (34.04%) of the papers cover deep learning-based models. Recently, most researchers have paid attention to deep learning-based models. The rate of the published papers in DMM-based models is 17.02%, which is the third prominent sub-category in ASTTM publications. From both Figs. 9a and b, we can conclude that the higher rate of publications was probabilistic class according to the TSTTM publications. In contrast, the deep learning-based models class is considered the most productive sub-category for publications in ASTTM publications.

7.4 STTM publications based on both Publication years and sub-categories

This section quantitatively presents the analysis of the STTM publications based on both the publication years and the sub-categories. Figure 10 shows the number of papers in each sub-category of TSSTM models over the publication year. In recent years, probabilistic, clustering, and supervised based models seem to be worth exploring for researchers over the course of years. In contrast, exemplar, matrix factorization, and frequent pattern mining have gained less attention currently than before. Figure 11 illustrates the number of papers in each sub-category of ASSTM models over publication years. It can be observed that the deep learning-based models, Global word co-occurrences and DMM models have gained more attention, respectively, further than the self-aggregation based model from the side of researchers in recent years.

Fig. 10
figure 10

Distribution of papers in categories of TSTTM models over publication years

Fig. 11
figure 11

Distribution of papers in categories of ASTTM models over publication years

7.5 STTM models publications based on platform

In this section, we analyze the environment platform used for the implementation of the existing STTM models.

7.5.1 TSTTM publications

This sub-section quantitatively analyzes the TSTTM publications based on the utilized platform. From Fig. 12a, it is observed that most of the publications did not mention the utilized platform where Not Mention (NM) class has got a 48.21% rate, the Java has got 17.88% rate which is the highest of the mentioned platform. Moreover, Matlab has earned 14.29% rate, which is the second-highest rate. The Python platform has obtained 10.71%, whereas the R language has got 3.57% rate which is the less rate.

Fig. 12
figure 12

Distribution of platforms utilized for STTM models

7.5.2 ASTTM publications

In this sub-section, we analyze the ASTTM publications based on the utilized platform quantitatively. From Fig. 12b, it is clearly noted that Java has got 17.75% rate, Python has 35.14% rate, and the C++ language has got 5.41%, whereas Spark has used only in one paper and got 1.35% rate. From both Fig. 12a and b, we can conclude that the spark has less attention from the researcher, and the researchers can bridge this gap for future works as the spark will process the data in parallel, and the cost of computational time will reduce.

7.6 Quantitative analysis of the existing datasets

This sub-section provides a quantitative analysis of the existing datasets utilized in the current STTM models based on their utility, language, and source.

7.6.1 Quantitative analysis of the existing datasets based on the utility

This section quantitatively analyzes the prominent datasets utilized to evaluate the existing STTM models. The number of existing and collecting datasets is 51; 20 of them are collected from the Twitter platform, whereas six datasets are collected from Chinese Sogou Labs and used only specifically, as introduced in Table 12. The other 25 rest are the prominent datasets available publicly. Here, we focus only on the 25 prominent datasets used in 114 papers. It is clearly observed from Fig. 13 that the Web Snippets have got a 10.53% using rate, which considers the highest rate, whereas the (Youtube, Viber, Swiftkey, NOAA Radar, and Wikipedia) datasets have got 1.7% which considers the least rate of usage among of them. The second highest datasets are BaiduQA/Question and Sina Weibo, which have gained of 8.77% rate. Generally, if we evaluate the utility of the prominent dataset based on source, then the Twitter dataset is the highest rate, which is 20.05% (sum of the four prominent twitter datasets). Finally, the observation is that less attention is given to real-world data in social media. Therefore, it is better to evaluate the STTM models on the real-world dataset to extract the trending topics and discover the emerging latent topics of discussion from constant background chatter in social media.

Fig. 13
figure 13

Distribution of prominent datasets based on the utility of STTM

7.6.2 Quantitative analysis of the existing datasets based on language

This section quantitatively presents the analysis of the datasets language used in both TSTTM and ASTTM models. The existing datasets languages utilized to evaluate the state-of-the-art STTM models are six languages: English (EN) and Chinese (CHI). Turkish (TUR), Japanese (JPN), Germanize (GER) and Portuguese (POR).

Datasets languages utilized on TSTTM models As it is clearly observed from Fig. 14a, the English dataset has got 77.94% which is the highest rate, and the Chinese datasets have gained 13.24% rate, whereas the Germanize has obtained 1.47%, which is the less rate.

Fig. 14
figure 14

Distribution of dataset languages on STTM models

Datasets languages utilized on ASTTM models From Fig. 14b, it is noted that the Portuguese (POR) has obtained 1.01%, CHI has got 26.04%, whereas the EN has acquired 72.92%, which is the highest datasets language used in the ASTTM models. From Fig. 14a and b, we can conclude that the English datasets have gotten more attention from researchers. Besides, the researchers have not paid attention to other languages in the existing works. So the researchers can bridge this gap for future works.

7.6.3 Quantitative analysis of the existing datasets based on sources

This section shows the quantitative analysis of the sources of the datasets utilized in the both TSTTM and ASTTM models. Generally, the sources of the datasets, which have been used for gathering datasets to evaluate the state-of-the-art STTM models, are as follows: Twitter, Weibo, Sogou Labs, Yahoo, websites, amazon, news, and others. Here, it is observed that from Fig. 15, Twitter social media has used 30.77% for the dataset, which is the highest rate of all the sources. Weibo has got a 12.09% rate, whereas amazon has got a 3.3% rate among datasets on STTM models, which are the least sources used to gather the datasets. We can conclude that the Twitter platform has gotten more attention from researchers. The researchers have not paid attention to using Facebook, Instagram, Tik Tok, and Whatsup social media to gather the datasets in the existing works. So the researchers can bridge this gap for future works.

Fig. 15
figure 15

Distribution of the existing datasets utilized based on Sources

7.7 Quantitative analysis of STTM evaluation metrics

This section summarizes quantitatively the different metrics utilized for evaluating both TSTTM and ASTTM models, which evaluate the quality of extracted topics as shown in Tables 15 and 16, respectively. The topics must be evaluated to measure their efficiency based on appropriate performance evaluation metrics. The main common evaluation metrics which were considered in the literature are: topic coherence (Fang et al. 2016a; Mimno et al. 2011), perplexity, Point-wise Mutual Information (PMI) (Newman et al. 2010), Normalized PMI (NPMI), purity (Zhao and Karypis 2001), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI) (Yan, Guo, Liu, et al. 2013), topic recall, Accuracy, precision, recall, and F-scores. Also, there are other metrics only once or twice times used throughout the studies, such as Rank index (RI), Mean Average Precision(MAP), Topic Mixing Degree (MD), Estimate Error (ERR), Topic Effectiveness(T.E), KL Distance(KLD), Matthews Correlation Coefficient (MCC), Calinsiki-Har-abasz index (CH), log-likelihood indicates (Log-LH), Classification Error Rate (C.E.R), Mean Absolute Error (MAE), Topic Relevance (Topic Rev), Hellinger distance (HD), Mean Average Precision (MAP), Averaged Pearson’s correlation coefficients (AP), Discounted Cumulative Gain (DCG), Normalized Discounted Cumulative Gain (NDCG), Similarity (Murshed et al. 2020), Root Mean Squared Error (RMSE), Mean Reciprocal Rank (MRR), Cohen’s kappa score (Kappa), Homogeneity (H), Completeness (C) (Rosenberg and Hirschberg 2007), and Word Embedding-based topic coherence measure (WE-based Metrics Similarity) (Fang et al. 2016b) as shown in Tables 15 and 16.

Table 15 Analysis of the evaluation metrics utilized in prominent TSTTM models
Table 16 Comparative analysis of the evaluation metrics utilized in ASTTM models

This sub-section presents the quantitative analysis of the evaluation metrics utilized in both TSTTM and ASTTM models.

Evaluation metrics utilized in TSTTM models In terms of coherence, It can be observed that from Fig. 16 (a), the perplexity has been used more than the other coherence metrics, and it has got 9.68% rate, whereas the topic coherence and PMI/NPMI have got 6,45% and 4.84% respectively. Perplexity has been less effectively utilized incomprehension of the semantic essence of the learned topics. Therefore, many researchers such as (Fang et al. 2016a; and Mimno et al. 2011) proposed topic coherence metrics to address this issue. The accuracy has got 19.35% which is the highest evaluation metric used in the TSTTM models, whereas the ARI has got 0.81 which is the least evaluation metric used in this category.

Fig. 16
figure 16

Distribution of using evaluation metrics in both TSTTM and ASTTM models

Evaluation metrics utilized in ASTTM models In this part of the taxonomy ASTTM, the researchers have more attention to the utility of topic coherence to evaluate the quality of the extracted topics, so the topic coherence has got 17.9% rate which is the highest used among all the other evaluation metrics. The PMI/NPMI has got 12.35%, which is the second-highest metric in terms of coherence, whereas the accuracy, F-measure, and precision have got 14.81%, 11.73% and 10.49%, respectively. On the other hand, the AMI and ARI have got the least rate. Figure 16 (b) depicts the percentage of using the evaluation metrics in ASTTM models. There is another topic coherence metric based on word embedding, namely (WE-based metrics Sim) proposed by (Fang et al. 2016b), which was used by (Huang et al. 2020). In summary, Beneficial assessment metrics have never been resolved for topic discovery models (Qiang et al. 2020). Topic coherence cannot differentiate between topics. Moreover, only one category of the topic modelling models has gotten the attention of using the recent metrics. Developing new evaluation criteria is a future research work for topic modelling that matches how the models are utilized.

8 Experimental analysis

This section presents details about the evaluation of the prominent topic models for extracting hidden topics from short texts performed over two datasets crawled from the Twitter social media platform: the Real-world pandemic Twitter dataset (RW-Pan–Twitter) and Real-world Cyberbullying Twitter dataset (RW-CB-Twitter). Ten main models from the literature are selected: the traditional models of LDA (Blei et al. 2003) and NMF are chosen based on their extensive usage for short text topic models, especially in Twitter-related topic modelling. An extension of the LDA known as the Twitter-LDA (Zhao et al. 2011) is chosen as it is one of the first topic models that addressed the problem of sparsity by defining an application-specific approach. Besides, CDTM (Wang et al. 2008) and FTM (Rashid et al. 2019b) models are chosen. BTM (Cheng et al. 2014), PTM (Zuo, Wu, et al. 2016a), SATM (Quan et al. 2015), WNTM (Zuo, Zhao, et al. 2016b), and GLTM (Liang et al. 2018) are chosen because they represent common and most effective short text topic models for discovering latent topics from a short text. Hence, the ten models: LDA, NMF, Twitter-LDA, CDTM, FTM, BTM, PTM, SATM, WNTM and GLTM, were implemented over both the real-world datasets: RW-Pand–Twitter and RW-CB-Twitter datasets to evaluate their performance. The implementation of the considered models is written in Python language, and the experiments were conducted using Pycharm IDE. The datasets consist of tweets collected from multiple topics of interest. The Twitter datasets could be collected using the Twitter streaming API using Python language with tweepy package. Following that, the pre-processing step is applied to the dataset to clean up the data using toolkits such as the NLTK python package (Anil Phand and Chakkarwar 2018) that provides stop-word and punctuation removal, tokenization, lemmatizing, stemming, identifying n-gram procedures, and lowercase transformation and also other pre-processing and data cleansing steps (Murshed et al. 2021). Finally, the TM methods are applied to extract a set of recurrent themes/topics that are explored throughout the collection of posts and the extent to which each post reflects those themes. We compare the performance of considered topic discovery models in terms of purity, NMI, ARI, accuracy and topic coherence with \(k\) different number of topics such as \( k = \left\{ {\# ground\, truth\, topics, 20, 40, 60, \,and\, 80} \right\}\), determined according to the labelled datasets. The number of iterations is fixed as 1000 for all the experiments except that when analyzing the influence of the number of iterations on the performance of the selected topic modeling. In this case, we fixed the Number_of_Iteration = \(\left\{ {5,{ }10,{ }20,40,{ }60,{ }80,{ }100, 200, and\, 400} \right\}\). All setup parameters were set for the experiments based on the used setup parameters in the original papers of the selected models. The evaluation metrics were selected to display high performance of the topic discovery of each model, and the implemented algorithms were run 20 times for each evaluation metric to obtain an average value.

8.1 Parameters settings

In this sub-section, we provide the setting of the common parameters used in the considered models. The number of topics for all experiments is set as \(k = \left\{ {\# ground\, truth\, topics, 20, 40, 60, and\, 80} \right\}\). The number of iterations is fixed to 1000 for all models, whereas we set the number of iterations as the following \( Number\_of\_Iterations =\) \(\left\{ {5, 10, 20, 40, 60, 80, 100, 200, and\, 400} \right\}\), only for checking the influence of the number of iterations on the performance of the considered ten topic modelling as illustrated in Sect. 8.4.4. All the parameters of the considered models for experiments are chosen as recommended by the authors in the original papers. In GLTM, BTM, and SATM, the value of α is fixed as 50/K, and we set α = 0.1 for both PTM and WNTM, whereas we set α = 0.05 for all the following models: LDA, TwitterLDA and CDTM. We set β = 0.1 for SATM and WNTM whereas, we set β = 0.01 for the other models such as BTM, GLTM, LDA, TwitterLDA, WNTM, CDTM, and PTM models. In according to the hyper-parameter λ, we set λ = 0.5 and λ = 0.1 for GLTM and PTM models, respectively. The number of pseudo documents in PTM and SATM is fixed to 1000 and 300, respectively. The sliding window is fixed as 10 in the WNTM model.

8.2 Datasets

In this section, we have collected two real-world Twitter datasets: The Real-World Pandemic Twitter dataset named (RW-Pand-Twitter) and the Real-World Cyberbullying Twitter dataset named (RW-CB-Twitter). The descriptions of both datasets are provided below. Table 17 shows the statistics of the collected datasets.

Table 17 The statistics of the datasets

8.2.1 Real-world Pandemic Twitter (RW-Pand-Twitter) dataset

This dataset consists of tweets collected from multiple topics of interest. The Twitter dataset on seven major topics of pandemics was extracted using the Twitter streaming API using Python language with tweepy package. These seven selected topics of interest include Cholera, Coronavirus, Dengue fever, Malaria, Chikungunya, Ebola virus disease and Swine flu. A total of 971,132 tweets were collected using the streaming API and were filtered to remove the re-tweets, duplications, and non-English Tweets. After removing such tweets, a total of 330,159 tweets were obtained in the seven specified topics. From these tweets, 42,000 tweets were selected and fed to the system to evaluate the described topic models.

8.2.2 Real-world Cyberbullying Twitter (RW-CB-Twitter) dataset

The RW-CB-Twitter dataset is collected from the Twitter social media platform utilizing Twitter API streaming with the use of some Cyberbullying keywords like a threat, terrorist, attack, kill, hate, black, Islam, racism, Islamic, and ban as provided in (Zhang et al. 2018d) whereas the other keywords such as whale, fuck, pussy, fucking, moron, ugly, LGBTQ, poser, idiot, bitch, whore, nigger, etc., as suggested in the psychology literature (Nand et al. 2016); Squicciarini et al. 2015; Cortis and Handschuh 2015; Cheng et al. 2019). The RW-CB-Twitter dataset is extended to the dataset utilized in (Murshed et al. 2022) and labelled into five classes: sexism, racism, aggressive (harassment), insult, and not-bullying. The gathered dataset consists of 435,764 tweets, which includes several outliers. So, only the English tweets are needed, and the other language tweets are filtered out. Similarly, re-tweets have been deleted from the dataset because they are not informative. Finally, after removing the irrelevant tweets, we selected only 20,000 tweets from the rest of the dataset for the experiments in this research.

8.3 Evaluation metrics

This sub-section briefly describes the utilized evaluation metrics for evaluating the efficiency of STTM models. The evaluation in this study was conducted in terms of (1) evaluation utilizing clustering measures such as Normalized Mutual Information(NMI) (Singh and Singh 2020), Purity (Zhao and Karypis 2001), Adjusted Random Index (ARI), (2) Evaluation utilizing classification accuracy, and finally, (3) Evaluation utilizing Topic Coherence (TC) (Röder et al. 2015) using PMI. These measures are described briefly below.

8.3.1 Clustering evaluations

Supposing \(D = \left\{ {d_{1} ,d_{2} , \ldots ,d_{k} } \right\}\) is the set of derived clusters with the number of clusters denoted as \(k\), each \({\text{d}}_{{\text{k}}}\) is a tweet in cluster \({\text{k}}\), and \(S = \left\{ {s_{1} ,s_{2} , \ldots ,s_{m} } \right\}\) is the set of labelled clusters (ground-truth) already determined in the dataset with the number of labelled clusters denoted as \({\text{m}}\). We have adopted three measures to assess the quality of clusters of set \({\text{D}}\).

8.3.1.1 Purity metric

Purity (Zhao and Karypis 2001) is the measure used to evaluate the degree of the clustered tweets similar to the labelled datasets. It is based on the accuracy of the clustered tweets and is computed as the number of correctly clustered tweets among the total number of labelled tweets in the dataset. Purity is expressed as given in Eq. (7).

$$ Purity \left( {D,S} \right) = \frac{1}{N}\mathop \sum \limits_{k} \mathop {\max }\limits_{m} |d_{k} \cap s_{m} | $$
(7)

Here, \(N\) denotes the number of labelled tweets in the dataset. It must be noted that the purity values lie between 0 and 1. The low clustering quality results in a purity value of zero, while a high perfect clustering quality results in a purity value of 1.

8.3.1.2 NMI metrics

NMI (Yan, Guo, Liu, et al. 2013) computes the mutual information shared between \(D\) and \(S,\) which is normalized by the mean entropy of clusters represented as \(H\left( D \right)\) and entropy of classes represented as \(H\left( S \right)\). Similar to the purity value, the NMI ranges from 0 to 1, and the larger values of NMI imply a higher level of accuracy in clustering. It can be computed as given in Eq. (8).

$$ NMI \left( {D,S} \right) = \frac{{I\left( {D,S} \right)}}{{\left[ {H\left( D \right) + H\left( S \right)} \right]/2}} $$
(8)

Here \(I\left( {D,S} \right)\) is the mutual information between D and S which can be statistically computed as

$$ I\left( {D,S} \right) = \mathop \sum \limits_{k} \mathop \sum \limits_{m} P(d_{k} \cap s_{m} )log \frac{{P\left( {d_{k} \cap s_{m} } \right)}}{{P\left( {d_{k} } \right)P\left( {s_{m} } \right)}} $$
(9)

Here \(P\left( {d_{k} } \right)\) indicates the probability of a tweet possibly present in the cluster \(d_{k}\), \(P\left( {s_{m} } \right)\) the probability of a tweet possibly present in \(s_{m}\) and \(P\left( {d_{k} \cap s_{m} } \right)\), the probability of a tweet possibly present in both the clusters \(d_{k}\) and \(s_{m}\). The mutual information can be rewritten based on the number of tweets N in the original labelled dataset as in Eq. (10).

$$ I\left( {D,S} \right) = \mathop \sum \limits_{k} \mathop \sum \limits_{m} \frac{{\left| {d_{k} \cap s_{m} } \right|}}{N}log \frac{{N\left| {d_{k} \cap s_{m} } \right|}}{{\left| {d_{k} } \right|\left| {s_{m} } \right|}} $$
(10)

Here N denotes the number of tweets in S, \(\left| {d_{k} } \right|\), \(\left| {s_{m} } \right|\) indicate the number of tweets in \(d_{k}\) and \(s_{m}\), respectively and \(\left| {d_{k} \cap s_{m} } \right|\), the number of tweets occurring in both the clusters \(d_{k}\) and \({\text{s}}_{{\text{m}}}\).

Similarly, the entropy of classes \(H\left( S \right)\) and entropy of clusters \(H\left( D \right)\) are computed as follows:

$$ H\left( C \right) = - \mathop \sum \limits_{m} P(s_{m} )log P\left( {s_{m} } \right) = - \mathop \sum \limits_{m} \frac{{\left| {s_{m} } \right|}}{N}log\frac{{\left| {s_{m} } \right|}}{N} $$
(11)
$$ H\left( D \right) = - \mathop \sum \limits_{m} P(d_{k} )log P\left( {d_{k} } \right) = - \mathop \sum \limits_{m} \frac{{\left| {d_{k} } \right|}}{N}log\frac{{\left| {d_{k} } \right|}}{N} $$
(12)
8.3.1.3 ARI metric

ARI is the Adjusted Random Index that measures the correctness of a decision on clustering two tweets based on similarity. Clustering is considered as a pair-wise decision. For measuring ARI, if two tweets are located in the same class and in the same cluster or both in different classes and clusters, then the decision is considered correct. Rand Index (RI) is the percentage of correct decisions. The ARI is the corrected for the chance version of RI. The highest value of 1 denotes perfect clustering and the more similarity between labels and clustering results. It is computed as expressed in Eq. (13).

$$ ARI \left( {D,S} \right) = \frac{{\mathop \sum \nolimits_{k,m} \left( {\frac{{\left| {d_{k} \cap s_{m} } \right|}}{2}} \right) - \left[ {\mathop \sum \nolimits_{k} \left( {\frac{{\left| {d_{k} } \right|}}{2}} \right)\mathop \sum \nolimits_{m} \left( {\frac{{\left| {s_{m} } \right|}}{2}} \right)} \right]/\left( \frac{N}{2} \right)}}{{\frac{1}{2}\left[ {\mathop \sum \nolimits_{k} \left( {\frac{{\left| {d_{k} } \right|}}{2}} \right) + \mathop \sum \nolimits_{m} \left( {\frac{{\left| {s_{m} } \right|}}{2}} \right)} \right] - \left[ {\mathop \sum \nolimits_{k} \left( {\frac{{\left| {d_{k} } \right|}}{2}} \right)\mathop \sum \nolimits_{m} \left( {\frac{{\left| {s_{m} } \right|}}{2}} \right)} \right]/\left( \frac{N}{2} \right)}} $$
(13)

8.3.2 Topic coherence metric

Topic coherence is a measure to assess the quality of topic models. For each \(K\) topic of tweets generated, the topic coherence is applied to the top \(N \) words. In the experiment, we selected 10 top words with high probabilities (\(w_{1} , \ldots , w_{N} )\) as a sliding window. Topic coherence is computed using the PMI metric and following to (Li et al. 2017) (Murshed et al. 2020). It measures the semantic score of a single topic by measuring the degree of semantic similarity between high scoring words in the topic and is computed as follows.

$$ Coherence\left( K \right) = \frac{2}{{N\left( {N - 1} \right)}}\mathop \sum \limits_{1 \le i \le j \le N} PMI (w_{i} ,w_{j} ) $$
(14)

Here \(w_{i} ,w_{j}\) indicates to the top words pairs of the topic. This score can be computed by means of PMI. PMI is the Point-wise Mutual Information value between the topic words in a cluster.

$$ PMI\left( {w_{i} ,w_{j} } \right) = \log \frac{{P\left( {w_{i} ,w_{j} } \right)}}{{P\left( {w_{i} } \right)P\left( {w_{j} } \right)}} $$
(15)

Here \(P\left( . \right)\) denotes the probability or likelihood of the topic words in the clusters.

8.3.3 Classification accuracy metric

The topic modelling performance can be evaluated using text classification. We leverage the extracted topics determined by STTM models as tweet features. Each tweet can be described by document-topic distribution. We adopt the document-topic distribution by \( p\left( {z_{i} |d} \right)\) for all the considered models. The accuracy is chosen as the metric for classification purposes. We randomly split both datasets: RW-CB-Twitter and RW-Pand-Twitter datasets, into training and test sub-datasets and used the Support Vector Machine (SVM) classifier for classification. K fold Cross-Validation (CV) has been used to compute the classification accuracy, where K is fixed to 5.

8.4 Experimental results

This section describes the experimental results obtained from the selected ten state-of-the-art models such as LDA, TwitterLDA, NMF, FTM, CDTM, SATM, BTM, PTM, WNTM, and GLTM over two Twitter datasets: Real-world pandemic Twitter dataset (RW-Pand-Twitter) and Real-world Cyberbullying Twitter dataset (RW-CB-Twitter). We compare the performance of considered models in terms of Purity, NMI, ARI, accuracy and topic coherence with \({\text{k}}\) different number of topics such as \({\text{ k}} = \left\{ {\# {\text{ ground truth topics}},{ }20,{ }40,{ }60,{\text{ and }}80} \right\}\).

8.4.1 Classification accuracy findings

The classification accuracy is used to assess the performance of the considered STTM models. We leverage the extracted topics determined by STTM models as tweet features. Each tweet can be described by document-topic distribution. We adopt the document-topic distribution by \( p\left( {z_{i} |d} \right)\) for all the considered models. The classification accuracy over both RW-Pand-Twitter and RW-CB-Twitter datasets utilizing the considered existing models LDA, Twitter-LDA, NMF, CDTM, and FTM, SATM, BTM, and PTM, WNTM, and GLTM are shown in Fig. 17a and b. Firstly, in the pandemic Twitter (RW-Pand-Twitter) dataset, it can be clearly observed that from Fig. 17a, the FTM has a higher value of accuracy compared to all other models with a different number of topics \( k = \left\{ {7, 20, 40, 60, and\, 80} \right\}\), while the GLTM has the second high accuracy with all various number of topics. In according to the BTM, it has got the third higher accuracy with \(k = \left\{ {7, 20, and\, 80} \right\}\) But when \(k = 40\) and \(k = 60,\) the PTM has obtained a higher accuracy than BTM. While SATM is securing the poorly accuracy among all the ASTTM models. NMF has obtained poor accuracy among all the considered models in both ASTTM and TSTTM models. In according to the Real-world cyberbullying (RW-CB-Twitter) dataset, the WNTM model has achieved the best model among all the models, followed by GLTM, as shown in Fig. 17b. The FTM has obtained the third-best model among the ASTTM and TSTTM models, and the PTM model has gained better accuracy than BTM. In conclusion, the models using Global word co-occurrence such as GLTM, WNTM, and BTM performed better performance than the models using self-aggregation such as SATM and PTM as well as better than TwitterLDA, LDA and NMF over the RW-CB-Twitter dataset, indicating their higher levels of efficiency, whereas the FTM has achieved the best model among ASTM and TSTTM models over RW-Pand-Twitter dataset, it gives proposition term weighting and fuzzy clustering which enhances the topic models. The WNTM, GLTM, and BTM have shown their robustness on both datasets. We noted that the self-aggregation models' results, especially the SATM model have not high accuracy compared with Global word co-occurrence such as GLTM, WNTM, and BTM. Therefore, the effectiveness of self-aggregation-based models is affected by creating long pseudo-documents.

Fig. 17
figure 17

Performance evaluation in terms of accuracy on both RW-Pand-Twitter and RW-CB-Twitter datasets with \({\text{ k}} = \left\{ {5{\text{ or }}7,{ }20,{ }40,{ }60,{\text{ and }}80} \right\}\) topics

8.4.2 Short text clustering NMI and purity findings

This sub-section illustrates the obtained results of the considered models over both Twitter datasets in terms of purity and NMI metrics.

8.4.2.1 Purity Findings

Figure 18a and b show the experimental findings of the considered different TSTTM and ASTTM models in terms of purity to discover the latent topics over both real-world Twitter datasets: RW-Pand-Twitter and RW-CB-Twitter datasets. It can be observed that from Fig. 18a, the FTM model has achieved a better value of purity when compared to other models with a different number of topics over RW-Pand-Twitter. The GLTM has obtained the second-best model among all the TSTTM and ASTTM models, it is considered the best model compared with other Global word co-occurrence based models such as WNTM and BTM and self-Aggregation-based models. The PTM model performs better than BTM with the number of topics \( k = \left\{ {40, 60, and\, 80} \right\}\) and BTM performs better performance than PTM with \( k = \left\{ {7,and\, 20} \right\}\) and NMF has got the least purity among all other models for all \(k\) different topics over RW-Pand-Twitter as displayed in Fig. 18a. In according to the Cyberbullying twitter dataset(RW-CB-Twitter), the WNTM performs the best performance in terms of purity among all the considered models. FTM and GLTM have achieved second-best and third-best purity compared with other models with all the different \(k\) topics. PTM performs better purity than BTM and traditional models LDA, TwitterLDA, CDTM, and NMF. As shown in Fig. 18b, it is evident that the WNTM, GLTM, FTM, and PTM models have higher values of purity when compared with the other traditional NMF, CDTM, LDA and TwitterLDA, indicating their higher levels of efficiency. Finally, we state that the FTM model appears to obtain the best performance in terms of purity over the RW-Pand-Twitter dataset, which is consistent with findings of classification accuracy, positioning itself as the most prominent TSTTM among the considered models with the various \(k\) topics. We conclude that the WNTM gains the best performance, followed by GLTM and FTM in terms of clustering evaluation(purity) over the RW-CB-Twitter dataset. In general, the self-aggregation models' results, especially the SATM model, have less purity compared with Global word co-occurrence models. Therefore, the effectiveness of self-aggregation-based models is affected by creating long pseudo-documents. The Global word co-occurrence based models such as WNTM and GLTM show their robustness over both datasets.

Fig. 18
figure 18

Performance evaluation in terms of purity on both RW-Pand-Twitter and RW-CB-Twitter datasets with \({\text{ k}} = \left\{ {5{\text{ or }}7,{ }20,{ }40,{ }60,{\text{ and }}\,80} \right\}\) topics

8.4.2.2 NMI findings

The performance results of the considered models are assessed using the NMI measure. The NMI results of the selected models over both Real-world Twitter datasets: RW-Pand-Twitter and RW-CB-Twitter datasets are shown in Figs. 19a, b. Firstly, the FTM model performs better NMI values than other models for all \(k\) different topics \(k = \left\{ {7, 20, 40, 60, and 80} \right\}\) over the RW-Pand-Twitter dataset. The GLTM and WNTM have achieved the second-best and the third-best performance in terms of NMI compared to other topic modelling models on the RW-Pand-Twitter dataset. The PTM performs better than the BTM model for \(k\) different topics except when \(k = \left\{ {7 and 20} \right\}\), the BTM is better as depicted in Fig. 19 (a). Secondly, according to the performance results of the considered models over the RW-CB-Twitter dataset, it can be observed that from Fig. 19 (b), the WNTM has got the best results in comparison with NMI values of the other considered methods such as GLTM, FTM, BTM, PTM, SATM, CDTM, TwitterLDA, LDA, and NMF. GLTM gains the second-best results compared to all the other models on the RW-CB-Twitter dataset. Here, the PTM outperforms the BTM models with all the \(k\) various number of topics \(k = \left\{ {7, 20, 40, 60, and 80} \right\}\). Finally, we can conclude that the FTM outperforms all the considered ASTTM and TSTTM models over the RW-Pand-Twitter dataset, and the WNTM has gained superior results compared to all the considered other models over the RW-CB-Twitter dataset. In contrast, the SATM has gained a poor NMI value compared to ASTTM models, and NMF has got the least NMI among both the TSTTM and ASTTM models with different \(k\) topics over both the RW-Pand-Twitter and RW-CB-Twitter datasets, as depicted in Figs. 19 (a) and 19 (b).

Fig. 19
figure 19

Performance evaluation in terms of NMI on both RW-Pand-Twitter and RW-CB-Twitter datasets with \({\text{ k}} = \left\{ {5{\text{ or }}7,{ }20,{ }40,{ }60,{\text{ and }}80} \right\}\) topics

8.4.3 Topic coherence findings

This section compares the quality of latent topics discovered by the considered models using topic coherence. The number of topmost words is fixed to \({\text{N}} = 10\). We set the number of topics for all the selected STTM models as \({\text{k}} = \left\{ {\# {\text{ground truth topics}},20,{ }40,{ }60{\text{ and }}80} \right\}\). Figure 20a and b show the topic coherence of the selected TSTTM and ASTTM models on two datasets: RW-Pand-Twitter and RW-CB-Twitter datasets with a different number of topics. First, on the RW-Pand-Twitter dataset, FTM achieves the best topic coherence with all the k different number of topics \({\text{k}} = \left\{ {7,{ }20,{ }40,{ }60,{\text{ and }}80} \right\}\) comparing to other models, which ensures the consistency of results from the purity, NMI and accuracy metrics. The GLTM is the second-best model and superior to WNTM, PTM, BTM, LDA, TwitterLDA, CDTM, and NMF. The third-best model among all the models is WNTM. The BTM outperforms PTM in terms of topic coherence with all k topics except when \({\text{k}} = \left\{ {7{\text{ and }}20} \right\}\) the PTM achieves better results than BTM. In contrast, NMF has the worst coherence in all \({\text{k}}\) topics compared to other STTM models, as depicted in Fig. 20a. These results ensure the consistency of results from the purity, NMI and accuracy metrics. Second, on the real-world cyberbullying Twitter(RW-CB-Twitter) dataset, the WNTM achieves the best topic coherence with all different numbers of topics, and the FTM and GLTM have achieved the second-best and third-best models compared to all models, respectively, as shown in Fig. 20b. PTM has gained the value of topic coherence better than BTM. This also seems to suggest that the FTM (TSTTM) has significantly better performance when compared with all TSTTM and ASTTM on the RW-Pand-Twitter dataset, and also the modern specific ASTTM models such as WNTM, GLTM, PTM, BTM and SATM have significantly better performance when compared to other TSTTM models such as TwitterLDA, LDA, CDTM, and NMF. Besides, The WNTM has significantly better performance when compared with all TSTTM and ASTTM on the RW-CB-Twitter dataset.

Fig. 20
figure 20

Performance evaluation in terms of topic coherence on both RW-Pand-Twitter and RW-CB-Twitter datasets with \({\text{ k}} = \left\{ {5{\text{ or }}7,{ }20,{ }40,{ }60,{\text{ and }}80} \right\}\) topics

8.4.4 The number of iterations influence on the performance of topic models

In this sub-section, we select the real-world pandemic Twitter (RW-Pand-Twitter) dataset and discuss the effect of iterations number utilizing the Purity, NMI, ARI, and accuracy metrics on the performance of the selected STTM models such as LDA, Twitter-LDA, NMF, FTM, CDTM, BTM, PTM, SATM, WNTM, and GLTM. Figures 21a–d show the effect of the iteration number on the performance of the topic models in terms of Purity, NMI, ARI, and accuracy respectively. From the figures, we observe that when the number of iterations is 5, the values of all metrics are very less while the values of all metrics increase gradually when the number of iterations at 20, 40, 60, 80 etc. That means when the number of iterations increases, the performance of the topic models increases and even reaches convergence, as shown in Fig. 21. On the selected dataset (RW-Pand-Twitter) dataset, FTM has the maximum values of purity, NMI, ARI, and accuracy compared to all the other models. It can be observed that the FTM, GLTM, and WNTM models are superior to all other models in terms of all metrics and secure the first-best, second-best, and third-best value in terms of all metrics, respectively, followed by BTM, PTM, SATM, CDTM TwitterLDA, LDA and NMF. We can notice that Global word co-occurrences based models can get stable and converge at 400 iterations to be near the optimal solutions. Self-aggregation models have the slowest convergence rate and the poorest iterative performance compared with Global word co-occurrences based models.

Fig. 21
figure 21

Purity, NMI, ARI, and Accuracy values with various numbers of iterations on the RW-Pand-Twitter dataset

In conclusion, the FTM model-based TSTTM is a superior model among all TSTTM and ASTTM models over the Real-world pandemic Twitter (RW-Pand-Twitter) dataset in terms of all metrics such as accuracy, NMI, purity, and topic coherence because the FTM model provides fuzzy-clustering and terms weighting which enhances the topic model. Then, the GLTM and WNTM, and BTM models, which are based on simple assumptions, are superior to SATM self-aggregation based models. SATM always outperform NMF, CDTM, LDA, and its improvement (Twitter-LDA). The NMF model achieves poor performance in terms of all metrics when compared to other models. In according to the RW-CB-Twitter dataset, WNTM outperforms all the TSTTM and ASTTM models, and GLTM and FTM have got the second-best and third-best models among all the models. It can be observed here in this dataset that the Global word co-occurrence models such as WNTM and GLTM have got high values of accuracies, NMI, purity and topic coherence compared with the other models. PTM outperforms BTM in terms of all metrics over the RW-CB-Twitter dataset. SATM models have poor values compared with other ASTTM models. The NMF model achieves poor performance in terms of all metrics when compared to other ASTTM and TSTTM models. The Global word co-occurrence models show their robustness on both datasets and indicate their higher levels of efficiency. The self-aggregation models' findings, especially the SATM model, have less values of purity, NMI, accuracy, and topic coherence compared with Global word co-occurrence models. Therefore, the effectiveness of self-aggregation-based models is affected by creating long pseudo-documents. The structure of SATM and PTM are very complex, and there are numerous hidden variables that request to be sampled, resulting in significant time consumption.

9 Discussions

This section discusses the overall observations noted from the qualitative and quantitative analysis, as well as the comparative analysis.

Qualitative analysis The qualitative analysis highlighted some observations related to existing TSTTM and ASTTM models. According to the existing TSTTM models, it can be observed that from Tables 2, 3, and 4, some of the works tried to increase the accuracy, such as (Wang et al. 2012; Fang et al. 2017; Kumar and Vardhan 2019; Sharath et al. 2019; Valdez et al. 2018; Belford et al. 2016; Yan et al. 2012; Muliawati and Murfi 2017; Capdevila et al. 2017; and Li et al. 2015), enhance the coherence such as (Zhao et al. 2011; Zheng et al. 2019; Han et al. 2020; Kim et al. 2020; Farahat et al. 2015; Lacoste-Julien et al. 2009; and He et al. 2019b), alleviate the data sparsity problem as in (Ozyurt and Akcayol 2021; Akhtar et al. 2019b; Iskandar 2017; and Pang et al. 2019), extract the topics from the unlabeled data (Korshunova et al. 2019; Ozyurt and Akcayol 2021). Other existing works resolved the lack of semantic information and local word co-occurrence (Akhtar et al. 2019a; Chen and Kao 2017; Chen et al. 2020b). However, the existing works addressed some of the above challenges. There are still issues in this sub-category, such as time complexity and sparsity, whereas the others are incapable of handling large scale of short text data.

On the other hand, it can be observed that from Tables 7, 8, 9, and 10, the efficiency of the some of ASTTM models is better than the TSTTM, and also there are TSTTM model is better than ASTTM as concluded from the experimental analysis that the FTM models under TSTTM category is superior than BTM, PTM, GLTM, and WNTM over RW-Pand-Twitter dataset. The ASTTM models attempt to resolve the common issues in the short text, such as improving classification accuracy and increasing topic coherence. It can be noted that the accuracy increases by mixture components in EM as in the DMM category. This strategy suggested to sample one document or short text by one topic, and it is suited for short text to increase the accuracy issue as in (Yin and Wang 2014; Yu and Qiu 2019; Yu and Qiu 2019; Li et al. 2019c; Liu et al. 2020b; and Garcia and Berton 2021). Moreover, due to the nature of the short text, there is not enough word Co-occurrence information; hence, some models such as BTM (Cheng et al. 2014; Pang et al. 2016; and Huang et al. 2020) attempt to utilize the wealthy Global word co-occurrence pattern to infer hidden topics. In contrast, the other models tried to alleviate the sparse of a short text by aggregating the short text into long pseudo-documents to discover latent topics, e.g. PTM (Zuo, Wu, et al. 2016a), SATM (Quan et al. 2015), WE-PTM (Zuo et al. 2021), and SenU-PTM(Lu et al. 2021). It can be concluded that the short text is still suffering from the data sparsity problem, time complexity, visualizing the topics, and data quality problems (the noise of data) in social media. Since the tweets data are informal in nature, it contains slang, typos, Elongated (repeated Characters), transposition, Concatenated words, and complex spelling mistakes, such as unorthodox use of acronyms. Therefore, researchers are advised to work and address these issues for future works.

Quantitative Analysis The quantitative analysis highlighted some observations regarding the publications related to STTM. According to the publications of TSTTM based on the timeline, the higher rate of publications was in 2019, but in the case of ASTTM, 2020 was the most productive year. On the other side, TSTTM probabilistic based have scored a higher rate of publications, whereas Deep Learning Topic modelling (DLTM) based models were the most productive sub-category among the ASTTM categories, followed by Global word co-occurrences which is the second productive sub-category. Contrary, the self-aggregation based had been given less attention by the researchers. Coming to the social media data source utilized by STTM, Twitter has given more attention from researchers. Whereas other social media data sources such as Facebook, Instagram, Tik Tok, and WhatsApp have not gotten attention from the researchers. This gives an indication to the research to work on such platforms (Facebook, Instagram, Tik Tok, and WhatsApp). To the best of our knowledge, these platforms have gotten less attention due to the limited access to such social media data sources. Twitter facilitates access to data through the Twitter API Streaming. Therefore, researchers can create new datasets from these social media platforms and make them publicly available for the researchers to draw their attention to work on these platforms. Furthermore, real-world data (streaming data) in social media has been given less attention. Therefore, researchers are advisable to evaluate the STTM models on the real-world dataset to extract the trending topics and discover the emerging latent topics of discussion from constant background chatter in social media. Finally, we observed from both Fig. 12a and b that Spark has given less attention, where researchers can consider the spark streaming for further work, as it has the capability to process huge datasets and process the data parallelly with less computational time. On the other side, Fig.14a and b highlight that the English language has been given more attention than other languages such as Chinese, Arabic, Hindi, etc. Therefore, researchers are advised to work on such other languages.

Comparative Analysis The comparative analysis highlighted some observations from the experimental results obtained for the selected TSTTM and ASTTM models. This sub-section briefly discusses these observations as follows:

Topic coherence, purity, NMI, and accuracy evaluation metrics: The conclusion of the Comparative Analysis is that the FTM model-based TSTTM is a superior model among all TSTTM and ASTTM models over the Real-world pandemic Twitter (RW-Pand-Twitter) dataset in terms of all metrics accuracy, NMI, purity, and topic coherence because the FTM model provides fuzzy-clustering and terms weighting which enhances the topic model, indicating its high level of efficiency. Then, the GLTM, WNTM, and BTM models, which are based on simple assumptions, are superior to SATM self-aggregation based models. SATM always outperform NMF, CDTM, LDA, and its improvement (Twitter-LDA). The NMF model achieves poor performance in terms of all metrics when compared to other models. In according to the RW-CB-Twitter dataset, WNTM outperforms all the TSTTM and ASTTM models, and GLTM and FTM have got the second-best and third-best models among all the models. It can be observed that in this dataset, the Global word co-occurrence models such as WNTM and GLTM have got high values of accuracies, NMI, purity and topic coherence compared with the other models. PTM outperforms BTM in terms of all metrics over the RW-CB-Twitter dataset. SATM models have poor values compared with other ASTTM models. The NMF model achieves poor performance in terms of all metrics when compared to other ASTTM and TSTTM models. The Global word co-occurrence based models show their robustness on both datasets and indicate their higher levels of efficiency. The self-aggregation models' findings, especially the SATM model, have less values of purity, NMI, accuracy, and topic coherence compared with Global word co-occurrence models. Therefore, the effectiveness of self-aggregation-based models is affected by creating long pseudo-documents. The structure of SATM and PTM are very complex, and there are numerous hidden variables that request to be sampled resulting in significant time consumption.

The number of iterations it was noted that when the number of iterations is 5, the values of all metrics are significantly less, while the values of all metrics increase gradually when the number of iterations at 20, 40, 80, 120, etc. However, When the number of iterations increases, the performance of the topic modelling increases, as shown in Fig. 21. For purity, NMI, ARI and Accuracy, FTM has the maximum values in all models when the number of iterations at 200. The FTM model was found to be superior to all other models in terms of all metrics over the RW-Pand-Twitter dataset, whereas the GLTM and WNTM have achieved the second-best and the third-best performance compared with all other models over. In according to RW-CB-Twitter Dataset, the WNTM model was the best model in comparison with all other models over the RW-CB-Twitter dataset in terms of the majority of the considered evaluation metrics. The BTM and PTM models have the tradeoff values in terms of all metrics over both datasets, followed by SATM, CDTM, TwitterLDA, LDA, and NMF.

10 Challenges and future directions

From the above deep discussions, it can be observed that issues such as data sparsity, data noise, evaluation metrics, visualization, and deep learning were given less attention by the research community. This section briefly highlights the aforesaid challenges and open research issues in STTM that can help in further research enhancement to STTM.

Data sparsity LDA, NMF, PLSA, and LSA are some of the key unsupervised techniques that extract topics from a set of documents texts solely using the content of documents. Many extensions were developed for supporting short texts. But many issues are still left unaddressed. The models which are entirely based on short text and tweet contents are still suffering from the data sparsity problem (Likhitha et al. 2019). The co-occurrence density of the matrix of words in a collection of tweets can be as low as 0.274% on average (Nugroho et al. 2017), which results in low overlapping terms with less semantic relationships.

Data quality problem Augmentation of short texts in TSTTM and ASTTM models with auxiliary content is mostly from external sources might seem efficient. However, they always result in noise interference and sometimes also cause unrelated term selection. Since the tweets are informal in nature, slang, typos, Elongated (repeated Characters), transposition, Concatenated words, complex spelling mistakes, such as unorthodox use of acronyms, word boundary errors, manifold forms of abbreviations of the same words and grammar are challenging for the learning models and matching with the content of external resources. Further, these external sources can also cause stability issues due to increased usage of resources for handling external source data. Many models use the hashtags feature in tweets to extract and ameliorate topics learned for topic discovery (Alash and Al-Sultany 2020). Our earlier work (Murshed et al. 2021) has addressed the data quality problem of the Twitter social media datasets. However, since not all tweets have hashtags, the sparsity problem continues. Further, some models require user information to track the topics. But, in real-time, a Twitter user can post tweets on many topics, where the user mentions are not included in most of these tweets. In a dynamic twitter environment, user-related information is prone to privacy issues.

Visibility Most models in STTM fail to consider time as an important feature in extracting topics. Even methods with temporal features use time only as a timing window for topic extraction. However, it is important to identify the time factor as a quality factor to be improved as it derives the static tweet content for dynamic processing. Advanced models such as DMM, BTM, PTM, WNTM and SATM have been developed in recent years and are not being widely used owing to limited visibility of lack of exposure. However, they seem to provide better performance when compared to the traditional models. In order to enhance their usage for STTM, the limitations discussed in Sect. 4.3 must be addressed. It is also important to process complex data so that they can be used to support big data sets (large-scale datasets).

Visualization Topics are displayed by the most frequent terms/words for each topic (Chuang et al. 2012; Qiang et al. 2020). How to view a document or short text utilizing topic models (TM) is also a challenging issue. Topic modelling is a process of automatically discovering the hidden thematic structure from the short text, such as posts or tweets and facilitates building new ways to browse and summarize the large archive of text as topics. This structure can help to define the most important sections of the documents by linking to the topic labels. Visualizing a considerable number of topics (for example, more than 100 topics) in a compressed way is a challenging issue (Sievert and Shirley 2014).

Streaming data Concept drift and recurring concepts are two characteristics of streaming data. Most available models do not adequately handle such scenarios. Another future research direction, training TMs can be achieved through active learning (Burkhardt and Kramer 2019a). Specifically, when dealing with text data streams, labelling all incoming new documents by hand is a hard task. Active learning allows for aggressively selecting documents that differ from recently viewed documents or where the algorithm has the least confidence during labelling and infers labels for the remainder. Extending Semi-supervised may also help in better training with less labelled training data (Burkhardt et al. 2020; Burkhardt et al. 2018).

Deep learning Although the literature reported that several existing TMs were trained using neural networks (Burkhardt and Kramer 2019b), the research on multi-label classification using NTM is still limited (Panda et al. 2019). This necessitates the use of advanced DL methods such as recurrent networks, convolutions, and various prior distributions. For instance, dropping the assumption of a mixture model, that all documents are mixtures of topics, does not increase the complexity of NTM (Srivastava and Sutton 2017). This allows each document to be represented by various combinations of topics, which makes the model more expressive. Furthermore, neural networks can be more easily extended by using word vectors or other layer types, which can be pre-trained to extract semantics and syntactic attributes of words(Burkhardt and Kramer 2019a).

Evaluation metrics Purity and NMI, and recall metrics are the three most popularly utilized to assess the quality of extracted topics. These metrics are dependent on the statistical accuracy of clustering outcomes. Due to the great data sparsity of correlation among words/terms, a numerical examination of the coherency among terms/words in the representation of the topic may not yield a consistent outcome for estimation purposes. Hence, it is necessary to use more suitable and efficient metrics. Therefore, PMI, Topic coherence and measures have been used in the experiments section. Beneficial assessment metrics have never been resolved for topic discovery models (Qiang et al. 2020). Topic coherence cannot differentiate between topics. Moreover, only one part of the topic modelling models is evaluated by current metrics. Developing more such new evaluation criteria is a future research work for topic modelling.

Optimization for STTM From the comprehensive review and extensive analysis, the limitations of existing STTM models are regarded as critical for accurately representing current knowledge on the topic. While there are numerous significant researches for STTM, modelling refinement and optimization are still required for optimizing the accuracy and generated outputs. Meta-heuristic optimization algorithms can be a good tool for achieving high accuracy and optimized results.

Comparative analysis Comparing short text topic modeling methods in terms of how many sub-tasks are there in short text topic modeling, the methods designed to address the sub-tasks, and the general processing framework of the method for each sub-task.

11 Conclusion

This article presented a comprehensive survey and comparative analysis along with an extended structured taxonomy for the most recent, ever-growing efficient STTM models in social media. It mainly focused on most aspects of TSTTM models such as probabilistic models, matrix factorization, unsupervised and supervised models, Exemplar based, dynamic-based categories, data source based, word types, application-based, Frequent Pattern Mining (FPM) techniques). Moreover, it provided ASTTM models such as DMM-based, Global word co-occurrences-based, self-aggregation-based, and deep learning topic modelling models. The taxonomy provided a qualitative analysis of existing STTM models based on their performance and respective strengths and weaknesses. The utilized datasets by STTM were reviewed and analyzed quantitatively. The useful software tools and open-sources libraries for STTM were reviewed and summarized.

Furthermore, a quantitative analysis of the literature was performed to highlight the research trends and future directions. Moreover, a comparative analysis of the topic quality and performance of representative STTM models is presented. The performance evaluation is conducted on two real-world Twitter datasets: RW-Pand-Twitter and RW-CB-Twitter datasets in terms of several metrics such as topic coherence, purity, NMI, and accuracy. Finally, the open challenges and future research directions in this promising field are discussed. To the best of our knowledge, the findings of this study will serve as a catalyst to develop new and efficient models for topic discovery of short texts overcoming all the limitations of the existing STTM models. The final suggestion offered by this study is the development of the STTM method incorporating all vital features without increasing the complexity, which would be a viable solution for the future.