1 Introduction

As the online data volume is increasing exponentially, it is essential to understand the topics from the available data promptly and correctly. There is an increased demand for automatic topic modeling system due to the enormous volume of text data available in electronic form today and the limitations of human reading abilities. There are multiple techniques used to discover the topics from text, images, and videos [1]. Topic models have been extensively utilized in Topic Recognition and Tracking tasks, which helps to track, detect, and designate topics from a stream of documents or texts. This machine learning technique is widely used in NLP applications to analyze the unstructured textual data and automatically discover the abstract topics within it. There are various methods proposed to accurately identify topics [2,3,4] from huge text corpora.

This study is intended to model various text mining techniques to infer hidden semantic structures and hence the topics in a text corpus. Topic modeling experiments are conducted using the CORD-19 dataset. The topic modeling investigations are performed based on sentence embedding after pre-processing. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling have been studied in detail.

As an extended work, the applications of k-Means clustering in topic modeling have been studied. As the computational complexity of the Clustering Algorithms increases with the increased number of features, various dimensionality reduction methods are also be performed to improve the performance of the proposed system. Finally, an efficient integrated clustering and a combined BERT-LDA framework to perform automatic topic modeling is designed and implemented. The performance evaluation of the system is also carried out by computing Silhouette Score. The experimental results are found very promising, and it also unveils the opportunity for further research, precisely by applying hybrid methods to improve the efficiency of the proposed model.

The rest of the paper is organized as follows: Sect. 2 provides a quick review of related work around the topic modeling and clustering. Section 3 presents the proposed topic modeling methodology. In Sect. 4, the comprehensive experiments and results are illustrated. Lastly, Sect. 5 concludes the work and share the future research directions.

2 Review of related works

In this section, recent research findings related to topic identification including topic modeling and topic clustering are summarised. Topic modeling is an essential study area in Natural Language Processing (NLP). It aims to identify the most important topics from the documents, which help to understand about the document without reading the whole content. As the digital contents are growing rapidly and it’s virtually impossible to read all the details to identify related topics. Reliable topic modeling algorithms help you to draw relevant conclusions quickly. A reliable topic modeling algorithm helps to get the relevant inferences quickly. Organizing, searching, and briefing large volume of textual information are the major tasks involved in NLP and the topic modelling can be used to resolve these problems to certain level. Some of the important techniques that are broadly used for topic modelling are detailed below.

The Vector Space Model (VSM) was widely used in the initial stage of text processing, in which a document is denoted as a vector. Latent Semantic Analysis (LSA) [5] came into practice to overcome the drawback of VSM, as it could not capture inter and intra-document statistical structure. It helps in examining relations between the set of documents and the terms it comprises by creating a set of conceptions related to the documents and terms. Latent Semantic Analysis looks at how a group of texts' terms relate to one another. Using singular value decomposition, it searches unstructured data for undiscovered connections between words and ideas (SVD). LSA can significantly reduce data when used to huge corpora, but as it uses a linear model, it cannot account for nonlinear dependencies. Latent Dirichlet Allocation (LDA) overcomes the deficiencies of LSA [6]. LDA is the most widely used algorithm for topic modelling based on statistical analysis [7]. It aims to discover the topic belonging to a document, based on the specific words in it. Each latent topic of documents is distinguished by a probabilistic distribution around words and the word allocations of topics assign a common Dirichlet. There are multiple variations of LDA identified to get more accurate result based on different types of data, some are Correlated Topic Models, Supervised Topic Models, Nested hierarchical Dirichlet Process and Anchor-free correlated topic modeling [8,9,10,11]. Gupta et al. presented a scoring based LDA algorithm to effectively extract and conclude significant life events from social networking sites. [12]. Many research findings show that LDA with word embedding gain better performance. S. Limwattana et al. proposed Deep Word-Topic Latent Dirichlet Allocation (DWT-LDA) method for training LDA with word embedding [13]. As an alternative approach of assigning word topics to the Collapsed Gibbs Sampling process, a neural network with word embedding method is used.

BERT is a Natural Language Processing Model proposed by Google Researchers [14]. This topic modeling technique uses transformers (BERT embeddings) and class-based TF-IDF to generate dense clusters [16]. N. Peinelt et al. proposed tBERT, a topic-aware extractive and abstractive summarization approach that is built on Transformers' bidirectional encoder representations (BERTs) [15]. This method concentrated on topic mining and previously trained external knowledge to capture more accurate contextual representations. It can also effectively predict topics and construct summaries from social texts. The proposed method also uses BERT, FinBERT, RoBERTa, and DistilBERT for modeling the topics. Pre-trained models provide more accurate word and phrase representations, making it easier to understand topic in Bert-based models since stemming and lemmatization are not required.

Clustering and topic modelling combined techniques are used to improve the topic discovery process [17,18,19]. Recently, researchers started applying combined techniques to get more accurate topic predictions. The combined BERT-LDA model showed better result than individual LDA and BERT modeling. Zhang et al. proposed a flexible framework for combining topic models with BERT. According to their analysis, introducing LDA topics to BERT significantly enhanced performance across a variety of semantic similarity dataset prediction [20, 35]. When measured against a number of standard benchmarks, SBERT was found to significantly outperform state-of-the-art phrase embedding approaches [36, 37]. A quick glance about the topic modeling techniques used in recent research are mentioned in Table 1.

Table 1 Summary of recent topic modeling research findings

3 Proposed approach for topic modeling

This section presents the proposed methodology used in the research work for topic modeling along with the detailed description of the CORD-19 dataset used. There are 4 major components in the proposed integrated Clustering and BERT framework which includes feature extraction, topic modeling methods, dimensionality reduction and document clustering. In topic modeling a combined BERT-LDA model is proposed, and an improved topic modeling framework is implemented using various dimensionality reduction techniques. The reduced dimensionality results are clustered using a combined clustering framework based on k-means clustering algorithm is then implemented for mining a set of meaningful topics from the input text corpora. A high-level block diagram of the proposed approach is depicted in the Fig. 1. Each section mentioned in the block diagram are described in the following sub sections in detail.

Fig. 1
figure 1

Block Diagram of proposed integrated Clustering and BERT-LDA based topic modeling framework

3.1 Dataset used

For the experimentation purpose, the COVID-19 Open Research Dataset (CORD-19), which is the largest open dataset on COVID-19 research papers, is used [23]. The National Library of Medicine (NLM) at the National Institutes of Health (NIH), the Chan-Zuckerberg Initiative, Microsoft Research, IBM Research, Kaggle, and Georgetown's Centre for Security and Emerging Technology worked together to develop the corpus of metadata and full text of COVID-19 publications and preprints, which have released daily.

Major features of the dataset used are mentioned in Table 2. CORD-19 is designed to enable the development of text mining and information retrieval techniques over its rich collection of metadata and structured full-text papers [24, 25]. This dataset is used in various text mining and corona related research in past [26, 27].

Table 2 Features of the COVID-19 dataset

3.2 Data preprocessing

The data preprocessing techniques are used for knowledge mining which includes the transformation of raw data and making it appropriate for machine learning models. Real-world information is often inaccurate, incompatible, and/or missing in certain behaviors or trends and is probable to contain many mistakes. Preprocessing data is a proven technique for solving such problems as it prepares raw data for further processing. Initially, the input data is pipelined through Data Cleaning, Data Integration, Data Transformation, Data reduction, and Data discretization preprocessing techniques.

3.3 Feature extraction

The feature extraction techniques applied to the preprocessed document to extract the features from the document. There are various feature extraction techniques available in natural language processing. We have applied TF-IDF (Term Frequency – Inverse Document Frequency) to vectorize the document and extract the features. It is a frequently used statistical technique in information retrieval and natural language processing. It determines a term's importance within a document in relation to a group of documents. A text vectorization method converts words in a text document into importance numbers.

3.4 Topic modeling

Topic modeling techniques are applied on the vectorized documents obtained after the preprocessing and feature extraction process performed on CORD-19 dataset. The topic model output comprises of the probability of word ‘w’ in a given topic ‘t’ and the probability of topic ‘k’ found in the given document ‘N’. A combined BERT and LDA topic model is proposed. The Sentence Transformers from BERT is used for sentence embedding to provide better performance.

3.5 Dimensionality reduction

Dimensionality reduction is a significant step in text mining [21]. Dimensionality reduction enhances the performance of clustering methods by reducing dimensions so that text mining techniques process data with a reduced number of terms [28, 29]. The dimensionality reduction techniques including PCA, UMAP, and t-SNE are used in this study and performance of the system is compared based on the topic modeling output.

The PCA (Principal component analysis) is a applied to reduce the dimensionality of massive data sets, by altering a large collection of variables into a smaller one which retains the information in the massive data set. t-SNE (t-Distributed Stochastic Neighbor Embedding) is another important technique used for dimensionality reduction which is predominantly suited for high-dimensional datasets visualization.

UMAP (Uniform Manifold Approximation and Projection) is one of the promising dimensionality reduction techniques with solid mathematical foundations [22]. It is an effective tool with increased speed than t-SNE and preserve better global data structure. Also, it keeps a considerable portion of the high-dimensional structure into lower dimensional structure. The efficacy and utilization of UMAP in numerous scientific domains show how powerful the algorithm is. The UMAP technique-based topic modeling showed better result in this study.

3.6 Clustering

It is an unsupervised method for forming similar groups of related data points. The clustering is used to group the topics so as to effectively categorize the topics within the clusters [28]. The k-Means algorithm is used for clustering. K-means unsupervised learning algorithm, groups the unlabeled dataset into different groups [30,31,32,33]. k-Means is a partitional based clustering method that is widely used in topic model clustering. The clustering results are evaluated using silhouette scores and the combined topic-model with reduced dimension clusters showed better results.

The algorithm of the proposed topic modelling approach based on integrated clustering and combined BERT-LDA is detailed in Algorithm -1.

figure a

4 Experimental results and discussion

4.1 Experimental setup

As previously stated, unsupervised clustering and transformer models are used in the study for topic modelling. The experiments are conducted to identify the topics from a set of large text corpora. The experiments have been performed on the CORD-19 dataset. The preprocessing techniques are applied on the collected subset of data. As a next step, the knowledge retrieval method that weighs a term’s frequency (TF) and its inverse document frequency (IDF) are computed from the preprocessed document. The TF-IDF vectors are passed to the LDA and BERT Model to identify the topics. Then the dimensionality reduction of the embeddings is performed as the clustering algorithms poorly handle high dimensionality. The PCA, t-SNE and UMAP algorithms are used for the dimensionality reduction. The reduced dimensionality results are clustered using the k-means clustering algorithm and evaluated the results.

The Elbow Method is used to identify the optimum k that can be used for clustering purpose. Figure 2 shows the elbow method plot indicates the value of the cost function produced by different values of k.

Fig. 2
figure 2

Elbow method plot to identify optimum k value

The integrated development environment (IDE) named Spyder in open-source cross-platform is used for experimentation purposes. To perform the machine learning experiments, Scikit-learn and NLTK tools are used.

4.2 Results and discussion

The outcome of the results is evaluated based on Silhouette Score/Silhouette Coefficient. The Silhouette Coefficient is a metric applied to determine the goodness of a clustering method. The value ranges from -1 to 1. This indicated that the clusters are perfectly separated from each other and distinguished. The Silhouette Coefficient of the topic cluster S(j) is computed using the Eq. (1).

$$ S\left( j \right) = \frac{{\left( {B\left( j \right) - A\left( j \right)} \right)}}{{MAX\,\left( {B\left( j \right),\,A\left( j \right)} \right)}} $$
(1)

where, A(j) is the average separation between that point and all other points belonging to the same cluster. B(j) is the average distance from that point to its cluster of all the points in the nearest cluster.

After computing the silhouette coefficient for each point, average it out to get the silhouette score. The Silhouette Scores obtained for different topic modelling methods with PCA, UMP, t-SNE algorithms used for dimensionality reduction are tabulated in Table 3. From the experimental results it is evident that the combined BERT-LDA model with UMAP outperforms other methods.

Table 3 Silhouette Scores obtained for various models using different dimensionality reduction methods

The Figs. 3, 4, and 5 shows the visualization of clustering result and silhouette plots for the best outcome obtained from LDA, BERT and combined BERT-LDA model as given in the Table 3. The topic clustering result and silhouette plot based on LDA is best derived with UMAP is shown in Fig. 3. Similarly, BERT technique provided better clustering results with UMAP, which is shown in Fig. 4. The combined BERT-LDA model presented best result on UMAP compared to all other techniques shown in Fig. 5.

Fig. 3
figure 3

Clustering result and Silhouette plot based on LDA with UMAP

Fig. 4
figure 4

Clustering result and Silhouette plot based on BERT with UMAP

Fig. 5
figure 5

Clustering result and Silhouette plot based on BERT-LDA with UMAP

4.3 Performance comparison of the proposed algorithm

The experimental results show that the proposed approach is performing better than the earlier algorithms reported in the similar type of research. The results are compared with already existing algorithms LDA and BERT. Figure 6 shows that the proposed integrated clustering and combined BERT-LDA framework performs better than the other methods under consideration. It may also be noted that the UMAP dimensionality reduction method is performed on top of combined BERT-LDA to get more precise context-based features for improved topic modeling.

Fig. 6
figure 6

Result comparison with LDA, BERT and BERT-LDA

5 Conclusion and future work

In this paper, topic modeling from text corpora has been studied in detail. A hybrid model of Bidirectional Encoder Representations (BERT) from Transformers and Latent Dirichlet Allocation (LDA) in topic modeling with unsupervised clustering have been applied. The PCA, t-SNE and UMAP-based dimensionality reduction methods are used as the clustering algorithms are computationally ineffectual with the higher number of features. The dataset CORD-19 is used after preprocessing and the hybrid model with UMAP dimensionality reduction showed comparatively better results for this input data. The reduced dimensionality results are clustered using the k-means clustering algorithm and evaluated the results. The Elbow Method is used to identify the precise number of clusters The model was developed by combining the probabilistic subject assignment vector of LDA taken from the LDA model and the sentence vectors taken from BERT model. The hybrid method helps to preserve the semantic information and created the contextual topic information.

Finally, a unified clustering-based framework using BERT- LDA is proposed and implemented for mining a set of meaningful topics from the input text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework applying BERT- LDA. The experimental results show that clustering would help infer more coherent topics and therefore this unified clustering and BERT-based approach can be effectively utilized for developing topic modeling applications.

As a future work, the BERT variants and the Deep Neural Networks (DNNs) models can be used to infer topics. The work can be extended to verify the effectiveness of the proposed topic modeling framework with the advanced transformers and deep learning-based algorithms reported in the literature.