1 Introduction

Topic models seek to discover a set of latent topics from a collection of documents, depending on word co-occurrence information. Each topic represents an interpretable semantic concept and is described as a group of related words. For example, a topic about “sports” may relate to words like “baseball”, “basketball”, and “football”. Topic models also infer what topics a document contains to reveal their underlying semantics. Due to their effectiveness and interpretability, topic models have derived various downstream applications, such as document retrieval, content recommendation, opinion/event mining, and trend analysis (Blei and Lafferty 2006b; Wang and Blei 2011; Boyd-Graber et al 2017; Duong et al 2022; Churchill and Singh 2022).

Conventional approaches to topic modeling embrace either probabilistic graphical models or non-negative matrix factorization. Approaches based on probabilistic graphical models, such as Latent Dirichlet Allocation (LDA, Blei et al 2003), have been extensively explored for the past two decades. They mainly model the document generation process with topics as latent variables (Blei 2012). Then they infer model parameters through Variational Inference (Blei et al 2017) or Monte Carlo Markov Chain (MCMC) methods like Gibbs sampling (Steyvers and Griffiths 2007). Alternatively, another conventional type of topic models uses non-negative matrix factorization. They directly discover topics by decomposing a term-document matrix into two low-rank factor matrices: one represents words and the other documents (Lee and Seung 2000; Kim et al 2015; Shi et al 2018). These conventional topic models have derived various model structures, such as supervised LDA (Mcauliffe and Blei 2007) and correlated LDA (Blei and Lafferty 2006a). Besides the basic topic modeling scenario, researchers have extended topic models to other diverse scenarios, e.g., short text (Yan et al 2013; Yin and Wang 2014), cross-lingual (Mimno et al 2009), and dynamic topic modeling (Blei and Lafferty 2006b; Wang et al 2008).

However, despite the achievements of these conventional methods, they generally confront two limitations: (i) Inefficient and labor-intensive parameter inference. These methods necessitate complicated model-specific derivations for parameter inference, and the inference complexity grows along with model complexity. Consequently, this requirement weakens their generalization ability to diverse model structures and application scenarios. (ii) Limited scalability to large datasets. Their inference algorithms typically are not parallel, leading to significant time consumption. For example, training a probabilistic dynamic topic model on a dataset with 10k documents takes two days (Dieng et al 2019). Admittedly some parallel inference algorithms have been proposed (Newman et al 2009; Wang et al 2009; Liu et al 2011), but unfortunately they cannot straightforwardly fit other model structures and application scenarios. As a result, how to design effective, flexible, efficient, and scalable topic models has become an urgent imperative.

Fig. 1
figure 1

The overview of this survey, including NTMs with different structures, NTMs for various scenarios, applications, and challenges

To overcome these challenges, Neural Topic Models (NTMs) have emerged as a promising solution. Unlike conventional topic models, NTMs can efficiently and flexibly infer the model parameters through automatic gradient back-propagation by adopting deep neural networks to model latent topics, such as the popular Variational AutoEncoder (VAE, Kingma and Welling 2014; Rezende et al 2014). This flexibility enables researchers to tailor model structures to fit diverse application scenarios. In addition, NTMs can seamlessly handle large-scale datasets by harnessing parallel computing facilities like GPUs. Owing to these advantages, NTMs have witnessed the exploration of numerous new methods and applications.

Previously, Zhao et al (2021a) provided a review with a primary focus on the methods of NTMs. However, their review is beset by the following limitations: (i) Their method taxonomy is incomplete because they ignore several recently proposed NTM methods, such as NTMs with contrastive learning, cross-lingual NTMs, and dynamic NTMs. (ii) They omit the popular applications based on NTMs, developed for a wide range of downstream tasks. (iii) They lack in-depth discussions on the challenges inherent in NTMs. As a consequence, a more comprehensive review on NTMs is necessary for the research field.

To address these limitations, we in this paper present an extensive and up-to-date survey of NTMs, which offers an in-depth and self-contained understanding of NTMs in terms of methods, applications, and challenges. We begin by systematically organizing existing NTMs according to their neural network structures, such as using embeddings or graph neural networks. We then introduce the NTMs designed for various prevalent topic modeling scenarios, e.g., short text, cross-lingual, and dynamic topic modeling, covering a wider range than the early survey (Zhao et al 2021a). Moreover while omitted by the previous survey (Zhao et al 2021a), we also organize and discuss the popular applications based on NTMs, developed for diverse tasks like text analysis and text generation. Finally we summarize the key research challenges for NTMs in detail to motivate future research directions. Fig. 1 depicts the overview of our survey. We conclude the main contributions of this paper as follows:

  • We extensively review methods of neural topic models through detailed discussions and comparisons, covering variants with different network structures.

  • We include a broader range of popular topic modeling scenarios and provide detailed background information for each scenario, accompanied by easy-to-understand illustrations and related neural topic models.

  • We introduce popular applications based on neural topic models, developed to tackle various tasks such as text analysis and generation.

  • We highlight the current vital challenges faced by neural topic models in detail to facilitate future research; Motivated by this, we propose a new topic diversity metric that measures diversity along with word semantics, which more agrees with human judgment.

We accompany this survey with a repository Footnote 1 of the mentioned paper resources to provide easy access for researchers.

2 Preliminary

In this section, we introduce the preliminary of topic modeling, including the problem setting, notations, and evaluation methods. Then we present the most basic and popular NTM in the framework of Variational AutoEncoder (VAE).

2.1 Problem setting and notations

We introduce the problem setting and notations of topic modeling following LDA (Blei et al 2003). Consider a collection of N documents with V unique words (vocabulary size), and a document is denoted as \(\varvec{\textbf{x}}\). As illustrated in Fig. 2, topic models aim to discover K latent topics from this collection. The number of topics K is a hyperparameter, usually determined by researchers manually according to the characteristics of datasets and their target tasks. Each topic is defined as a distribution over the vocabulary, i.e., topic-word distribution, \({\varvec{\beta}}_{k} \in {{\mathbb {R}}}^{V}\). Then we have \(\varvec{{\beta }}=(\varvec{ {\beta }}_{1},\dots ,\varvec{ {\beta }}_{K}) \in {{\mathbb {R}}}^{V \times K}\) as the topic-word distribution matrix of all topics. In addition, topic models also infer the topic distribution of a document (doc-topic distribution): \(\varvec{ {\theta }} \in \Delta _{K}\), implying what topics a document contains. Here \(\theta _{k}\) refers to the proportion of Topic#k in the document, and \(\Delta _{K}\) denotes a probability simplex \(\Delta _{K} = \{ \varvec{ {\theta }} \in {{\mathbb {R}}}^{K}_{+} | \sum _{k=1}^{K} \theta _{k} = 1 \}\).

Fig. 2
figure 2

Illustration of topic modeling. Given a document collection, a topic model aims to discover K latent topics, interpreted as distributions over words (topic-word distributions). It also infers what topics a document contains, defined as distributions over all latent topics (doc-topic distributions). Here the topic-word distribution of Topic#k, \({\varvec{{\beta}}}_{k}\), has related words like “movie”, “film”, and “oscar”; the doc-topic distribution \({\varvec{{\theta}}}\) concentrates on Topic#1 and Topic#k

2.2 Evaluation of topic models

Given the absence of ground-truth labels in topic modeling tasks, how to reliably and comprehensively evaluate topic models remains inconclusive in the research community. We introduce currently the most prevalent evaluation methods employed for assessing topic models as follows.

2.2.1 Perplexity

Perplexity, borrowed from language models, measures how a model can predict new documents. It is measured as the normalized log-likelihood of held-out test documents. Perplexity has been used for years to evaluate topic models. Nevertheless, prior studies have empirically demonstrated that perplexity inaccurately reflects the quality of discovered topics as it often contradicts human judgment (Chang et al 2009). Furthermore, computing log-likelihood is inconsistent among different topic models. This is because they apply various sampling or approximation techniques (Wallach et al 2009; Buntine 2009) as well as diverse modeling approaches for topic-word distributions and doc-topic distributions. For instance, certain methods normalize topic-word distributions with respect to topics, some with respect to words, and others opt to keep them unnormalized. These disparities bring challenges to equitable comparisons. Finally, perplexity may not evaluate the practical utility of topic models since users typically employ topic models for content analysis rather than generating new documents (Zhao et al 2021a; Hoyle et al 2022). Due to these reasons, perplexity has waned in popularity for topic model evaluation in the recent research field.

2.2.2 Topic coherence

Rather than predictive abilities, researchers turn to evaluating the quality of produced topics. For this purpose, researchers propose topic coherence to measure the coherence among the most related words of topics, i.e., top words (determined by topic-word probabilities). Experiments showcase that topic coherence can agree with the human evaluation on topic interpretability (Lau et al 2014). For example, one widely-used coherence metric is Normalized Point-wise Mutual Information (NPMI, Bouma 2009; Newman et al 2010; Lau et al 2014). Footnote 2 Specifically, the NPMI score between two words \((x_i, x_j)\) is calculated as follows:

$$\begin{aligned} \textrm{NPMI}(x_i, x_j) = \frac{\log \frac{p(x_i, x_j) + \epsilon }{p(x_i) p(x_j)}}{-\log p(x_i, x_j)+\epsilon }. \end{aligned}$$
(1)

It computes the normalized mutual information of two words, and then takes the average of all word pairs in all topics. Here \(\epsilon \) is to avoid zero; \(p(x_i)\) is the probability of word \(x_i\), and \(p(x_i, x_j)\) is the co-occurrence probability of \((x_i, x_j)\). These probabilities are estimated as their occurrence frequencies in a reference corpus. The reference corpus can be either internal (the training set) or external (e.g., Wikipedia articles). Basically, a large external corpus is recommended because it can alleviate the influence of data bias in training sets and facilitate fair topic coherence comparisons across different datasets.

Later, Röder et al (2015) propose a new metric, \(C_V\), which calculates the cosine similarity between NPMI score vectors (Krasnashchok and Jouili 2018). Given the top T words of a topic, \((x_1, x_2, \dots , x_T)\), the exact calculation of \(C_V\) is formulated as

$$\begin{aligned} C_V&= \frac{1}{T} \sum _{i=1}^{T} \cos (\varvec{\textbf{v}}_{\text {{NPMI}}}(x_i), \varvec{\textbf{v}}_{\text {{NPMI}}}(\{x_i\}_{i=1}^{T}) ) \end{aligned}$$
(2)
$$\begin{aligned} {\varvec{\textbf{v}}}_{\text {{NPMI}}}\left( x_i\right)&=\left\{ \textrm{NPMI}\left( x_i, x_j\right) \right\} _{j=1, \ldots , T} \end{aligned}$$
(3)
$$\begin{aligned} {\varvec{\textbf{v}}}_{\text {{NPMI}}}\left( \{x_i\}_{i=1}^{T}\right)&=\left\{ \sum _{i=1}^T \textrm{NPMI}\left( x_i, x_j\right) \right\} _{j=1, \ldots , T}. \end{aligned}$$
(4)

The NPMI score computation follows Equation (1). Röder et al (2015) empirically demonstrate that \(C_V\) outperforms previous coherence metrics, NPMI, UCI, and UMass (Mimno et al 2011), since \(C_V\) more agrees with human judgment (See Röder et al (2015) for experimental results).

We would like to recommend the Palmetto tool Footnote 3 to compute topic coherence. It includes almost all common coherence metrics and provides a pre-processed Wikipedia article collection as the reference corpus for easier reproducibility.

2.2.3 Topic diversity

To further evaluate the quality of topics, topic diversity is introduced to measure the difference between topics. This is driven by the anticipation that topics should exhibit diversity rather than redundancy thereby enabling the comprehensive disclosure of latent semantics in corpora. At present, researchers employ the following diversity metrics:

  • Nan et al (2019) propose Topic Uniqueness (TU) which computes the average reciprocal of top word occurrences in topics. In detail given K topics and the top T words of each topic, TU is computed as

    $$\begin{aligned} \textrm{TU} = \frac{1}{K} \sum _{k=1}^{K} \frac{1}{T} \sum _{x_i \in t(k)} \frac{1}{\#(x_{i})} \end{aligned}$$
    (5)

    where t(k) means the top word set of the k-th topic, and \(\#(x_i)\) denotes the occurrence of word \(x_i\) in the top T words of all topics. TU ranges from 1/K to 1.0, and a higher TU score indicates more diverse topics.

  • Burkhardt and Kramer (2019) propose Topic Redundancy (TR) that calculates the average occurrences of a top word in other topics. Its computation is

    $$\begin{aligned} \textrm{TR} = \frac{1}{K} \sum _{k=1}^{K} \frac{1}{T} \sum _{x_i \in t(k)} \frac{\#(x_i) - 1}{K - 1}. \end{aligned}$$
    (6)

    A higher TR score means less diverse topics.

  • Dieng et al (2020) propose Topic Diversity (TD) which computes the proportion of unique top words of topics:

    $$\begin{aligned} \textrm{TD} = \frac{1}{K} \sum _{k=1}^{K} \frac{1}{T} \sum _{x_i \in t(k)} {{\mathbb {I}}}(\#(x_i)) \end{aligned}$$
    (7)

    where \({{\mathbb {I}}}(\cdot )\) is a indicator function that equals 1 if \(\#(x_i) = 1\) and equals 0 otherwise. TD ranges from 0 to 1.0, and a higher TD score indicates more diverse topics.

These metrics all measure topic diversity based on the uniqueness of individual words. They posit that diversity is optimal when all topics are characterized by distinct top words. However, we question these diversity metrics because of the fact that certain topics naturally share the same words. For example, the word “chip” could be shared by the topics of “potato chip” and “electronic chip”; Similarly, the word “apple” may be covered by the topics of “fruit” and “company”. This issue remains unresolved for reliable diversity evaluation. We in this paper propose a new diversity metric to address this issue (See details in Sect. 6).

Fig. 3
figure 3

Illustration of a VAE-based NTM. It mainly contains an encoder (inference network) and a decoder (generation network). The encoder outputs doc-topic distribution \({\varvec{ {\theta }}}\) from input document \({\varvec{\textbf{x}}}\) through MLPs using the reparameterization trick where \({\varvec{ {\epsilon}}} \sim {\mathcal {N}}({\varvec{\textbf{0}}}, {\varvec{\textbf{I}}})\). The decoder reconstructs the input document from \({\varvec{ {\theta}}}\) with \({\varvec{ {\beta }}}\) as the topic-word distribution matrix. The objective includes reconstruction error and KL divergence

2.2.4 Downstream task performance

Except for the coherence and diversity to measure topic quality, researchers also resort to extrinsic performance: they use doc-topic distributions \({\varvec{ {\theta}}}\) as low-dimensional document features and evaluate their quality on downstream tasks. These tasks mainly consist of document classification and document clustering. For document classification, researchers train ordinary classifiers (e.g., SVMs or Random Forests) with learned doc-topic distributions as document features and then predict the labels of testing documents. The performance can be evaluated by accuracy or F1 scores. For document clustering, the common way is to use the most significant topic in a doc-topic distribution as the clustering assignment of a document. Another way is to apply clustering algorithms, e.g., K-Means or DBSCAN, on doc-topic distributions (Zhao et al 2021b). The clustering performance can be measured by Purity and Normalized Mutual Information (NMI, Manning et al 2008).

2.2.5 Visualization

Finally, researchers visualize topic models for evaluation. The typical visualization method is to show the top words of topics and doc-topic distributions, such as using pyLDAvis Footnote 4 (Sievert and Shirley 2014) or word cloud .Footnote 5 Another strategy is to cluster documents on a 2D canvas by reducing the dimension of doc-topic distributions with tools like t-SNE (van der Maaten and Hinton 2008).

2.3 Basic NTM based on VAE

We introduce the most basic and popular NTM based on the Variational AutoEncoder (VAE) framework with the neural variational inference technique (Miao et al 2016; Srivastava and Sutton 2017). As illustrated in Fig. 3, a VAE-based NTM mainly contains an encoder (inference network) and a decoder (generation network). The encoder is to infer doc-topic distributions from documents. To be specific, we use a latent variable \({\varvec{\textbf{r}}} \in {{\mathbb {R}}}^{K} \) following a logistic normal prior

$$\begin{aligned} p({\varvec{\textbf{r}}}) = {{\mathcal {L}}}{{\mathcal {N}}}({\varvec{ {\mu}}}_{0}, {\varvec{ {\Sigma}}}_{0}) \end{aligned}$$
(8)

where \({\varvec{ {\mu}}}_{0}\) and \({\varvec{ {\Sigma}}}_{0}\) are the mean vector and diagonal covariance matrix respectively. Here the prior distribution is specified with Laplace approximation (Hennig et al 2012) to approximate a symmetric Dirichlet prior as \(\mu _{0,k} = 0\) and \(\Sigma _{0, kk} = (K-1) / \alpha K\) with hyperparameter \(\alpha \) (Srivastava and Sutton 2017). The variational distribution is modeled as

$$\begin{aligned} q_{\Theta }({\varvec{\textbf{r}}} | {\varvec{\textbf{x}}}) = {\mathcal {N}}({\varvec{ {\mu}}}, {\varvec{ {\Sigma}}}) . \end{aligned}$$
(9)

We compute \({\varvec{ {\mu }}}\) and \({\varvec{ {\Sigma}}}\) with encoder networks parameterized by \(\Theta \):

$$\begin{aligned} {\varvec{ {\mu}}}&= f_{\Theta _{1}}(\varvec{\textbf{x}}) \end{aligned}$$
(10)
$$\begin{aligned} \varvec{ {\Sigma }}&= \textrm{diag}(f_{\Theta _{2}}(\varvec{\textbf{x}})) \end{aligned}$$
(11)

where \(\Theta = \{ \Theta _{1}, \Theta _{2} \}\) and \(\textrm{diag}(\cdot )\) denotes transforming a vector to a diagonal matrix. In practice, we transform document \(\varvec{\textbf{x}}\) into a Bag-of-Words (BoW) vector as the input and employ MLPs as encoder networks. Then to avoid gradient variance (Kingma and Welling 2014; Rezende et al 2014), we sample \(\varvec{\textbf{r}}\) through the reparameterization trick by sampling a random variable \(\varvec{ {\epsilon }}\):

$$\begin{aligned} \varvec{\textbf{r}} = \varvec{ {\mu }} + (\varvec{ {\Sigma }})^{1/2} \varvec{ {\epsilon }} \quad \text {where} \quad \varvec{ {\epsilon }} \sim {\mathcal {N}}(\varvec{\textbf{0}}, \varvec{\textbf{I}}). \end{aligned}$$
(12)

We model the doc-topic distribution \(\varvec{ {\theta }}\) with a softmax function to restrict it on a simplex:

$$\begin{aligned} \varvec{ {\theta }} = \textrm{softmax}(\varvec{\textbf{r}}) . \end{aligned}$$
(13)

The decoder is to generate documents from doc-topic distributions. Specifically, we use a decoder network parameterized by \(\Phi \): \(f_{\Phi }(\varvec{ {\theta }}) = \textrm{softmax}(\varvec{ {\beta }}\varvec{ {\theta }})\) which represents the generation probability of each word. Here \(\Phi = \{ \varvec{ {\beta }} \}\). Then we sample words from its multinomial distribution: \( x \sim \textrm{Mult}(f_{\Phi }(\varvec{ {\theta }}))\). Following the Evidence Lower BOund (ELBO) of VAE, we formulate the learning objective of NTM as

$$\begin{aligned} \min _{\Theta ,\Phi } -{{\mathbb {E}}}_{q_{\Theta }(\varvec{ {\theta }}|\varvec{\textbf{x}})} \left[ \log p_{\Phi }(\varvec{\textbf{x}}|\varvec{ {\theta }}) ] + \textrm{KL} [ q_{\Theta }(\varvec{\textbf{r}}|\varvec{\textbf{x}}) \Vert p(\varvec{\textbf{r}}) \right] . \end{aligned}$$
(14)

The first term is the negative expected log-likelihood, i.e., the reconstruction error, where \(p_{\Phi }(\varvec{\textbf{x}}|\varvec{ {\theta }})\) denotes the generation probability of \(\varvec{\textbf{x}}\). As we sample words from the multinomial distribution, the first term becomes \(-\varvec{\textbf{x}}^{\top } \log (f_{\Phi }(\varvec{ {\theta }})) \). The second term is the Kullback–Leibler (KL) divergence between the variational and prior distributions, which can be computed through an analytical form (Srivastava and Sutton 2017). It is also known as a regularization term.

The above is the fundamental structure of a VAE-based NTM. Based on this, NTMs with different structures are proposed to further improve performance and deal with various application scenarios.

3 Review of neural topic models

In this section, we review the existing work of Neural Topic Models (NTMs). We first introduce the NTMs with different structures, and then discuss the NTMs for various use case scenarios.

3.1 NTMs with different structures

Apart from the basic VAE structure mentioned in Sect. 2, we introduce NTMs with more different structures.

3.1.1 NTMs with various priors

VAE-based NTMs commonly employ Gaussian (Normal) as priors since it is easy to apply the reparameterization trick and compute the analytical form of KL divergence. Besides Gaussian priors, NTMs also leverage other various priors. Miao et al (2017) propose new priors like Gaussian softmax and the stick-breaking process. Zhang et al (2018) use a Weibull distribution to approximate gamma distributions. Joo et al (2020) leverage an auxiliary uniform distribution to approximate the cumulative distribution function of gamma. As Dirichlet priors are important for topic modeling, Burkhardt and Kramer (2019) utilize the proposal function of a rejection sampler for a gamma distribution to approximate Dirichlet priors. Tian et al (2020) draw from the rounded posterior distribution to approximate Dirichlet samples.

3.1.2 NTMs with embeddings

Alternative to directly modeling topics, Miao et al (2017) propose to decompose topics as two embedding parameters:

$$\begin{aligned} \varvec{ {\beta }} = \varvec{\textbf{W}}^{\top }\varvec{\textbf{T}}. \end{aligned}$$
(15)

Here \(\varvec{\textbf{W}} \in {{\mathbb {R}}}^{D \times V}\) denotes V word embeddings, and \(\varvec{\textbf{T}} \in {{\mathbb {R}}}^{D \times K}\) means K topic embeddings, where D is the dimension of embedding space. Then Dieng et al (2020) follow this setting and facilitate topic learning by initializing \(\varvec{\textbf{W}}\) with pre-trained word embeddings like Word2Vec (Mikolov et al 2013) or GloVe (Pennington et al 2014). This approach also confers flexibility and efficiency to other topic modeling scenarios. For instance in dynamic topic modeling, it is much cheaper to repeat topic embeddings for each time slice than repeating the entire topic-word distribution matrix (Dieng et al 2019).

Alternatively, Zhao et al (2021b) also model topics as embeddings, but use the optimal transport distance between doc-topic distributions and input documents to measure the reconstruction error. Wang et al (2022a) share the same idea and instead use conditional transport distance. Duan et al (2022) learn a group of global topic embeddings for task-specific adaptations. Xu et al (2022) use hyperbolic embeddings to model topics. Due to the tree-likeness property of hyperbolic space, they can capture the hierarchy among topics. More recently, Wu et al (2023b) find that topic embeddings mostly collapse together in the space of NTMs with embeddings, which leads to topic collapsing, i.e., repetitive topics. To address this issue, they propose a regularization on embeddings in addition to the traditional objective based on ELBO. The regularization considers topic embeddings as cluster centers and word embeddings as cluster samples; then it forces topic embeddings to be the centers of separately aggregated word embeddings. This effectively mitigates the topic collapsing issue and extensively improves topic modeling performance.

3.1.3 NTMs with metadata

While common NTMs learn topics in an unsupervised manner, NTMs can also leverage the metadata of documents to guide topic modeling, similar to supervised LDA (Mcauliffe and Blei 2007). In detail, Card et al (2018) introduce a NTM that can incorporate various metadata of documents. It encodes a document together with its labels (e.g., sentiment) and covariates (e.g., publication year), and generates the document conditioned on the covariates. Korshunova et al (2019) model the generation of documents and labels together in a discriminative way; then train their model with mean-field variational inference. They can also incorporate a variety of data modalities like images. Wang and Yang (2020) jointly model topics and train a RNN classifier to predict document labels. They are connected by an attention mechanism. Wang et al (2021a) incorporate document networks in a NTM and jointly reconstruct documents and networks.

3.1.4 NTMs with graph neural networks

In addition to traditional BoW as inputs, several NTMs use graph neural networks to model documents. Specifically, Zhu et al (2018) transform documents into biterm graphs and follow the VAE framework to reconstruct the input graphs. A biterm refers to an unordered word pair that co-occurred in the same document, originally from Yan et al (2013). Similarly, Yang et al (2020); Zhou et al (2020) use a bipartite graph of documents and words, connected by word occurrences or TF-IDF values. Wang et al (2021b) use word co-occurrence and semantic correlation graphs. Wang et al (2022b) focus on graph topic modeling with micro-blogs. Zhu et al (2023) propose a graph neural topic model to incorporate commonsense knowledge.

3.1.5 NTMs with generative adversarial networks

Some studies focus on employing Generative Adversarial Networks (GANs) to facilitate topic modeling. Wang et al (2019) follows the idea of GAN: they use a generator to generate “fake” documents from a random Dirichlet sample and then use a discriminator to distinguish the generated documents from real ones. Note that their model cannot infer doc-topic distributions because it directly maps documents to representations based on TF-IDF. To lift this limitation, Wang et al (2020) propose to use bidirectional adversarial training, which can meanwhile infer doc-topic distributions. Hu et al (2020) further present an extension that uses two cycle-consistency constraints to generate informative representations.

3.1.6 NTMs with pre-trained language models

Researchers frequently combine NTMs with pre-trained language models. Pre-trained language models based on Transformers (Vaswani et al 2017) have been prevalent in NLP fields, which are pre-trained on large-scale corpora to capture contextual linguistic features. Multiple studies leverage contextual features from these pre-trained models to provide richer information than conventional BoW. For instance, Bianchi et al (2021a) input the concatenation of BoW and the contextual document embeddings from Sentence-BERT (Reimers and Gurevych 2019), and then reconstruct BoW as previous work. Hoyle et al (2020) propose to distill knowledge from BERT (Devlin et al 2018) to NTMs. In detail, they produce pseudo BoW from the predictive word probability of BERT. Then their NTM reconstructs both the real and pseudo BoW. Bianchi et al (2021b); Mueller and Dredze (2021) employ multilingual BERT to infer cross-lingual doc-topic distributions for zero-shot learning but they cannot discover aligned cross-lingual topics.

3.1.7 NTMs with contrastive learning

As a self-supervised learning fashion, contrastive learning has been employed to facilitate NTMs. The idea of contrastive learning is to measure the similarity relations among sample pairs in a representation space (Van den Oord et al 2018). Nguyen and Luu (2021) propose the contrastive learning on doc-topic distributions where they build positive and negative pairs by sampling salient words of documents. Differently, Wu et al (2022) directly sample positive and negative pairs based on the topic semantics of documents to capture relations among samples. Specifically, they quantize doc-topic distributions following Wu et al (2020b) and then sample documents with the same quantization indices as positive pairs and different indices as negative pairs. Their method can also capture the similarity relations among augmented samples. Zhou et al (2023) improve topic disentanglement with contrastive learning on word and topic embeddings. Han et al (2023) cluster documents, compute term weights, and make NTMs reconstruct salient words. They also use contrastive learning to refine doc-topic distributions where positive samples come from pre-trained language models.

3.1.8 NTMs with reinforcement learning

Reinforcement learning has been utilized to facilitate the learning process of NTMs. To be specific, Gui et al (2019) enhance NTMs with a reinforcement learning framework. They evaluate topic coherence performance during training and use this performance as reward signals to guide the learning of topic modeling. Costello and Reformat (2023) follow this idea and add more improvements like using sentence embeddings, adding a weighting term to the ELBO, and tracking topic diversity and coherence during training.

3.1.9 Other NTMs

Apart from the aforementioned methods, we introduce NTMs with other structures.

Before the invention of VAE-based NTMs, researchers have different attempts to model latent topics with neural networks. Some studies focus on NTMs in the autoregressive framework. Larochelle and Lauly (2012) propose an autoregressive NTM, called DocNADE. Inspired by Replicated Softmax (Hinton and Salakhutdinov 2009), DocNADE predicts the probability of a word in a document conditioned on its hidden state which is conditioned on previous words. Then it interprets topics with a hidden state and infers doc-topic distributions with the hidden states of the document. Gupta et al (2019b) extend DocNADE by modeling the bi-directional dependencies between words. Gupta et al (2019a) then use a LSTM to enable DocNADE to incorporate external knowledge.

Cao et al (2015) also propose an early NTM before VAE-based NTMs. Their approach predicts how an n-gram correlates with documents. It computes the representation of an n-gram by transforming the accumulation of the word embeddings from Word2Vec (Mikolov et al 2013) and projects documents into representations with a look-up matrix table. In this way, they model topic-word distributions as the n-gram representations and model doc-topic distributions as the projected document representations. For training, it uses the document of the n-gram as a positive and randomly samples documents that do not contain this n-gram as negatives.

Lin et al (2019) replace the softmax function with the sparsemax to enhance the sparsity in doc-topic distributions. Nan et al (2019) use Wasserstein AutoEncoder (WAE) to model topics, which minimizes the Wasserstein distance between generated documents and input documents. Rezaee and Ferraro (2020) propose a NTM without using the reparameterization trick. They generate discrete topic assignments from RNN inspired by Dieng et al (2017). Wu et al (2021) focus on discovering latent topics from long-tailed corpora. They propose a causal inference framework to analyze how the long-tailed bias influences topic modeling. Then they use a simple but effective casual intervention method to mitigate such influence.

Fig. 4
figure 4

Illustration of hierarchical topic modeling. Topics at each level cover different semantic granularity: child topics are more specific to parent topics. Topic#2-1 denotes the first topic of the second layer in the topic structure

3.1.10 Topic discovery by clustering

We discuss a special type of approach that discovers latent topics by clustering instead of modeling the generation process of documents. They typically leverage traditional word embeddings such as Word2Vec (Mikolov et al 2013) or contextual embeddings from pre-trained language models. We must clarify that these methods differ from the aforementioned ordinary topic models. This is because they can only produce topics but cannot infer doc-topic distributions as required. Accordingly, their one advantage is their enhanced computational efficiency. In detail, Thompson and Mimno (2020) straightforwardly cluster token-level word embeddings from pre-trained models like BERT and GPT-2 and produce topics from the words assigned to clusters. Similarly, Sia et al (2020); Angelov (2020); Zhang et al (2022c) cluster word or document embeddings and interpret hidden topics by sampling words from clusters via term weights like TF-IDF. Following this line of work, Grootendorst (2022) propose BERTopic by clustering document representations through HDBSCAN. Note that BERTopic can estimate the doc-topic distribution based on the term weights within a given document.

3.2 NTMs for various scenarios

Besides the basic scenario on normal documents, we in this section introduce NTMs tailored for various use case scenarios, such as hierarchical, cross-lingual, and dynamic topic modeling. We introduce the background of each scenario and present the related NTMs.

3.2.1 Hierarchical NTMs

Similar to conventional topic models (Griffiths et al 2003; Teh et al 2004; Blei et al 2010), NTMs can discover hierarchical topics to reveal topic structures from general to specific. Topics at each level in a hierarchy cover different semantic granularity: child topics tend to be more specific to their parent topics. As shown in Fig. 4, a topic about “sports” can derive more specific child topics, like “soccer”, “basketball”, and “tennis”; a topic about “computer” also has specific child topics like “linux”, “programming”, and “windows”. In addition, hierarchical topic modeling can relieve the challenge of determining the number of topics to some extent (Blei et al 2010).

To discover hierarchical topics, some work follows the previous non-parametric setting. Isonuma et al (2020) propose a tree-structured neural topic model with two doubly-recurrent neural networks over the ancestors and siblings respectively (Alvarez-Melis and Jaakkola 2017). Note that the tree structure is unbounded, i.e., it can be dynamically updated in a heuristic way during training. Pham and Le (2021) follow this spirit and jointly handle hierarchical topics and document visualization. Chen et al (2021b) leverage a stick-breaking process as prior for non-parametric modeling.

Later, the parametric fashion has attracted more attention, which requires setting the width and height of a topic hierarchy ahead. Chen et al (2021a) propose manifold regularization on topic hierarchy learning. Duan et al (2021) propose a Sawtooth Connection to model topic dependencies across hierarchical levels based on the model structure of ETM Dieng et al (2020). As aforementioned, Xu et al (2022) use different layers in the hyperbolic embedding space to interpret hierarchical topics. Li et al (2022) employ skip-connections for decoding to alleviate the posterior collapsing issue and propose a policy gradient method for training. Recently, Duan et al (2023) propose to generate different documents for different levels. They craft documents with more related words through word similarity matrices for higher levels, and then progressively generate these documents at each level. Chen et al (2023) utilize a Gaussian mixture prior and nonlinear structural equations to model topic dependencies between hierarchical levels. The main issue of the parametric setting is that topic hierarchies cannot grow dynamically since their width and height must be predetermined before training.

3.2.2 Short text NTMs

Researchers apply NTMs to discover topics from short texts. Short texts, prevalent on the Internet in various forms such as tweets, comments, and news headlines, serve as a common medium for individuals to express ideas, comments, and opinions. However, normal topic models often struggle to effectively handle short texts. The principal reason is that topic models depend on the word co-occurrence information to infer latent topics, but such information is extremely sparse in short texts due to their limited context. This challenge, referred to as data sparsity (Yan et al 2013; Wu and Li 2019), hinders topic models from discovering high-quality topics and thus has attracted considerable attention in the research community.

Several studies are proposed to overcome this data sparsity challenge. Lin et al (2020) use the Archimedean copulas to regularize the discreteness of topic distributions of short texts. Wu et al (2020b) propose to quantize doc-topic distributions of short texts to quantization vectors following the idea of Van den Oord and Vinyals (2017). By carefully initializing the quantization vectors, they can produce sharper doc-topic distributions that better fit short texts with limited context. They also propose a negative sampling decoder to avoid repetitive topics besides the negative log-likelihood. To address the data sparsity issue, Wang et al (2021b) use word co-occurrence and semantic correlation graphs to enrich the learning signals of short texts. Zhao et al (2021d) propose to incorporate entity vector representations into a NTM for short texts. They learn entity vector representations from manually edited knowledge graphs. Based on Wu et al (2020b), Wu et al (2022) further propose a contrastive learning method according to the topic semantics of short texts, which better captures the similarity relations among them. This refines the representations of short texts and thus their doc-topic distributions. They can also adapt to using data augmentation to further mitigate the data sparsity problem.

Fig. 5
figure 5

Illustration of cross-lingual topic modeling (on English and Chinese documents). Corresponding cross-lingual topics are required to be aligned, like English and Chinese Topic#3, and English and Chinese Topic#5. Words in the brackets are the English translations

3.2.3 Cross-lingual NTMs

Cross-lingual NTMs are proposed following cross-lingual topic models (Mimno et al 2009). Cross-lingual topic models aim to discover aligned topics in different languages. As exemplified in Fig. 5, English and Chinese Topic#3 both refer to “music”, and English and Chinese Topic#5 refer to “celebrity”. In addition if two documents in different languages contain similar latent topics, their inferred doc-topic distributions should be similar. For instance, the doc-topic distributions of the parallel English and Chinese documents in Fig. 5 are similar. These aligned cross-lingual topics can reveal commonalities and differences across languages and cultures, which enables cross-lingual text analysis without supervision.

Wu et al (2020a) propose the first neural cross-lingual topic model. It transforms the topic-word distribution to the vocabulary space of another language. Thus the topic-word distributions of one language can incorporate the semantics of another language, which aligns cross-lingual topics. They show that their model outperforms traditional multilingual topic models (Shi et al 2016; Yuan et al 2018). Later, Wu et al (2023a) propose to align cross-lingual topics from the perspective of mutual information. This can properly align cross-lingual topics and prevent degenerate topic representations. To address the low-coverage dictionary issue, they also propose a cross-lingual vocabulary linking method that finds more linked words for topic alignment beyond the given dictionary. Bianchi et al (2021b); Mueller and Dredze (2021) directly learn cross-lingual doc-topic distributions with multilingual BERT. But we emphasize that they cannot discover aligned cross-lingual topics as required.

3.2.4 Dynamic NTMs

Dynamic NTMs are explored following dynamic topic models (Blei and Lafferty 2006b; Wang et al 2012). Previous static topic models implicitly assume that documents are exchangeable. However, this assumption is inappropriate since documents are produced sequentially, such as scholarly journals, emails, and news articles. As such, dynamic topic models are proposed. While topics in previous methods are all static, dynamic topic models allow topics to shift over time to capture the topic evolution in sequential documents. To be specific, dynamic topic models assume that documents are divided by time slice, for example by year, and each time slice has K latent topics. The topics associated with slice t evolve from the topics associated with slice \(t-1\). As the example in Fig. 6, Topic#1 about Ukraine and Russia evolves from the year 2020 to 2022. Due to the emergence of the word “invasion”, we see Topic#1 captures the Ukraine-Russia war that exploded in 2022. Similarly, Topic#K about Covid-19 evolves from the year 2020 to 2022 with the explosion of the Omicron variant. These topic evolution reveals how topics emerge, grow, and vanish, which has been applied for trend analysis and opinion mining.

Dieng et al (2019) first propose a neural dynamic topic model, the Dynamic Embedding Topic Model. It uses word and topic embeddings to interpret latent topics following Dieng et al (2020). The topic embeddings at slice t depend on topic embeddings at slice \(t-1\). Besides, it uses a LSTM to learn temporal priors of doc-topic distributions. Rahimi et al (2023) discover topic evolution by clustering documents but cannot infer doc-topic distributions as required. Zhang and Lauw (2022) focus on the dynamic topics of temporal document networks and incorporate the linking information between documents. Rather than modeling topic evolution, Cvejoski et al (2023) model the activities of topics over time. Note that the activities of topics evolve over time but their topics are invariant. Thus, this method does not precisely adhere to the original definition of dynamic topic modeling.

Fig. 6
figure 6

Illustration of dynamic topic modeling. Topics associated with time slice t evolve from topics associated with slice \(t-1\). Here Topic#1 in 2022 evolves from Topic#1 in 2021. It is similar for other topics

3.2.5 Correlated NTMs

Following the idea of correlated topic modeling (Blei and Lafferty 2006a), correlated NTMs have been explored. Correlated topic models seek to consider the correlation between latent topics. For example, a document about genetics is more likely to be also about disease than x-ray astronomy (Blei and Lafferty 2006a). This leads to better expressiveness than LDA. Liu et al (2019) follow the VAE-based NTM and use centralized transformation flow to capture topic correlations. To effectively infer the transformation flow, they present the transformation flow lower bound to regulate the KL divergence term.

3.2.6 Lifelong NTMs

Lifelong NTMs are proposed to solve the challenge of data sparsity, similar to short text NTMs but in a continual lifelong learning fashion. Gupta et al (2020) propose the first lifelong NTM. They retain prior knowledge, i.e., topics, from document streams and guide topic modeling on sparse datasets with the accumulated knowledge. In detail, they use topic regularization to transfer topical knowledge from several domains and prevent catastrophic forgetting, and a selective replay strategy to identify relevant historical documents. Zhang et al (2022a) propose a lifelong NTM enhanced with a knowledge extractor and adversarial networks.

Although lifelong and dynamic topic modeling both work on sequential documents, we emphasize their difference: Dynamic topic modeling targets to discover topic evolution, i.e.,, replacing outdated topics with emergent ones. Lifelong topic modeling aims to mitigate the data sparsity issue by accumulating prior topical knowledge, so it needs to restrain from forgetting past knowledge.

4 Applications of NTMs

In this section, we introduce the applications of NTMs, mainly including text analysis, text generation, and content recommendation.

4.1 Text analysis

The primary applications of NTMs concentrate on text analysis (Hoyle et al 2022; Laureate et al 2023).

Bai et al (2018) apply a NTM to analyze scientific articles. They enable a NTM to incorporate the citation graphs of scientific articles by predicting the connections between them. Thus their model can also recommend related articles to users. Zeng et al (2018) combine a NTM and a memory network and jointly train them for short text classification. Their method classifies short texts and discovers topics from them simultaneously. Chaudhary et al (2020) combine BERT with a NTM, which reduces the operation of self-attention. They claim that this can greatly speed up their fine-tuning process and thus reduce CO2 emission. Song et al (2021) propose a classification-aware NTM which includes a NTM and a classifier. They focus on classifying the disinformation about COVID-19 to help deliver effective public health messages.

Zeng et al (2019) apply NTMs to understand the discourse in micro-blog conversations. Li et al (2020) use a dynamic NTM to understand the global impact of COVID-19 and non-pharmacological interventions in different countries and media sources. Their discovered dynamic topics help researchers understand the progression of the epidemic. Valero et al (2022) propose a short text NTM for podcast short-text metadata. Gui et al (2020) propose a multitask mutual learning framework for sentiment analysis and topic detection. They make the topic-word distributions similar to the word-level attention vectors through mutual learning. Avasthi et al (2022) use NTMs to mine topics from large-scale scientific and biomedical text corpora.

4.2 Text generation

Several studies apply NTMs to text generation tasks. Specifically, Tang et al (2019) propose a text generation model that learns semantics and structural features through a VAE-based NTM. Yang et al (2021) leverage NTMs to alleviate the information sparsity issue in long story generation. They map a short text to a low dimensional doc-topic distribution, from which they sample interrelated words as a skeleton. With the short text and the skeleton as input, they use a Transformer to generate long stories. Nguyen et al (2021) use the doc-topic distributions of NTM to enrich and control the global semantics for text summarization. Zhang et al (2022b) propose a neural hierarchical topic model to discover hierarchical topics from documents. Then they generate keyphrases under the hierarchical topic guidance.

4.3 Content recommendation

Similar to early work (Wang and Blei 2011), NTMs can cooperate with recommendation systems. Esmaeili et al (2019) combines a NTM with a recommender system for reviews through a structured auto-encoder. Xie et al (2021) use a graph NTM for citation recommendation.

5 Challenges of NTMs

Despite their popularity, NTMs encounter several challenges. In this section, we conclude these main challenges as possible future research directions.

5.1 Lacking reliable evaluation

Inheriting from conventional topic models, the critical challenge of NTMs primarily lies in the lack of reliable evaluation. Current evaluation methods have been developed for years, but they have the following issues.

5.1.1 Absence of standard evaluation metrics

The topic modeling field lacks standard evaluation metrics. Resorting to human judgment provides one effective way to evaluate topic models, such as topic rating and word intrusion tasks (Lau et al 2014). Unfortunately, its reliance on human raters renders it expensive and time-consuming, limiting its feasibility for wide-scale comparisons. Owing to this, researchers generally depend on automatic evaluation metrics, such as the topic coherence and diversity mentioned in Sect. 2.2. However, these automatic metrics encounter the following two problems:

  • Inconsistent usage and settings of automatic metrics. The usage and settings of automatic metrics vary across papers and even within a paper. For example, variations include the number of top words, the number of topics, reference corpora, and coherence or diversity metrics. Consequently, the results are often confined to specific studies, impeding the comparability of NTMs across different research papers. Such inconsistencies have led some benchmarking studies to argue that the conventional LDA can still outperform NTMs in certain aspects (Doan and Hoang 2021; Hoyle et al 2022).

  • Questionable agreement between automatic metrics and human judgment. Some investigations have revealed the discrepancies between the coherence metrics and human evaluation: they find that automatic metrics declare winner models when the corresponding human evaluation does not. This raises concerns that coherence metrics, originally designed for older models, possibly are incompatible with the newer neural topic models (Doogan and Buntine 2021; Hoyle et al 2021). We believe similar concerns may extend to diversity metrics: they may also be inconsistent with human assessments. We detail the reasons and offer a heuristic solution in Sect. 6.

Owing to the above problems, researchers appeal to explore automatic metrics that better approximate the preferences of real-world topic model users (Hoyle et al 2021; Stammbach et al 2023). Thus, proposing standard and practical evaluation metrics is a promising and urgent future research direction for topic modeling.

5.1.2 Lacking standardized dataset pre-processing settings

The topic modeling field lacks standardized dataset pre-processing settings for topic model comparisons. Researchers routinely pre-process datasets before running topic models, like removing less frequent words and stop words. Recent studies find that different dataset pre-processing settings greatly impact topic modeling outcomes, such as the minimum and maximum document frequency, maximum vocabulary size, and stop word sets (Card et al 2018; Wu et al 2023b). However, these pre-processing settings vary substantially across papers even if they use the same benchmark datasets like 20newsgroup. These variations raise questions about the generalization ability of their methods across different pre-processing settings. Thus their claimed performance improvements may be untenable. In consequence, establishing standardized dataset pre-processing settings emerges as an imperative prerequisite for ensuring reliable and consistent evaluations of topic models.

Table 1 Examples of trivial and repetitive topics.

5.2 Low-quality topics

Regardless of the simplification and popularity of NTMs, the quality of their discovered topics has been questioned from two aspects:

  • Trivial Topics: Discovered topics are trivial with uninformative words. These topics cannot reveal the actual latent semantics of documents. As exemplified in Table 1, the topics include “even”, “just”, and “really”. It is difficult to discern their underlying conceptual semantics.

  • Repetitive Topics: Discovered topics are repetitive with the same words, also referred to as the topic collapsing problem. As shown in Table 1, the topics include the same words like “sports”, “games”, and “soccer”. It is hard to distinguish them. Apart from that, these repetitive topics imply some semantics are still hidden in documents.

More disastrously, some NTMs may exhibit triviality and repetitiveness in their discovered topics simultaneously (Wu et al 2020b, 2023b). These two kinds of low-quality topics impede the understanding, undermine the interpretability of topic models, and are less beneficial for downstream tasks and applications. In consequence, how to effectively and consistently overcome this challenge becomes a necessary and constructive research direction.

5.3 Sensitivity to hyperparameters

Another significant challenge of NTMs lies in their sensitivity to hyperparameters. Due to the complicated structures, NTMs typically possess more hyperparameters compared to conventional topic models. For example, hyperparameters such as dropout probability, batch size, and learning rate assume critical roles in several NTMs (Srivastava and Sutton 2017; Card et al 2018). Besides, certain NTMs cannot perform well under a large number of topics (Wu et al 2020b). As a result, researchers must meticulously fine-tune these hyperparameters when applying NTMs, especially to new datasets. Therefore, the sensitivity of NTMs to hyperparameters curtails the generalization ability of NTMs, underscoring the necessity to mitigate this sensitivity.

Table 2 Comparison of topic diversity metrics under three cases

6 Topic semantic-aware diversity

In this section, we propose a new diversity metric that considers the semantics of topics when measuring topic diversity.

6.1 Problem of previous diversity metrics

Previous topic diversity metrics may contradict human judgment. These diversity metrics, such as TR (Burkhardt and Kramer 2019), TU (Nan et al 2019), and TD (Dieng et al 2020), all consider the uniqueness of one top word of topics. They believe that diversity is perfect only when each top word is unique. However, we argue that this measurement is over-strict since it ignores the fact that different topics may naturally share the same words due to word polysemy. As the examples shown in Table 2, “apple” refers to a kind of fruit in Topic#1 and a technology company in Topic#2, and “jobs” refers to Steve Jobs in Topic#2 or a paid position of employment in Topic#3. These topics imply different conceptual semantics although they include some same words. So we conceive their diversity score should be highest. But we see that their TU score is only 0.867 and TD is 0.733 in Table 2, which disagrees with our judgment.

6.2 Topic semantic-aware diversity

To address this issue, we propose Topic Semantic-aware Diversity (TSD), a new metric that measures topic diversity along with word semantics.

6.2.1 Definition of topic semantic-aware diversity

In detail, we compute TSD based on the frequencies of word pairs. Given K topics and the top T words of each topic, we propose the new TSD as follows:

$$\begin{aligned} \textrm{TSD} = \frac{2}{K T (T-1)} \sum _{k=1}^{K} \sum _{(x_i, x_j) \in t(k)} {{\mathbb {I}}}( \#(x_i, x_j) ). \end{aligned}$$
(16)

Here \(\#(x_i, x_j)\) means the number of an unordered word pair \((x_i, x_j)\) in the top words of all K topics. \({{\mathbb {I}}}(\cdot )\) refers to an indicator function that equals to 1 if \(\#(x_i, x_j) = 1\) and equals 0 otherwise. t(k) denotes the top words of k-th topic. Rather than the uniqueness of one word, our TSD measures the uniqueness of word pairs in the top words of topics. This is because we know what a word exactly refers to when paired with another one. For example, “apple” refers to fruit if paired with “orange” or “banana” and to a company if with “technology” or “company”. Note that TSD degrades to TD when measuring the frequency of each word in Equation (16).

We exemplify the difference between our TSD with previous diversity metrics. Table 2 Case 1 shows the TSD score of these three topics is 1.0. This is because “apple” does not co-occur with “orange”, “grape”, or “banana”, and “jobs” does not co-occur with “unemployment”, “economy”, or “salary” in Topic#2. Thus TSD considers Topic#1-3 as different topics regardless of the same words. Naturally, TSD punishes diversity if the word pairs are repetitive. For instance, Table 2 Case 2 shows the TSD score of the three topics becomes lower since “apple” co-occurs with “orange” in both Topic#1 and #2. In the worst situation, Table 2 Case 3 has all the same topics. We see in this case both TD and TSD give 0 for topic diversity.

Table 3 Correlations between topic diversity metrics and human ratings on different datasets

6.2.2 Evaluation results

We conduct experiments to sufficiently compare our proposed topic diversity metric and previous ones. In detail, we employ a conventional topic model, LDA (Blei et al 2003) and a neural topic model, NSTM (Zhao et al 2021c) to discover latent topics from real-world datasets. Then we ask human raters to evaluate the diversity among the top words of sampled topics. The adopted datasets are listed as follows: (i) NeurIPS ,Footnote 6 including published papers at the NeurIPS conference from 1987 to 2017. (ii) ACL (Bird et al 2008), including research articles between 1973 and 2006 from the ACL Anthology .Footnote 7 (iii) NYT ,Footnote 8 including news articles on the New York Times website from 2012 to 2022. (iv) Wikitext103 Footnote 9 (Merity et al 2016), including Wikipedia articles. Following Lau et al (2014); Röder et al (2015), we compute the Pearson correlation coefficients between the results of these topic diversity metrics and human ratings.

Table 3 shows the correlation results on different datasets and the average correlation. We notice that our TSD achieves relatively higher correlation scores with human ratings. This is because our TSD metric considers the word semantics as well while measuring topic diversity. These empirical results demonstrate that our TSD metric more closely aligns with human judgment concerning the topic diversity evaluation.

7 Topic model toolkits

Several topic model toolkits have been developed by the research community. Early popular toolkits include MALLET Footnote 10 (McCallum 2002), gensim Footnote 11 (Rehurek and Sojka 2011), STTM Footnote 12 (Qiang et al 2020), ToModAPI Footnote 13 (Lisena et al 2020), and tomotopy .Footnote 14 However these toolkits often neglect either the implementations of NTMs, dataset pre-processing, or evaluations, leaving a gap in meeting practical requirements. Recently, Terragni et al (2021) propose the OCTIS toolkit Footnote 15 which includes several NTM methods, evaluations, and Bayesian parameter optimization for research. The latest toolkit is TopMost Footnote 16 (Wu et al 2023c). Compared to OCTIS, TopMost covers a wider range of topic modeling scenarios and more newest released NTMs. It also decouples the model implementations and training, which eases the extension of new models. These toolkits provide a solid foundation for beginners to explore various topic models and empower users to apply diverse topic models in their applications.

8 Conclusion

Topic models have been prevalent for decades with diverse applications. Recently Neural Topic Models (NTMs) have attracted significant attention due to their flexibility and scalability. They stand out by offering advantages such as avoiding the requirement for model-specific derivations and efficiently handling large-scale datasets. With the emergence of NTMs, researchers have explored several promising applications for various tasks.

In this paper, we provide a comprehensive and up-to-date survey of NTMs. We introduce the preliminary of topic modeling, including the problem setting, notations, and evaluation methods. We review the existing NTM methods that employ different network structures and discuss their applicability to different use case scenarios. In addition, we delve into an examination of the popular applications built on NTMs. Finally, we identify and discuss the challenges that lie ahead for NTM research in detail. We hope this survey can serve as a valuable resource for researchers interested in NTMs and contribute to the advancement of NTM research.