A Survey on Neural Topic Models: Methods, Applications, and Challenges

Topic models have been prevalent for decades to discover latent topics and infer topic proportions of documents in an unsupervised fashion. They have been widely used in various applications like text analysis and context recommendation. Recently, the rise of neural networks has facilitated the emergence of a new research field -- Neural Topic Models (NTMs). Different from conventional topic models, NTMs directly optimize parameters without requiring model-specific derivations. This endows NTMs with better scalability and flexibility, resulting in significant research attention and plentiful new methods and applications. In this paper, we present a comprehensive survey on neural topic models concerning methods, applications, and challenges. Specifically, we systematically organize current NTM methods according to their network structures and introduce the NTMs for various scenarios like short texts and bilingual documents. We also discuss a wide range of popular applications built on NTMs. Finally, we highlight the challenges confronted by NTMs to inspire future research. We accompany this survey with a repository for easier access to the mentioned paper resources: https://github.com/bobxwu/Paper-Neural-Topic-Models.


Introduction
Topic models seek to discover a set of latent topics from a collection of documents, depending on word co-occurrence information.Each topic represents an interpretable semantic concept and is described as a group of related words.For example, a topic about "sports" may relate to words like "baseball", "basketball", and "football".Topic models also infer what topics a document contains (topic proportions) to reveal their underlying semantics.Due to their effectiveness and interpretability, topic models have derived various downstream applications, such as document retrieval, content recommendation, opinion/event mining, and trend analysis (Blei and Lafferty, 2006b;Wang and Blei, 2011;Boyd-Graber et al., 2017;Duong et al., 2022;Churchill and Singh, 2022).
Conventional approaches to topic modeling embrace either probabilistic graphical models or nonnegative matrix factorization.Approaches based on probabilistic graphical models, such as the classic Latent Dirichlet Allocation (LDA, Blei et al., 2003), have been extensively explored for the past two decades (Blei, 2012).They mainly model the document generation process with topics as latent variables, and then infer model parameters through Variational Inference (Blei et al., 2017) or Monte Carlo Markov Chain (MCMC) methods like Gibbs sampling (Steyvers and Griffiths, 2007).Alternatively, another conventional topic model type uses non-negative matrix factorization.They directly discover topics by decomposing a term-document matrix into two low-rank factor matrices: one represents words and the other documents (Lee and Seung, 2000;Kim et al., 2015;Shi et al., 2018).These conventional topic models have derived various model structures, such as supervised LDA (Mcauliffe and Blei, 2007) and correlated LDA (Blei and Lafferty, 2006a).Besides the basic topic modeling scenario, researchers have extended topic models to other diverse scenarios, e.g., short text (Yan et al., 2013;Yin and Wang, 2014), crosslingual (Mimno et al., 2009), and dynamic topic modeling (Blei and Lafferty, 2006b;Wang et al., 2008).
However, despite the achievements of these conventional methods, they generally confront two limitations: (i) Inefficient and labor-intensive parameter inference.These methods necessitate complicated model-specific derivations for parameter inference, and the inference complexity grows along with model complexity.Consequently, this requirement weakens their generalization ability to diverse model structures and application scenarios.(ii) Limited scalability to large datasets.inference algorithms typically are not parallel, leading to significant time consumption.For example, training a probabilistic dynamic topic model DTM Blei and Lafferty (2006b) on a dataset with 10k documents takes two days (Dieng et al., 2019).Admittedly some parallel inference algorithms have been proposed (Newman et al., 2009;Wang et al., 2009;Liu et al., 2011), but unfortunately they cannot straightforwardly fit other model structures and application scenarios.As a result, how to design effective, flexible, efficient, and scalable topic models has become an urgent imperative in the research field.
To overcome these challenges, Neural Topic Models (NTMs) have emerged as a promising solution.Unlike conventional topic models, NTMs can efficiently and flexibly infer model parameters through automatic gradient back-propagation by adopting deep neural networks to model latent topics, such as the popular Variational AutoEncoder (VAE, Kingma and Welling, 2014;Rezende et al., 2014).This flexibility enables researchers to tailor model structures to fit diverse application scenarios.In addition, NTMs can seamlessly handle largescale datasets by harnessing parallel computing facilities like GPUs.Owing to these advantages, NTMs have witnessed the exploration of numerous new methods and applications.
Previously, Zhao et al. (2021a) provided a review with a primary focus on the methods of NTMs.However, their review is beset by the following limitations: (i) Their method taxonomy is incomplete because they ignore several recently proposed NTM methods, such as NTMs with contrastive learning, cross-lingual NTMs, and dynamic NTMs.(ii) They omit the popular applications based on NTMs, developed for a wide range of downstream tasks.(iii) They lack in-depth discussions on the challenges inherent in NTMs.As a consequence, a more comprehensive review on NTMs is necessary Here the topic-word distribution of Topic#k, β k , has related words like "movie", "film", and "oscar"; the doc-topic distribution θ concentrates on Topic#1 and Topic#k.
for the research field.
In this paper to address these limitations, we present an extensive and up-to-date survey of NTMs, which offers an in-depth and self-contained understanding of NTMs in terms of methods, applications, and challenges.We begin by systematically organizing existing NTMs according to their neural network structures, such as using embeddings or graph neural networks.We then introduce the NTMs designed for various prevalent topic modeling scenarios, e.g., short text, crosslingual, and dynamic topic modeling, covering a wider range than the early survey (Zhao et al., 2021a).Moreover while omitted by the previous survey (Zhao et al., 2021a), we also organize and discuss the popular applications based on NTMs, developed for diverse tasks like text analysis and text generation.Finally we summarize the key challenges for NTMs in detail to motivate future research directions.Figure 1 depicts the overview of our survey.We conclude the main contributions of this paper as follows: • We extensively review methods of neural topic models through detailed discussions and comparisons, covering variants with different network structures.
• We include a broader range of popular topic modeling scenarios and provide detailed background information for each scenario, accompanied by easy-to-understand illustrations and related neural topic models.
• We introduce popular applications based on neural topic models, developed to tackle various tasks such as text analysis and generation.
• We highlight the current vital challenges faced by neural topic models in detail to facilitate future research; Motivated by this, we propose a new topic diversity metric that measures diversity along with word semantics, which more agrees with human judgment.

Preliminary of Topic Models
In this section, we introduce the preliminary of topic modeling, including the problem setting, notations, and evaluation.Then we present the most basic and popular NTM in the framework of Variational AutoEncoder (VAE).

Problem Setting and Notations
We introduce the problem setting and notations of topic modeling following LDA (Blei et al., 2003).Consider a collection of N documents with V unique words (vocabulary size), and a document is denoted as x.As illustrated in Figure 2, topic models aim to discover K latent topics from this collection.The number of topics K is a hyperparameter, usually determined by researchers manually according to the characteristics of datasets and their target tasks.Each topic is defined as a distribution over the vocabulary, i.e., topic-word distribution, β k ∈ R V .Then the topic-word distribution matrix of all topics is In addition, topic models also infer the topic distribution of a document (doc-topic distribution): θ ∈ ∆ K , implying what topics a document contains.Here θ k refers to the proportion of Topic#k in the document, and ∆ K denotes a probability simplex

Evaluation of Topic Models
Given the absence of ground-truth labels in topic modeling tasks, how to reliably and comprehensively evaluate topic models remains inconclusive in the research community.We introduce currently the most prevalent evaluation methods employed for assessing topic models as follows.

Perplexity
Perplexity, borrowed from language models, measures how a model can predict new documents.It is measured as the normalized log-likelihood of held-out test documents.Perplexity has been used for years to evaluate topic models.Nevertheless, prior studies have empirically demonstrated that perplexity inaccurately reflects the quality of discovered topics as it often contradicts human judgment (Chang et al., 2009).Furthermore, computing log-likelihood is inconsistent among different topic models.This is because they apply various sampling or approximation techniques (Wallach et al., 2009;Buntine, 2009) as well as diverse modeling approaches for topic-word distributions and doctopic distributions.For instance, certain methods normalize topic-word distributions with respect to topics, some with respect to words, and others opt to keep them unnormalized.These disparities bring challenges to equitable comparisons.Finally, perplexity may not evaluate the practical utility of topic models since users typically employ topic models for content analysis rather than generating new documents (Zhao et al., 2021a;Hoyle et al., 2022).Due to these reasons, perplexity has waned in popularity for topic model evaluation in the recent research field.

Topic Coherence
Rather than predictive abilities, researchers turn to evaluating the quality of produced topics.For this purpose, researchers propose topic coherence to measure the coherence among the most related words of topics, i.e., top words (determined by topic-word probabilities).Experiments showcase that topic coherence can agree with the human evaluation on topic interpretability (Lau et al., 2014).For example, one widely-used coherence metric is Normalized Point-wise Mutual Information (NPMI, Bouma, 2009;Newman et al., 2010;Lau et al., 2014).1 Specifically, the NPMI score between two words (x i , x j ) is calculated as follows: It computes the normalized mutual information of two words, and then takes the average of all word pairs in all topics.Here ϵ is to avoid zero; p(x i ) is the probability of word x i , and p(x i , x j ) is the cooccurrence probability of (x i , x j ).These probabilities are estimated as their occurrence frequencies in a reference corpus.The reference corpus can be either internal (the training set) or external (e.g., Wikipedia articles).Basically, a large external corpus is recommended because it can alleviate the influence of data bias in training sets and facilitate fair topic coherence comparisons across different datasets.
Later, Röder et al. (2015) propose a new metric, C V , which calculates the cosine similarity between NPMI score vectors (Krasnashchok and Jouili, 2018).Given the top T words of a topic, (x 1 , x 2 , . . ., x T ), the exact calculation of C V is formulated as The NPMI score computation follows Eq. (2).Röder et al. (2015) empirically demonstrate that C V outperforms previous coherence metrics, NPMI, UCI, and UMass (Mimno et al., 2011), since C V is more consistent with human judgment (See Röder et al. (2015) for experimental results).
We would like to recommend the Palmetto tool 2 to compute topic coherence.It includes almost all common coherence metrics and provides a preprocessed Wikipedia article collection as the reference corpus for easier reproducibility.

Topic Diversity
To further evaluate the quality of topics, topic diversity is introduced to measure the difference between topics.This is driven by the anticipation that topics should exhibit diversity rather than redundancy thereby enabling the comprehensive disclosure of latent semantics in corpora.At present, researchers propose the following common diversity metrics: • Nan et al. (2019) propose Topic Uniqueness (TU) which computes the average reciprocal of top word occurrences in topics.In detail given K topics and the top T words of each topic, TU is computed as where t(k) means the top word set of the k-th topic, and #(x i ) denotes the occurrence of word x i in the top T words of all topics.TU ranges from 1/K to 1.0, and a higher TU score indicates more diverse topics.
• Burkhardt and Kramer (2019) propose Topic Redundancy (TR) that calculates the average occurrences of a top word in other topics.Its computation is A lower TR score means more diverse topics.
• Dieng et al. (2020) propose topic diversity (TD) which computes the proportion of unique top words of topics: where I(  The decoder reconstructs the input document from θ with β as the topic-word distribution matrix.The objective includes reconstruction error and KL divergence.from 0 to 1.0, and a higher TD score indicates more diverse topics. These metrics all measure topic diversity based on the uniqueness of individual words.They posit that diversity is optimal when all topics are characterized by distinct top words.However, we question these diversity metrics because certain topics naturally share the same words.For example, the word "chip" could be shared by the topics of "potato chip" and "electronic chip"; Similarly, the word "apple" may be covered by the topics of "fruit" and "company".This issue remains unresolved for reliable diversity evaluation.We in this paper propose a new diversity metric to address this issue (See details in Sec. 7).

Downstream Task Performance
Except for the coherence and diversity to measure topic quality, researchers also resort to extrinsic performance: they use doc-topic distributions θ as low-dimensional document features and evaluate their quality on downstream tasks.These tasks mainly consist of document classification and document clustering.For document classification, researchers train ordinary classifiers (e.g., SVMs or Random Forests) with learned doc-topic distributions as document features and then predict the labels of testing documents.
The performance can be evaluated by accuracy or F1 scores.For document clustering, the common way is to use the most significant topic in a doctopic distribution as the clustering assignment of a document.Another way is to apply clustering algorithms, e.g., K-Means or DBSCAN, on doc-topic distributions (Zhao et al., 2021b).The clustering performance can be measured by Purity and Normalized Mutual Information (NMI, Manning et al., 2008).

Visualization
Finally, researchers visualize topic models for evaluation.The typical visualization method is to show the top words of topics and doc-topic distributions, such as using pyLDAvis3 (Sievert and Shirley, 2014) or word cloud4 .Another strategy is to cluster documents on a 2D canvas by reducing the dimension of doc-topic distributions with tools like t-SNE (van der Maaten and Hinton, 2008).

Basic NTM based on VAE
We introduce the most basic and popular NTM based on the Variational AutoEncoder (VAE) framework with the neural variational inference technique (Miao et al., 2016;Srivastava and Sutton, 2017).As illustrated in Figure 3, a VAE-based NTM mainly contains an encoder (inference network) and a decoder (generation network).The encoder is to infer doc-topic distributions from documents.To be specific, we use a latent variable r ∈ R K following a logistic normal prior where µ 0 and Σ 0 are the mean vector and diagonal covariance matrix respectively.Here the prior distribution is specified with Laplace approximation (Hennig et al., 2012) to approximate a symmetric Dirichlet prior as µ 0,k = 0 and Σ 0,kk = (K − 1)/αK with hyperparameter α (Srivastava and Sutton, 2017).The variational distribution is modeled by parameters Θ as We compute µ and Σ with encoder networks f Θ 1 and f Θ 2 : where Θ = {Θ 1 , Θ 2 } and diag(•) denotes transforming a vector to a diagonal matrix.In practice, we employ MLPs as the encoder networks and transform document x into a Bag-of-Words (BoW) vector as inputs.Then to avoid gradient variance (Kingma and Welling, 2014;Rezende et al., 2014), we sample r through the reparameterization trick by sampling a random variable ϵ: We model the doc-topic distribution θ with a softmax function to restrict it on a simplex: The decoder is to generate documents from doctopic distributions.Specifically, we use a decoder network parameterized by Φ: f Φ (θ) = softmax(βθ) which represents the generation probability of each word.Here Φ = {β}.Then we sample words from its multinomial distribution: x ∼ Mult(f Φ (θ)).Following the Evidence Lower BOund (ELBO) of VAE, we formulate the learning objective of NTM as The first term is the negative expected loglikelihood, i.e., the reconstruction error, where p Φ (x|θ) denotes the generation probability of x.
As we sample words from the multinomial distribution, the first term becomes −x ⊤ log(f Φ (θ)).The second term is the Kullback-Leibler (KL) divergence between the variational and prior distributions, which can be computed through an analytical form (Srivastava and Sutton, 2017).It is also known as a regularization term.
The above is the fundamental structure of a VAE-based NTM, followed by pioneer studies like NVDM Miao et al. (2016) and ProdLDA Srivastava and Sutton (2017).Based on this, NTMs with different structures are proposed to further improve performance and deal with various application scenarios.

NTMs with Different Structures
Apart from the basic VAE structure mentioned in Sec. 2, we in this section introduce NTMs with more different structures.

NTMs with Various Priors
VAE-based NTMs commonly employ Gaussian (Normal) as priors since it is easy to apply the reparameterization trick and compute the analytical form of KL divergence.Besides Gaussian priors, NTMs also leverage other various priors.Miao et al. (2017) propose new priors like Gaussian softmax and the stick-breaking process.Zhang et al. (2018) use a Weibull distribution to approximate gamma distributions.Joo et al. (2020) leverage an auxiliary uniform distribution to approximate the cumulative distribution function of gamma.As Dirichlet priors are important for topic modeling, Burkhardt and Kramer (2019) utilize the proposal function of a rejection sampler for a gamma distribution to approximate Dirichlet priors.Tian et al. (2020) draw from the rounded posterior distribution to approximate Dirichlet samples.

NTMs with Embeddings
Alternative to directly modeling topics, Miao et al. (2017) propose to decompose topics as two embedding parameters: Here W ∈ R D×V denotes V word embeddings, and T ∈ R D×K denotes K topic embeddings, where D is the dimension of embedding space.
Then Dieng et al. ( 2020) follow this setting and propose ETM (Embedding Topic Model).ETM facilitates topic learning by initializing W with pretrained word embeddings like Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014).This approach also confers flexibility and efficiency to other topic modeling scenarios.For instance, it is much cheaper to repeat topic embeddings for each time slice in dynamic topic modeling than repeating the entire topic-word distribution matrix (Dieng et al., 2019).Alternatively, Zhao et al. (2021b) propose NSTM.It also models topics as embeddings, but uses the optimal transport distance between doctopic distributions and input documents to measure the reconstruction error.Wang et al. (2022a) share the same idea and instead use conditional transport distance.Duan et al. (2022) learn a group of global topic embeddings for task-specific adaptations.Xu et al. (2022) propose HyperMiner, using hyperbolic embeddings to model topics.Due to the tree-likeness property of hyperbolic space, they can capture the hierarchy among topics.
Differently, Wu et al. (2023b) propose ECRTM, which models the topic-word distribution matrix as Here β jk denotes the correlation between j-th word and k-th topic with τ as a temperature hyperparameter; w j is the j-th word embedding in W, and t k is the k-th topic embedding in T. It computes the Euclidean distance between topic and word embeddings and normalizes overall topics in a softmax manner.This works together with a clustering regularization method.The regularization considers topic embeddings as cluster centers and word embeddings as cluster samples; then it forces topic embeddings to be the centers of separately aggregated word embeddings by optimal transport.This effectively avoids the topic collapsing issue where topics are repetitive to each other.

NTMs with Metadata
While common NTMs learn topics in an unsupervised manner (only using documents), NTMs can also leverage the metadata of documents to guide topic modeling, similar to supervised LDA (Mcauliffe and Blei, 2007)

NTMs with Adversarial Training
Some studies focus on employing adversarial training to facilitate topic modeling.Wang et al. (2019) follows the idea of Generative Adversarial Network (GAN): they use a generator to generate "fake" documents from a random Dirichlet sample and then use a discriminator to distinguish the generated documents from real ones.Note that their model cannot infer doc-topic distributions because it directly maps documents to representations based on TF-IDF.To lift this limitation, Wang et al. (2020) propose to use bidirectional adversarial training, which can meanwhile infer doc-topic distributions.Hu et al. (2020) further present an extension that uses two cycle-consistency constraints to generate informative representations.

NTMs with Pre-trained Language Models
Researchers frequently combine NTMs with pretrained language models.Pre-trained language models based on Transformers (Vaswani et al., 2017) have been prevalent in NLP fields, which are pre-trained on large-scale corpora to capture contextual linguistic features.Multiple studies leverage contextual features from these pre-trained models to provide richer information than conventional BoW.For instance, Bianchi et al. (2021a) input the concatenation of BoW and the contextual document embeddings from Sentence-BERT (Reimers and Gurevych, 2019), and then reconstruct BoW as previous work.Hoyle et al. (2020) propose to distill knowledge from BERT (Devlin et al., 2018) to NTMs.In detail, they produce pseudo BoW from the predictive word probability of BERT.Then their NTM reconstructs both the real and pseudo BoW.Bianchi et al. (2021b); Mueller and Dredze (2021) employ multilingual BERT to infer cross-lingual doc-topic distributions for zero-shot learning but they cannot discover aligned cross-lingual topics.

NTMs with Contrastive Learning
As a self-supervised learning fashion, contrastive learning has been employed to facilitate NTMs (Hadsell et al., 2006;Nguyen et al., 2022Nguyen et al., , 2024a

Other NTMs
Apart from the aforementioned methods, we introduce NTMs with other structures.Before the invention of VAE-based NTMs, researchers have different attempts to model latent topics with neural networks.Some studies focus on NTMs in the autoregressive framework.Larochelle and Lauly (2012) propose an autoregressive NTM, called DocNADE.Inspired by Replicated Softmax (Hinton and Salakhutdinov, 2009), DocNADE predicts the probability of a word in a document conditioned on its hidden state which is conditioned on previous words.Then it interprets topics with a hidden state and infers doc-topic distributions with the hidden states of the document.Gupta et al. (2019b) extend DocNADE by modeling the bi-directional dependencies between words.Gupta et al. (2019a)  then use a LSTM to enable DocNADE to incorporate external knowledge.Cao et al. (2015) also propose an early NTM before VAE-based NTMs.Their approach predicts how an n-gram correlates with documents.It computes the representation of an n-gram by transforming the accumulation of the word embeddings from Word2Vec (Mikolov et al., 2013) and projects documents into representations with a look-up matrix table.In this way, they model topic-word distributions as the n-gram representations and model doc-topic distributions as the projected document representations.For training, it uses the document of the n-gram as a positive and randomly samples documents that do not contain this n-gram as negatives.Lin et al. (2019) replace the softmax function with the sparsemax to enhance the sparsity in doctopic distributions.Nan et al. (2019) use Wasserstein AutoEncoder (WAE) to model topics, which minimizes the Wasserstein distance between generated documents and input documents.Rezaee and Ferraro (2020) propose a NTM without using the reparameterization trick.They generate discrete topic assignments from RNN inspired by Dieng et al. (2017).Wu et al. (2021) focus on discovering latent topics from long-tailed corpora.They propose a causal inference framework to analyze how the long-tailed bias influences topic modeling.Then they use a simple but effective casual intervention method to mitigate such influence.Pham et al. (2024) propose TopicGPT, prompting Large Language Models (LLMs) to augment seed topics.It defines each topic as a textual description, rather than the word distributions in LDA.It also prompts LLMs to assign a single or more topics to a document, like a classification task.We emphasize that TopicGPT cannot give the topicword distributions or doc-topic distributions for downstream applications.

Topic Discovery by Clustering
We discuss a special type of approach that discovers latent topics by clustering instead of modeling the generation process of documents.They typically leverage traditional word embeddings such as Word2Vec (Mikolov et al., 2013) or contextual embeddings from pre-trained language models.We must clarify that some of these methods differ from the aforementioned ordinary topic models.This is because they can only produce topics but cannot infer doc-topic distributions as required.Accordingly, their one advantage is their enhanced computational efficiency.In detail, Thompson and Mimno (2020) straightforwardly cluster tokenlevel word embeddings from pre-trained models like BERT and GPT-2 and produce topics from the words assigned to clusters.Similarly, Sia et al. (2020); Zhang et al. (2022c) cluster word embeddings and interpret hidden topics by sampling words from clusters via term weights like TF-IDF.Angelov (2020) propose Top2Vec.It obtains document and word embeddings through Doc2Vec (Le and Mikolov, 2014).Then it reduces the dimension of document embeddings with UMAP and clusters them with HDBSCAN.It extracts the n-closest words of a cluster to represent a topic.Following Top2Vec, Grootendorst (2022) propose BERTopic.BERTopic extracts words of a clustering based on c-TF-IDF, which calculated the TF-IDF of a word over a cluster.It estimates the doc-topic distribution based on the term weights within a given document.

Wu et al. (2024b) propose FASTopic following a new paradigm rather than VAE-based or
clustering-based ones.Similar to Top2Vec and BERTopic, FASTopic leverages document embeddings, like from pretrained Transformers (Reimers and Gurevych, 2019;Pan et al., 2023;Wu et al., 2024d).Differently, it conducts optimal transport (Peyré et al., 2019) between document and topic embeddings and uses the transport plans to model doc-topic distributions.In the same way, it models topic-word distributions with the transport plans between topic and word embeddings.Then it optimizes topic and word embeddings by reconstruction with these semantic relations.This avoids the previous complicated VAE or simple clustering structures, leading to a neat and efficient paradigm.

NTMs for Various Scenarios
Apart from the basic scenario on normal documents, we in this section introduce NTMs tailored for various use case scenarios, such as hierarchical, cross-lingual, and dynamic topic modeling.We present the background of each scenario and its related NTMs.

Hierarchical NTMs
Similar to conventional topic models (Griffiths et al., 2003;Teh et al., 2004;Blei et al., 2010), NTMs can discover hierarchical topics to reveal topic structures from general to specific.Topics at each level in a hierarchy cover different semantic granularity: child topics tend to be more specific to their parent topics.As shown in Figure 4, a topic about "sports" can derive more specific child topics, like "soccer", "basketball", and "tennis"; a topic about "computer" also has specific child topics like "linux", "programming", and "windows".In addition, hierarchical topic modeling can relieve the challenge of determining the number of topics to some extent (Blei et al., 2010).
To discover hierarchical topics, some NTMs follow the previous non-parametric setting which allows topic hierarchies to grow dynamically.Isonuma et al. (2020) propose a tree-structured neural topic model with two doubly-recurrent neural networks over the ancestors and siblings respectively (Alvarez-Melis and Jaakkola, 2017).Note that the tree structure is unbounded, i.e., it can be dynamically updated in a heuristic way during training.Pham and Le (2021) follow this spirit and jointly handle hierarchical topics and document visualization.Chen et al. (2021b) leverage a stick-breaking process as prior for non-parametric modeling.
Later, the parametric fashion has attracted more attention, which presets the width and height of a topic hierarchy before learning.This leads to the main issue that topic hierarchies cannot grow dynamically.Chen et al. (2021a) propose manifold regularization on topic hierarchy learning.Duan et al. (2021) (2023) propose to generate different documents for different levels.They craft documents with more related words through word similarity matrices for higher levels, and then progressively generate these documents at each level.Chen et al. ( 2023) utilize a Gaussian mixture prior and nonlinear structural equations to model topic dependencies between hierarchical levels.Wu et al. (2024c) propose TraCo.It produces sparse and balanced dependencies by modeling them as the transport plan solutions of specially defined optimal transport problems between hierarchical topics.They then introduce a context-aware disentangled decoder to decode documents with topics at each level, which better distributes different semantic granularity to different levels.
Non-parametric vs. Parametric As aforementioned, the non-parametric setting allows topic hierarchies to grow dynamically while the parametric setting cannot.We must preset the structure of a topic hierarchy in parametric settings.However, we still need efforts to determine the hyperparameters of the non-parametric setting, e.g., the stickbreaking process prior Chen et al. (2021b), which control the growing process of topic hierarchies.As a result, we conceive that the parametric setting better fits when enough prior knowledge of a dataset is available to determine the topic hierarchy structure ahead.

Short Text NTMs
Researchers apply NTMs to discover topics from short texts.Short texts, prevalent on the Internet in various forms such as tweets, comments, and news headlines, serve as a common medium for individuals to express ideas, comments, and opinions.However, normal topic models often struggle to effectively handle short texts.The principal reason is that topic models depend on the word cooccurrence information to infer latent topics, but such information is extremely sparse in short texts due to their limited context.This challenge, referred to as data sparsity (Yan et al., 2013;Wu and Li, 2019), hinders topic models from discovering high-quality topics and thus has attracted considerable attention in the research community.
Several studies are proposed to overcome this data sparsity challenge.Lin et al. (2020) use the Archimedean copulas to regularize the discreteness of topic distributions of short texts.Wu et al. (2020b) propose NQTM, which quantizes doctopic distributions of short texts to quantization vectors following the idea of Van den Oord and Vinyals (2017).By carefully initializing the quantization vectors, it can produce sharper doc-topic distributions that better fit short texts with limited context.They also propose a negative sampling decoder to avoid repetitive topics besides the negative log-likelihood.To address the data sparsity issue, Wang et al. (2021b) use word co-occurrence and semantic correlation graphs to enrich the learning signals of short texts.Zhao et al. (2021d) propose to incorporate entity vector representations into a NTM for short texts.They learn entity vector representations from manually edited knowledge graphs.Based on NQTM (Wu et al., 2020b), Wu et al. (2022) further propose TSCTM, a contrastive learning method according to the topic semantics of short texts, which better captures the similarity relations among them.This refines the representations of short texts and thus their doc-topic distributions.They can also adapt to using data augmentation to further mitigate the data sparsity problem.

Cross-lingual NTMs
Cross-lingual NTMs are proposed following crosslingual topic models (Mimno et al., 2009).Crosslingual topic models aim to discover aligned topics in different languages.As exemplified in Figure 5, English and Chinese Topic#3 both refer to "music", and English and Chinese Topic#5 refer to "celebrity".In addition if two documents in different languages contain similar latent topics, their inferred doc-topic distributions should be similar.For instance, the doc-topic distributions of the parallel English and Chinese documents in Figure 5 are similar.These aligned cross-lingual topics can reveal commonalities and differences across languages and cultures, which enables cross-lingual text analysis without supervision.Wu et al. (2020a) propose the first neural crosslingual topic model, NMTM.It transforms the topic-word distribution to the vocabulary space of another language.Thus the topic-word distributions of one language can incorporate the semantics of another language, which aligns cross-lingual topics.They show that their model outperforms traditional multilingual topic models (Shi et al., 2016;Yuan et al., 2018).Later, Wu et al. (2023a) propose InfoCTM.It aligns cross-lingual topics from the perspective of mutual information.This can properly align cross-lingual topics and prevent degenerate topic representations.To address the low-coverage dictionary issue, they also propose a cross-lingual vocabulary linking method that finds more linked words for topic alignment beyond the given dictionary.Bianchi et al. (2021b); Mueller and Dredze (2021) directly learn cross-lingual doctopic distributions with multilingual BERT.But we emphasize that they cannot discover aligned cross-lingual topics as required.

Dynamic NTMs
Dynamic NTMs are explored following dynamic topic models (Blei and Lafferty, 2006b;Wang et al., 2008).Previous static topic models implicitly assume that documents are exchangeable.However, this assumption is inappropriate since documents are produced sequentially, such as scholarly journals, emails, and news articles.As such, dynamic topic models are proposed.While topics in previous methods are all static, dynamic topic models allow topics to shift over time to capture the topic evolution in sequential documents.
To be specific, dynamic topic models assume that documents are divided by time slice, for example by year, and each time slice has K latent topics.The topics associated with slice t evolve from the topics associated with slice t − 1.As the example in Figure 6, Topic#1 about Ukraine and Russia evolves from the year 2020 to 2022.Due to the emergence of the word "invasion", we see Topic#1 captures the Ukraine-Russia war that exploded in 2022.Similarly, Topic#K about Covid-19 evolves from the year 2020 to 2022 with the explosion of the Omicron variant.These topic evolution reveals how topics emerge, grow, and vanish, which has been applied for trend analysis and opinion mining.Dieng et al. (2019) first propose a neural dynamic topic model, DETM (Dynamic Embedding Topic Model).It uses word and topic embeddings to interpret latent topics following Dieng et al. ( 2020) and chains topic embeddings at slice t with topic embeddings at slice t − 1 by Markov chains.Besides, it uses a LSTM to learn temporal priors of doc-topic distributions.Rahimi et al. (2023) discover topic evolution by clustering documents but cannot infer doc-topic distributions as required.Zhang and Lauw (2022) focus on the dynamic topics of temporal document networks and incorporate the linking information between documents.Following DETM, Miyamoto et al. (2023) propose to employ a self-attention mechanism to model the dependencies among dynamic topics.Wu et al. (2024a) focus on the unassociated topic and repetitive topic issues.Instead of the previous Markov  chains fashion, they propose CFDTM with a contrastive learning method to resolve these issues and track topic evolution.Rather than modeling topic evolution, Cvejoski et al. (2023) model the activities of topics over time.Note that the activities of topics evolve over time but their topics are invariant.Thus, this method does not precisely adhere to the original definition of dynamic topic modeling.

Correlated NTMs
Following the idea of correlated topic modeling (Blei and Lafferty, 2006a), correlated NTMs have been explored.Correlated topic models seek to consider the correlation between latent topics.For example, a document about genetics is more likely to be also about disease than x-ray astronomy (Blei and Lafferty, 2006a).This leads to better expressiveness than LDA.Liu et al. (2019) follow the VAE-based NTM and use centralized transformation flow to capture topic correlations.To effectively infer the transformation flow, they present the transformation flow lower bound to regulate the KL divergence term.

Lifelong NTMs
Lifelong NTMs are proposed to solve the challenge of data sparsity, similar to short text NTMs but in a continual lifelong learning fashion.Gupta et al. (2020) propose the first lifelong NTM.They retain prior knowledge, i.e., topics, from document streams and guide topic modeling on sparse datasets with the accumulated knowledge.In detail, they use topic regularization to transfer topical knowledge from several domains and prevent catas-trophic forgetting, and a selective replay strategy to identify relevant historical documents.Zhang et al. (2022a) propose a lifelong NTM enhanced with a knowledge extractor and adversarial networks.
Although lifelong and dynamic topic modeling both work on sequential documents, we clarify their differences: Dynamic topic modeling targets to discover topic evolution, i.e.,, replacing outdated topics with emergent ones.Lifelong topic modeling aims to mitigate the data sparsity issue by accumulating prior topical knowledge, so it needs to restrain from forgetting past knowledge.

Applications of NTMs
In this section, we introduce the applications of NTMs, mainly including text analysis, text generation, and content recommendation.

Text Analysis
The primary applications of NTMs concentrate on text analysis (Boyd-Graber et al., 2017;Laureate et al., 2023).Bai et al. (2018) apply a NTM to analyze scientific articles.They enable a NTM to incorporate the citation graphs of scientific articles by predicting the connections between them.Thus their model can also recommend related articles to users.Zeng et al. (2018) combine a NTM and a memory network and jointly train them for short text classification.Their method classifies short texts and discovers topics from them simultaneously.Chaudhary et al. (2020) combine BERT with a NTM, which reduces the operation of self-attention.They claim that this can greatly speed up their fine-tuning process and thus reduce CO 2 emission.Song et al. (2021) propose a classification-aware NTM which includes a NTM and a classifier.They focus on classifying the disinformation about COVID-19 to help deliver effective public health messages.Zeng et al. (2019) apply NTMs to understand the discourse in micro-blog conversations.Li et al. (2020) use a dynamic NTM to understand the global impact of COVID-19 and nonpharmacological interventions in different countries and media sources.Their discovered dynamic topics help researchers understand the progression of the epidemic.Valero et al. (2022) propose a short text NTM for podcast short-text metadata.Gui et al. (2020) propose a multitask mutual learning framework for sentiment analysis and topic detection.They make the topic-word distributions similar to the word-level attention vectors through mutual learning.Avasthi et al. (2022) use NTMs to mine topics from large-scale scientific and biomedical text corpora.

Text Generation
Several studies apply NTMs to text generation tasks.Specifically, Tang et al. (2019) propose a text generation model that learns semantics and structural features through a VAE-based NTM.Yang et al. (2021) leverage NTMs to alleviate the information sparsity issue in long story generation.They map a short text to a low dimensional doc-topic distribution, from which they sample interrelated words as a skeleton.With the short text and the skeleton as input, they use a Transformer to generate long stories.Nguyen et al. (2021) use the doc-topic distributions of NTM to enrich and control the global semantics for text summarization.Zhang et al. (2022b) propose a neural hierarchical topic model to discover hierarchical topics from documents, and then generate keyphrases under the hierarchical topic guidance.

Content Recommendation
Similar to early work (Wang and Blei, 2011), NTMs can cooperate with recommendation systems.Esmaeili et al. (2019) combines a NTM with a recommender system for reviews through a structured auto-encoder.Xie et al. (2021) use a graph NTM for citation recommendation.

Challenges of NTMs
Despite their popularity, NTMs encounter several challenges.In this section, we conclude these main challenges as possible future research directions.

Lacking Reliable Evaluation
Inheriting from conventional topic models, the critical challenge of NTMs primarily lies in the lack of reliable evaluation.Current evaluation methods have been developed for years, but they still encounter the following issues.

Absence of Standard Evaluation Metrics
The topic modeling field lacks standard evaluation metrics.Resorting to human judgment provides one effective way to evaluate topic models, such as topic rating and word intrusion tasks (Lau et al., 2014).Unfortunately, its reliance on human raters renders it expensive and time-consuming, limiting its feasibility for wide-scale comparisons.Owing to this, researchers generally depend on automatic evaluation metrics, such as the topic coherence and diversity mentioned in Sec.2.2.However, these automatic metrics encounter the following two problems: • Inconsistent usage of automatic metrics.The usage and settings of automatic metrics vary across papers and even within a paper.For example, variations include the number of top words, the number of topics, reference corpora, and coherence or diversity metrics.Consequently, the results are often confined to specific studies, impeding the comparability of NTMs across different research papers.Such inconsistencies have led some benchmarking studies to argue that the conventional LDA can still outperform NTMs in certain aspects (Doan and Hoang, 2021;Hoyle et al., 2022).
• Questionable agreement between automatic metrics and human judgment.Some investigations have revealed the discrepancies between the coherence metrics and human evaluation: they find that automatic metrics declare winner models when the corresponding human evaluation does not.This raises concerns that coherence metrics, originally designed for older models, possibly are incompatible with the newer neural topic models (Doogan and Buntine, 2021;Hoyle et al., 2021).We believe similar concerns may extend to diversity metrics: they may also be inconsistent with human assessments.We detail the reasons and offer a heuristic solution in Sec. 7.
Owing to the above problems, researchers appeal to explore automatic metrics that better approximate the preferences of real-world topic model users (Hoyle et al., 2021;Stammbach et al., 2023).Thus, proposing standard and practical evaluation metrics is a promising and urgent future research direction for topic modeling.

Lacking Standardized Dataset
Pre-processing Settings The topic modeling field lacks standardized dataset pre-processing settings for topic model comparisons.Researchers routinely pre-process datasets before running topic models, like removing less frequent words and stop words.Recent studies find that different dataset pre-processing settings greatly impact topic modeling outcomes, such as the minimum and maximum document frequency, maximum vocabulary size, and stop word sets (Card et al., 2018;Wu et al., 2023b).However, these pre-processing settings vary substantially across papers even if they use the same benchmark datasets like 20newsgroup.These variations raise questions about the generalization ability of their methods across different pre-processing settings.Thus their claimed performance improvements may be untenable.In consequence, establishing standardized dataset pre-processing settings emerges as an imperative prerequisite for ensuring reliable and consistent evaluations of topic models.

Low-Quality Topics
Regardless of the simplification and popularity of NTMs, the quality of their discovered topics has been questioned from two aspects: • Trivial Topics: Discovered topics are trivial with uninformative words.These topics cannot reveal the actual latent semantics of documents.As exemplified in Table 1, the topics include "even", "just", and "really".It is difficult to discern their underlying conceptual semantics.
• Repetitive Topics: Discovered topics are repetitive with the same words, also referred to as the topic collapsing problem.As shown in Table 1, the topics include the same words like "sports", "games", and "soccer".It is hard to distinguish them.Apart from that, these repetitive topics imply some semantics are still hidden in documents.
More disastrously, some NTMs may exhibit triviality and repetitiveness simultaneously in their discovered topics (Wu et al., 2020b(Wu et al., , 2023b)).These two kinds of low-quality topics impede the understanding, undermine the interpretability of topic models, and are less beneficial for downstream tasks and applications.In consequence, how to effectively and consistently overcome this challenge becomes a necessary and constructive research direction.

Sensitivity to Hyperparameters
Another significant challenge of NTMs lies in their sensitivity to hyperparameters.Due to the complicated structures, NTMs typically possess more hyperparameters compared to conventional topic models.For example, hyperparameters such as dropout probability, batch size, and learning rate assume critical roles in several NTMs (Srivastava and Sutton, 2017;Card et al., 2018).Besides, certain NTMs cannot perform well under a large number of topics (Wu et al., 2020b).As a result, researchers must meticulously fine-tune these hyperparameters when applying NTMs, especially to new datasets.Therefore, the sensitivity of NTMs to hyperparameters curtails the generalization ability of NTMs, and this underscores the necessity to mitigate such sensitivity.
7 Topic Semantic-aware Diversity In this section, we propose a new diversity metric that considers the semantics of topics when measuring topic diversity.

Problem of Previous Diversity Metrics
Previous topic diversity metrics may contradict human judgment.These diversity metrics, such as TR (Burkhardt and Kramer, 2019), TU (Nan et al., 2019), andTD (Dieng et al., 2020), all consider the uniqueness of one top word of topics.They believe that diversity is perfect only when each top word is unique.However, we argue that this measurement is over-strict since it ignores the fact that different topics may naturally share the same words due to word polysemy.As the examples shown in Table 2, "apple" refers to a kind of fruit in Topic#1 and a technology company in Topic#2, and "jobs" refers to Steve Jobs in Topic#2 or a paid position of employment in Topic#3.These topics imply different conceptual semantics although they include some same words.So we conceive their diversity score should be highest.But we see that their TU score is only 0.867 and TD is 0.733 in Table 2, which disagrees with our judgment.

Topic Semantic-aware Diversity
To address this issue, we propose Topic Semanticaware Diversity (TSD), a new metric that measures topic diversity along with word semantics.

Definition of Topic Semantic-aware Diversity
In detail, we compute TSD based on the frequencies of word pairs.Given K topics and the top T words of each topic, we propose the new TSD as follows: Here #(x i , x j ) means the number of an unordered word pair (x i , x j ) in the top words of all K topics.I(•) refers to an indicator function that equals to 1 if #(x i , x j ) = 1 and equals 0 otherwise.t(k) denotes the top words of k-th topic.Rather than the uniqueness of one word, our TSD measures the uniqueness of word pairs in the top words of topics.This is because we know what a word exactly refers to when paired with another one.For example, "apple" refers to fruit if paired with "orange" or "banana" and to a company if with "technology" or "company".Note that TSD degrades to TD when measuring the frequency of each word in Eq. ( 18).
We exemplify the difference between our TSD with previous diversity metrics.Table 2 Case 1 shows the TSD score of these three topics is 1.0.This is because "apple" does not co-occur with "orange", "grape", or "banana", and "jobs" does not co-occur with "unemployment", "economy", or "salary" in Topic#2.Thus TSD considers Topic#1-3 as different topics regardless of the same words.

TU
TD TSD Case 1 Topic#1: apple peach grape orange banana 0.867 0.733 1.000 Topic#2: apple company steve jobs macintosh Topic#3: jobs unemployment economy worker salary Case 2 Topic#1: apple peach grape orange banana 0.800 0.600 0.933 Topic#2: apple orange steve jobs macintosh Topic#3: jobs unemployment economy worker salary Case 3 Topic#1: apple peach grape orange banana 0.333 0.000 0.000 Topic#2: apple peach grape orange banana Topic#3: apple peach grape orange banana Table 2: Comparison of topic diversity metrics under three cases.TU (Eq.( 6)) and TD (Eq.( 8)) refer to previous diversity metrics in Sec.2.2, and TSD (Eq.( 18)) is our proposed new diversity metric.Each row represents the top words of a topic.Repetitive words are underlined.Naturally, TSD punishes diversity if the word pairs are repetitive.For instance, Table 2 Case 2 shows the TSD score of the three topics becomes lower since "apple" co-occurs with "orange" in both Topic#1 and #2.In the worst situation, Table 2 Case 3 has all the same topics.We see in this case both TD and TSD give 0 for topic diversity.

Evaluation Results
We conduct experiments to sufficiently compare our proposed topic diversity metric and previous ones.In detail, we employ a conventional topic model, LDA (Blei et al., 2003) and a neural topic model, NSTM (Zhao et al., 2021c) to discover latent topics from real-world datasets.Then we ask human raters to evaluate the diversity among the top words of sampled topics.The adopted datasets are listed as follows: (i) NeurIPS 5 , including published papers at the NeurIPS conference from 1987 to 2017.(ii) ACL (Bird et al., 2008), including research articles between 1973 and 2006 from the 5 https://www.kaggle.com/datasets/benhamner/nips-papers ACL Anthology6 .(iii) NYT7 , including news articles on the New York Times website from 2012 to 2022.(iv) Wikitext1038 (Merity et al., 2016), including Wikipedia articles.Following Lau et al. (2014); Röder et al. (2015), we compute the Pearson correlation coefficients between the results of these topic diversity metrics and human ratings.
Table 3 shows the correlation results on different datasets and the average correlation.We notice that our TSD achieves relatively higher correlation scores with human ratings.This is because our TSD metric considers the word semantics as well while measuring topic diversity.These empirical results demonstrate that our TSD metric more closely aligns with human judgment concerning the topic diversity evaluation.

Topic Model Toolkits
We introduce several topic model toolkits developed by the research community.Early popular toolkits include MALLET 9 (McCallum, 2002), gensim10 (Rehurek and Sojka, 2011), STTM11 (Qiang et al., 2020), ToModAPI12 (Lisena et al., 2020), and tomotopy 13 .However these toolkits often neglect either the implementations of NTMs, dataset pre-processing, or evaluations, leaving a gap in meeting practical requirements.Recently, Terragni et al. (2021) propose the OCTIS toolkit 14 which includes several NTM methods, evaluations, and Bayesian parameter optimization for research.The latest toolkit is TopMost 15 (Wu et al., 2023c).While OCTIS only has two NTMs proposed after 2018, TopMost covers more newest released NTMs and a wider range of topic modeling scenarios.It also decouples the model implementations and training, which eases the extension of new models.These toolkits provide a solid foundation for beginners to explore various topic models and empower users to leverage diverse topic models in their applications.

Conclusion
Topic models have been prevalent for decades with diverse applications.Recently Neural Topic Models (NTMs) have attracted significant attention due to their flexibility and scalability.They stand out by offering advantages such as avoiding the requirement for model-specific derivations and efficiently handling large-scale datasets.With the emergence of NTMs, researchers have explored several promising applications for various tasks.
In this paper, we provide a comprehensive and up-to-date survey of NTMs.We introduce the preliminary of topic modeling, including the problem setting, notations, and evaluation methods.We review the existing NTM methods that employ different network structures and discuss their applicability to different use case scenarios.In addition, we delve into an examination of the popular applications built on NTMs.Finally, we identify and discuss the challenges that lie ahead for NTM research in detail.We hope this survey can serve as a valuable resource for researchers interested in NTMs and contribute to the advancement of NTM research.

Figure 1 :
Figure 1: The overview of this survey: NTMs with different structures, NTMs for various scenarios, applications of NTMs, and challenges of NTMs.

Figure 2 :
Figure2: Illustration of topic modeling.Given a document collection, a topic model aims to discover K latent topics, interpreted as distributions over words (topic-word distributions).It also infers topic proportions of each document (what topics a document contains), defined as distributions over all latent topics (doc-topic distributions).Here the topic-word distribution of Topic#k, β k , has related words like "movie", "film", and "oscar"; the doc-topic distribution θ concentrates on Topic#1 and Topic#k.

Figure 3 :
Figure 3: Illustration of a VAE-based NTM.It mainly contains an encoder (inference network) and a decoder (generation network).The encoder outputs doc-topic distribution θ from input document x through MLPs using the reparameterization trick where ϵ ∼ N (0, I).The decoder reconstructs the input document from θ with β as the topic-word distribution matrix.The objective includes reconstruction error and KL divergence.

Figure 4 :
Figure 4: Illustration of hierarchical topic modeling.Topics at each level cover different semantic granularity: child topics are more specific to parent topics.Topic#2-1 denotes the first topic of the second layer in the topic structure.

Figure 5 :
Figure 5: Illustration of cross-lingual topic modeling (on English and Chinese documents).Corresponding crosslingual topics are required to be aligned, like English and Chinese Topic#3, and English and Chinese Topic#5.Words in the brackets are the English translations.

Figure 6 :
Figure 6: Illustration of dynamic topic modeling.Topics with time slice t evolve from topics associated with slice t − 1.Here Topic#1 in 2022 evolves from Topic#1 in 2021.It is similar for other topics. Their

Encoder (Inference network) Decoder (Generation network)
propose a Sawtooth Connection to model topic dependencies across hierarchical levels based on the model structure of ETM (Dieng et al., 2020).As aforementioned, Xu et al. (2022) use different layers in the hyperbolic embedding space to interpret hierarchical topics.Li et al. (2022) employ skip-connections for decoding to alleviate the posterior collapsing issue and propose a policy gradient method for training.Recently, Duan et al.

Table 1 :
Examples of trivial and repetitive topics.Each row represents the top words of a topic.Trivial topics include less informative concepts; repetitive topics contain repeating words.Repetitions are underlined.

Table 3 :
Correlations between topic diversity metrics and human ratings on different datasets.