Deep mixtures of unigrams for uncovering topics in textual data

Mixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward’s method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.


Introduction
Deep learning methods are receiving an exponentially increasing interest in the last years as powerful tools to learn complex representations of data. They can be basically defined as a multilayer stack of algorithms or modules able to gradually learn a huge number of parameters in an architecture composed by multiple nonlinear transformations (LeCun et al. 2015). Typically, and for historical reasons, a structure for deep learning is identified with advanced neural networks: deep feed forward, recurrent, auto-encoder, convolution neural networks are very effective and used algorithms of deep learning (Schmidhuber 2015). They demonstrated to be particularly successful in supervised classification problems arising in several fields such as image and speech recognition, gene expression data, topic classification. Under the framework of graph-based learning, Peng et al. (2016) proposed an efficient method to produce robust subspace clustering and B  Peng et al. (2018) to map input data points into nonlinear latent spaces while preserving the local and global subspace structure. Zhou et al. (2018) addressed the data sparsity issue in hashing.
When the aim is uncovering unknown classes in an unsupervised classification perspective, important methods of deep learning have been developed along the lines of mixture modeling, because of their ability to decompose a heterogeneous collection of units into a finite number of sub-groups with homogeneous structures (Fraley and Raftery 2002;McLachlan and Peel 2000). In this direction, van den Oord and Schrauwen (2014) proposed Multilayer Gaussian Mixture Models for modeling natural images; Tang et al. (2012) defined deep mixture of factor analyzers with a greedy layerwise learning algorithm able to learn each layer at a time. Viroli and McLachlan (2019) developed a general framework for deep Gaussian mixture models that generalizes and encompasses the previous strategies and several flexible model-based clustering methods such as mixtures of mixture models (Li 2005), mixtures of Factor Analyzers (McLachlan et al. 2003), mixtures of factor analyzers with common factor loadings (Baek et al. 2010), heteroscedastic factor mixture analysis (Montanari and Viroli 2010) and mixtures of factor mixture analyzers introduced by Viroli (2010). A general 'take-home-message' coming from the existing deep clustering strategies is that deep methods vs shallow ones appear to be very efficient and powerful tools especially for complex high-dimensional data; on the contrary, for simple and small data structures, a deep learning strategy cannot improve performance of simpler and conventional methods or, to better say, it is like to use a 'sledgehammer to crack a nut' (Viroli and McLachlan 2019).
The motivating problem behind this work derives from ticket data (i.e. content of calls made to the customer service) of an important mobile phone company, collected in Italian language. When a customer calls the assistance service, a ticket is created: the agent classifies it as, e.g. a claim, a request of information for specific services, deals or promotions. Our dataset consists of tickets related to five classes of services, previously classified from independent operators. The aim is to define an efficient clustering strategy to automatically assign the tickets into the same classes without the human judgment of an operator. The data are textual and information are collected in a document-term matrix with raw frequencies at each cell. They have a very complex and a high-dimensional structure, caused by the huge number of tickets and terms used by people that call the company for a specific request and by a relevant degree of sparsity (after a preprocessing step, the tickets have indeed an average length of only 5 words and, thus, the document-term matrix contains zero almost everywhere). The simplest topic model for clustering document-term data is represented by mixture of unigrams (MoU) (Nigam et al. 2000) MoUs are based on the idea of modeling the word frequencies as multinomial distributions, under the assumption that a document refers to a single topic. Table 5 (last column) shows that the method appears to be the most efficient tool for classifying the complex ticket data, compared to other conventional clustering strategies such as k-means, partition around medoids and hierarchical clustering. The reason is probably related to the fact that, by using proportions, MoU is not affected by the large amount of zeros, differently from the other competitors. We also compared MoU with the latent Dirichlet allocation model (LDA) (Blei et al. 2003), which represents an important and very popular model in textual data analysis, allowing documents to exhibit multiple topics with a different degree of importance. The latent Dirichlet allocation model has demonstrated great success on long texts (Griffiths and Steyvers 2004), and it could be thought of as a generalization of the MoU, because it adds a hierarchical level to it and, hence, much more flexibility. However, when dealing with very short documents like the ticket dataset, it is very rare that a single unit could refer to more than one topic; in such cases, the LDA model may not improve the clustering performance of MoU.
The aim of this paper is to derive a deep generalization of mixtures of unigrams, in order to better uncover topics or groups in case of complex and high-dimensional data. The proposal will be derived in a Bayesian framework and we will show that it will be particularly efficient for classifying the ticket data with respect to the 'shallow' MoU model. We will also show the good performance of the proposed method in a simulation study.
The paper is organized as follows. In the next section, the mixture of unigrams model is described. In Sect. 3, the deep formulation of the model is developed. Section 4 is devoted to the estimation algorithm for fitting the model. Experimental results on simulated and real data are presented in Sects. 5 and 6, respectively. We conclude the paper with some final remarks (Sect. 7 ).

Mixtures of unigrams
Let X be a document-term matrix of dimension n × T containing the word frequencies of each document in row and let k be the number of homogeneous groups in which documents could be allocated. Let x d be the single document of length T , with d = 1, . . . , n.
In MoU, the distribution of each document has a specific distribution function conditional on the values of a discrete latent allocation variable z d describing the probability of each topic. More precisely, with p(z d = i) = π i , under the restrictions (i) π i > 0 for i = 1, . . . , k and (ii) k i=1 π i = 1. The natural distribution for p(x d |z d = i) is represented by the multinomial distribution with a parameter vector, say ω i , that is cluster-specific: x dt denoting the word-length of the document d. The multinomial distribution assumes that, conditionally to the cluster membership, all the terms can be regarded as independently distributed.
The model is indeed a simple mixture of multinomial distributions that can be easily estimated by the EM algorithm (see Nigam et al. 2000 for further details) under the assumption that a document belongs to a single topic and the number of groups coincides with the number of topics. The approach has been successfully applied not only to textual data, where it originated from, but to genomic data analysis. In this latter field, a particular improvement of the method consists in relaxing the conditional independence assumption of the variables/terms, by using m-order Markov properties, thus leading to the so-called m-gram models (Tomović et al. 2006). Due to the limited average length of documents in ticket data, co-occurrence information is very rare and this extension would be not effective on these data.

Going deep into mixtures of unigrams: a novel approach
We aim at extending MoU by allowing a further layer in the probabilistic generative model, so as to get a nested architecture of nonlinear transformations able to describe the data structure in a very flexible way. At the deepest latent layer, the documents can come from k 2 groups with different probabilities, say π (2) j j = 1, . . . , k 2 . Conditionally to what happened at this level, at the top observable layer the documents can belong to k 1 groups with conditional probabilities π (1) i| j i = 1, . . . , k 1 . For the sake of a simple notation, we refer in the following to a generic document denoted by x. The distribution of x conditionally to the two layers is a multinomial distribution with cluster-specific proportions ω: where z (1) and z (2) are the allocation variables at the top and at the bottom layers, respectively. They are discrete latent variables that follow the distributions: and In mixture of unigrams, the proportions ω were fixed parameters. Here we assume they are realizations of random variables with a Dirichlet distribution. In order to have a flexible and deep model we assume that the parameters of the Dirichlet distribution are a linear function of two sets of parameters that originate in the two subsequent layers. The Dirichlet parameters are β i + α j β i = β i (1 + α j ), where β i and α j are vectors of length T . Since they must be positive and the overall model must be identifiable, we further assume that β it > 0 and −1 < α jt < 1. These choices lead to a nice interpretative perspective: at the bottom layer α j acts as a perturbation on the cluster-specific β i parameters of the top layer. Therefore, the distribution is: where Γ denotes the Gamma function. An example of deep MoU structure is depicted in Fig. 1 for the case k 1 = 3 and k 2 = 2. Notice that in a model with k 1 = 3 and k 2 = 2 components we have an overall number M of sub-components equal to 5, but 6 > M possible paths for each document. The paths share and combine the parameters of the two levels, thus achieving great flexibility with less parameters to be estimated.
By combining equations (2) and (5), the latent variable ω can be integrated out from the model estimation, thus gaining efficiency without losing flexibility and interpretability. More precisely: where B denotes the Beta function. Figure 2 shows the Directed Acyclic Graph (DAG) summarizing the dependence structure of the model.
Finally by combining formulas (6) with (3) and (4), the density of the data is a mixture of mixtures of Multinomial-Dirichlet distributions: In other terms, adding a level to the hierarchy resulted in adding a mixing step. A double mixture is deeper, more flexible and it can capture more heterogeneity of the data, than a simple mixture of Multinomial-Dirichlet distributions.
Having two layers and two number of groups for each, that are k 1 and k 2 , it is important to define the procedure by which the units are clustered.

Cluster assignment
Theoretically, under this double mixtures, we could group units into k 1 groups, k 2 groups or k 1 × k 2 groups. However, note that, under the constraint −1 < α jt < 1, the role of k 2 components at the deepest layer is confined to add flexibility to the model, while the real cluster-distribution is specified by the β parameters. For this reason, the number of 'real' clusters is given by k 1 . Their internal heterogeneity is captured by the k 2 sub-groups that help in adding more flexibility to the model. Therefore, the final allocation of the documents to the clusters is given by the posterior probability p(z (1) |x) that can be obtained as follows: (8) The model encompasses the simple MoU, that can be obtained as special case when k 2 = 1 and without any prior on ω. When k 2 = 1 and a Dirichlet prior is put on ω, a mixture of Dirichlet-Multinomials is defined. In this case, in order to assure identifiability, we assume α = 0.
The approach can be generalized to multilayer of latent variables, where at each layer perturbation parameters to the final β are introduced, under the constraint that their values are limited between -1 and 1. However, we will show in the next sections that the structure with just one additional latent layer is generally sufficient to gain large flexibility and very good clustering performance.

Model estimation
In this section, we present a Bayesian algorithm for parameter estimation. The prior distribution for the weights of the mixture components is assumed to follow a Dirichlet distribution with hyperparameter δ. We want non-informative priors for the model parameters. Hence, the value of the Dirichlet hyperparameter is δ = 1 in order to have a flat Dirichlet distribution. The prior distributions for each α j and β i are given by the Uniform in the interval [-1,1] and in (0, 1000], respectively.
By using the previous model assumptions, the posterior distribution can be expressed as where p (X|z (1) , z (2) , α, β) is the likelihood function of the model. By indexing the documents for which it holds that z (2) d j = 1, the likelihood function can be expressed as: with .
In order to sample parameters and latent variables from the posterior distribution, we determine the full conditionals of each unobservable variable given the other ones.

Full conditionals
The posterior distribution of the parameters and latent allocation variables given the other variables are proportional to known quantities. By using | . . . to denote conditioning on all other variables, they are: A Gibbs sampling MCMC algorithm can be thus easily implemented for generating values from the posterior distributions.
Note that in order to get α j and β i we need an acceptreject mechanism. We consider as proposal value for β i , the average value of the parameters generated by n×k 2 Dirichletmultinomials, given α j fixed and vice versa.
The computational time of the algorithm depends on the desired number of runs of the MCMC algorithm, the number of nodes k 1 and k 2 and on the length of the vectors α j and β i . For a two-class dataset, described in Sect. 6, with D = 240, T = 357 and k 2 = 2 the MCMC algorithm with 5.000 iterations requires about 10 minutes on a processor Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz, 2592 Mhz, 2 cores, under R cran 3.6.1.
We will show the estimation and clustering performance through a simulation study and real applications in the next sections.

Simulation study
The performance of the proposed method is evaluated under different aspects in an empirical simulation study. In order to prove the capability of the deep MoU to uncover the clusters in complex data, data were generated with a high level of sparsity. Several simulation studies are presented and discussed in the following.
The first simulation study aimed to check the capability of the deep MoU to cluster well the data, when these are generated according to a deep generative process. More precisely, we set T = 200 and n = 200, k 1 = 3, k 2 = 2 and balanced classes. We randomly generated β from a Uniform distribution in (0,20] and α from a Uniform distribution in [-1,1]. In order to assure a high level of sparsity, for each document the total number of terms has been generated according to a Poisson distribution with parameter N d = 20, ∀d. Data are then organized in a document-term matrix containing the term frequencies of each pseudo-document. Panel (a) of Fig. 3 shows the row frequency distributions of the features across the clusters and provides a representation of the group overlapping.
Clustering performance has been measured by using both the Adjusted Rand Index (ARI) and the accuracy rate. The former is a corrected-for-chance version of the Rand Index (Hubert and Arabie 1985), that measures the degree to which two partition of objects agree; Romano et al. (2016) proved that this measure is particularly indicated in presence of large equal sized reference clusters. The accuracy is defined as the complement of the misclassification error rate. Table 1 shows the Adjusted Rand Index and the accuracy obtained by a deep MoU model for different values of k 2 , ranging from 1 to 5. We run an MCMC chain with 5000 iterations, discarding the first 2000 as burn-in. Visual inspection assured that this burn-in was largely sufficient.
The results show how the model with k 2 > 1 is really effective in clustering the data and, as desirable, the model with k 2 = 2 (i.e. the setting that reflects the generative process of the data) resulted to be the best one in terms of recovering the 'true' grouping structure. The gap between k 2 = 1 and k 2 > 1 is relevant; however, the performance remains elevate for the various k 2 > 1, thus indicating that a deep structure can be really effective in clustering such kind of data. Statistics and Computing (2021)  In a second simulation study, we tested the performance of a deep MoU with data that are not originated according to a deep generative process, but simply by k 1 = 3 balanced groups, T = n = 200 and N d = 20, ∀d. Panel (b) of Fig. 3 displays the row frequency distributions of the generated features. As shown in Table 2, in this situation, the deep model with k 2 > 1 does not significantly improve the clustering per-formance, and the accuracy remains stable as k 2 increases. This suggests that when the data are pretty simple, and are not high-dimensional, a deep algorithm is not more efficient than the conventional MoU.
The third simulation study aimed at measuring the accuracy of the estimated parameters α and β in data with double structure k 1 = 3 and k 2 = 2, allowing for different combinations of T , n and N , so as to measure the effect of data dimensionality and level of sparsity on the goodness of fit. We considered a total of 8 different scenarios generated according to the combinations of T = {100, 200}, n = {100, 200} and N = {10, 20}. Table 3 contains the Euclidean distance between the true parameter vectors and the posterior means, normalized over T .
As expectable, for a given T the goodness of fit improves as the number of documents increases. The level of sparsity has a relevant role as well: when N increases the documents are more informative and the parameter estimates become more accurate.

Application to real data
The effectiveness of the proposed model is demonstrated by using four textual datasets, including the introduced ticket data. We compare the deep MoU with conventional clustering strategies: k-means (Lloyd 1982) with cosine distance (k-means) and with Euclidean distance on data transformed according to semantic analysis (LSA k-means), partition around medoids (PAM) (Kaufman and Rousseeuw 2009),   ster et al. 1977), spectral clustering (SpeCl) (Ng et al. 2002) and affinity propagation clustering (AffPr) (Frey and Dueck 2007) with normalized linear kernel. The CNAE-9 dataset contains 1080 documents of free text business descriptions of Brazilian companies categorized into 9 balanced categories (Ciarelli and Oliveira 2009; Cia- relli et al. 2010) for a total of 856 preprocessed words. Since the classes 4 and 9 are the most overlapped we considered also the reduced CNAE-2 dataset composed by these two groups only which consists of 240 documents and 357 words. This dataset is highly sparse: the 99.22% of the document-term matrix entries are zeros. The BBC dataset consists of 737 documents from the BBC Sport website corresponding to sport news articles in five topics/areas (athletics, cricket, football, rugby, tennis) from 2004 to 2005 (Greene and Cunningham 2006). After a preprocessing phase aimed at discarding non-relevant words, the total number of features is 1075. This dataset is moderately sparse with a fraction of zeros equal to 92.36%.
The ticket dataset contains n = 2129 tickets and T = 489 terms obtained after preprocessing: original raw data were processed via stemming, so as to reduce inflected or derived words to their unique word stem, and some terms have been filtered out in order to remove very common noninformative stopwords words in the Italian language. The tickets have then been classified by independent operators to k = 5 main classes described in Table 4.
The peculiarity and major challenge of this dataset is the limited number of words used, on average, for each ticket. In Statistics and Computing (2021)   fact, after preprocessing, the tickets have an average length of 5 words. A graphical representation of the data, after the preprocessing step, is shown in Fig. 4. The heatmap shows the row frequency distributions of the most common terms (overall occurrences ≥ 50). As clear from the shades, each class is characterized by a limited set of specific words, while the majority is uniformly distributed across them, thus making the classification problem particularly challenging. In order to have further details about the terms characterizing the tickets within each topic, Fig. 5 displays the most frequent words. Data are characterized by a large amount of sparsity with the 99.05% of zeros: as a consequence, most conventional clustering strategies fail. Tables 5 and 6 show, respectively, the accuracy and the ARI of the different methods on the presented datasets. For comparative reasons, the true number of clusters was considered as known for all the methods.
Among the considered methods, spectral clustering, kmeans and mixtures of unigrams are the most effective method for classifying the datasets. An advantage of MoU, which is inherited by the proposed deep MoU, is that it seems to be not affected by the large number of zeros like the other methods, as it is based on proportions. The latent Dirichlet allocation model, despite its flexibility, is not able to improve the classification on these short documents, because it is based on the assumption of multiple topics for each docu- ment, which is not realistic for short data, as in the case of ticket dataset. We applied deep MoU with k 2 = 1, . . . , 5. For each setting, we run 5000 iterations of the MCMC algorithm, discarding the first 2000 as burn-in. From graphical visualization and diagnostic criteria we observed convergence and stability to the different choices of starting points and hyperparameters δ, so we considered δ = 1.
Tables 7 and 8 contain the clustering results of the deep MoU on the datasets, measured by accuracy and ARI, respectively. The case k 2 = 1 corresponds to a Bayesian MoU and classification is improved with respect to the conventional MoU in most all empirical cases. A probable reason for that is the adoption of Dirichlet-multinomials (in Bayesian MoU) instead of classical multinomials (in frequentist MoU), which are more suitable to capture overdispersion with respect to the multinomial framework (Wilson and Koehler 1991). In our analysis, the only exception is represented by the dataset CNAE-9, which is particularly challenging because it is composed by 9 unbalanced groups. In this case, classical MoU performs slightly better in terms of misclassification rate and ARI. This may depend on a particularly good starting point of the EM algorithm for fitting MoU, which is initialized by the k-means clustering.
When we move from k 2 = 1 to a deeper structure with k 2 = 2 or k 2 = 3 the accuracy improves in all the analyzed experiments. This proves how the introduction of the parameters α j may be beneficial for classification. When the hidden nodes are greater than 3 (k 2 = 4 or k 2 = 5) results seem to be little worsened, probably because of the larger number of parameters to be estimated.

Final remarks
In this paper, we have proposed a deep learning strategy that extends the mixtures of unigrams model. With respect to other clustering methods, MoUs have desirable features for textual data. Firstly, document-term matrices usually contain the frequencies of the words in each document; for this reason, MoUs represent an intuitive choice, since they are based on multinomial distributions that are the probabilistic distributions for modeling positive frequencies. Moreover, as MoUs naturally model proportions, they are not affected by the large amount of zeros of the datasets like other methods, so they are a proper choice for modeling very short texts and sparse datasets. Furthermore, MoU is based on the idea that documents related to the same topic have similar distributions of terms, which is realistic in practice. Taking a mixture of k multinomials means doing clustering into k topics/groups: there is a unique association between documents and topics.
All these nice characteristic are inherited by the proposed deep MoU model. The proposed deep MoU is particularly effective in clustering with challenging issues (sparsity, overdispersion, short document length and highdimensionality). Being hierarchical in its nature, the model can be easily estimated by a MCMC algorithm in a Bayesian framework. In our analysis, we chose non-informative priors, because there is little prior information available on the empirical context. The estimation algorithm produces good results in all the simulated and real situations considered here.
The proposed model could be extended in several directions: as discussed in Sect. 3, several hidden layers (instead of just a single one) could be considered. The merging function β i (1 + α j ) has been defined for identifiability reasons under the idea that the number of estimated groups is k 1 and the latent layer is only aimed at perturbing the β i parameters for capturing some residual heterogeneity inside the groups. Of course, more complex nonlinear functions could be considered, without however losing sight of identifiability. In case of non-extreme sparsity and long documents, the model could be also extended to allow for deep m-gram models. We leave all these ideas to future research.
Funding Open access funding provided by Alma Mater Studiorum -Università di Bologna within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.