Variational Inference and Sparsity in High-Dimensional Deep Gaussian Mixture Models

Gaussian mixture models are a popular tool for model-based clustering, and mixtures of factor analyzers are Gaussian mixture models having parsimonious factor covariance structure for mixture components. There are several recent extensions of mixture of factor analyzers to deep mixtures, where the Gaussian model for the latent factors is replaced by a mixture of factor analyzers. This construction can be iterated to obtain a model with many layers. These deep models are challenging to fit, and we consider Bayesian inference using sparsity priors to further regularize the estimation. A scalable natural gradient variational inference algorithm is developed for fitting the model, and we suggest computationally efficient approaches to the architecture choice using overfitted mixtures where unnecessary components drop out in the estimation. In a number of simulated and two real examples, we demonstrate the versatility of our approach for high-dimensional problems, and demonstrate that the use of sparsity inducing priors can be helpful for obtaining improved clustering results.


Introduction
Exploratory data analysis tools such as cluster analysis need to be increasingly flexible to deal with the greater complexity of datasets arising in modern applications. For high-dimensional data, there has been much recent interest in deep clustering methods based on latent variable models with multiple layers. Our work considers deep Gaussian mixture models for clustering, and in particular a deep mixture of factor analyzers (DMFA) model recently introduced by Viroli and McLachlan (2019). The DMFA model is highly parametrized, and to make the estimation tractable Viroli and McLachlan (2019) consider a variety of constraints on model parameters which enable them to obtain promising results. The objective of the current work is to consider a Bayesian approach to estimation of the DMFA model which uses sparsity-inducing priors to further regularize the estimation. We use the horseshoe prior of Carvalho and Polson (2010), and demonstrate in a number of problems that the use of sparsity-inducing priors is helpful. This is particularly true in clustering problems which are high-dimensional and involve a large number of noise features. A computationally efficient natural gradient variational inference scheme is developed, which is scalable to high-dimensions and large datasets. A difficult problem in the application of the DMFA model is the choice of architecture, i.e. determining the number of layers and the dimension of each layer, since trying many different architectures is computationally burdensome. We discuss useful heuristics for choosing good models and selecting the number of clusters in a computationally thrifty way, using overfitted mixtures (Rousseau and Mengersen, 2011) with suitable priors on the mixing weights.
There are several suggested methods in the literature for constructing deep versions of Gaussian mixture models (GMM)s involving multiple layers of latent variables. One of the earliest is the mixture of mixtures model of Li (2005). Li (2005) considers modelling non-Gaussian components in mixture models using a mixture of Gaussians, resulting in a mixture of mixtures structure. The model is not identifiable through the likelihood, and Li (2005) suggests several ways to address this issue, as well as methods for choosing the number of components for each cluster. Malsiner-Walli et al. (2017) identify the model through the prior in a Bayesian approach, and consider Bayesian methods for model choice.
Another deep mixture architecture is suggested in van den Oord and Schrauwen (2014). The authors consider a network with linear transformations at different layers, and the random sampling of a path through the layers. Each path defines a mixture component by applying the corresponding sequence of transformations to a standard normal random vector. Their approach allows a large number of mixture components, but without an explosion of the number of parameters due to the way parameters are shared between components.
The model of Viroli and McLachlan (2019) considered here is a multi-layered extension of the mixture of factor analyzers (MFA) model (Ghahramani and Hinton, 1997;McLachlan et al., 2003). The MFA model is a GMM in which component covariance matrices have a parsimonious factor structure, making this model suitable for high-dimensional data. The factor covariance structure can be interpreted as explaining dependence in terms of a lowdimensional latent Gaussian variable. Extending the MFA model to a deep model with multiple layers has two main advantages. First, non-Gaussian clusters can be obtained and second, it enables the fitting of GMMs with a large number of components, which is particularly relevant for high-dimensional data and cases when the number of true clusters is large. Tang et al. (2012) were the first to consider a deep MFA model, where a mixture of factor analyzers model is used as a prior for the latent factors instead of a Gaussian distribution. Applying this idea recursively leads to deep models with many layers. Their architecture splits components into several subcomponents at the next layer in a tree-type structure. Yang et al. (2017) consider a two layer version of the model of Tang et al. (2012) incorporating a common factor loading matrix at the first level and some other restrictions on the parametrization.
The model of Viroli and McLachlan (2019) combines some elements of the models of Tang et al. (2012) and van den Oord and Schrauwen (2014). Similar to Tang et al. (2012) an MFA prior is considered for the factors in an MFA model and multiple layers can be stacked together. However, similar to the model of van den Oord and Schrauwen (2014), their architecture has parameters for Gaussian component distributions with factor structure arranged in a network, with each Gaussian mixture component corresponding to a path through the network. Parameter sharing between components and the factor structure makes the model parsimonious. Viroli and McLachlan (2019) consider restrictions on the dimensionality of the factors at different layers and other model parameters to help identify the model. A stochastic EM algorithm is used for estimation, and they report improved performance compared to greedy layerwise fitting algorithms. For choosing the network architecture, they fit a large number of different architectures and use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) to choose the final model. Selosse et al. (2020) report that even with the identification restrictions suggested in Viroli and McLachlan (2019) it can be very challenging to fit the DMFA model. The likelihood surface has a large number of local modes, and it can be hard to find good modes, possibly because the estimation of a large number of latent variables is difficult. This is one motivation for the sparsity priors we introduce here. While sparse Bayesian approaches have been considered before for both factor models (Carvalho et al. (2008); Ročková and George (2016); Hahn et al. (2018) among many others) and the MFA model (Ghahramani and Beal, 2000), they have not been considered for the DMFA model.
The structure of the paper is as follows. In Section 2 we introduce the DMFA model of Viroli and McLachlan (2019) and our Bayesian treatment making appropriate prior choices. Section 3 describes our scalable natural gradient variational inference method, and the approach to choice of the model architecture. Section 4 illustrates the good performance of our method for simulated data in a number of scenarios and compares our method to several alternative methods from the literature for simulated and benchmark datasets considered in Viroli and McLachlan (2019). Although it remains challenging to fit complex latent variable models such as the DMFA, we find that in many situations which are well-suited to sparsity our approach is helpful for producing better clusterings. Sections 5 and 6 consider two real high-dimensional examples on gene expression data and taxi networks illustrating the potential of our Bayesian treatment of DMFA models. Section 7 gives some concluding discussion.

Bayesian Deep Mixture of Factor Analyzers
This section describes our Bayesian DMFA model. We consider a conventional MFA model first in Section 2.1, and then the deep extension of this by Viroli and McLachlan (2019) in Section 2.2. The priors used for Bayesian inference are discussed in Section 2.3.

MFA model
In what follows we write N pµ, Σq for the normal distribution with mean vector µ and covariance matrix Σ, and φpy; µ, Σq for the corresponding density function. Suppose y i " py i1 , . . . , y id q J , i " 1, . . . , n are independent identically distributed observations of dimension d. An MFA model for y i assumes a density of the form where p k ą 0 are mixing weights, k " 1, . . . , K, with ř K k"1 p k " 1; µ k " pµ k1 , . . . , µ kd q J are component mean vectors; and B k B J k`δ k , k " 1, . . . , K, are component specific covariance matrices with B k dˆD matrices, D ăă d, and δ k are dˆd diagonal matrices with diagonal entries pδ k1 , . . . , δ kd q J . The model (1) has a generative representation in terms of D-dimensional latent Gaussian random vectors, z i " N p0, Iq, i " 1, . . . , n. Suppose that y i is generated with probability p k as where ik " N p0, δ k q. Under this generative model, y i has density (1). The latent variables z i are called factors, and the matrices B k are called factor loadings or factor loading matrices. In ordinary factor analysis (Bartholomew et al., 2011), which corresponds to K " 1, restrictions on the factor loading matrices are made to ensure identifiability, and similar restrictions are needed in the MFA model. Common restrictions are lower-triangular structure with positive diagonal elements, or orthogonality of the columns of the factor loading matrix. With continuous shrinkage sparsity-inducing priors and careful initialization of the variational optimization algorithm, we did not find imposing an identifiability restriction such as a lower-triangular structure on the factor loading matrices to be necessary in our applications later. Identifiability issues for sparse factor models with point mass mixture priors have been considered recently in Frühwirth-Schnatter and Lopes (2018).
The MFA model is well-suited to analyzing high-dimensional data, because the modelling of dependence in terms of low-dimensional latent factors results in a parsimonious model. The idea of deep versions of the MFA model is to replace the Gaussian assumption z i " N p0, Iq with the assumption that the z i 's themselves follow an MFA model. This idea can be applied recursively, to define a deep model with many layers.

DMFA model
Suppose once again that y i , i " 1, . . . , n, are independent and identically distributed observations of dimension D p0q " d, and define y i " z p0q i . We consider a hierarchical model in which latent variables z plq i , at layer l " 1, . . . , L are generated according to the following generative process. Let K plq denote the number of mixture components at layer l. Then, with probability p plq k , k " 1, . . . , K plq ,  (2019) observe that their DMFA model is just a Gaussian mixture model with ś L l"1 K plq components. The components correspond to "paths" through the factor mixture components at the different levels. Write k l P t1, . . . , K plq u for the index of a factor mixture component at level l. Let k " pk 1 , . . . , k L q J index a path. Let ppkq " ś L l"1 p plq k l ,

Viroli and McLachlan
Then the DMFA model corresponds to the Gaussian mixture density ÿ k ppkqφpy; µpkq, Σpkqq.
The DMFA allows expressive modelling through a large number of mixture components, but due to the parameter sharing and the factor structure the model remains parsimonious. A precise description of the priors will be given now. Denote the the complete vector of parameters by ϑ " pµ J , B J , z J , p J , δ J , γ J q J . First, we assume independence between µ, B, z, δ and p in their prior distribution. For the marginal prior distribution for µ, we furthermore assume independence between all components, and the marginal prior density for µ plq kj is assumed to be a Cauchy density, Cp0, G plq q, where G plq is a known hyperparameter possibly depending on l, l " 1, . . . , L. This Cauchy prior can be represented hierarchically as a mixture of a univariate Gaussian and inverse gamma distribution, i.e.

Prior distributions
We augment the model to include the parameters g in this hierarchical representation.
For the prior on B, we assume elements are independently distributed a priori with a horseshoe prior (Carvalho and Polson, 2010), where Gpa, bq denotes a gamma distribution with shape a and scale parameter b and we write For δ, we assume all elements are independent in the prior with b δ plq kj being half-Cauchy distributed, b δ plq kj " HCpA plq q, where the scale parameter A plq is known and possibly depending on l, l " 1, . . . , L. This prior can also be expressed hierarchically, and we write Once again, we can augment the original model to include ψ and we do so.
The prior on p assumes independence between p plq , l " 1, . . . , L, and the marginal prior The full set of unknowns in the model is tµ, B, z, p, δ, γ, p, ψ, g, h, c, τ, ξu and we denote the corresponding vector of unknowns by 3 Variational Inference for the Bayesian DMFA Next we review basic ideas of variational inference and introduce a scalable variational inference method for the DMFA model.

Variational inference
Variational inference (VI) approximates a posterior density ppθ|yq by assuming a form q λ pθq for it and then minimizing some measure of closeness to ppθ|yq. Here, λ denotes variational parameters to be optimized, such as the mean and covariance matrix parameters in a multivariate Gaussian approximation. If the measure of closeness adopted is the Kullback-Leibler divergence, the optimal approximation maximizes the evidence lower bound (ELBO) Lpλq " with respect to λ. Eq. (4) is an expectation with respect to q λ pθq, where hpθq " ppy|θqppθq and this observation enables an unbiased Monte Carlo estimation of the gradient of Lpλq after differentiating under the integral sign. This is often used with stochastic gradient ascent methods to optimize the ELBO, see Blei et al. (2017) for further background on VI methods. For some models and appropriate forms for q λ pθq, Lpλq can be expressed analytically, and this is the case for our proposed VI method for the DFMA model. In this case, finding the optimal value λ˚does not require Monte Carlo sampling from the approximation to implement the variational optimization (see e.g. Honkela et al., 2008), although the use of subsampling to deal with large datasets may still require the use of stochastic optimization.

Mean field variational approximation for DMFA
Choosing the form of q λ pθq requires balancing flexibility and computational tractability. Here we consider the factorized form q λ pθq " qpµqqpvecpBqqqpzqqppqqpδqqpγqqpψqqpgqqphqqpcqqpτ qqpξq.
and each factor in Eq. (6) will also be fully factorized. For the DMFA model, the existence of conditional conjugacy through our prior choices means that the parametric form of the factors follows from the factorization assumptions made. We can give an explicit expression for the ELBO (see the Web Appendix A.2 for details) and use numerical optimization techniques to find the optimal variational parameter λ˚. It is possible to consider less restrictive factorization assumptions than we have, retaining some of the dependence structure, while at the same time retaining the closed form for the lower bound. A fully factorized approximation has been used to reduce the number of variational parameters to optimize.

Approach to computations for large datasets
Optimization of the lower bound (5) with respect to the high-dimensional vector of variational parameters is difficult. One problem is that the number of variational parameters grows with the sample size. This occurs because we need to approximate posterior distributions of observation specific latent variables z plq i and γ plq i , i " 1, . . . , n, l " 1, . . . , L.
To address this issue, we adapt the idea of stochastic VI by Hoffman et al. (2013). More specifically, we split θ into so-called "global" and "local" parameters, θ " pβ J , ζ J q J , where and ζ " pz J , γ J q J denote the "global" and "local" parameters of θ, respectively.
We can similarly partition the variational parameters λ into global and local components, where λ G and λ L are the variational parameters appearing in the variational posterior factors involving elements of β and ζ, respectively. Write the lower bound Lpλq as Lpλ G , λ L q. Write M pλ G q for the value of λ L optimizing the lower bound with fixed global parameter λ G . When optimizing Lpλq, the optimal value of λ G maximizes The last line follows from ∇ M Lpλ G , M pλ G qq " 0, since M pλ G q gives the optimal local variational parameter values for fixed λ G . Eq. (7) says that the gradient ∇ λ G Lpλ G q can be computed by first optimizing to find M pλ G q, and then computing the gradient of Lpλq with respect to λ G with λ L fixed at M pλ G q.
We optimize Lpλ G q using stochastic gradient ascent methods, which iteratively update an initial value λ p1q G for λ G for t ě 1 by the iteration where a t is a vector-valued step size sequence,˝denotes elementwise multiplication, and (7) is constructed by randomly sampling a mini-batch of the data. This enables us to avoid dealing with all local variational parameters at once, lowering the dimension of any optimization. Let A be a subset of t1, . . . , nu chosen uniformly at random without replacement inducing the set of indices of a data mini-batch. Then, in view of (7), an unbiased where |A| is the cardinality of A, and we have written Lpλq " L F pλ G q`ř n i"1 L i pλq, where L i pλq is the contribution to Lpλq of all terms involving the local parameter for the ith observation and L F pλ G q is the remainder. For this estimate, computation of all components of M pλ G q is not required, since only the optimal local variational parameters for the minibatch are needed.
To optimize λ G we use the natural gradient (Amari, 1998) rather than the ordinary gradient. The natural gradient uses an update which respects the meaning of the variational parameters in terms of the underlying distributions they index. The natural gradient is given Because of the factorization of the variational posterior into independent components, it suffices to compute the submatrices of I F pλ G q corresponding to each factor separately.

Algorithm
The complete algorithm can be divided into two nested steps iterating over the local and global parameters, respectively. First, an update of the global parameters is considered using a stochastic natural gradient ascent step for optimizing Lpλ G q, with the adaptive stepsize choice proposed in Ranganath et al. (2013). For estimating the gradient in this step, the optimal local parameters M pλ G q for the current λ G need to be identified for observations in a mini-batch. In theory, this can be done numerically by a gradient ascent algorithm. But this leads to a long run time, because one has to run this local gradient ascent until convergence for each step of the global stochastic gradient ascent. In our experience it is not necessary to calculate M pλ G q exactly, and it suffices to use an estimate, which helps to decrease the run time of the algorithm. A natural approximate estimator for M pλ G q can be constructed in the following way. Start by estimating the most likely path γ i using the clustering approach

Hyperparameter choices
For all experiments we set the hyperparameters G plq " 2, ν plq " 1 and A plq " 2.5. The hyperparameter ρ plq k is set either to 1 if the number of components in the respective layer is known, or to 0.5 if the number of components is unknown, see Section 3.7 for further discussion. The size of the mini-batch is set to 5% of the data size, but not less then 1 or larger then 1024. This way the number of variational parameters which need to be updated at every step remains bounded regardless of the sample size n.We found these hyperparameter choices to work well in all the diverse scenarios considered in Sections 4, 5 and 6.

Clustering with DMFA
The algorithm described in the previous section returns optimal variational parameters λ G for the global parameters β. A canonical point estimator for β is then given by the mode of q λ pθq. Following Viroli and McLachlan (2019), we consider only the first layer of the model for clustering. Specifically, the cluster of data point y is given by By integration over the local parameters for y, ppy|β, γ p1q k " 1q can be written as a mixture of ś L r"2 K prq Gaussians of dimension D p0q for k " 1, . . . , K p1q . This way, the parameters of the bottom layers can be viewed as defining a density approximation to the different clusters and the overall model can be interpreted as a mixture of Gaussian mixtures. Alternatively, the DMFA can be viewed as a GMM with ś L l"1 K plq components, where each component corresponds to a path through the model. By considering each of these Gaussian components as a separate cluster, the DMFA can be used to find a very large number of Gaussian clusters.
However, Selosse et al. (2020) note that some of these "global" Gaussian components might be empty, even when there are data points assigned to every component in each layer. We investigate the idea of using the DMFA to fit a large scale GMM further in Section 6.

Model architecture and model selection
So far it has been assumed that the number of layers L, the number of components in each layer K plq , and the dimensions of each layer D plq , l " 1, . . . L, are known. As this is usually not the case in practice, we now discuss how to choose a suitable architecture.
If the number of mixture components K plq in a layer is unknown, we initialize the model with a relatively large number of components, and set the Dirichlet prior hyperparameters on the component weights to ρ Once an architecture is chosen based on short runs, the fitting algorithm is fully run to convergence for the optimal choice.

Benchmarking using simulated and real data
To demonstrate the advantages of our Bayesian DMFA with respect to clustering highdimensional non-Gaussian data, computational efficiency of model selection, accommodating sparse structure and scalability to large datasets, we experiment on several simulated and publicly available benchmark examples.

Design of numerical experiments
First we consider two simulated datasets where our Bayesian approach with sparsity-inducing priors is able to outperform the maximum likelihood method considered in Viroli and McLachlan (2019). The data generating process for the first dataset scenario S1 is a GMM with many noisy (uninformative) features and sparse covariance matrices. Here, sparsity priors are helpful because there are many noise features. The data in scenario S2 has a similar structure to the real data from the application presented in Section 5 having unbalanced non-Gaussian clusters. Here, the true generative model has highly unbalanced group sizes, and the regularization that our priors provide is useful in this setting for stabilizing the estimation.
Five real datasets considered in Viroli and McLachlan (2019) (2007) (VIgmm), a skew-normal mixture (SNmm), a skew-t mixture (STmm), k-means (kmeans), partition around metroids (PAM), hierarchical clustering with Ward distance (Hclust), a factor mixture analyser (FMA) and a mixture of factor analyzer (MFA). To measure how close a set of clusters is to the ground truth labels, we consider three popular performance measures (e.g. Vinh et al. (2010)). These are the missclassification rate (MR), the adjusted rand index (ARI) and the adjusted mutual information (AMI).

Datasets
Below we describe the two simulated scenarios S1 and S2 involving sparsity, as well as the five real datasets used in Viroli and McLachlan (2019).
Scenario S1: Sparse location scale mixture. Datasets with D p0q " 30 features are drawn from a mixture of high-dimensional Gaussian distributions with sparse covariance structure. The first 15 of the features yield information on the clusters and are obtained from a mixture of K " 5 Gaussian distributions with different means µ k and covariances Σ k .
The mean for component k " 1, . . . , 5 has entries for j " 1, . . . , 15. The covariance matrices Σ k are drawn independently based on the Cholesky factors via the function make sparse spd matrix from the Python package scikit-learn (Pedregosa et al., 2011). An entry of the Cholesky factros of Σ k is set to zero with a probability α " 0.95 and all non-zero entries are assumed to be in the interval r0.4, 0.7s. The 15 noise (irrelevant) features are drawn independently from an N p0, 1q distribution. The number of elements in each cluster is discrete uniformly distributed as Ut50, 200u leading to datasets of sizes n P r250, 1000s. This clustering task is not as easy as it might first appear, since only half of the features contain any information on the class-labels.
Scenario S2: Cyclic data: We follow Yeung et al. (2001) and generate synthetic data modelling a sinusoidal cyclic behaviour of gene expressions over time. We simulate n " 235 genes under D p0q " 24 experiments in K " 10 classes. Two genes belong to the same class if they have similar phase shifts. Each element of the data matrix y ij with i " 1, . . . , n and j " 1, . . . , D p0q is simulated as where α i " N p0, 1q represents the average expression level of gene i, β i " N p3, 0.5q is the amplitude control of gene i, λ j " N p3, 0.5q models the amplitude control of condition j, while the additive experimental error δ j and the idiosyncratic noise ij are drawn independently from N p0, 1q distributions. The different classes are represented by ω k , k " 1, . . . , K, which are assumed to be uniformly distributed in r0, 2πs, such that ω k " Up0, 2πq. The sizes of the classes are generated according to Zipf's law, meaning that gene i is in class k with probability proportional to k´1. Each observation vector y i " py i1 , . . . , y iD p0q q J is individually standardized to have zero mean and unit variance. Recovering the class labels from the data can be difficult, because the classes are very unbalanced and some classes are very small. On average, only eight observations belong to cluster k " 10. Furthermore, it is hard to distinguish between classes k and k 1 if ω k is close to ω k 1 .
Olive data: The dataset from the R-package cepp (Dayal, 2016) contains percentage composition of eight fatty acids found by lipid fraction of 572 Italian olive oils from three regions (323 Southern Italy; 98 Sardinia; 151 Northern Italy), see Forina et al. (1983).
Ecoli data: This dataset from (Dua and Graff, 2017) describes the amino acid sequences of 336 proteins using seven features provided by Horton and Nakai (1996). The class is the localization site (143 cytoplasm; 77 inner membrane without signal sequence; 52 perisplasm; 35 inner membrane, uncleavable signal sequence; 20 outer membrane; five outer membrane lipoprotein; two inner membrane lipoprotein; two inner membrane; cleavable signal sequence).
Vehicle data: This dataset from the R-package mlbench (Leisch and Dimitriadou, 2010) describes the silhouette of one of four types of vehicles, using a set of 19 features extracted from the silhouette. It consists of 846 observations (218 double decker bus; 199 Cherverolet van; 217 Saab 9000; 212 Opel manta 400).
Satellite data: This dataset from (Dua and Graff, 2017) consists of four digital images of the same scene in different spectral bands structured into a 3ˆ3 square of pixels defining the neighborhood structure. There are 36 pixels which define the features. We use all 6435 scenes (1533 red soil; 703 cotton crop; 1358 greysoil; 626 damp grey soil; 707 soil with vegetation stubble; 1508 very damp grey soil) as observations.
All real datasets are mean-centered and componentwise scaled to have unit variance.

Results
To compare the overall clustering performance we assume in a first step that the true number of clusters is known and fixed for all methods. Data-driven choice of the number of clusters is considered later. For Scenarios S1 and S2, R " 100 datasets were independently generated.
Boxplots of the ARI, AMI and MR across replicates are presented in Figure 1 for S1 and S2.
Our method outperforms other approaches in both simulated scenarios. Even though the data generating process in Scenario S1 is a GMM, VIdmfa outperforms the classical GMM.
One reason for this is that the data simulated in S1 is relatively noisy, and all information regarding the clusters is contained in a subspace smaller than the number of features. This is not only an assumption for VIdmfa, but also for FMA and MFA. However, compared to the latter two, VIdmfa is able to better recover the sparsity of the covariance matrices due to the shrinkage prior as Figure 2 demonstrates. The clusters derived by VIdmfa are Gaussian in this scenario, because the deeper layer has only one component. The data generating process of Scenario S2 is non-Gaussian and so are the clusters derived by VIdmfa, due to the fact that the deeper layer might have multiple components. The performance on the real data examples is summarized in Table 1. Here our approach is competitive with the other methods considered, but does not outperform the EMdgmm approach.
The assumption that the number of clusters is known is an artificial one. Clustering is often the starting point of the data investigation process and used to discover structure in the data. Selecting the number of clusters in those settings is sometimes a difficult task. In Bayesian mixture modelling this difficulty can be overcome by initializing the model with a relatively large number of components and using a shrinkage prior on the component weights, so that unnecessary components empty out (Rousseau and Mengersen, 2011). This is also the idea of our model selection process described in Section 3.7. To indicate that this is also a valid approach for VIdmfa, we fit both Scenarios S1, S2 for different choices of K p0q ranging from 2 to 16 and compare the average ARI and AMI over 10 independent replicates. The results can be found in Figure 3. This suggests that our method is able to find the general structure of the data even when K p0q is larger than necessary, but the best results are derived when the model is initialized with the correct number of components. We have found it useful to fit the model with a potentially large number of components and then to refit the model  Table 1: The MR, ARI and AMI for the real datasets rounded to three decimals is given and the best result is marked in each column.
For each method the best out of 10 runs (according either to BIC or the highest ELBO) is presented.
with the number of components selected. This observation justifies our approach to selecting the number of components for the deeper layers discussed in Section 3.7. In both real data examples the optimal model architecture is unknown and VIdmfa selects a reasonable model.

Application to Gene Expression Data
In microarray experiments levels of gene expression for a large number of genes are measured under different experimental conditions. In this example we consider a time course experiment where the data can be represented by a real-valued expression matrix tm ij , 1 ď i ď n, 1 ď j ď Du, where m ij is the expression level of gene i at time j. We will consider rows of this matrix to be observations, so that we are clustering the genes according to the time series of their expression values.
Our model is well-suited to the analysis of gene expression datasets. In many time course microarray experiments both the number of genes and times is large, and our VIdmfa method scales well with both the sample size and dimension. If the time series expression profiles are smooth, their dependence may be well represented using low-rank factor structure. Our VIdmfa method also provides a computationally efficient mechanism for the choice of the number of mixture components. In the simulated Scenario S2 of Section 4, VIdmfa is able to detect unbalanced and comparable small clusters, which is the main advantage of our VIdmfa approach in comparison to the benchmark methods on this simulated gene expression data set and a similar behaviour is expected in real data applications.
As an example we consider the Yeast data set from Yeung et al. (2001), originally considered in Cho et al. (1998). This data set has been analyzed by many previous authors (Medvedovic and Sivaganesan, 2002;Tamayo et al., 1999;Lukashin and Fuchs, 2001) and contains n " 384 genes over two cell cycles (D " 17 time points) whose expression levels peak at different phases of the cell cycle. The goal is to cluster the genes according to those peaks and the structure is similar to the simulated cyclic data in Scenario S2 from Section 4. The genes were assigned to five cell cycle phases by Cho et al. (1998), but there are other cluster assignments with more groups available (Medvedovic and Sivaganesan, 2002;Lukashin and Fuchs, 2001). Hence, the optimal number of clusters for this data set is unknown and we aim to illustrate that our VIdmfa is able to select the number of clusters meaningfully. Also, the gene expression levels change relatively quickly between time points in this data set, suggesting a sparse correlation structure. VIdmfa is well suited for this scenario due to the sparsity inducing priors.
As in Scenario S2 and following Yeung et al. (2001), we individually scale each gene to have zero mean and unit variance. The simulation study in Section 4 suggests that VIdmfa can be initialized with a large number of potential components when the number of clusters is unknown, since the unnecessary components empty out. When initialized with K p0q " t ? nu " 19 potential components, the model selection process described in Section 3.7 selects a model with L " 2 layers with dimensions D p1q " 4, D p2q " 1 and K p2q " 1 components in the second layer. VIdmfa returns twelve clusters as shown in Figure 4. While some of the clusters are small, others are large and match well with the clusters proposed in Cho et al. (1998). For example, cluster eleven corresponds to their largest cluster. Comparisons to other kinds of information would be needed to decide if all of the twelve clusters are useful here, or if some of the clusters should be merged. For example, clusters two and twelve peak at similar phases of the cell cycle and could possibly be merged, while cluster nine does not seem to have a similar cluster and could be kept even though it contains only n 9 " 6 genes. An investigation of the fitted DMFA suggests that the clusters differ mainly by their means, while the covariance matrices have high sparsity, which matches with our expectation of (distant) time points being only weakly correlated. Figure 1: Scenario S1 (a) and Scenario S2 (b). Boxplots summarize row-wise the MR, ARI across 100 replications. Here, we present only the best performing methods to make the plots more informative. A plot containing all benchmark methods and runs outside the interquartile range can be found in the Web Appendix C.3. Figure 2: Scenario S1. Heat-maps of the true, the empirical and the estimated covariance matrices for each of the 5 clusters. For the calculation of the empirical covariance matrices the true labels of the data points were used, which are not available in practice. Figure 3: Scenario S1 (a, b) and Scenario S2 (c, d). The average AMI (a, c) and ARI (b, d) across 10 independent replicates for different choices on the initial number of clusters K " 2, . . . , 16. The x-axis denotes the initialized number of clusters and the y-axis the average AMI/ARI. The optimal number of clusters for each data set is denoted by the dashed line. Figure 4: Yeast data. VIdmfa selects twelve clusters, when initialized with K p0q " 19. For each cluster the cluster mean (bold line), as well as minimum and maximum (dotted lines) on the normalized expression patterns are presented and n denotes the number of elements in the respective cluster. The x-axis denotes the time and the y-axis labels the normalized gene expression.

Application to Taxi Trajectories
In this section, we consider a complex high-dimensional clustering problem for taxi trajectories. We use the publicly available data-set of Moreira-Matias et al. (2013)  in a second step. This approach is successful in finding hidden structures in the data like busy roads and frequently taken paths. We will illustrate that VIdgmm is a good alternative because it does not require a separate dimension reduction step and is able to fit a GMM with a much larger number of components. Also VIdgmm scales well with the sample size, due to the sub-sampling approach described in Section 3.3.
In our analysis we focus on the paths of the trajectories ignoring the temporal dependencies and interpolate the data accordingly. Each trajectory f i can be viewed as a mapping f i ptq : tt 0 ,¨¨¨, t T i u Ñ R 2 of T i`1 equally spaced time points onto coordinate pairs px i,t h , y i,t h q for h " 0,¨¨¨, T i with 50 time-points on average. We consider the interpolatioñ and L i,h " ř h´1 s"0 ||f i pt s q´f i pt s`1 q|| is the length of the trajectory up to time-point t h . This interpolation is time-independent in the sense that dpf i ptq,f i pt`δqq is linear in t for all δ ą 0, where d i p¨,¨q denotes the distance between two points along a linear interpolation of the trajectory f i . This way two trajectories f i and f i 1 with the same path, i.e. the same image in R 2 , have similar interpolationsf i «f i 1 even when the taxi on one trajectory was much slower than the taxi on the other trajectory. Additionally, we consider a tour and its time reversal as equivalent, and therefore change an observation to its time reversal if the origin point after reordering is closer to the city center. We use a discretization off i ptq at 50 equally spaced points t " 0, 1 49 , 2 49 , . . . , 1 as feature inputs for our clustering algorithm leading to data points in R 100 . To avoid numerical errors, all coordinates are centered around the city center and scaled to have unit variance.
The architecture chosen has L " 2 layers with ten components in both layers. The dimensions are set to D p1q " 5 and D p2q " 2. This model is equivalent to a 100-dimensional GMM with 100 components. We consider each of these Gaussian components as a separate cluster. In fact, each observation can be matched to one path k " pk 1 , k 2 q through the model, building Gaussian clusters. This idea of linking the different paths to the clusters leads to a natural hierarchical clustering, where the clusters are built layer-wise. First, the observations are divided into K p1q " 10 large main-clusters based on the components of the first layer, where an observation is in cluster k 1 if and only if γ p1q ik 1 " 1. Those clusters are then divided further into K p2q " 10 sub-clusters. An observation of the main cluster k 1 is in sub-cluster k 2 if γ p1q ik 1 " 1 and γ p2q ik 2 " 1, leading to a potential of K p1qˆK p2q " 100 clusters in total. A graphical representation of this idea based on our fitted DGMM can be found in Figure 5.
Even though this hierarchical representation has no very clear interpretation for all clusters, it might be used as a starting point for further exploration. City planners and urban decision makers could be interested in a comparison between start and end points of trajectories in the various clusters to further evaluate the need of connections, for example via public transport, between different regions of the city. Here, the nested structure of the clusters might be helpful, when scanning the clusters for interesting patterns. The 10 large main clusters based on the first layer give a broad overview, while the sub-clusters allow for a more nuanced analysis. An example is given in Figure 6.

Conclusion and Discussion
In this paper, we introduced a new method for high-dimensional clustering. The method uses sparsity-inducing priors to regularize the estimation of the DMFA model of Viroli and McLachlan (2019), and uses variational inference methods for computation to achieve scalability to large datasets. We consider the use of overfitted mixtures and ELBO values from short runs to choose a suitable architecture in a computationally thrifty way.
As noted recently by Selosse et al. (2020) deep latent variable models like the DMFA are challenging to fit. It is difficult to estimate large numbers of latent variables, and there are many local modes in the likelihood function. While our sparsity priors are helpful in some cases for obtaining improved clustering and making the estimation more reliable, there is much more work to be done in understanding and robustifying the training of these models.
The way that parameters for mixture components at different layers are combined by considering all paths through the network allows a large number of mixture components without a large number of parameters, but this feature may also result in probability mass being put in regions where there is no data. It is also not easy to interpret how the deeper layers of these models assist in explaining variation. A deeper understanding of these issues could lead to further improved architectures and interesting new applications in high-dimensional density estimation and regression. Figure 5: Taxi data. The nested clustering for 10000 randomly selected trajectories is shown. There are ten non-empty components in the first layer and nine non-empty components in the second layer, corresponding to a GMM with 90 components shown in the first row. These clusters can be divided into ten subsets (second row) containing up to nine clusters each. The ten subsets are shown in the second row. Note, that many trajectories connect to endpoints outside the shown region. The map was taken from OpenStreetMap. Figure 6: Taxi data. VIdmfa detects two clusters consisting of short trajectories connected to the harbour and seaside. Both clusters belong to the same main cluster 5. Cluster r5, 2s connects to a larger region on the west side of the city outside the center with the taxis passing through smaller streets, while cluster r5, 4s connects to the city center. Here the Taxis pass through one of three parallels. The dots denote starting and end points of the trajectories.