Model based clustering of multinomial count data

We consider the problem of inferring an unknown number of clusters in replicated multinomial data. Under a model based clustering point of view, this task can be treated by estimating finite mixtures of multinomial distributions with or without covariates. Both Maximum Likelihood (ML) as well as Bayesian estimation are taken into account. Under a Maximum Likelihood approach, we provide an Expectation--Maximization (EM) algorithm which exploits a careful initialization procedure combined with a ridge--stabilized implementation of the Newton--Raphson method in the M--step. Under a Bayesian setup, a stochastic gradient Markov chain Monte Carlo (MCMC) algorithm embedded within a prior parallel tempering scheme is devised. The number of clusters is selected according to the Integrated Completed Likelihood criterion in the ML approach and estimating the number of non-empty components in overfitting mixture models in the Bayesian case. Our method is illustrated in simulated data and applied to two real datasets. An R package is available at https://github.com/mqbssppe/multinomialLogitMix.


Introduction
Multinomial count data arise in various applications (see e.g. ; ) and clustering them is a task of particular interest Bouguila, 2008;Chen et al., 2020). Finite mixture models  are widely used for clustering heterogeneous datasets. Their applicability is extended beyond the model-based clustering framework, by also providing a means for semiparametric inference, see e.g. , where mixtures of multinomial distributions model extra multinomial variation in count data.
In many instances, the resulting inference can be improved by taking into account the presence of covariates, when available. Naturally, the framework of mixtures of multinomial logistic regressions  can be used for dealing with such data, under a model-based clustering point of view. These models belong to the broader family of mixtures of generalized linear models Leisch, 2007, 2008), which are estimated either under a maximum likelihood approach via the EM algorithm (Dempster et al., 1977), or in a Bayesian fashion using MCMC sampling .
Various latent class models are based on mixtures of multinomial distributions. Durante et al. (2019) cluster multivariate categorical data by estimating mixtures of products of multinomial distributions, under the presence of covariates in the mixing proportions. Galindo  estimate latent class models using Bayesian Maximum A Posteriori estimation and illustrate via simulations that the Bayesian approach is more accurate than maximum likelihood estimation. More general latent class models based on multinomial distributions include hidden Markov models  and Markov random fields .
In this paper, our goal is to cluster multinomial count data using finite mixtures of multinomial logistic regression models. Before proceeding we introduce some notation. Let Y = (Y 1 , . . . , Y J ; Y J+1 ) ⊤ denote a random vector distributed according to a multinomial distribution Y ∼ M J+1 (S, θ).
The R package mixtools  can be used to estimate mixtures of multinomial distributions (among numerous other functionalities), under a maximum likelihood approach using the EM algorithm. However, the usage of covariates is not considered. On the other hand, the flexmix package Leisch, 2007, 2008) can estimate mixtures of multinomial logistic regression models using the FLXMRmultinom() function, which also implements the EM algorithm. The package allows the user to run the EM algorithm repeatedly for different numbers of components and returns the maximum likelihood solution for each. However, alternative -and perhaps more efficient-initialization schemes are not considered. Finally, a fully Bayesian implementation is not available in both packages.
The contribution of the present study it to offer an integrated approach to the problem of clustering multinomial count data using mixtures of multinomial logit models. For this purpose we use frequentist as well as Bayesian methods. Both the EM algorithm (for the frequentist approach) as well as the MCMC sampler (for the Bayesian approach) are carefully implemented in order to deal with various computational and inferential challenges imposed by the complex nature of mixture likelihoods/posterior distributions . At first, it is well known that the EM algorithm may converge to local modes of the likelihood surface. We tackle this problem by extending the initialization of the EM algorithm for mixture of Poisson regression models as suggested in . Second, we implement a ridge-stabilized version of the Newton-Raphson algorithm in the M-step. This adjustment is based on a quadratic approximation of the function of interest on a suitably chosen spherical region and effectively avoids many of the pitfalls of standard Newton-Raphson iterations . In the presented applications and simulation studies, our interest lies in cases where the multinomial data consists of a large number of replications for each multinomial observation. When the number of replicates is small, identifiability of the model is not guaranteed (see ).
Under a Bayesian approach, traditional Bayesian methods estimate the number of clusters using the reversible jump MCMC  or the birth-death MCMC technique . In multivariate settings, however, the practical application of these methods is limited. More recently, alternative Bayesian methods for estimating the number of clusters focus on the use of overfitting mixture models , information theoretic techniques which allow to post-process MCMC samples of partitions to summarize the posterior in Bayesian clustering models , and generalized mixtures of finite mixtures . Our Bayesian model combines recent advances on overfitting mixture models  with stochastic gradient MCMC sampling  and running parallel MCMC chains which can exchange states. Moreover, we efficiently deal with the label switching problem, using the Equivalence Classes Representatives (ECR) algorithm . In such a way, the returned MCMC output is directly interpretable and provides various summaries of marginal posterior distributions such as posterior means and Bayesian credible intervals of the parameters of interest.
The combination of Maximum Likelihood and Bayesian estimation provides additional insights: it is demonstrated that the best-performing approach is to initialize the MCMC algorithm using information from the solution obtained in the EM implementation. Therefore, our proposed method provides a powerful and practical approach that allows to easily estimate the unknown number of clusters and related parameters in multinomial count datasets.
The rest of the paper is organized as follows. Maximum likelihood estimation of finite mixtures of multinomial distributions with or without covariates via the EM algorithm is described in Section 2. The careful treatment of the M-step is extensively described in Section 2.2. Section 2.3 discusses initialization issues in the EM implementation. Section 2.4 describes the selection of the number of clusters under the EM algorithm. The Bayesian formulation is described in Section 3. The proposed MCMC sampler is introduced in Section 3.1. Section 3.2 describes the estimation of the number of clusters using overfitting mixture models as well as how we deal with the label switching problem. Applications are illustrated in Section 4. The paper concludes with a Discussion in Section 5. An Appendix contains further implementation details, additional simulation results and comparisons with alternative approaches (flexmix).

Maximum Likelihood estimation via the EM algorithm
In this section we describe the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) for estimating mixtures of multinomial logistic regressions. For the case of covariates, the complete log-likelihood is written as where c i = S i !/ J+1 j=1 y ij !. The EM algorithm proceeds by computing the expectation of the complete loglikelihood (see Section 2.1 with respect to the latent allocation variables Z (given y and x). Then, the expected complete log-likelihood is maximized with respect to the parameters π, β (see Section 2.2), given the current expected values of missing data. In the case of mixtures of multinomial logistic regressions this task can become quite challenging, since typical numerical implementations (such as the standard Newton-Raphson algorithm) may fail. For this reason, it is crucial to apply more robust numerical implementations  as discussed in Section 2.2. In Section 2.3 special attention is given to the important issue of initialization of the EM algorithm. Section 2.4 describes the selection of the number of clusters under the EM algorithm.

Expectation step
The expectation step (E-step) consists of evaluating the expected complete log-likelihood, with respect to the conditional distribution of Z given the observed data y (and x in the covariates case), as well as a current estimate of the parameters (π (t) , β (t) ). Define the posterior membership probabilities w ik as , i = 1, . . . , n; k = 1, . . . , K.
Note that, according to the Maximum A Posteriori rule, the estimated clusters are obtained as The expected complete log-likelihood is equal to where the current parameter values (π (t) , β (t) ) are used to compute w ik .

Maximization step
In the maximization step (M-step), (10) is maximized with respect to the parameters π, β, that is, The maximization of the expected complete log-likelihood with respect to the mixing proportions leads to The maximization with respect to β is analytically tractable only when P = 1 (that is, a model with just a constant term). Recall that when no covariates are present then the model is reparameterized with respect to the multinomial probabilities, that is, The expected complete log-likelihood is maximized with respect to θ kj . The analytical solution of the M-step in this case is for k = 1, . . . , K and j = 1, . . . , J + 1. In case where P ⩾ 2 numerical methods are implemented. We have used two optimization techniques: the typical Newton-Raphson algorithm, as well as a ridgestabilized version introduced by . It is easy to show that the partial derivative of (10) with respect to β kjp is k = 1, . . . , K, j = 1, . . . , J and p = 1, . . . , P . Thus, the gradient vector can be expressed as where ⊗ denotes the Kronecker product and we have also defined g ik := (g ik1 , . . . , g ikJ ) ⊤ , k = 1, . . . , K.
The second partial derivative of the log-likelihood function (10) with respect to β kjp and β k ′ j ′ p ′ is where δ ij denotes the Kronecker delta, for k, k ′ = 1, . . . , K, j, j ′ = 1, . . . , J and p, p ′ = 1, . . . , P . Note that the corresponding Hessian is a block diagonal matrix consisting of K blocks H k , where each one of them being a JP × JP -dimensional matrix, with This is particularly useful because the inverse of this KJP × KJP -dimensional matrix is the corresponding block diagonal matrix of the inverse matrices, that is, Consequently, the Newton-Raphson update can be performed independently for each k = 1, . . . , K, as described in the sequel. In order to maximize the expected complete log-likelihood with respect to β we used a ridge-stabilized version  of the Newton-Raphson algorithm.
Denote by β (t,1) the initial value of β at the M-step of iteration t of the EM algorithm. Then, the typical Newton-Raphson update at the m + 1-th iteration takes the form Let M denote the last iteration of the sequence of Newton-Raphson updates. The updated value of β for iteration t of the EM algorithm is then equal to In case that a second-order Taylor expansion is a good approximation of the underlying function around a maximum, the Newton-Raphson method will converge rapidly (Crockett et al., 1955). However, in general settings, it may happen that the step of the basic update in Equation (14) will be too large, or −H will be negative definite, in which case the quadratic approximation has no validity.
The following technique addresses these issues by maximizing a quadratic approximation to the function on a suitably chosen spherical region. The algorithm of  is based on the updates where, while λ 1 and ||x|| denote the largest eigenvalue of H and the length of vector x, respectively. The parameter R controls the step size of the update: smaller values result to larger step sizes. This parameter is adjusted according to the procedure described : the step size tends to increase when the quadratic approximation appears to be satisfactory. Figure 1 illustrates the two algorithms using simulated data of n = 250 observations from a typical (K = 1) multinomial logit model with D = 6 categories and varying number of explanatory variables p. In each case, the same random starting value was used for both the standard Newton-Raphson as well as the modified version. Observe that, especially as the number of parameters increases, the standard Newton-Raphson updates may decrease the log-likelihood function. On the other hand, the ridge stabilized version produces a sequence of updates which converge to the mode of the log-likelihood function (as indicated by the gray line).

EM Initialization
Careful selection of initial values for the EM algorithm is crucial (Biernacki et al., 2003; in order to avoid convergence to minor modes of the likelihood surface. Following , a small-EM (Biernacki et al., 2003) procedure is used. A small-EM initialization refers to the strategy of starting the main EM algorithm from values arising by a series of short runs of the EM. Each run consists of a small number Iteration log−likelihood q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Standard Newton−Raphson Ridge−Stabilized Newton−Raphson log−likelihood at true values of iterations (say 5 or 10), under different starting values. The selected values that will be used to initialize the main EM algorithm are the ones that correspond to the largest log-likelihood across all small-EM runs. The starting values of each small-EM are selected according to three alternative strategies, namely: random, split and shake small-EM schemes, described in detail in the sequel.
in order to explicitly refer to the estimated membership probabilities arising from a mixture model with K components, where π and β are the parameter estimates obtained at the last iteration of the EM algorithm.
Random small-EM This strategy corresponds to the random selection of M random starting values and running the EM for a small number (say T = 5 or 10) iterations. The parameters of the run which results to the largest log-likelihood value in the last (T -th) iteration are used to initialize the main EM algorithm. The random selection can refer to either choosing random values for the coefficients of the multinomial logit model or for the posterior membership probabilities. The latter scheme is followed in our approach, in particular each row of the n × K matrix of posterior probabilities is generated according to the U(0, 1) distribution. Each row is then normalized according to the unity sum constraint.
Split small-EM ;  proposed to begin the EM algorithm from a model that underestimates the number of clusters and consecutively adding one component using a splitting procedure among the previously estimated clusters. In our setup, this procedure begins with estimating the one-component (K = 1) mixture model. Then, for g = 2, . . . , K, we estimate a g-component mixture by proposing to randomly split clusters obtained by the estimated model corresponding to g − 1 components. The way that clusters are split is determined by a random transformation of the estimated posterior classification probabilities w (g) ik , defined in Equation (18). Given w (g−1) ik , denote by I 1 , . . . , I g−1 the clusters obtained applying the Maximum A Posteriori rule on the estimated model with g − 1 components. First, a non-empty component I g ⋆ is chosen at random among {I 1 , . . . , I g−1 }. Second, a new component labelled as g is formed by splitting the selected cluster I g ⋆ into two new ones, via a random transformation of the estimated posterior probabilities: where u i ∼ Beta(a, b), for i = 1, . . . , n, with Beta(a, b) denoting the Beta distribution with parameters a > 0 and b > 0. In our examples we have used a = b = 1, that is, a Uniform distribution in (0, 1). Another valid option would be to set a = b < 1 in order to enforce greater cluster separation. Finally, the EM algorithm for a mixture with g components starts by plugging in {w (g) ik , i = 1, . . . , n; k = 1, . . . , g} as starting values for the posterior membership probabilities. This procedure is repeated M split times by running small EM algorithms and the one resulting to largest log-likelihood value is chosen to start the main EM for model g. We will refer to this strategy as a split-small-EM initialization scheme. A comparison between the random-small-EM strategy using mixtools  and the split-small-EM scheme for a mixture of 8 multinomial distributions is shown in Figure 2. More detailed comparisons between the random small-EM initializations are reported in the simulation study of Section 4.1 and in Appendix B.
Shake small-EM Assume that there are at least K ⩾ 2 clusters in the fitted model and that the estimated posterior membership probabilities are equal to w (K) ik , i = 1, . . . , n, k = 1, . . . , K. We randomly select 2 of them (say k 1 and k 2 ) and propose to randomly re-allocate the assigned observations within those 2 clusters. More specifically, let I k 1 and I k 2 denote the observations assigned (according to the MAP rule) to clusters k 1 and k 2 , respectively. A small-EM algorithm is initialized by a state which uses a matrix ( w ′ (K) ik ) obtained by a random perturbation of the posterior probabilities as follows Note that in this way only the posterior probabilities of those observations assigned to components k 1 and k 2 are affected. This procedure is repeated M shake times and the one leading to the highest log-likelihood value after T EM iterations is selected in order to initialize the algorithm. We will refer to this strategy as a shake small-EM initialization. The aforementioned initialization schemes will be compared in our simulation study in Section 4.1 (see also Appendix B). We will use the notation EM (M split , M shake , M random ) to refer to a small-EM algorithm initialization consisting of M split split-small-EM rounds, which are then followed by a sequence of M shake shake-small-EM rounds and M random random-small-EM rounds.

Estimation of the number of clusters under the EM algorithm
There is a plethora of techniques in order to select the number of components in a mixture model, see e.g. Chapter 6 in . One of the most popular choices is the Bayesian Information Criterion , defined as where θ K and d K denote the Maximum Likelihood estimate and the number of parameters of the mixture model with K components, respectively. Another criterion which is particularly suited to the task of model-based clustering is the Integrated Complete Likelihood (ICL) criterion (Biernacki et al., 2000).
It has been demonstrated that BIC may overestimate the number of clusters (see e.g. ; ). In what follows, the number of clusters in the EM approach is selected according to the ICL criterion.

Bayesian Formulation
We assume that the mixing proportions of the mixture model (7) follow a Dirichlet prior distribution, that is, for some fixed hyper-parameters α k > 0, k = 1, . . . , K. Usually, there is no prior information which separates the components between each other so typically  we set α 1 = . . . = α K = α > 0 (see also Section 3.2). The prior distribution of the coefficients β kjp is normal centered on zero, that is, The prior variance ν 2 is assumed constant. A default value of ν 2 = 100 was used in all of all our examples presented in subsequent sections, which corresponds essentialy to a vague 1 prior distribution, however we will also consider more informative choices (ν = 1), in order to penalize large values of the coefficients. We furthermore assume that β, π are a-priori independent random variables, thus the joint prior distribution of the parameters and latent allocation variables is written as The joint posterior distribution of z, π, β|y, x, K is written as

A Hybrid Metropolis-Adjusted-Langevin within Gibbs MCMC Algorithm
From Equation (9) follows that the full conditional posterior distribution of the latent allocation vector for observation i is independent for i = 1, . . . , n.
The full conditional posterior distribution of mixing proportions is a Dirichlet distribution with parameters π| · · · ∼ D(α 1 + n 1 , . . . , α K + n K ), where n k = n i=1 z ik . For the regression coefficients we use a Metropolis-Hastings step, although other approaches which are based on the Gibbs sampler have been proposed (Dellaportas and Smith, 1993;. Note however that these approaches impose additional augmentation steps in the hierarchical model and have been applied only for simple (that is, K = 1) logistic regression models.
One could use a random walk for proposing updates to β, but it is well known that the large number of parameters would lead to slow-mixing and poor convergence of the MCMC sampler. In order to overcome this issue, we used a proposal distribution which is based on the gradient information of the full conditional distribution. The Metropolis Adjusted Langevin Algorithm (MALA)  is based on the following proposal mechanism where ε ∼ N (0, I) and ∇ log f (β (t) |y, x, z, π) denotes the gradient vector of the logarithm of the full conditional of β, evaluated at β (t) . In order to select a value of τ with a reasonable acceptance rate betweeen proposed moves the MCMC sampler runs for an initial warm-up period. During this period τ is adaptively tuned as the MCMC sampler progresses in order to achieve acceptance rates of the proposed updates between user-specified limits (see Appendix A for details). The final value of τ is then selected as the one that will be used in the subsequent main MCMC sampler. The derivative of the logarithm of the joint posterior distribution of β, conditional on z and π is equal to Note that the first term on the right-hand side of the previous expression corresponds to the log-derivative of the complete log-likelihood (that is, given z), while the second term corresponds to the derivative of the prior distribution in (20). The proposal in (23) is accepted according to the usual Metropolis-Hastings probability, that is, where P (a → b) denotes the probability density of proposing state b while in a. From (23) we have that P β (t) → β is the density of the , π (t) ), 2τ I KJP distribution, evaluated at β. The density of the reverse transition β → β (t) is equal to the density of the distribution The overall procedure is summarized at Algorithm 1.

Estimation of the number of clusters using overfitting Bayesian mixtures
The Bayesian setup allows to estimate the number of clusters by using overfitting mixture models, that is, models where the number of mixture components is much larger than the number of clusters. Let K max > K denote an upper bound on the number of clusters and define the overfitting mixture model where f k ∈ F Θ = {f (·|θ); θ ∈ Θ} denotes a member of a parametric family of distributions. Let also d denote the dimension of free parameters in the distribution f k (·). For instance, in the case of a mixture of multinomial logistic regression models with J + 1 categories and P covariates (including constant term) d = JP .  show that the asymptotic behavior of the K max −K extra components depends on the prior distributions of the mixing proportions π = (π 1 , . . . , π Kmax ). For the case of a Dirichlet prior as in Equation (19), if max{α k ; k = 1, . . . , K max } < d/2, the posterior weight of the extra components will tend to zero as n → ∞ and force the posterior distribution to put all its mass in the sparsest way to approximate the true density.
Following Papastamoulis (2018), we set α 1 = . . . = α K = α, thus the distribution of mixing proportions in Equation (19) becomes where 0 < α < d/2 denotes a pre-specified positive number. Therefore, the inference on the number of mixture components can be based on the posterior distribution of the "alive" components of the overfitted model, that is, the components which contain at least one allocated observation. In order to estimate the number of clusters we only have to keep track of the number of components with at least one allocated observation, across the MCMC run. This reduces to record the variable K (t) i denotes the simulated allocation vector for observation i at MCMC iteration t = 1, 2, . . ..
In order to produce a MCMC sample from the joint posterior distribution of the parameters of the overfitting mixture model (including the number of clusters), we embed the scheme described in Section 3.1 within a prior parallel tempering scheme . Each heated chain (c = 1, . . . , C) corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be imposed on any . . , M } and acceptance rate r of the MALA proposal.
Let us denote by f c (φ|x) and f c (φ); c = 1, . . . , C, the posterior and prior distribution of the c-th chain, respectively. Obviously, c denote the state of chain c at iteration t and assume that a swap between chains c 1 and c 2 is proposed. The proposed move is accepted with probability min{1, A(π c 1 , π c 2 )} where and f c (·) corresponds to the probability density function of the Dirichlet prior distribution related to chain c = 1, . . . , C. According to Equation (26), this is for a pre-specified set of parameters α (c) > 0 for chain c = 1, . . . , C.
When estimating a Bayesian mixture model, a well known problem stems from the label switching phenomenon , which arises from the fact that both the likelihood and prior distribution are invariant to permutation of the labels of mixture componets. The posterior distribution of the parameters will also be invariant, thus the parameters are not marginally identifiable. We deal with this problem by postprocessing the MCMC output of the overfitting mixture via the ECR algorithm (Papastamoulis and Iliopoulos, 2010; . Note that after post-processing the MCMC output for correcting label switching, the estimated classification for observation i is obtained as the mode of the (reordered) simulated values of Z i in Equation (21) across the MCMC run (after discarding the draws corresponding to the burn-in period of the sampler), i = 1, . . . , n. For more details the reader is referred to the label.switching package . The overall procedure is summarized in Algorithm 2.
Regarding the initialization of the overfitting mixture model (see Step 0) in Algorithm 2, we use two alternative approaches. The first initialization is based on random starting values ("MCMC-RANDOM" scheme) and the second initialization scheme uses a more elaborate scheme, by exploiting the output of the EM algorithm under the splitsmall-EM scheme ("MCMC-EM" scheme). As expected, the latter scheme performs better as illustrated in the simulation studies. The overall procedure of the MCMC sampler is summarized in Algorithm 1 and Algorithm 2. The typical choices of the parameters as well as further details on the prior parallel tempering scheme and initialization schemes are given in Section A of the Appendix.

Applications
In Section 4.1 we use a simulation study in order to evaluate and rank the proposed methods in terms of their ability in clustering multinomial data. Next, we present two applications on real data: in Section 4.2 our method is used to identify clusters of age profiles within a regional unit in Greece and in Section 4.3 we study clusters of Facebook engagement metrics in Thailand. Further simulation results and comparisons with flexmix are reported in Appendix C. and scale of the MALA proposal equal to τ = τ (c) .

Set
Step 2: Chain swaping 2.1 Randomly choose 1 ⩽ c ⩽ C − 1 and set c 1 = c, c 2 = c + 1 Step 3: Undo label switching Apply the ECR algorithm on the output of chain c = 1 and return an identifiable MCMC sample {z (t) , π alive , β alive ; t = 1, . . . , T }   Table 1 for the simulation study set-up. For the EM implementations: the three numbers in parenthesis refer to the number of split, shake and random small-EM runs. For the MCMC implementations, the number in parenthesis denotes the prior variance (ν 2 ) of the coefficients β kjp in Equation (20).

END of algorithm
In order to evaluate the ability of the proposed methods in clustering multinomial count data, we considered synthetic datasets generated from a mixture of multinomial logistic regression models (7). The number of multinomial replicates (S i ) per observation is drawn from a negative binomial distribution: S i ∼ N B(r, p) with number of successful trials r = 20 and probability of success p = 0.025. We simulated 500 datasets in total where the values of n, K, P, J are uniformly drawn in the range of values shown in Table 1. Given K, the weight of each cluster was equal to π k ∝ k, k = 1, . . . , K. Notice that this setup gives rise to mixture models with total number of free parameters ranging from 10 up to 535.
The true values of the regression coefficients were simulated according to (conditionally) independent for k = 1, . . . , K; j = 1, . . . , J; p = 1, . . . , P , where I {0} (β) denotes a discrete distribution degenerate at 0 and ϕ(β; µ, σ 2 ) denotes the density function of the normal distribution with mean µ and variance σ 2 . We applied the proposed methodology in order to estimate mixtures of multinomial logit models. In particular we compared the EM algorithm under three initialization schemes, as well the MCMC sampling scheme under random initialization and an initialization based on the output of the EM algorithm (under the split-shake-random small-EM scheme). In total we considered 24 different starts in the small-EM schemes: a random small-EM with 24 starts: EM(0, 0, 24), a combination of split and random small-EM with 12 starts each: EM(12, 0, 12) and finally, a combination of split, random and shake small-EM with 8 starts each: EM (8, 8, 8). The total number of MCMC iterations is held fixed at 100000. We also present results when considering the double amount of iterations (both in the warm-up period as well as the main MCMC sampler) under the MCMC-RANDOM scheme, which we will denote by MCMC-RANDOM-2x. Finally, we considered two different values of prior variance of the coefficients in Equation (20): ν = 1 and ν 2 = 100. The first choice corresponds to an informative prior distribution, heavily penalizing large values of |β kjp |. The second choice corresponds to a vague prior distribution. The chosen value of ν will be denoted in a parenthesis, that is, MCMC-RANDOM (ν 2 ) and MCMC-EM (ν 2 ) will indicate the output of MCMC algorithm with random and EM initialization schemes (respectively) and prior variance equal to ν 2 . See Appendix A for further details of various other parameters for the EM and MCMC algorithms. Figure (3) illustrates a graphic summary of the simulation study findings, based on our 500 synthetic datasets. The metrics we are focusing are the following: The left graph shows the mean of relative absolute error |K−K| K between the estimated number of clusters (K) and the correspoding true value (K). The right graph displays the mean of the adjusted Rand index (with respect to the ground-truth classification) subtracted from 1. In all cases we conclude that the EM algorithm with a random small-EM initialization (denoted as EM(0,0,24)) is worse compared to the split-shake-random small EM initialization (denoted as EM (8,8,8)). Regarding the MCMC sampler we see that the random initialization scheme (MCMC-RANDOM) is worse than the EMbased initialization (MCMC-EM), when both MCMC-RANDOM and MCMC-EM run for the same number of iterations. However, as the number of iterations increases in the randomly initialized MCMC sampler (MCMC-RANDOM-2x), the results are improved, particularly for the mean relative absolute error of the estimation of the number of clusters. Overall, the MCMC algorithm initialized by the (split-small) EM solution is the best performing method, closely followed by the EM algorithm under the split-small EM scheme. Naturally, the informative prior distribution (ν = 1, corresponding to the green-coloured bars in Figure (3)) outperforms the vague prior distribution (ν 2 = 100, corresponding to the red-coloured bars). More detailed summaries of the resulting estimates are given in Appendix A.

Phthiotis Population Dataset
In this example we present an application of our methodology in clustering areas within a certain region with respect to the age profiles of their population, taking also into account geographical covariate information. For this purpose we considered population data based on the 2011 census of Eurostat 2 . We considered the Phthiotis area, a regional unit located in central Greece. Our extracted dataset consists of number of people per age group (21 groups: 0 − 4, 5 − 9, . . . , 95 − 99, > 99 years old) for a total of n = 187 settlements (such as villages, towns, and the central city of the regional unit,  The research question is to cluster these settlements based on the age profiles of their population. If we cluster the raw dataset of age counts within each group using a mixture of multinomial distributions without any covariate information, then a large number of clusters is found. In particular, when using mixtools ), a relatively large number of clusters is found (K = 8 using ICL). Therefore, we opt to apply our method using the following covariate information for settlement i = 1, . . . , n: x i1 : distance (in Km) from Lamia (capital city of the regional unit) x i2 : logarithm of the altitude (elevation, in m).
Both covariates were scaled to zero mean and unit variance. Let us denote by y ij the number of people in age group j = 1, . . . , 21 for settlement i = 1, . . . , n. The probability θ (i) kj denotes the proportion of population being in age group j conditional on the event that settlement i belongs to cluster k, where 21 j=1 θ kj = 1 for all k. Conditional on cluster k = 1, . . . , K, the random vector Y i = (Y i1 , . . . , Y i,21 ) ⊤ is distributed according to a multinomial distribution k,21 ) for k = 1, . . . , K and S i denotes the total population for settlement i, for i = 1, . . . , n. Note that we have used the 1st category (ages between 0 and 4) as baseline in order to express the log-odds of the remaining groups. The distribution of counts per age group is written as a mixture of multinomial distributions where π k denotes the weight of cluster k. Hence, each cluster represents areas with different age profile as reflected by the corresponding vector of multinomial probabilities. The total number of clusters (K) is unknown.
At first we used the EM algorithm under the proposed initialization scheme to estimate mixtures of multinomial logistic regression models for a series of K = 1, 2, . . . , K max = 10 components. According to the ICL criterion, the selected number of clusters is equal to K = 3. Next we estimated an overfitting Bayesian mixture of K max = 10 components, using a prior parallel tempering scheme based on 12 chains. The MCMC algorithm was initialized from the EM solution, while all remaining parameters were initialized from a zero value. The MCMC sampler ran for a warm-up period of 100000 iterations, followed by 400000 iterations. A thinned MCMC sample of 20000 iterations was retained for inference. In almost all MCMC draws the number of non-empty mixture components was equal to K 0 = 3 (estimated posterior probability equal to 99.5%). The retained MCMC sample was then post-processed according to the ECR algorithm  in order to undo label switching. The confusion matrix of the single best clusterings between the two methods (EM and MCMC) is displayed in Table 2. The corresponding adjusted Rand index is equal to 0.67 indicating that the two resulting partitions have strong agreement.  Next we focus on the results according to the MCMC algorithm (after post-processing). Figure 5 illustrates the posterior mean (and 95% credible region) of the probability θ (i) kj for age group j in cluster k = 1, . . . , 3. Three characteristic configurations of covariate levels were used, that is, the 0.1, 0.5 and 0.9 percentiles of the two covariates. In all cases we see distinct age group characteristics and what is evident is the presence of a group which contains places with younger age profiles (cluster 1). In cluster 3, notice a strong peak at the group of ages between 76 to 84, which emerges in cases of moderate to large values of the two covariates. In cluster 2, the peak is also located at the older age groups however it is less pronounced compared to cluster 3. Figure 6 visualizes the three clusters on the map of the regional unit. We may conclude that cluster 1 (the "younger" cluster) mainly consists of settlements that are either located close to Lamia (gray spot on the map), including Lamia itself, or their total population is larger than 1000 (towns such as Makrakomi, Malessina, Sperchiada, Atalanti, Domokos, Stavros and the central city of Lamia). However this younger group of age profiles is also present in some of the most distant and mountainous southwestern areas (Dafni, Neochori Ypatis, Kastanea, Pavliani and Anatoli : the altitude of these small villages is larger than 1000 m). In general, however, as we move further away from Lamia the "older" and "eldest" clusters dominate, particularly for areas with a small number of population. See also the histogram of settlement populations per cluster in Figure 7. Note that the majority of smaller villages (population of 100 citizens, approximately) are mainly assigned to the third cluster (the eldest group).
Finally, we have to mention that this specific application involves an ordinal and not a nominal response. Therefore, one could use alternative techniques to model the data, such as proportional odds models, or smoothing the changes between adjacent categories.

Facebook Live Sellers in Thailand Data Set
The dataset of Dehouche (2020) (see also ) contains engagement metrics of Facebook pages for Thai fashion and cosmetics retail sellers. We consider the number of emoji reactions for each Facebook post, which are known as "like", "love", "wow", "haha", "sad" and "angry". The aim of our analysis is to cluster posts based on the reaction profiles, using additional covariate information. Each post can be of a different nature ("video", "photo", "status"), a categorical variable which we are taking into account as a categorical predictor. In addition, we also use as covariate the number of shares per post (in log-scale). The dataset is available at the UCI machine learning repository 3 .
We considered the period between 2017-01-01 and 2018-12-31 taking into account posts with a minimum overall number of reactions equal to 40. We then randomly selected 100 posts per type (100 videos, 100 photos and 100 statuses), that is, 300 posts in total. The observed data is displayed in Figure 8 (note that for the sole purpose of visualization in the log-scale, each count is increased by 1). It is evident that most reactions correspond to "loves" and "likes". There is also some visual evidence that videos may result to a larger number of "loves" compared to photos or statuses. On the other hand, many posts result to zero counts for any kind of reaction other than "like". So we might expect that such a dataset exhibits heterogeneity, due to zero inflation in the first five categories. Thus, it makes sense to cluster posts according to the reaction profiles, i.e. reaction probability.
Let us denote by y i = (y 1 , y 2 , y 3 , y 4 , y 5 , y 6 ) ⊤ the observed vector of reaction counts  for post i = 1, . . . , n (n = 300). We assume that y i , conditional on post type and number of shares (as well as the total number of reactions for that particular post), is distributed according to a mixture of multinomial distributions with J + 1 = 6 categories, where y j denotes the number of reactions of type j for post i, i = 1, . . . , n.
The type of each post serves as a categorical predictor with three levels ("video", "photo" and "status"). Selecting the probability of "like" as the reference category and conditional on cluster k = 1, . . . , K, the multinomial logit model is written as kj denotes the probability of reaction j corresponding to "angry" (j = 1), "sad" (j = 2), "haha" (j = 3), "wow" (j = 4), "love" (j = 5) and "like" (j = 6). Note that the categorical predictor consists of three levels, thus, we created the two dummy variables after selecting the "video" type as the baseline. In addition, x shares i denotes the number of shares for post i.
We applied our method using the EM algorithm with the proposed initialization scheme as well as the MCMC sampler using an overfitting mixture model with K max = 10 components. A total of 12 chains under the prior parallel tempering scheme were considered. The MCMC sampler ran for an initial warm-up period of 100000 iterations, followed by 400000 iterations. A thinned MCMC sample of 20000 iterations was retained for inference. Both methods select K = 4 clusters. In the MCMC sampler we have considered two different levels of the prior variance of the regression coefficients, that is, ν 2 = 100 (vague prior distribution) and ν 2 = 1 (informative prior).  Note that the y axis on both graphs as well as the x axis of the right graph are displayed in log-scale after increasing each observed count by one.
More specifically, for the EM algorithm the minimum value of ICL is equal to 4610.89 (corresponding to a model with K = 4 clusters) while for the MCMC sampler the mode of the posterior distribution of the number of non-empty mixture components corresponds to K 0 = 4, withP (K 0 = 4|data) = 0.67 for the sampler with ν 2 = 100. The same number of components is also selected when considering the prior distribution with ν 2 = 1, whereP (K 0 = 4|data) = 0.80. The confusion matrix of the single best clusterings arising after applying the Maximum A Posteriori rule is displayed in Table 3. The corresponding adjusted Rand Indices between the two partitions are equal to ARI(EM, MCMC(100)) ≈ 0.96, ARI(EM, MCMC(1)) ≈ 0.74 and ARI(EM, MCMC(100)) ≈ 0.76, indicating a high level of agreement between the three approaches.
Next we are concerned with the identifiability of the selected model with K = 4 components. Recall that in our extracted dataset the minimum number of reactions is equal to 40, thus, condition 1.(a) in Theorem 2 of  (see also ) is satisfied. If we were considering only the categorical predictor (video type) in our model, the number of distinct hyperplanes (lines in this case) needed to cover the covariates of each cluster would be equal to 2: one line covering the points (0, 0) (origin) and (x status  ) is violated. However, this condition is satisfied after including a continuous covariate with cluster-specific effect (number of shares). Finally, the generated MCMC sample has been post-processed according to the ECR algorithm  in order to deal with the label switching issue. Figure 9 displays the posterior mean of reaction probability per cluster and the cor- Table 3: Confusion matrix between the single best clustering of the Facebook Live Sellers Dataset arising from the EM and MCMC algorithms with a prior variance of regression coefficients equal to ν 2 = 100 and ν 2 = 1 (after post-processing the MCMC output for correcting label switching). responding (equally tailed) 95% credible interval, for both prior setups. The continuous covariate (number of shares) is set equal to the observed mean per post type. In all cases, there is an increased probability of a "love" reaction when the post is a video, compared to a photo or a status. However, the average probability of such a reaction is different between the clusters, with the most notable difference obtained in cluster "4". Finally, notice the similarity of cluster profiles for both prior distributions.

Discussion
The problem of clustering multinomial count data under the presence of covariates has been treated using a frequentist as well as a Bayesian approach. Our simulations showed that our proposed models perform well, provided that the suggested estimation and initialization schemes are selected. The application of our method in clustering real count datasets reveal the interpretability of our approach in real-world data. Our contributed package in R makes our method directly available to the research community. Under a frequentist approach we have demonstrated that an efficient initialization (i.e. the split-shake-random small-EM scheme in Section 2.3) yields improved results, when compared to a more standard random small-EM initialization scheme. Furthermore, a crucial point for the implementation of the maximization step of the EM algorithm is the control of the step size of the Newton-Raphson iterations, something that was achieved using the ridge-stabilized version in Section 2.2.
We did not address the issue of estimating standard errors in our EM implementation. However, these can be obtained by approximating the covariance matrix of the estimates by the inverse of the observed information matrix  or using bootstrap approaches . Maximum likelihood estimation with the EM algorithm can be modified in order to provide Maximum A Posteriori estimates under a regularized likelihood approach, as implemented in Galindo .
The Bayesian framework of Section 3 has clear benefits over the frequentist approach, but of course, under the cost of increased computing time. As demonstrated in our simulations, the proposed MCMC scheme outperforms the EM algorithm in terms of estimation of the number of clusters as well the clustering of the observed data in terms of the Adjusted Rand Index. Moreover, the Bayesian setup allows for even greater flexibility in the resulting inference, such as the calculation of Bayesian credible intervals from the MCMC output which provide a direct assessment of the uncertainty in our point-estimates. For this purpose we used state-of-the-art algorithms that deal with the label switching problem in mixture, suitably adjusted to the special framework of overfitting mixture models.
A natural and interesting extension of our research is to consider the problem of variable selection in model based clustering Dean and Raftery, 2010;. In the Bayesian setting, one could take into account alternative prior distributions of the multinomial logit coefficients per cluster, e.g. spike and slab or shrinkage prior distributions  that encourage sparsity in the model. Another direction for future research is to combine our mixture model with alternative Bayesian logistic regression models that exploit data augmentation schemes Choi and Hobert, 2013) and assess whether MCMC inference is improved.

Declarations
Main MCMC sampler: After the warm-up period, the MCMC sampler runs for a total of T = 2600 cycles. Each cycle consists of m 1 = 20 iterations. A chain swap is attempted at the end of each MCMC cycle. The results are obtained retaining the last 2500 cycles (and discarding the first 100 cycles as burn-in period) of the MCMC sampler.
Note that the initial warm-up period and the main MCMC sampler consist of 48000 + 2600 × 20 = 100000 MCMC iterations in total.
MCMC sampler initialization: Each chain was initialized under two different schemes: a scheme based on randomly selected started values (which we will be referring to as "MCMC-RANDOM") and a more elaborate initialization scheme based on the output of the EM algorithm ("MCMC-EM" starting scheme). More specifically, in "MCMC-RANDOM" all parameters are initialized by simulating from the prior distributions. In "MCMC-EM" we first estimate the number of clusters as well as the model parameters according to the EM algorithm under our split-small-EM scheme. Let us denote byK (EM ) the selected number of clusters according to the EM algorithm (using ICL), under the split-small-EM initialization. Next, consider a Bayesian overfitting mixture with K max > K (EM ) components. The parameters of the first K (EM ) components are all set equal to the values of the corresponding parameters obtained at the last iteration of the EM algorithm for that particular model. The parameters of the remaining K max − K components are all initialized by a zero value. Finally, a random permutation is drawn among the initial parameters of the K max components for each of the different chains in order to encourage the presence of the label switching phenomenon in the MCMC sampler.

B Computational details -package in R
The computational pipeline for the proposed methodology has been implemented in R. It is furthermore publicly available as a contributed R package named multinomialLogitMix, which is available at https://CRAN.R-project.org/package=multinomialLogitMix. The proposal scheme of the MALA sampler is implemented in the Rcpp (Eddelbuettel and François, 2011;Eddelbuettel, 2013;Eddelbuettel and Balamuta, 2018) and RcppArmadillo  packages, which integrate R and C++. Figure .1 illustrates that the gain in computing time is tremendous when replacing the R code with Rcpp.
The basic pipeline is illustrated next. For this purpose we use a simulated dataset as the ones in Section 4.1.   Now let's explore the output based on the EM algorithm only. At first we retrieve the selected number of clusters according to ICL. Then we display the estimated clustering conditional on the selected value. Finally, we retrieve the estimated parameters (mixing proportions and coefficients of the mixture of multinomial logits) of the model.  Note that in the last chunk of the output the rows correspond to multinomial categories (there are 6 categories in total so j = 1, . . . , 5) and the columns correspond to covariates (p = 1, . . . , 3), per cluster. For example, the estimate of the coefficient of the second multinomial category (j = 2) for the second covariate (so p = 3 because the model includes a constant term) for cluster 1 (k = 1) is equal toβ kjp =β 1,2,3 = 2.21. The corresponding estimate for cluster 2 is equal toβ kjp =β 2,2,3 = −0.02.
Let's explore the MCMC output now. We stress once again that the raw MCMC sample of the overfitting mixture is not directly interpretable due to label switching. Moreover, bear in mind that it also consists of the values of the empty mixture components which are not relevant for all practical purposes. These points are illustrated in Figure .2, which displays the raw MCMC output for β k,2,3 for all components k = 1, 2, . . . , 10 of the overfitting mixture model. A careful inspection of this graph reveals that up to a switching of the labels the sampled values of 2 (among 10 components) are concentrated around the two horizontal dotted lines which correspond the estimates of the corresponding parameters according to the EM algorithm. The remaining values which are further away from the dotted lines correspond to the sampled values of this parameter for the remaining 8 empty mixture components (again up to switching of the labels).
Let's proceed now by inspecting the post-processed output according to the ECR algorithm. At first we can retrieve the estimated posterior distribution of the number of clusters (which correspond to the number of non-empty mixture components across the MCMC run) as well as the number of assigned observation per cluster, conditional on the value of the most probable number of clusters. Now we concentrate on the post-processed output of the non-empty mixture components, so we essentially discard the sampled values of the remaining 8 empty mixture components. We can retrieve basic MCMC summaries using the coda package. For example let us retrieve MCMC summaries for the coefficients β 1,2,3 and β 2,2,3 . The last command produces Figure .3, which displays the trace of the post-processed values of β 1,2,3 (left) along with the corresponding estimate of the marginal posterior distribution (right).
Of particular interest is also the matrix of posterior membership probabilities for each observation. The next chunk of code shows how one can retrieve these estimates according to output from the EM and the MCMC algorithm, respectively. The previous example uses the parameter setup detailed in Section A. The user can modify these arguments by passing the desired input to the optional arguments em parameters and mcmc parameters of the multinomialLogitMix() function. Both em parameters and mcmc parameters should be lists, consisting of the following entries.
Arguments of the em parameters list.
maxIter Maximum number of EM iterations. Default: 100. emthreshold Positive real threshold for terminating the EM algorithm.
The algorithm stops when the difference between two successive evaluations of the observed log-likelihood is less than this threshold. Default: 10 −8 . maxNR Maximum number of Newton-Raphson iterations. Default: 10. tsplit Number of different starts that will be used within the small-EM scheme (this quantity refers to all schemes: random, split and shake). Default: 16. msplit Number of iterations for each small-EM start. Default: 10. split Boolean denoting whether the EM algorithm will use the split-small EM scheme. Default: TRUE. In the opposite case, the small-EM scheme will use only randomly selected initial values. R0 The initial value for the parameter R that controls the stepsize of the ridge-stabilized Newton-Raphson scheme (see Equation (16)). Default: 0.1.

Arguments of the mcmc parameters list.
tau initial value for the scale of the MALA proposal (positive, it corresponds to the parameter ν in Equation (23)). Default: 0.00035. This parameter is adjusted in the initial (warmup) period of the sampler in order to achieve the desirable acceptance rate of the MALA proposal. nu2 the variance of the normal prior distribution of the logit coefficients (the parameter τ 2 in Equation (20)). Default: 100. mcmc cycles Total number of MCMC cycles (after the initial warm-up) period of the sampler. Default: 2600. At the end of each MCMC cycle a swap between chains is attempted. iter per cycle Number of MCMC iterations per cycle. Default: 20.
nChains Number of MCMC chains that run in parallel. Each chain uses a different prior distribution of the mixing proportions. The inference is based on the first chain. dirPriorAlphas The concentration parameter of the Dirichlet prior distributions per chain (see Equation (28)). It should be a vector with length equal to nChains. The default is: 5 * exp((seq(2, 14, length = nChains -1)))/100)/(200), see Equations (.1) and (.2). warm up Initial warm-up period of the sampler, in order to adaptively tune the scale of the MALA proposal. Default: 48000. checkAR Number of iterations required in order to adjust the scale of the proposal in MALA mechanism during the initial warmup phase of the sampler. Default: 500. ar low Lowest threshold for the acceptance rate of the MALA proposal. Default: 0.15. ar up Highest threshold for the acceptance rate of the MALA proposal. Default: 0.25. burn Number of MCMC cycles that will be discarded as burn-in.
Default: 100. withRandom Boolean value indicating whether or not to apply a random permutation in the supplied starting values for each chain. Default: true.

C Further simulation results
A Main simulation study . We observe that for both indices (estimation of the number of clusters and partition agreement as measured by the adjusted Rand index), the clustering accuracy tends to decrease as the number of covariates increase. This effect is clearly illustrated in the smaller sample sizes (n). However, as n gets larger the impact of the number of covariates becomes smaller. Figure A.5 displays the difference between the estimated and true value of the numbers of clusters (K − K) (left) and the adjusted Rand index (right) for each value of K (horizontal axis), stratified for all different values of sample size (n ∈ {125, 250, 500, 1000}). It is evident that as the number of clusters increases, overestimates of the number of clusters occur more often and this effect is more severe for small sample sizes. Once again we note that the MCMC sampler with small prior variance (MCMC-EM (1)) outperforms the remaining implementations.
Next we are concerned with the accuracy of point estimates β kjp , that is, the coefficient value at the last iteration for the EM implementation and the estimate of the posterior mean after reordering the MCMC output in order to deal with label switching for the MCMC implementation. We generated 100 synthetic datasets with n = 250 observations and 100 datasets with n = 500, considering K = 4 clusters, J + 1 = 6 multinomial categories and P = 3 covariates (including constant term). In each case, the number of components K was set equal to the true number of clusters, that is, K = 4. All other parameters were generated as described in Section 4.1. In order to meaningfully compare β kjp with the corresponding point estimates β kjp (arising either from EM(8,8,8) or from MCMC-EM (100)), we relabelled the resulting point estimates by applying the ECR algorithm , by considering that the pivot allocation vector (required in the ECR algorithm) is set to the true partition. This procedure ensures the labelling between β kjp and β kjp (for j = 1, . . . , J; p = 1, . . . , P ) is consistent for all k = 1, . . . , K, and it has no impact at the quality of the estimates themselves. Figure A.6 displays the estimated Mean Absolute Error (MAE) between the point estimate β kjp and the corresponding true value β kjp , for k = 1, . . . , K, j = 1, . . . , J and p = 1, . . . , P . As expected, the estimated MAEs are smaller on average as the sample (n) increases. Observe that when n = 250 the MAEs become smaller as k increases. This is of course due to the fact that in our simulation study (see Section 4.1) the mixing proportions are generated according to π k ∝ k, for k = 1, . . . , K. Thus, in our 4-cluster scenario we have that π 1 = 0.1, π 2 = 0.2, π 3 = 0.3 and π 4 = 0.4. Naturally, the point estimates are less accurate in cases of very small clusters, a behaviour which is vividly illustrated when n = 250. However, when n = 500 we do not spot any systematic pattern in the resulting MAEs across clusters.

B Further simulations and comparison with flexmix
In this Section we compare the proposed methods with the popular R package flexmix Leisch, 2007, 2008) and we also explore the impact of the average number of multinomial replicatess = n i=1 S i /n. In the main simulation study of Section 4.1, the number of multinomial replicates (S i ) per observation is drawn from a negative binomial distribution N B(r, p) with number of trials equal to r = 20 and p = 0.025. This yields a potentially large of multinomial replicates: the average number is equal to 781. In order to assess the sensitivity of our results to the number of multinomial replicates we consider that r varies in the set {2.5, 5, 10, 20} and we use the N B(r, 0.025) distribution to simulate S i (in case that the generated number is 0 we set it to 1). It follows that the average value of S i is approximately 100, 200, 400 and 800, respectively (more precisely, the average values in our simulated datasets are 97.5, 195, 390 and 781, respectively).
For this task we considered the following simulation scenario. In all cases the number of multinomial categories was set equal to J + 1 = 6. The number of covariates (including constant term) was set equal to P = 3. The true values of the regression coefficients are generated as in 4.1. The number of clusters varied between 1 ⩽ K ⩽ 5. We considered sample sizes of n = 250 and n = 500 observations. We generated 5 synthetic datasets for each unique combination of K (number of clusters), level of the average number of multinomial replicatess and sample size (n), resulting to 200 datasets in total.
Next, we fitted mixtures of multinomial logistic regressions considering the proposed methodology, as well as the EM implementation in the R package flexmix. The configuration of the EM algorithm for our proposed method was EM(8,8,8), that is, an EM algorithm with 8 splits, 8 shakes and 8 random starts of small EM for each possible value of the number of components. Then we have used the selected model in order to initialize a run of a Bayesian overfitting mixture model. The prior variance of the regression coefficients is equal to ν = 100 (vague prior distribution) using 8 chains in total. Finally, we have run flexmix repeatedly considering 24 random starts, for each possible value of the number of components, using the stepFlexmix function with options: k=1:Kmax (where Kmax denotes the maximum number of components -see next paragraph), nrep = 24, control = list(minprior = 0) and model = flexmix::FLXMRmultinom().
In all cases, the maximum number of mixture components was set equal to K max = K+2, where K denotes the true value of the number of clusters for each case. This upper bound of the number of components is much more informative regarding the number of clusters than the one in our main simulation study of Section 4.1 (where K max = 20). The reason for choosing such a value was mainly to speed-up the computing time in flexmix, where in most cases is significantly elevated (see Figure B.7) compared to the proposed implementation.
The resulting estimates are summarized in Figure B.8. We conclude that in all cases the estimation becomes more challenging when the average number of multinomial replicates is smaller. Note also that our proposed methods (EM and MCMC) are better than flexmix in terms of estimation of the number of clusters as well as classification accuracy under the adjusted Rand index. : Posterior mean and 95% Credible Region of the reaction probabilities (θ 1 , θ 2 , θ 3 , θ 4 , θ 5 , θ 6 ) per cluster for the Facebook Sellers data, when setting the continuous covariate (number of shares) equal to its mean per post type. The left and right columns correspond to the MCMC samplers with the large (ν 2 = 100) and small (ν 2 = 1) prior variance of the regression coefficients in Equation (20), respectively.