Row mixture-based clustering with covariates for ordinal responses

Preedalikit, Kemmawadee; Fernández, Daniel; Liu, Ivy; McMillan, Louise; Nai Ruscone, Marta; Costilla, Roy

doi:10.1007/s00180-023-01387-9

Row mixture-based clustering with covariates for ordinal responses

Original Paper
Open access
Published: 22 July 2023

Volume 39, pages 2511–2555, (2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Statistics Aims and scope Submit manuscript

Row mixture-based clustering with covariates for ordinal responses

Download PDF

1564 Accesses
1 Altmetric
Explore all metrics

Abstract

Existing methods can perform likelihood-based clustering on a multivariate data matrix of ordinal data, using finite mixtures to cluster the rows (observations) of the matrix. These models can incorporate the main effects of individual rows and columns, as well as cluster effects, to model the matrix of responses. However, many real-world applications also include available covariates, which provide insights into the main characteristics of the clusters and determine clustering structures based on both the individuals’ similar patterns of responses and the effects of the covariates on the individuals' responses. In our research we have extended the mixture-based models to include covariates and test what effect this has on the resulting clustering structures. We focus on clustering the rows of the data matrix, using the proportional odds cumulative logit model for ordinal data. We fit the models using the Expectation-Maximization algorithm and assess performance using a simulation study. We also illustrate an application of the models to the well-known arthritis clinical trial data set.

Clustering Ordinal Data via Latent Variable Models

Biclustering Models for Two-Mode Ordinal Data

Article Open access 21 June 2016

Clustering longitudinal ordinal data via finite mixture of matrix-variate distributions

Article 17 February 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A well-known definition of an ordinal variable says it is one characterized by a categorical data scale, which describes an order showing differing degrees of dissimilarity (Agresti 2014). Thus, although ordinal variables are affected by the distances among their ordinal categories, those distances are not known. For example, in a questionnaire, the answers based on a Likert scale could be labelled as strongly disagree, disagree, neutral, agree, and strongly agree, and these are frequently coded as an equally-distanced 1–5 scale, but they could be coded using any other increasing sequence of numerical values. Ordinal scales are very common in a wide range of areas such as medical studies, ecology, and marketing.

Cluster analysis is the study of techniques to classify a set of related objects into the same cluster (Everitt et al. 2011) and can be applied to identify groups, patterns, or clusters in a data set. Clustering is used for a wide range of applications, in fields including business, biology, psychology, and medicine. A couple of recent examples of this are an application to gene microarray data proposed by Rocci and Vichi (2008) and an application in the analysis of genomic abnormality data, in which the developmental patterns of different types of tumours are used to identify clusters of tumours (Hoff 2005).

Many different approaches to clustering have been developed. The earliest approaches use partition optimization; the most common method is the k-means clustering (MacQueen 1967; Hartigan and Wong 1979). Several authors have proposed extensions to this approach [see e.g. Vichi (2001) and Rocci and Vichi (2008)]. Moreover, the objects can be clustered in a hierarchical way, gradually agglomerating objects into larger and larger clusters (Ward 1963; Johnson 1967). All approaches listed above are based on mathematical distance metrics and therefore statistical inferences, model selection procedures, and goodness-of-fit assessments cannot be easily applied due to the lack of an underlying probability model (Everitt et al. 2011; Fernández et al. 2016).

Cluster analysis based on finite mixture models (Peel and McLachlan 2000) assumes that variables in the data matrix arise from mixtures of statistical distributions, with each cluster corresponding to one component of the mixture. The estimated parameters for those distributions are those that have the maximum likelihood based on the observed data. Likelihood-based methods include those proposed by McLachlan and Basford (1988), Peel and McLachlan (2000), Böhning et al. (2007), and Melnykov and Maitra (2010), among others. More recently, Govaert and Nadif (2010) and Pledger and Arnold (2014) proposed an approach via finite mixtures for binary and count data using Bernoulli or Poisson building blocks. Other authors have introduced clustering algorithms specifically for ordinal data: see e.g. Giordan and Diana (2011), Biernacki and Jacques (2016), Ranalli and Rocci (2016), Matechou et al. (2016), and Fernández et al. (2016, 2019). Matechou et al. (2016) proposed a mixture-based biclustering solution relying on the proportional odds assumption of the cumulative logit model (McCullagh 1980). Fernández et al. (2016) developed an equivalent model-based clustering approach using the ordered stereotype model (Anderson 1984), although this approach assumes that there are no covariates available. Furthermore, methods that cluster observations using both ordinal and continuous variables simultaneously, such as the approach proposed by Ranalli and Rocci (2017), should also be mentioned and compared in the context of our proposed method.

Unlike distance-based methods, which only determine which objects should be clustered together, likelihood-based methods can additionally describe the properties of each cluster, based on the estimated parameters, and can also estimate the probability of each object being allocated to each cluster. Additionally, the mixture-based approaches for ordinal responses introduced above are focused on finding cluster structures based only on the matrix of ordinal responses, and assume that no associated covariates are available. Any available covariates can be analyzed alongside the clustering results, to assist with interpretation of the cluster structures, even though there has been no reference to the covariates during the clustering process, but actually incorporating covariates in the clustering process could lead to different estimated clustering structures, and a different estimate of the number of clusters. Generally speaking, if a model with covariates is estimated, subjects tend to be clustered according to their responses and covariate effects. Therefore, it is desirable to make available covariates endogenous to the clustering process to improve interpretation of the main characteristics of the clusters (Murphy and Murphy 2020).

Other studies have investigated the associations between clustering structures and covariates. Gudicha and Vermunt (2013) described several methods for clustering categorical responses via a three-step approach: (1) estimate the mixture model; (2) assign subjects to clusters; (3) regress cluster assignments on the covariates. Another proposal, the cluster-weighted model (CWM) approach, fits the joint distribution of a random vector composed of a response variable and a set of covariates (Ingrassia et al. 2012; Lamont et al. 2016). Ingrassia et al. (2015) also introduced a version of the CWM for mixed-type covariates that assumes continuous covariates arise from Gaussian distributions. Finally, several methods in the literature use the mixture of experts (MoE) paradigm in which the parameters of the mixture are modelled as functions of fixed, potentially mixed-type, covariates (Formann 1992; Jacobs et al. 1991; Murphy and Murphy 2020).

Our approach to mixture-based clustering involves constructing an additive linear model of parameters, connected to the response data via a link function. Additional terms such as covariates may easily be added to the linear predictor. To the best of our knowledge, Fernández et al. (2019) introduced this formulation of model-based clustering for ordinal data with covariates, but the performance of these covariate methods and, more importantly, their influence on the resulting clustering structures, have not been documented so far. The main purpose of this article is to extend such models to include covariates and allow them to affect the detection of cluster structures. Moreover, we are also interested in comparing how the resulting clustering structures compare to those obtained without covariates, and how these changes may affect the interpretation of the results.

We will focus on extending the one-dimensional clustering approach proposed in Matechou et al. (2016). This approach models ordinal response data using the proportional odds assumption of the cumulative logit model (hereafter referred to as the “proportional odds model”). We will include covariates directly in the linear predictor. Our approach to clustering follows the constructivist approach described by Hennig (2015), but with an interest in realist clustering: we think there are many scenarios where patterns in the data can be simplified by identifying clusters of observations that follow similar patterns, but if there is a real structure in the data, then we wish to determine that structure. There are many real-world scenarios that we can model as a response variable being affected by predictor variables, and in some of those scenarios, certain groups of observations may have different patterns of response to the predictors than other groups of observations. If those groups have already been identified, then we might attempt mixed model analysis or multilevel modelling; but if the groups have not already been identified, then the method we propose here provides a pathway to detecting these groupings of response patterns. So our approach could be seen as a bridge between regression modelling and cluster analysis.

The rest of the article is organised as follows. Section 2 introduces the one-dimensional clustering models and their formulation. Section 3 describes the measures used to compare different clustering structures. Section 4 uses a simulation study to assess the performance of the method, and Sect. 5 applies the method to a real-world application: the well-known arthritis clinical trial data. Section 6 describes our conclusions.

2 The row clustering model

When the data are in matrix form, clustering of rows is called row clustering. In this section, we present the row clustering formulation for finite mixtures based on the proportional odds model. This closely follows the model formulations in Matechou et al. (2016) and Fernández et al. (2019). We decided to focus on row clustering because it is more common to have covariates linked to observations (rows) than to variables (columns).

2.1 Model formulation

We consider a set of n subjects and m ordinal response variables, each with q possible ordinal response categories. Thus, data can be represented by an $n \times m$ matrix ${\textbf {Y}}$ with ordinal entries $y_{ij}$. The row cluster index r ($r = 1,\ldots ,R$) represents the number of the row cluster and the symbol $i\in r$ indicates that row i is allocated to row cluster r. We shall assume that all rows belonging to the same row cluster r have ordinal responses driven by the same row cluster effect, i.e. that there are no individual row effects. In the case of the proportional odds model where the effect of rows on the response is considered, the probability that $y_{ij}$ takes category k, when row i is in row cluster r, is defined by

$$\begin{aligned} P[y_{ij}=k|i\in r]=\theta _{ijrk} , \end{aligned}$$

where $i = 1,\ldots ,n, \ j=1,\ldots ,m$ and $\ k=1,\ldots , q$ with $\sum _{k=1}^q \theta _{ijrk} = 1$ for a given i, j and r. This can be expressed using linear predictor terms as

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - (\alpha _{r} + \beta _j + \gamma _{rj}),\ \end{aligned}$$

(1)

where the parameters $\{ \mu _k \}$ are the cutpoints. $\{\alpha _{r}\}$ and $\{\beta _j\}$ indicate the effects of row cluster r and column j, respectively, and $\{\gamma _{{r} j}\}$ represent the associations between different row clusters and individual columns. Corner-point or sum-to-zero constraints on $\{\alpha _{r}\}$, $\{\beta _j\}$ and $\{\gamma _{{r} j}\}$ must be included to avoid identifiability problems and the monotonically increasing constraint $\mu _1< \mu _2< \ldots < \mu _q(=\infty )$ is included to capture the ordinal nature of the responses. The (unknown) proportion of rows in each row cluster r is defined as $\{\pi _1, \ldots , \pi _R \}$, with $\sum _{r=1}^R \pi _r = 1$.

In a simpler model with clustering of rows, the clustering is solely based on the patterns of responses of the rows (observations/subjects) without considering the information present in the covariates. For instance, let’s consider a hypothetical example of a matrix of subjects answering a set of 5-level Likert-scale questions from a self-report questionnaire, which intends to measure the degree of suffering in patients diagnosed with cancer. If the covariate information is not incorporated in the clustering process, resulting clusters would only be based on the patterns of responses of the patients. For example, the clusters may be categorized as low scores, middle scores, and high scores, based solely on the responses. However, when the covariate information, such as type of cancer, treatment dose, initial tumor burden, size of the tumor, gender, and age, is included in the clustering process, the resulting clusters may differ. This is because patients with equal or similar values in the covariates are assumed to be a priori more likely to co-cluster than others. For instance, patients with larger tumor sizes may tend to be clustered together, regardless of their responses to the questionnaire. This motivational example is based on the one given in Müller et al. (2011).

We now define the model formulation of row clustering using the proportional odds model, with additional covariates $\bf{x}_{i}=(x_{i1},\ldots , x_{ip})^T$, as follows,

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - (\alpha _{r} + \beta _{j} + \gamma _{rj} + \bf{x}_i^T{\varvec{\delta _{r}}}), \end{aligned}$$

(2)

where $\bf{x}_{i}=(x_{i1},\ldots , x_{ip})^T$ are a set of p covariates associated with row i of the data matrix; these covariates can be categorical or continuous. The parameters $\{\delta _{r}\}$ represent the effects of the covariates; we assume these effects are the same for all rows in the same row cluster r. When fitting this model, the subjects will be clustered according to both their response patterns and the values of their covariates, which may lead to different estimates of cluster assignment.

Considering the simplest row clustering model, without column effects, the proportional odds model without covariates can be expressed as

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - \alpha _{r}, \end{aligned}$$

(3)

where the number of parameters, including the $R-1$ independent values of $\pi _r$, is $q+2R-3$.

Adding p covariates into model (3), we obtain

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - (\alpha _{r} + \bf{x}^T_i{\varvec{\delta}_{r}} ), \end{aligned}$$

(4)

where there are now $q+(p+2)R-3$ parameters in the model.

The row clustering model with individual column effects can be expressed as

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijr h}\right) = \eta _{ijr k} = \mu _k - (\alpha _{r} + \beta _j), \end{aligned}$$

(5)

where the number of parameters, including $\pi _r$, is $q+2R+m-4$.

Adding p covariates into model (5), we obtain the following model

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijr h}\right) = \eta _{ijr k} = \mu _k - (\alpha _{r} + \beta _j + \bf{x}^T_{i}{\varvec{\delta} _{r}}), \end{aligned}$$

(6)

where the number of parameters, including $\pi _r$, is $q+(p+2)R+m-4$.

Models (3) and (4) will be used in the simulation and application section to compare the clustering structure.

2.2 Estimation of the parameters

The Expectation-Maximization (EM) algorithm (Dempster et al. 1977; McLachlan and Krishnan 1997) is a well-known iterative procedure to compute maximum likelihood estimates in the presence of missing or incomplete data. In the suite of models introduced in the previous section, the actual row cluster memberships, i.e. the allocation of rows into row clusters, are unknown or missing. Thus, the EM algorithm is a natural approach to fit these models. Previous examples of this approach include Bernoulli and Poisson distributions (Pledger and Arnold 2014), the proportional odds model (Matechou et al. 2016), and the ordered stereotype model (Fernández et al. 2016).

We have modified the EM algorithm used in Matechou et al. (2016) and Fernández et al. (2016) to incorporate covariates. Assuming the local independence assumption, where variables within a row are conditionally independent of each other given the row’s cluster membership (Clogg 1988), the incomplete data likelihood function can be expressed as

$$\begin{aligned} L \left( \Theta | \{y_{ij}\}, \{\bf{x}_{i}\} \right) = \prod _{i=1}^n \left[ \sum _{r=1}^{R} \pi _{r} \prod _{j=1}^{m} \prod _{k=1}^{q} \left( \theta _{ijrk} | \{\bf{x}_{i} \}\right) ^{I(y_{ij}=k)} \right] , \end{aligned}$$

(7)

which sums over all possible partitions of rows into R clusters. $\textbf{Y} = \{y_{ij}\}$ is the data matrix corresponding to the observed responses, and $\textbf{X} = (\bf{x}^T_1,\ldots ,\bf{x}^T_n)^T$ is the matrix of p covariates for all n rows $\Theta$ contains all unknown parameters and $\pi _{r}$ is the a priori row membership probability of row i.

The incomplete data likelihood is difficult to optimize numerically because it does not have a simple form. Therefore, it is more natural to work with the complete data likelihood which we define below.

Let $\textbf{Z} = \{Z_{ir}\}$ be a set of random vectors corresponding to the missing information, i.e., the unknown row cluster memberships. $Z_{ir} = 1$ if row i is in row cluster r and 0 otherwise. Thus, $\sum _{r=1}^R {Z_{ir}} = 1$ for all i. We can then suppose a complete data set exists, $(\textbf{Y}, \textbf{X}, \textbf{Z})$ and the complete data log-likelihood function can be defined as

$$\begin{aligned} \ell _c(\Theta | \{y_{ij}\}, \{\bf{x}_{i}\}, \{Z_{ir}\}) = \sum _{i=1}^n \sum _{r=1}^R Z_{ir}\log (\pi _{r}) + \sum _{i=1}^n \sum _{j=1}^m \sum _{r=1}^R \sum _{k=1}^q Z_{ir} I(y_{ij}=k) \log \left( \theta _{ijr k}|\{\bf{x}_{i}\} \right) \end{aligned}$$

(8)

Using the previous equation, we can now determine the E-step and M-steps of the EM algorithm. Given the latest estimates $\widehat{\Theta }^{(t-1)}$ from the previous iteration, the expected value of the complete data log-likelihood over $Z_{ir}$, given the observed data $\{\bf{x}_i\}$ and $\{y_{ij}\}$, becomes

$$\begin{aligned} Q(\Theta | \widehat{\Theta }^{(t-1)} )= & {} E_{\{Z_{ir}\} | \{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} } [\ell _c (\Theta |\{y_{ij}\},\{\bf{x}_{i}\},\{Z_{ir}\})] \\= & {} \sum _{i=1}^n \sum _{r=1}^R \log (\pi _{r}^{(t-1)}) E[Z_{ir} | \{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} ] \nonumber \\{} & {} + \sum _{i=1}^n \sum _{j=1}^m \sum _{r=1}^R \sum _{k=1}^q I(y_{ij}=k)\log \left( \widehat{\theta }_{ijr k}^{(t-1)} | \{\bf{x}_{i}\} \right) E[Z_{ir} | \{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} ]. \nonumber \end{aligned}$$

(9)

In the E-step, we use the latest parameter estimates $\Theta$ to find the expected values of $Z_{ir}$. The expected value of $Z_{ir}$, a Bernoulli variable, is the posterior probability of individual i being in cluster r given the observed data. Therefore, using Bayes’ rule, we can compute it as

$$\begin{aligned} \hat{Z}_{ir}^{(t)} &= P[Z_{ir}=1|\{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} ] \nonumber \\ &= \frac{P(\{y_{ij}\} | Z_{ir}=1, \widehat{\Theta }^{(t-1)}, \{\bf{x}_{i}\})P(Z_{ir}=1)}{\sum _{\ell =1}^R P(\{y_{ij}\} | Z_{i\ell }=1, \widehat{\Theta }^{(t-1)}, \{\bf{x}_{i}\})P(Z_{i\ell }=1)} \nonumber \\ & = {} \frac{\hat{\pi }_{r}^{(t-1)}\prod _{j=1}^m\prod _{k=1}^q (\hat{\theta }_{ijr k}^{(t-1)} | \{\bf{x}_{i}\})^{I(y_{ij}=k)}}{\sum _{\ell =1}^R\{\hat{\pi }_{\ell }^{(t-1)}\prod _{j=1}^m\prod _{k=1}^q (\hat{\theta }_{ij\ell k}^{(t-1)} | \{\bf{x}_{i}\})^{I(y_{ij}=k)} \}}. \end{aligned}$$

(10)

Then, we substitute this expected value of $Z_{ir}$ in the complete data log-likelihood (9) at iteration t to complete the E-step,

$$\begin{aligned} \hat{Q}(\Theta | \Theta ^{(t-1)} ) = \sum _{i=1}^n \sum _{r=1}^R \hat{Z}_{ir}^{(t)}\log (\pi _{r}^{(t-1)}) + \sum _{i=1}^n \sum _{j=1}^m \sum _{r=1}^R \sum _{k=1}^q \hat{Z}_{ir}^{(t)} I(y_{ij}=k)\log \left( \hat{\theta }_{ijr k}^{(t-1)} | \{\bf{x}_{i}\} \right) . \end{aligned}$$

(11)

At the M-step, we maximize equation (11) obtained in the E-step with respect to $\pi _r$ and $\Theta$. The M-step estimates for finite mixture models can be calculated in two parts: the row-cluster proportions $\hat{\pi }_1,\ldots ,\hat{\pi }_R$ and the parameters $\widehat{\Theta }$. To find the estimates of $\pi _r$, following Fernández et al. (2016), we replace the conditional expectation (10) in the following expression for the iteration t,

$$\begin{aligned} \hat{\pi }_r^{(t)} = \frac{1}{n} \sum _{i=1}^n E[Z_{ir} | \{y_{ij}\}, \{\bf{x}_{i}\}, \Theta ^{(t-1)}] = \frac{1}{n} \sum _{i=1}^n \hat{Z}_{ir}^{(t)}. \end{aligned}$$

(12)

Similarly, to find the estimate of parameters $\Theta$ in the second part of (11), the derivative of the second term can be taken with respect to $\Theta$. However, this has no simple analytical solution; we need to find the conditional expectation of the complete data log-likelihood of equation (9) using numerical maximization.

We then iterate the E-step and the M-step until we reach convergence. There are various convergence conditions that can be specified; we will use the convergence criterion based on the incomplete likelihood: we will iterate until the absolute difference between the incomplete log-likelihoods at two consecutive iterations, relative to the likelihood at the latest iteration, is close to zero. That is,

$$\begin{aligned} \frac{\Vert L(\Theta ^{(t+1)} | \{y_{ij}\}) - L(\Theta ^{(t)} | \{y_{ij}\}) \Vert }{\Vert L(\Theta ^{(t)} | \{y_{ij}\}) \Vert } \approx 0. \end{aligned}$$

(13)

At the end of the process, we have estimates for the posterior probabilities of cluster membership for each row, and these may be between 0 and 1. We will assume each observation is assigned to the group having the highest posterior probability.

We implemented the EM algorithm described above for the proportional odds model including clustering via finite mixtures and set up the simulation study by using the statistical software R 4.0.2 (R Development Core Team, 2019). The numerical maximization part of the M-step was carried out using the quasi-Newton method L-BFGS-B provided as an option in the predefined R function optim(). We used the default settings for all other control parameters. Alternative functions for maximum likelihood estimation of the cumulative version of the proportional odds model, assuming $\delta _{rj}$, could be explored and used to intend to simplify the implementation process.

We remark that an inherent drawback in mixture modelling is that the associated likelihood surface may be multimodal. We therefore tried different starting points, covering a comprehensive range of parameter values, to avoid being locked into a local maximum. We reran the EM algorithm 10 times with random starting points and kept the run with the highest log-likelihood. In preliminary tests, we ran experiments testing up to 100 random starting points and found that 10 starting points were sufficient to avoid convergence to local optima. Finally, to ensure that this approach does not affect any final estimates, we used the resulting maximum likelihood estimation of the complete data likelihood using the EM algorithm as starting points (Fernández et al. 2016) to numerically maximize the incomplete data log-likelihood (7).

3 Measures to compare clustering structures

This section discusses three popular measures for comparing clustering structures: the Adjusted Rand Index (ARI 1985), the variation of information (VI 2005), and the normalised information distance (NID 2005). Comparing clustering structures can be challenging due to the “label-switching problem" where different labels can result in identical clusters. To address this issue, the measures used in this section do not rely on cluster labels, but instead consider pairs of rows that are clustered together. The Rand Index (RI 1971) measures similarity between clustering structures based on how data points are assigned to clusters, but it can have limitations in comparing replicability of different classifications. The ARI is an adjustment of the Rand index that corrects for chance with respect to the null hypothesis and ranges from 0 (totally independent structures) to 1 (identical structures). The VI measures the distance between partitions of the same dataset using concepts of entropy and information (Meila 2007), and the normalised VI (NVI 2005) is used to bound it between 0 and 1 for comparability with the ARI. The NID is another information criterion bounded between 0 and 1, and both the NVI and NID have values of 0 indicating identical clustering structures and values of 1 indicating totally independent structures. To simplify interpretation, the unit complements of the NVI and NID (1-NVI and 1-NID) are used in this section.

4 Simulation study

We set up a small scale simulation study to test, in a diverse range of scenarios, how reliably we were able to estimate both the parameters of our proposed row clustering model (4) and the cluster allocations, using the EM algorithm. We are not testing model selection here: we simulate data sets and then fit the correct model to those data. This study is closely related to the one in Fernández et al. (2016).

We simulated the simplest covariate model (4), with only a single covariate $x_i$ and no column effects. We designed two possible main scenarios for the true model by varying the values of covariate effect parameters for the different clusters, $\{\delta _{r}\}$. Scenario 1 is designed with both negative and positive covariate effects, which means that different clusters could have dramatically different covariate effects. Scenario 2, by contrast, has only positive covariate effects, which is likely to make it more difficult to fit the cluster parameters, because the different clusters are more likely to produce similar response data than they were in Scenario 1.

The simulation program was written in R, and we did not observe any issues with its convergence. For each scenario, we ran several cases varying the following features:

Sample size: $n=100,1000$
Number of response categories: $q=3, 4, 5, 6$
Number of columns: $m=3, 5,10$
Number of row clusters: $R=3, 5$
Distribution of covariates: Normal (N(0, 1), ), Binomial (Bin(1, 0.5))

In total, we ran 96 cases within each scenario. We generated 2000 replicate datasets for each combination of features using model (4) and calculated maximum likelihood estimates (MLEs) of the model parameters and their standard errors for each replicate. We then compared the estimated parameter values with the true parameter values and assessed the agreement between the true and estimated clustering structures using indices such as Adjusted Rand Index (ARI), 1-Normalized Variation of Information (1-NVI), and 1-Normalized Information Distance (1-NID). To report the results, we computed the mean of both the estimated model parameters and their corresponding standard errors using the 2000 simulated datasets.

4.1 Scenario 1

We simplified the study by using equal proportions of rows in each cluster: $(\pi _1,\ldots ,\pi _R) = (1/R,\ldots ,1/R)$. The cutpoint values $\{\mu _k\}$ were chosen from a quantile function for the logistic distribution. Therefore, the cutpoint values are $\{ \mu _1 = \log (1/2), \mu _2= \log (2) \}$ when $q=3$, and $\{ \mu _1 = \log (1/4), \mu _2=\log (2/3), \mu _3 = \log (3/2), \mu _4 = \log (4) \}$ when $q=5$. We used evenly distributed values for the row cluster effect parameters $\alpha _r$, with the corner-point constraint that $\alpha _1 = 0$.

Table 1 summarizes the average absolute bias and their corresponding standard errors for each parameter over 2000 simulations when the fitted models are model (3) and (4). In all cases, the estimated parameters of model (4) are close to the true values due to a small bias, and as expected, the variability decreases with increasing sample size n. We also remark that the value of the standard error decreases as the number of ordinal categories increases (see additional results in Tables 4 and 5, Figs. 4 and 5, Appendix A). We believe this might be due to the fact that as the number of ordinal categories increases, the response data becomes more continuous and the responses contain more information. On the other hand, the estimated parameters of model (3) perform poorly, i.e., they are very far from the true parameter values despite having modest standard errors (see additional results in Tables 6 and 7, Figs. 6 and 7, Appendix A).

Table 1 Scenario 1: The average absolute bias and standard error obtained for each parameter over 2000 simulations for models formulated in Eqs. (3) and (4)

Row mixture-based clustering with covariates for ordinal responses

Abstract

Similar content being viewed by others

Clustering Ordinal Data via Latent Variable Models

Biclustering Models for Two-Mode Ordinal Data

Clustering longitudinal ordinal data via finite mixture of matrix-variate distributions

1 Introduction

2 The row clustering model

2.1 Model formulation

2.2 Estimation of the parameters

3 Measures to compare clustering structures

4 Simulation study

4.1 Scenario 1

4.2 Scenario 2

5 Application

6 Discussion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendices

1.1 A Simulation results

1.2 B Application: comparison of clustering structures

1.3 C Application: comparison among models with R=4 clusters

1.4 D Application: comparison with Partitioning Around Medoids (PAM)

1.5 E Application: robustness analysis

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation