1 Introduction

A well-known definition of an ordinal variable says it is one characterized by a categorical data scale, which describes an order showing differing degrees of dissimilarity (Agresti 2014). Thus, although ordinal variables are affected by the distances among their ordinal categories, those distances are not known. For example, in a questionnaire, the answers based on a Likert scale could be labelled as strongly disagree, disagree, neutral, agree, and strongly agree, and these are frequently coded as an equally-distanced 1–5 scale, but they could be coded using any other increasing sequence of numerical values. Ordinal scales are very common in a wide range of areas such as medical studies, ecology, and marketing.

Cluster analysis is the study of techniques to classify a set of related objects into the same cluster (Everitt et al. 2011) and can be applied to identify groups, patterns, or clusters in a data set. Clustering is used for a wide range of applications, in fields including business, biology, psychology, and medicine. A couple of recent examples of this are an application to gene microarray data proposed by Rocci and Vichi (2008) and an application in the analysis of genomic abnormality data, in which the developmental patterns of different types of tumours are used to identify clusters of tumours (Hoff 2005).

Many different approaches to clustering have been developed. The earliest approaches use partition optimization; the most common method is the k-means clustering (MacQueen 1967; Hartigan and Wong 1979). Several authors have proposed extensions to this approach [see e.g. Vichi (2001) and Rocci and Vichi (2008)]. Moreover, the objects can be clustered in a hierarchical way, gradually agglomerating objects into larger and larger clusters (Ward 1963; Johnson 1967). All approaches listed above are based on mathematical distance metrics and therefore statistical inferences, model selection procedures, and goodness-of-fit assessments cannot be easily applied due to the lack of an underlying probability model (Everitt et al. 2011; Fernández et al. 2016).

Cluster analysis based on finite mixture models (Peel and McLachlan 2000) assumes that variables in the data matrix arise from mixtures of statistical distributions, with each cluster corresponding to one component of the mixture. The estimated parameters for those distributions are those that have the maximum likelihood based on the observed data. Likelihood-based methods include those proposed by McLachlan and Basford (1988), Peel and McLachlan (2000), Böhning et al. (2007), and Melnykov and Maitra (2010), among others. More recently, Govaert and Nadif (2010) and Pledger and Arnold (2014) proposed an approach via finite mixtures for binary and count data using Bernoulli or Poisson building blocks. Other authors have introduced clustering algorithms specifically for ordinal data: see e.g. Giordan and Diana (2011), Biernacki and Jacques (2016), Ranalli and Rocci (2016), Matechou et al. (2016), and Fernández et al. (2016, 2019). Matechou et al. (2016) proposed a mixture-based biclustering solution relying on the proportional odds assumption of the cumulative logit model (McCullagh 1980). Fernández et al. (2016) developed an equivalent model-based clustering approach using the ordered stereotype model (Anderson 1984), although this approach assumes that there are no covariates available. Furthermore, methods that cluster observations using both ordinal and continuous variables simultaneously, such as the approach proposed by Ranalli and Rocci (2017), should also be mentioned and compared in the context of our proposed method.

Unlike distance-based methods, which only determine which objects should be clustered together, likelihood-based methods can additionally describe the properties of each cluster, based on the estimated parameters, and can also estimate the probability of each object being allocated to each cluster. Additionally, the mixture-based approaches for ordinal responses introduced above are focused on finding cluster structures based only on the matrix of ordinal responses, and assume that no associated covariates are available. Any available covariates can be analyzed alongside the clustering results, to assist with interpretation of the cluster structures, even though there has been no reference to the covariates during the clustering process, but actually incorporating covariates in the clustering process could lead to different estimated clustering structures, and a different estimate of the number of clusters. Generally speaking, if a model with covariates is estimated, subjects tend to be clustered according to their responses and covariate effects. Therefore, it is desirable to make available covariates endogenous to the clustering process to improve interpretation of the main characteristics of the clusters (Murphy and Murphy 2020).

Other studies have investigated the associations between clustering structures and covariates. Gudicha and Vermunt (2013) described several methods for clustering categorical responses via a three-step approach: (1) estimate the mixture model; (2) assign subjects to clusters; (3) regress cluster assignments on the covariates. Another proposal, the cluster-weighted model (CWM) approach, fits the joint distribution of a random vector composed of a response variable and a set of covariates (Ingrassia et al. 2012; Lamont et al. 2016). Ingrassia et al. (2015) also introduced a version of the CWM for mixed-type covariates that assumes continuous covariates arise from Gaussian distributions. Finally, several methods in the literature use the mixture of experts (MoE) paradigm in which the parameters of the mixture are modelled as functions of fixed, potentially mixed-type, covariates (Formann 1992; Jacobs et al. 1991; Murphy and Murphy 2020).

Our approach to mixture-based clustering involves constructing an additive linear model of parameters, connected to the response data via a link function. Additional terms such as covariates may easily be added to the linear predictor. To the best of our knowledge, Fernández et al. (2019) introduced this formulation of model-based clustering for ordinal data with covariates, but the performance of these covariate methods and, more importantly, their influence on the resulting clustering structures, have not been documented so far. The main purpose of this article is to extend such models to include covariates and allow them to affect the detection of cluster structures. Moreover, we are also interested in comparing how the resulting clustering structures compare to those obtained without covariates, and how these changes may affect the interpretation of the results.

We will focus on extending the one-dimensional clustering approach proposed in Matechou et al. (2016). This approach models ordinal response data using the proportional odds assumption of the cumulative logit model (hereafter referred to as the “proportional odds model”). We will include covariates directly in the linear predictor. Our approach to clustering follows the constructivist approach described by Hennig (2015), but with an interest in realist clustering: we think there are many scenarios where patterns in the data can be simplified by identifying clusters of observations that follow similar patterns, but if there is a real structure in the data, then we wish to determine that structure. There are many real-world scenarios that we can model as a response variable being affected by predictor variables, and in some of those scenarios, certain groups of observations may have different patterns of response to the predictors than other groups of observations. If those groups have already been identified, then we might attempt mixed model analysis or multilevel modelling; but if the groups have not already been identified, then the method we propose here provides a pathway to detecting these groupings of response patterns. So our approach could be seen as a bridge between regression modelling and cluster analysis.

The rest of the article is organised as follows. Section 2 introduces the one-dimensional clustering models and their formulation. Section 3 describes the measures used to compare different clustering structures. Section 4 uses a simulation study to assess the performance of the method, and Sect. 5 applies the method to a real-world application: the well-known arthritis clinical trial data. Section 6 describes our conclusions.

2 The row clustering model

When the data are in matrix form, clustering of rows is called row clustering. In this section, we present the row clustering formulation for finite mixtures based on the proportional odds model. This closely follows the model formulations in Matechou et al. (2016) and Fernández et al. (2019). We decided to focus on row clustering because it is more common to have covariates linked to observations (rows) than to variables (columns).

2.1 Model formulation

We consider a set of n subjects and m ordinal response variables, each with q possible ordinal response categories. Thus, data can be represented by an \(n \times m\) matrix \({\textbf {Y}}\) with ordinal entries \(y_{ij}\). The row cluster index r (\(r = 1,\ldots ,R\)) represents the number of the row cluster and the symbol \(i\in r\) indicates that row i is allocated to row cluster r. We shall assume that all rows belonging to the same row cluster r have ordinal responses driven by the same row cluster effect, i.e. that there are no individual row effects. In the case of the proportional odds model where the effect of rows on the response is considered, the probability that \(y_{ij}\) takes category k, when row i is in row cluster r, is defined by

$$\begin{aligned} P[y_{ij}=k|i\in r]=\theta _{ijrk} , \end{aligned}$$

where \(i = 1,\ldots ,n, \ j=1,\ldots ,m\) and \(\ k=1,\ldots , q\) with \(\sum _{k=1}^q \theta _{ijrk} = 1\) for a given ij and r. This can be expressed using linear predictor terms as

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - (\alpha _{r} + \beta _j + \gamma _{rj}),\ \end{aligned}$$
(1)

where the parameters \(\{ \mu _k \}\) are the cutpoints. \(\{\alpha _{r}\}\) and \(\{\beta _j\}\) indicate the effects of row cluster r and column j, respectively, and \(\{\gamma _{{r} j}\}\) represent the associations between different row clusters and individual columns. Corner-point or sum-to-zero constraints on \(\{\alpha _{r}\}\), \(\{\beta _j\}\) and \(\{\gamma _{{r} j}\}\) must be included to avoid identifiability problems and the monotonically increasing constraint \(\mu _1< \mu _2< \ldots < \mu _q(=\infty )\) is included to capture the ordinal nature of the responses. The (unknown) proportion of rows in each row cluster r is defined as \(\{\pi _1, \ldots , \pi _R \}\), with \(\sum _{r=1}^R \pi _r = 1\).

In a simpler model with clustering of rows, the clustering is solely based on the patterns of responses of the rows (observations/subjects) without considering the information present in the covariates. For instance, let’s consider a hypothetical example of a matrix of subjects answering a set of 5-level Likert-scale questions from a self-report questionnaire, which intends to measure the degree of suffering in patients diagnosed with cancer. If the covariate information is not incorporated in the clustering process, resulting clusters would only be based on the patterns of responses of the patients. For example, the clusters may be categorized as low scores, middle scores, and high scores, based solely on the responses. However, when the covariate information, such as type of cancer, treatment dose, initial tumor burden, size of the tumor, gender, and age, is included in the clustering process, the resulting clusters may differ. This is because patients with equal or similar values in the covariates are assumed to be a priori more likely to co-cluster than others. For instance, patients with larger tumor sizes may tend to be clustered together, regardless of their responses to the questionnaire. This motivational example is based on the one given in Müller et al. (2011).

We now define the model formulation of row clustering using the proportional odds model, with additional covariates \(\bf{x}_{i}=(x_{i1},\ldots , x_{ip})^T\), as follows,

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - (\alpha _{r} + \beta _{j} + \gamma _{rj} + \bf{x}_i^T{\varvec{\delta _{r}}}), \end{aligned}$$
(2)

where \(\bf{x}_{i}=(x_{i1},\ldots , x_{ip})^T\) are a set of p covariates associated with row i of the data matrix; these covariates can be categorical or continuous. The parameters \(\{\delta _{r}\}\) represent the effects of the covariates; we assume these effects are the same for all rows in the same row cluster r. When fitting this model, the subjects will be clustered according to both their response patterns and the values of their covariates, which may lead to different estimates of cluster assignment.

Considering the simplest row clustering model, without column effects, the proportional odds model without covariates can be expressed as

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - \alpha _{r}, \end{aligned}$$
(3)

where the number of parameters, including the \(R-1\) independent values of \(\pi _r\), is \(q+2R-3\).

Adding p covariates into model (3), we obtain

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijrh}\right) = \eta _{ijrk} = \mu _k - (\alpha _{r} + \bf{x}^T_i{\varvec{\delta}_{r}} ), \end{aligned}$$
(4)

where there are now \(q+(p+2)R-3\) parameters in the model.

The row clustering model with individual column effects can be expressed as

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijr h}\right) = \eta _{ijr k} = \mu _k - (\alpha _{r} + \beta _j), \end{aligned}$$
(5)

where the number of parameters, including \(\pi _r\), is \(q+2R+m-4\).

Adding p covariates into model (5), we obtain the following model

$$\begin{aligned} \text {logit}\left( \sum _{h=1}^k\theta _{ijr h}\right) = \eta _{ijr k} = \mu _k - (\alpha _{r} + \beta _j + \bf{x}^T_{i}{\varvec{\delta} _{r}}), \end{aligned}$$
(6)

where the number of parameters, including \(\pi _r\), is \(q+(p+2)R+m-4\).

Models (3) and (4) will be used in the simulation and application section to compare the clustering structure.

2.2 Estimation of the parameters

The Expectation-Maximization (EM) algorithm (Dempster et al. 1977; McLachlan and Krishnan 1997) is a well-known iterative procedure to compute maximum likelihood estimates in the presence of missing or incomplete data. In the suite of models introduced in the previous section, the actual row cluster memberships, i.e. the allocation of rows into row clusters, are unknown or missing. Thus, the EM algorithm is a natural approach to fit these models. Previous examples of this approach include Bernoulli and Poisson distributions (Pledger and Arnold 2014), the proportional odds model (Matechou et al. 2016), and the ordered stereotype model (Fernández et al. 2016).

We have modified the EM algorithm used in Matechou et al. (2016) and Fernández et al. (2016) to incorporate covariates. Assuming the local independence assumption, where variables within a row are conditionally independent of each other given the row’s cluster membership (Clogg 1988), the incomplete data likelihood function can be expressed as

$$\begin{aligned} L \left( \Theta | \{y_{ij}\}, \{\bf{x}_{i}\} \right) = \prod _{i=1}^n \left[ \sum _{r=1}^{R} \pi _{r} \prod _{j=1}^{m} \prod _{k=1}^{q} \left( \theta _{ijrk} | \{\bf{x}_{i} \}\right) ^{I(y_{ij}=k)} \right] , \end{aligned}$$
(7)

which sums over all possible partitions of rows into R clusters. \(\textbf{Y} = \{y_{ij}\}\) is the data matrix corresponding to the observed responses, and \(\textbf{X} = (\bf{x}^T_1,\ldots ,\bf{x}^T_n)^T\) is the matrix of p covariates for all n rows \(\Theta\) contains all unknown parameters and \(\pi _{r}\) is the a priori row membership probability of row i.

The incomplete data likelihood is difficult to optimize numerically because it does not have a simple form. Therefore, it is more natural to work with the complete data likelihood which we define below.

Let \(\textbf{Z} = \{Z_{ir}\}\) be a set of random vectors corresponding to the missing information, i.e., the unknown row cluster memberships. \(Z_{ir} = 1\) if row i is in row cluster r and 0 otherwise. Thus, \(\sum _{r=1}^R {Z_{ir}} = 1\) for all i. We can then suppose a complete data set exists, \((\textbf{Y}, \textbf{X}, \textbf{Z})\) and the complete data log-likelihood function can be defined as

$$\begin{aligned} \ell _c(\Theta | \{y_{ij}\}, \{\bf{x}_{i}\}, \{Z_{ir}\}) = \sum _{i=1}^n \sum _{r=1}^R Z_{ir}\log (\pi _{r}) + \sum _{i=1}^n \sum _{j=1}^m \sum _{r=1}^R \sum _{k=1}^q Z_{ir} I(y_{ij}=k) \log \left( \theta _{ijr k}|\{\bf{x}_{i}\} \right) \end{aligned}$$
(8)

Using the previous equation, we can now determine the E-step and M-steps of the EM algorithm. Given the latest estimates \(\widehat{\Theta }^{(t-1)}\) from the previous iteration, the expected value of the complete data log-likelihood over \(Z_{ir}\), given the observed data \(\{\bf{x}_i\}\) and \(\{y_{ij}\}\), becomes

$$\begin{aligned} Q(\Theta | \widehat{\Theta }^{(t-1)} )= & {} E_{\{Z_{ir}\} | \{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} } [\ell _c (\Theta |\{y_{ij}\},\{\bf{x}_{i}\},\{Z_{ir}\})] \\= & {} \sum _{i=1}^n \sum _{r=1}^R \log (\pi _{r}^{(t-1)}) E[Z_{ir} | \{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} ] \nonumber \\{} & {} + \sum _{i=1}^n \sum _{j=1}^m \sum _{r=1}^R \sum _{k=1}^q I(y_{ij}=k)\log \left( \widehat{\theta }_{ijr k}^{(t-1)} | \{\bf{x}_{i}\} \right) E[Z_{ir} | \{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} ]. \nonumber \end{aligned}$$
(9)

In the E-step, we use the latest parameter estimates \(\Theta\) to find the expected values of \(Z_{ir}\). The expected value of \(Z_{ir}\), a Bernoulli variable, is the posterior probability of individual i being in cluster r given the observed data. Therefore, using Bayes’ rule, we can compute it as

$$\begin{aligned} \hat{Z}_{ir}^{(t)} &= P[Z_{ir}=1|\{y_{ij}\}, \{\bf{x}_{i}\}, \widehat{\Theta }^{(t-1)} ] \nonumber \\ &= \frac{P(\{y_{ij}\} | Z_{ir}=1, \widehat{\Theta }^{(t-1)}, \{\bf{x}_{i}\})P(Z_{ir}=1)}{\sum _{\ell =1}^R P(\{y_{ij}\} | Z_{i\ell }=1, \widehat{\Theta }^{(t-1)}, \{\bf{x}_{i}\})P(Z_{i\ell }=1)} \nonumber \\ & = {} \frac{\hat{\pi }_{r}^{(t-1)}\prod _{j=1}^m\prod _{k=1}^q (\hat{\theta }_{ijr k}^{(t-1)} | \{\bf{x}_{i}\})^{I(y_{ij}=k)}}{\sum _{\ell =1}^R\{\hat{\pi }_{\ell }^{(t-1)}\prod _{j=1}^m\prod _{k=1}^q (\hat{\theta }_{ij\ell k}^{(t-1)} | \{\bf{x}_{i}\})^{I(y_{ij}=k)} \}}. \end{aligned}$$
(10)

Then, we substitute this expected value of \(Z_{ir}\) in the complete data log-likelihood (9) at iteration t to complete the E-step,

$$\begin{aligned} \hat{Q}(\Theta | \Theta ^{(t-1)} ) = \sum _{i=1}^n \sum _{r=1}^R \hat{Z}_{ir}^{(t)}\log (\pi _{r}^{(t-1)}) + \sum _{i=1}^n \sum _{j=1}^m \sum _{r=1}^R \sum _{k=1}^q \hat{Z}_{ir}^{(t)} I(y_{ij}=k)\log \left( \hat{\theta }_{ijr k}^{(t-1)} | \{\bf{x}_{i}\} \right) . \end{aligned}$$
(11)

At the M-step, we maximize equation (11) obtained in the E-step with respect to \(\pi _r\) and \(\Theta\). The M-step estimates for finite mixture models can be calculated in two parts: the row-cluster proportions \(\hat{\pi }_1,\ldots ,\hat{\pi }_R\) and the parameters \(\widehat{\Theta }\). To find the estimates of \(\pi _r\), following Fernández et al. (2016), we replace the conditional expectation (10) in the following expression for the iteration t,

$$\begin{aligned} \hat{\pi }_r^{(t)} = \frac{1}{n} \sum _{i=1}^n E[Z_{ir} | \{y_{ij}\}, \{\bf{x}_{i}\}, \Theta ^{(t-1)}] = \frac{1}{n} \sum _{i=1}^n \hat{Z}_{ir}^{(t)}. \end{aligned}$$
(12)

Similarly, to find the estimate of parameters \(\Theta\) in the second part of (11), the derivative of the second term can be taken with respect to \(\Theta\). However, this has no simple analytical solution; we need to find the conditional expectation of the complete data log-likelihood of equation (9) using numerical maximization.

We then iterate the E-step and the M-step until we reach convergence. There are various convergence conditions that can be specified; we will use the convergence criterion based on the incomplete likelihood: we will iterate until the absolute difference between the incomplete log-likelihoods at two consecutive iterations, relative to the likelihood at the latest iteration, is close to zero. That is,

$$\begin{aligned} \frac{\Vert L(\Theta ^{(t+1)} | \{y_{ij}\}) - L(\Theta ^{(t)} | \{y_{ij}\}) \Vert }{\Vert L(\Theta ^{(t)} | \{y_{ij}\}) \Vert } \approx 0. \end{aligned}$$
(13)

At the end of the process, we have estimates for the posterior probabilities of cluster membership for each row, and these may be between 0 and 1. We will assume each observation is assigned to the group having the highest posterior probability.

We implemented the EM algorithm described above for the proportional odds model including clustering via finite mixtures and set up the simulation study by using the statistical software R 4.0.2 (R Development Core Team, 2019). The numerical maximization part of the M-step was carried out using the quasi-Newton method L-BFGS-B provided as an option in the predefined R function optim(). We used the default settings for all other control parameters. Alternative functions for maximum likelihood estimation of the cumulative version of the proportional odds model, assuming \(\delta _{rj}\), could be explored and used to intend to simplify the implementation process.

We remark that an inherent drawback in mixture modelling is that the associated likelihood surface may be multimodal. We therefore tried different starting points, covering a comprehensive range of parameter values, to avoid being locked into a local maximum. We reran the EM algorithm 10 times with random starting points and kept the run with the highest log-likelihood. In preliminary tests, we ran experiments testing up to 100 random starting points and found that 10 starting points were sufficient to avoid convergence to local optima. Finally, to ensure that this approach does not affect any final estimates, we used the resulting maximum likelihood estimation of the complete data likelihood using the EM algorithm as starting points (Fernández et al. 2016) to numerically maximize the incomplete data log-likelihood (7).

3 Measures to compare clustering structures

This section discusses three popular measures for comparing clustering structures: the Adjusted Rand Index (ARI 1985), the variation of information (VI 2005), and the normalised information distance (NID 2005). Comparing clustering structures can be challenging due to the “label-switching problem" where different labels can result in identical clusters. To address this issue, the measures used in this section do not rely on cluster labels, but instead consider pairs of rows that are clustered together. The Rand Index (RI 1971) measures similarity between clustering structures based on how data points are assigned to clusters, but it can have limitations in comparing replicability of different classifications. The ARI is an adjustment of the Rand index that corrects for chance with respect to the null hypothesis and ranges from 0 (totally independent structures) to 1 (identical structures). The VI measures the distance between partitions of the same dataset using concepts of entropy and information (Meila 2007), and the normalised VI (NVI 2005) is used to bound it between 0 and 1 for comparability with the ARI. The NID is another information criterion bounded between 0 and 1, and both the NVI and NID have values of 0 indicating identical clustering structures and values of 1 indicating totally independent structures. To simplify interpretation, the unit complements of the NVI and NID (1-NVI and 1-NID) are used in this section.

4 Simulation study

We set up a small scale simulation study to test, in a diverse range of scenarios, how reliably we were able to estimate both the parameters of our proposed row clustering model (4) and the cluster allocations, using the EM algorithm. We are not testing model selection here: we simulate data sets and then fit the correct model to those data. This study is closely related to the one in Fernández et al. (2016).

We simulated the simplest covariate model (4), with only a single covariate \(x_i\) and no column effects. We designed two possible main scenarios for the true model by varying the values of covariate effect parameters for the different clusters, \(\{\delta _{r}\}\). Scenario 1 is designed with both negative and positive covariate effects, which means that different clusters could have dramatically different covariate effects. Scenario 2, by contrast, has only positive covariate effects, which is likely to make it more difficult to fit the cluster parameters, because the different clusters are more likely to produce similar response data than they were in Scenario 1.

The simulation program was written in R, and we did not observe any issues with its convergence. For each scenario, we ran several cases varying the following features:

  • Sample size: \(n=100,1000\)

  • Number of response categories: \(q=3, 4, 5, 6\)

  • Number of columns: \(m=3, 5,10\)

  • Number of row clusters: \(R=3, 5\)

  • Distribution of covariates: Normal (N(0, 1), ), Binomial (Bin(1, 0.5))

In total, we ran 96 cases within each scenario. We generated 2000 replicate datasets for each combination of features using model (4) and calculated maximum likelihood estimates (MLEs) of the model parameters and their standard errors for each replicate. We then compared the estimated parameter values with the true parameter values and assessed the agreement between the true and estimated clustering structures using indices such as Adjusted Rand Index (ARI), 1-Normalized Variation of Information (1-NVI), and 1-Normalized Information Distance (1-NID). To report the results, we computed the mean of both the estimated model parameters and their corresponding standard errors using the 2000 simulated datasets.

4.1 Scenario 1

We simplified the study by using equal proportions of rows in each cluster: \((\pi _1,\ldots ,\pi _R) = (1/R,\ldots ,1/R)\). The cutpoint values \(\{\mu _k\}\) were chosen from a quantile function for the logistic distribution. Therefore, the cutpoint values are \(\{ \mu _1 = \log (1/2), \mu _2= \log (2) \}\) when \(q=3\), and \(\{ \mu _1 = \log (1/4), \mu _2=\log (2/3), \mu _3 = \log (3/2), \mu _4 = \log (4) \}\) when \(q=5\). We used evenly distributed values for the row cluster effect parameters \(\alpha _r\), with the corner-point constraint that \(\alpha _1 = 0\).

Table 1 summarizes the average absolute bias and their corresponding standard errors for each parameter over 2000 simulations when the fitted models are model (3) and (4). In all cases, the estimated parameters of model (4) are close to the true values due to a small bias, and as expected, the variability decreases with increasing sample size n. We also remark that the value of the standard error decreases as the number of ordinal categories increases (see additional results in Tables 4 and 5, Figs. 4 and 5, Appendix A). We believe this might be due to the fact that as the number of ordinal categories increases, the response data becomes more continuous and the responses contain more information. On the other hand, the estimated parameters of model (3) perform poorly, i.e., they are very far from the true parameter values despite having modest standard errors (see additional results in Tables 6 and 7, Figs. 6 and 7, Appendix A).

Table 1 Scenario 1: The average absolute bias and standard error obtained for each parameter over 2000 simulations for models formulated in Eqs. (3) and (4)

4.2 Scenario 2

The setting of Scenario 2 is configured in the same way as the one from Scenario 1, apart from the specific values of the covariate effect parameters \(\{\delta _{r}\}\). Scenario 2 has covariate effects in the same direction for all the clusters.

Figures 1 and 2 provide a summary of the results of Scenario 2 for \(R=3\) for the models (4) and  (3). We observe that the performance is remarkably similar to that of Scenario 1. In other words, the estimates of parameters in Model (4) are closer to their true values than those in Model (3), and the performance improves as the number of response categories, q increases. The results of our simulation study indicate that the clustering procedure described in this article is able to recover the true parameter values in all tested instances. In addition, the other results for this scenario are shown in Tables 8, 9, 10, and 11 and Figs. 8, and 9 in Appendix A.

Fig. 1
figure 1

Boxplots for Scenario 2, representing the estimated distribution of each parameter when the fitted model is Model 4 and R \(=\) 3

Fig. 2
figure 2

Boxplots for Scenario 2, representing the estimated distribution of each parameter when the fitted model is Model 3 and R \(=\) 3

Fig. 3
figure 3

Boxplot: Clustering structure comparison for simulations between Scenario 1 and Scenario 2 using ARI, 1-NVI and 1-NID. Values in the vertical axis indicate averages across 2000 simulations. The labels in y-axis are in the format “Sx.Ry.qz" being x the number of scenario (1, 2), y the number of clusters (\(R=3.5\)), and z the number of ordinal categories (\(q=3,4,5,6\))

Finally, Fig. 3 shows the average ARI, 1-NVI, and 1-NID over all the replicates and different numbers of rows, by Scenario, R, and q. These measures compare the similarity of resulting row clustering structures in the true models with the fitted models. It can be seen that, for Scenario 1, the mean of similarity measures based on the ARI is 0.65 between the true clustering memberships and the predicted memberships when data were fitted by Model (4) with \(R=3\), \(q=3\). Similarly, the mean of similarity measures based on the ARI is 0.69 when \(R=3\), \(q=5\). Thus, the measure increases with increasing q, most likely because the data with \(q=5\) contain more information about the response. We observed equivalent results for the other two measures: 1-NVI and 1-NID. For Scenario 2, all three measures (ARI, 1-NVI and 1-NID) show equivalent results to Scenario 1 but the indices are smaller. For example, the ARI is 0.46 when \(R=3\), \(q=3\) and the ARI is 0.49 when \(R=3\), \(q=5\). All three clustering measures have smaller values than the ones in Scenario 1. We conclude from this that if some covariate effects are positive and others are negative (Scenario 1), then it is easier to detect the correct clustering structure than if the covariate effects are all in the same direction (Scenario 2).

5 Application

We applied the models proposed in this article to the arthritis clinical trial data set (Lipsitz et al. 1996), which compares the drug auranofin and placebo therapy for the treatment of rheumatoid arthritis. The data set is obtained from the \({\textbf {R}}\) package multgee (Touloumis 2015). The response variable is the patient’s self-assessment of arthritis, which is measured on a five-level ordinal response scale, from very poor (1) to very good (5). A total of 302 eligible patients were in the original data set but only 289 patients completed a rheumatoid self-assessment questionnaire at all three follow-up times (first, third and fifth month of treatment). We used those 289 patients with completed questionnaires to analyze in our example. The data can be represented by a \(289 \times 3\) matrix \(\textbf{Y}\). The covariates we include in our model are gender (1=female and 0=male), age (in years), and treatment (1=placebo and 0=drug). In this application, the covariate-dependent clustering could help to identify subsets of patients with similar covariate information patterns. This insight would be important because it would provide a flexible approach for identifying potential heterogeneous gender, age, and auranofin treatment effects on the arthritis scores. For instance, if the elderly experience more symptoms and, consequently, tend to be more pessimistic about their arthritis status, our proposed model would allow us to distinguish subsets of older people that tend to report higher/lower arthritis scores. However, we note that this is only an example and we do not advocate the clinical relevance of the covariate-dependent clustering model. In real settings, clinicians and the statisticians together should decide which model, i.e. no clustering, clustering with covariates, or clustering without covariates, is more relevant to answer their research questions. After fitting the models without covariates (3) and with covariates (4), with different number of row clusters, we compared them using the information criteria AIC (Akaike 1973) and BIC (Schwarz 1978) (see results in Table 2). AIC indicates that the best model is the version of the row clustering model including age and treatment covariates (\(\mu _k - (\alpha _{r} + x_{i1}\delta _{1r} + x_{i2}\delta _{2r})\)) with \(R=4\) row clusters (AIC = 2136.78), which is better than its counterpart in the model without covariates (AIC=2154.40). However, BIC shows that the model without covariates (\(\mu _k - \alpha _{r}\)) and \(R=4\) is the best model (BIC=2202.05). A possible reason is that BIC penalizes higher numbers of parameters more strongly than AIC does, leading to a preference for more parsimonious models. On the other hand, based on our experience working with practitioners and researchers from other areas, we have chosen to use AIC as a standard measure for model selection. Nevertheless, we also acknowledge the importance of BIC in providing more parsimonious models, and we have included the results from BIC in our analyses to ensure a comprehensive evaluation of model performance.

Table 2 Results of row clustering models fitted to the arthritis data set
Table 3 Estimated parameters of two models, the first with no covariates and the second with the covariates age and treatment

Table 3 shows the estimated parameters of the two models. The row clustering model without covariates (3) separates the patients into four clusters (sorted by the best to the worst self-assessment scores). The first cluster has the strongest patient feelings effect, (\(\alpha _1 = 4.20\)), followed by cluster 2 (\(\alpha _2 = 1.26\)), cluster 3 (\(\alpha _3 = -1.41\)), and cluster 4 (\(\alpha _4 = -4.04\)), suggesting that patients in cluster 1 have the best feeling about their current state of arthritis among all the clusters. When we add the age and treatment covariates into the clustering model, the parameters \(\{\delta _{1r}\}\) and \(\{\delta _{2r}\}\) indicate the age and treatment effects within the clusters. For instance, the auranofin treatment did not show improvement for patients in cluster 1 (\(\delta _{21}= 0.23\)), but the treatment did show improvement, to differing degrees, for patients in clusters 2, 3, and 4 (\(\delta _{22} = -0.84\), \(\delta _{23} = -0.82\) and \(\delta _{24} = -2.00\)). Moreover, the older patients in cluster 3 (\(\delta _{13} = -0.22\)) were likely to have a worse feeling about their current arthritis status than older patients in other clusters. Therefore, the clustering model without covariates (3) allows us to describe the overall patterns of patient feelings and once we add the covariates (4), we could also identify the subgroups of patients with similar covariate patterns.

Additionally, Table 12 (see results in Appendix A) shows the results of the comparison of clustering structure agreement of the selected models with and without covariates by using the information theoretic criteria ARI, 1-NVI and 1-NID. The results assume each patient has been allocated to the cluster for which they have the highest posterior probability of membership.

The comparison of clustering structure agreement (measured by ARI, 1-NVI, and 1-NDI) between the best model (Model (4) with \(R=4\), including age and treatment covariates) and its counterpart without covariates (Model (3)) revealed distinct differences. The values of the three measures were 0.66 (ARI), 0.47 (1-NVI), and 0.64 (1-NDI), indicating that Model (3) and Model (4) resulted in different clustering structures. This was further confirmed by examining the detailed estimated memberships for individuals in Table 13 (see results in Appendix A). For example, when data were fitted by Model (4), nine patients (1, 62, 119, 124, 131, 223, 239, 243, and 266) originally assigned to cluster 1 were re-allocated to cluster 2. Similarly, two patients (79 and 238) from cluster 2 were re-allocated to cluster 3, and ten patients (63, 125, 153, 192, 215, 217, 219, 245, 267, and 285) from cluster 3 were re-allocated to cluster 4. These findings highlight the substantial impact of including covariates in Model (4) on the clustering structures, underscoring the importance of considering covariate effects in the analysis.

Finally, Table 14 in Appendix C presents a comparison of the average age for different combinations of clusters and treatment (placebo or drug) using two models: one without covariates (\(\mu _k - \alpha _{r}\)) with \(R=4\) clusters, and the other incorporating the covariates age and treatment (\(\mu _k - (\alpha _{r} + x_{i1}\delta _{1r} + x_{i2}\delta _{2r})\), where \(x_1\) represents age and \(x_2\) represents treatment) also with \(R=4\) clusters. Notable differences in mean age were observed within specific groups. For instance, individuals in group 1 receiving the drug treatment exhibited an increase in mean age from 35 to 41 when covariates were incorporated. Conversely, a similar discrepancy in mean age, but in the opposite direction, was observed for individuals in group 2 receiving the placebo (54.5 vs 50.8). This comparison highlights the added insights provided by including covariates, shedding light on the relationship between variables and their impact on mean age within specific groups and treatment categories. These findings underscore the importance of considering covariates in understanding population characteristics and their potential influence on outcomes.

6 Discussion

This paper uses finite mixture models to model the case of ordinal data using the proportional odds model, including covariates in the linear predictor of the model. We used the proportional odds model to capture the inherent natural order of the responses.

We set up a simulation study to explore the reliability of the models with covariates across a range of cases. We considered two scenarios by varying the covariate effects from mixed directions to all in the same direction. For all cases, the estimates of the parameters are close to their true values and we observed that the value of the standard errors decreases as the number of ordinal categories increases. The standard errors also decrease with increasing sample size n. Moreover, we compared the similarity of the true model and the fitted model for both scenarios based on ARI, 1-NVI, and 1-NID indices. The row clustering structure with mixed direction covariate effects showed better performance than the one with all positive covariate effects.

We also illustrated our approach with the well-known “arthritis clinical trial" data set. The results of this application indicated that the best model according to AIC was the row clustering model with \(R=4\) including age and treatment covariates. However, we remark that AIC is a standard procedure and we consider that subject-matter experts in the matter and statisticians together should decide which model (adding covariates or not) is more relevant to answer the research questions. The patients were clustered according to their similar pattern of responses and the effect of the covariates. That is, all four clusters have different age and treatment effects, which changes the interpretation of the clustering structure when the covariates were not taken into account. In this case, we could identify individuals in each of the four groups based on their self-assessment scales and how the age and treatment are associated with these groups.

It is important to note that our proposed model is based on the cumulative version of the proportional odds model, which applies the proportional odds assumption. This assumption must be assessed in different ways: (1) examining graphical diagnostics, e.g. plotting the cumulative logit probabilities against the covariates can reveal any systematic departures from parallelism, which is an indication of violations of the proportional odds assumption; (2) performing formal statistical tests, e.g. the Brant Wald test (1990); and (3) performing model diagnostics, e.g. examining the residuals for patterns, such as non-linearity or heteroscedasticity, which can provide evidence of violations of the assumptions. Additionally, we acknowledge that this represents a simplified approach that assumes a uniform covariate effect, which may not always be valid in all cases and may not capture the true complexity of the relationship between covariates and response variables in all cases. Further research is needed to explore more flexible models that can account for varying covariate effects on different response variables in different situations.

Deciding whether a variable should be considered a response or a covariate is a crucial step in statistical analyses, including mixture-based clustering. By necessity, statistical analyses distinguish between response (dependent) variables and explanatory (independent) variables (Agresti 2014). In making this decision, it is essential to consider the research questions, theoretical considerations, and subject-matter knowledge about the variables under study. Researchers need to carefully assess whether a variable is of primary interest in the study to answer the research questions, or if it plays a supporting role because of its association with the variable of interest. Additionally, exploratory data analysis techniques, such as visualizations and correlation analyses, can provide valuable insights into the relationships between variables. By considering these factors and utilizing appropriate statistical techniques, researchers can make informed decisions regarding the allocation of variables as responses or covariates. In this study, we follow the general convention of treating variables as responses or covariates based on research interest. For instance, in our application, the main interest is the effect of a drug on the arthritis status. However, if a researcher is investigating the potential average age of the patient according to their arthritis status, the age variable would likely be considered a response variable. We acknowledge that this allocation of variables may vary depending on the specific research context, and researchers should adapt their approach accordingly.

Our approach assumes that all available variables are used in the modeling procedure. However, in many situations, considering all the variables unnecessarily increases the model complexity. Moreover, some variables may not possess any clustering information and are of no use in the detection of the group structure. Rather, they could be detrimental to the clustering. Likewise, the case where all the variables contain clustering information can also be problematic. Along with the increasing number of dimensions comes the curse of dimensionality, and including superfluous variables in the model leads to identifiability problems and over-parameterization (Bouveyron and Brunet-Saumard 2014). Therefore, resorting to variable selection techniques can facilitate model fitting, ease the interpretation of the results and lead to data classification of better quality. Even in situations of moderate or low dimensionality, reducing the set of variables employed in the clustering process can be beneficial (Fowlkes et al. 1988; Raftery and Dean 2006; Andrews and McNicholas 2014). How the variable selection algorithm interacts with the model fitting process defines the overall approach to the problem. For a general learning task, the principal distinction is in whether the selection is carried out separately or jointly to the learning procedure (John et al. 1994; Dash and Liu 1997; Dy and Brodley 2004). Thus, the application of information criteria such as AIC or BIC would be a direct way to perform variable selection in our approach. There would be other alternatives as model-based selection methods, such as stepwise selection and LASSO. Additionally, domain knowledge or subject-matter expertise can also guide the variable selection process by considering the relevance of covariates based on their theoretical importance or prior knowledge.

We performed a robustness analysis to assess the impact of outliers by introducing 3% outliers in the numerical variable age and re-fitting the models incorporating this covariate. We have included these results in Table 5 of Appendix E. Interestingly, our analysis consistently revealed the emergence of an additional group, according to AIC, resulting in a total of five groups, in contrast to the framework without outliers. Importantly, this newly identified group aligned with the rows containing the artificially introduced outliers. These findings demonstrate the sensitivity of our proposed method to outliers and its ability to capture their influence on the clustering structure. However, it is essential to note that this analysis serves as an illustrative example, and a more comprehensive robustness analysis is warranted in future research. Therefore, we view this as a potential avenue for future investigations.

We compared the best model according to AIC with the Partitioning Around Medoids (PAM) method using the Gower dissimilarity measure to assess the equivalence in terms of the number of clusters and the cluster structure of covariate values. The results of this comparison can be found in Table 15 of Appendix D. Interestingly, this comparison revealed consistent results in the number of groups (\(R=4\)), while exhibiting slight differences in both the cluster structures and covariate values. Notably, the incorporation of covariates age and treatment in the model (\(\mu _k - (\alpha _{r} + x_{i1}\delta _{1r} + x_{i2}\delta _{2r})\)) resulted in a lower mean age for individuals in group 1 receiving drug treatment, compared to the results obtained with the PAM method. A more comprehensive comparison of clustering methods would be an intriguing avenue for future research.

This article demonstrated that including available covariates in the fitting process of the mixture-based approaches for ordinal responses improves insights into the main characteristics of the clusters. The same idea could be implemented for different types of models focused on ordinal responses, such as the ordered stereotype and the adjacent-categories logit models. For future research, we will plan to extend the model shown in this article in order to cluster rows and columns simultaneously (a.k.a. co-clustering or biclustering), which is a natural extension and can give more insights of the clustering structure of the data sets. Additionally, another natural and challenging extension to consider would be to incorporate rows and columns covariates to our current approach, capturing their potential interactions. One potential approach might be using a multi-level modeling framework or fitting two separate mixture models. Another interesting avenue to explore would be the potential application of our proposed procedure for data imputation in case of missing data. It could be extended to impute both ordinal responses and covariate values simultaneously, leveraging the estimated mixture models and capturing non-linear relationships and interactions between variables. The uncertainty-aware imputation approach using the EM algorithm could provide more realistic and robust imputed values. However, further research and validation would be needed to evaluate the performance of our proposed procedure as a data imputation method, in comparison to existing techniques, in various settings and data scenarios. Additionally, as a future work, we plan to conduct additional comparisons with other existing methods to further evaluate the performance of our proposed method and provide a more comprehensive analysis. Finally, this research has considered the case where responses in each column have the same number of ordinal response levels. This could be varied but may require a separate set of parameters \(\{\mu _{jk}\}\) and \(\{\phi _{jk}\}\). The simulation and model fitting code in R is available on Github at https://github.com/vuw-clustering/clustering-covariates.