Biclustering Models for Two-Mode Ordinal Data

The work in this paper introduces finite mixture models that can be used to simultaneously cluster the rows and columns of two-mode ordinal categorical response data, such as those resulting from Likert scale responses. We use the popular proportional odds parameterisation and propose models which provide insights into major patterns in the data. Model-fitting is performed using the EM algorithm, and a fuzzy allocation of rows and columns to corresponding clusters is obtained. The clustering ability of the models is evaluated in a simulation study and demonstrated using two real data sets. Electronic supplementary material The online version of this article (doi:10.1007/s11336-016-9503-3) contains supplementary material, which is available to authorized users.


Introduction
Measurement data with ordinal categories occur frequently and in many fields of application. For example in medicine, a continuous clinical response is often categorised into ordered subtypes based on histological or morphological terms. In a questionnaire, Likert scale responses might be "better", "unchanged" or "worse". When analysing such data, it is of interest to link the ordinal responses to a set of explanatory variables.
One motivation for the PO model assumes that the ordinal response has an underlying continuous variable (Anderson & Philips, 1981), called a latent variable, that follows a logistic distribution. The extensive use of the PO model is due to its parsimony for modelling the effect of covariates on the response, compared to other similar models such as the baseline-category 612 PSYCHOMETRIKA logit model, thanks to the use of the proportional odds property (Agresti, 2010, Sect. 3.3.1). Additionally, the model parameters are invariant to the way the categories for the ordinal response are formed (Agresti, 2010, Sect. 3.3.3).
In the analysis of two-mode data matrices, with the modes being for example subjects and questions and with all of the elements being ordered categorical responses, one might be interested in modelling the effect of both the rows and columns on the response. An example of such data is an n by p matrix that summarises the responses of n individuals to p questions, each with q possible (ordered) responses. In this case, the PO model can be fitted to identify, for example, individuals and questions that tend to be linked with higher values of the ordinal response.
However, the number of parameters in the PO model increases as the number of rows or columns in the data set increases. As a result, interpretation becomes problematic for large data sets. Identifying patterns related to the heterogeneity of the data, for example clusters of rows or columns that have similar effect on the response, is challenging. Therefore, the formulation of model approaches taking into account the row and column cluster structure of the data is needed.
The work in this paper has been motivated by this need to model potential heterogeneity among the, assumed independent, ordinal responses in two-mode data by identifying row and/or column clusters. As well as a single-mode clustering, our proposed model provides a two-mode clustering, or biclustering, for fuzzy allocation of the rows and/or columns to corresponding clusters. This way, the number of parameters can be reduced considerably as rows and/or columns are clustered in corresponding homogeneous groups assumed to have the same effect on the response. The results provide insights into major patterns in the data, and row/column clusters can be compared and ranked according to their effect on the ordinal response.
A number of model-based or distance-minimising biclustering methods exist that allocate, probabilistically or not, the rows and columns of a data set containing continuous, binary or count data to corresponding clusters. Examples include the double k-means method of Vichi (2001) and Rocci and Vichi (2008) which, as the name suggests, resembles the k-means algorithm (Hartigan & Wong, 1979), and the block mixture models of Nadif (2003, 2010). Pledger and Arnold (2014) have recently proposed a group of likelihood-based models fitted using the Expectation-Maximisation algorithm (EM) (Dempster, Laird, & Rubin, 1977) for simultaneous fuzzy clustering of the rows and columns of binary or count data.
In this paper, we generalise the Pledger and Arnold (2014) work to the case of ordinal categorical response data, specifically using the PO model parameterisation. The proposed model structure is an extension of the one-mode clustering model given by Desantis et al. (2008).
Section 2 describes the model structure. The performance of several model selection criteria in selecting the true number of clusters in the data when our proposed model is used is assessed in Sect. 3.1. The reliability of the clustering resulting from our proposed model is evaluated, using simulation, in Sect. 3.2. Finally, applications to two real data sets are shown in Sects. 4.1 and 4.2 and the resulting clusters are compared to those obtained by double k-means (Vichi, 2001).

Background: Proportional Odds Model
Consider the data set as an n × p matrix Y with entry y i j the realisation of a categorical distribution with q cells and θ i j1 , . . . , θ i jq probabilities, q k=1 θ i jk = 1, ∀i, j. Let the set of model parameters be denoted by φ φ φ.
Under the PO model, and in the case where the additive effect of rows and columns on the response is considered where μ k is the kth cut-off point, with μ 1 < μ 2 < · · · < μ q−1 , and α i , β j are, respectively, the effect of row i, column j on the response, with α 1 = β 1 = 0. The total number of model parameters is equal to ν = (q − 1) + (n − 1) + ( p − 1).

Biclustering: Simultaneous Clustering of Rows and Columns
Suppose that the rows come from a finite mixture with R components or row clusters while the columns come from a finite mixture with C components or column clusters. Rows that belong to the same row cluster, r , are assumed to have the same effect on the response, modelled using parameter α r . Similarly, columns that belong to the same column cluster c have the same effect on the response modelled by parameter β c . If cell i, j belongs to row group r and column group c then, under the PO model and assuming an additive effect of the clusters on the response, The proportion of rows in row group r is π r and the proportion of columns in column group c is κ c , with R r =1 π r = C c=1 κ c = 1. As the rows and columns in the same row and column cluster, respectively, share the same parameters, α r and β c , respectively, there are now (q − 1) + 2(R − 1) + 2(C − 1) parameters in the model, where R ≤ n and C ≤ p. Choosing R n and C p ensures that the number of independent parameters in this model is lower than the number of parameters in the proportional odds model formulated in expression (2).
However, cluster membership is typically unknown and hence the (incomplete data) likelihood sums over all possible partitions of rows into R clusters and over all possible partitions of columns into C clusters (φ φ φ, π π π, κ κ κ|Y) = log 614 PSYCHOMETRIKA where π r i and κ c j is the proportion of rows and columns, respectively, that belong to row group r , column group c for the particular partition i, j, of rows and columns into R and C clusters, respectively.
Here, following Pledger and Arnold (2014, Sect. 2.2.2), we adopt a finite mixture model which, assuming row-based conditional independence, we can describe using the following (incomplete data) log-likelihood which sums over the possible column cluster partitions only. Equation (5) is obtained from Eq. (4) by taking terms of the i product through the r sums. The additive model shown in Eq.
(3) can be extended to a model which allows for an interaction between the row and column cluster effects, denoted by parameters γ , by modelling the logits of the cumulative probabilities as and, assuming constraints r γ rc = 0 ∀c and c γ rc = 0 ∀r , increasing the number of parameters by (R − 1)(C − 1) compared to the additive case. The model can also be altered to consider one-mode clustering, and the set of different models that can be fitted are shown in Table 1 with details given in Appendix A. The first two columns in Table 1, labelled as "R" and "C", denote, respectively, the number of row and column clusters assumed in the model when R = 1 and C = 1 all rows/columns are homogeneous forming a single row/column cluster, when R = n and C = p all rows/columns are heterogeneous, each forming its own row/column cluster, when R = r and C = c there are r and c homogeneous row/column clusters, respectively. Additionally, models incorporating an interaction term are indicated by the associated parameters γ lk with l indexing the row clusters and k the column clusters.
We denote by Z ir and X jc the indicator random variables for group membership of row i in row group r and column j in column group c, respectively. We use the EM algorithm (Dempster et al., 1977) by treating cluster membership as the missing data and derive estimates of the posterior probability of allocation of row i to row cluster r and of column j to column cluster c, given respectively by E(Z i j ) = z ir and E(X jc ) = x jc , for i = 1, . . . , n, j = 1, . . . , p, r = 1, . . . , R and c = 1, . . . , C with R r =1 z ir = C c=1 x jc = 1, ∀i, j. The lack of a posteriori independence of the Z ir and X jc makes the evaluation of the expected value of their product computationally expensive as it requires a sum either over all possible allocations of rows to row groups, or over all possible allocations of columns to column groups. The variational approximation (Govaert & Nadif, 2005) which we employ (see Appendix A.3.1. for details) is a solution to this problem.
We give details of the EM algorithm steps in Appendix A for all models listed in Table 1. All the computer code is written in R (R Core Team, 2014), and the (complete data) loglikelihood (given in Appendix A) is maximised using the Newton-Raphson algorithm provided as an option in optim to estimate parameters μ 1 , . . . , μ q−1 and the effects of row and column clusters, as well as their interaction, if these exist in the model being fitted. Since the likelihood surface is multimodal, the EM algorithm is started from a number of different points and the iteration with the highest obtained likelihood value is retained (Everitt, Landau, Leese, & Stahl, 2011). The R code to fit the models is available upon request from the first author.
The following constraints are placed, where appropriate: : a single row cluster, R = r : r row clusters, R = n: each row is in its own cluster. Similarly, C = 1: a single column cluster, C = c: c column clusters and C = p: each column is in its own cluster. For example, when R = 1, C = c, the rows form one cluster, while the columns form c clusters and the logits of the cumulative probabilities in the PO model for column cluster c and 1 ≤ k < q are logit P(Y i j ≤ k) = μ k − β c , for all rows. If on the other hand R = n, C = c, the cumulative probabilities for row i, column cluster c are, assuming an interaction between row and column effects and 1

Simulation Studies
We have performed two simulation studies: one to evaluate the performance of 10 model selection criteria in recovering the true number of clusters when our proposed models are used (Sect. 3.1) and one to evaluate the reliability of our proposed models (Sect. 3.2).
Following Fernández, Arnold, and Pledger (2014), we set up a simulation study to empirically establish a relationship between our likelihood-based models for ordinal data, specifically using the PO model, and the performance of 10 information criteria (Table 2) in recovering the true number of cluster components.
We set n = 150, p = 15, q = 4, R = 3 and C = 2. We specified five scenarios by varying the row and column mixing proportions: a data set with similar dimensions (n = 150 and p = 15) to the data analysed in the example in Sect. 4.2 (Scenario 1), balanced row and column mixing proportions (Scenario 2), balanced column mixing proportions but unbalanced row proportions (Scenario 3), unbalanced row and column mixing proportions (Scenario 4) and one of the row mixing proportions close to zero (Scenario 5).
For each scenario, we simulated 100 data sets and noted the selected model using each of the 10 criteria out of models with R = 1, 2, 3, 4, 5 and C = 1, 2, 3, 4, 5. For each simulated data set, the EM algorithm was repeated 10 times with random starting points and the best ML estimates (those that led to highest log-likelihood value) were kept. Figure 1 displays the percentage of cases in which each information criterion correctly recovered the true number of row and column clusters, i.e. the true model that generated the data, 616 PSYCHOMETRIKA  (Bozdogan, 1987) −2 + ν(1 + log(np)) BIC (Schwarz, 1978) −2 + ν log(np) AIC3 (Bozdogan, 1994) −2 + 3ν Clustering ν CLC (Biernacki & Govaert, 1997) −2 + 2EN EN NEC(R) (Biernacki, Celeux, & Govaert, 1999) EN − (1) ICL-BIC (Biernacki et al., 2000) −2 c + ν log(np) ν , np and EN AWE (Banfield & Raftery, 1993) −2 c + 2ν 3 2 + log (np) is the maximised incomplete-data log-likelihood (see Eq. 5); (1) is the maximised incomplete-data loglikelihood without clustering structure; and c is the maximised complete-data log-likelihood given in Appendix A. The third column categorises the criteria according to whether they were proposed for model selection in a regression setting or for clustering. The last column indicates whether the penalty depends on the number of parameters, ν, the total sample size which is the number of elements in the response matrix Y , np, and/or the entropy function, EN(·) = − c . averaged across the five scenarios. AIC3 has the best performance (selecting the correct model in 78 % of cases), followed by BIC (75 %), AIC, AIC c , AIC u and CAIC (73 %).
Our results are in accordance with Fonseca and Cardoso (2007) for the categorical case. ICL-BIC is underestimating the number of clusters (selecting a smaller number of clusters in 32 % of cases) and CLC is overestimating the number of clusters in 29 % of cases. A very poor performance is obtained by AWE and NEC (selecting the correct model in 46 and 24 % of cases, respectively).
It is important to highlight that these results are simply evaluating the ability of model selection criteria in selecting the right number of clusters in the mixture, but not necessarily in providing the best clustering structure for the data.

Model Evaluation
In this section, we evaluate the performance of our proposed method in (i) biclustering, varying the cluster sizes and the sample size and (ii) one-dimensional row clustering, compared to that of double k-means (Vichi, 2001) and standard k-means, respectively.
The response {Y i j } values are generated from a categorical distribution with size 1 and probabilities constrained as in expression (1). We assign the first 1/3 of rows to row cluster 1, the second 1/3 to row cluster 2 and the last 1/3 to row cluster 3. Similarly, the first 1/κ 1 of columns are assigned to column cluster 1, and the rest of the columns to column cluster 2. We simulate 100 data sets for each scenario. Table 3 shows the mean of parameter estimates obtained for α 2 , α 3 and β 2 from 100 simulated data sets. We are aware of the bias in the estimated parameters when n or p are small. This is due to the fact that the clusters are not fixed and hence their effect on the response is not fixed either. For example, a group of subjects who belong to a certain cluster in the true model might be allocated into a different cluster for a simulated data set. Or, they might be separated into different clusters. However, when both n and p are large, the means are close to the true parameters, because it is less likely to allocate a large number of subjects to a wrong cluster and, hence, the clusters themselves are more similar to the true clusters.
Regardless of the bias, the overall result shows that for balanced cases with (κ 1 , κ 2 ) = (0.5, 0.5), the estimates of the column effects are closer to the truth than for highly unbalanced cases (κ 1 , κ 2 ) = (0.2, 0.8) when n is small. The unbalanced column clusters do not affect the quality of the row cluster effect estimates. In general, when both n and p increase, the quality of row cluster effect estimates improves. The standard errors are between 0.05 to 0.5 for the cases of p = 10. For the other cases, they range from 0.001 to 0.08.
To evaluate the clustering ability of our proposed method, we calculate the average proportion of times that the pairwise grouping is correct (Rand index, Rand, 1971) over 100 simulated data sets. For example, if two rows are in the same cluster for the true model, but the proposed method allocates them to different clusters, then this pair is mis-clustered and vice-versa. We report the average Rand index for all row/column pairs in Table 4 when (κ 1 , κ 2 ) = (0.5, 0.5) and (0.2, 0.8) for both our proposed approach and the double k-means algorithm (Vichi, 2001). The two approaches have similar performance which improves as n and p increase and when the column clusters are balanced. For our approach, the largest standard error is 0.03 for the highly unbalanced cases and most standard errors are between 0.001 to 0.01.
When p is large, there are more data points for each row. When q is large, the ordered categorical response has a finer scale. For the row cluster effects {α r , r = 1, 2, 3}, the last setting (0, 1, 4) gives an unbalanced effect where the difference between the first two clusters is small, but the first two clusters are quite different from the third cluster.   Table 5 shows the average Rand index for 1000 simulated data sets for each of the scenarios, comparing the proposed method (POFM) with k-means. All standard errors for the index are less than 0.0026. Most of them are around 0.001. POFM performs better than k-means when the cluster effects are balanced. In general, the greater n, p, q or the cluster effects are, the better the performance. The only case when k-means considerably outperforms POFM is when (α 1 , α 2 , α 3 ) = (0, 1, 4) and p is large. For this particular case, POFM fails to distinguish between Clusters 1 and 2, and partitions the individuals into only two clusters, leaving one of the clusters empty. However, the quality of the row clustering is still satisfactory, with the average Rand index greater than 70 % in all cases.

Religious beliefs
We consider part of the data set from a study first published by Wiech et al. (2008). Twelve individuals, self-classified as religious, replied to 16 questions, shown in Appendix B, all rated on a 6-point Likert scale, (1) "Strongly disagree", …, (6) "Strongly agree". The questions were  designed to assess an individual's beliefs on the level of control that god (first 8 questions) and powerful other individuals (last eight questions) have on their lives. The biclustering model proposed in Sect. 2 was fitted to the 12 by 16 matrix by considering R, C = 2, . . . , 4. The model with the greatest support by AIC3 has R = 3, C = 2 and an interaction between row group effects and column group effects.
The two column clusters separate the questions into the two categories (god and others) almost perfectly. Cluster 1 includes questions {1, 2, 3, 4, 5, 6, 8, 10}, while Cluster 2 includes questions {7, 9, 11, 12, 13, 14, 15, 16}. The three row clusters are {3, 4, 5, 6, 8, 9, 10, 12}, {1, 2, 11} and {7}. Double k-means (Vichi, 2001) gives the same row clusters and similar column clusters {2, 3, 4, 5, 6, 8}, {1, 7, 9, 10, 11, 12, 13, 14, 15, 16}. The estimated probabilities of replying 3 or above to each of the two question clusters for the three row clusters are shown in Figure 2. All row groups tend to agree more with godrelated questions than with questions related to the effect of other powerful people. The estimated probabilities of agreeing with the god-related questions do not vary considerably between the three row clusters. However, that is not the case for the second column group since Row Cluster 1 and particularly Row Cluster 3, which consists of Individual 7 alone, tend to give lower scores than individuals in Row Cluster 2. Note that in addition, Individual 7 strongly agrees with questions in Cluster 1, demonstrating more extreme views than individuals belonging to the other clusters, who tend to be more moderate in their answers.

Attempted Suicides
The data set was collected as part of a study of patients admitted for deliberate self-harm (DSH) at the Acute Medical Departments of three major hospitals in Eastern Norway. We consider the answers of 151 individuals to 13 questions, shown in Appendix C, that were designed to assess the level of depression of the respondent by means of the Beck Depression Inventory-Short Form Percent of individuals from the five POFM clusters, represented in the rows, that are clustered in the corresponding five double k-means (Vichi, 2001)  (BDI-SF) (Furlanetto, Mendlowicz, & Romildo Bueno, 2005). Response options range from 1 to 4, with higher scores indicating higher levels of depression (Beck, Schuyler, & Herman, 1974). We fitted biclustering models with R = 2, . . . , 5 and C=2 or 3. The model supported by AIC3 has R = 5, C = 2 and an additive effect of row and column groups on the response.
The two column clusters are {1, 2, 3, 4, 5, 7, 8, 10, 13} and {6, 9, 11, 12}, with the first cluster receiving higher scores than the second ( β 2 = −0.99(0.10)), suggesting that the nine questions of Cluster 1 are, possibly, markers of more severe forms of depression. The allocation of individuals to the five row groups is in proportion to 0.211, 0.266, 0.208, 0.030, 0.285. Double k-means (Vichi, 2001) gives the following column clusters: {2, 3, 4, 5, 6, 7, 8} and {1, 9, 10, 11, 12, 13}. For row clusters, we present the proportion of individuals from each of our clusters that are allocated to each of the double k-means clusters in Table 6, where it can be seen that with the exception of Cluster 4, the highest proportions appear in the diagonal of the table.
The fourth row cluster, which consists of four individuals, is believed to show the most signs of depression since α 4 = 1.8(0.32). The first cluster follows with α 1 = 0 since it is the baseline, followed by Clusters 5 ( α 5 = −1.14(0.12)), 2 ( α 2 − 2.37(0.13)), and 3 ( α 3 = −3.79(0.16)). In fact, no one in Cluster 4 contacted someone for help after their attempt, while the corresponding proportions for the other four clusters are all greater than 25 %, which demonstrates the greater determination of individuals in Cluster 4 to succeed in their attempt. Of course, the size of Cluster 4 is possibly too small to make meaningful comparisons of this type. However, the proportion of individuals in Clusters 1, 5, 2 and 3 that had at least one episode of DSH within three months after the study is, respectively, equal to 30, 24, 16 and 3.4 %. DSH is one of the most robust predictors of subsequent death by suicide (Hawton, Casanas, Comabella, Haw, & Saunders, 2013). The risk of suicide among DSH patients treated at hospital is 30-to 200-fold in the year following an episode compared to individuals with no history of DSH (Cooper et al., 2005;Hawton et al., 2012;Owens, Horrocks, & House, 2002). Our model has successfully ordered the groups in terms of their risk of DSH within three months since the data we considered were collected.

Discussion
Our biclustering models identify homogeneous groups of both rows and columns in twomode data sets of ordinal responses, reducing the number of parameters needed to adequately describe the data and therefore easing interpretation. They fully account for the ordinal nature of the responses, while, being likelihood-based, give access to tools for selecting between possible models.
We have performed an extensive simulation study to compare the performance of a number of model selection criteria in identifying the correct number of mixture components for models and data such as the ones we considered in our applications, conditional on using the EM algorithm and the variational approximation of Govaert and Nadif (2005). The variational approximation is known to produce local optima, and hence it is recommended to use different random starting values for several runs of the EM algorithm. Recently, Keribin, Brault, Celeux, and Govaert (2014) developed latent block models for categorical data, considering a Bayesian approach, which do not require the aforementioned approximation. The potential to develop such models for the PO parameterization is a matter of future research.
In the two real data applications considered, both including questionnaire-type data designed to gain knowledge about the participants' personality, feelings and way of thinking, the clusters identified by the model agree with our knowledge of the system and provide useful insight of the characteristics of the participants. Especially in the example of Sect. 4.2, the way the participants were clustered agrees with information collected three months after the study was conducted.
In the analysis presented in Sect. 4.2 we have considered only individuals with complete records, excluding participants with missing data. Missing data are often present in similar studies; and, hence, future work could extend the models to deal with such issues. Fitting the models using a Bayesian approach could provide a way of dealing with the missing data and also of choosing the right number of clusters, as, for example, in van Dijk, van Rosmalen, and Paap (2009) and Wyse and Friel (2012), or of appropriately averaging over models, for example using reversible jump MCMC (Green, 1995).
Substantial developments in specialised methods for ordinal data have recently been made (see Liu & Agresti, 2005, for an overview). For instance, Fernández et al. (2014) have recently developed one-and two-dimensional clustering models for ordinal data having a likelihood-based foundation. They did this by using the assumption of the ordinal stereotype model, which allows the determination of a new spacing of the ordinal categories, as dictated by the data. The models presented in this paper may be extended to other ordinal models such as the adjacent-categories logit models, continuation-ratio logit models, and mean response models (see Agresti, 2012, for details on these models). Similarly, incorporating covariates into the model, when these are available, is straightforward by adjusting the linear predictor accordingly.
We have presented the case when q, i.e. the number of levels, is the same for all variables. However, the models are easily extended to allow for a set of cutpoints to be calculated for each unique value of q observed in the data set.
The area of application of these models is extremely wide and includes market research, where questions of the type "How likely are you to buy this product in the future" have possible responses "Very likely to buy", "Likely to buy", "May or may not buy", etc. Additionally, the models are useful for services, such as websites, that review products, such as books, music albums, hotels. and provide recommendations to the users according to their own past reviews, as they can simultaneously cluster the individuals according to their taste, but also the products according to the reviews they have received from all users.
Future research will develop a graphical method for matrix visualisation, taking the resulting probabilities of allocation for each individual data point into account. The existing graphical methods rely on the use of ad hoc distance metrics and similarity measures which, as we have noted above, do not respect the full ordinal nature of the data.