Divisive hierarchical maximum likelihood clustering

Sharma, Alok; López, Yosvany; Tsunoda, Tatsuhiko

doi:10.1186/s12859-017-1965-5

Divisive hierarchical maximum likelihood clustering

Research
Open access
Published: 28 December 2017

Volume 18, article number 546, (2017)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Divisive hierarchical maximum likelihood clustering

Download PDF

Alok Sharma^1,2,3,
Yosvany López^1,4 &
Tatsuhiko Tsunoda^1,4,5

3245 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Biological data comprises various topologies or a mixture of forms, which makes its analysis extremely complicated. With this data increasing in a daily basis, the design and development of efficient and accurate statistical methods has become absolutely necessary. Specific analyses, such as those related to genome-wide association studies and multi-omics information, are often aimed at clustering sub-conditions of cancers and other diseases. Hierarchical clustering methods, which can be categorized into agglomerative and divisive, have been widely used in such situations. However, unlike agglomerative methods divisive clustering approaches have consistently proved to be computationally expensive.

Results

The proposed clustering algorithm (DRAGON) was verified on mutation and microarray data, and was gauged against standard clustering methods in the literature. Its validation included synthetic and significant biological data. When validated on mixed-lineage leukemia data, DRAGON achieved the highest clustering accuracy with data of four different dimensions. Consequently, DRAGON outperformed previous methods with 3-,4- and 5-dimensional acute leukemia data. When tested on mutation data, DRAGON achieved the best performance with 2-dimensional information.

Conclusions

This work proposes a computationally efficient divisive hierarchical clustering method, which can compete equally with agglomerative approaches. The proposed method turned out to correctly cluster data with distinct topologies. A MATLAB implementation can be extraced from http://www.riken.jp/en/research/labs/ims/med_sci_math/ or http://www.alok-ai-lab.com

Stepwise iterative maximum likelihood clustering approach

Article Open access 24 August 2016

2D–EM clustering approach for high-dimensional data through folding feature vectors

Article Open access 28 December 2017

Semi-supervised adaptive-height snipping of the hierarchical clustering tree

Article Open access 16 January 2015

Background

In unsupervised clustering algorithms, the class label or the state of nature of a sample is unknown. The partitioning of data is then driven by considering similarity or distance measures. In some applications (e.g. genome-wide association studies, multi-omics data analyses), the number of clusters also remains unknown. Because such biological information usually tends to follow a normal distribution, the distribution of samples of each cluster can be assumed to be Gaussian.

Hierarchical clustering methods, which can be mainly categorized into agglomerative (bottom-up) and divisive (top-down) procedures, are well known [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. In agglomerative procedures, each sample is initially assumed to be a cluster. The two nearest clusters (based on a distance measure or criterion function) are then merged at a time. This merger continues until all the samples are clustered into one group. Consequently, a tree like structure known as dendrogram is yielded. If the number of clusters is provided, the process of amalgamation of clusters can be terminated when the desired number of clusters is obtained. The first step of an agglomerative procedure considers all the possible mergers of two samples, which requires n(n − 1)/2 combinations (where n depicts the number of samples). Divisive procedures, on the other hand, perform clustering in an inverse way as compared to their agglomerative counterparts. They begin by considering a group (having all the samples) and divide it into two groups at each stage until all the groups comprise of only a single sample [21, 22]. In the first step of a divisive procedure all the partitions of a sample set are considered, which amounts to 2^n − 1 − 1 combinations. This number of combinations grows exponentially and practically makes divisive clustering a difficult procedure to implement. However, there are a few divisive approaches which do not necessarily consider all the divisions [21]. In hierarchical classifications, each subcluster can be formed from one larger cluster split into two, or the union of two smaller clusters. In either case, false decisions made in early stages cannot be corrected later on. For this reason, divisive procedures, which start with the entire dataset, are in general considered safer than agglomerative approaches [21, 23]. Therefore, the accuracy of a divisive procedure is envisaged to be higher than that of an agglomerative procedure [24]. However, the high computational demand (O(2ⁿ)~O(n ⁵)) of divisive procedures has severely restricted their usage [24, 25] (though for special cases the complexity can be further reduced [26]). Therefore, the divisive procedure has not been generally used for hierarchical clustering, remaining largely ignored in the literature.

Hierarchical approaches do not require initial parameter settings and generally employed either linear or non-linear regression models [27,28,29]. Over the last few decades, a number of hierarchical approaches have been proposed. Some of these popular schemes are summarized below. The single linkage or link agglomerative hierarchical approach (SLink) [30] merges two adjacent neighbour groups. Euclidean distance for computing the proximity between two clusters. SLink is very sensitive to data location and occasionally generates groups in a long chain (called as chaining effect). This chaining effect can be reduced developing a method based on farthest distance. This was achieved by the complete linkage (CLink) hierarchical approach [2]. Nevertheless, CLink is also sensitive to outliers. Sensitiveness could be further decreased by the average linkage (ALink) hierarchical approach [31, 32]. ALink implements linking by using the average distance between two groups. In a similar way, the median linkage (MLink) hierarchical approach [33] regards median distance for linking. In Ward’s linkage (Wa-Link), clusters are merged based on the optimal value of an objective function [34]. In weighted average distance linkage (Wt-Link) hierarchical clustering [35, 36], the group sizes are not considered when computing average distances. Consequently, smaller groups will be assigned larger weights during the clustering process [35]. Similarly, model-based hierarchical clustering [4, 8] uses an objective function. Whereas the method in [8] follows a Bayesian analysis and uses both Dirichlet priors and multinomial likelihood function, the approach in [4] optimizes the distance between two GMMs. The number of group is previously defined. Most of these approaches are constructed using the agglomerative procedure, though their construction (with higher computational demand) is equally possible using the divisive procedure. Although divisive clustering is generally disregarded, some approaches like DIANA (DIvisive ANAlysis) program has been recently established [21]. In spite of well-established methods (i.e. EM algorithm [37, 38]) for estimating the parameters of a Gaussian mixture model, it is worth noting that hierarchical and expectation-maximization (EM) algorithms are very different in nature. The EM algorithm is an iterative optimization method, which requires prior knowledge of the number of clusters. It begins with a random choice of cluster centers and therefore returns different sets of clusters for distinct runs of the algorithm. Hierarchical clustering algorithms, on the other hand, do not require such prior knowledge and return a unique set of clusters. These advantages often make hierarchical clustering methods preferable to the EM algorithm for dealing with biological datasets where unique solutions are of utmost importance [39,40,41].

In this work, we described a new Divisive hieRArchical maximum likelihOod clusteriNg approach, abbreviated as DRAGON hereafter. This is a top-down procedure which does not find pairs. Instead, it takes out one sample at a time, maximally increasing the likelihood function. This process continues until the first cluster is obtained. This cluster is not further subdivided but removed from the sample set. In the remaining sample set, the same procedure is repeated for obtaining all the possible clusters. The removal of one sample out of n samples requires n search. This reduces the total search complexity to O(n ² c) (where c is the number of clusters), which represents a significant reduction of the top-down procedure. The following sections present the mathematical derivation of the proposed model, and the analyses carried over on synthetic as well as on biological data to illustrate its usefulness.

Methods

Maximum likelihood clustering: an overview

This section summarizes an overview of the maximum likelihood method for clustering [22]. Here we are not introducing our method, instead we are proving a brief description of conventional maximum likelihood approach for clustering applications. It is possible to learn from an unlabeled data if some assumptions are taken. We will begin the section with an assumption that probability densities are known and it is required to estimate unknown parametric vector θ. The solution comes out to be similar to supervised learning case of maximum likelihood estimation. However, in the supervised learning case, the topology of groups of data is known. But in an unsupervised learning case one has to assume parametric form of data to reach to the solution. Here we describe how to estimate of maximum likelihood of clusters of a given sample set χ. The label of cluster of the sample sets is defined as ω. Assuming there are c clusters in the sample set (c ≥ 1), we define Ω = {ω _j} (for j = 1, 2, …, c) as the cluster label for jth cluster χ _j (In many clustering problems, the number of c is unknown, this issue we will deal in detail in later section and in Additional file 1). In this paper, we followed the notations from Duda et al. [22] for the convenience of readers. Let a sample set χ = {x ₁, x ₂, …, x _n} be defined in a d-dimensional space (It is assumed that d < n. For d ≫ n, dimensionality reduction techniques can be first applied for supervised or unsupervised learning tasks [42,43,44,45,46,47]). Let an unknown parameter vector be θ consisting of mean μ as well as covariance Σ. This will specify the mixture density as

$$ \mathrm{p}\left(\left.{\mathbf{x}}_{\mathrm{k}}\right|\boldsymbol{\uptheta} \right)={\sum}_{\mathrm{j}=1}^{\mathrm{c}}\mathrm{p}\left(\left.{\mathbf{x}}_{\mathrm{k}}\right|{\upomega}_{\mathrm{j}},{\boldsymbol{\uptheta}}_{\mathrm{j}}\right)\mathrm{P}\left({\upomega}_{\mathrm{j}}\right) $$

(1)

where p(x _k|ω _j, θ _j) (for j = 1, …, c) is the conditional density, P(ω _j) is the a priori probability and θ = {θ _j}. The joint density is further defined using the log likelihood as

$$ L=\log p\left(\left.\chi \right|\boldsymbol{\uptheta} \right)=\log {\prod}_{k=1}^np\left(\left.{\mathbf{x}}_k\right|\boldsymbol{\uptheta} \right)={\sum}_{k=1}^n\log p\left(\left.{\mathbf{x}}_k\right|\boldsymbol{\uptheta} \right) $$

(2)

Assuming the joint density p(χ|θ) is differentiable w.r.t to θ then from Eqs. (1) to (2)

$$ {\nabla}_{{\boldsymbol{\uptheta}}_i}L={\sum}_{k=1}^n\frac{1}{p\left(\left.{\mathbf{x}}_k\right|\boldsymbol{\uptheta} \right)}{\nabla}_{{\boldsymbol{\uptheta}}_{\boldsymbol{i}}}\left[{\sum}_{j=1}^cp\left(\left.{\mathbf{x}}_k\right|{\omega}_j,{\boldsymbol{\uptheta}}_j\right)P\left({\omega}_j\right)\right] $$

(3)

where $ {\nabla}_{{\boldsymbol{\uptheta}}_i}L $ is the gradient of L w.r.t. θ _i. Assuming θ _i and θ _j are independent, and supposing a posteriori probability is

$$ P\left(\left.{\omega}_i\right|{\mathbf{x}}_k,\boldsymbol{\uptheta} \right)=\frac{p\left(\left.{\mathbf{x}}_k\right|{\omega}_i,{\boldsymbol{\uptheta}}_i\right)P\left({\omega}_i\right)}{p\left(\left.{\mathbf{x}}_k\right|\boldsymbol{\uptheta} \right)} $$

(4)

then from Eq. (4) we can observe that $ \frac{1}{p\left(\left.{\mathbf{x}}_k\right|\boldsymbol{\uptheta} \right)}=\frac{P\left(\left.{\omega}_i\right|{\mathbf{x}}_k,\boldsymbol{\uptheta} \right)}{p\left(\left.{\mathbf{x}}_k\right|{\omega}_i,{\boldsymbol{\uptheta}}_i\right)P\left({\omega}_i\right)} $. Substituting this value in Eq. (3), we obtain

$$ {\nabla}_{{\boldsymbol{\uptheta}}_i}L={\sum}_{k=1}^nP\left(\left.{\omega}_i\right|{\mathbf{x}}_k,\boldsymbol{\uptheta} \right){\nabla}_{{\boldsymbol{\uptheta}}_i}\log p\left(\left.{\mathbf{x}}_k\right|{\omega}_i,{\boldsymbol{\uptheta}}_i\right) $$

(5)

Note that in Eq. (5), (1/f(z))∇ _z f(z) is arranged as ∇ _z log f(z). Equation (5) can be equated to 0 ($ {\nabla}_{{\boldsymbol{\uptheta}}_i}L=0 $) for obtaining maximum likelihood estimate $ {\widehat{\boldsymbol{\uptheta}}}_i $. This will give us the solution as (interested readers may refer to Duda et al. [22] for further details.)

$$ P\left({\omega}_i\right)=\frac{1}{n}{\sum}_{k=1}^nP\left(\left.{\omega}_i\right|{\mathbf{x}}_k,\widehat{\boldsymbol{\uptheta}}\right) $$

(6)

$$ {\sum}_{k=1}^nP\left(\left.{\omega}_i\right|{\mathbf{x}}_k,\widehat{\boldsymbol{\uptheta}}\right){\nabla}_{{\boldsymbol{\uptheta}}_i}\log p\left(\left.{\mathbf{x}}_k\right|{\omega}_i,{\widehat{\boldsymbol{\uptheta}}}_i\right)=0 $$

(7)

$$ P\left(\left.{\omega}_i\right|{\mathbf{x}}_k,\widehat{\boldsymbol{\uptheta}}\right)=\frac{p\left(\left.{\mathbf{x}}_k\right|{\omega}_i,{\widehat{\boldsymbol{\uptheta}}}_i\right)P\left({\omega}_i\right)}{\sum_{j=1}^cp\left(\left.{\mathbf{x}}_k\right|{\omega}_j,{\widehat{\boldsymbol{\uptheta}}}_j\right)P\left({\omega}_j\right)} $$

(8)

In the case of normal distribution, the unknown mean and covariance {μ, Σ} parameters are replaced in θ in the Eqs. 6, 7 and 8 for yielding maximum likelihood estimates. The parameter θ is usually updated in an iterative fashion to attain $ \widehat{\boldsymbol{\uptheta}} $ by EM algorithms or hill climbing schemes.

DRAGON method: concept

Here we illustrate the clustering method DRAGON. In brief, the proposed procedure is top-down in nature. It initially considers the sample set as one cluster from which one sample is removed at a time. This increases the likelihood function and continues until the maximum likelihood is reached as depicted in Fig. 1 (where L ₁ is the cluster likelihood at the beginning of the process and L ₃ is the maximum likelihood after removing two samples). Once the first cluster is obtained it is removed from the sample set and the procedure is then repeated for attaining the subsequent clusters. Consequently, only one cluster will be retrieved from a sample set. It is assumed that samples are multinomial distributed, however, the number of clusters is not known at the beginning of the process.

To establish the maximum likelihood estimate in the divisive hierarchical context, we investigate the criterion function and the distance measure that satisfy it.

DRAGON method: algorithm

To find the distance measure, we first define the log-likelihood function of a cluster χ _s, where χ _s is a subset of χ. At the beginning, χ _s is the same as χ, however, in every subsequent iteration a sample $ \widehat{\mathbf{x}} $ is removed from χ _s such that the likelihood function

$$ L={\sum}_{\mathbf{x}\in {\chi}_s}\log \left[p\left(\left.\mathbf{x}\right|\omega, \boldsymbol{\uptheta} \right)P\left(\omega \right)\right] $$

(9)

is maximized.

Since we are finding only one cluster in the sample set χ, a priori probability P(ω) can be ignored. We would like to explore how function L changes when a sample $ \widehat{\mathbf{x}} $ is taken out. Let us suppose centroid μ and covariance Σ of χ _s are defined as

$$ \boldsymbol{\upmu} =\frac{1}{n}{\sum}_{\mathbf{x}\in {\chi}_s}\mathbf{x} $$

(10)

$$ \varSigma =\frac{1}{n}{\sum}_{\mathbf{x}\in {\chi}_s}\left(\mathbf{x}-\boldsymbol{\upmu} \right){\left(\mathbf{x}-\boldsymbol{\upmu} \right)}^{\mathrm{T}} $$

(11)

where the number of samples in χ _s is depicted as n. Assuming that the component density is normal then Eq. (9) can be simplified as

$$ {\displaystyle \begin{array}{c}L=\sum \limits_{\mathbf{x}\in {\chi}_s}\log \left[\frac{1}{{\left(2\pi \right)}^{d/2}{\left|\Sigma \right|}^{1/2}}\exp \left[-\frac{1}{2}{\left(\mathbf{x}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}{\varSigma}^{-1}\left(\mathbf{x}-\boldsymbol{\upmu} \right)\right]\right]\\ {}=-\frac{1}{2} tr\left[{\varSigma}^{-1}{\sum}_{\mathbf{x}\in {\chi}_s}\left(\mathbf{x}-\boldsymbol{\upmu} \right){\left(\mathbf{x}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}\right]-\frac{nd}{2}\log 2\pi -\frac{n}{2}\log \left|\varSigma \right|\end{array}} $$

where trace function is denoted by tr(). Since $ tr\left[{\varSigma}^{-1}{\sum}_{\mathbf{x}\in {\chi}_s}\left(\mathbf{x}-\boldsymbol{\upmu} \right){\left(\mathbf{x}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}\right]= tr\left({nI}_{d\times d}\right)= nd $, we can write L as

$$ L=-\frac{1}{2} nd-\frac{nd}{2}\log 2\pi -\frac{n}{2}\log \left|\varSigma \right| $$

(12)

If a sample $ \widehat{\mathbf{x}} $ is removed from χ _s then centroid and covariance (Eqs. (10) and (11)) will change as follows

$$ {\boldsymbol{\upmu}}^{\ast }=\boldsymbol{\upmu} -\frac{\widehat{x}-\boldsymbol{\upmu}}{n-1} $$

(13)

$$ {\varSigma}^{\ast }=\frac{n}{n-1}\varSigma -\frac{n}{{\left(n-1\right)}^2}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right){\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}} $$

(14)

In order to observe the alteration in the likelihood function (of Eq. (12)), we provide the following Lemma.

Lemma 1 Assume point $ \widehat{\boldsymbol{x}} $ is taken out of a set χ _s and this changes the centroid and covariance (as Eqs. ( 13 ) and ( 14 ) described). Thereby the determinant of Σ ^* is defined as

$$ \left|{\varSigma}^{\ast}\right|={\left(\frac{n}{n-1}\right)}^d\left|\varSigma \right|\left(1-\frac{1}{n-1}{\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}{\varSigma}^{-1}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)\right) $$

Proof From Eq. (14), the determinant of Σ ^* will be

$$ \left|{\varSigma}^{\ast}\right|=\left|\frac{n}{n-1}\varSigma -\frac{n}{{\left(n-1\right)}^2}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right){\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}\right| $$

(L1)

For any square matrix of size m × m, we can write |AB| = |A||B|, and |cA| = c ^m|A| where c is any scalar, This would enable us to write Eq. (L1) in the following manner

$$ \left|{\varSigma}^{\ast}\right|={\left(\frac{n}{n-1}\right)}^d\left|\varSigma \right|\left|{I}_{d\times d}-\frac{1}{n-1}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right){\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}{\varSigma}^{-1}\right| $$

(L2)

$ \left|{I}_{m\times m}+ AB\right| $ can be proved to be |I _n × n + BA| by Sylvester’s determinant theorem (where A ∈ ℝ^m × n and B ∈ ℝ^n × m are rectangular matrices). This would allow us to write

$$ \left|{I}_{d\times d}-\frac{1}{n-1}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right){\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}{\varSigma}^{-1}\right|=\left|1-\frac{1}{n-1}{\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}{\varSigma}^{-1}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)\right| $$

For any scalar |c| = c, the Lemma is then proved by substituting this term in Eq. (L1). □

It is now possible to define the change in L as

$$ {L}^{\ast }=L-\varDelta L $$

(15)

where ΔL is defined as

$$ \varDelta L=-\frac{1}{2}\log \left|\varSigma \right|+\frac{n-1}{2}\log \left(1-\frac{P}{n-1}\right)+\frac{n-1}{2}d\log \frac{n}{n-1}-\frac{d}{2}-\frac{d}{2}\log 2\pi $$

(16)

and P is expressed as

$$ P={\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right)}^{\mathrm{T}}{\varSigma}^{-1}\left(\widehat{\mathbf{x}}-\boldsymbol{\upmu} \right) $$

(17)

It can be observed from Eqs. (15) to (17) that when a sample $ \widehat{\mathbf{x}} $ is taken out of cluster χ _s, the change in L mainly depends on the term P as all the other terms are not changing. If we want to select $ \widehat{\mathbf{x}} $ such that L ^* > L, this requires to solve the following maximization problem

$$ \widehat{\mathbf{x}}=\arg {\max}_{\mathbf{x}\in {\upchi}_{\mathrm{s}}}P $$

(18)

Therefore, by removing $ \widehat{\mathbf{x}} $ the likelihood should increase until the maximum value is reached.

This procedure can track the location of the cluster having the highest density or likelihood. Because one sample is taken out at a time, it could be sensitive to data positions around the center of the cluster. Thereby, it is possible to locate the center of the cluster whereas its complete topology can be missed. In order to reduce such sensitiveness additional processing for tuning the cluster would be useful.

By taking out one sample at a time, we can obtain a cluster χ _s that provides maximum likelihood. All the samples taken out can be collated in a set defined as $ {\chi}_s^{\ast } $, where $ {\chi}_s\cup {\chi}_s^{\ast }=\chi $. The centroid of cluster χ _s can be obtained by μ _s = E[χ _s]. It can be then employed to compute the distance of all the samples from μ _s; i.e. d _x = δ(x, μ _s) ∀ x ∈ χ, where δ denotes a distance measure (in this case the Euclidean metric) and d _x is a 1-dimensional sample or point corresponding to x. Thereafter, a centroid-based clustering scheme can be applied on this distance metric or inner product space (by considering 1-dimensional data and partitioning it into 2 groups) to readjust the cluster χ _s. Here clustering is applied on a distance metric, which can be either the Euclidean norm or any form of kernel (as it is derived from dot product). This procedure can be repeated if μ _s is changing dramatically. The overall method is summarized in Table 1 and illustrated in Additional file 1 (slides 1–6).

Table 1 DRAGON Method

Full size table

The next issue is the estimation of the number of clusters (c). If the value of c were given, it is then easier to find the locations. However, in some applications, c is unknown. In such situations, a range of values of c can be inserted to the procedure so that the best value in the range can be estimated. If no clue about c were given, the maximum number of possible clusters C can be investigated and the best among them (c ≤ C) can be chosen. In order to estimate c, we first define the total likelihood function as

$$ {L}_t(c)={\sum}_{i=1}^c{L}_i $$

(19)

where L _i is the likelihood of ith cluster. The total likelihood function L _t can be computed for different values of c. If for a particular number of clusters (k), the variation between two successive total likelihood functions were not significant, then we can estimate c to be k. Let the difference between two successive total likelihoods be δL _t(k) = L _t(k + 1) − L _t(k), this quantity can be normalized as

$$ \delta {L}_t\leftarrow \frac{\delta {L}_t-\min \left(\delta {L}_t\right)}{\max \left(\delta {L}_t\right)-\min \left(\delta {L}_t\right)} $$

(20)

where δL _t is normalized over all the possible values of k.

Figure S5 in Additional file 1 illustrates the above explanation with a dataset of 4 clusters.

DRAGON method: search complexity

In this section, we briefly discuss the search complexity of the DRAGON method. As explained above the proposed method begins by taking one sample out of the sample set, which increases the likelihood function and requires n search. However, in the second iteration the search reduces to n − 1. Finding a cluster having n ₁ samples requires (1/2)(n − n ₁)(n + n ₁ + 1) total search (see Additional file 2 for details). Therefore, the search for c clusters results O(n ² c). It should be noted here that this search in conventional divisive hierarchical approaches is quite expensive, in the order of O(2ⁿ). However, the search of DRAGON is in the order of O(n ² c), which indicates a considerable reduction. Furthermore, DRAGON employs the k-means clustering algorithm in the dot product space as an intermediate step. The computational complexity of k-means is considered to be linear (e.g. using Lloyd’s algorithm this is O(2nt) because dimensionality is 1 in an intermediate step, and the number of classes is 2. Here, t represents the number of iterations).

Results and discussion

To validate the DRAGON method, we performed analyses using synthetic and biological data. We further compared its clustering accuracy with that of existing hierarchical methods.

Analysis on synthetic data

For the analysis on synthetic data, we generated Gaussian data of dimensionality d with 4 clusters. This data consisted of 400 samples with similar topology to that described in Figure S1 of Additional file 1). With the help of different random seeds we produced the data 20 times, and for on each occasion we calculated the clustering accuracy. We then computed the average or mean of clustering accuracy over 20 attempts to have a statistically stable value. The dimension of the generated data was increased from 2 to 30. For evaluation purposes, we also used other agglomerative hierarchical methods such as SLink, CLink, MLink, ALink, Wa-Link, and Wt-Link. The average clustering accuracies of these approaches over dimensionality d are shown in Fig. 2a. For all the above methods, we provided the number of clusters; i.e. c = 4. For the DRAGON method (of Table 1), we iterated two times (step 5 of Table 1) in all the experiments. Additionally, we assessed seven previously compared divisive hierarchical methods [24]: ALink (average link), SLink (single link), CLink (complete link), DunnsOrg (Dunn’s original), DunnsVar (Dunn’s variant), Mac-Smith (Macnaughton-Smith) and PDDP (Principal Direction). The average clustering accuracies of these divisive methods are summarized in Fig. 2b. As Fig. 2b shows when the dimensionality of data increases, Mac-Smith performs poorly whereas ALink, SLink and DunnsOrg slightly improve their performances. The data dimension did not affect the accuracy of DunnsVar whatsoever, which remains around 50%. The highest accuracy is achieved by the PDDP method (roughly 80%), however, this accuracy is still lower than that of DRAGON.

It can be observed from Fig. 2 that on Gaussian data, the DRAGON method provides promising results over other hierarchical methods either agglomerative or divisive. The Wa-Link hierarchical method also shows good results after the DRAGON method over dimensionality d = 2, …, 30.

To avoid limiting our validation to Gaussian datasets, we carried out additional analyses on synthetic data that included Pathbased [48], Flame [49] and Aggregation [50] datasets. The results are summarized in Tables S1.1, S1.2 and S1.3 of Additional file 1.

Analysis on biological data

We also utilized biological datasets, namely acute leukemia [51], mixed-lineage leukemia (MLL) [52] and mutation data from The Cancer Genome Atlas for assessing clustering accuracy of several hierarchical approaches studied in this paper. A summary of datasets can be found below:

Acute leukemia dataset –comprises DNA microarray gene expressions. The samples belong to acute leukemias of humans. Two kinds are available: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The dataset has 25 AML and 47 ALL bone marrow samples with a dimension of 7129.

MLL dataset –has three ALL, MLL and AML classes. It contains 72 leukemia samples where 24 belong to ALL, 20 belong to MLL and 28 belong to AML with dimensionality 12,582.

Mutation dataset – this dataset is derived from The Cancer Genome Atlas project (https://tcga-data.nci.nih.gov/docs/publications/tcga/?). It includes mutation data for breast cancer, glioblastoma, kidney cancer and ovarian cancer. The data is divided into two groups of 416 samples and 90 samples, which contain 1636 genes.

To vary its number of features or dimensions, we employed the Chi-squared method for ranking the genes (the InfoGain feature selection method was also employed, see Additional file 3). We then performed clustering on samples and computed the clustering accuracy on dimensionality d = 2, …, 5. The clustering accuracies (was measured with the package AccMeasure 2011: http://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement/content/AccMeasure.m) on acute leukemia, MLL and mutation datasets are depicted in Tables 2, 3 and 4, respectively. The best outcomes are highlighted in bold faces.

Table 2 Clustering accuracy (%) on acute leukemia dataset

Full size table

Table 3 Clustering accuracy (%) on MLL dataset

Full size table

Table 4 Clustering accuracy (%) on mutation dataset

Full size table

It is noticed from Table 2 that Wa-Link, Wt-Link and MLink achieve the highest performance when d = 2. For the other dimensions (d = 3, 4 and 5), DRAGON shows the highest performance. From Table 3, it can be seen that DRAGON achieves reasonable performance for all the dimensions. On mutation data (Table 4), DRAGON is showing the highest performance for d = 2 and slightly low performance (82.2%) for the other dimensions. Dunn’s original and Macnaughton-Smith provide the highest performance when d = 3. Despite the reasonable performance of divisive clustering methods in Table 4, it is worth noting that their running times were extremely slow, specially when the number of samples increases as it is the case with mutation data. In general, it can be summarized that the DRAGON method exhibited promising results in terms of clustering accuracy over other hierarchical methods. Also its search complexity is O(n ² c), which is significantly lower than that of conventional divisive approaches.

Conclusions

In this work, we proposed a divisive hierarchical maximum likelihood clustering method whose search complexity was reduced to O(n ² c). Its overall clustering accuracy showed a significant improvement over that of agglomerative hierarchical clustering methods when compared on both synthetic and biological datasets.

Abbreviations

ALL:: Acute lymphoblastic leukemia
AML:: Acute myeloid leukemia
CLink:: Complete linkage
DIANA:: Divisive analysis
DRAGON:: Divisive hierarchical maximum likelihood clustering
EM:: Expectation-maximization
GMM:: Gaussian mixture model
ML:: Maximum likelihood
MLink:: Median linkage
MLL:: Mixed lineage leukemia
SLink:: Single linkage
Wa-Link:: Ward’s linkage
WLink:: Weighted average distance linkage
Wt-Link:: Weighted linkage

References

Castro RM, Coates MJ, Nowak RD. Likelihood based hierarchical clustering. IEEE Trans Signal Process. 2004;52(8):2308–21.
Article Google Scholar
Defays D. An efficient algorithm for a complete link method. Comput J. 1977;20(4):364–6.
Article Google Scholar
Fisher D. Iterative optimization and simplification of hierarchical clusterings. J Artif Intell Res. 1996;4(1):147–79.
Google Scholar
Goldberger J, Roweis S. Hierarchical clustering of a mixture model. In: Neural information processing systems. Cambridge: MIT Press; 2005. p. 505–12.
Heller KA, Ghahramani Z. Bayesian hierarchical clustering. In: 22nd international conference on machine learning. ACM: New York, Bonn; 2005. p. 297–304.
Horng S-J, M-Y S, Chen Y-H, Kao T-W, Chen R-J, Lai J-L, Perkasa CD. A novel intrusion detection system based on hierarchical clustering and support vector machines. Expert Syst Appl. 2011;38(1):306–13.
Article Google Scholar
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.
Article Google Scholar
Vaithyanathan S, Dom B. Model-based hierarchical clustering. In: 16th conference in uncertainty in artificial intelligence; 2000. p. 599–608.
Google Scholar
Sharma A, Boroevich KA, Shigemizu D, Kamatani Y, Kubo M, Tsunoda T. Hierarchical maximum likelihood clustering approach. IEEE Trans Biomed Eng. 2017;64(1):112–22.
Article PubMed Google Scholar
Zhang W, Zhao D, Wang X. Agglomerative clustering via maximum incremental path integral. Pattern Recogn. 2013;46(11):3056–65.
Article Google Scholar
Rokach L, Maimon O. Clustering methods. In: Maimon O, Rokach L, editors. Data mining and knowledge discovery handbook. US: Springer; 2005. p. 321–52.
Chapter Google Scholar
Székely GJ, Rizzo ML. Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J Classif. 2005;22:151–83.
Article Google Scholar
Karypis G, Han E-H, Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer. 1999;32(8):68–75.
Article Google Scholar
Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? J Classif. 2014;31(3):274–95.
Article Google Scholar
Bouguettaya A, Yu Q, Liu X, Zhou X, Song A. Efficient agglomerative hierarchical clustering. Expert Syst Appl Int J. 2015;42(5):2785–97.
Article Google Scholar
Zhang X, Xu Z. Hesitant fuzzy agglomerative hierarchical clustering algorithms. Int J Syst Sci. 2015;46(3):562–76.
Article Google Scholar
Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview. WIREs Data Min Knowl Discov. 2012;2(1):86–97.
Article Google Scholar
Gómez D, Yáñez J, Guada C, Rodríguez JT, Montero J, Zarrazola E. Fuzzy image segmentation based upon hierarchical clustering. Knowl-Based Syst. 2015;87:26–37.
Article Google Scholar
Zhou G-T, Hwang SJ, Schmidt M, Sigal L, Mori G. Hierarchical maximum-margin clustering. Ithaca: Cornell University Library; 2015.
Wang J, Zhong J, Chen G, Li M, F-X W, Pan Y. ClusterViz: a cytoscape APP for cluster analysis of biological network. IEEE/ACM Trans Comput Biol Bioinform. 2015;12(4):815–22.
Article PubMed Google Scholar
Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. 1st ed: Wiley-Interscience; Hoboken, New Jersey; 2005.
Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed: Wiley-Interscience; New York; 2000.
Macnaughton-Smith P, T Williams W B. Dale M G. Mockett L: Dissimilarity analysis: a new technique of hierarchical sub-division. Nature 1964, 202:1034-1035.
Roux M. A comparative study of divisive hierarchical clustering algorithms. Ithaca: Cornell University Library; 2015.
Roux M. Basic procedures in hierarchical cluster analysis. In: Devillers J, Karcher W, editors. Applied multivariate analysis in SAR and environmental studies. Dordrecht: Springer Netherlands; 1991. p. 115–35.
Chapter Google Scholar
Guénoche A, Hansen P, Jaumard B. Efficient algorithms for divisive hierarchical clustering with the diameter criterion. J Classif. 1991;8(1):5–30.
Article Google Scholar
Farrell S, Ludwig CJH. Bayesian and maximum likelihood estimation of hierarchical response time models. Psychon Bull Rev. 2008;15(6):1209–17.
Article PubMed PubMed Central Google Scholar
Pinheiro J, Bates D. Mixed-effects models in S and S-PLUS. 2000th ed. New York: Springer; 2009.
Raudenbush SW, Bryk AS. Hierarchical linear models: applications and data analysis methods. 2nd ed. Thousand Oaks: SAGE Publications; 2001.
Sibson R. SLINK: an optimally efficient algorithm for the single-link cluster method. Comput J. 1973;16(1):30–4.
Article Google Scholar
Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull. 1958;38(22):1409–38.
Google Scholar
Legendre P, Legendre L. Numerical Ecology. 3rd ed: Elsevier; 2012.
Everitt BS, Landau S, Leese M, Stahl D. Cluster analysis. 5th ed: Wiley; 2011.
Ward JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58(301):236–44.
Article Google Scholar
Podani J. Multivariate data analysis in ecology and systematics: SPB Academic Publishing; 1994.
McQuitty LL. Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas. 1966;26(4):825–31.
Article Google Scholar
McLachlan G, Krishnan T. The EM algorithm and extensions. 2nd ed: Wiley-Interscience; 2008.
Jordan MI, Jacobs RA. Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 1994;6(2):181–214.
Article Google Scholar
González M, Gutiérrez C, Martínez R. Expectation-maximization algorithm for determining natural selection of Y-linked genes through two-sex branching processes. J Comput Biol. 2012;19(9):1015–26.
Article PubMed PubMed Central Google Scholar
Do CB, Batzoglou S. What is the expectation maximization algorithm? Nat Biotechnol. 2008;26(8):897–9.
Article CAS PubMed Google Scholar
Holmes I. Using evolutionary expectation maximization to estimate indel rates. Bioinformatics. 2005;21(10):2294–300.
Article CAS PubMed Google Scholar
Sharma A, Paliwal KK. Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett. 2007;28(10):1151–5.
Article Google Scholar
Sharma A, Paliwal KK. A gradient linear discriminant analysis for small sample sized problem. Neural Process Lett. 2008;27(1):17–24.
Article Google Scholar
Sharma A, Paliwal KK. Cancer classification by gradient LDA technique using microarray gene expression data. Data Knowl Eng. 2008;66(2):338–47.
Article Google Scholar
Sharma A, Paliwal KK. Rotational linear discriminant analysis technique for dimensionality reduction. IEEE Trans Knowl Data Eng. 2008;20(10):1336–47.
Article Google Scholar
Sharma A, Paliwal KK, Onwubolu GC. Class-dependent PCA, MDC and LDA: a combined classifier for pattern classification. Pattern Recogn. 2006;39(7):1215–29.
Article Google Scholar
Sharma A, Paliwal KK. A new perspective to null linear discriminant analysis method and its fast implementation using random matrix multiplication with scatter matrices. Pattern Recogn. 2012;45(6):2205–13.
Article Google Scholar
Chang H, Yeung DY. Robust path-based spectral clustering. Pattern Recogn. 2008;41(1):191–203.
Article Google Scholar
Fu LM, Medico E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics. 2007;83. pp. 1-15, https://doi.org/10.1186/1471-2105-8-3.
Gionis A, Mannila H, Tsaparas P. Clustering aggregation. ACM Trans Knowl Discov Data. 2007;1(1):1–30.
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–7.
Article CAS PubMed Google Scholar
Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002;30(1):41–7.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Maurice Roux for kindly providing us with his implementation of divisive hierarchical clustering algorithms. High-end computing resources were provided by the Advanced Center for Computing and Communication, RIKEN.

Funding

This study was funded by the CREST, JST Grant, Japan.

Availability of data and materials

The ALL and MLL datasets are publicly accessible and can be downloaded via the author’s webpage or the Kent Ridge Biomedical Data Set Repository. The mutation data can be retrieved from The Cancer Genome Atlas Project. The MATLAB package of DRAGON can be accessed at http://www.riken.jp/en/research/labs/ims/med_sci_math/ or http://www.alok-ai-lab.com

Declarations

The publication charges of this article were funded by JST CREST grant JPMJCR1412.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 18 Supplement 16, 2017: 16th International Conference on Bioinformatics (InCoB 2017): Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-16.

Author information

Authors and Affiliations

Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan
Alok Sharma, Yosvany López & Tatsuhiko Tsunoda
Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, 4111, Australia
Alok Sharma
School of Engineering & Physics, University of the South Pacific, Suva, Fiji
Alok Sharma
Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, 113-8510, Japan
Yosvany López & Tatsuhiko Tsunoda
CREST, JST, Tokyo, 113-8510, Japan
Tatsuhiko Tsunoda

Authors

Alok Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Yosvany López
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuhiko Tsunoda
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AS developed the concept, conducted the experiments and wrote the first draft of the manuscript. YL carried out the experiments related to divisive hierarchical clustering and proofread the manuscript. TT supervised the entire project. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Tatsuhiko Tsunoda.

Ethics declarations

Ethics approval and consent to participate

None

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Divisive hierarchical maximum likelihood clustering. In this file an illustration of DRAGON method is given. Additionally, performance (in terms of Rand index) is given for synthetic data (Flame, Pathbased and Aggregation). (PDF 963 kb)

Additional file 2:

Computational consideration of DRAGON search. In this file derivation of computational complexity of DRAGON search is given. (PDF 84 kb)

Additional file 3:

Clustering accuracy using InfoGain feature selection method. In this file, InfoGain filtering method was used to perform feature selection. Thereafter, various clustering methods were used to evaluate the performance of DRAGON method. (PDF 68 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Sharma, A., López, Y. & Tsunoda, T. Divisive hierarchical maximum likelihood clustering. BMC Bioinformatics 18 (Suppl 16), 546 (2017). https://doi.org/10.1186/s12859-017-1965-5

Download citation

Published: 28 December 2017
DOI: https://doi.org/10.1186/s12859-017-1965-5

Divisive hierarchical maximum likelihood clustering

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Stepwise iterative maximum likelihood clustering approach

2D–EM clustering approach for high-dimensional data through folding feature vectors

Semi-supervised adaptive-height snipping of the hierarchical clustering tree

Background

Methods

Maximum likelihood clustering: an overview

DRAGON method: concept

DRAGON method: algorithm

DRAGON method: search complexity

Results and discussion

Analysis on synthetic data

Analysis on biological data

Conclusions

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Declarations

About this supplement

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional file 1:

Additional file 2:

Additional file 3:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation