Abstract
Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments.
1 Introduction
The microarray technology has enabled the possibility to monitor the expression levels of thousands of genes in parallel under various conditions [1]. Due to the highvolume nature of the microarray data, one often needs certain algorithms to investigate the gene functions, regulation relations, etc. Clustering is considered to be an important tool for analyzing the biological data [2–4]. The aim of clustering is to group the data into disjoint subsets, where in each subset the data show certain similarities to each other. In particular, for microarray data, genes in each clustered group exhibit correlated expression patterns under various experiments.
Several clustering methods have been proposed, most of which are distancebased algorithms. That is, a distance is first defined for clustering purpose and then the clusters are formed based on the distances of the data. Typical algorithms in this category include the Kmeans algorithm [5] and the selforganizing map (SOM) algorithm [6]. These algorithms are based on simple rules, and they often suffer from robustness issue, i.e., they are sensitive to noise which is extensive in biological data [7]. For example, the SOM algorithm requires user to provide number of clusters in advance. Hence, incorrect estimation of the parameter may provide wrong result.
Another important category of clustering methods is the modelbased algorithms. These algorithms employ a statistical approach to model the structure of clusters. Specifically, data are assumed to be generated by some mixture distribution. Each component of the mixture corresponds to a cluster. Usually, the parameters of the mixture distribution are estimated by the EM algorithm [8]. The finitemixture model [9–11] assumes that the number of mixture components is finite and the number can be estimated using the Bayesian information criterion [12] or the Akaike information criterion [13]. However, since the estimation of the number of clusters and the estimation of the mixture parameters are performed separately, the finitemixture model may be sensitive to the different choices of the number of clusters [14].
The infinitemixture model has been proposed to cope with the above sensitivity problem of the finitemixture model. This model does not assume a specific number of components and is primarily based on the Dirichlet processes [15, 16]. The clustering process can equivalently be viewed as a Chinese restaurant process [17], where the data are considered as customers entering a restaurant. Each component corresponds to a table with infinite capacity. A new customer joins a table according to the current assignment of seats.
Hierarchical clustering (HC) is yet another more advanced approach especially for biological data [18], which groups together the data with similar features based on the underlying hierarchical structure. The biological data often exhibit hierarchical structure, e.g., one cluster may highly be overlapped or could be embedded into another cluster [19]. If such hierarchical structure is ignored, the clustering result may contain many fragmental clusters which could have been combined together. Hence, for biological data, such HC has its advantages to many traditional clustering algorithms. The performances of such HC algorithms depend highly on the quality of the data and the specific agglomerative or divisive ways the algorithms use for combining clusters.
Traditional clustering algorithms for microarray data usually assign each gene with a feature vector formed by the expressions in different experiments. The clustering is carried out for these vectors. It is well known that many genes share different levels of functionalities [20]. The resemblances of different genes are commonly represented at different levels of perspectives, e.g., at the cluster level instead of individual gene level. In other words, The relationships among different genes may vary during different experiments. In Figure 1, we illustrate the gene hierarchical structures for microarray data. Genes group A and B may show close relationship to genes group C in some experiments. While the genes group D shows correlations to groups A, B, and C in other experiments. The group D obviously has a hierarchical relationships to other gene groups. In this case, we desire to have a HC algorithm recognizing the gene resemblances not at the single gene level but at the higher cluster level, to avoid unnecessary fragmental clusters that impede the proper interpretation of the biological information. Such a HC algorithm may also provide new information by taking the hierarchical similarities into account.
In this article, we propose a modelbased clustering algorithm for gene expression data based on the hierarchical Dirichlet process (HDP) [21]. The HDP model incorporates the merits of both the infinitemixture model and the HC. The hierarchical structure is introduced to allow sharing data among related clusters. On the other hand, the model uses the Dirichlet processes as the nonparametric Bayesian prior, which do not assume a fixed number of clusters a priori.
The remainder of the article is organized as follows. In Section 2, we introduce some necessary mathematical background and formulate the HC problem as a statistical inference problem. In Section 3, we derive a Gibbs samplerbased inference algorithm based on the Chinese restaurant metaphor of the HDP model. In Section 4, we provide experimental results of the proposed HDP algorithm for two applications, regulatory network segmentation and gene expression clustering. Finally, Section 5 concludes the article.
2 System model and problem formulation
As in any modelbased clustering method, it is assumed that the gene expression data are random samples from some underlying distributions. All data in one cluster are generated by the same distribution. For most existing clustering algorithms, each gene is associated with a vector containing the expressions in all experiments. The clustering of the genes is based on their vectors. However, such approach ignores the fact that genes may show different functionalities under various experiment conditions, i.e., different clusters may be formed under different experiments. In order to cope with this phenomenon, we treat each expression separately. More specifically, we allow different expressions of the same individual gene to be generated by different statistical models.
Suppose that for the mircoarray data, there are N genes in total. For each gene, we conduct M experiments. Let g_{ j i } denote the expression of the i th gene in the j th experiment, 1≤i≤N, and 1≤j≤M. For each g_{ j i }, we associate a latent membership variable z_{ j i }, which indicates the cluster membership of g_{ j i }. That is, if genes i and i^{′} are in the same cluster under the conditions of experiments j and j^{′}, we have ${z}_{\mathit{\text{ji}}}={z}_{{j}^{\prime}{i}^{\prime}}$. Note that z_{ j i } is supported on a countable set such as $\mathbb{N}$ or $\mathbb{Z}$. For each g_{ j i }, we associate a coefficient ${\theta}_{{z}_{\mathit{\text{ji}}}}$, whose index is determined by its membership variable z_{ j i }. In order to have a Bayesian approach, we also assume that each coefficient θ_{ k } is drawn independently from a prior distribution G_{0}
where k is determined by z_{ j i }.
The membership variable z={z_{ j i }}_{j,i} has a discrete joint distribution
Note that in this article, the boldface letter always refers to a set formed by the elements with specified indices.
We assume that each g_{ j i } is drawn independently from a distribution $F({\theta}_{{z}_{\mathit{\text{ji}}}})$
where ${\theta}_{{z}_{\mathit{\text{ji}}}}$ is a coefficient associated with g_{ j i } and F is a distribution family such as the Gaussian distribution family. In summary, we have the following model for the expression data
The above model is a relatively general one which can induce many previous models. For example, in all Bayesian approaches, all variables are assigned with proper priors. It is very popular to use the mixture model as the prior, which models the data generated by a mixture of distributions, e.g., a linear combination of a family of distributions such as Gaussian distributions. Each cluster is generated by one component in the mixture distribution given the membership variable [14]. The above approach corresponds to our model if we assume that Π is finitely supported and F is Gaussian.
The aim for clustering is to determine the posterior probability of the latent membership variables given the observed gene expressions
where g={g_{ j i }}_{j,i}.
As a clustering algorithm, the final result is given in the forms of clusters. Each gene has to be assigned to one and only one cluster. Once we have the inference result in (5), we can apply the maximum a posterior criterion to obtain an estimate of membership variable ${\widehat{z}}_{\xb7i}$ for the i th gene as
We note that in case one is interested in finding other related clusters for one gene, we can simply use the inferred distribution to membership variable to obtain this information.
2.1 Dirichlet processes and infinite mixture model
Instead of assuming a fixed number of clusters a priori, one can assume infinite number of clusters to avoid the estimation accuracy problem on the number of clusters as we mentioned earlier. Correspondingly in (4), the prior Π is an infinite discrete distribution. Again as in the Bayesian fashion, we will introduce priors for all parameters. The Dirichlet process is one such prior. It can be viewed as a random measure [15], i.e., the domain of this process (viewed as a measure) is a collection of probability measures. In this section, we will give a brief introduction to the Dirichlet process which serves as the vital prior part in our HDP model.
Recall that the Dirichlet distribution $\mathcal{D}({u}_{1},\dots ,{u}_{K})$ of order K on a (K−1)simplex in ${\mathbb{R}}^{K1}$ with parameter u_{1},…,u_{ K } is given by the following probability density function
where ${\sum}_{i=1}^{K}{x}_{i}=1,{u}_{i}>0,i=1,\dots ,K,$ and Γ(·) is the Gamma function. Since every point in the domain is a discrete probability measure, the Dirichlet distribution is a random measure in the finite discrete probability space.
The Dirichlet processes are the generalization of the Dirichlet distribution into the continuous space. There are various constructive or nonconstructive definitions of Dirichlet processes. For simplicity, we use the following nonconstructive definition.
Let (X,σ,μ_{0}) be a probability space. A Dirichlet process D(α_{0},μ_{0}) with parameter α_{0}>0 is defined as a random measure: for any nontrivial finite partition (χ_{1},…,χ_{ r }) of X with χ_{ i }∈σ, we have the random variable
where $\mathcal{G}$ is drawn from D(α_{0},μ_{0}).
The Dirichlet processes can be characterized in various ways [15] such as the stickbreaking construction [22] and the Chinese restaurant process [23]. The Chinese restaurant process serves as a visualized characterization of the Dirichlet process.
Let x_{1},x_{2},… be a sequence of random variables drawn from the Dirichlet process D(α_{0},μ_{0}). Although we do not have the explicit formula for D, we would like to know the conditional probability of x_{ i } given x_{1},…,x_{i−1}. In the Chinese restaurant model, the data can be viewed as customers sequentially entering a restaurant with infinite number of tables. Each table corresponds to a cluster with unlimited capacity. Each customer x_{ i } entering the restaurant will join in the table already taken with equal probability. In addition, the new customer may sit in a new table with probability proportional to α_{0}. Tables that have already been occupied by customers tend to gain more and more customers.
One remarkable property of the Dirichlet process is that although it is generated by a continuous process, it is discrete (countably many) almost surely [15]. In other words, almost every sample distribution drawn from the Dirichlet process is a discrete distribution. As a consequence, the Dirichlet process is suitable to serve as a nonparametric prior of the infinite mixture model.
The Dirichlet mixture model uses the Dirichlet process as a prior. The model in (4) can then be represented as follows:
θ_{ k } is generated by the measure μ_{0}
{z_{ j i }} is generated by a Dirichlet process D(α_{0},μ_{0})
Recall that D(α_{0},μ_{0}) is discrete almost everywhere, which corresponds to the indices of the clusters.
2.2 HDP model
Biological data such as the expression data often exhibit hierarchical structures. For example, although clusters can be formed based on similarities, some clusters may still share certain similarities among themselves at different levels of perspectives. Within one cluster, the genes may share similar features. But on the level of clusters, one cluster may share some similar feature with some other clusters. Many traditional clustering algorithms typically fail to recognize such hierarchical information and are not able to group these similar clusters into a new cluster, producing many fragments in the final clustering result. As a consequence, it is difficult to interpret the functionalities and meanings of these fragments. Therefore, it is desirable to have an algorithm that is able to cluster among clusters. In other words, the algorithm should be able to cluster based on multiple features at different levels. In order to capture the hierarchical structure feature of the gene expressions, we now introduce the hierarchical model to allow clustering at different levels. The clustering algorithm based on the hierarchical model not only reduces the number of cluster fragments, but also may reveal more details about the unknown functionalities of certain genes as the clusters sharing multiple features.
Recall that in the statistical model (11), the clustering effect is induced by the Dirichlet process D(α_{0},μ_{0}). If we need to take into account different level of clusters, it is natural to introduce a prior with clustering effect to the base measure μ_{0}. Again in this case, the Dirichlet process can serve as such prior. The intuition is that given the base measure, the clustering effect is represented through a Dirichlet process on the single gene level. By the Dirichlet process assumption on the base measure, the base measure also exhibits the clustering effect, which leads to clustering at cluster level. We simply set the prior to the base measure μ_{0} as
where D_{1}(α_{1},μ_{1}) is another Dirichlet process. In this article, we use the same letter for the measure, the distribution it induces, and the corresponding density function as long as it is clear from the context. Moreover, we could extend the hierarchies to as many levels as we wish at the expense of complexity of the inference algorithm. The desired number of hierarchies can be determined by the prior biological knowledge. In this article, we focus on a twolevel hierarchy.
As a remark, we would like to point out the connection and difference on the “hierarchy” in the proposed HDP method and traditional HC [4]. Both the HDP and HC algorithms can provide HC results. The hierarchy in the HDP method is manifested by the Chinese restaurant process which will be introduced later, where the data sit in the same table can be viewed as the first level and all tables sharing the same dish can be viewed as the second level. While the hierarchy in the HC is obtained by merging existing clusters based on their distances. However, its specific merging strategy is heuristic and is irreversible for those merged clusters. Hierarchy formed in this fashion often may not reflect the true structure in the data since various hierarchical structures can be formed by choosing different distance metrics. However, the HDP algorithm captures the hierarchical structure at the model level. The merging is carried out automatically during the inference. Therefore, it naturally takes the hierarchy into consideration.
In summary, we have the following HDP model for the data:
where a and b are some fixed constants. We assume that F and μ_{1} are conjugate priors. In this article, F is assumed to be the Gaussian distribution and μ_{1} is the inverse Gamma distribution.
3 Inference algorithm
It is intractable to get the closedform solution to the inference problem (5). In this section, we develop a Gibbs sampling algorithm for estimating the posterior distribution in (5). At each iteration l, we draw a sample ${z}_{\mathit{\text{ji}}}^{(l)}$ sequentially from the distribution:
Under regularity conditions, the distribution of ${\left\{{z}_{\mathit{\text{ji}}}^{(l)}\right\}}_{j,i}$ will converge to the true posterior distribution in (5) [24]. The proposed Gibbs sampling algorithm is similar to the HDP inference algorithm proposed in [21], since both the Gibbs algorithms use the Chinese restaurant metaphor which we will elaborate later. However, because of the differences in modeling, we still need to provide details for the inference algorithm based on our model.
3.1 Chinese restaurant metaphor
The Chinese restaurant model [23] is a visualized characterization for interpreting the Dirichlet process. Because there is no explicit formula to describe the Dirichlet process, we will employ the Chinese restaurant model for HDP inference instead of directly computing the posterior distribution in (5). We refer to [23, 25] for the proof and other details of the equivalence between the Chinese restaurant metaphor and the Dirichlet processes.
In the Chinese restaurant metaphor for the HDP model (13), we view {z_{ j i }} as customers entering a restaurant sequentially. The restaurant has infinite number of rows and columns of tables which are labeled by t_{ j i }. Each z_{ j i } will associate to one and only one table in the j th row. We use ϕ(z_{ j i }) to denote the column index of the table in the j th row taken by z_{ j i }, i.e., z_{ j i } will sit at table ${t}_{\mathrm{j\varphi}({z}_{\mathit{\text{ji}}})}$. If it is clear from the context, we will use ϕ_{ j i } in short for ϕ(z_{ j i }). The index of the random variable θ_{ k } in (13) is characterized by a menu containing various dishes. Each table picks one and only one dish from the menus {m_{ k }}_{k=1,2,…}, which are drawn independently from the base measure μ_{1}. g_{ j i } is drawn independently according to the dish it chooses through the distribution F(·) as in (13). We denote λ(t_{ j i }) as the index of the dish taken by table t_{ j i }, i.e., table t_{ j i } chooses dish ${m}_{\lambda ({t}_{\mathit{\text{ji}}})}$. As before, we may write λ_{ j i } in short of λ(t_{ j i }). In summary, customer z_{ j i } will sit at table ${t}_{j{\varphi}_{\mathit{\text{ji}}}}$ and enjoy dish ${m}_{{\lambda}_{j{\varphi}_{\mathit{\text{ji}}}}}$. The HDP is reflected in this metaphor such that the customers choose the tables as well as the dishes in a Dirichlet process fashion. The customers sitting at the same table are classified into one cluster. Moreover, the customers sitting at different tables but ordering the same dish will also be clustered into the same group. Hence, the clustering effect is performed at the cluster level, i.e., we allow “clustering among clusters”. In Figure 2, we show an illustration of the Chinese restaurant metaphor. The different patterns of shades represent different clusters. We also introduce two useful counter variables: c_{ j i } denotes the number of customers sitting at table t_{ j i }; d_{ j k } counts the number of tables in row j serving dish m_{ k }.
Using the Chinese restaurant metaphor, instead of inferring z_{ j i }, we can directly infer ϕ_{ j i } and λ_{ j i }. The membership variable z_{ j i } is completely determined by $\lambda ({t}_{\mathrm{j\varphi}({z}_{\mathit{\text{ji}}})})$. That is, ${z}_{\mathit{\text{ji}}}={z}_{{j}^{\prime}{i}^{\prime}}$ if and only if $\lambda ({t}_{\mathrm{j\varphi}({z}_{\mathit{\text{ji}}})})=\lambda ({t}_{\mathrm{j\varphi}({z}_{{j}^{\prime}{i}^{\prime}})})$. As we pointed out before, the specific values of the membership variable z_{ j i } are not relevant to the clustering as long as z_{ j i } is supported on a countable set. Hence, we could simply let
According to [25], we have the following conditional probabilities for the HDP model
where $\sum _{k}{d}_{\mathit{\text{jk}}}$ calculates the number of tables taken in the r th row and δ_{(·)} is the Kronecker delta function. The interpretation of (16) is that customer z_{ j i } chooses a table already taken with equal probability. In addition, z_{ j i } may choose a new table with probability proportional to α_{0}.
By the hierarchical assumption, the distribution of the dish chosen at an occupied table is another Dirichlet process. We have the following conditional distribution of the dishes
where ${\sum}_{j}{d}_{\mathit{\text{jk}}}$ counts the number of tables serving dish m_{ k }; ${\sum}_{\mathit{\text{jk}}}{d}_{\mathit{\text{jk}}}$ counts the number of tables serving dishes; K_{ j i } denotes the net number of dishes served till λ_{ j i }’s coming by counting only once each dish that has been served multiple times.
3.2 A Gibbs sampler for HDP inference
Instead of sampling the posterior probability in (5), we will sample ϕ={ϕ_{11},ϕ_{12},…} and λ={λ_{11},λ_{12},…} from the following posterior distribution
We can calculate the related conditional probabilities as follows.
If a is a value that has been taken before, the conditional probability of ϕ_{ j i }=a is given by
where θ={θ_{ j i }}_{j,i} and λ={λ_{ j i }}_{j,i}. The superscript c denotes the complement of the variables in its category, i.e., ${\mathbf{g}}_{\mathit{\text{ji}}}^{c}={\left\{{g}_{{j}^{\prime}{i}^{\prime}}\right\}}_{({j}^{\prime},{i}^{\prime})\ne (j,i)}$ and ${\mathit{\varphi}}_{\mathit{\text{ji}}}^{c}={\left\{{\varphi}_{{j}^{\prime}{i}^{\prime}}\right\}}_{({j}^{\prime},{i}^{\prime})\ne (j,i)}$. ${f}_{{\lambda}_{\mathit{\text{ja}}}}\left({g}_{\mathit{\text{ji}}}{\mathbf{g}}_{\mathit{\text{ji}}}^{c}\right)$ denotes the conditional density of g_{ j i } given all other data generated according to menu ${m}_{{\lambda}_{\mathit{\text{ja}}}}$, which can be calculated as
The numerator of (20) is the joint density of the data which are generated by the same dish. By the assumption that ${g}_{{j}^{\prime}{i}^{\prime}}$ are conditionally independent given the chosen dish, we have the conditional density of the data in the product form. The denominator is the joint density excluding the specific g_{ j i } term. The integrals in (20) can either be calculated using the numerical method or using the Monte Carlo integration. For example, in order to calculate the following integral ${\int}_{a}^{b}f(x)p(x)\mathit{\text{dx}}$, where p(x) is a density function, we can draw samples x_{1},x_{2},…,x_{ n } from p(x) and approximate the integral by ${\int}_{a}^{b}f(x)p(x)\mathit{\text{dx}}={E}_{p(x)}[f(x)]\approx \frac{1}{n}{\sum}_{i=1}^{n}f({x}_{i})$. To calculate (20), we view μ_{1}(·) as p(·) and $F({g}_{{j}^{\prime}{i}^{\prime}}\xb7)$ as f(·).
On the other hand, if a is a new value then we have
We also have the following conditional probabilities for λ_{ j i }. If a is used before, we have
otherwise we have
The derivations of (19), (21), (22), and (23) are given in Appendix.
Before we present the Gibbs sampling algorithm, we recall the Metropolis–Hastings (M–H) algorithm [26] for drawing samples from a target distribution whose density function f(x) is only known up to a scaling factor, i.e., f(x)∝p(x). To draw samples from f(x), we make use of some fixed conditional distribution q(x_{2}x_{1}) that satisfies q(x_{2}x_{1})=q(x_{1}x_{2}), ∀x_{1},x_{2}. The M–H algorithm proceeds as follows.

Start with an arbitrary value x_{0} with p(x_{0})>0.

For l=1,2,…
Given the previous sample x_{l−1}, draw a candidate sample x^{⋆} from q(x^{⋆}x_{l−1}).
Calculate $\beta =\frac{p({x}^{\star})}{p({x}_{l1})}$. If β≥1 then accept the candidate and let x_{ l }=x^{⋆}. Otherwise accept it with probability β, or reject it and accept the previous sample with probability 1−β.
After a “burnin” period, say l_{0}, the samples ${\left\{{x}_{l}\right\}}_{l>{l}_{0}}$ follow the distribution f(x).
We now summarize the Gibbs sampling algorithm for the HDP inference as follows.

Initialization: randomly assign the indices ${\mathit{\varphi}}^{(0)}=\left\{{\varphi}_{11}^{(0)},{\varphi}_{12}^{(0)},\dots \right\}$ and ${\mathit{\lambda}}^{(0)}=\left\{{\lambda}_{11}^{(0)},{\lambda}_{12}^{(0)},\dots \right\}$. Note that once we have all the indices, the counters {c_{ j i }} and {d_{ j k }} are also determined.

For l=1,2,…,l_{0}+L,
Draw samples of $\left\{{\varphi}_{\mathit{\text{ji}}}^{(l)}\right\}$ from their posteriors
given by (19) and (21) using the M–H algorithm. We view the probability in (24) as the target density and choose q(··) to be a distribution supported on $\mathbb{N}$. For example, we can use $q(ij)=\frac{j}{{(j+1)}^{i}}$, $i,j\in \mathbb{N}$.
Draw samples of $\left\{{\lambda}_{j{\varphi}_{\mathit{\text{ji}}}^{(l)}}^{(l)}\right\}$ from their posteriors
given by (22) and (23) using M–H algorithm. We view the probability in (25) as the target density and use q(··) as specified in the previous step.
Since P(α_{0}ϕ,λ,α_{1},g)=P(α_{0}) and P(α_{1}ϕ,λ,α_{0},g)=P(α_{1}), simply draw samples of ${\alpha}_{0}^{(l)}$ and ${\alpha}_{1}^{(l)}$ from their prior Gamma distributions.

Using the samples after the “burnin” period ${\left\{{\mathit{\varphi}}^{(l)},{\mathit{\lambda}}^{(l)}\right\}}_{l={l}_{0}+1}^{{l}_{0}+L}$ to calculate $\widehat{P}(\mathit{\varphi},\mathit{\lambda}\mathbf{g})$, which is given by
$$\phantom{\rule{12.0pt}{0ex}}\widehat{P}\left({\varphi}_{\mathit{\text{ji}}}=a,{\lambda}_{j{\varphi}_{\mathit{\text{ji}}}}=b\right)=\frac{{\sum}_{l={l}_{0}+1}^{{l}_{0}+L}\mathbf{1}\left\{{\varphi}_{\mathit{\text{ji}}}^{(l)}=a,{\lambda}_{j{\varphi}_{\mathit{\text{ji}}}^{(l)}}^{(l)}=b\right\}}{L},$$(26) 
where 1(·) is the indicator function. Determine the membership distribution P(zg) from the inferred joint distribution $\widehat{P}(\mathit{\varphi},\mathit{\lambda}\mathbf{g})$ by $P({z}_{\mathit{\text{ji}}}=a\mathbf{g})=\sum _{b}\widehat{P}({\lambda}_{\mathit{\text{jb}}}=a\mathbf{g},{\varphi}_{\mathit{\text{ji}}}=b)\widehat{P}({\varphi}_{\mathit{\text{ji}}}=b\mathbf{g})$.

Calculate the estimation of clustering index ${\widehat{z}}_{\xb7i}$ for the i th gene by ${\widehat{z}}_{\xb7i}=\underset{a}{arg}max\phantom{\rule{1pt}{0ex}}{\sum}_{j}P({z}_{\mathit{\text{ji}}}=a\mathbf{g})$.
3.3 A numerical example
In this section, we provide a simple numerical example to illustrate the proposed Gibbs sampler. Let us consider the case N=M=2, i.e., there are 2 genes and 2 experiments. Assume that the expressions are as g_{11}=0,g_{12}=1,g_{21}=−1, and g_{22}=2. We assume ${\mu}_{1}(\theta )\sim \mathcal{N}(0,1)$ and $F({g}_{\mathit{\text{ji}}}\theta )\sim \mathcal{N}(\theta ,1)$. For initialization, we set ${\varphi}_{11}^{(0)}=1,{\varphi}_{12}^{(0)}=2,{\varphi}_{21}^{(0)}=3,{\varphi}_{22}^{(0)}=4$; ${\lambda}_{1{\varphi}_{11}^{(0)}}^{(0)}=1,{\lambda}_{1{\varphi}_{12}^{(0)}}^{(0)}=1,{\lambda}_{2{\varphi}_{21}^{(0)}}^{(0)}=2,{\lambda}_{2{\varphi}_{22}^{(0)}}^{(0)}=2,$ and α 0(0)=α 1(0)=1.
We first show how to draw sample from $P\left({\varphi}_{11}^{(1)}{\mathit{\varphi}}_{11}^{(0)c},\right.$$\left.{\mathit{\lambda}}^{(0)},{\alpha}_{1}^{(0)},{\alpha}_{0}^{(0)},\mathbf{g}\right)$ by the M–H algorithm. Given the initial value, assume that q(··) returns ϕ_{11}=3 as a candidate sample. By (19), we have $P\left({\varphi}_{11}^{(1)}=1{\mathit{\varphi}}_{11}^{(0)c},{\mathit{\lambda}}^{(0)},{\alpha}_{1}^{(0)},\right.$$\left.{\alpha}_{0}^{(0)},\mathbf{g}\right)\propto {c}_{11}{f}_{{\lambda}_{11}}\left({g}_{11}{\mathbf{g}}_{11}^{c}\right)$, where c_{11}=1 and λ_{11}=1. We also have
Note that the above integral can be calculated either numerically or by using the Monte Carlo integration method.
By (21) and using the specific values of the variables, we obtain
with K_{11}=1, ${\sum}_{j}{d}_{j1}=2$, ${\sum}_{\mathit{\text{jk}}}{d}_{\mathit{\text{jk}}}=4$, α_{0}=α_{1}=1. Plugging in these values, we have
Since $\beta =\frac{0.1483}{0.22971}\approx 0.6456<1$, we should accept this candidate sample ϕ_{11}=3 with a probability of 0.6456. After the burnin period, say the sample returned by the M–H algorithm is ϕ_{11}=4, then we update ${\varphi}_{11}^{(1)}=4$ and move on to draw samples of the remaining variables ϕ_{12}, ϕ_{21}, and ϕ_{22}.
Assuming that we obtain samples of ϕ^{(1)} as ${\varphi}_{11}^{(1)}=4,{\varphi}_{12}^{(1)}=1,{\varphi}_{21}^{(1)}=1,{\varphi}_{22}^{(1)}=2$. We next draw the sample λ^{(1)}. Given the initial value ${\lambda}_{1{\varphi}_{11}^{(1)}}=1$ and q(··) returns ${\lambda}_{1{\varphi}_{11}^{(1)}}=3$ as a candidate sample. By (22), we obtain $P\left({\lambda}_{1{\varphi}_{11}^{(1)}}^{(1)}=1{\mathit{\varphi}}^{(1)},{\mathit{\lambda}}_{1{\varphi}_{11}^{(1)}}^{(0)c},{\alpha}_{1}^{(0)},{\alpha}_{0}^{(0)},\mathbf{g}\right)\propto \left(\sum _{j}{d}_{j1}\right){f}_{1}\left({g}_{11}{\mathbf{g}}_{11}^{c}\right)$. Furthermore, we have ${\sum}_{j}{d}_{j1}=2$ and ${f}_{1}\left({g}_{11}{\mathbf{g}}_{11}^{c}\right)\approx 0.22971$ as calculated before.
By (23), we obtain $P\left({\lambda}_{1{\varphi}_{11}}^{(1)}=3{\mathit{\varphi}}^{(1)},{\mathit{\lambda}}_{1{\varphi}_{11}}^{(0)c},{\alpha}_{1}^{(0)},{\alpha}_{0}^{(0)},\mathbf{g}\right)\propto {\alpha}_{1}\int F({g}_{11}\theta ){\mu}_{1}(\theta )\mathrm{d\theta}$. Moreover, we have α_{1}=1 and $\int F({g}_{11}\theta ){\mu}_{1}(\theta )\mathrm{d\theta}\approx 0.28208$ as calculated before. So we have $\beta =\frac{0.28208}{2\ast 0.22971}\approx 0.614<1$. After the burnin period, assume that the M–H algorithm returns a sample ${\lambda}_{1{\varphi}_{11}^{(1)}}=2$, then update ${\lambda}_{1{\varphi}_{11}^{(1)}}^{(1)}=2$ and move on to sample the remaining λ variables as well as α_{0} and α_{1}.
After the burnin period of the whole Gibbs sampler, we can calculate the posterior joint distribution P(ϕ,λg) from the samples and determine the clusters following the last two steps in the proposed Gibbs sampling algorithm.
4 Experimental results
The HDP clustering algorithm proposed in this article can be employed for gene expression analysis or as a segmentation algorithm for gene regulatory network inference. In this section, we first introduce two performance measures for clustering, the Rand Index (RI) [27] and the Silhouette Index (SI) [28]. We compare the HDP algorithm to the support vector machine (SVM) algorithm for network segmentation on synthetic data. We then conduct various experiments on both synthetic and real datasets including the AD400 datasets [29], the yeast galactose datasets [30], yeast sporulation datasets [31], human fibroblasts serum datasets [32], and yeast cell cycle data [33]. We compare the HDP algorithm to the Latent Dirichlet allocation (LDA), MCLUST, SVM, Kmeans, Bayesian Infinite Mixture Clustering (BIMC) the HC [4, 14, 34–37] based on the performance measures and the functional relationships.
4.1 Performance measures
In order to evaluate the clustering result, we utilize two measures: RI [27] and SI [28]. The first index is used when a ground truth is known in priori and the second index is to measure the performance without any knowledge of the ground truth.
The RI is a measure of agreement between two clustering results. It takes a value between 0 and 1. The higher is the score, the higher agreements it indicates.
Let A denote the datasets with a total number of n elements. Given two clustering results X={X_{1},…,X_{ S }} and Y={Y_{1},…,Y_{ T }} of A, i.e., $A=\bigcup _{i=1}^{S}{X}_{i}=\bigcup _{j=1}^{T}{Y}_{j}$ and ${X}_{i}\bigcap {X}_{j}=\varnothing $, ${Y}_{i}\bigcap {Y}_{j}=\varnothing $ for i≠j. For any pair of elements (a,b) in A, we say they are in the same set under a clustering result if a and b are in the same cluster. Otherwise we say they are in different sets. Note that there are totally $\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)$ pairs of elements. We define the following four counting numbers: Z_{1} denotes the number of pairs that are both in the same set in X and Y; Z_{2} denotes the number of pairs that are both in different sets in X and Y; Z_{3} denotes the number of pairs that are in the same set in X and in different sets in Y; and Z_{4} denotes the number of pairs that are in different sets in X and in the same set in Y. The RI is then given by
Due to the lack of the ground truth in most real applications, we utilize the SI to evaluate the clustering performance. The SI is a measure by calculating the average width of all data points, which reflects the compactness of the clustering. Let x denote the average distance between a point p in a cluster and all other points within that cluster. Let y be the minimum average distance between p and other clusters. The Silhouette distance for p is defined as
The SI is the average Silhouette distance among all data points. The value of SI lies in [−1,1] and higher score indicates better performance.
4.2 Network segmentation on synthetic data
In regulatory network inference, due to the large size of the network, it is often useful to perform a network segmentation. The segmented subnetworks usually have much less number of nodes than the original network, leading to faster and more accurate analysis of the original network [38]. Clustering algorithms can be employed for such segmentation purpose. However, traditional clustering algorithms often provide segmentation results either too fine or too coarse, i.e., the resulting subnetworks either contain too few genes or two many genes. In addition, the hierarchical structure of the network cannot be discovered by those algorithms. Thanks to its hierarchical model assumption, the HDP algorithm can provide better segmentation results. We demonstrate the segmentation application of HDP on a synthetic network and compare to the SVM algorithm which is widely used for clustering and segmentation.
The network under consideration is shown in Figure 3. We assume that the distributions for all nodes are Gaussian. The directed links indicate that the parent nodes are the priors of the child nodes. Disconnected nodes are mutually independent. We generate the data in the following way. Nodes 1, 2, and 8 are generated independently by Gaussian distributions of unit variance with means 1, 2, and 3, respectively. Nodes 3, 4, 5, 6, 9, and 10 are generated independently by unit variance Gaussian distributions with means determined by their respective parent nodes. Node 7 is generated by a Gaussian distribution with mean determined by node 4 and variance determined by absolute value of node 5. The network contains two isolated segments with one segment containing nodes 1–7 and the other containing nodes 8–10. The HDP algorithm is applied to this network and segments the network into three clusters. Nodes 2, 4, 6 form one cluster; nodes 1, 3, 5, 7 form another cluster; and nodes 8, 9, 10 form the third one. The SVM algorithm on the other hand produces two clusters, one containing nodes 1–7 and the other containing nodes 8–10. As one can see, the network obviously contains two hierarchies in the left segment, i.e., nodes 1–7 of the network. The SVM fails to recognize the hierarchies and provides a result coarser than that given by the HDP algorithm.
4.3 AD400 data
The AD400 is a synthetic dataset proposed in [29], which is used to evaluate the clustering algorithm performance. The dataset is constituted by 400 genes with 10 time points. As the ground truth, the AD400 dataset has 10 clusters with each one containing 40 genes.
For randomized algorithms as LDA, BIMC, HDP, we average the results over 20 runs of the algorithms. We compare the HDP algorithm to other widely used algorithms such as LDA, SVM, MCLUST, Kmeans, BIMC, and HC. The results are presented in Table 1. As we can see, the HDP algorithm has the similar performance of the MCLUST algorithm. While the HDP generally performs better than other widely used algorithms.
4.4 Yeast galactose data
We conduct experiment on the yeast galactose data, which consists of 205 genes. The true number of clusters based on the functional categories is 4 [39]. We calculate the RI index between different clustering results to the result in [39], which is regarded as the standard benchmark. The LDA model is a generative probabilistic model for document classifications [34], which also uses Dirichlet distribution as a prior. We adapt the LDA model to the yeast galactose data to compare the proposed HDP algorithm. Since the LDA and HDP methods are randomized algorithms, we run the algorithms 20 times and use the average for the final score. In Figure 4, we illustrate the performances of each experiments for the HDP method. The performances of the algorithms under consideration are listed in Table 2.
It is seen that the HDP algorithm performs the best among the three algorithms. Unlike the MCLUST and LDA algorithms which produce more clusters than 4, the average number of clusters given by the HDP algorithm is very closed to the “true” value 4. Compared to the SVM method, the HDP algorithm produces a result that is more similar to the “ground truth”, i.e., with the highest RI value.
4.5 Yeast sporulation data
The yeast sporulation dataset consists of 6,118 genes with 7 times points which were obtained during the sporulation process [31]. We preprocessed the dataset by applying a logarithmic transform and removing the data whose expression levels did not have significant changes. After the preprocess, the data have 513 genes left. In Table 3, we compare the HDP clustering result to LDA, MCLUST, KMeans, BIMC, and HC. For randomized algorithms such as LDA, BIMC, and HDP, we average the scores by running the algorithm 20 times.
From Table 3, we can see that the HDP has the highest SI score. It suggests that the clustering results provided by HDP are more compact and less separated than results from other algorithms. The Kmeans and HC algorithm suggest higher number of clusters. However, their SI scores indicate that their clusters are not as tight as other algorithms.
4.6 Human fibroblasts serum data
The human fibroblasts serum data consists of 8,613 genes with 12 time points [32]. Again a logarithmic transform has been applied to the data and genes without significant changes have been removed. The remaining dataset has 532 genes.
In Table 4, we show the performance of the HDP algorithm and other various algorithms. It has been shown that the clustering results by the HDP algorithm are the compactest among those algorithms. The LDA algorithm suggests 9.4 clusters with the lowest SI score, which indicates that some of its clusters can be further tightened. HC provides a result consisting of five clusters. However, the SI score of the HC result is not the highest, which suggests its clustering may not be well formed.
4.7 Yeast cell cycle data
We next apply the proposed HDP clustering algorithm on the yeast cell Saccharomyces cerevisiae cycle dataset [2, 40]. The data are obtained by synchronizing and collecting the mRNAs from cells at 10min intervals over the course of two cell cycles. It has been used widely for testing the performances of clustering algorithm [2, 14, 41]. The expression data have been taken logarithmic transform and lie in the interval [−2,2]. We preprocessed the data to remove those which did not change significantly over time. We also removed those data whose means are below a small threshold. After the preprocessing, there are 1,515 genes left. We then apply the HDP algorithm and obtain 10 clusters in total. The plots of the clusters are shown in Figure 5.
We resort to the MIPS database [42] to determine the functional categories for each cluster. The inferred functional category of a cluster is the category shared by the majority of the member elements. After applying the cellcycle selection criterion in [2], we find that there are 126 genes identified by proposed HDP algorithm but not discovered in [2]. We list in Table 5 the numbers of newly discovered genes in various functional categories. We also observe that parts of the newly discovered unclassified genes belong to clusters with classified categories. Given the hierarchical characteristic of the HDP algorithm, it may suggest multiple descriptions of those genes that might have been overlooked before.
Note that in [14] a Bayesian model with infinite number of clusters is proposed based on the Dirichlet process. The model in [14] is a special case of the HDP model proposed in this article when there is only one hierarchy. In terms of discovering new gene functionalities, we find that the performances of the two algorithms are similar, as the method in [14] discovered 106 new genes compared to the result in [2]. However, by taking the hierarchical structure into account, the total number of clusters found by the HDP algorithm is significantly smaller than that given in [14] which is 43 clusters. The SI score for BIMC and HDP are 0.321 and 0.392, respectively. The HDP clustering consolidates many fragmental clusters, which may provide an easier way to interpret the clustering results.
In Table 6, we list the new genes discovered by the HDP algorithm which are not found in [2].
5 Conclusions
In this article, we have proposed a new clustering approach based on the HDP. The HDP clustering explicitly models the hierarchical structure in the data that is prevalent in biological data such as gene expressions. We have developed a statistical inference algorithm for the proposed HDP model based on the Chinese restaurant metaphor and the Gibbs sampler. We have applied the proposed HDP clustering algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to reveal more structural information of the data compared to popular algorithms such as SVM and MCLUST, by incorporating the hierarchical knowledge into the model.
Appendix
Derivation of formula (19) and (21)
By (16), if a has appeared before, we have
Otherwise we have
If a has appeared before, by the assumption the data are conditionally independent, we also have
where ${f}_{{\lambda}_{\mathit{\text{ja}}}}({g}_{\mathit{\text{ji}}}{\mathbf{g}}_{\mathit{\text{ji}}}^{c})$ can be calculated by the Bayes’ formula:
Combining (35) and (37), we have (19).
If a has not appeared before, by (17), we have
Combining (36) and (39), we have (21).
Derivation of (22) nd (23)
By (17), if a has appeared before, we have
Otherwise we have
If a is used before, we have
Otherwise, the customer chooses a new table. The data are generated from F based on a sample from μ_{1}. We have
Combining (43), (44), (45), and (46), we have (22) and (23).
References
 1.
Schena M, Shalon D, Davis R, Brown P: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995,270(5235):467470. 10.1126/science.270.5235.467
 2.
Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D: A genomewide transcriptional analysis of the mitotic cell cycle. Mol. Cell 1998, 2: 6573. 10.1016/S10972765(00)801148
 3.
Hughes J, Estep P, Tavazoie S, Church G: Computational identification of cisregulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol 2000,296(5):12051214. 10.1006/jmbi.2000.3519
 4.
Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc. Natl. Acad. Sci 1998,95(25):1486314868. 10.1073/pnas.95.25.14863
 5.
MacQueen J: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. California: University of California Press; 1967:281297.
 6.
Kohonen T: SelfOrganization and Associative Memory. New York: Springer; 1988.
 7.
Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: a survey. IEEE Trans. Knowledge Data Eng 2004,16(11):13701386. 10.1109/TKDE.2004.68
 8.
Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 1977, 39: 138.
 9.
McLachlan G, Peel D: Finite Mixture Models. New York: WileyInterscience; 2000.
 10.
Fraley C, Raftery A, clustering Modelbased, analysis discriminant, Am densityestimation. J.: Stat. Assoc. 2002,97(458):611631. 10.1198/016214502760047131
 11.
Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data. Bioinformatics 2001,17(10):977987. 10.1093/bioinformatics/17.10.977
 12.
Schwarz G: Estimating the dimension of a model. Ann. Stat 1978,6(2):461464. 10.1214/aos/1176344136
 13.
Akaike H: A new look at the statistical model identification. IEEE Trans Autom. Control 1974,19(6):716723. 10.1109/TAC.1974.1100705
 14.
Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002,18(9):11941206. 10.1093/bioinformatics/18.9.1194
 15.
Ferguson T: A Bayesian analysis of some nonparametric problems. Ann. Stat 1973,1(2):209230. 10.1214/aos/1176342360
 16.
Neal R: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat 2000,9(2):249265.
 17.
Pitman J: Some developments of the BlackwellMacQueen urn scheme. Lecture NotesMonograph Series 1996, 245267.
 18.
Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Online Library; 1990.
 19.
Jiang D, Pei J, Zhang A: DHC: a densitybased hierarchical clustering method for time series gene expression data. In Proceedings of Third IEEE Symposium on Bioinformatics and Bioengineering. Bethesda: IEEE; 2003:393400.
 20.
Piatigorsky J: Gene Sharing and Evolution: The Diversity of Protein Functions. Cambridge: Harvard University Press; 2007.
 21.
Teh Y, Jordan M, Beal M, Blei D: Hierarchical Dirichlet processes. J. Am. Stat. Assoc 2006,101(476):15661581. 10.1198/016214506000000302
 22.
Sethuraman J: A constructive definition of Dirichlet priors. Stat. Sinica 1991, 4: 639650.
 23.
Aldous D: Exchangeability and related topics. École d’Été de Probabilités de SaintFlour XIII 1985, 1198.
 24.
Casella G, George E: Explaining the Gibbs sampler. Am. Stat 1992,46(3):167174.
 25.
Blackwell D, MacQueen J: Ferguson distributions via Pólya urn schemes. Ann. Stat 1973,1(2):353355. 10.1214/aos/1176342372
 26.
Brooks S: Markov chain Monte Carlo method and its application. J. R. Stat. Soc. Ser. D (The Statistician) 1998, 47: 69100. 10.1111/14679884.00117
 27.
Hubert L, Arabie P: Comparing partitions. J. Classif 1985, 2: 193218. 10.1007/BF01908075
 28.
Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math 1987, 20: 5365.
 29.
Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics 2001,17(9):763774. 10.1093/bioinformatics/17.9.763
 30.
Yeung K, Medvedovic M, Bumgarner R: Clustering geneexpression data with repeated measurements. Genome Biol 2003,4(5):R34. 10.1186/gb200345r34
 31.
Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The transcriptional program of sporulation in budding yeast. Science 1998,282(5389):699705.
 32.
Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J, Boguski MS: The transcriptional program in the response of human fibroblasts to serum. Science 1999,283(5398):8387. 10.1126/science.283.5398.83
 33.
Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998,9(12):3273.
 34.
Blei D, Ng A, Jordan M: Latent Dirichlet allocation. J. Mach. Learn. Res 2003, 3: 9931022.
 35.
Fraley C, Raftery A: MCLUST: software for modelbased cluster analysis. J. Classif 1999,16(2):297306. 10.1007/s003579900058
 36.
Furey T, Cristianini N, Duffy N, Bednarski D, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000,16(10):906914. 10.1093/bioinformatics/16.10.906
 37.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat. Genetics 1999, 22: 281285. 10.1038/10343
 38.
Chung F, Lu L CBMS Lecture Series no. 107. In Complex Graphs and Networks. Providence: American Mathematical Society; 2006.
 39.
Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J: Gene ontology: tool for the unification of biology. Nat. Genet 2000, 25: 2529. 10.1038/75556
 40.
Stanford University: Yeast cell cycle datasets http://genomewww.stanford.edu/cellcycle/data/rawdata
 41.
Lukashin A, Fuchs R: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 2001,17(5):405414. 10.1093/bioinformatics/17.5.405
 42.
Mewes H, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002, 30: 3134. 10.1093/nar/30.1.31
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Wang, L., Wang, X. Hierarchical Dirichlet process model for gene expression clustering. J Bioinform Sys Biology 2013, 5 (2013). https://doi.org/10.1186/1687415320135
Received:
Accepted:
Published:
Keywords
 Support Vector Machine
 Latent Dirichlet Allocation
 Dirichlet Process
 Inference Algorithm
 Rand Index