A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

Silva, Anjali; Rothstein, Steven J.; McNicholas, Paul D.; Subedi, Sanjeena

doi:10.1186/s12859-019-2916-0

A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

Methodology article
Open access
Published: 16 July 2019

Volume 20, article number 394, (2019)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

Download PDF

Anjali Silva^1,2,
Steven J. Rothstein²,
Paul D. McNicholas³ &
…
Sanjeena Subedi⁴

7503 Accesses
24 Citations
12 Altmetric
1 Mention
Explore all metrics

Abstract

Background

High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the data, cluster analysis provides an intuitive alternative. The aim of applying mixture model-based clustering in this context is to discover groups of co-expressed genes, which can shed light on biological functions and pathways of gene products.

Results

A mixture of multivariate Poisson-log normal (MPLN) model is developed for clustering of high-throughput transcriptome sequencing data. Parameter estimation is carried out using a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm, and information criteria are used for model selection.

Conclusions

The mixture of MPLN model is able to fit a wide range of correlation and overdispersion situations, and is suited for modeling multivariate count data from RNA sequencing studies. All scripts used for implementing the method can be found at https://github.com/anjalisilva/MPLNClust.

A clustering procedure for three-way RNA sequencing data using data transformations and matrix-variate Gaussian mixture models

Article Open access 01 March 2024

VPAC: Variational projection for accurate clustering of single-cell transcriptomic data

Article Open access 01 May 2019

Statistical Modeling of High Dimensional Counts

Background

RNA sequencing (RNA-seq) is used to determine the transcriptional dynamics of a biological system by measuring the expression levels of thousands of genes simultaneously [1, 2]. This technique provides counts of reads that can be mapped back to a biological entity, such as a gene or an exon, which is a measure of the gene’s expression under experimental conditions. Analyzing RNA-seq data is challenged by several factors, including the nature of the data, which is characterized by high dimensionality, skewness, and presence of a dynamic range that may vary from zero to over a million counts. Further, multivariate count data from RNA-seq is generally overdispersed. Upon obtaining raw counts of reads from an RNA-seq study, a typical bioinformatics analysis pipeline involves trimming, mapping, summarizing, normalizing and downstream analysis [3]. Cluster analysis is often performed as part of downstream analysis to identify key features between observations.

Clustering algorithms can be classified into two broad categories: distance-based or model-based approaches [4]. Distance-based clustering techniques include hierarchical clustering and partitional clustering [4]. Distance-based approaches utilize a distance function between pairs of data objects and group similar objects together into clusters. Model-based approaches involve clustering data objects using a mixture-modeling framework [4–8]. Compared to distance-based approaches, model-based approaches offer better interpretability because the resulting model for each cluster directly characterizes that cluster [4]. In model-based approaches, the conditional probability of each data object belonging to a cluster is calculated.

The probability distribution function of a mixture model is $f(\boldsymbol {y}| \pi _{1}, \ldots, \pi _{G},\boldsymbol {\vartheta }_{1}, \ldots, \boldsymbol {\vartheta }_{G}) =\!\! \sum \nolimits _{g=1}^{G} \pi _{g} f_{g}(\boldsymbol {y} | \boldsymbol {\vartheta }_{g})$, where G is the total number of clusters, f_g(·) is the distribution function with parameters 𝜗_g, and π_g>0 is the mixing weight of the g^th component such that $\sum \nolimits _{g=1}^{G} \pi _{g}=1$. An indicator variable z_ig is used for cluster membership, such that z_ig equals 1 if the ith observation belongs to component g and 0 otherwise. The predicted cluster memberships at the maximum likelihood estimates of the model parameters are given by the maximum a posteriori probability, MAP$(\hat {z}_{ig}$). The $\text {MAP}(\hat {z}_{ig}) = 1$ if $\arg \max _{h}\{\hat {z}_{ih}\}=g$ and $\text {MAP}(\hat {z}_{ig}) = 0$ otherwise. Parameter estimation is typically carried out using maximum likelihood algorithms, such as the expectation-maximization (EM) algorithm [9]. The parameter estimation methods are fitted for a range of possible number of components and the optimal number is selected using a model selection criterion. Typically, one component represents one cluster [8].

Clustering of gene expression data allows identifying groups of genes with similar expression patterns, called gene co-expression networks. Inference of gene networks from expression data can lead to better understanding of biological pathways that are active under experimental conditions. This information can also be used to infer the biological function of genes with unknown or hypothetical functions based on their cluster membership with genes of known functions and pathways [10]. Over the past few years, a number of mixture model-based clustering approaches for gene expression data from RNA-seq studies have emerged based on the univariate Poisson and negative binomial (NB) distributions [11–13]. Although these distributions seem a natural fit to count data, there can be limitations when applied in the context of RNA-seq as outlined in the following paragraph.

The Poisson distribution is used to model discrete data, including expression data from RNA-seq studies. However, the multivariate extension of the Poisson distribution can be computationally expensive. As a result, the univariate Poisson distribution is often utilized in clustering algorithms, which leads to the assumption that samples are independent conditionally on the components [11, 12, 14]. This assumption is unlikely to hold in real situations. Further, the mean and variance coincide in the Poisson distribution. As a result, the Poisson distribution may provide a good fit to RNA-seq studies with a single biological replicate across technical replicates [15]. However, current RNA-seq studies often utilize more than one biological replicate in order to estimate the biological variation between treatment groups. In such studies, RNA-seq data exhibit more variability than expected (called “overdispersion”) and the Poisson distribution may not provide a good fit for the data [15, 16]. Due to the smaller variation predicted by Poisson distribution, type-I errors in the data can be underestimated [16]. The use of NB distribution may alleviate some of these issues as the mean and variance differ. However, NB can fail to provide a good fit to heavy tailed data like RNA-seq [17].

The multivariate Poisson-log normal (MPLN) distribution [18] is a multivariate log normal mixture of independent Poisson distributions. It is a two-layer hierarchical model, where the observed layer is a multivariate Poisson distribution and the hidden layer is a multivariate Gaussian distribution [18, 19]. The MPLN distribution is suitable for analyzing multivariate count measurements and offers many advantages over other discrete distributions [20, 21]. Importantly, the hidden layer of the MPLN distribution is a multivariate Gaussian distribution, which allows for the specification of a covariance structure. As a result, independence no longer needs to be assumed between variables. The MPLN distribution can also account for overdispersion in count data and supports negative and positive correlations, unlike other multivariate discrete distributions such as multinomial or negative multinomial [22].

Here, a novel mixture model-based clustering method is presented for RNA-seq using MPLN distributions. The proposed clustering technique is explored in the context of clustering genes. The performance of the method is evaluated through data-driven simulations and real data.

Results

Transcriptome data analysis

To illustrate the applicability of mixtures of MPLN distributions, it is applied to a RNA-seq dataset. For comparison purposes, three model-based clustering methods were also used. These include HTSCluster [11, 14], Poisson.glm.mix [12] and MBCluster.Seq [13]. Poisson.glm.mix offers three different parameterizations for the Poisson mean, which will be termed m = 1, m = 2, and m = 3. MBCluster.Seq offers clustering via mixtures of Poisson, termed MBCluster.Seq, Poisson, and clustering via mixtures of NB, termed MBCluster.Seq, NB.

Typically, only a subset of differentially expressed genes is used for cluster analysis. Normalization factors representing library size estimate for samples for all methods were obtained using trimmed mean of M values (TMM) [23, 24] from the calcNormFactors function of edgeR package. Initialization is done via k-means for HTSCluster and MBCluster.Seq. An option to specify normalization or initialization method was not available for Poisson.glm.mix, thus default settings were used. Note, for MBCluster.Seq, G=1 cannot be run.

In the context of real data clustering, it is not possible to compare the clustering results obtained from each method to a ‘true’ clustering of the data as such classification does not exist. To identify if co-expressed genes are implicated in similar biological processes, functions or components, an enrichment analysis was performed on the gene clusters using the Singular Enrichment Analysis tool available on AgriGO [25]. Singular Enrichment Analysis tool identifies enriched gene ontology (GO) terms provided a list of gene identifiers by comparing it to a background population or reference from which the query list is derived [25]. A significance level of 5% is used with Fisher statistical testing and Yekutieli multi-test adjustment. GO defines three distinct ontologies, called biological process, molecular function, and cellular component.

Transcriptome data analysis: cranberry bean RNA-seq data

In the study by Freixas-Coutin et al. [26], RNA-seq was used to monitor transcriptional dynamics in the seed coats of darkening (D) and non-darkening (ND) cranberry beans (Phaseolus vulgaris L.) at three developmental stages: early (E), intermediate (I) and mature (M). A summary of this dataset is provided in Table 1. The aim of their study was to evaluate if the changes in the seed coat transcriptome were associated with proanthocyanidin levels as a function of seed development in cranberry beans. For each developmental stage, 3 biological replicates were considered for a total of 18 samples. The RNA-seq data are available on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under the BioProject PRJNA380220. The study identified 1336 differentially expressed genes, which were used for the cluster analysis.

Table 1 Summary of the cranberry bean RNA-seq dataset used for cluster analysis

Full size table

The raw read counts for genes were obtained from Binary Alignment/Map files using samtools [27] and HTSeq [28]. The median value from the 3 replicates per each developmental stage was chosen. In the first run, T₁, data was clustered for a range of G=1,…,11 using k-means initialization with 3 runs. (Note, for MBCluster.Seq, G=1 cannot be run.) Since model selection criteria selected G=2 or G=11 for HTSCluster, Poisson.glm.mix, and MBCluster.Seq, further clustering runs were performed for these methods using ranges of T₂ :G=1,…,20; T₃ :G=1,…,30; T₄ :G=1,…,40; T₅ :G=1,…,50 and T₆ :G=1,…,100. The clustering results are summarized in Table 2. Note, more than 10 models need to be considered for applying slope heuristics, dimension jump (Djump) and data-driven slope estimation (DDSE), and because G=1 cannot be run for MBCluster.Seq, slope heuristics could not be applied for T₁.

Table 2 Number of clusters selected using different model selection criteria for the cranberry bean RNA-seq dataset for T₁ to T₆

Full size table

For the mixtures of MPLN distributions, all information criteria selected a model with G=4, with the exception of the AIC, which selected a G=5 model in T₁. Recall that the AIC is known to favor more complex models with more parameters. A cross tabulation comparison of G=4 model with that of G=5 did not reveal any significant patterns, but rather random classification results were observed. For the G=4 model, each cluster contained 71, 731, 415 and 119 genes respectively, and the expression patterns of these models are provided in Fig. 1. For MBCluster.Seq, NB, a model with G=2 was selected. This is the lowest cluster size considered in the range of clusters for this method as G=1 cannot be run for MBCluster.Seq. For G=2 model, Cluster 1 contained 467 genes and Cluster 2 contained 869 genes (expression patterns provided in Additional file 1: Figure S1). A comparison of this model with that of G=4, from mixtures of MPLN distributions, did not reveal any significant patterns. For all other methods in T₁, information criteria selected G=11.

During T₂, a model with G=14 was selected for MBCluster.Seq, Poisson by the BIC and ICL (expression patterns provided in Additional file 1: Figure S2). A comparison of this model with that of G=4, from mixtures of MPLN distributions, did not reveal any significant patterns. With further runs (T₃,…,T₆), it was evident that the highest cluster size is selected for HTSCluster and Poisson.glm.mix. No changes were observed for MBCluster.Seq, NB, as the lowest cluster size, G=2, is selected. All information criteria (BIC, ICL, AIC, AIC3) gave similar results, suggesting a high degree of certainty in the assignment of genes into clusters, i.e., that the posterior probabilities $\hat {z}_{ig}$ are generally close to zero or one. The results from slope heuristics (Djump and DDSE) highly varied across T₁,…,T₆. For this reason, overfitting and underfitting methods were run for G=1,…,100, as in T₆, but for 20 different times. Results for both information criteria and slope heuristics are provided in Table 3. The results from slope heuristics highly varied across the 20 different clustering runs, as evident by the large range in the number of models selected.

Table 3 Range of clusters selected using different model selection criteria for the cranberry bean RNA-seq dataset for T₆, repeated 20 times

Full size table

Due to model selection issues with over and under fitting, downstream analysis was only conducted using the G=4 model of mixtures of MPLN distributions, G=14 model of MBCluster.Seq, Poisson, and G=2 model of MBCluster.Seq, NB. The GO enrichment analysis results for all models are provided in Additional file 2. Only $\frac {1}{2}, \frac {3}{4}$, and $\frac {5}{14}$ clusters contained enriched GO terms in G=2,G=4, and G=14 models, respectively. Among the models, clear expression patterns were evident for the G=14 model, and this can be attributed to the fact that there are more clusters present in this model. However, only 5 of the 14 clusters exhibited significant GO terms.

Further analysis was only conducted on the G=4 model of the mixtures of MPLN distributions, because comparing the cluster composition of genes across different methods, with respect to biological context, is beyond the scope of this article. For the G=4 model, Cluster 1 genes were highly expressed in intermediate developmental stage, compared to other developmental stages, regardless of the variety (see Figure 1). The GO enrichment analysis identified genes belonging to pathogenesis, multi-organism process and nutrient reservoir activity (see Additional file 2). For Cluster 2, no GO terms exhibited enrichment and the expression of genes might be better represented by two or more distinct clusters.

Cluster 3 genes showed higher expression in early developmental stage, compared to other developmental stages, regardless of the variety. Here, genes belonged to oxidoreductase activity, enzyme activity, binding and dehydrogenase activity. Finally, Cluster 4 genes were more highly expressed in the darkening variety relative to the non-darkening variety. The GO enrichment analysis identified Cluster 4 genes as containing biosynthetic genes. Further examination identified that many of these genes were annotated as flavonoid/proanthocyanidin biosynthesis genes in the P. vulgaris genome. Polyphenols, such as proanthocyanidins, are synthesized by the phenylpropanoid pathway and are found on seed coats (Reinprecht et al. 2013). Proanthocyanidins have been shown to convert from colorless to visible pigments during oxidation [29]. Beans with regular darkening of seed coat color is known to have higher levels of polyphenols compared to beans with slow darkening [29, 30].

Simulation data analysis: mixtures of MPLN distributions

To simulate data that mimics real data, the library sizes and count ranges in simulated datasets were ensured to be within the same 5–95% ranges as those observed for real data. For the simulation study, three different settings were considered. In simulations 1 and 2, 50 datasets with one underlying cluster and 50 datasets with two underlying clusters were generated, respectively. In simulation 3, 30 datasets with three underlying clusters were generated. All datasets had n=1000 observations and d=6 samples generated using mixtures of MPLN distributions. The covariance matrices for each setting were generated using the genPositiveDefMat function in clusterGeneration package, with a range specified for variances of the covariance matrix [31].

Comparative studies were conducted to evaluate the ability to recover the true underlying number of clusters. For this purpose, the following model-based methods were used: HTSCluster, Poisson.glm.mix and MBCluster.Seq. Initialization of z_ig for all methods was done using the k-means algorithm with 3 runs. For simulation 1, π₁=1 and a clustering range of G=1,…,3 was considered. For simulation 2, π₁=0.79 and a clustering range of G=1,…,3 was considered. For simulation 3, π₁=0.3 and π₂=0.5, and a clustering range of G=2,…,4 was considered. In addition to model-based methods, three distance-based methods were also used: k-means [32], partitioning around medoids [33] and hierarchical clustering. These were only applied to simulation 2 and simulation 3. Further, a graph-based method employing Louvain algorithm [34] was also used. The parameter estimation results for the mixtures of MPLN algorithm are provided in Additional file 3. The clustering results for all methods are summarized in Table 4.

Table 4 Number of clusters selected (average ARI, standard deviation) for each simulation setting using mixtures of MPLN distributions

Full size table

The adjusted Rand index (ARI) values obtained for mixtures of MPLN were equal to or very close to one, indicating that the algorithm is able to assign observations to the proper clusters, i.e., the clusters that were originally used to generate the simulation datasets. Note, for MBCluster.Seq, G=1 cannot be run, and the corresponding row of results has been left blank on Table 4. Although a range of clusters G=1,2,3 was selected for Poisson.glm.mix, m = 3 in simulation 1, an ARI value of one was obtained because all runs resulted in only one cluster (others were empty clusters). Distance-based methods and the graph-based method resulted in low ARI values.

Simulation data analysis: mixtures of negative binomial distributions

In this simulation, 50 datasets with two underlying clusters were generated. All datasets had n=200 observations and d=6 samples generated using mixtures of negative binomial distributions. Comparative studies were conducted as specified earlier. Initialization of z_ig for all methods was done using the k-means algorithm with 3 runs. Here, π₁=0.79 and a clustering range of G=1,…,3 was considered. The clustering results are summarized in Table 5. The ARI values obtained for mixtures of MPLN were equal to or very close to one, indicating that the algorithm is able to assign observations to the proper clusters. Low ARI values were observed for all other model-based clustering methods and the graph-based method. Interestingly, application of distance-based methods resulted in high ARI values.

Table 5 Number of clusters selected (average ARI, standard deviation) for the simulation setting using mixtures of negative binomial distributions

Full size table

Discussion

A model-based clustering technique for RNA-seq data has been introduced. The approach utilizes a mixture of MPLN distributions, which has not previously been used for model-based clustering of RNA-seq data. The transcriptome data analysis showed the applicability of mixture model-based clustering methods on RNA-seq data. Information criteria selected the highest cluster size considered in the range of clusters for HTSCluster and Poisson.glm.mix. For MBCluster.Seq, NB, the lowest cluster size considered in the range of clusters was selected. This could potentially imply that these mixtures of Poisson and NB models are not providing a good fit to the data. However, further research is needed in this direction, including the search for other model selection criteria. The GO enrichment analysis (p-value <0.05) identified enriched terms in 75% of the clusters resulting from mixtures of MPLN distributions, whereas only 50% of clusters from MBCluster.Seq, NB and 36% of the clusters from MBCluster.Seq, Poisson contained enriched GO terms.

Using simulated data from mixtures of MPLN distributions, it was illustrated that the algorithm for mixtures of MPLN distributions is effective and returned favorable clustering results. It was observed that other model-based methods from the current literature failed to identify the true number of underlying clusters a majority of the time. Clustering trends similar to those observed for transcriptome data analysis were observed for other model-based methods during the simulation data analysis. Distance-based methods failed to assign observations to proper clusters, as evident by the low ARI values. The graph-based method, Louvain, also failed to identify the true number of underlying clusters.

Using simulated data from mixtures of negative binomial distributions, it was illustrated that the algorithm for mixtures of MPLN distributions is effective and returned favorable clustering results. The distance-based methods also assigned observations to proper clusters resulting high ARI values. It was observed that other model-based methods from the current literature, as well as the graph-based method, failed to identify the true number of underlying clusters a majority of the time. Although the correct numbers of clusters were selected by MBCluster.Seq, proper cluster assignment has not taken place as evident by the low ARI values. Note that although MBCluster.Seq, NB is based on negative binomial distributions, it has low ARI (approx. 0). This could be because the implementation of the approach by [35] available in R package MBCluster.Seq at the moment only performs clustering based on the expression profiles. Si et al. [35] mention that clustering could be done according to both the overall expression levels and the expression profiles by some modification to the parameters, but the implementation of the approach was not available in the R package. Additionally, across all studies (both real and simulated) it is evident that G=2 is selected via information criteria, when MBCluster.Seq, NB is used for clustering.

Overall, the transcriptome data analysis together with simulation studies show superior performance of mixtures of MPLN distributions, compared to other methods presented.

Conclusions

The mixture model-based clustering method based on MPLN distributions is an excellent tool for analysis of RNA-seq data. The MPLN distribution is able to describe a wide range of correlation and overdispersion situations, and is ideal for modeling RNA-seq data, which is generally overdispersed. Importantly, the hidden layer of the MPLN distribution is a multivariate Gaussian distribution, which accounts for the covariance structure of the data. As a result, independence does not need to be assumed between variables in clustering applications.

The scripts used to implement this approach are publicly available and reusable such that they can be simply modified and utilized in any RNA-seq data analysis pipeline. Further, the vector of library size estimates for samples can be relaxed and the proposed clustering approach can be applied to any discrete dataset. A direction for future work would be to investigate subspace clustering methods to overcome the curse of dimensionality as high-dimensional RNA-seq datasets become frequently available.

Methods

Mixtures of MPLN Distributions

The sequencing depth can differ between samples in an RNA-seq study. Therefore, the assumption of equal means across conditions is unlikely to hold. To account for the differences in library sizes across each sample j, a fixed, known constant, s_j, representing the normalized library sizes is added to the mean of the Poisson distribution. Thus, for genes i∈{1,…,n} and samples j∈{1,…,d}, the MPLN distribution is modified to give

$$\begin{aligned} Y_{ij} | \theta_{ij} &\sim \mathscr{P}(\exp\{\theta_{ij}+\log s_{j}\})\\ (\theta_{i1}, \ldots,\theta_{id})^{\prime} &\sim \mathscr{N}_{d}(\boldsymbol{\mu},\boldsymbol{\Sigma}). \end{aligned} $$

A G-component mixture of MPLN distributions can be written

$${\begin{aligned} f(\mathbf{y};\boldsymbol{\Theta})&=\sum\limits_{g=1}^{G}\pi_{g}f_{\mathbf{Y}}(\mathbf{y}|\boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g})\\ &= \sum\limits_{g=1}^{G}\pi_{g}\int_{\mathbb{R}^{d}} \left(\! \prod_{j=1}^{d} f(y_{ij}| \theta_{ijg}, s_{j})\!\right) ~f\!(\boldsymbol{\theta}_{ig}|\boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g}\!)~d\boldsymbol{\theta}_{ig}, \end{aligned}} $$

where Θ=(π₁,…,π_G,μ₁,…,μ_G,Σ₁,…,Σ_G) denotes all model parameters and f_Y(y;μ_g,Σ_g) denotes the distribution of the gth component with parameters μ_g and Σ_g. The unconditional moments of the MPLN distribution can be obtained via conditional expectation results and standard properties of the Poisson and log normal distributions. For a G-component mixture of MPLN distributions, the mean of Y_j is $\mathbb {E}(Y_{j}) = \exp \left \{\boldsymbol {\mu }_{jg} + \frac {1}{2} \sigma _{jjg} \right \} \overset {\text {def}}{=} \boldsymbol {m}_{jg}$ and the variance is $\mathbb {V}\text {ar}(Y_{j}) = \boldsymbol {m}_{jg} + \boldsymbol {m}_{jg}^{2} (\exp \{\sigma _{jjg}\} - 1)$. Here, σ_jjg represents the diagonal elements of Σ_g, for j=1,…,d. Now, $\mathbb {V}\text {ar}(Y_{j}) \geq \mathbb {E}(Y_{j})$ so there is overdispersion for the marginal distribution with respect to the Poisson distribution.

Parameter Estimation

To estimate the parameters, a maximum likelihood estimation procedure based on the EM algorithm is used. In the context of clustering, the unknown cluster membership variable is denoted by Z_i such that Z_ig=1 if an observation i belongs to group g and Z_ig=0 otherwise, for i=1,…,n;g=1,…,G. The complete-data consist of (y,z,θ), the observed and missing data. Here, z is a realization of Z. The complete-data log-likelihood for the MPLN mixture model is

$$ {\begin{aligned} l_{c}(\boldsymbol{\Theta}) & \,=\, \sum\limits_{i=1}^{n} \sum\limits_{g=1}^{G} z_{ig} \log \pi_{g} \left(\sum\limits_{j=1}^{d} f(y_{ij}| \theta_{ijg}, s_{j})\right) ~f(\boldsymbol{\theta}_{ig}|\boldsymbol{\mu}_{g},\boldsymbol{\Sigma}_{g})\\ & = \sum\limits_{g=1}^{G} n_{g} \log \pi_{g} - \sum\limits_{i=1}^{n} \sum\limits_{g=1}^{G} \sum\limits_{j=1}^{d} z_{ig} \exp \{\theta_{ijg} + \log s_{j}\}\\ &\quad+ \sum\limits_{i=1}^{n} \sum\limits_{i=g}^{G} z_{ig} (\boldsymbol{\theta}_{ig} + \log \mathbf{s}) \mathbf{y}_{i}^{\prime} \\ & \,-\, \sum\limits_{i=1}^{n} \sum\limits_{g=1}^{G} \sum\limits_{j=1}^{d} z_{ig} \log \mathbf{y}_{ij} ! \,-\, \frac{nd}{2} \log 2 \pi \,-\, \frac{1}{2} \sum\limits_{g=1}^{G} \!n_{g} \log | \mathbf{\Sigma}_{g} |\\ &- \frac{1}{2} \sum\limits_{i=1}^{n} \sum\limits_{g=1}^{G} z_{ig} (\boldsymbol{\theta}_{ig} - \boldsymbol{\mu}_{g}) \mathbf{\Sigma}_{g}^{^{\raisebox{.2ex}{$\scriptscriptstyle-1$}}} (\boldsymbol{\theta}_{ig} - \boldsymbol{\mu}_{g})^{\prime}, \end{aligned}} $$

where $n_{g} = \sum \nolimits _{i=1}^{n} z_{ig}^{(t)}$. The conditional expectation of complete-data log-likelihood given observed data ($\mathcal {Q}$) is

$$ \begin{aligned} \mathcal{Q}(\boldsymbol{\Theta}) & = \mathbb{E} \left[ l_{c}(\boldsymbol{\Theta}) \right] = \mathbb{E} \left[ \log \big(\pi_{g} f(\boldsymbol{y}|\boldsymbol{\theta}_{g}, \boldsymbol{s}) f(\boldsymbol{\theta}_{g} | \boldsymbol{\vartheta}_{g}) \big) \right]. \end{aligned} $$

(1)

Here, 𝜗_g=(μ_g,Σ_g), for g=1,…,G. Because the first term of (1) does not depend on parameters $\boldsymbol {\vartheta }_{g}, \mathcal {Q}$ can be written

$$ \begin{aligned} \mathcal{Q}(\boldsymbol{\vartheta}_{g}| \boldsymbol{\vartheta}_{g}^{(t)}) & = \mathbb{E} \left[\log f(\boldsymbol{\theta}_{g}| \boldsymbol{Y}, \boldsymbol{\vartheta}_{g}) | \boldsymbol{Y} = \boldsymbol{y} \right] + c(\boldsymbol{y}), \end{aligned} $$

(2)

where c is independent of 𝜗_g. The density of the term f(θ_g|y,𝜗_g) in (2) is

$$ {\begin{aligned} f(\boldsymbol{\theta}_{g} | \boldsymbol{y}, \boldsymbol{\vartheta}_{g}) & = \frac{f(\boldsymbol{y} | \boldsymbol{\theta}_{g}) f(\boldsymbol{\theta}_{g}, \boldsymbol{\vartheta}_{g})} {f(\boldsymbol{y}, \boldsymbol{\vartheta}_{g})} = \frac{f(\boldsymbol{y} | \boldsymbol{\theta}_{g}) f(\boldsymbol{\theta}_{g}, \boldsymbol{\vartheta}_{g})} {\int_{\boldsymbol{\theta}_{g}} f(\boldsymbol{y} | \boldsymbol{\theta}_{g}) f(\boldsymbol{\theta}_{g}, \boldsymbol{\vartheta}_{g}) \mathrm{d} \boldsymbol{\theta}_{g}}. \end{aligned}} $$

(3)

Due to the integral present in (3), evaluation of f(y,𝜗_g) is difficult. Therefore, the E-step cannot be solved analytically. Here, an extension of the EM algorithm, called Monte Carlo EM (MCEM) [36], can be used to approximate the $\mathcal {Q}$ function. MCEM involves simulating at each iteration t and for each observation y_i a random sample of size B, i.e., $\boldsymbol {\theta }^{(1)}_{ig}, \ldots,\boldsymbol {\theta }^{(B)}_{ig}$, from the distribution f(θ_g|y,𝜗_g) to find a Monte Carlo approximation to the conditional expectation of complete-data log-likelihood given observed data. Here, each iteration from the MCEM simulation is represented using k, where k=1,…,B. As the values from initial iterations are discarded from further analysis to minimize bias, the number of iterations used for parameter estimation is N, where N<B. Thus, a Monte Carlo approximation for $\mathcal {Q}$ in (2) is

$$\begin{aligned} \mathcal{Q}(\boldsymbol{\vartheta}_{g}| \boldsymbol{\vartheta}_{g}^{(t)}) & =\sum\limits_{g=1}^{G} \sum\limits_{i=1}^{n} \mathcal{Q}_{ig}(\boldsymbol{\vartheta}_{g}| \boldsymbol{\vartheta}_{g}^{(t)}),\\ \mathcal{Q}_{ig}(\boldsymbol{\vartheta}_{g}| \boldsymbol{\vartheta}_{g}^{(t)}) & \simeq \frac{1}{N} \sum\limits_{k=1}^{N} \log f(\boldsymbol{\theta}_{ig}^{(k)}|\boldsymbol{y}_{i},\boldsymbol{\vartheta}_{g}) + c(\boldsymbol{y}_{i}). \end{aligned} $$

However, another layer of complexity is added as the distribution of f(θ_g|y,𝜗_g) is unknown. Therefore, an alternative MCEM based on Markov chains, Markov chain Monte Carlo expectation-maximization (MCMC-EM) is proposed. MCMC-EM is implemented via Stan, which is a probabilistic programming language written in C++. The R interface of Stan is available via RStan.

Bayesian Inference With Stan

Bayesian approaches to mixture modeling offer the flexibility of sampling from computationally complex models using MCMC algorithms. For the mixtures of MPLN distributions, the random sample $\boldsymbol {\theta }^{(1)}_{ig}, \ldots,\boldsymbol {\theta }^{(B)}_{ig}$ is simulated via the RStan package. RStan carries out sampling from the posterior distribution via No-U-Turn Sampler (NUTS). The prior on θ_ig is a multivariate Gaussian distribution and the likelihood follows a Poisson distribution. Within RStan, the warmup argument is set to half the number of total iterations, as recommended [37]. The warmup samples are used to tune the sampler and are discarded from further analysis.

Using MCMC-EM, the expected value of θ_ig and group membership variable Z_ig, respectively, are updated in E-step as follows

$$\begin{aligned} &\mathbb{E}(\boldsymbol{\theta}_{ig} |\mathbf{y}_{i}) \simeq \frac{1}{N} \sum\limits_{k=1}^{N} \boldsymbol{\theta}_{ig}^{(k)} \simeq \boldsymbol{\theta}^{(t)}_{ig},\\ &\mathbb{E}(Z_{ig} |\mathbf{y}_{i}, \boldsymbol{\theta}_{ig}, \mathbf{s}) = \frac{ \pi_{g} f\left(\mathbf{y}_{i}| \boldsymbol{\theta}_{ig}^{(t)}, \mathbf{s}\right) f\left(\boldsymbol{\theta}_{ig}|\boldsymbol{\mu}_{g}^{(t)}, \boldsymbol{\Sigma}_{g}^{(t)}\right)}{\sum\nolimits_{h=1}^{G} \pi_{h}^{(t)} f(\boldsymbol{y}_{i}|\boldsymbol{\theta}_{ih}^{(t)}, \mathbf{s}) f(\boldsymbol{\theta}_{ih}|\boldsymbol{\mu}_{h}^{(t)}, \boldsymbol{\Sigma}_{h}^{(t)})} =: z_{ig}^{(t)}. \end{aligned} $$

During the M-step, the updates of the parameters are obtained as follows

$$ \begin{aligned} &\pi_{g}^{(t+1)} = \frac{\sum\nolimits_{i=1}^{n} z_{ig}^{(t)}}{n},\qquad \boldsymbol{\mu}^{(t+1)}_{g} = \frac{\sum\nolimits_{i=1}^{n} z_{ig}^{(t)} \mathbb{E} (\boldsymbol{\theta}_{ig})} {\sum\nolimits_{i=1}^{n} z_{ig}^{(t)}},\\ &\boldsymbol{\Sigma}^{(t+1)}_{g} = \frac{\sum\nolimits_{i=1}^{n} z_{ig}^{(t)} \mathbb{E} \left(\left(\boldsymbol{\theta}_{ig} - \boldsymbol{\mu}^{(t+1)}_{g}\right) \left(\boldsymbol{\theta}_{ig} - \boldsymbol{\mu}^{(t+1)}_{g}\right)^{\prime} \right)} {\sum\nolimits_{i=1}^{n} z_{ig}^{(t)} }. \end{aligned} $$

Convergence

To determine whether the MCMC chains have converged to the posterior distribution, two diagnostic criteria are used. One is the potential scale reduction factor [38] and the other is the effective number of samples [39]. The algorithm for mixtures of MPLN distributions is set to check if the RStan generated chains have a potential scale reduction factor less than 1.1 and an effective number of samples value greater than 100 [37]. If both criteria are met, the algorithm proceeds. Otherwise, the chain length is set to increase by 100 iterations and sampling is redone. A total of 3 chains are run at once, as recommended [37]. The Monte Carlo sample size should be increased with the MCMC-EM iteration count due to persistent Monte Carlo error [40], which can contribute to slow or no convergence. For the algorithm for mixtures of MPLN distributions, the number of RStan iterations is set to start with a modest number of 1000 and is increased with each MCMC-EM iteration as the algorithm proceeds. To check if the likelihood has reached its maximum, the Heidelberger and Welch’s convergence diagnostic [41] is applied to all log-likelihood values after each MCMC-EM iteration, using a significance level of 0.05. The diagnostic is implemented via the heidel.diag function in coda package [42]. If not converged, further MCMC-EM iterations are performed until convergence is reached.

Initialization

For initialization of parameters μ_g and Σ_g, the mean and cov functions in R are applied to the input dataset, respectively, and log of the resulting values are used. For initialization of $\hat {z}_{ig}$, two algorithms are provided: k-means and random. For k-means initialization, k-means clustering is performed on the dataset and the resulting group memberships are used for the initialization of $\hat {z}_{ig}$. The mixtures of MPLN algorithm is then run for 10 iterations and the resulting $\hat {z}_{ig}$ values are used as starting values. For random initialization, random values are chosen for $\hat {z}_{ig} \in [0,1]$ such that $\sum \nolimits _{i=1}^{n} \hat {z}_{ig} = 1$ for all i. The mixtures of MPLN algorithm is then run for 10 iterations and resulting $\hat {z}_{ig}$ values are used as starting values. If multiple initialization runs are considered, the $\hat {z}_{ig}$ values corresponding to the run with the highest log-likelihood value are used for downstream analysis. The value of the fixed, known constant that accounts for the differences in library sizes, s, is calculated using the calcNormFactors function from the edgeR package [43].

Parallel Implementation

Coarse grain parallelization has been developed in the context of model-based clustering of Gaussian mixtures [44]. When a range of clusters are considered for a dataset, i.e., G_min: G_max, each cluster size, G, is independent and there is no dependency between them. Therefore, each G can be run in parallel, each one on a different processor. Here, the algorithm for mixtures of MPLN distributions is parallelized using parallel package [45] and foreach package [46]. Parallelization reduced the running time of the datasets (results not shown) and all analyses were done using the parallelized code.

Model selection

The Bayesian information criterion (BIC) [47] remains the most popular criterion for model-based clustering applications [8]. For this analysis, four model selection criteria were used: the Akaike information criterion (AIC) [48],

$$\text{AIC} = -2 \log \mathcal{L} (\hat{\boldsymbol{\vartheta}} |\boldsymbol{y}) + 2K;$$

the BIC,

$$\text{BIC} = -2 \log \mathcal{L} (\hat{\boldsymbol{\vartheta}} |\boldsymbol{y}) + K \log (n);$$

a variation on the AIC used by [49],

$$\text{AIC3} = -2 \log \mathcal{L} (\hat{\boldsymbol{\vartheta}}|\boldsymbol{y}) + 3K;$$

and the integrated completed likelihood (ICL) of [50],

$$\text{ICL} \approx \text{BIC} + 2 \sum\limits_{i=1}^{n} \sum\limits_{g=1}^{G} \text{MAP}\{ \hat{z}_{ig}\} \log \hat{z}_{ig}.$$

The $\mathcal {L} (\hat {\boldsymbol {\vartheta }} |\mathbf {y})$ represents maximized log-likelihood, $\hat {\boldsymbol {\vartheta }}$ is the maximum likelihood estimate of the model parameters 𝜗, n is the number of observations, and $\text {MAP}\{ \hat {z}_{ig}\}$ is the maximum a posteriori classification given $\hat {z}_{ig}$. K represents the number of free parameters in the model, calculated as K=(G−1)+(Gd)+Gd(d+1)/2, for G clusters. These model selection criteria differ in terms of how they penalize the log-likelihood. Rau et al. [14] make use of an alternative approach to model selection using slope heuristics [51, 52]. Following their work, Djump and DDSE, available via capushe package, were also used. More than 10 models need to be considered for applying slope heuristics.

Availability of data and materials

The RNA-seq dataset used for transcriptome data analysis is available on the NCBI SRA under the BioProject PRJNA380220 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA380220/. All scripts used for implementing the mixtures of MPLN algorithm and simulation data can be found at https://github.com/anjalisilva/MPLNClust.

Abbreviations

AIC:: Akaike information criterion
AIC3:: Bozdogan Akaike information criterion
ARI:: Adjusted Rand index
BIC:: Bayesian information criterion
D:: Darkening
DDSE:: Data-driven slope estimation
Djump:: Dimension jump
E:: Early
EM:: Expectation-maximization
GO:: Gene ontology
I:: Intermediate
ICL:: Integrated completed likelihood
M:: Mature
MAP:: Maximum a posteriori probability
MCEM:: Monte Carlo expectation-maximization
MCMC-EM:: Markov chain Monte Carlo expectation-maximization
MPLN:: Multivariate Poisson-log normal
NB:: Negative binomial
NCBI:: National Center for Biotechnology Information
ND:: Non-darkening
NUTS:: No-U-Turn Sampler
RNA-seq:: RNA sequencing
SRA:: Sequence Read Archive
TMM:: Trimmed mean of M values

References

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10:57–63.
Article CAS Google Scholar
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-seq. Bioinformatics. 2011; 27:2325–9.
Article CAS Google Scholar
Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010; 11:10–118620101112220.
Article Google Scholar
Zhong S, Ghosh J. A unified framework for model-based clustering. J Mach Learn Res. 2003; 4:1001–37.
Google Scholar
Wolfe JH. A Computer Program for the Maximum Likelihood Analysis of Types. 1965. Technical Bulletin 65-15. US Naval Personnel Research Activity.
McLachlan GJ, Basford KE. Mixture Models Inference and Applications to Clustering. New York: Marcel Dekker; 1988.
Google Scholar
McLachlan GJ, Peel D. Finite Mixture Models. New York: Wiley; 2000.
Book Google Scholar
McNicholas PD. Mixture Model-based Classification. Boca Raton: Chapman and Hall/CRC Press; 2016.
Book Google Scholar
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977; 39:1–38.
Google Scholar
D’haeseleer P. How does gene expression clustering work?Nat Biotechnol. 2005; 23:1499–501.
Article Google Scholar
Rau A, Celeux G, Martin-Magniette M, Maugis-Rabusseau C. Clustering high-throughput sequencing data with Poisson mixture models. Technical Report, INRIA, Saclay, Ile-de-France. 2011; 7786(RR-7786):1–33.
Google Scholar
Papastamoulis P, Martin-Magniette M, Maugis-Rabusseau C. On the estimation of mixtures of Poisson regression models with large number of components. Comput Stat Data Anal. 2014; 93:97–106.
Article Google Scholar
Si Y, Liu P, Li P, Brutnell TP. Model-based clustering for RNA-seq data. Bioinformatics. 2014; 30:197–205.
Article CAS Google Scholar
Rau A, Maugis-Rabusseau C, Martin-Magniette ML, Celeux G. Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics. 2015; 31:1420–7.
Article CAS Google Scholar
Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot. 2012; 99:248–56.
Article Google Scholar
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:1–12.
Article Google Scholar
Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics. 2013; 14:254.
Article Google Scholar
Aitchison J, Ho CH. The multivariate Poisson-log normal distribution. Biometrika. 1989; 76:643–53.
Article Google Scholar
Georgescu V, Desassis N, Soubeyrand S, Kretzschmar A, Senoussi R. A hierarchical model for multivariate data of different types and maximum likelihood estimation. Technical Report, INRIA, Saclay, Ile-de-France. 2011; RR-46:1–33.
Google Scholar
Zhang H, Xu J, Jiang N, Hu X, Luo Z. Sparse estimation of multivariate Poisson log-normal models from count data. Stat Med. 2015; 34:1577–89.
Article Google Scholar
Wu H, Deng X, Ramakrishnan N. Sparse estimation of multivariate Poisson log-normal models from count data. Stat Anal Data Min. 2016; 11:66–77.
Article Google Scholar
Tunaru R. Hierarchical Bayesian models for multiple count data. Austrian J Stat. 2002; 31:221–9.
Google Scholar
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11:25.
Article Google Scholar
McCarthy JD, Chen Y, Smyth KG. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40:4288–97.
Article CAS Google Scholar
Du Z, Zhou X, Ling Y, Zhang Z, Su Z. agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Res. 2010; 38:64–70.
Article Google Scholar
Freixas-Coutin JA, Munholland S, Silva A, Subedi S, Lukens L, Crosby WL, Pauls KP, Bozzo GG. Proanthocyanidin accumulation and transcriptional responses in the seed coat of cranberry beans (Phaseolus vulgaris L) with different susceptibility to postharvest darkening. BMC Plant Biol. 2017; 17:89.
Article Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009; 25:2078–9.
Article Google Scholar
Anders S, Pyl PT, Huber W. HTSeq-a python framework to work with high-throughput sequencing data. Bioinformatics. 2015; 31:166–9.
Article CAS Google Scholar
Junk-Knievel DC, Vandenberg A, Bett KE. Slow darkening in pinto bean (Phaseolus vulgaris L) seed coats is controlled by a single major gene. Crop Sci. 2008; 48:189–93.
Article Google Scholar
Beninger CW, Gu L, Prior RL, Junk DC, Vandenberg A, Bett KE. Changes in polyphenols of the seed coat during the after-darkening process in pinto beans (Phaseolus vulgaris L). J Agric Food Chem. 2005; 53:7777–82.
Article CAS Google Scholar
Qiu W, Joe H. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). 2015. R package version 1.3.4. https://CRAN.R-project.org/package=clusterGeneration.
MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley: University of California Press: 1967. p. 281–297.
Google Scholar
Reynolds A, Richards G, de la Iglesia B, Rayward-Smith V. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. J Math Model Algoritm. 1992; 5:475–504.
Article Google Scholar
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006; Complex Systems:1695.
Google Scholar
Si Y, Liu P, Li P, Brutnell TP. Model-based clustering for rna-seq data. Bioinformatics. 2013; 30(2):197–205.
Article Google Scholar
Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the Poor Man’s data augmentation algorithms. J Am Stat Assoc. 1990; 85:699–704.
Article Google Scholar
Annis J, Miller BJ, Palmeri TJ. Bayesian inference with Stan: A tutorial on adding custom distributions. Behav Res Methods. 2016; 49:1–24.
Google Scholar
Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992; 7:457–72.
Article Google Scholar
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. Boca Raton, FL: Chapman & Hall/CRC Press; 2013.
Google Scholar
Neath RC. Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Morris L. Eaton. Beachwood: Institute of Mathematical Statistics. 2013.
Chapter Google Scholar
Heidelberger P, Welch PD. Simulation run length control in the presence of an initial transient. Oper Res. 1983; 31:1109–44.
Article Google Scholar
Plummer M, Best N, Cowles K, Vines K. CODA: Convergence diagnosis and output analysis for MCMC. R News. 2006; 6:7–11. R package version 0.19-1.
Google Scholar
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40. R package version 3.17.10.
Article CAS Google Scholar
McNicholas PD, Murphy TB, McDaid AF, Frost D. Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal. 2010; 54:711–23.
Article Google Scholar
R Core Team. R: A language and environment for statistical computing. Vienna, Austria; 2017. R Foundation for Statistical Computing. https://www.R-project.org/.
Microsoft and Weston S. foreach: Provides Foreach Looping Construct for R. 2017. R package version 1.4.4.
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978; 6:461–4.
Article Google Scholar
Akaike H. Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory. New York: Springer: 1973. p. 267–81.
Google Scholar
Bozdogan H. Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach: Volume 2 Multivariate Statistical Modeling. Dordrecht: Springer: 1994. p. 69–113.
Google Scholar
Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated classification likelihood. IEEE Trans Pattern Anal Mach Intell. 2000; 22:719–25.
Article Google Scholar
Birge L, Massart P. Gaussian model selection. J Eur Math Soc. 2001; 3:203–68.
Article Google Scholar
Birge L, Massart P. Minimal penalties for Gaussian model selection. Probab Theory Relat Fields. 2006; 138:33–73.
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the computational support provided by Dr. Marcelo Ponce at the SciNet HPC Consortium, University of Toronto, M5G 0A3, Toronto, Canada. The authors thank the editorial staff for help to format the manuscript.

Funding

AS was supported by Queen Elizabeth II Graduate Scholarships in Science & Technology and Arthur Richmond Memorial Scholarship. SD was supported by Canada Natural Sciences and Engineering Research Council of Canada (NSERC) grant 400920-2013. No funding body played a role in the design of the study, analysis and interpretation of data, or in writing the manuscript.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Guelph, Guelph, N1G 2W1, Canada
Anjali Silva
Department of Molecular and Cellular Biology, University of Guelph, Guelph, N1G 2W1, Ontario, Canada
Anjali Silva & Steven J. Rothstein
Department of Mathematics and Statistics, McMaster University, Hamilton, L8S 4K1, Ontario, Canada
Paul D. McNicholas
Department of Mathematical Sciences, Binghamton University, Binghamton, 13902, New York, USA
Sanjeena Subedi

Authors

Anjali Silva
View author publications
You can also search for this author in PubMed Google Scholar
Steven J. Rothstein
View author publications
You can also search for this author in PubMed Google Scholar
Paul D. McNicholas
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeena Subedi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AS and SD designed the method, code, and conducted statistical analyses. AS wrote the scripts for mixtures of MPLN algorithm and drafted the manuscript. SJR and PDM contributed to data analyses. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sanjeena Subedi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

Expression patterns of different models. The expression patterns for different models of cranberry RNA-seq dataset. (PDF 1631 kb)

Additional file 2

GO analysis of different models. GO enrichment analysis results for the different models selected for cranberry RNA-seq dataset. (XLSX 17 kb)

Additional file 3

Parameter estimation results of simulated data. Parameter estimation results of mu and sigma values for simulated data using mixtures of MPLN distributions. (PDF 77 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Silva, A., Rothstein, S.J., McNicholas, P.D. et al. A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinformatics 20, 394 (2019). https://doi.org/10.1186/s12859-019-2916-0

Download citation

Received: 26 December 2018
Accepted: 28 May 2019
Published: 16 July 2019
DOI: https://doi.org/10.1186/s12859-019-2916-0

A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

Abstract

Background

Results

Conclusions

Similar content being viewed by others

A clustering procedure for three-way RNA sequencing data using data transformations and matrix-variate Gaussian mixture models

VPAC: Variational projection for accurate clustering of single-cell transcriptomic data

Statistical Modeling of High Dimensional Counts

Background

Results

Transcriptome data analysis

Transcriptome data analysis: cranberry bean RNA-seq data

Simulation data analysis: mixtures of MPLN distributions

Simulation data analysis: mixtures of negative binomial distributions

Discussion

Conclusions

Methods

Mixtures of MPLN Distributions

Parameter Estimation

Bayesian Inference With Stan

Convergence

Initialization

Parallel Implementation

Model selection

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Additional files

Additional file 1

Additional file 2

Additional file 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation