Background

Flow Cytometry is among the most widely used platforms in biomedical research and clinical labs. It is used for investigation of a wide variety of biological problems at single cell level. Classical applications of flow cytometry include quantitative measurements of DNA content and cell cycle progression [1]. It is also one of the key platforms for studying dynamic cellular properties such as differentiation, proliferation and apoptosis, especially in the contexts of stem cells and cancer [2]. Such applications make flow cytometry the ideal platform for the purpose of identifying and monitoring the myriad states and functions in different specimens that vary over time under particular conditions and stimuli.

Typically, a flow sample is stained with fluorescent dyes, possibly attached to antibodies, and per cell events such as the expression of a cell-surface marker or the DNA content are measured in terms of fluorescence intensity. The distribution of these events are then plotted or modeled statistically for identification of important features in the sample. While developments in computational cytomics have produced many useful analytical methods (e.g. [3]), several important problems have not yet been addressed adequately. One such issue involves precise parametric modeling of dynamic features in temporal profiles such that the model parameters can characterize the transition of the populations in a sample through different cellular states. Often simple statistics such as population mean or size can be imprecise in the presence of unusually shaped distributions and outliers in temporal profiles. The modeling scenario could be complicated further by the adoption of different trajectories by different subpopulations. Indeed a rigorous algorithm for modeling cellular state transitions can not only automate the traditionally manual approach, which is subjective and labor-intensive, but also extend it to increasingly complex and high-throughput experiments.

Many major cytometric studies have highlighted the importance of characterizing temporal profiles at single cell resolution for a variety of purposes such as cell cycle expression kinetics (e.g. [4, 5]), pharmacodynamics (e.g. [6, 7]), signaling alterations in specific subpopulations (e.g. [8, 9]), dynamics of differentiation into distinct lineages (e.g. [10, 11]), and so on. Clearly, mathematical formulation of a cellular state-space, and the transitions therein, can help us model a given collection of temporal flow cytometric profiles with the required rigor. Thereupon we can study the changes in features (say, in comparison with those in control profiles) and monitor trends in parametric detail. Precise probabilistic modeling of sample distributions at each stage can automatically reveal such dynamic features as emergence of a tail subpopulation or change in the skewness of a cluster that are statistically well-defined as well as biologically insightful [3].

Temporal profiling of cellular state transitions in flow data can, however, present unique modeling challenges. Often the transient states produce non-Gaussian features such as asymmetric or trailing subpopulations owing to rush or delay in progression from one state to another [5]. Intermediate states might also produce outliers that cannot be clearly distinguished from the more distinctive states. Moreover certain metastable states may appear only inconsistently in a given time course [11]. Often the transient features appear and disappear at the tails of the more prominent distributions, and may be hard to model via automation. Thus a framework that uses robust probabilistic density functions to model time course data may be the best way to represent the underlying state-space, and reveal any sudden or gradual transition therein. In terms of the distribution of events in a flow sample, characteristics of different states may be determined by variation in size (say, percentage of cells in a peak or cluster), location (such as mean or mode) or significance (peak density) of the model components. While traditionally such changes were detected with manual or non-parametric techniques, several model-based frameworks have recently been applied with success, e.g. [3, 1215].

Here we present StateProfiler, a new framework based on finite mixture models of skew t-Normal distributions (STNMIX) for statistical characterization of flow cytometric time course data. In particular, we present in StateProfiler a new greedy Expectation-Maximization (EM) algorithm for fitting our STNMIX model. The greedy EM algorithm starts with a minimum number of distributions (or components) and sequentially inserts a new component to the mixture until model convergence is achieved. This parsimonious approach allows us to detect the dynamic appearance (and disappearance) of transient features that are characteristic of many state transitions. In addition, intermediate states are known to produce spatial features in the form of distributions with unusual shapes or low separation, which can lead to overlapping components, and hence to an overestimated number of model components. For optimal model selection, we therefore also provide in StateProfiler a new procedure for merging skew t-Normal components that are significantly overlapping in the mixture such that the change in entropy of the resulting model is minimal. Besides profiling of unusually shaped distributions and less well-separated features, this allows StateProfiler to tackle cellular heterogeneity that exists even within clonal populations.

We applied StateProfiler to learn the temporal features of cell cycle progression in two mutant strains of budding yeast Saccharomyces cerevisiae. Based on knockout of S-phase triggering cyclins Clb5 and Clb6, we compared the S-phase delay phenotypes resulting from the differential regulation of the two cyclins. Also we used StateProfiler to construct the overall temporal profile of clonal divergence underlying lineage selection in mammalian hematopoietic progenitor EML cells. By comparing the fitted models at each time point, we observed a slow and non-montonic convergence of clonal outlier subpopulations to a final median state.

Results and discussion

Temporal profiling with StateProfiler has several distinct advantages. First, the skew t-Normal mixture fitted to the data is defined by a probability density function (pdf). This function is well-defined at any resolution and can be visualized as a smooth profile, which is, unlike kernel-based non-parametric representations, not dependent on bandwidth specification. Importantly, the pdf rigorously specifies the significance of every feature, which allows us to detect the significant ones in the profile, while ignoring the ones which are not. StateProfiler bases its optimal modeling on 3 strategies: (1) to begin with, asymmetric and heavy-tailed STNMIX components model the data precisely even in the presence of outliers or skewed populations, further (2) the parsimonious fitting of the model with greedy EM yields accurately estimated components, and finally, (3) any redundant components are merged into a well-fitted output profile. By design, our STNMIX model is computationally faster to fit than the skew t mixture (STMIX) model [3, 12, 16, 17] without sacrificing precision or rigor. Ho et al. [13] summarized the differences between the STMIX and STNMIX models and showed the implementation of the STNMIX model is generally much simpler and faster than that of STMIX model.

For temporal profiling, certain parameters of STNMIX model such as shape are uniquely suited to detect lagging or hastening trends in subpopulations (such as delay phenotypes in gene knockout experiments) that directly correspond to interesting cellular states and functions. Clearly this is neither possible with non-parametric representations nor using traditional parametric models based on Gaussian, t or other symmetric components [5]. Moreover, such shape or size parameters could be used to test for separability among components - i.e. to identify tendencies of subpopulations to move towards or away from each other without actually changing their mean locations. Parametric "snapshots" of such back-and-forth trends can shed light on the the discrete (switch-like) or continuous (spectrum-like) nature of the state transitions, leading to statistical observation of systems exhibiting multistable dynamics [10].

To illustrate some applications of StateProfiler, we analyzed two previously generated datasets for studying (a) cell division cycle and (b) cell differentiation in different species.

Cell cycle profiling

We applied StateProfiler to identify the temporal features of budding yeast cell cycle progression based on knockout of S-phase triggering cyclins Clb5 and Clb6. In late G1-phase, while both Clb5 and Clb6 activate Cdc28p to promote initiation of DNA synthesis, the exact mechanisms and extents of regulating this transition from G1 to S phase are distinct for the two cyclins [4]. In particular, Clb5 knockout causes a more prominent S phase defect during cell cycle progression in yeast cells than Clb6 knockout. Since DNA replication happens in S phase, we studied the dynamics of transition from the start and end states corresponding of one and two copies of the chromosomes (respectively, G1 and G2-M phases) while passing through intermediate states corresponding to S phase delay in the mutants. Interestingly, while genetic mutations are long known to produce delay phenotypes in cell cycle progression, few algorithms prior to StateProfiler could model the lag in the DNA distributions with precision.

We fitted STNMIX models to flow samples from two cell cycle time courses with 10 time-points each in yeast cells with knockout of Clb5 (Clb5Δ) and Clb6 (Clb6Δ3P). The time courses spanned more than one cell cycle period with respect to wild-type yeast cells dividing under the same protocol. The fitted mixture models identified two or more components in every sample, which typically corresponded to the 1C and 2C peaks before and after DNA synthesis, along with subpopulations in the intermediate S-phase, thus characterizing an overall spectrum of profiles of different state transitions (Figure 1).

Figure 1
figure 1

Cell cycle time-course profiles. Cell cycle time-course profiles. Overall spectrum of temporal profiles based on STNMIX modeling of flow cytometric DNA content data.

The smooth profiles of the noisy DNA histograms at every time-point are constructed with StateProfiler according to optimal change in the entropy values of the fitted model (Figure 2). For example, the entropy plot (Figure 2a, b) suggests a jump in entropy (or elbow) beyond g = 2 components for Clb5Δ data at t = 25 min (blue histogram in Figure 2c). The resulting 2-component profile is depicted by the orange curve in Figure 2c. The individual components involved in the model are identified and shown as black dotted curves. Their parameters could be used to detect features for purposes like sorting cells (FACS) or monitoring trends in specific subpopulations (e.g. note the lag in the left component in Figure 2).

Figure 2
figure 2

Modeling a temporal flow cytometric profile. Modeling a temporal flow cytometric profile. (a) Entropy values for a profile combining a given number of components (g) based on the results of the Greedy EM algorithm for Clb5Δ at t = 25 min. (b) Differences between successive entropy values as g increases. (c) DNA distribution for Clb5Δ at t = 25 min, as depicted by a histogram, is modeled with a 2-component skew t-Normal mixture. The orange curve shows the fitted profile while the underlying components are shown in dotted curves.

To determine the precision of STNMIX, we computed log-likelihood maxima ^ max , BIC values, and distances D n based on Kolmogorov-Smirnov (K-S) test statistic, and compared in Table 1 with the same for four competing 2-component mixture models (of normal, t, skew normal, and skew t) known from the literature [3, 18]. According to BIC, the optimal selection of the STNMIX model with equal dfs is evident (e.g. the 2-component model at t = 25). As seen from Dn, we also conclude that STNMIX achieves the most precise modeling in terms of both the count and asymmetry of components in the given data. Further, we used the models for objective comparison of profiles both within and across time-courses. We computed the Gap statistic [19] as a measure of dispersion of cellular events between the two extreme states corresponding to the 1C and 2C peaks or clusters. Tested against a reference distribution of data with no clustering, the Gap statistics support the biological observation of Jackson et al. [4] that the Clb5 mutant shows more pronounced S-phase delay phenotypes than the Clb6 mutant and hence has less well-separated components in mid-cell cycle (e.g. t = 25). The contrast between the samples in terms of cells showing a slower state transition from 1C to 2C may be observed in Table 2 for different time-points. Finally, we observe the gradual variation in the key features at each successive time-point to gain insights into the differential regulation of the S-phase by the cyclins Clb5 and Clb6 (Figure 3).

Table 1 Details of competing models for Clb5 data
Table 2 Measuring dispersion of events at each time point
Figure 3
figure 3

Comparison of time-course profiles. Comparison of time-course profiles constructed with StateProfiler. The orange-red and green-blue curves represent DNA distributions of Clb5Δ and Clb6Δ3P cells respectively. The time-points in minutes and budding information are indicated.

Cell differentiation profiling

Another key area in which flow data are extremely insightful about different state transitions is cell differentiation. In recent years, many important advances in biology have been made by studying the modes and mechanisms of differentiation especially in the context of stem cells and cancer. Stem cell differentiation has also been studied for their clinical applications such as in the field of regenerative medicine. An excellent review of the field is given in a recent text edited by Krishan et al. [2]. Over the course of differentiation, the profiles of expression of various markers - including those indicating stemness or commitment to a lineage - vary according to transitions of populations through unstable, metastable and eventually stable states. Often measurable phenotypic diversity appears due to cell-to-cell variability even within clonal populations, which are manifest and can be studied as outlier events or asymmetric or tail subpopulations. Sometimes these features are transient and peripheral, and could be hard to distinguish via automation. Accurate modeling of dynamic flow profiles is thus essential to identify or monitor transitional features as and when they appear (or disappear) for objective temporal characterization of the state-space components involved in differentiation.

In the present study, we analyzed clonal populations of EML cells, a multipotent mouse haematopoietic cell line that can differentiate into myeloid, erythroid, and other lineages. In a recent study, Chang et al. [11] measured the expression levels of the stem cell marker Sca-1 in different subpopulations of EML cells as time course data. They observed that cell-to-cell heterogeneity in this clonal progenitor population gave rise to Sca-1 outlier cells - cells that exhibit very high or low Sca-1 expression - and possessed distinct gene expression patterns. The heterogeneity could not be attributed to measurement noise or cell-cycle-dependent cell size variation. Eventually, however, each of these distinct Sca-1 subpopulations' profiles became similar to that of the median cells, thus revealing an attractor state. Yet it was noted [11] that the divergence lasted long enough to allow different propensities for either subpopulation, i.e. low and high Sca-1, to enter into a transient state that primes them for either the erythroid or the myeloid lineage, as captured by their differential expression of lineage-specific transcription factors.

For precise characterization of the dynamics by which population heterogeneity arose in this clonal population via outliers and subsided ultimately, cells with the lowest, middle and highest levels of Sca-1 expression were isolated by [11] using fluorescence-activated cell sorting (FACS). We call these subsets Sca-1low, Sca-1mid, and Sca-1high. Following FCAS, the sorted cells were immediately stripped of the staining antibody and cultured in standard growth medium. Subsequently, Sca-1 fluorescence intensity were measured individually for each of the 3 subpopulations as time course data. Similar measurements were made for an original clonal population of EML cells for comparison (we call it Sca-1all).

We applied the StateProfiler framework to model the flow profiles for 14-point time course data for each of the 4 populations. Often finite mixtures of Gaussians are used for modeling the theoretical subpopulation structure in such profiles [11, 20]. However, using Gaussian components, precise modeling in the presence of outliers due to cell-to-cell heterogeneity is particularly difficult for clonal populations. This is because an optimal model must be able to accommodate such heterogeneity without requiring extra components, but Gaussian components with sharp tails are hardly robust against outliers. It leads to sub-optimal models with spurious subpopulations, which makes their biological interpretation difficult.

StateProfiler addressed the modeling problem in two ways. First, its skew t-Normal components are robust to outliers and asymmetry in the distributions. This helps in modeling transitional features even if they lead to unusually shaped or heavy tailed distributions. Second, even if redundant subpopulations were identified, the new merging procedure in StateProfiler can re-construct any significantly overlapping components in a statistically optimal fashion, i.e. to produce a combined profile by causing minimal change in entropy of the model pre- and post-reconstruction.

The dual advantages of the StateProfiler modeling algorithm allowed us to compute highly accurate profiles of Sca-1 expression in the time course datasets for the three sorted and the unsorted EML cells. The steps of the merging procedure through which an optimal structure for the model is "stitched" together are illustrated with an example in Figure 4. Finally, we compared the divergence of the 3 sorted subsets from the corresponding unsorted population using Kullback-Leibler distances between the probability density functions specifying their profiles. A visual comparison of the profiles is shown in Figure 5. The trend of decreasing divergence, as the 3 sorted profiles become similar to the unsorted profile with progression of time, is shown in Figure 6.

Figure 4
figure 4

An example of merging mixture components. The Sca-1 expression data for the unsorted population of EML cells at 264 h is shown in the histogram. At each step of the merging algorithm, the fitted profile is shown as a thick grey curve, and the individual components in think black curves. (a) Initial profile computed by Greedy EM with g = 5, Entropy = 2351. (b) Merged profile with g = 4, Entropy = 573. Combining a group of components in the left significantly reduces entropy. (c) Merged profile with g = 3, Entropy = 297. (d) The final merged profile with g = 2 components and Entropy = 48.

Figure 5
figure 5

Comparison of time-course profiles. The temporal profiles of the 3 sorted subsets and the unsorted clonal population are constructed with StateProfiler, and plotted for visual comparison.

Figure 6
figure 6

The trend of convergence to the unsorted profile. Kullback-Leibler (KL) distance of the profiles for each of the 3 sorted subpopulations from the unsorted profile at a given time-point. While the distances decrease with time, the trend is slow and does not appear to be monotonic.

StateProfiler's parametric characterization can reveal various features and trends of interest in terms of specific parameters. For instance, we observe that by 3 days, both Sca-1mid and Sca-1high have already started to resemble the unsorted population, and by 6 days, they actually have their own low Sca-1 tails. Another trend of possible interest is the slow but continuous fluctuation in the proportion of low Sca-1 outliers in the unsorted population. Finally, it appears that the eventual stable state when the 3 profiles finally coincide is reached at a point of time much later than 9 days, as suggested by [11], and takes probably double that time (432 h). In the mean time, as we see in Figure 6, the states might continue to drift closer and apart as in a dynamical system exhibiting multistable behaviour. If indeed the departure from the average state has biological functionality in the priming of cell fate commitment, then a non-monotonic, delayed restoration of the underlying molecular mechanisms may be justified by having more than a few cells with random fluctuation and call for further investigation.

Conclusions

In this study, we described StateProfiler, a framework to construct temporal profiles with flow data, which can facilitate parametric modeling of cellular state transitions Towards this, we presented 3 key features of the framework. First, we described a finite mixture of skew t-Normal distributions. Second, we presented a new greedy EM algorithm for fast and optimal model selection. The parsimonious approach of our greedy algorithm allows us to detect the variation in the features as and when they appear and disappear at different points of time thereby offering a parametric characterization of the overall nature of state transition. Third, we designed a mixture merging procedure for ensuring robust estimation of the fitted profile. The code implementing the framework is available from the authors upon request. Indeed the proposed framework is effective, general and may be applied to other similar domains.

Methods and materials

Mixtures of skew Student-t-normal distributions

We describe the skew t-Normal mixture model (STNMIX) of StateProfiler. To simplify notation, we let ϕ(.) and Φ(.) denote the probability density function (pdf) and the cumulative distribution function (cdf) of the standard normal distribution, respectively. Let

t ( x | ξ , σ 2 , ν ) = Γ ( ν + 1 ) / 2 ) Γ ( ν / 2 ) π ν σ ( 1 + ( x ξ ) 2 ν σ 2 ) ( ν + 1 ) / 2

denote the pdf of the t distribution with location ξ, scale σ2 and degrees of freedom (df) v, and t(x|v ) simply for the case when ξ = 0 and σ = 1; and let Γ(α,β) be the gamma distribution with density g(x|α, β) ∝ xα-1exp{-βx}. We start by defining the STN distribution and then note further properties.

As introduced by Gómez et al. [21], a random variable Y is said to follow the STN with location parameter ξ ∈ ℝ, scale parameter σ2 ∈ (0, ∞), skewness parameter λ ∈ ℝ and degrees of freedom v ∈ (0,∞) it is has the density

ψ ( y ) = 2 t ( y | ξ , σ 2 , v ) Φ λ y - ξ σ .
(1)

We shall write Y~ STN(ξ,σ2,λ,v) if Y has the density of (1).

Ho et al. [13] give following hierarchical representation of STN to establish an EM-type algorithm [22].

Y | γ , τ ~ N ξ + σ λ τ + λ 2 γ , σ 2 τ + λ 2 , γ | τ ~ T N 0 , τ + λ 2 τ ; ( 0 , ) , τ ~ Γ ( v / 2 , v / 2 ) ,
(2)

where T N(µ, σ2; (a, b)) represents the truncated normal distribution for N(µ, σ2) lying within the truncated interval (a, b).

Consider n independent random variables Y1,..., Y n , which are taken from a mixture of STN distributions. The pdf of a g-component STNMIX model is

f ( y j | Θ g ) = i = 1 g w i ψ ( y j | θ i ) ,
(3)

Where w i 's are mixing proportions which are constrained to be positive and i = 1 g w i = 1 , ψ ( y j | θ i ) is the STN density defined in (1) and Θ g = (w1,..., wg- 1, θ1,..., θ g ) represents all unknown parameters. Note that the component vector θ i consists of ( ξ i , σ i 2 , λ i , v i ) .

Based on (2), a practical ECM/ECME algorithm [23, 24] proceeds are described by Ho et al. [13] as follows:

E-step: Given Θ g = Θ ^ g ( h ) ,compute following z i j ( h ) ^ , τ ^ i j ( h ) , κ i j ( h ) ^ and γ 1 ^ i j ( h ) for i = 1,..., g and j = 1,..., n.

i j ( h ) = ŵ i ( h ) ψ ( y j | θ ^ i ( h ) ) f ( y j | Θ ^ ( h ) ) , τ ^ i j ( h ) = v ^ i ( h ) + 1 v ^ i ( h ) + u ^ i j 2 ( h ) , κ ^ i j ( h ) = DG v ^ i ( h ) + 1 2 - log v ^ i ( h ) + u ^ i j 2 ( h ) 2 , γ ^ 1 i j ( h ) = λ ^ i ( h ) u ^ i j ( h ) + ϕ ( λ ^ i ( h ) u ^ i j ( h ) ) Φ ( λ ^ i ( h ) u ^ i j ( h ) ) ,

where u ^ i j ( h ) = ( y j - ξ ^ i ( h ) ) / σ ^ i ( h ) .

CM-step: Update the estimation by

ŵ i ( h + 1 ) = n ^ i ( h ) / n , ξ ^ i ( h + 1 ) = b ^ 1 i ( h ) + λ ^ i 2 ( h ) b ^ 2 i ( h ) - σ ^ i ( h ) λ ^ i ( h ) b ^ 3 i ( h ) j = 1 n i j ( h ) τ ^ i j ( h ) + λ ^ i 2 ( h ) n ^ i ( h ) , σ ^ i 2 ( h + 1 ) = 1 n ^ i ( h ) j = 1 n i j ( h ) τ ^ i j ( h ) ( y j - ξ ^ i ( h + 1 ) ) 2 , λ ^ i ( h + 1 ) = j = 1 n i j ( h ) γ ^ 1 i j ( h ) u ^ i j ( h + 1 ) j = 1 n i j ( h ) u ^ i j 2 ( h + 1 ) , v ^ i ( h + 1 ) = arg max v i v i 2 log v i 2 - log Γ v i 2 + v i 2 b ^ 4 i ( h ) ,

where n ^ i ( h ) = j = 1 n i j ( h ) , b ^ 1 i ( h ) = j = 1 n i j ( h ) τ ^ i j ( h ) y j , b ^ 2 i ( h ) = j = 1 n i j ( h ) y j , b ^ 3 i ( h ) = j = 1 n i j ( h ) γ ^ 1 i j ( h ) , b ^ 4 i ( h ) = j = 1 n i j ( h ) ( κ ^ i j ( h ) - τ ^ i j ( h ) ) / n ^ i ( h ) and u i j ^ ( h + 1 ) = ( y j - ξ i ( h + 1 ) ^ ) / σ i ( h + 1 ) ^ .

If the dfs are assumed to be identical, say v 1 = ⋯ = v g = v, we could update v ^ ( h ) by

v ^ ( h + 1 ) = arg max ν { j = 1 n log { i = 1 g w ^ i ( h + 1 ) × ψ ( y j | ξ ^ i ( h + 1 ) , σ ^ i 2 ( h + 1 ) , λ ^ i ( h + 1 ) , v ) } } .

The E-step and CM/CML-steps are alternately repeated until a suitable convergence rule is satisfied, e.g., the Aitken acceleration based stopping criterion | ( h + 1 ) - ( h + 1 ) | < ε , where ( h + 1 ) is the observed log-likelihood evaluated at Θ ^ g ( h ) , ( h + 1 ) is the asymptotic estimate of the log-likelihood at iteration h + 1 (see [18]; Chap. 4.9) and ε is the desired tolerance. For numerical analyses in this paper, a default value of ε = 10-6 was used to terminate the iterations.

Greedy learning for STN mixtures

In this section, we present a new greedy version of the EM algorithm to determine the optimum number of components in the fitting of STNMIX models. The greedy EM approach was first introduced by Vlassis and Likas [25]. The fundamental concept of the greedy EM algorithm is to start from a minimum number of components and sequentially insert a new component to the mixture until convergence is achieved. The stopping criterion can be a pre-specified maximum number of components or a pre-specified convergence tolerance.

Suppose a new component ψ ( y j | θ g + 1 ) is added to a g-component f ( y j | Θ g ) . The resulting mixture takes the form of

f ( y j | Θ g + 1 ) = ( 1 - a ) f ( y j | Θ g ) + a ψ ( y j | θ g + 1 ) ,

where 0 < a < 1 and Θ g + 1 = ( Θ g , a , θ g + 1 ) with θ g + 1 being the added parameters ( ξ g + 1 , σ g + 1 2 , λ g + 1 , v g + 1 ) . Given an old mixture f ( y j | Θ ^ g ) , the weight a and θ g + 1 are optimally chosen to maximize the new log-likelihood

L g + 1 = j = 1 n log f ( y j | Θ g + 1 ) = j = 1 n log [ ( 1 - a ) f ( y j | Θ ^ g ) + a ψ ( y j | θ g + 1 ) ] .
(4)

To find the optimal solution in (4), we start by performing a local search with for the newly inserted component. This gives rise to the following partial EM steps where θ ̃ denotes and the partial ML estimates of θ. For notational simplicity, the subscript (g + 1) is suppressed below in the Partial E-step.

Partial E-step: Calculating the conditional expectation of latent variables at the k th iteration, this yields

z ̃ j ( k ) = a ˜ ( k ) ψ ( y j | θ ̃ ( k ) ) ( 1 - a ˜ ( k ) ) f ( y j | Θ g ( k ) ^ ) + a ˜ ( k ) ψ ( y j | θ ̃ ( k ) ) , τ j ( k ) ^ = ( k ) + 1 ( k ) + ũ j 2 ( k ) , γ ̃ 1 j ( k ) = λ ̃ ( k ) ũ j ( k ) + ϕ ( λ ̃ ( k ) ũ j ( k ) ) Φ ( λ ̃ ( k ) ũ j ( k ) ) , κ ̃ j ( k ) = DG ( k ) + 1 2 - log ( k ) + ũ j 2 ( k ) 2 ,

Where ũ j ( k ) = ( y j - ξ ̃ ( k ) ) / σ ̃ ( k ) .

Partial M-step: Updating the new parameters in (a, θg+1), we get

a ˜ ( k + 1 ) = j = 1 n z ̃ j ( k ) n , ξ ̃ g + 1 ( k + 1 ) = b ̃ 1 ( k ) + λ ̃ 2 ( k ) b ̃ 2 ( k ) - σ ̃ ( k ) λ ̃ ( k ) b ̃ 3 ( k ) j = 1 n z ̃ j ( k ) τ ̃ j ( k ) + λ ̃ 2 ( k ) j = 1 n z ̃ j ( k ) , σ ̃ g + 1 2 ( k + 1 ) = j = 1 n z ̃ j ( k ) τ ̃ j ( k ) ( y j - ξ ̃ ( k + 1 ) ) 2 j = 1 n z ̃ j ( k ) , λ ̃ g + 1 ( k + 1 ) = j = 1 n z ̃ j ( k ) γ ̃ 1 j ( k ) ũ j ( k + 1 ) j = 1 n z ̃ j ( k ) ũ j 2 ( k + 1 ) , ν ̃ g + 1 ( k + 1 ) = arg max ν ν 2 log ν 2 - log Γ ν 2 + ν 2 b ̃ 4 ( k ) ,

Where ũ j ( k + 1 ) = ( y j - ξ ̃ ( k + 1 ) ) / σ ̃ ( k + 1 ) , b ̃ 1 ( k ) = j = 1 n z ̃ j ( k ) τ ̃ j ( k ) y j , b ̃ 2 ( k ) = j = 1 n z ̃ j ( k ) y j , b ̃ 3 ( k ) = j = 1 n z ̃ j ( k ) γ ̃ 1 j ( k ) , and b ̃ 4 ( k ) = j = 1 n z ̃ j ( k ) ( κ ̃ j ( k ) - τ ̃ j ( k ) ) / j = 1 n z ̃ j ( k ) .

The above partial EM steps constitute a fast and simple procedure to locally seek for the maximum of L g + 1 . To our experience, this local search scheme is very sensitive the initialization of a and ξg+1. Similar to Vlassis and Likas [25], we provided a global search strategy for extracting proper parameter initialization for a and ξ g + 1 ( 0 ) . By a second-order Taylor expansion for L g + 1 , we obtain the following approximation:

^ g + 1 = g + 1 ( a 0 ) [ ˙ g + 1 ( a 0 ) ] 2 2 ¨ g + 1 ( a 0 ) ,
(5)

where ˙ g + 1 ( a 0 ) and ¨ g + 1 ( a 0 ) are the first and second derivatives of L g + 1 evaluated at a = a0. It can be deduced from (5) that a local maximum of L g + 1 around a0 = 0.5 is given by

L ^ g + 1 = j = 1 n log f ( y j | Θ ^ g ) + ψ ( y j | θ g + 1 ) 2 + j = 1 n δ j ( θ g + 1 ) 2 2 j = 1 n δ j 2 ( θ g + 1 )
(6)

with

δ j ( θ g + 1 ) = f ( y j | Θ ^ g ) - ψ ( y j | θ g + 1 ) f ( y j | Θ ^ g ) + ψ ( y j | θ g + 1 ) .

So the the optimal value of a can be calculated as

a ^ = 1 2 ( 1 j = 1 n δ j ( θ g + 1 ) j = 1 n δ j 2 ( θ g + 1 ) ) .
(7)

Following the suggestion of Li and Barron [26], one may set a ^ = 0.5 for g = 1 and a ^ = 2 / ( g + 1 ) for g ≥ 2 as a default recommendation when the estimated value (7) fall outside the range of (0, 1).

In our global search, a convenience choice of σ ̃ g + 1 2 ( 0 ) is n-1/5 times half of the sample variance s y 2 whereas λ ̃ g + 1 2 ( 0 ) and ν g + 1 ̃ 2 ( 0 ) are always fixed at 0 and 10, respectively. For the initial choice of ξg+1, we search over the 5th, 10th, 15th, ⋯ 95th quantiles of y and set ξ ̃ g + 1 ( 0 ) to the one that maximizes (6).

The implementation of the greedy EM algorithm is summarized below.

  1. 1.

    Start with g = 1 and compute the ML estimates of the single-component STNMIX model via the ECME algorithm.

  2. 2.

    If g > 1, estimate Θ g via the EM-type algorithms.

  3. 3.

    Perform a global search to find a proper initialization of a and ξg+1.

  4. 4.

    Apply the partial EM-steps until convergence. For instance, | L ^ g + 1 ( k ) / L ^ g + 1 ( k - 1 ) -1|<1 0 - 6 .

  5. 5.

    If L ^ g + 1 L ^ g +m then terminate, where m > 0 is a penalty term. Otherwise allocate the new component to the model and go to 2. Set g = g + 1.

Given r candidates (we have 19 quantiles of sample), the time complexity of our greedy EM algorithm is O(ngr). If overall sample was considered as candidates in the global search, then the running time is similar to Vlassis and Likas [25].

Merging mixture algorithm

The greedy EM algorithm provides a convenient method for automatically selecting a number of components for a mixture model under reasonable assumptions (such as convexity of components). Yet if data have certain spatial features due to distributions with unusual shapes or low separation [8], it can lead to overlapping components, and hence to overestimation in the number of components in spite of the parsimonious approach. To augment our greedy algorithm for obtaining a robust estimate of the number of components, we extend the merging mixture approach of Baudry et al. [27] to skew t-Normal components. While merging techniques have been applied in the past to symmetric distributions [27, 28], designing a procedure for asymmetric distributions obviates any need for spurious components that may be required for the sole purpose of modeling asymmetry, and thus avoids redundant merging.

The basic idea behind the procedure is to use the maximum merged entropy to iteratively combine two possibly overlapping clusters, until the result of combination belong a single cluster (see implementation in [28]). The steps of the merging algorithm in StateProfiler are described below.

  1. 1.

    Calculate the mean entropy of maximum estimation for g components as

    Ent ( g ) = - j = 1 n i = 1 g i j log i j 0 ,

where i j denotes the posterior probability given Θ g fix at Θ ^ g .

  1. 2.

    Two clusters l and l' to be combined are those maximizing the criterion:

    - j = 1 n i l log i l + i l log i l + j = 1 n ( i l + i l ) log ( i l + i l )

among all possible pairs of clusters (l, l').

  1. 3.

    Obtain the merged entropy

    Ent ( g - 1 ) = - j = 1 n i l , l i j log i j + i , l l log i , l l ,

where i , l l = i l + i l is the posterior probability of the new cluster ll'.

  1. 4.

    Update z j ^ consists of the unmerged and merged posterior probabilities.

  2. 5.

    Set g = g - 1 and go to 2. Repeat until g = 1.

  3. 6.

    A solution of number of components can be identified (i) a sudden jump or "elbow" in a plot of the entropy of clustering versus the number of clusters, or (ii) peaks in a plot of the number of clusters versus the difference in entropy.

Data and experiments

For details of the yeast cell cycle experiments and timecourse data analyzed by StateProfiler, see [4]. For details of EML cell differentiation data, see [11].