A non-homogeneous dynamic Bayesian network with a hidden Markov model dependency structure among the temporal data points

In the topical field of systems biology there is considerable interest in learning regulatory networks, and various probabilistic machine learning methods have been proposed to this end. Popular approaches include non-homogeneous dynamic Bayesian networks (DBNs), which can be employed to model time-varying regulatory processes. Almost all non-homogeneous DBNs that have been proposed in the literature follow the same paradigm and relax the homogeneity assumption by complementing the standard homogeneous DBN with a multiple changepoint process. Each time series segment defined by two demarcating changepoints is associated with separate interactions, and in this way the regulatory relationships are allowed to vary over time. However, the configuration space of the data segmentations (allocations) that can be obtained by changepoints is restricted. A complementary paradigm is to combine DBNs with mixture models, which allow for free allocations of the data points to mixture components. But this extension of the configuration space comes with the disadvantage that the temporal order of the data points can no longer be taken into account. In this paper I present a novel non-homogeneous DBN model, which can be seen as a consensus between the free allocation mixture DBN model and the changepoint-segmented DBN model. The key idea is to assume that the underlying allocation of the temporal data points follows a Hidden Markov model (HMM). The novel HMM–DBN model takes the temporal structure of the time series into account without putting a restriction onto the configuration space of the data point allocations. I define the novel HMM–DBN model and the competing models such that the regulatory network structure is kept fixed among components, while the network interaction parameters are allowed to vary, and I show how the novel HMM–DBN model can be inferred with Markov Chain Monte Carlo (MCMC) simulations. For the new HMM–DBN model I also present two new pairs of MCMC moves, which can be incorporated into the recently proposed allocation sampler for mixture models to improve convergence of the MCMC simulations. In an extensive comparative evaluation study I systematically compare the performance of the proposed HMM–DBN model with the performances of the competing DBN models in a reverse engineering context, where the objective is to learn the structure of a network from temporal network data.

of the MCMC simulations. In an extensive comparative evaluation study I systematically

Introduction
In the topical field of systems biology there is considerable interest in learning regulatory networks, such as gene regulatory transcription networks (Friedman et al. 2000), protein signal transduction cascades (Sachs et al. 2005), neural information flow networks (Smith et al. 2006), or ecological networks (Aderhold et al. 2013). In the computational biology and machine learning literature a variety of powerful probabilistic machine learning methods based on graphical models, such as Bayesian networks (Friedman et al. 2000), have been proposed to learn these networks from data. The standard assumption underlying the conventional graphical models is that the observed time series are homogeneous so that potential changes in the regulatory interactions are not taken into account. That is, the standard graphical models, e.g. the conventional homogeneous Gaussian dynamic Bayesian network (DBN) model, describe a simple homogeneous linear dynamical system. Unfortunately, the assumptions of homogeneity and linearity are unrealistic for many applications in systems biology, and thus can cause erroneous and misleading inference results. Regulatory interactions in systems biology applications tend to be non-linear and adaptive so that they vary over time, e.g. in response to changing environmental and experimental conditions. A more appropriate approach would therefore be the deduction of a detailed mathematical description of the entire network domain in terms of mechanistic models, e.g. in the form of coupled non-linear stochastic differential equations (DEs). Seminal examples have for example been presented in Vyshemirsky and Girolami (2008) and Toni et al. (2009). Since a proper Bayesian inference for those mechanistic models is computationally expensive, usually only very small network domains with typically only 3-4 nodes are considered (Vyshemirsky and Girolami 2008) or the inference is based on approximations (Toni et al. 2009). Therefore, in standard applications of mechanistic models only a limited amount of different hypotheses about the underlying network structure is compared, and the space of network structures is not systematically searched for those networks that are most consistent with the observed data. That is, mechanistic models cannot be used to learn regulatory networks from scratch (i.e. without any prior hypotheses about potential network structures). Therefore, there have been various efforts to relax the homogeneity assumption for undirected (see, e.g., Talih andHengartner 2005 or Xuan andMurphy 2007) and directed (see, e.g., Ahmed and Xing 2009) graphical models, as well as for dynamic Bayesian networks (see references below).
The key idea is to leave the class of homogeneous linear dynamic models, and to develop novel non-homogeneous graphical models that balance between two requirements: On the one hand, those models should offer enough flexibility so that they can appropriately capture the underlying non-homogeneous biological processes, and thus become competitive to the mechanistic models. On the other hand, from a computational perspective it must be possible to use these models to systematically search the space of network structures and to learn the underlying regulatory relationships from scratch (i.e. in the absence of any hypoth-esis about the underlying network structure). The focus of this paper is to propose a novel non-homogeneous dynamic Bayesian networks (DBN) model that fulfils both requirements.
Various DBN models have been proposed in the literature, and it can be distinguished between DBNs for which the parameters in the likelihood can be integrated out in closedform, and DBNs for which the marginal likelihood is intractable. The latter DBNs tend to have a greater flexibility, but they are more susceptible to over-fitting, since the network structures and the interaction parameters have to be estimated simultaneously. Flexible DBNs with an intractable likelihood can, for example, be constructed along the lines proposed in Imoto et al. (2003), Rogers and Girolami (2005), or Ko et al. (2007). 1 Here, I concentrate on DBNs for which the network parameters can be integrated out in closed form. Although this requires certain regularity conditions, such as parameter independence and prior conjugacy, to be fulfilled, these DBNs have two attractive features: (i) The data-overfitting problem is intrinsically avoided, and (ii) "model-averaging" can be realised by efficient Reversible Jump Markov Chain Monte Carlo (RJMCMC) simulations in discrete configuration spaces (Green 1995).
To obtain a closed-form expression of the marginal likelihood in DBN models three models with their respective conjugate prior distributions have been proposed in the literature: (i) the multinomial distribution with the Dirichlet prior, leading to the BDe score (Cooper and Herskovits 1992), (ii) the linear Gaussian distribution with the normal-Wishart prior, leading to the BGe score (Geiger and Heckerman 1994), and (iii) a Bayesian linear regression model with a Gaussian prior on the regression coefficients (see, e.g., Lèbre et al. 2010). The former two approaches have originally been proposed for static Bayesian networks, but they can be extended straightforwardly to model homogeneous DBNs, as demonstrated in Friedman et al. (2000). Non-homogeneous DBNs with these two standard scores have for example been developed in Robinson and Hartemink (2009) and Robinson and Hartemink (2010) (with BDe), and in Grzegorczyk and Husmeier (2009) and Grzegorczyk and Husmeier (2011) (with BGe). The key idea behind these non-homogeneous DBNs is to relax the homogeneity assumption by complementing the standard homogeneous DBN with a Bayesian multiple changepoint process. Each time series segment defined by two demarcating changepoints is associated with separate interaction parameters, and in this way the regulatory relationships are allowed to vary over time.
Recently, the Bayesian regression model, described in Lèbre et al. (2010), has become a popular probabilistic model for non-homogeneous DBNs. A shortcoming of this "Bayesian regression" DBN (BR-DBN) model, as originally proposed by Lèbre et al. (2010), is potential model over-flexibility, as different time series segments are associated with different network structures, which for short time series will lead to over-fitting and inflated inference uncertainty. Various regularised variants of this BR-DBN model have been proposed (see, e.g., Dondelinger et al. 2010Dondelinger et al. , 2012, and in other instantiations of the BR-DBN model, the authors follow Grzegorczyk and Husmeier (2011) or Grzegorczyk and Husmeier (2013) and keep the network structure fixed among segments so that only the interaction parameters vary from segment to segment. In this paper I follow the latter works and focus on applications where cellular processes take place on a short time scale so that it is not the network structure but rather the strength of the regulatory interactions that changes with time. 2 For those models, which do not allow for segment-wise network changes, various information coupling schemes with respect to the segment-specific network parameters have recently been proposed Husmeier 2012a, b, 2013).
All these non-homogeneous DBNs, mentioned above, follow the same paradigm and combine a classical homogeneous DBN model with a multiple changepoint process. However, the configuration space of the data segmentations that can be obtained by changepoints is restricted. Let us consider these changepoint processes in the broader context of mixture models, which are based on a free allocation of the data points to mixture components. From this perspective the changepoints divide the time series into disjunct temporal segments, and the segments (i.e. the data points within each segment) are assigned to disjunct ("mixture") components. That is, there is a one-to-one mapping between the temporal segments and the mixture components, and hence, distant segments cannot be allocated to the same component; throughout the paper I will also say: "a component once left cannot be revisited". For instance, if there are 10 temporal data points, then allocation schemes, such as [1112222211], are not part of the segmentation space of multiple changepoint processes and would have to be "approximated" by segmentations, such as [1112222233].
In earlier papers it has been proposed to combine Bayesian networks with classical mixture models (see, e.g., Ko et al. 2007 or Grzegorczyk et al. 2008). Unlike the DBNs with changepoints (CPS-DBNs), the proposed mixture DBN (MIX-DBN) allows for an unrestricted free allocation of the data points to (mixture) components, and, hence, substantially increases the configuration space of the possible data segmentations. However, for time series the temporal order of the data points is not taken into account, and this inevitably incurs an information loss, e.g. when a priori temporally neighbouring data points should be more likely to be assigned to the same component than distant ones.
In biological systems various examples for periodic gene regulatory processes can be found. E.g. plants, such as Arabidopsis thaliana, possess a circadian clock and the underlying molecular mechanisms depend on the presence/absence of light (see Sect. 3.3 for details and literature references). That is, the gene regulatory processes in Arabidopsis are diurnal and periodically depend on the daily dark:light (night:day) cycle. These daily alternations of darkness ("1") and light ("2") phases are caused by an external factor, namely the rotation of the earth, and they impose a periodic diurnal segmentation on the gene regulatory processes in the circadian clock, e.g. a segmentation of the form [111222111222]. 3 Apart from this plant biology example, described in more detail in Sect. 3.3, circadian rhythms also play an important role in the regulatory processes in mammalian cells (see, e.g., Yan et al. 2008). Another example are the periodic regulatory processes that can be observed during the cell cycle (see, e.g., Whitfield et al. 2002or Rustici et al. 2004). As discussed above, neither the changepoint processes (CPS-DBN) nor the free allocation mixture models (MIX-DBN) are adequate for learning periodic segmentations; the CPS-DBN model cannot revisit states once left, while the MIX-DBN model completely ignores the temporal arrangement of the data points.
In this paper I present a novel non-homogeneous dynamic Bayesian network model, which can be seen as a consensus between the free allocation mixture DBN model (MIX-DBN) Footnote 2 continued scenarios, such as morphogenesis, where the cellular processes take place on a long time scale, the assumption of a fixed network structure might turn out to be too restrictive. and the changepoint-process-segmented DBN model (CPS-DBN). The idea is to assume that the underlying allocation of the temporal data points follows a Hidden Markov model (HMM). The novel DBN model, which I will refer to as the HMM-DBN model, does take the temporal structure of the time series into account without putting any restriction onto the configuration space of the allocations. With the HMM-DBN model, periodic segmentations, such as [111222111222], can be inferred properly. In this paper I implement the novel model with a network structure that is kept fixed among segments and I only allow the network interaction parameters to vary in time. In a comparative evaluation study I demonstrate that the novel HMM-DBN model has the attractive feature that it is competitive to both (i) the CPS-DBN model for changepoint-segmented allocations and (ii) the MIX-DBN model for free mixture allocations. I also show how the allocation of the data points can be inferred with the allocation sampler (Nobile and Fearnside 2007). As the allocation sampler has been developed for classical Gaussian mixture models, it does not exploit the temporal information. I therefore propose to improve the allocation sampler by introducing two new pairs of complementary MCMC moves, which utilise the temporal arrangement of the data points. Although the key idea behind the proposed HMM-DBN model is generic, I present it in the context of the BR-DBN model . With regard to the real-world applications (see Sects. 3.2 and 3.3) I follow Grzegorczyk and Husmeier (2011) and Grzegorczyk and Husmeier (2013) and keep the network structure fixed among segments (components).
This paper is organized as follows: Sect. 2 provides a comprehensive exposition of the mathematical details behind the HMM-DBN model. I also present two new pairs of moves for the MCMC inference, and I briefly summarise the competing non-homogeneous DBN models. Section 3 gives an overview to the data on which I apply and cross-compare the models. I provide the details on how I implemented the HMM-DBN model for the comparative evaluation study in Sect. 4. The results of a study, in which I systematically compare the performances of the MIX-DBN, the CPS-DBN and the HMM-DBN model, are presented in Sect. 5. A discussion of the computational costs and a brief outlook to future work is provided in Sect. 6, before I draw my final conclusions in Sect. 7. Note that mathematical details from Sect. 2 have been relegated to the Appendices 1-4.

Bayesian regression models
In this subsection I briefly summarise the non-homogeneous Bayesian regression DBN (BR-DBN) model, proposed by Lèbre et al. (2010). Recently, various different variants of the original BR-DBN model have been developed, proposed and applied in the literature. Here I consider the uncoupled BR-DBN variant, which has been recently used in Grzegorczyk and Husmeier (2012a) and Grzegorczyk and Husmeier (2012b). Unlike all BR-DBN model instantiations that have been developed so far, I combine the BR-DBN model with a free allocation model rather than a multiple changepoint process. 4 The free allocation BR-DBN model, considered here, allows for more flexibility with respect to the configuration space of the possible data allocations.
Consider a set of N nodes, g ∈ {1, . . . , N }, in a network, M = (π 1 (M), . . . , π N (M)), where π g (M) denotes the parents of node g in M, that is the set of nodes with a directed edge pointing to node g. For notational convenience, I write π g = π g (M) in the following representations; i.e. I do not indicate the dependency on M explicitly . Given a N -by-T data set matrix, D, where the rows correspond to the N nodes and the columns correspond to T temporal observations, let y g,t denote the realisation of the random variable associated with node g at time point t ∈ {1, . . . , T }, and let x π g ,t denote the vector of realisations of the random variables associated with the parent nodes of node g, π g , at the previous time point, (t − 1), and including a constant element equal to 1 (for the intercept). With |π g | denoting the cardinality of the parent node set π g , the vector x π g ,t , which also includes the element 1 for the intercept, is of size |π g | + 1.
Unlike the mixture model DBN in Grzegorczyk et al. (2008) I here consider node-specific allocation vectors, V g (g = 1, . . . , N ), where each vector V g is of size T and defines a free allocation of the last T − 1 observations, y g,2 , . . . , y g,T , of node g to K g components. V g (t) = k means that the observation y g,t is allocated to the kth component (t = 2, . . . , T and k = 1, . . . , K g ). Furthermore, I define y g,k to be the vector of observations that have been allocated to component k by V g (1 ≤ k ≤ K g ). In the free allocation regression models, described below, the nodes g = 1, . . . , N are considered as target variables and their regressor variables are the variables in their parent sets, namely π 1 , . . . , π N . More precisely, y g,k is the target vector for component k, and I have to arrange the corresponding observations of the parent nodes, π g , appropriately in a regressor (or design) matrix, which I denote X π g ,k . Let the vector y g,k be of size n k , i.e. let n k observations have been allocated to component k, then X π g ,k is an (|π g | + 1)-by-n k matrix, and if the jth element of the target vector y g,k is the observation y g,t , then the jth column of the regressor matrix, X π g ,k , has to be the vector x π g ,t . As each vector x π g,t includes a constant element for the intercept, the first row of the design matrix, X π g ,k , is a column vector of 1's, which corresponds to the intercept.
Given a fixed graph topology M, which implies the parent node sets, π g , and thus the regressor variables for each node g, as well as fixed allocation vectors, V g , which imply the node-specific allocations, I follow Lèbre et al. (2010) and apply a linear Gaussian regression model to each target vector y g,k using X π g ,k as regressor matrix: where w g,k is the (|π g | + 1)-dimensional vector of regression parameters, ε g,k is the noise vector, and the superscript symbol "T" denotes matrix transposition. I assume that the individual elements of the noise vectors, ε g,k , are i.i.d. Gaussian distributed with zero mean and variance σ 2 g ; i.e. the noise variances are node-specific but do not depend on the component k. 5 The vectors ε g,k (k = 1, . . . , K g ) are then independently multivariate Gaussian distributed with zero mean vector and covariance matrix σ 2 g I, where I denotes the unit matrix. The likelihood of the regression model is given by: On the component-specific regression parameter vectors, w g,k , I impose the following conjugate Gaussian priors: Compact representation of the employed free allocation Bayesian regression model. The grey circles refer to fixed (hyper-)parameters and the data (X π g ,k and y g,k ), while the white circles refer to free (hyper-)parameters. A detailed model description is provided in Sect. 2.1 where δ g can be interpreted as a gene-specific "signal-to-noise" (SNR) hyperparameter ). On the inverse noise variances, σ −2 g , and on the inverse SNR hyperparameters, δ −1 g , I also impose conjugate priors, i.e. Gamma priors: with the fixed level-2 hyperparameters A σ , B σ , A δ and B δ . A compact representation of the relationships among the (hyper-)parameters of the Bayesian regression models, described above, can be found in Fig. 1. The free model parameters, indicated by white circles in Fig. 1, have to be sampled from the posterior distribution. Due to standard conjugacy arguments the full conditional distributions of the free parameters can be computed in closed form, and the Gibbs-sampling scheme from Grzegorczyk and Husmeier (2012b) can be applied to generate a sample from the posterior distribution P(w g,1 , . . . , w g,K g , δ g , σ 2 g |D). 6 To indicate the allocations implied by the allocation vector V g , I introduce the symbols: y g,V g := y g,k k=1,...,K g (5) X π g ,V g := X π g ,k k=1,...,K g (6) w g,V g := w g,k k=1,...,K g (7) Table 1 The MCMC sampling scheme for the free allocation Bayesian regression model shown in Fig. 1 For each node g = 1, . . . , N :

Input:
The parent node set, π g , the allocation vector, V g , and the current SNR • Afterwards sample component-specific regression parameter vectors, w [see Eq. (8) The full conditional distributions of δ −1 g and w g,k are given by: where K g is the number of components for node g, |π g (M)| is the cardinality of the parent set, π g , and g,k = δ −1 g I + X π g ,k X T The inverse variance hyperparameters, σ −2 g , could also be sampled from the full conditional distribution, but a computationally more efficient way is to to use a collapsed Gibbs sampling step, in which the regression parameter vectors, w g,k , have been integrated out. This marginalization yields: with the squared Mahalanobis distance 2 g,k = y T g,k I + δ g X T π g ,k X π g ,k −1 y g,k . If the parent node sets, π g , and the allocation vectors, V g , are known and kept fixed, Eqs. (8-10) can be used, as indicated in Table 1, to generate a sample from the posterior distribution: However, in real-wold applications the allocation vectors, V g , are usually unknown and the objective is to infer the parent node sets, π g , which form the network structure, M = (π 1 , . . . , π N ). Note that the regression model is defined such that the likelihood can be marginalized over both the regression parameters, w g,k , and the noise variance hyperparameters, σ g,k . For each node g the marginal likelihood is given by: where˜ g,k = I + δ g X T π g ,k X π g ,k , 2 g = K g k=1 2 g,k , and the squared Mahalanobis distance terms, 2 g,k , were defined below Eq. (10); for a derivation see Grzegorczyk and Husmeier (2012b). Note that the marginal likelihood in Eq. (12) is invariant with respect to a permutation of the components' labels, as I have imposed exchangeable (i.i.d.) priors on the componentspecific regression parameter vectors [see Eq. (2)].

Network structure inference
I assume the allocation vectors, V g , still to be fixed, and I describe how the network structure, M, can be inferred. For the prior on the network structures, M = (π 1 , . . . , π N ), I assume a modular form: and uniform distributions for P(π g ), subject to a fan-in restriction, |π g | ≤ F , for each g.
The individual parent node sets, π g , can then be inferred independently for each node g, and the collection of parent node sets forms the network structure, M = (π 1 , . . . , π N ). For each node g the full conditional distribution is given by: where the expressions for P(y g,V g |X π g ,V g , δ g ) can be computed with Eq. (12). As the full conditional distribution of π g in Eq. (14) is not of closed form, I resort to Metropolis-Hastings sampling techniques. For each node g the MCMC algorithm keeps the SNR-hyperparameter, δ g , and the allocation vector, V g , fixed, and proposes to move from the current parent node set, π (i−1) g , to a new set π ( ) g , where π ( ) g is randomly chosen from the system S(π (i−1) g ) of all parent sets which can be reached (i) either by removing a single parent node from π (i−1) g , (ii) or by adding a single parent node to π (i−1) g , unless the maximal fan-in, F , is reached, (iii) or by a parent-node flip move. 7 According to the Metropolis Hastings criterion, the move is accepted with probability where the likelihood-ratio can be computed with Eq. (12), the prior ratio is equal to 1, and the Hastings-ratio is the ratio of the cardinalities of the two parent node set systems 7 The parent-node flip move was proposed in Grzegorczyk and Husmeier (2011) and randomly chooses a parent node, u ∈ π (i−1) g , and randomly chooses a node v / ∈ π (i−1) g , and then modifies π (i−1) g by substituting parent node u for node v.
123 Table 2 Pseudo-code for the MCMC inference of the parent node sets, π g , in the free allocation Bayesian regression model shown in Fig. 1 For each node g = 1, . . . , N : Input: The SNR hyperparameter, δ g , the allocation vector, V g , and the current parent node set, π (i−1) g MCMC iteration: (i − 1) → i: • Determine the system of parents sets, S(π (i) g ), that is the system of parent node sets that can be reached from π (i−1) g by (i) either adding a node to π (i−1) g , (ii) or by deleting a node from π (i−1) g or (iii) by exchanging a node u ∈ π (i−1) g for a node v / ∈ π (i−1) g . Randomly select a new candidate parent set, π ( ) g , from S(π (i) g ) • Accept the new parent node set, π ( ) g , with the probability given in Eq. (15). If the move is accepted, set: π Otherwise leave the parent set unchanged, i.e. set π ) and S(π g ). 8 If the move is accepted, set: π Pseudo code for this Metropolis-Hastings step is given in Table 2. Given the current network, , successively updating the parent node sets, symbolically π

Modelling the allocation vectors
In the last two subsections I have assumed that the allocation vectors are known and fixed, although they will be unknown in many real-world applications. The focus of this subsection is on inferring the allocation vectors from the data. A common choice in the context of dynamic Bayesian networks (DBNs) is the application of (node-specific) multiple changepoint processes to infer the segmentations; see references in Sect. 1.
Another approach, presented in Ko et al. (2007) and Grzegorczyk et al. (2008), is to combine DBNs with a mixture model. The mixture approach is more flexible, as it allows for a free allocation of the data points. E.g. for 11 data points and 3 mixture components the allocation scheme [V g (2), . . . , V g (11)] = [1, 1, 3, 2, 3, 1, 2, 2, 1, 1] for the last 10 data points is valid. The changepoint approach imposes sets of changepoints to divide the temporal data points into disjunct segments. Since temporal observations follow a natural time ordering and a priori neighbouring time points should be more likely to be allocated to the same component than distant time points, the changepoint approach includes plausible prior knowledge. However, changepoint approaches have a restricted allocation space, since data points in different segments have to be allocated to different components; i.e. "a (segment) component once left cannot be revisited". Consequently, certain allocation schemes can only be approximated by imposing additional changepoints, e.g. in this example the true allocation scheme [V g (2), . . . , V g (11)] = [1, 1, 1, 2, 2, 2, 1, 1, 1, 1] cannot be modelled properly with changepoints; the best changepoint set approximation might be: [V g (2), . . . , V g (11)] = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]. The mixture model, on the other hand, can infer the correct allocation, but it ignores the temporal ordering of the data points. That is, it treats the temporal data points (time points) as interchangeable units. This information loss implies in this example that all 10 3 allocation vectors, which allocate seven time points to component k = 1 and three time points to component k = 2, are a priori equally likely; including allocation schemes, such as [V g (2), . . . , V g (11)] = [1, 2, 1, 1, 1, 2, 1, 1, 2, 1], which might be very unlikely a priori.
A compromise between the mixture model and the changepoint process is a hidden Markov model (HMM). In HMMs there is a homogeneous Markovian dependency between the allocations of the data points. In a Markov chain of order τ = 1 the allocation (state) of the tth data point given the states of all earlier time points 2, 3, 4, . . . , t − 1 just depends on the state of the immediately preceding time point t − 1. Moreover, in a homogeneous Markov chain these transition probabilities stay constant over time, i.e. they do not depend on t. The homogeneous state-transition probabilities can be chosen such that neighbouring points are likely to be allocated to the same state, and states once left can be revisited. In this subsection I show how to employ a HMM for the allocation vectors, V g .
I model the allocation vectors, V g , for each node, g, independently with a HMM. In a first step I impose a truncated Poisson distribution with parameter λ on the number of states (components). For K g = 1, . . . , K MAX this yields: Afterwards, I impose a HMM with K g states on the allocation vector, V g . The allocation vector can be identified with the temporally ordered sequence [V g (2), . . . , V g (T )] and its probability is the probability of the sequence: Assuming a Markovian dependency of order τ = 1 for the state sequence, this leads to: For t = 3, . . . , T let p g k, j denote the probability for a transition from state k to state j: This gives K g j=1 p g k, j = 1 and the probability vectors p g k = ( p g k,1 , . . . , p g k,K g ) T define categorical (or multinomial) random variables (k = 1, . . . , K g ). On V g (2) I impose a discrete uniform distribution with the possible outcomes {1, . . . , K g }. The probability of the sequence ,...,K g is then given by: where n k, j = |{t|3 ≤ t ≤ T ∧ V g (t) = j ∧ V g (t − 1) = k}| is the number of transitions from state k to state j in the sequence [V g (2), . . . , V g (T )]. For k = 1, . . . K g I impose a Dirichlet distribution with hyperparameter vector α k = (α k,1 , . . . , α k,K g ) T on p g k : 20) 123 Marginalizing over the set {p g k } k in Eq. (19) gives the marginal distribution: was defined in Eq. (20), the integral in Eq. (21) is effectively a product integral. Inserting Eq. (19) into Eq. (21) yields: The inner integrals correspond to Dirichlet-multinomial distributions, which can be computed in closed form. This yields: In the absence of any genuine prior knowledge about the state-transition probabilities, p g k, j , I set α k, j = α in Eq. (20). The marginal distribution P(V g |K g ) in Eq. (23) is then invariant to permutations of the states' labels.

The proposed HMM-DBN model
The proposed Hidden Markov model (HMM) dynamic Bayesian network (DBN) model, which I refer to as the HMM-DBN model, is now fully specified. A compact representation of the relationships among the data and all (hyper-)parameters of the HMM-DBN model is given in Fig. 2. Figure 2 adds flexible parent node sets and allocation vectors along with their prior distributions to Fig. 1. Unlike the earlier Bayesian regression DBN model, shown in Fig. 1, the parent node sets and the allocation vectors are now flexible and have to be inferred. The joint posterior distribution of the HMM-DBN model is given by: where M = (π 1 , . . . , π N ), and In the latter equation K g is the number of possible states (components) for the gth allocation vector, V g , which implies the target vector segmentation, y g,V g = {y g,1 , . . . , y g,K g }, and the segmentation of the regressor matrices, X π g ,V g = {X π g ,1 , . . . , X π g ,K g }. The marginal likelihood, P(y g,V g |X g,V g , δ g ), can be computed with Eq. (12). With regard to the MCMC inference, described in Sect. 2.5, note that the posterior distribution in Eq. (24) is invariant (to permutations of the states' labels), as the marginal likelihood in Eq. (12) and the priors on the allocation vector in Eq. (23) (if α k, j = α) are invariant. For Bayesian mixture models with invariant posterior distributions it is challenging to infer the component-specific model parameters, since their marginal posterior distributions are identical. There is a so called "non-identifiability problem" with respect to the components' labels (see, e.g., Nobile and Fearnside 2007). For the HMM-DBN model the problem of For g = 1, . . . , N: non-identifiability has not be tackled, as the interest is not on state-specific parameters. Here, I am interested in the network structure, M, and in the co-allocation of data points. 9

Allocation vector inference
For the allocation vector inference, in principle, two different RJMCMC sampling strategies (Green 1995) can be employed. The first technique is to implement a RJMCMC approach in a continuous configuration space, where concrete instantiations of all free parameters of the Bayesian regression model (i.e. the white circles in Fig. 1) are sampled, before the forwardbackward simulation algorithm (Boys et al. 2000) is used to sample the allocation vector from its full conditional distribution via a Gibbs sampling step. This RJMCMC approach for hidden Markov models has been proposed by Robert et al. (2000) and has become popular in various fields of applications, such as DNA sequence analysis (see, e.g., Boys and Henderson 2004). However, the disadvantage of this approach is that the variation of the number of 9 For each node g the allocation vector, V g , effectively defines a node-specific co-allocation matrix: . Note that the node-specific co-allocation matrices, C g , are invariant to permutations of the states' labels. See, e.g., Jasra et al. (2005) for a detailed argumentation. 123 hidden states, K g , requires the implementation of efficient RJMCMC moves which switch between models with different dimensionalities in continuous parameter spaces. Otherwise, the RJMCMC simulations may become computationally inefficient (see, e.g., Nobile and Fearnside 2007).
The second sampling strategy is based on RJMCMC moves in the discrete allocation vector configuration space. In this approach the numbers of states and the allocation vectors are sampled from the posterior distribution. This strategy can be used when all state-specific parameters can be integrated out analytically so that the marginal likelihood does not depend on state-specific continuous parameters. 10 The main advantage of this second RJMCMC strategy is that the resulting sampling scheme does not require any particular trans-dimensional jumping moves in continuous configuration spaces. In the present paper I resort to this second RJMCMC sampling strategy, and I employ the "allocation sampler" (Nobile and Fearnside 2007) for the allocation vector inference. The allocation sampler was proposed by Nobile and Fearnside (2007) and has already been utilised in the context of mixture dynamic Bayesian networks (MIX-DBNs) in Grzegorczyk et al. (2008). The allocation sampler consists of a simple Gibbs sampling move and various more involved Metropolis-Hastings moves. The mathematical details are briefly summarised in the "Appendix". In Appendix 1 I describe a simple Gibbs sampling move, which re-samples the allocation state of one single data point from the full conditional distribution. Since this type of move yields very small steps in the configuration space, Nobile and Fearnside (2007) proposed a set of more involved allocation sampler moves. In Appendix 2 I describe these allocation sampler moves, namely the M1, the M2, and the Ejection-Absorption (EA) move. However, the allocation sampler moves have been developed for free allocation models, where data points are treated as interchangeable units without any natural (here: temporal) arrangement. These moves are sub-optimal when a Markovian dependency structure among the (temporal) data points is given. In Sects. 2.5.1 and 2.5.2 I therefore propose two new pairs of Metropolis-Hastings moves, which exploit the temporal structure and thus improve convergence and mixing for the HMM-DBN model. While the conceptualization of the ideas behind these moves is relatively simple and intuitive, the mathematical implementation is involved, due to the need to ensure that the sampling scheme satisfies the equations of detailed balance and converges to the proper posterior distribution. In Appendices 3 and 4 I rigorously formulate the mathematical details, and I show for both pairs of moves that the two moves are complementary to each other. Hence, the acceptance probabilities can be chosen according to the Metropolis-Hastings criterion, so as to guarantee that the equation of detailed balance is fulfilled. Combining the SNR hyperparameter inference (see Table 1) and the network inference (see Table 2) with the moves on the allocation vectors yields the MCMC sampling scheme for generating a sample from the posterior distribution in Eq. (24). Table 3 shows how the sampling steps can be combined.

First pair of new HMM moves: the inclusion and the exclusion move
In this subsection I propose and verbally describe the novel inclusion and the novel exclusion move for the HMM-DBN model. For each exclusion move there is a unique complementary inclusion move, and vice-versa. The introduction of this pair of moves can be best motivated by a simple example: Given 11 time points and the allocation [V g (2), . . . , V g (11)] = [1, 1, 2, 2, 1, 1, 1, 2, 2, 2] for the last 10 data points. If there is Table 3 Pseudo code for the MCMC sampling scheme Input: The current state of the MCMC simulation. That is, the network: • Keep the network M (i−1) and the allocation vectors, V (i−1) g (g = 1, . . . , N ), fixed, and update the SNR hyperparameters with the MCMC sampling scheme described in Table 1. For each g replace δ . . , N ) and the SNR hyperparameters δ (i) g (g = 1, . . . , N ) fixed, and update the network structure with the MCMC sampling scheme described in Table 2. Replace the old graph, M (i−1) , by the outputed new graph, Keep the parent set, π (i) g , the number of states, K (i−1) g , and the SNR hyperparameter, δ (i) g , fixed, and perform the Gibbs sampling move, described in Appendix 1, on V a Markovian dependency structure, it appears to be useful to propose to re-allocate the coherent time sequence [V g (4), V g (5)] = [2, 2] to state k = 1, since the surrounding earlier (lower) and later (higher) time points ] into the state of the surrounding data points. This gives the new allocation [V g (2), . . . , V g (11)] = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2]. Given the new allocation, the complementary exclusion move has to cut the subsequence [V g (4), V g (5)] out of the coherent sequence [V g (2), . . . , V g (8)] to move back to the original allocation. To this end, the exclusion move selects the coherent sequence [V g (2), . . . , V g (8)] of data points that are allocated to the same state (k = 1). Subsequently, it proposes to cut out a randomly selected subsequence, which is then "excluded", i.e. it is cut out and re-allocated to a new state (here: k = 2). To guarantee that there is a complementary inclusion move for each exclusion move, it is important to impose a constraint: The randomly selected subsequence is not allowed to include the two limiting data points; i.e. the lower limit V g (2) and the upper limit V g (8) in the example. In Appendix 3 I rigorously formulate the mathematical details, and I show that there is a unique exclusion move for each inclusion move, and vice-versa.

Second pair of new HMM moves: the birth and the death move
In this subsection I propose and verbally describe the novel death and the novel birth move for the HMM-DBN model. For each birth move there is a unique complementary death move, and vice-versa. The introduction of this pair of novel Metropolis-Hastings moves can be best motivated by a simple example: Given 11 time points and the allocation vector [V g (2), . . . , V g (11)] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] for the last 10 data points, then it appears to be useful to impose a changepoint, which re-allocates the last data points to a new state k = 2. For example, re-allocating the last four data points yields the new allocation vector [V g (2), . . . , V g (11)] = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2]. The birth move randomly selects a state k and re-allocates the last data points that are allocated to k to a new state k new . Thereby the novel birth move also allows for moves, such as [1, 1, 2, 2, 1, 1, 2, 2, 1, 1] → [1, 1, 2, 2, 1, 3, 2, 2, 3, 3], where the last two data points that were allocated to state k = 1 have been re-allocated to a new state k new = 3.
Given the new allocation vector, [V g (2), . . . , V g (11)] = [1, 1, 2, 2, 1, 3, 2, 2, 3, 3] the complementary death move has to re-allocate all data points that are allocated to state k = 3 back to state k = 1. To this end the death move selects the two states k = 1 and k = 3, and then tests whether the data points allocated to state k = 1 and the data points allocated to state k = 3 are "separated" (do not "overlap"). Formally, I will say that the two sets T 1 = {t : If the "separation test" is successful, the death move is valid and can be performed. In the example, the highest time point allocated to k = 1, namely t = 5, precedes the lowest time point allocated to k = 3, namely t = 6, so that the "test for separation" is successful and the death move is valid. This formal test for separation is required, since otherwise the new allocation vector could not have been reached by the novel birth move, described above. In Appendix 4 I rigorously formulate the mathematical details, and I show that there is a unique novel death move for each novel birth move, and vice-versa.

Competing dynamic Bayesian network models
I will perform a systematic comparative evaluation, in which I compare the proposed HMM-DBN model with three competing DBN models. The traditional homogeneous DBN model (HOM-DBN) is described in Sect. 2.6.1, and in Sects. 2.6.2 and 2.6.3 the free allocation mixture DBN model (MIX-DBN) and the changepoint-segmented DBN model (CPS-DBN) are briefly summarised. An overview to the models is given in Table 4.

The conventional homogeneous DBN model (HOM-DBN)
In the homogeneous DBN model the network interactions do not vary over time. There is only one single state, K g = 1, for each node g and the allocation vectors assign all data points to state 1, V g = (1, . . . , 1) T . The HOM-DBN is a special case of the HMM-DBN model, where K g and V g are fixed and non-adaptable. In Fig. 2 the nodes V g and K g become fixed (grey), and the nodes for α g k , p g k , K g , λ, and K MAX can be removed. The HOM-DBN model can be inferred with the MCMC sampling scheme in Table 3, but the allocation vector moves have to be left out, as K  Ko et al. (2007) and Grzegorczyk et al. (2008) Akin to Lèbre et al. (2010) and various follow-up works Proposed here

Number of components
Hyperparameters of -S e e E q . ( 27) S e e E q . ( 28) S e e E q . ( 23) MCMC moves on V The probability of the allocation vector is then given by: = k}| is the number of data points that are allocated to component k by V g . On p g I impose a conjugate Dirichlet distribution with hyperparameters α = (α 1 , . . . , α K g ) T , P(p g ) = Dir(p g |α). Marginalizing over p g yields: For α k = α the posterior distribution of the MIX-DBN model becomes invariant to permutations of the components' labels. The MIX-DBN model can be inferred with the MCMC sampling scheme in Table 3, but exclusively allocation sampler moves can be performed on V g . The moves from Sects. 2.5.1 and 2.5.2 cannot be used, as the MIX-DBN model treats the data points as interchangeable units (without any ordering). If the allocation sampler moves, described in Appendix 2, are performed, the terms P(V g |K g ) in the acceptance probabilities have to be computed with Eq. (27) instead of Eq. (23).

The non-homogeneous changepoint DBN model (CPS-DBN)
The changepoint DBN model (CPS-DBN) combines the traditional DBN model with a multiple changepoint process. As before, I assume that K g follows a truncated Poisson distribution, where b g,0 := 1 and b g,K g := T . Following Green (1995) I assume that the changepoints are distributed as the even-numbered order statistics of L := 2(K g − 1) + 1 points uniformly and independently distributed on the set {2, . . . , T − 1}. This induces the following prior distribution on the allocation vectors: The allocation vectors can be inferred via changepoint birth, death and re-allocation moves along the lines of the RJMCMC algorithm of Green (1995).
The changepoint reallocation move from V (i−1) g to V g randomly selects one changepoint b g, j from the changepoint set, {b g,1 , . . . , b g, changepoint is randomly drawn from the set b g, j−1 + 2, . . . , b g, j+1 − 2 . This yields the new candidate allocation vector V g , and K = K (i−1) .
randomly draws the location of one single new changepoint from the set of all valid new changepoint locations: Adding the new changepoint to the changepoint set yields V g , and K g ] is complementary to the birth move. It randomly selects one of the changepoints induced by V (i−1) g and delets it. V g is the new candidate allocation vector after deletion, and The acceptance probabilities for these moves are given by A = min{1, R}, with where Q is the Hastings ratio, which can be computed for each of the three changepoint move types (see, e.g., Green 1995). If the move is accepted, set V The CPS-DBN model can be inferred with the MCMC sampling scheme described in Table 3, but the moves on the allocation vectors have to be replaced by the changepoint birth, death and re-allocation moves, described in this subsection.
I also include the globally coupled variant of the CPS-DBN model, proposed in Grzegorczyk and Husmeier (2012b) and Grzegorczyk and Husmeier (2013), in my comparative evaluation study. The key idea is to hierarchically couple the segment-specific regression parameter vectors, w g,k , in Eq. (2) to allow for information-sharing with respect to the regression parameters. In the coupled CPS-DBN model Eq.
(2) is replaced by P(w g,k |σ 2 g , δ g ) = N (w g,k |m g , δ g σ 2 g I), and the mean vector, m g , is now a flexible hyperparameter and has a multivariate standard Gaussian distribution, symbolically: m g,k ∼ N (0, I); see, e.g., Grzegorczyk and Husmeier (2013) for the mathematical details. However, as the coupled CPS-DBN model is not in the primary scope of the present paper, I focus on the standard CPS-DBN model and discuss the results of the coupled CPS-DBN model only casually.

Network-wide (shared) allocation vectors
The non-homogeneous DBN models have been formulated with node-specific allocation vectors, V g (g = 1, . . . , N ). That is, the allocations vary from node to node, and have to be inferred independently for each node g. This gives very flexible DBN models. For applications where all nodes are a priori expected to share the same segmentation the nodespecific allocation vectors can be replaced by a network-wide allocation vector, which is then shared by all nodes, V g = V and K g = K for all g. For network-wide allocation vectors the moves from Sect. 2.5 have to be adapted. The probability terms P(K g ) and P(V g |K g ) have to be replaced by P(K) and P(V|K), respectively. And each allocation vector change, V (i−1) → V , applies to all nodes. The marginal likelihood terms (e.g. in the acceptance probabilities), P(y g,V g |X g,V g , δ g ), have to be replaced by product terms: N g=1 P(y g,V |X g,V , δ g ). The usage of network-wide allocation vectors imposes a substantial restriction on the configuration space of the allocations. The underlying allocation vector can then be inferred more 123 accurately, as conceptual problems associated with model over-flexibility (data-overfitting) are alleviated.

Marginal edge posterior probabilities
The MCMC sampling scheme for the HMM-DBN model is outlined in Table 3, and in Sects. 2.6.1-2.6.3 I provide details on how to modify this scheme for the competing models. I perform 200I iterations in total, and to avoid autocorrelations in the MCMC trajectories I take samples in equidistant intervals (every 100th iteration). From the sample of length 2I I withdraw the first I samples to allow for a "burn-in phase", and I keep the remaining sample of length I : where M (i) (n, j) is 1 if M (i) contains the edge n → j, and 0 otherwise. I also estimate the marginal posterior probabilities, C g s,t , of two data points s and t (s, t ∈ {2, . . . , T }) being assigned to the same state by the allocation V g : I will refer to C g = ( C g s,t ) s,t∈{2,...,T } as the estimated connectivity (co-allocation) matrix.

Criterions for quantifying the network reconstruction accuracy
If the true network, M ‡ , is known, I evaluate the network reconstruction accuracy in terms of the areas under the precision recall curve. Let M ‡ (n, j) = 1 indicate that M ‡ possesses the edge from node n to node j, while M ‡ (n, j) = 0 indicates that the edge n → j is not in M ‡ . The models yield marginal edge posterior probabilities e n, j ∈ [0, 1] for every possible edge n → j. For ζ ∈ [0, 1] I define E(ζ ) as the set of all edges whose posterior probabilities exceed the threshold ζ .  Davis and Goadrich (2006).

Potential scale reduction factors (PSRFs) for network edges
The diagnostic that I apply to evaluate convergence, proposed in Grzegorczyk and Husmeier (2011), is based on the potential scale reduction factors (PSRFs); see Brooks and Gelman (1998) 1, 2, . . . , I ). Let e [h,s] n, j denote the probability of the edge n → j obtained with MCMC simulation h after 200s iterations, where s equidistant samples (every 100th iteration) are taken after the burn in phase of length 100s. For s = 1, . . . , I I compute the "between-chain" and the "within-chain" variance: where e [.,s] n, j is the mean of e [1,s] n, j , . . . , e [H,s] n, j , and M (i,h) (n, j) is 1 if the ith network in the sample, taken from the hth simulation, contains the edge n → j, and 0 otherwise. Following Brooks and Gelman (1998) the PSRF s (n, j) of the edge n → j is given by: where PSRF values near 1 indicate that the MCMC simulations are close to the stationary distribution. I use as a PSRF-based convergence diagnostic the fraction of edges C(ξ, s) whose PSRF is lower than a threshold ξ (e.g. ξ = 1.1 and ξ = 1.01). The fractions C(ξ, s) can be monitored against the numbers of MCMC iterations 200s.

Simulated data from the RAF pathway
For the RAF pathway, shown in Fig. 3, I generate synthetic network data. I employ a function V , which assigns a state k ∈ {1, . . . , K g } to each temporal data point t = 2, . . . , T . V (t) = k means that data point t is assigned to the kth state. For each interaction between a node, g, and its parent nodes, which are defined by the RAF pathway, I require regression parameter vectors, which vary over time. Data points that are assigned to the same state k share the same regression parameter vectors, while the regression parameters differ among states. Let w g,k denote the regression parameter vector (including the intercept) for the interaction between node g and its parent nodes for all time points that are assigned to state k. I distinguish two sampling scenarios for sampling random regression parameter vector instantiations. The first sampling strategy (scenario S1) has recently been employed in Grzegorczyk and Husmeier (2012b) and Grzegorczyk and Husmeier (2013) and guarantees that all regression parameter vectors, w g,k , share the same amplitude, |w g,k | 2 = 1. The second sampling strategy (scenario S2), which has for example been employed in Werhli et al. (2006), guarantees that the absolute value of each single element of the regression coefficient vector is in between 0.5 and 2.
123 Fig. 3 The topology of the RAF pathway, as reported in Sachs et al. (2005). The RAF protein signalling transduction pathway consists of 11 proteins (pip3, plcg, pip2, pkc, p38, raf, pka, jnk, mek, erk, and act) and the edges represent protein interactions Sampling scenario (S1) For each node g ∈ {1, . . . , N } and each state k ∈ {1, . . . , K}, I sample random vectors from standard multivariate Gaussian distributed vectors, w † g,k ∼ N (0, I), and I normalize these random vectors to obtain regression parameter vectors, w g,k of Euclidean norm (amplitude) one: w g,k = w † g,k /|w † g,k | 2 . Sampling scenario (S2) For each node g ∈ {1, . . . , N } and each state k ∈ {1, . . . , K}, I sample each element of the regression parameter vector, w g,k , independently from a continuous uniform distribution on the interval [0.5, 2], and for each element (regression coefficient) I afterwards draw a coin to determine its sign.
As strategy (S2) yields higher amplitudes, |w g,k | 2 , on average, I employ this sampling scenario when I compare the DBN models with node specific allocation vectors. For the DBN models with shared allocation vectors, V g = V for all g, I follow strategy (S1). 12 Given the sampled regression parameter vectors, w g,k , which either stem from S1 or from S2, concrete data set instantiations, D, can be generated. Let D g,t denote the observation for node g at time point t. For the first time point, t = 1, I sample the realisations of the N = 11 nodes from independent univariate Gaussian distributions, D g,1 ∼ N (0, 1) for all g. Afterwards, I generate realisations for t = 2, . . . , T : where D π g ,t−1 is the vector of the realisations of gth parent nodes at the previous time point t − 1, the function V (.) assigns each data point t to a state k ∈ {1, . . . , K g }, and the noise variables g,t are independently standard Gaussian distributed, g,t ∼ N (0, 1). The element 1 is included for the intercept. For each data set instantiation, D, I add additive white noise in a gene-wise manner to vary the signal-to-noise ratio (SNR). For each node, g, I compute the standard deviation, s g , of its T realisations, D g,1 , . . . , D g,T , and I add i.i.d. Gaussian noise with zero mean and standard deviation SNR −1 · s g to each data point, where SNR is the pre-defined signal-to-noise ratio level. That is, I substitute D g,t for D g,t + v g,t (t = 1, . . . , T ), where v g,1 , . . . , v g,T are Regression parameter sampling S1 S1 S1 S1 S1 S1 S2 S2 S2 Node-specific allocation inference Furthermore, let the symbol "MIX" indicate an allocation scheme that does not consist of segments, but assigns each of the states k ∈ {1, . . . , K} to T = (T −1)/K randomly selected data points. For example, for T = 33 and K = 2 I divide the time point set {2, . . . , T } randomly into two disjunct subsets, consisting of T = 16 data points each. Then I assign the state k = 1 to the data points in the first subset, and the state k = 2 to the data points in the second subset. An overview to the allocation schemes that I employ in my study is given in Table 5. For each of the nine allocation schemes I distinguish five SNR levels, and I generate 20 independent data instantiations for each combination of allocation scheme and SNR level; i.e. 9 × 5 × 20 = 900 data sets in total.

Synthetic biology in Saccharomyces cerevisiae
A popular benchmark gene expression data set for non-homogeneous DBN models has been provided by Cantone et al. (2009). The authors synthetically designed a small network in Saccharomyces cerevisiae (yeast). This network, consisting of N = 5 genes, is depicted in the right panel of Fig. 10. The authors measured expression levels of these genes in vivo with quantitative real-time Polymerase Chain Reaction at 37 time points over 8 h. During the experiment Cantone et al. (2009) changed the carbon source from galactose to glucose. 13 As 16 measurements were taken in galactose and 21 measurements were taken in glucose, there are the following observations for each node g: D gal g,1 , . . . , D gal g,16 , D glu g,1 , . . . , D glu g,21 .
The first measurements in galactose and glucose, D gal g,1 and D glu g,1 , were taken during washing steps, in which the extant glucose (galactose) was removed and new galactose (glucose) was added. Consequently, these two measurements were biased by external circumstances and have to be removed from the time series. After removal of these two measurements, the remaining time series was (i) standardized via a log transformation, before (ii) a z-score transformation over all measured expressions, {D

Circadian rhythms in Arabidopsis thaliana
Plants assimilate carbon via photosynthesis during the day, but have a negative carbon balance at night. The plants can buffer these daily carbon budget alternations by diurnal gene regulatory processes. They store some of the assimilated carbon as starch during the day (in the presence of light), and use the stored starch as a carbon supply during the night (in the absence of light). In order to synchronize this diurnal process with the external 24-h photo period, plants have a circadian clock that can potentially provide predictive, temporal regulation of metabolic processes over the day:night (light:dark) cycle. The molecular mechanisms behind this circadian regulation have not been fully elucidated yet.
I use four individual (independent) gene expression time series from Arabidopsis thaliana to study the diurnal gene regulatory processes among nine genes involved in the circadian clock. 14 In the four experiments E1-E4 the Arabidopsis plants were entrained in different dark:light cycles: 12 h:12 h (E1 and E2), 10 h:10 h (E3), and 14 h:14 h (E4). In the experiments T = 12 (E1) or T = 13 (E2-E4) measurements were taken either in 4-h (E1 and E2) or in 2-h (E3 and E4) intervals. After the pre-experimental dark:light entrainment, the measurements were taken under experimentally generated constant light condition. RNA amounts were extracted with Affymetrix microarrays, and the data were background-corrected and RMAnormalized. The experimental protocols as well as more details on the time series can be found in Mockler et al. (2007)  For my data analysis I merge the four time series E1-E4 into one single data set by successively arranging them, symbolically: E1, . . . , E4. The expression values at the first time points of the time series are not related to the expression values at the last time point of the preceding time series; e.g. the value of gene g at the first time point in E2, D E2 g,1 , is not related to the values of the genes at the last time point of E1, {D E1 g,T |g = 1, . . . , N }. Therefore, the first time points, D E1 g,1 , D E2 g,1 , D E3 g,1 , and D E4 g,1 , have to be removed from the merged time series. That is, those four observations cannot be used as targets, as there are no measurements for their potential parent nodes (at the preceding time points).
My objective differs from the earlier studies. Neither do I assume the three boundaries between the four individual time series to be known (as in Grzegorczyk and Husmeier (2013)) nor do I try to infer them (as in Grzegorczyk and Husmeier (2011)). My focus is on capturing the diurnal nature (i.e. the alternating dark:light cycles) of the gene regulatory processes in the circadian clock.

The objectives of my empirical studies
First, I want to perform a comparative evaluation study to investigate under which circumstances the proposed HMM-DBN model achieves a higher network reconstruction accuracy than the competing DBN models. Second, I want to provide empirical evidence that the new MCMC moves, proposed in Sects. 2.5.1 and 2.5.2, improve convergence and mixing of the MCMC simulations. In Sect. 5.2 I employ data from the RAF pathway to systematically compare the network reconstruction accuracies of the DBN models, shown in Table 4, for various underlying segmentation schemes, shown in Table 5. The data are generated as explained in Sect. 3.1, and I distinguish five different SNR levels. I infer the DBN models with MCMC simulations and I compute marginal edge posterior probabilities to reverseengineer the RAF pathway. As the RAF pathway does not possess self-feedback loops, i.e. edges, such as g → g, I impose the constraint g / ∈ π g (g = 1, . . . , N ). Except for a first preliminary study in Sect. 5.1 I assume the segmentations to be unknown. That is, unlike related studies (see, e.g., Dondelinger et al. 2010;Husmeier et al. 2010;Dondelinger et al. 2012;Husmeier 2012b, a, 2013), I here follow an unsupervised approach, in which the allocation vectors have to be inferred from the data. For the RAF pathway data I also compare the inferred segmentations with the true segmentations, and I show that the new MCMC moves substantially improve convergence and mixing. In Sect. 5.4 I employ the gene expression time series from Saccharomyces cerevisiae, described in Sect. 3.2, to extend my comparative evaluation by a real-world in vivo application from synthetic biology. Again I assume the segmentations to be unknown, and I exclude self-feedback loops, as the true network does not possess self-feedback loops. Although this application is quite small, the data have been measured in a true biological system, for which the true network is known. This study allows for an objective comparison of the performances of the DBN models on real biological data. In Sect. 5.5 I analyse the four gene expression time series from Arabidopsis thaliana, described in Sect. 3.3. For the Arabidopsis data a proper evaluation in terms of the network reconstruction accuracy is infeasible owing to the absence of a gold standard. My primary focus is thus on capturing the diurnal nature of the regulatory processes. Since the true Arabidopsis network is not known, I do not rule out self-feedback loops.

Hyperparameter settings
The HMM-DBN model is presented as a graphical model in Fig. 2, and values for the fix hyperparameters have to be chosen. In consistency with earlier studies on Bayesian networks I restrict the maximal cardinality of the parent node sets to F = 3. 15 According to Eqs. (3-4) the inverse variance hyperparameters, σ −2 g (g = 1, . . . , N ), and the inverse SNR hyperparameters, δ −1 g (g = 1, . . . , N ), are Gamma distributed with two hyperparameters each. I again follow earlier related studies, in which the Bayesian regression DBN model from Sect. 2.1 was used, and I set: 16 Note that an extensive study in Grzegorczyk and Husmeier (2013) has shown that there is robustness with respect to different choices of these four hyperparameters. I also have to fix the hyperparameters of the Dirichlet priors for the MIX-DBN and the HMM-DBN model. In the absence of prior knowledge I follow Nobile and Fearnside (2007)

I infer the DBN models with MCMC simulations, and for each simulation I perform 200I
(with I = 500) iterations. I take samples in equidistant intervals (every 100th iteration). From the resulting sample of length 1000 I withdraw the first 500 samples ("burn-in phase"), and I use the remaining sample of length 500 to compute the marginal edge posterior probabilities (see Sect. 2.8). To assess convergence and mixing I apply trace plot (Giudici and Castelo 2003) and potential scale reduction factor (Gelman and Rubin 1992) diagnostics. With respect to the PSRF based criterion, described in Sect. 2.10, I found that the PSRF's of all edges were below 1.1 for the above mentioned simulation lengths. If the true network is known, I evaluate the network reconstruction accuracy in terms of the areas under the precision recall curve (AUC-PR), as described in Sect. 2.9.

Pre-study: the supervised approach
I start with a pre-study, in which I cross-compare the network reconstruction accuracies of the proposed HMM-DBN model and the CPS-DBN model. I generate RAF pathway data for the segmentation (V g (2), . . . , V g (T )) = 1212 and I employ strategy (S1) from Sect. 3.1 to sample the regression parameters. Unlike in the later studies (i), I here fix the noise level (SNR= 16) and vary the numbers of data points instead, and (ii) I assume the segmentation to be known and fixed ("supervised approach"). For the proposed HMM-DBN model I can impose the true underlying allocation vectors. The CPS-DBN model employs changepoints to divide the data into disjunct segments with different states. Consequently, the true segmentation, 1212, is not a member of the allocation vector configuration space of the CPS-DBN model and has to be approximated by 1234. I vary the number of data

Network reconstruction and allocation vector accuracy for various segmentation schemes
In this subsection I cross-compare the performances of the four DBN models from Table 4. I generate RAF pathway data for various segmentations, as listed in Table 5, and I follow an unsupervised approach, i.e. I assume the segmentations to be unknown so that the allocation vectors have to be inferred from the data. I implement the models with nodespecific and network-wide allocation vectors, and I distinguish the strategies (S1) and (S2) from Sect. 3.1 for sampling random instantiations of the regression parameters. I keep the numbers of data points per segment fixed (T = 8) and I vary the noise level (SNR∈ {16,8,4,2,1}). The network reconstruction accuracy results for the models with networkwide allocations vectors, V g = V, are shown in Figs. 5 and 6. The results obtained with node-specific allocation vectors, V g , are shown in Fig. 7. The results can be summarised as follows. Network reconstruction accuracy for the synthetic RAF pathway data for different segmentation schemes. Data were generated with the regression parameter sampling strategy (S1) for different allocation schemes; see Sect. 3.1 and Table 5 for details. The DBN models were implemented with network-wide allocation vectors, V g = V. The three columns refer to three different segmentation schemes, 1111, 1122, and 112233. The panels in the top row monitor the network reconstruction accuracy in terms of average AUC-PR scores for the HOM-DBN, the CPS-DBN, the MIX-DBN, and the proposed HMM-DBN model. The horizontal axis refers to five different SNR levels. The following rows monitor the average AUC-PR differences between the proposed HMM-DBN model and the other three DBN models, HMM versus HOM (2nd row), HMM versus CPS (3rd row), and HMM versus MIX (4th row). The AUC-PR scores and AUC-PR score differences are averages over 20 independent data instantiations, with error bars indicating two-sided 95% t-test confidence intervals. Note that identical plots with AUC-ROC scores (not provided) show very similar trends

Network reconstruction accuracies
(1) Homogeneous data: The segmentation 1111 in Fig. 5 refers to homogeneous data. As the number of states is equal to one, K = 1, the regression parameter vectors, w g,1 (g = 1, . . . , N ), do not vary over time. Fig. 5 shows that the models perform approximately equally well for this scenario. That is, the non-homogeneous models (CPS, MIX, and HMM) do not overfit the data by inferring spurious segmentations and are thus not inferior to the homogeneous DBN (HOM).
(2) Changepoint-segmented data: The segmentations 1122 and 112233 in Fig. 5 and the segmentation 1122 in Fig. 7 refer to classical changepoint-segmented time series. There are 2-3 different states, K g , and states once left are not revisited. Consequently, these segmentations i.e. a random free allocation of the data points. Although the HMM-DBN model can infer free allocations, it does take the temporal ordering into account by putting less prior weight onto (random) allocations (without any temporal dependencies).
For noisy data the prior on the allocation vectors becomes important, and so the HMM-DBN model is disadvantaged compared to the MIX-DBN model, whose allocation vector prior ignores the temporal ordering of the data points altogether. The dependency structure behind these segmentations is compatible with a Hidden Markov model and I will refer to them as "periodic data". As for the mixture data (MIX) the HOM-DBN and the CPS-DBN model cannot deal with these periodic segmentations and perform consistently and significantly worse than the HMM-DBN model unless the data are very noisy (SNR= 1). Only for the simulations with network-wide allocation vectors on segmentation 1212 in Fig. 6 the difference between the HMM-DBN and the CPS-DBN model are moderate only. 18 The MIX-DBN model also achieves consistently lower network reconstruction accuracies than the proposed HMM-DBN model, but the differences in favour of the HMM-DBN model are less pronounced. Form the left and middle column in Fig. 6 it appears that the MIX-DBN model is outperformed for the moderate noise levels, where the network reconstruction is neither perfect (AUC-PR 1) nor impossible (AUC-PR 0.5).

The estimated marginal connectivity matrices
For the non-homogeneous DBN models I estimate the marginal connectivity matrices, as described in Sect. 2.8. Figure 8 shows heatmap representations of the average connectivity matrices for the segmentations 1122, 112233, 1212, and 121212 of the simulations with network-wide allocation vectors and SNR= 16. Figure 8 shows that the estimated connectivity matrices are consistent with my findings for the network reconstruction accuracy. The HMM-DBN model (bottom row in Fig. 8) infers the underlying segmentations (top row in Fig. 8) more accurately than the MIX-DBN model (2nd row in Fig. 8). That is, both models detect the underlying compartments, but the components are separated substantially stronger by the proposed HMM-DBN model. The CPS-model perfectly separates the segments only for those segmentations, 1122 and 112233, that are in agreement with its allocation vector configuration space. The segmentations 1212 and 121212 can only be approximated by 1234 and 123456, respectively, and the segments are then separated only weakly (last two panels in the 3rd row of Fig. 8).

Summary
The results shown in Figs. 5, 6, 7 and 8 demonstrate that the proposed HMM-DBN model is more robust than the competing DBN models with respect to a variation of the underlying allocation. The HOM-DBN model cannot deal with non-homogeneous data at all. The CPS-DBN model fails when the underlying segmentation cannot be approximated properly by changepoints. The MIX-DBN model fails when the underlying segmentation has a temporal structure, which cannot be taken into account. The proposed HMM-DBN model is always among the best-scoring models, and it significantly outperforms the competing models for periodic segmentations, such as 1212 and 121212. Finally, note that I also applied the coupled variant of the CPS-DBN model from Grzegorczyk and Husmeier (2013); see Sect. 2.6.3 for a brief description of the coupling scheme. However, for the RAF-pathway data I have never observed a significant difference between the AUC-PR scores of the coupled CPS-DBN model and the AUC-PR scores of the standard CPS-DBN model. This finding is not surprising and consistent with the empirical results reported in Grzegorczyk and Husmeier (2013): As described in Sect. 3.1, I here sample independent state-specific regression parameters so that coupling the regression parameters is unlikely to yield any information gain. 19

Convergence comparison for the HMM-DBN model
In this subsection I assess the degree of convergence and mixing of three different MCMC sampling schemes for the proposed HMM-DBN model. The MCMC sampling scheme for the HMM-DBN model is outlined in Table 3. I vary the 4th sampling step, i.e. the allocation vector inference part, to demonstrate that the adoption of the new moves, proposed in Sects. 2.5.1 and 2.5.2, improves convergence. The first MCMC sampling scheme, referred to as MIX and HMM moves, is the sampling scheme provided in Table 3. That is, a coin is drawn to decide randomly whether an allocation sampler (MIX) or a new (HMM) move is performed. Both move types are equally likely ( p M I X = 0.5 and p H M M = 0.5). I consider two alternative schemes; each employing only one particular move-type. The second scheme, referred to as  Table 3; here I vary the 4th sampling step, i.e. the allocation vector inference part: (i) MIX and HMM: Exactly as indicated in Table 3, randomly draw an unbiased coin to decide whether an allocation sampler (MIX) or a new move (HMM) is performed. Both move types are equally likely. (ii) MIX only: Perform the allocation sampler moves (MIX) with probability

(iii) HMM only:
Perform the new HMM moves with probability 1. With the three sampling schemes I perform 5 independent MCMC simulations for one single data set instantiation. Afterwards, for each sampling scheme the 5 independent MCMC inference results were used to compute a PSRF for each edge, and the fractions of edges whose PSRF was lower than the thresholds ξ = 1.1 (left panel) and ξ = 1.01 (right panel) were computed. This procedure was repeated for five individual data set instantiations, and the panels show overlaid trace plots of the average fractions of edges whose PSRF was lower than the threshold ξ . The data sets were generated with sampling strategy (S2) for two segmentations, 1122 (top row) and 121212 (bottom row). Details on how I defined a PSRF for an edge are given in Sect. 2.10 ξ = 1.1 and ξ = 1.01. 20 The average results for the simulations with node-specific allocation vectors for the segmentation schemes 1122 and 121212 are shown in Fig. 9. A clear outcome of the convergence diagnostic is that the MCMC sampling scheme MIX and HMM moves, which combines both types of moves, yields the best convergence: About 100% of the edges satisfy the standard convergence criterion (PSRF< 1.1) already after 50k iterations. For the sampling scheme new HMM moves only scheme there is considerable scope for improvement. For the segmentation scheme 121212 on average only about 90% of the edges satisfy the convergence criterion PSRF< 1.1 after 50k iterations. The sampling scheme old MIX moves only fails to converge properly for the segmentation scheme 1122; only about 98% (92%) percent of the edges satisfy the criterion PSRF< 1.1 (PSRF< 1.01) after 50k iterations. Note that the same average percentage rates are reached with the MIX and HMM moves already after 10k (20k) iterations. This suggests that the inclusion of the new MCMC moves, proposed in Sects. 2.5.1 and 2.5.2, is advantageous. The novel moves are as straightforward to implement as the allocation sampler moves, described in Appendix 2, and yield a convergence improvement.

Network reconstruction in Saccharomyces cerevisiae (yeast)
In this subsection I cross-compare the network reconstruction accuracy of the DBN models on a small but topical data set from synthetic biology. The (true) yeast network, which was synthetically designed by Cantone et al. (2009), is depicted in the right panel of Fig. 10. Gene expression time series were measured in synthetically designed yeast cells, as described in Sect. 3.2. I apply each of the non-homogeneous DBN models (CPS, coupled CPS, HMM, and MIX) with node-specific, V g and with network-wide, V g = V, allocation vectors. Hence, I compare the performances of eight non-homogeneous DBN models and the conventional homogeneous DBN model. For each of the nine DBN models I run 5 independent MCMC simulations. The network reconstruction accuracy results (in terms of mean AUC-PR scores) are represented as histograms in Fig. 10. It can be seen that the non-homogeneous DBN models consistently achieve higher AUC-PR scores when they are implemented with nodespecific allocation vectors. Two-sided Student's t-tests show that the improvement achieved with node-specific allocation vectors is significant for the CPS-DBN model (p value 0.015), the coupled CPS-DBN model ( p = 0.048) and the HMM-DBN model (p value 0.011). For both allocation vector variants (node-specific and network wide) the proposed HMM-DBN reaches the highest average AUC-PR scores. In terms of the p values of two-sided t-tests the differences in favour of the proposed HMM-DBN model are significant except for the comparison with the coupled CPS-DBN model. 21 When implemented with nodespecific allocation vectors the coupled CPS-DBN model and the proposed HMM-DBN model perform approximately equally well ( p = 0.517).
This finding is in agreement with earlier results on the RAF-pathway data in Sect. 5.2. Because of the carbon source switch from galactose to glucose the true segmentation of the yeast time series should be roughly of the form 1122. For this segmentation it was found that the MIX-DBN model, which ignores the temporal order of the time points, performs substantially worse than the CPS-DBN and the HMM-DBN model; see, e.g., the middle  Fig. 7. The improved network reconstruction accuracy of the coupled CPS-DBN model is in agreement with earlier reported results (see, e.g., Fig. 12 in Grzegorczyk and Husmeier 2013). The results in Fig. 10 suggest that the same improvement (i.e. the same "regularisation effect") can also be reached by a more flexible data segmentation scheme, namely the proposed HMM-DBN model. Finally, I also applied the coupling scheme from Grzegorczyk and Husmeier (2013) to the proposed HMM-DBN model; see Sect. 2.6.3 for a brief description of this coupling scheme. For the "coupled" HMM-DBN model I have not observed further improvements, but a slight (non-significant) decrease of the average AUC-PR scores.

Network reconstruction in Arabidopsis thaliana
In this subsection I compare the performances of the non-homogeneous DBN models on a merged gene expression time series from Arabidopsis thaliana. One single long Arabidopsis time series has been obtained by successively arranging four individual short gene expression time series from different experiments, as explained in more detail in Sect. 3.3. In the four individual experiments (E1-E4) the gene expressions have been measured under constant light condition, but the plants were entrained in different experimentally controlled light-dark cycles. In the first two experiments E1 and E2 the plants were entrained in a 12 h:12 h light/dark-cycle and measurements were taken in 4 h intervals, and in E3 and E4 measurements were taken in 2 h intervals and the plants were entrained in the light/dark-cycles 10 h:10 h (E3) and 14 h:14 h (E4). From a biological perspective the regulatory relationships among the circadian genes in Arabidopsis follow a two-stage process, which is related to the diurnal nature of the environmental dark-light cycle. Two groups of genes can be distinguished: Morning genes whose activities peak in the presence of light (i.e. in the morning), and evening genes whose activities peak in the absence of light (i.e. in the evening). Although all gene expression measurements in E1-E4 were taken under artificially generated constant light condition, the two-stage nature of the regulatory mechanisms will be preserved by the circadian clock (see, e.g., Johnson et al. 2003;McClung 2006). That is, even under constant light condition the regulatory processes (approximately) follow the diurnal dark:light cycle, in which the plants were entrained before the experiment. Since the dark:light cycle affects the activities of both the morning and the evening genes (i.e. the whole regulatory network) rather than specific genes only (Johnson et al. 2003;McClung 2006), I implement the non-homogeneous DBN models with network-wide allocation vectors, V g = V; see Sect. 2.7 for details.
Heatmap representations of the inferred connectivity matrices are shown in Fig. 11a. All the non-homogeneous models (MIX-DBN, CPS-DBN, coupled CPS-DBN, and HMM-DBN) infer a two-stage process with the number of states (components) peaking at K = 2. The CPS-DBN model and the coupled CPS-DBN model (see Sect. 2.6.3) both infer the same segmentation with one single changepoint between E2 and E3 (see left panel in Fig. 11a). As the CPS-DBN models can only infer changepoint-divided segmentations, where the segments are assigned to disjunct components (i.e. a state once left cannot be revisited), they do not capture the true underlying segmentation of the Arabidopsis time series. The changepoint of the CPS-DBN models appears to be related to different experimental conditions in E1-E2 and E3-E4 (here: e.g. the distance between measurements). The inferred segmentation does not reflect the diurnal nature of the regulatory process. For the merged Arabidopsis time series the preservations of the entrained dark:light cycles corresponds to a segmentation scheme of the form "121212 . . .". Hence, the failure of the CPS-DBN model is in agreement with results observed for the synthetic RAF-pathway data. In Sect. 5.2 I found for segmentations, such as 1212 and 121212, that the CPS-DBN model cannot infer the correct segmentation; see, e.g., the last two panels in the third row of Fig. 8.
From the middle and the right panel in Fig. 11a it can be seen that the inferred connectivity structures of the MIX-DBN model and the proposed HMM-DBN model are (also) very similar. I now have a closer look at the connectivity structures within the four individual time series. Figure 11b shows heatmap representations of the connectivity structures within E1-E4. 22 The (sub-)heatmaps in Fig. 11b confirm the conjecture that the MIX-DBN and the HMM-DBN model infer very similar connectivities; i.e. the patterns in the top row are almost identical to the patterns in the bottom row. In particular, it can also be seen that the inferred segmentations are actually related to the dark:light cycles in which the Arabidopsis plants were entrained. In the heatmaps the "white windows around the diagonal" represent connected blocks, i.e. segments of data points that are assigned to the same state (component). The "white windows" in E3 (time points 26, . . . , 30) and in E4 (time points 38, . . . , 44) represent time intervals of length (5 × 2 h =)10h and (7 × 2 h =)14 h, and thus are in agreement with the entrainment cycles 10 h:10 h (E3) and 14 h:14 h (E4), respectively. In E1 and E2 there is at least a certain tendency towards segments ("white windows") consisting of 3 data points. In E1 and E2, where measurements were taken in 4 h intervals, three neighbouring data points cover a time interval of length (3 × 4 h =)12 h, what corresponds to the entrainment cycle 12 h:12 h of E1-E2. This suggests that the MIX-DBN model and the HMM-DBN model infer the same connectivity structure, which is related to the diurnal nature of the dark:light cycle and thus in agreement with biology. 23 (c) Fig. 11 Inference results on the Arabidopsis gene expression data. In the heatmaps in panels (a) and (b) the grey shading indicates the posterior probability of two data points being assigned to the same state, ranging from 0 (black) to 1 (white). a Heatmaps of the connectivity matrices inferred on the merged data set. The merged data set consists of four individual time series (E1-E4), which were arranged successively; see Sect. 3.3 for details. In the three panels the axes represent the indices of the data points, and the axes are ticked at the boundaries of the four individual time series. The CPS-DBN and the coupled CPS-DBN model both infer approximately the same segmentation (see left panel) with one single changepoint in between E2 and E3. b Sub-heatmaps extracted ("cut out") from the heatmaps in panel (a). The extracted sub-heatmaps show the connectivity structures within the four individual time series E1-E4. Note that the temporal distance between neighbouring data points is 4 h in E1 and E2, while measurements in E3 and E4 have been taken in 2 h intervals. Finally, I use the inferred marginal edge posterior probabilities of the proposed HMM-DBN model to predict the regulatory relationships in the circadian clock. Figure 12 shows the predicted network possessing only those edges whose marginal posterior probability exceeds the threshold of 0.5. Unfortunately, there is no gold-standard network for the circadian clock in Arabidopsis so that the network reconstruction accuracy cannot be evaluated properly. However, the reconstructed network, shown in Fig. 12, possesses several edges that are consistent with the biological literature: According to the biological literature (see, e.g., McClung 2006) the morning genes activate the evening genes, and the evening genes inhibit the morning genes. In the predicted network there are six edges pointing from the morning genes to the evening genes and five edges pointing from the evening genes to the morning genes. McClung (2006) also reports that CCA1 and LHY are the central regulators among the morning genes. 24 From Fig. 12 it can be seen that four of the six edges pointing from the morning to the evening genes actually originate from LHY and CCA1 and that four of the five evening genes are regulated either by LHY or by CCA1. In particular, the regulation of the evening genes TOC1 and ELF4 by the central regulators CCA1/LHY has already been reported in Alabadi et al. (2001) and Kikis et al. (2005). Among the edges originating from the evening genes the two edges E L F3 → CC A1 and E L F3 → L HY are consistent with the biological finding in Kikis et al. (2005) that ELF3 is necessary for light-induced CCA1 and LHY expression. Moreover, the edges E L F3 → T OC1 and G I → T OC1 are also in agreement with the literature, as In these scenarios all models were implemented with network-wide (shared) segmentations. The computational costs in this table are given in seconds per simulation with 10,000 MCMC iterations. More details on the simulation settings can be found in Table 5. All simulations were run using Matlab on a Desktop PC with 3.20 GHz Intel Core processor and 8GB RAM In these scenarios all models were implemented with node-specific segmentations. The computational costs in this table are given in seconds per simulation with 10,000 MCMC iterations. More details on the simulation settings can be found in Table 5. All simulations were run using Matlab on a Desktop PC with 3.20 GHz Intel Core processor and 8GB RAM Miwa et al. (2006) found that both genes ELF3 and GI are involved in the interaction between CCA1 and TOC1. Within the group of evening genes, the reconstructed network contains one single feedback loop G I ↔ T OC1 between GI and TOC1. Exactly this feedback loop has also been found in Locke et al. (2005).

Computational costs of the MCMC inference
In this subsection I briefly discuss the computational costs of the required MCMC simulations. For the proposed HMM-DBN model I used the Matlab software to implement the MCMC algorithm, as outlined in the pseudo code, provided in Tables 1, 2  Since all MCMC simulations on a network with N = 11 nodes could be finished within minutes on a standard Desktop PC, I would expect that the novel HMM-DBN model can also be applied to larger network domains (e.g. with N = 100 nodes) in reasonable time. On the other hand, it certainly has to be taken into account that the number of possible parent sets grows at least polynomially in the number of network nodes N . 25 In this context it is worth mentioning that the MCMC simulations for the HMM-DBN model with network-specific allocation vectors can be run in parallel (e.g. on a computer cluster). That is, as there is no information-sharing among genes, the HMM-DBN model can be applied independently to each gene g to infer its particular parent set π g and its allocation vector V g . Given the increasing availability of high-performance computer clusters, I would thus argue that it is not the number of network nodes N but the number of observations T which restricts the applicability of the HMM-DBN model. E.g. in modern systems biology applications the number of measured observations T is usually substantially smaller than the number of variables (e.g. genes) N , symbolically T << N , leading to diffuse posterior distributions. Hence, even if an MCMC sampling scheme guaranteed that the huge space of possible network structures could be systematically searched for those networks with "high" posterior probabilities, a lack of significance would have to be expected. I would then recommend reducing the size of the network by restricting on the most important variables (e.g. genes). Often biological prior knowledge can be exploited to reduce the network to a reasonable size; e.g. for the Arabidopsis thaliana data (see Sect. 3.3) the focus was set on the nine potentially "most important" circadian clock genes.

Outlook and future work
In this article I proposed a novel non-homogeneous DBN model, namely the HMM-DBN model, for which I assumed that the regulatory network structure, G, is identical for all components (segments). Keeping the network structure constant allows for information-sharing among components (w.r.t. the network topology), and is certainly an appropriate assumption for the two presented real-world applications: (i) cellular response to fast environmental change in yeast (see Sect. 5.4) and (ii) the circadian clock network in Arabidopis thaliana (see Sect. 5.5). For certain other scenarios, e.g. morphogenesis, where the cellular processes take place on a longer time scale, the assumption of a fixed network structure might turn out to be too restrictive. For those applications it might be interesting to allow the network structure to vary with time and to implement the HMM-DBN model with component-specific network structures. This can, in principle, be accomplished straightforwardly, e.g. along the lines proposed and discussed in Lèbre et al. (2010) or Dondelinger et al. (2012).
An alternative extension of the proposed HMM-DBN model can be reached by incorporating the hierarchical global information-coupling scheme, proposed in Grzegorczyk and Husmeier (2013). The implementation of a coupled version of the HMM-DBN model is straightforward, and can be beneficial for applications where the component-specific network interaction parameters are similar to each other. For the yeast data (see Sect. 5.4) I incoporated the global information-coupling scheme into the changepoint-segmented DBN model (CPS-DBN) and the proposed HMM-DBN model, and I included these two new model variants into my cross-method comparison. In both cases the coupling did not yield substantially different results: For the coupled CPS-DBN model there was a slight (significant) improvement of the network reconstruction accuracy (see Fig. 10), and for the coupled HMM-DBN model I saw a slight (non-significant) decrease in the network reconstruction accuracy (see main text in Sect. 5.4). From a more general perspective, I would expect that the improvement through information-coupling, will be less pronounced for the HMM-DBN model than for the CPS-DBN model. With the CPS-DBN model, states once left cannot be revisited so that information-coupling is required for sharing information between distant time points. Unlike the CPS-DBN model, the proposed HMM-DBN model explicitly allows distant time points to be allocated to the same component and to share the same network interaction parameters.

Conclusion
I have proposed a novel non-homogeneous dynamic Bayesian network (DBN) model, which combines a conventional DBN with a Hidden Markov model (HMM). The key idea behind this HMM-DBN model is to assume that the temporal data points of a time series are allocated to different states (components) by a HMM. A graphical representation of the HMM-DBN model is provided in Table 2. My work complements earlier works which combined DBN models either with multiple changepoint processes (CPS-DBN) or with free allocation mixture (MIX-DBN) models; see Sect. 1 for various literature references. The CPS-DBN models, on the one hand, employ a multiple changepoint process to divide a time series into temporal segments with a one-to-one mapping between segments and states (components): All data points within a segment are assigned to the same state, but data points from different segments have to be allocated to different states; i.e. "a state (component) once left cannot be revisited". This imposes a very strong restriction onto the configuration space of the possible data segmentations. The MIX-DBN model, on the other hand, allows for an unrestricted free allocation of the data points to states (mixture components) but loses important information about the data, since it cannot take the temporal ordering of the data points into account.
The novel HMM-DBN model is a consensus between the CPS-DBN and the MIX-DBN model, as it does take the temporal structure of the data into account without putting any restriction onto the configuration space of the data segmentations. The novel HMM-DBN model can be inferred with two different Reversible Jump Markov Chain Monte Carlo (RJM-CMC) techniques, as briefly discussed in Sect. 2.5. In this paper I have shown how the allocation sampler from Nobile and Fearnside (2007) can be used for inference, and in Sects. 2.5.1-2.5.2 I have proposed two new pairs of complementary moves to improve mixing and convergence of the allocation sampler. Pseudo code of the proposed MCMC sampling scheme is provided in Table 3.
In Sect. 5.2 I have performed an extensive comparative evaluation study on synthetic RAF-pathway data to provide empirical evidence that the proposed HMM-DBN model is a consensus between the MIX-DBN and the CPS-DBN model. A brief overview to the four competing DBN models, which I cross-compared in my evaluation study, is given in Table 5. In my study I considered various segmentation scenarios, as listed in Table 4. For scenarios where the CPS-DBN model performed significantly better than the MIX-DBN model and vice-versa I found that the performance of the proposed HMM-DBN model was always very close (and only rarely significantly different) to the performance of the better-scoring DBN model. For scenarios with periodic segmentations the proposed HMM-DBN model outperformed the competing MIX-DBN model and the CPS-DBN models. I have also cross-compared the learning performances of the four DBN models on two real-world applications from systems biology (see Sects. 5.4 and 5.5). My cross-method comparison for the real-world data also confirmed that the proposed HMM-DBN model is a consensus between the CPS-DBN and the MIX-DBN model. For a non-homogeneous yeast gene expression time series, which consists of two ("changepoint-divided") temporal segments related to two different carbon sources, the free allocation MIX-DBN model has failed to reconstruct the underlying network, while the coupled CPS-DBN (Grzegorczyk and Husmeier 2013) and the proposed HMM-DBN model have performed substantially better; see Sect. 5.4 for details.
For a non-homogeneous Arabidopsis gene expression time series, in which the regulatory processes are diurnal and periodic, i.e. the processes follow recurrent entrained dark:light cycles, the CPS-DBN models have failed to capture the underlying data segmentation, while the MIX-DBN and the proposed HMM-DBN model both inferred a segmentation, which is in agreement with plant biology; see Sects. 3.3 and 5.5 for details. I have used the results of the HMM-DBN model to reconstruct the network among the circadian genes in Arabidopsis. As discussed in Sect. 5.5, the reconstructed network shows features that are consistent with the biological literature.
As this move does not change the number of states, it has to be set: K

Appendix 2: The mixture model allocation sampler (MIX) moves
The MIX MCMC moves, presented in this appendix, have been developed by Nobile and Fearnside (2007) for Gaussian mixture models. Nobile and Fearnside (2007) proposed the resulting "allocation sampler" as an alternative to computationally expensive Reversible Jump Markov Chain Monte Carlo sampling schemes (Green 1995); see Nobile and Fearnside (2007) for details.

The M1 move
If the number of states is currently equal to one, K (i−1) g = 1, skip the move. Otherwise randomly select two states k andk among the K (i−1) g available, and draw a random number p from a Beta(a,a) distribution with a = 1. Consider the set H = {t : V =k} of all data points that are allocated either to state k or to statek by V (i−1) g . Reallocate each point of the set H either to component k (with probabilityp) or to component k (with probability, 1 −p). This gives a new allocation vector, V g , which is accepted with probability: The prior probabilities and the marginal likelihood terms can be computed with Eqs. (12), (23) and (16). Nobile and Fearnside (2007) show that the Hastings ratio is given by: where n k and nk are the numbers of data points that are allocated to the states k andk by V (i−1) g , and n k and n k are the numbers of data points that are allocated to the states k andk by V g . If the move is accepted, set V (i) g = V g , or otherwise set: V . As the move cannot change the number of states, set: K

The M2 move
If the number of states is currently equal to one, K (i−1) g = 1, skip the move. Otherwise randomly select two states k andk among the K (i−1) g available. If the kth component is empty, the move fails outright. Otherwise draw a random number u from a uniform distribution on {1, . . . , n k }, where n k is the number of data points t with V (i−1) g (t) = k. Randomly select u observations from the n k data points and re-allocate them 123 to statek to obtain the new candidate allocation vector V g . The new allocation vector is accepted with the probability given in Eq. (37), except that the Hastings Ratio is different. As shown in Nobile and Fearnside (2007), the Hastings ratio is: where n k and nk are the numbers of data points allocated to the states k andk by V (i−1) g . If the move is accepted, set V (i) g = V g , or otherwise set: V (i) g = V (i−1) g . As the M2 move cannot change the number of states, set: K

The EA (ejection/absorption) moves
If K (i−1) g = 1, then an ejection move has to be performed. If K (i−1) g = K MAX , then an absorption move has to be performed. For K (i−1) g ∈ {2, . . . , K MAX − 1} the move type (ejection or absorption) is randomly drawn.

The ejection move
Randomly select a state k ∈ {1, . . . , K (i−1) g }. Make a draw p E from a Beta(a, a) distribution and re-allocate each data point allocated to component k by V (i−1) g with probability p E to a new state with label K (i−1) g + 1 to obtain the new candidate allocation vector V g . The new number of states, associated with V g , is K g = K (i−1) g + 1. The acceptance probability is A = min{1, R} where R = P V g |K g P K g P y g,V g |X g,V g , δ g P V and Nobile and Fearnside (2007) show that the Hastings ratio is given by: where n k is the number of observations allocated to the kth state by V (i−1) g , n * k and n * k are the numbers of data points allocated to the statesk and k by V g . The factor p E is equal to one for K (i−1) g ∈ {2, . . . , K MAX − 2}; while p E = 0.5 for K (i−1) g = 1, and p E = 2 for K (i−1) g = K MAX − 1. If the move is accepted, set V Consider the sequence s E +1, . . . , t E −1. It follows from Eqs. (42-43) that all data points in the sequence are currently allocated to state k. The length of this sequence is L E = t E −s E −1.
If L E < 3, skip the move. Otherwise, draw a random number u 1 from the set {1, . . . , L E −2}, and subsequently a random number u 2 from the set {0, . . . , L E −2−u 1 }. u 1 can be interpreted as the "subsequence length" and u 2 can be interpreted as the "lag", since the exclusion move proposes to re-allocate the data points s E + u 2 + 2, . . . , s E + u 2 + 1 + u 1 to a new statek, wherek = k is randomly drawn from all K (i−1) g −1 states unequal to k. For the new candidate allocation vector, V g , this yields: V g (t) =k if t ∈ {s E + u 2 + 2, . . . , s E + u 2 + 1 + u 1 }, and V g (t) = V (i−1) g (t) for t / ∈ {s E + u 2 + 2, . . . , s E + u 2 + 1 + u 1 }. The Hastings is given by: The first factor is the probability of selecting one point of the sequence s E + 1, . . . , t E − 1 of length L E , the second and the third factor are the probabilities for selecting u 1 and u 2 , respectively, and the last factor is the probability for selectingk = k.

The inclusion move
Randomly select one time point t 0 ∈ {2, . . . , T }, and consider the state k := V (i−1) g (t 0 ) to which the selected time point is currently allocated to. Determine the highest time point s I ∈ {2, . . . , t 0 − 1} that is not allocated to state k: If s I is not well-defined, skip the move. Otherwise, determine the lowest time point t I ∈ {t 0 + 1, . . . , T } that is not allocated to state k: t I = min t ∈ {t 0 + 1, . . . , T } : V (i−1) If t I is not well-defined, skip the inclusion move. Only if s I and t I are both well-defined, test whether V (i−1) g (s I ) is equal to V (i−1) g (t I ). If this "equal boundaries" test fails, skip the inclusion move. If the test is successful it holds: V (i−1) g (t) = k for t ∈ {s I + 1, . . . , t I − 1} and V (i−1) g (s I ) = V (i−1) g (t I ) =:k wherek = k. The inclusion move proposes to re-allocate all time points t ∈ {s I + 1, . . . , t I − 1} to statẽ k, i.e. to the state of the surrounding time points s I and t I . This yields for the new candidate allocation vector, V g : For t = 2, . . . , T set V g (t) =k if t ∈ {s I +1, . . . , t I −1}, or otherwise set V g (t) = V (i−1) g (t). The proposal probability is the probability of selecting one point t 0 of the sequence s I + 1, . . . , t I − 1 of length L I = t I − s I − 1.

Complementary inclusion move for the exclusion move
Consider the exclusion move from V (i−1) g to V g , described above. The data points in the sequence s E + u 2 + 2, . . . , s E + u 2 + 1 + u 1 , which were originally allocated to state k, have been re-allocated to statek. The design of the exclusion move ensures that the new candidate vector, V g , still allocates the two surrounding time points to state k, V g (s E + u 2 + 1) = k = V g (s E + u 2 + 2 + u 1 ). The complementary move, which proposes to move back from V g to V (i−1) g , is the inclusion move, which re-allocates the sequence s E + u 2 + 2, . . . , s E + u 2 + 1 + u 1 back to state k. To this end, the complementary inclusion move has to select a point t 0 ∈ {s E + u 2 + 2, . . . , s E + u 2 + u 1 + 1}. It follows that s I = s E + u 2 + 1 and t I = s E + u 2 + u 1 + 2 in Eqs. (45-46) are well-defined, and it is guaranteed that the "equal boundaries" test: V g (s E + u 2 + 1) =k = V g (s E + u 2 + u 1 + 2) withk = k is successful. Thus, the complementary inclusion move has the proposal probability: where L E = u 1 is the subsequence length parameter, which has been randomly drawn during the exclusion move. Hence, according to the Metropolis-Hastings criterion, the exclusion move, described above, is accepted with probability A = min{1, R}, where The likelihood ratio can be computed with Eq. (12), and the Hastings ratio can be computed with Eqs. (44) and (48).

Complementary exclusion move for the inclusion move
Consider the inclusion move from V (i−1) g to V g , described above. The data points in the sequence s I + 1, . . . , t I − 1, which were allocated to state k, have been re-allocated to statẽ k, and the design of the inclusion move ensures that the new candidate vector, V g , allocates the surrounding time points to statek as well: V g (s I ) =k = V g (t I ).
The complementary move, which proposes to move back from V g to V (i−1) g , is the exclusion move, which re-allocates the subsequence s I + 1, . . . , t I − 1 of length L I = t I − s I − 1 to state k. To this end, the complementary exclusion move has to select one single point t 0 out of the sequence s C E + 1, . . . , t C E − 1 where s C E := max t ∈ {2, . . . , s I − 1} : V g (t) =k (50) and s C E = 1 if s C E is not well-defined.
t C E := min t ∈ {t I + 1, . . . , T } : and t C E = T + 1 if t C E is not well-defined. Having selected t 0 , the "subsequence length" u 1 := L I and the "lag" u 2 := s I − s C E − 1 have to be sampled out of the sets {1, . . . , L C E − 2} and {0, . . . , L C E − 2 − u 1 }, respectively, where L C E := t C E − s C E − 1 is the length of the sequence s C E + 1, . . . , t C E − 1. Finally, the complementary exclusion move has to randomly draw the state k from all K (i−1) g − 1 states unequal tok. Thus, the complementary exclusion move has the proposal probability: 123 Hence, according to the standard Metropolis-Hastings criterion, the inclusion move, described above, is accepted with probability A = min{1, R}, where R = P V g |K (i−1) P y g,V g |X g,V g , δ g P V (i−1) g |K (i−1) P y g,V (i−1) The likelihood ratio can be computed with Eq. (12), and the Hastings ratio can be computed with Eqs. (47) and (52).
that are currently allocated to k 1 and k 2 :