Model averaging for sparse seemingly unrelated regression using Bayesian networks among the errors

Multivariate Bayesian linear regression (MBLR) is a popular statistical tool with many applications in a variety of scientific fields. However, a shortcoming is potential model over-complexity, as the model assumes that all responses depend on the same covariates and that all errors are mutually pairwise correlated. The class of Bayesian seemingly unrelated regression (SUR) models generalizes the class of MBLR models by allowing for response-specific covariate sets. In a recent work it has been proposed to employ Gaussian graphical models for learning sparse SUR (SSUR) models with conditional independencies among the errors. The proposed SSUR model infers undirected edges among the errors, and the proposed Reversible Jump Markov Chain Monte Carlo (RJMCMC) inference algorithm relies on approximations of the marginal likelihoods. In this paper, we propose a new refined SSUR model that replaces the undirected graphs (Gaussian graphical models) by directed acyclic graphs (Gaussian Bayesian networks). Unlike the earlier proposed model, our new model is therefore able to learn some directed edges among the errors. And we derive a RJMCMC algorithm that does not require approximations of the marginal likelihoods. In particular, we present an algorithm for sampling covariance matrices that are coherent with a given directed acyclic graph. The proposed RJMCMC algorithm allows for exact Bayesian model averaging across both: the response-specific covariate sets and the directed acyclic graphs.


Introduction
Multivariate Bayesian linear regression (MBLR) is an important statistical tool for modelling the relationship between covariates z 1 , … , z N and a continuous multivariate response vector = (y 1 , … , y S ) . The idea is to impose a univariate linear regression model for each element y s of and to infer the covariance matrix among the response-specific errors e 1 , … , e S . Because of the correlated errors, there is information-sharing among the univariate models. MBLR was proposed by Tiao and Zellner (1964); Geisser (1965) and detailed descriptions can be found in many textbooks, e.g. in Sect. 2.8.5 of Rossi et al. (2005). Recent applications range from the earth (Arroyo and Ordaz 2010) and climate (Seidou et al. 2007) sciences over geophysics (Talarico et al. 2017) to quality control (Ahmadi Yazdi et al. 2019). In other works it was proposed to exchange the inverse Wishart prior of the covariance matrix by other priors, such as a multivariate Gaussian prior for the unique elements of the matrix logarithm of the covariance matrix (Sinay and Hsu 2014), or a non-informative Jeffrey's prior (Saputro et al. 2018). In some scientific fields MBLR appears as vector auto regressive (VAR) model. VAR models are popular tools for modelling multivariate time series and they are effectively a special case of MBLR models, where the observations t ∈ ℝ S at the current time point t are the responses and the observations t−1 , … , t− ∈ ℝ S at the previous time points are the N = S ⋅ covariates. Applications of VAR models be found in econometrics (Banbura et al. 2010), in the behavioural sciences (Beltz and Molenaar 2016) and in neuroimaging (Chiang et al. 2017). The above works use 'full models' that feature all N ⋅ S possible covariate-response interactions. To compensate for the over-complexity, Bayesian parameter shrinkage is applied, but shrinkage priors are often inferior to Bayesian model averaging (BMA), see, e.g., Hoeting et al. (1999); Wasserman (2000); Fragoso et al. (2018). Seemingly unrelated regression (SUR) models (Zellner 1962;Zellner and Huang 1962) avoid the over-complexity of MBLR by allowing for response-specific covariate sets. But unlike for MBLR models, the inference for SUR models is challenging and ranges from Reversible Jump MCMC (RJMCMC) techniques (Green 1995) for sampling response-specific covariate sets (Holmes et al. 2002) to direct Monte Carlo approaches for sampling the model parameters (Zellner and Ando 2010). A promising sparse SUR (SSUR) model has been proposed by Wang (2010). The SSUR model from Wang imposes a spike-and-slab prior on the regression coefficients and uses Gaussian graphical models (Lauritzen 1996) to model conditional independencies among the errors. A disadvantage of the SSUR model is that the marginal likelihoods that are required for the RJMCMC inference algorithm in the model space (covariate sets and undirected graphs) cannot be computed analytically, so that the algorithm has to rely on marginal likelihood approximations. Wang proposes to use Chib's method (Chib 1995) to approximate the marginal likelihoods by MCMC techniques. This leads to inflated computational costs, since every RJMCMC step in the model space requires separate MCMC simulations for approximating the model-specific marginal likelihoods. The intention of our work is to improve the SSUR model from Wang (2010) in three important respects: Model averaging for sparse seemingly unrelated regression… (i) We do not use a spike-and-slab prior for the regression coefficients. Instead we apply Bayesian model averaging (Holmes et al. 2002) and sample responsespecific covariate sets from the posterior distribution. (ii) For modelling the conditional independencies among the errors we replace the undirected graphs (Gaussian graphical models) by directed acyclic graphs (Gaussian Bayesian networks). Unlike the earlier proposed SSUR model, our model can therefore learn a mixture of directed and undirected edges among the errors; cf. Sect. 2.5 for more details. (iii) Third, for the new model we derive a RJMCMC algorithm that does not need marginal likelihood approximations. We reach this by combining Bayesian model averaging (BMA) across the response-specific covariate sets along the lines of Raftery et al. (1997); Holmes et al. (2002) with BMA across all possible error Bayesian networks (directed acyclic graphs) along the lines of Giudici and Castelo (2003). We show that the therefore required marginal likelihoods and full conditional distributions have analytic solutions. An important methodological contribution of our work is that we show how to sample covariance matrices that are coherent with a given directed acyclic graph (DAG); cf. Sect. 2.4.2.
The remainder of this paper is organised as follows. Section 2 is on the mathematical background. We briefly review the traditional MBLR in Sect. 2.1 and the SUR model in Sect. 2.2. In Sect. 2.3 we present the new sparse SUR (SSUR) model, and Sect. 2.4 is devoted to the prior distributions and the proposed RJMCMC sampling scheme. Sections 2.5 and 2.6 are on Bayesian model averaging and predictive probabilities, respectively. In Sect. 2.7 we distinguish 6 model variants that we crosscompare in our empirical studies. Section 3 provides technical and implementation details. The synthetic and the three real-world data sets, on which we cross-compare the model variants, are described in Sect. 4. In Sect. 5 we perform the cross-method comparison, before we conclude with a discussion in Sect. 6.

Traditional multivariate Bayesian linear regression
The traditional multivariate Bayesian linear regression (MBLR) model (Tiao and Zellner 1964;Geisser 1965) assumes a set of S regression models that are related by common covariates and correlated errors. There are S regression equations where y s is the s-th response, ∈ ℝ N is the shared vector of covariate values, and s ∈ ℝ N and e s are s-th regression (coefficient) vector and error, respectively. An S-dimensional Gaussian distribution with zero mean vector ∈ ℝ S and covariance matrix ∈ ℝ S,S is assumed for the error vector: (1) y s = s + e s (s = 1, … , S) Given observations, {y 1,t , … , y S,t , t } t=1,…,T , one can compactly write where ∶= ( 1 , … , S ) ∈ ℝ N,S is the matrix of regression coefficients, y s,t is the t-th value of response y s , and t ∈ ℝ N is the covariate vector of observation t. By using a matrix-variate Gaussian distribution, Eq.
(3) can be written as: The random matrix ∈ ℝ T,S , whose elements are the errors e t,s ∶= y t,s − t s , has a matrix-variate Gaussian distribution where is a zero matrix, is the identity matrix, vec (.) is the vectorisation operator that stacks the columns, and ⊗ is the Kronecker product. For each row t ∶= (e 1,t , … , e S,t ) of we then have t ∼ N S ( , ) . For an inverse Wishart distribution with positive definite scale matrix ∈ ℝ S,S and > S + 1 degrees of freedom is used, ∼ W −1 S ( , ) . Given , a matrix-variate Gaussian prior is used for ∶ where the expectation matrix 0 ∈ ℝ N,S and the 'scale' matrix −1 0 ∈ ℝ N,N are hyperparameters. Sampling from the posterior distribution, p( , | ) , is computationally convenient, since the fully conjugate prior allows p( | ) to be computed in closed form, so that can be sampled by a collapsed Gibbs sampling step (marginalised over ), before is sampled from its full conditional distribution p( | , ) . For the analytic closed form solutions of this Gibbs sampling scheme we refer to Sect. 2.8.5 in Rossi et al. (2005).

Seemingly unrelated regression (SUR) modelling
Allowing for response-specific covariate sets, 1 , … , S ⊂ {z 1 , … , z N } , yields the so called 'seemingly unrelated regression' (SUR) model from Zellner (1962) and Zellner and Huang (1962), in which Eq. (1) becomes: where s ∈ ℝ k s contains the values of the covariates in s , and s ∈ ℝ k s . One then gets in replacement of Eq. (3): y s = s s + e s (s = 1, … , S) staples the vectors s ∈ ℝ k s , so that k ∶= S ∑ s=1 k s , and t,s ∈ ℝ k s contains the values of the covariates in s from observation t. Optionally, we can extend each t,s by an initial '1' element for the intercept. For s ∈ ℝ k s we thus either have k s = | s | (without intercept) or k s = | s | + 1 (with intercept).

The refined sparse SUR (SSUR) model
We identify the S response-specific covariate sets, 1 , … , S ⊂ {z 1 , … , z N } , with an N-by-S 'covariate matrix' D . The elements of D indicate all covariate-response interactions: if z i ∈ s we set D i,s = 1 , and for z i ∉ s we set D i,s = 0 . Moreover, we use Gaussian Bayesian networks (Geiger and Heckerman 2002;Kuipers et al. 2014) to model the conditional independencies among the errors. Let G denote the S-by-S adjacency matrix of a directed acyclic graph (DAG) among e 1 , … , e S . There is a directed edge e i → e j if G i,j = 1 and there is no edge from e i to e j if G i,j = 0 . We define j as the parent set of e j , so that e i ∈ j ⇔ G i,j = 1 . We refer to G as 'error DAG'. Given G , the joint distribution of the errors can be factorised into univariate conditional distributions: Each model M thus consists of two components: the S-by-S error DAG G and the N-by-S covariate matrix D . We write M = (G, D) . While G has to be a DAG, there is no restriction on D . The model parameters are the covariance matrix G and the regression vector D , where the subscripts indicate that the parameters must be coherent with G and D , respectively. For a given model M = (G, D) , we can rewrite Eq. (8) as: Given data = { t , t } t=1,…,T , where t is the vector of the N covariate values in observation t, our goal is to posterior sample models M = (G, D) along with their parameters ( G , D ).

Conjugate priors and RJMCMC inference
For the posterior density of the models M = (G, D) we have: where p( |G, D) is the marginal likelihood (marginalised over G and D ), and p(G) and p(D) are the prior probabilities of G and D , respectively. As there is no analytic solution for the marginal likelihood (double integral): Eq. (12) does not allow for exact posterior inference. However, when imposing conjugate priors on D and G , we can compute the marginal likelihoods (single integrals): and we show how to sample from the full conditional distributions: While sampling from p( D | G , D, ) is conceptually easy, sampling a covariance matrix G that is coherent with a given DAG G is challenging. Since to the best of our knowledge no algorithm has been described in the literature yet, we build on results from Geiger and Heckerman (2002) and derive an algorithm for sampling from p( G |G, D , ) ; please see Sect. 2.4.2 for the details.
We use the four 'ingredients' to design a Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm for exact posterior inference. From a given state 1. Given D ⋄ with parameters ⋄ D ⋄ , we use the marginal likelihood p( |G, ⋄ D ⋄ ) and a Metropolis-Hastings step to sample a new DAG G ⋆ ; see Sect. 2.4.1. 2. Given G ⋆ and D ⋄ with parameters In terms of the precision matrix, ∶= −1 , we then have the error likelihood (see Supplement S1): where ∶= { 1 , … , T } , and ∶= T ∑ t=1 t ⋅ t . On the precision matrix we impose the Wishart prior, ∼ W S ( , ) , with scale matrix and > S + 1 degrees of freedom. 1 In Supplement S1 we show that this implies for the errors the marginal likelihood: is the multivariate Gamma function. If we compute the marginal likelihood for only an l-dimensional subset L ⊂ {e 1 , … , e S } of the errors, symbolically L , we get (Geiger and Heckerman 2002;Kuipers et al. 2014): where L,L and L,L are submatrices of and that keep only the rows and columns that belong to L. Geiger and Heckerman (2002) show that only this Wishart prior, ∼ W S ( , ) , leads to 'score equivalence' in Gaussian Bayesian networks. 'Score equivalence' is required in Bayesian networks, as it ensures that equivalent DAGs, i.e. DAGs that impose the same conditional independencies among the nodes (errors), have the same marginal likelihood value; cf. Eq. (18).
The error DAG G imposes conditional independence relations among the errors e 1 , … , e S . Therefore, conditional on G , the covariance matrix is enforced to be consistent with those relations, symbolically = G . We need the equivalence relation (cf. Eq. 11): where s is the parent set of e s in G , and s and G L s ,L s are the subvector and submatrix that keep only the elements that belong to s and L s ∶= {e s , s }. 2 With the BGe score Heckerman 1994, 2002;Kuipers et al. 2014) the marginal likelihood of any error DAG G can be computed: where s is the parent set of e s in G , and the marginal likelihoods p( {e s , s } ) and p( { s } ) of L 1 ∶= {e s , s } and L 2 ∶= { s } can be computed with Eq. (16). Recalling that we defined t ∶= t − t D , we have Hence, given any covariate matrix D with regression vector D , the marginal likelihood p( |G, D ) of any error DAG G can be computed analytically. 3 For sampling error DAGs, we use the 'structure MCMC' Metropolis-Hastings (MH) sampling scheme from Madigan and York (1995), which we implement using the algorithms from Giudici and Castelo (2003). Let N(G) denote the 'neighbourhood' of G , i.e. the set of all DAGs that can be reached from G by adding, deleting or reversing one single edge. Given the current error DAG G , we propose to move to a randomly selected neighbour G ⋆ ∈ N(G) . The acceptance probability for the move is: where the Hastings ratio is HR (G, G ⋆ ) = |N(G)| |N(G ⋆ )| with |.| denoting the cardinality. If the move is accepted, we exchange G by G ⋆ , otherwise we keep G unchanged.

Sampling covariance matrices for error DAGs
Given the error DAG G , we need to posterior sample a covariance matrix G that is coherent with G ; cf. Eq. (17). To this end, we first specify the prior distribution of The conditional Gaussian of e s | s can be computed from s and G {e s , s },{e s , s } . 3 BGe stands for Bayesian metric for Gaussian networks having score equivalence. Geiger and Heckerman (2002) show that only the Wishart prior, ∼ W S ( , ) , yields 'score equivalence' in Eq. (18), i.e.: p( |G 1 ) = p( |G 2 ) for equivalent DAGs G 1 and G 2 . G , where G can be any valid DAG. We follow Geiger and Heckerman (2002) and exploit that a Wishart prior for the 'unrestricted' precision matrix ∶= −1 of a 'full' DAG G F , i.e. a DAG G F that does not impose any pairwise independencies, also implies a prior distribution for every precision matrix G that is coherent with a given DAG G . Inverting G then yields G . Consider the Wishart prior for G F , ∼ W S ( , ) , let L ⊂ {e 1 , … , e S } be an l-dimensional error subset, and let L,L denote the submatrix of that keeps only the rows and columns that belong to L. Theorem 5.1.4 in Press (1972) implies As the DAG G implies the factorization (conditional independence relations) given in Eq. (11), we can assign as prior probability for G |G where s is the | s | dimensional parent node set of e s implied by G , and L s ∶= {e s , s } is an l s = | s | + 1 dimensional subset of errors, so that the densities on the right hand side are marginal Wishart densities; cf. Eq. (20).
On the other hand, it is not straightforward to sample a precision matrix, G , which is coherent with a given DAG G ; cf. Eqs. (17) and (21). The Wishart distribution of the 'unrestricted' precision matrix, s , s , but the difficulty arises from the fact that these conditionals are not of a standard form, so that it cannot be sampled from them. To the best of our knowledge, in the literature no sampling algorithm has been proposed yet. 4 We propose the following algorithm, which exploits that any 'unrestricted' precision matrix implies a unique precision matrix G for any DAG G (Geiger and Heckerman 2002). Therefore, given any DAG G , we first sample an 'unrestricted' precision matrix , before our algorithm extracts the coherent precision matrix G from . Inversion of G yields the covariance matrix G . Our algorithm consists of three consecutive steps. (1) Sample an 'unrestricted' precision matrix, ∼ W S ( , ) . Inverting yields the 'unrestricted' covariance matrix .
(2) For a full DAG G F the error vector would have the multivariate Gaussian distribution, ∼ N( , ) , but the DAG G implies the factorization (conditional independencies) given in Eq. (17). That is, G implies that error e s is only allowed to depend on its parent nodes in s . Standard rules (see Supplement S2) allow us to compute the univariate conditional Gaussians appearing in Eq. (17): with conditional mean and conditional variance where all submatrices and subvectors consist of only those rows and columns that belong to the elements in the subscripts. The | s | elements of the row vector are often referred to as as (partial) regression coefficients.
(3) To extract the precision matrix G which is coherent with the DAG G , we apply the recursion from Shachter and Kenley (1989). The recursion (see Supplement S3) allows us to compute G from 2 e s | s and e s | s ( s = 1, … , S ). The key idea is to recompose the univariate conditional Gaussians from Eq. (22), which were computed from = −1 , back into a joint Gaussian, ∼ N( , −1 G ) . By construction only the conditional dependencies captured in Eq. (22) are brought into G while all other conditional dependencies (present in but not coherent with G ) are omitted from G . In Supplement S4 we give a more comprehensive description of this third step.
The proposed algorithm can also be used for generating posterior samples of covariance matrices G that are coherent with a DAG G . For a given covariate matrix D with regression vector D , we defined: t ∶= t − t D and ∶= { 1 , … , T } . In Supplement S5 we show that the Gaussian likelihood from Eq. (14) in combination with the Wishart prior, ∼ W S ( , ) , implies for the 'unrestricted' precision matrix the posterior distribution: Henceforth, for posterior sampling G , we replace in step (1) the Wishart prior, ∼ W S ( , ) , by the Wishart posterior, | ∼ W S ( + −1 ) −1 , + T .

Metropolis-Hastings step for the covariate matrix
Given the error DAG G with covariance matrix G and the covariate matrix D , we have: where the block design matrices t ∈ ℝ S,k depend on the covariate matrix D and have the form of Eq. (10), and D ∈ ℝ k has the form of Eq. (9) and consists of S stapled regression vectors. With the definitions: we can compactly write: For computing the density p( | G , D) we require the inverse and the determinant of: In Supplement S8 we show how the matrix inversion lemma and the matrix determinant lemma (see Supplement S7) can be applied to compute the density in Eq. (28) more efficiently. Hence, given the error DAG G with covariance matrix G , the marginal likelihood p( | G , D) of each covariate matrix D can be computed in closed form.
For sampling covariate matrices, we apply a Metropolis-Hastings (MH) move. We randomly select one of the elements of D , say D n,s , and we propose to flip its value 1 ↔ 0 . That is, we propose either to add z n to the covariate set s (if D s,n = 0 ) or to remove z n from s (if D s,n = 1 ). This yields a new candidate covariate matrix D ⋆ , and the acceptance probability is: The move design ensures that forward and backward move are equally likely so that the Hastings ratio HR is equal to 1. If the move is accepted, we exchange D by D ⋆ , or otherwise we keep D unchanged.

Sampling regression vectors
Given the error DAG G with covariance matrix G and the covariate matrix D , we have for the full conditional: In Supplement S9 we show that: implies the full conditional distribution Hence, given the error DAG G with covariance matrix G and the covariate matrix D , we can Gibbs-sample D from its full conditional distribution p( D | G , D, ).

Bayesian model averaging and equivalence classes of DAGs
The RJMCMC algorithm from Sect. 2.4 generates a posterior sample Given the sample, marginal posterior probabilities, such as the probability p † i,j that covariate z i has an effect on response y j , or the probability p ‡ i,j that there is an edge from error e i to error e j , can be estimated. We refer to these probabilities as covari- For the covariate-response interactions we compute the mean scores: For the error-error interactions we have to take into account that DAGs fall into equivalence classes. Each equivalence class contains the DAGs that encode the same conditional independence relations (Chickering 1995). In Chickering (2002) it has been shown that each equivalence class of DAGs can be represented by a 'completed partially directed acyclic graph' (CPDAG). Loosely speaking, the CPDAG replaces some of the directed edges of the DAG by undirected edges, so as to indicate that their directions are not unique within its equivalence class. 5 A directed edge in a CPDAG indicates that all DAGs within the equivalence class agree on this edge direction. That means that the inferred conditional independence relations require the edge to have this particular direction. Henceforth, the edge has been learned along with its direction. On the other hand, an undirected edge in a CPDAG indicates that the DAGs within the equivalence class disagree on the direction of this edge, so that the edge direction is not unique and stays unclear. When computing marginal edge posterior probabilities, we interpret undirected CPDAG edges as bidirectional edges e i ↔ e j . With this interpretation we translate each DAG G (r) into a graph G (r), ‡ , whose element G (r), ‡ i,j is 1 if the CPDAG of G (r) contains either the directed edge e i → e j or an undirected edge e i − e j , which we interpret as e i ↔ e j . For the error-error interactions we compute the mean scores: Since MCMC samples can be strongly auto-correlated, we note that the effective sample size (ESS) can be substantially lower than R; see Sect. 2.9 for details on how to estimate interaction-specific ESSs. For each interaction i → j we have an effective sample size ESS i,j and a marginal existence probability, p i,j , which we estimate by the mean p i,j . When the MCMC generated Bernoulli sample of size R is of effective size ESS i,j , asymptotic 1 − confidence intervals for p i,j are: where q 1− ∕2 is the (1 − ∕2) quantile of the N(0, 1) distribution.

Predictive probabilities
Given a validation data set ̃ with T observations, ̃ = {̃ t ,̃ t } t=1,…,T , where ̃ t contains the values of the covariates in observation t, we can Monte-Carlo approximate the predictive probability: where ̃ t,D (r) is a block design matrix (cf. Eq. 10), built from the covariates values in ̃ t using the covariate sets implied by D (r) , and p( | , ) denotes the density of the N S ( , ) distribution.

Model variants and their scores
The proposed SSUR model adds two features to the traditional MBLR model. It infers response-specific covariate sets, and it infers a Bayesian network among the errors. The traditional MBLR is akin to the new model with the DAG being enforced to be a 'full' DAG, G F , and the covariate matrix being enforced to be a matrix full of 1's ( D F ∶= N,S ). A full DAG G F has maximal number of edges (e.g. G F i,j = 1 for all i < j ), so that the errors are mutually pairwise dependent. A full covariate matrix D F implies that all responses depend on all covariates. As the two model refinements can be implemented separately, we can build two 'in-between' models, each featuring only one of the two refinements. In total, we have four model variants: • The model M 0,0 enforces a full DAG, G F , and a full covariate matrix, D F , and only samples the model parameters ( (r) G F , (r) D F ) . We refer to M 0,0 as 'baseline' model, since it is akin to the traditional MBLR.
• The model M 1,0 enforces a full covariate matrix D F , and only samples error DAGs G (r) along with parameters of the form ( (r) G (r) , (r) D F ). • The model M 0,1 enforces a full error DAG G F , and only samples covariate matrices D (r) along with parameters of the form ( (r) G F , (r) D (r) ). • M 1,1 is the newly proposed model. It samples error DAGs G (r) and covariate matrices D (r) along with their parameters ( (r) G (r) , (r) D (r) ).
We also consider two model variants that correspond to static ( M S ) and dynamic ( M D ) Bayesian networks.
• The model M S does not allow for covariate-response ( z i → y j ) interactions, so that the covariate matrix D is enforced to be a zero matrix. The model is akin to a static Bayesian model that only infers the error interactions ( e i → e j ) in form of DAGs G . Then, the design matrices contain only the intercepts. • The model M D does not allow for error interactions ( e i → e j ), so that the error DAG G is enforced to be a graph without any edges. The model only learns the covariate-response interactions ( z i → y j ). The error covariance matrix is a diagonal matrix, whose diagonal entries are the S error variances. For vector auto regressive (VAR) modelling approaches, where the current observations of a time series take the role of responses and past observations become the covariates, the model is thus akin to a dynamic Bayesian network model; cf. Sects. 4.2 and 4.3.
Our major goal is to show that the new M 1,1 model performs better than the traditional/baseline M 0,0 model. But we will also cross-compare with the 'in between' models ( M 1,0 and M 0,1 ), so as to elucidate what the individual gains of the two model refinements are. In two time series applications (mTOR and ANDRO) we also compare the M 1,1 model with the static ( M S ) and the dynamic ( M D ) Bayesian network, so as to provide empirical evidence that simultaneously learning both types of interactions leads to improved inference results. In an additional study on synthetic data we also compare with the traditional MBLR, as described in Sect. 2.1. Since all models generate parameter samples, we can cross-compare them in terms of predictive probabilities; cf. Eq. (30). But we have to take into account that we cannot compute interactions scores from a constantly full DAG G F and/or from a constantly full covariate matrix D F = N,S . We then replace the interaction scores by the fraction of sampled parameters that have the same sign. The M 0,1 model samples regression vectors (r) D F that have a specific regression coefficient (r) i,j for each possible interaction z i → y j . We determine the fraction of positive p ⋆,+ i,j and negative p ⋆,− i,j regression coefficients in the sample i,j , and we use p ⋆ i,j ∶= 2 ⋅ max{p ⋆,+ i,j , p ⋆,− i,j } − 1 for scoring the interaction z i → y j . Accordingly, in the M 1,0 model for each error-error interaction e i → e j we extract the (i, j) elements from the sampled covariance matrices, (1) i,j , … , (R) i,j , and determine the fraction of positive p ⋄,+ i,j and negative p ⋄,− i,j elements among them. We then use: p ⋄ i,j ∶= 2 ⋅ max{p ⋄,+ i,j , p ⋄,− i,j } − 1 for scoring the error-error interaction e i → e j .

Reconstruction accuracy
The covariate-response and the error-error interaction scores defined in Sect. 2.5 and in Sect. 2.7 allow for rankings of the interactions. If the true interactions are known, the rankings can be used to define precision recall (PR) curves. For each threshold ∈ [0, 1] we extract the interactions whose scores exceed . We then compute the precision p (fraction of true interactions among the extracted interactions, TP TP+FP ) and the recall r (fraction of true interactions that have been extracted, TP TP+FN ). Plotting p against r yields the PR curve, and the area under the curve (AUC) is a measure for quantifying the accuracy of the inferred interactions (Davis and Goadrich 2006). AUC values are in between 0 and 1, and higher AUCs indicate a higher accuracy (Davis and Goadrich 2006).

Convergence diagnostics and effective sample sizes
Trace plot diagnostics and potential scale reduction factors (PSRFs) are widely applied tools for assessing MCMC convergence. On a given data set we perform H independent MCMC simulations with V = 100, 000 (100k) iterations each. With trace plots we monitor global characteristics, such as the number of interactions of the sampled models. To monitor convergence via PSRFs we evaluate at equidistant time points, say after 2s ∈ {10, 20, … , 100k} iterations, and compute the interaction-specific PSRFs. When withdrawing the first s samples to account for a burn-in phase, we keep s MCMC samples. Let p [h,2s] i,j and V [h,2s] i,j denote the mean and the variance of the interaction i → j obtained from simulation h computed from the last R = s samples of a run with 2s iterations (cf. Sect. 2.5). The 'between-chain' variance B 2s (i, j) is the variance of the means p [1,2s] i,j , … ,p [H,2s] i,j . The 'within-chain" variance W 2s (i, j) is the mean of the variances V [1,2s] i,j , … , V [H,2s] i,j . Brooks and Gelman (1998)

define the PSRF as
where H is the number of MCMC simulations, s is the number of taken samples after a burn-in phase of the same length s. PSRFs near 1 indicate that the simulations are close to the stationary distribution (Brooks and Gelman 1998), where = 1.1 has become a widely applied threshold. We use the more conservative threshold = 1.05 , and as convergence diagnostic we monitor the fraction of edges F 2s whose PSRF is lower than against the numbers of MCMC iterations 2s. Hence, F 2s ∈ [0, 1] is the relative frequency of edges that have a PSRF lower than 1.05 after 2s iterations. For estimating the effective sample sizes (ESSs) we use the approach from Vehtari et al. (2021). For each interaction i → j the ESS is defined as: where R = s is the number of MCMC samples taken during the sampling phase and t i,j is the auto-correlation of interaction i → j at lag t, which can be estimated from the H independent MCMC runs using the estimator (Vehtari et al. 2021) where ̂t , [h] i,j is the auto-correlation at lag t estimated from the output of simulation h. To avoid that ESS i,j is biased by noisy estimates of very large lags, we again follow Vehtari et al. (2021) and apply the truncation rule from Geyer (1992).

Hyperparameter settings
In absence of prior knowledge, we impose uniform distributions on G and D . We then have in Eq. (12): The model uses an inverse Wishart prior for the covariance matrix , from which it extracts the covariance matrix G for any DAG G , and it uses a Gaussian prior for the regression vector D , whose individual vectors s (cf. Eq. 9) always include an initial intercept parameter (see text below Eq. 10): This implies that is inverse Wishart distributed with scale matrix −1 and degrees of freedom, so that E[ ] = 1 −S−1 −1 . We set −1 = ( − S − 1) to get E[ ] = . For = S + 2 we obtain an uninformative prior (our standard choice). In Sect. 5.2.1 we make the prior stronger by setting: = x(S + 2) with x = 1, … , 4 . We set 0 = and 0 = with > 0 , so that all regression coefficients i,s (for z i → y s ) have independent N(0, ) priors. The prior covariance matrix imposes regression relationships among the errors, and the joint distribution ∼ N S ( , ) can be factorised (see Supplement S4) : ,s e i , v 2 s 1 3 Model averaging for sparse seemingly unrelated regression… We define y s ∶= s + e s with s = s s , so that For our setting E[ ] = the prior expectation of each ̃i ,s (for e i → e s or y i → y s ) is zero, and we can Monte Carlo approximate the prior variance VAR(̃i ,s ) . We sample 10,000 covariance matrices from the prior, extract the conditional Gaussians, and compute the empirical variance of each ̃i ,s . In i,s ∼ N(0, ) we then set equal to the mean variance, so as to ensure that the two regression coefficient types i,s (for z i → y s ) and ̃i ,s (for y i → y s ) have the same (average) prior variance , depending on S, and .

MCMC simulations and software (Matlab) implementation
We run all RJMCMC simulations for V = 100, 000 (100k) iterations. Setting the burn-in to 0.5V (50%) and thinning out by a factor of 10 during the sampling phase, yields R = 5k MCMC samples. In Sect. 5.1 we provide exemplary convergence diagnostics and effective sample size calculations. Measurements of the computational costs for running RJMCMC simulations with our Matlab code are provided in Table 2

Synthetic data
We generate models with S = 10 responses and N = 20 covariates. Since we have dependent errors in the regression equations y s = s s + e s , we start with the error DAG G . Without loss of generality, we assume the topological order e 1 , … , e S , so that G can only have edges e i → e s with i < s . We randomly select 10 edges from {G i,s ∶ i < s} and for each of the 10 regression coefficients ̃i ,s we sample its absolute value from a uniform distribution on the interval [1; 2], before we assign a random sign to ̃i ,s . Among the errors we then impose relationships of the form: where v s ∼ N(0, 1) is unexplained noise and ̃0 ,s = 1 . To ensure that each error e s has a N(0, 1) marginal distribution, we re-scale the coefficients in Eq. (32) to Euclidean norm 1. For i ∈ {0, … , s − 1} with ̃i ,s ≠ 0: After re-scaling, we follow the topological order and successively sample the e t,s 's from Eq. (32), so as to obtain realisations t = (e t,1 , … , e t,N ) . Subsequently, we randomly select 10 covariate-response connections ( z i → y s ) from D ∈ ℝ N,S . For each of the 10 regression coefficients i,s we sample its absolute value from a uniform distribution on the interval [0.25; 0.50], before we assign a random sign to i,s . Data points ( t , t ) with t = (z t,1 , … , z t,N ) and t = (y t,1 , … , y t,S ) can be generated by sampling error values t ∈ ℝ S , as described above, and covariate values t ∼ N N ( , z ) . For the response values we then have: We set z i,j = 0.25 + 0.75 i,j , where i,j is the Kronecker delta, so that each covariate has expectation 0, variance 1, and the pairwise correlation is 0.25. For our study we sample 50 different models (D, G) along with parameters, and for each model we simulate two data sets. One data set with T = 50 observations for inference and another data set ̃ with T = 10 observations for validation (predictive probabilities).

Protein signalling (mTOR) data
The mammalian target of rapamycin complex 1 pathway (mTOR) is an important protein signalling pathway in all eukaryotic cells. The proteins activate and inactivate each other by phosphorylation, and the activity of each protein depends on which of its sites are phosphorylated. The immunoblotting data from Dalle Pezze et al. (2016) contain measurements of S = 11 phosphorylation states of eight key proteins across the mTOR signalling network. 6 After two treatments (only amino acids vs. amino acids and insulin) the phosphorylation states were measured at 10 non-equidistant time points m t [in min] We follow Shafiee Kamalabad et al. (2019) and zscore standardise the values of each phosphorylation site to mean 0 and variance 1. For our analyses we select a vector auto regressive (VAR) approach, which is a special case of the MBLR approach. Let y m t ,i denote the measurement for protein site i ∈ {1, … , 11} at time point m t . From both experiments we build nine data points, ( t , t ) t=1,…,9 , where t = (y m t+1 ,1 , … , y m t+1 ,11 ) and t = (y m t ,1 , … , y m t ,11 ) are the measurements at time 3,5,10,15,30,45,60,120) points m t+1 and m t . By merging the data points from both experiments, we get T = 18 data points. The model infers error-error ( e i → e j ) and covariate-response ( z i → y j ) interactions, which here correspond to static ( y i,m t → y j,m t ) and dynamic ( y i,m t → y j,m t+1 ) protein interactions. We provide the mTOR data as supplementary material.

Andromeda (ANDRO) data
In Hatzikos et al. (2008) water quality data of the Thermaikos Gulf of Thessaloniki (Greece) was provided. The data set contains daily under water measurements of six variables, namely: temperature, pH, conductivity, salinity, oxygen, and turbidity. The data are available from Hatzikos et al. (2008). We follow Spyromitros-Xioufis et al. (2016) and apply a vector auto regressive (VAR) approach in which the S = 6 variables are the responses, t = (y t,1 , … , y t,6 ) ∈ ℝ 6 , and the measurements of the same variables from 6-10 days before are the N = 30 potential covariates, symbolically t = ( t−6 , … , t−10 ) ∈ ℝ 30 . With the time lag of up to = 10 days, T = 49 data points ( t , t ) can be built from the data. Like for the mTOR data from Sect. 4.2, the model infers static ( y t,i → y t,j ) and dynamic interactions ( y t− ,i → y t,j ), where the latter can be subject to different time lags ∈ {6, … , 10}.

Occupational Employment Survey (OES) 2010
On a yearly basis, the US Bureau of Labor Statistics performs occupational employment surveys (OES) in large US cities. For each city the number of full-time equivalent employees for different job types is reported. The OES data set from 2010 are available from Spyromitros-Xioufis et al. (2016), where they were used to compare the performances of multi-target prediction algorithms. We follow Spyromitros-Xioufis et al. (2016) and (i) focus only on those job types that were present in at least 50 percent of the cities and (ii) replace missing values by job type sample means. This way one obtains 403 cities (observations) and 314 job types (variables). As the counts differ in magnitudes (range from 30 to 156740), we zscore standardise the counts of each city to mean 0 and variance 1. For our empirical studies, we generate data sets with S = 20 job types as responses and N = 20 other job types as covariates, and we then sample T ∈ {10, 20, 40, 80, 160, 320} cities as data points. For each T we generate 10 data instantiations. Each time we randomly sample S responses, N covariates, T cities, and T = 10 more cities for validation (predictive probabilities).

Results
In Sect. 5.1 we provide exemplary convergence diagnostics and effective sample size calculations. In Sect. 5.2 we cross-compare the performances of the different models described in Sect. 2.7 in terms of predictive probabilities and precision-recall AUC scores. Finally, in Sect. 5.3 we investigate the scalability of the proposed RJMCMC inference algorithm and we provide measurements of the computational costs.

Convergence diagnostics and effective sample sizes
To assess whether V = 100k MCMC iterations are sufficient to reach satisfactory convergence, we employ trace plot and potential scale reduction factor (PSRF) diagnostics. In this subsection our focus is on the mTOR data from Sect. 4.2 and the ANRO data from Sect. 4.3 to exemplify how we assessed convergence. 7 With the trace plots we monitor global quantities, such as the numbers of static and dynamic interactions or the posterior scores of the sampled models. Example trace plots are provided in Supplement S10. The trace plot diagnostics in Figs. 1-2 of Supplement S10 show the same trend: Already after relatively few iterations, the trace plots of H = 10 independent MCMC simulations run into plateaus and overlay smoothly. In Fig. 1 we monitor the fraction F 2s of edges that meet the criterion of a PSRF lower than = 1.05 , see Sect. 2.9 for the technical details. It can be seen that at the end of the burn-in phase of length 50k (50% of V = 100k ) all interactions fulfill the PSRF-based convergence criterion. During the subsequent sampling phase (results not shown) the fractions stayed constantly equal to 1. For each possible interaction i → j we also compute the individual effective sample size ESS i,j , as explained in Sect. 2.9. Figure 2 shows boxplots of the effective sample sizes (ESSs), distinguishing for both data sets (mTOR vs. ANDRO) the two interaction types (contemporaneous vs. dynamic). The boxplots display the distribution of the interaction-specific ESS values with medians being across all interactions of that type; the medians are also provided in Table 1 along with the average computing time [in seconds] for 50k MCMC iterations. The efficiency sample is the ratio of ESS and computing time and refers to the number of (effective) posterior samples that are generated per second. Although the ESSs are rather small relative to the sampling phase length of 50k MCMC iterations, they seem sufficiently large for reliably estimating the marginal edges scores; cf. the asymptotic confidence intervals from Sect. 2.5.

3
Model averaging for sparse seemingly unrelated regression… Fig. 1 Convergence diagnostics based on potential scale reduction factors (PSRFs). For each individual interaction we computed a PSRF, and the panels show trace plots of the fractions of the edges whose PSRF was lower than the threshold 1.05 (vertical axis) monitored along the number of MCMC iterations (horizontal axis). At the end of the burn-in phase (50k iterations) all PSRFs were below 1.05 and this did not change anymore during the sampling phase. The rows correspond to the mTOR (top) and to the ANDRO (bottom) data. The columns refer to the contemporaneous (left) and dynamic (right) interactions

Fig. 2
Effective sample sizes (ESSs) per data set and interaction type. Each ESS was computed from the 50k samples that were taken during the sampling phase and averaged across H = 10 independent MCMC simulations. Each panel refers to a data set (mTOR or ANDRO) and interaction type (contemporaneous or dynamic interactions). The boxplots display the distributions of the interaction-specific ESSs. The medians of the boxplots are provided in Table 1 5.2 Method comparisons

Synthetic data
First, we cross-compare the performances of the four models: M 0,0 (baseline), M 1,1 (proposed), and M 1,0 and M 0,1 ('in-between') on the 50 synthetically generated data sets (see Sect. 4.1). Figure 3 shows the results in terms of predictive probabilities and areas under precision recall curves (AUCs). In both criteria the proposed M 1,1 model clearly outperforms the over-complex baseline model, M 0,0 , that employs a full error DAG and a full covariate matrix. The predictive probabilities of the M 1,1 model are consistently higher than those of the M 0,0 model. To gain more insight, where the improvement comes from, we compute separate AUCs for the two types of interactions (error-error and covariate-response). We see that both AUC types are significantly in favour of the M 1,1 model. The average AUC improvements are 0.34 (error-error) and 0.33 (covariate-response). The AUC results for the two 'in-between' models meet our expectations. While the M 1,0 model (full covariate matrix) mainly loses accuracy w.r.t. the covariate-response interactions, the M 0,1 model (full error DAG) yields lower accuracies for the error-error interactions. This suggests that the two improvements are independent and that each (mainly) concerns a specific interaction type. Interestingly, a decreased accuracy in the covariate-response interactions also seems to result in a slightly decreased accuracy for the error-error interactions (see results of M 1,0 ). In terms of predictive probabilities the M 1,1 significantly outperforms all three competitors (Wilcoxon signed-rank test). However, the predictive probabilities of the M 0,1 are only slightly worse and partly overlap. Apparently, with regard to the prediction accuracy the over-complexity of a full error graph ( M 0,1 ) is less misleading than over-complex covariate sets ( M 1,0 ). In Supplement S11 we provide more empirical results. First, we studied other hyperparameters with stronger penalties for the regression parameters. We used ∈ {24, 36, 48} rather than = 12 like in Fig. 3. Second, we cross-compared the performances of the M 1,1 and the traditional MBLR model from Sect. 2.1. Our findings can be summarised as follows: For larger hyperparameters the trends from Fig. 3 are conserved and the differences stay significant, but the relative differences get smaller as increases (see Supplementary Figs. 3-5). The traditional  Figure 6 in Supplement S11) it can be seen that the traditional MBLR model yields a low accuracy for the error-error interactions ( AUC ≈ 0.32 ). A possible explanation is that the traditional model re-employs the error covariance matrix in the regression parameter prior; cf. Eq. (6). The interference between the regression parameters and the error covariance matrix might introduce a systematic bias. For more details we refer to Supplement S11.

mTOR and ANDRO data
For the mTOR (see Sect. 4.2) and the ANDRO data (see Sect. 4.3) we follow a vector auto-regressive (VAR) approach, so that the observations at the previous time points (=covariates) explain the observations at the current time point (= responses). Covariate-response interactions then refer to dynamic interactions (with a time lag), while error-error interactions refer to contemporaneous interactions (within the time points). Unlike the ANDRO data, the mTOR data were measured at non-equidistant time points. Therefore neither static Bayesian networks (which ignore the data point order) nor dynamic Bayesian networks (which assume equidistant data points) seem appropriate. In our study, we cross-compare the performances of static Bayesian Since the true interactions are unknown, we resort to predictive probabilities, and we follow a leave-one-out cross-validation approach, yielding a predictive probability for each individual left-out data point. Figure 4 shows the results for the mTOR data. It can be seen that the dynamic M D model performs slightly better than the static M S model (p-value: p = 0.1297 ) and that the two MBLR models, M 0,0 and M 1,1 , improve over the two Bayesian network types. In consistency with our finding for the synthetic data, the proposed M 1,1 model yields the best results and significantly outperforms the M 0,0 baseline model (see right panel of Fig. 4). The results for the ANDRO data are shown in Fig. 5. Unlike for the mTOR data, here the . In Supplement S12 we provide a figure that compares the models M i,j ( i, j ∈ {0, 1}) static M S model performs better than the dynamic M D model (p-value: p < 10 −4 ), indicating that the contemporaneous interactions are of greater importance. Again the two MBLR models ( M 0,0 and M 1,1 ) improve over both Bayesian network types, and the proposed M 1,1 performs best and significantly superior to its three competitors M D , M S and M 0,0 (see right panel of Fig. 5). In Supplement S12 we compare the results of the four MBLR model variants M i,j ( i, j ∈ {0, 1} ). The results can be summarised as follows: While the models M 0,0 and M 1,0 (with full covariate matrix) perform worse than the M 1,1 model, the predictive probabilities of the M 0,1 model (with full error DAG) are comparable to those of the M 1,1 model. A possible explanation is that our data have in common that there are more possible covariateresponse interactions than error-error interactions. For more details we refer to Supplement S12.

OES 2010 data
We use the OES 2010 data from Sect. 4.4 to study sample size effects. We crosscompare the performances of the four model variants M i,j ( i, j ∈ {0, 1} ) on data sets with different sample sizes T. For each T ∈ {10, 20, … , 320} we generate 10 independent data sets. For each we sample N = 20 covariates, S = 20 responses, and then T + 10 observations, where the last T = 10 are for computing predictive probabilities. Figure 6 shows boxplots of the relative log predictive probability differences. The proposed M 1,1 model yields for all sample sizes T the highest predictive probabilities, and the relative differences decrease as the sample size T increases. The baseline model M 0,0 shows the worst performance. And, as observed before, averaging across the covariate matrices ( M 0,1 ) yields more improvement than averaging across the error DAGs ( M 1,0 ). In Supplement S13 we provide more result Each panel refers to a sample size T and shows box plots of the log predictive probability differences in favour of the M 1,1 model. For computing the predictive probabilities we each time generated validation data with T = 10 observations. In Supplement S13 we provide three additional figures figures. Figure 9 in S13 shows boxplots of the model-specific predictive probabilities from which the differences in Fig. 6 were computed. Figure 10 in S13 shows how the predictive probabilities of the M 1,1 model increase in the sample size T. Figure 11 in S13 rearranges the results from Fig. 6, so as to show more explicitly how the relative predictive probability differences decrease as the sample size T increases.

Scalability and computational costs
We also use the OES 2010 data to monitor the scalability of the RJMCMC algorithm from Sect. 2.4. For all combinations of N ∈ {10, 20, 30} , S ∈ {10, 20, 30} and T ∈ {25, 50, 100, 200} we generate 10 data sets, each consisting of N random covariates, S random responses and T random data points. Table 2 lists the average computational costs for V = 100k RJMCMC iterations. It can be seen that the no. of covariates N has only a minor effect, while the no. of responses S and data points T clearly affect the run times. While the computational costs seem to increase linearly in T, the increase in S does not follow a clear trend. While the transition from S = 10 to S = 20 increases the computational costs by factors in the range 2-3, the transition from S = 20 to S = 30 increases the computational costs only by factors in the range 1-2. An exact investigation of the computational costs is beyond the scope of this paper and might be difficult, as our algorithm combines algorithms for Bayesian networks (Giudici and Castelo 2003) and Bayesian linear regression (Raftery   . 1997). However, the measured run times suggest that our Matlab implementation does not scale well in S and T. For example, running 100k RJMCMC iterations on a data set with S = 30 responses and T = 200 data points takes more than 8 hours of computational time on a standard desktop PC. For large sample sizes T one could resort to the traditional MBLR, as our results from Sect. 5.2.3 suggest that the performance differences decrease as T increases. For high-dimensional response vectors ( S > 30 ) our current Matlab implementation seems to require 'unreasonably' long run times. 8 However, we would argue that there are plenty of applications, where the number of responses is (clearly) below S = 30 and where the new model can yield a substantial improvement in reasonable time (in terms of hours).
To improve scalability, one could implement the algorithm in a more efficient programming language, such as C++, or parallelize the code and run it onto a computer cluster. For example the shotgun stochastic search method from Hans et al. (2007) could directly be adapted for speeding up the covariate sampling part. In Supplement S14. we propose and briefly discuss alternative strategies and approximations that might improve scalability.

Discussion and conclusion
In this work, we have proposed a new sparse seemingly unrelated regression (SSUR) model. The new model improves in two important respects over earlier proposed SSUR models (Wang 2010). First, our model employs directed acyclic graphs (Gaussian Bayesian networks) rather than undirected graphs (Gaussian graphical models) to model the conditional independencies among the errors. Unlike all earlier proposed SSUR models, our new model is therefore capable of learning some directed edges among the errors. Second, the RJMCMC scheme from Wang (2010) requires computational expensive approximations of the marginal likelihood. For the new model we have designed a RJMCMC algorithm that does not have to rely on approximations. We have shown that all required marginal likelihoods and full conditional distributions can be computed analytically. In particular, we have presented an algorithm for sampling covariance matrices that are coherent with a given directed acyclic graph. When compared with the multivariate Bayesian regression model (MBLR), the proposed SSUR model improves in two ways: It infers response-specific covariate sets and it infers directed acyclic graphs (Gaussian Bayesian networks) among the errors. In a comparative evaluation study we have compared the new SSUR model with the traditional MBLR model as well as with two 'in-between' models, each featuring only one of the two improvements. The results suggest that allowing for response-specific covariate sets yields clearer gains than inferring Gaussian Bayesian networks among the errors. For synthetic data, where the ground truth is known, we have shown that the new model identifies the true interactions with a higher accuracy than the competing model variants.
Our current implementation of the RJMCMC algorithm does not scale up well. In particular, the computational costs increase in the number of responses and in the sample size. We feel that the new SSUR model might be of high relevance for vector auto regressive (VAR) models. In VAR models one distinguishes between dynamic and contemporaneous interactions and both interaction types are considered to be (equally) important. The existing VAR and SSUR models only learn undirected contemporaneous interactions, while our new model can learn a mixture of directed and undirected edges among the errors. When applied in a vector auto regressive fashion, the error-error edges correspond to response-response edges, so that our model is capable of learning some directed contemporaneous interactions. Our future work will aim to improve the scalability of the RJMCMC algorithm so that new model features can be added. For VAR models, it would for example be interesting to combine our new method with multiple change point processes (Fearnhead 2006), so as to be able to infer time-varying VAR models (Koop and Korobilis 2013), where the model parameters can undergo temporal changes. Although conceptually easy to achieve, changepoints increase the model complexity further. Henceforth, the combination of both approaches is subject to the condition that the scalability can be improved.