Operatorvalued kernelbased vector autoregressive models for network inference
 1.3k Downloads
 5 Citations
Abstract
Reverseengineering of highdimensional dynamical systems from timecourse data still remains a challenging and important problem in knowledge discovery. For this learning task, a number of approaches primarily based on sparse linear models or Granger causality concepts have been proposed in the literature. However, when a system exhibits nonlinear dynamics, there does not exist a systematic approach that takes into account the nature of the underlying system. In this work, we introduce a novel family of vector autoregressive models based on different operatorvalued kernels to identify the dynamical system and retrieve the target network that characterizes the interactions of its components. Assuming a sparse underlying structure, a key challenge, also present in the linear case, is to control the model’s sparsity. This is achieved through the joint learning of the structure of the kernel and the basis vectors. To solve this learning task, we propose an alternating optimization algorithm based on proximal gradient procedures that learns both the structure of the kernel and the basis vectors. Results on the DREAM3 competition gene regulatory benchmark networks of sizes 10 and 100 show the new model outperforms existing methods. Another application of the model on climate data identifies interesting and interpretable interactions between natural and human activity factors, thus confirming the ability of the learning scheme to retrieve dependencies between statevariables.
Keywords
Network inference Operatorvalued kernel Regularization Proximal gradient methods Vector autoregressive model Jacobian1 Introduction
In many scientific problems, high dimensional data with network structure play a key role in knowledge discovery (Kolaczyk 2009). For example, recent advances in high throughput technologies have facilitated the simultaneous study of components of complex biological systems. Hence, molecular biologists are able to measure the expression levels of the entire genome and a good portion of the proteome and metabolome under different conditions and thus gain insight on how organisms respond to their environment. For this reason, reconstruction of gene regulatory networks from expression data has become a canonical problem in computational systems biology (Lawrence et al. 2010). Similar data structures emerge in other scientific domains. For instance, political scientists have focused on the analysis of roll call data of legislative bodies, since they allow them to study party cohesion and coalition formation through the underlying network reconstruction (Morton and Williams 2010), while economists have focused on understanding companies’ creditworthiness or contagion (Gilchrist et al. 2009). Understanding climate changes implies to be able to predict the behavior of climate variables and their dependence relationship (Parry et al. 2007; Liu et al. 2010). Two classes of network inference problems have emerged simultaneously from all these fields: the inference of association networks that represent coupling between variables of interest (Meinshausen and Bühlmann 2006; Kramer et al. 2009) and the inference of “causal” networks that describe how variables influence each other (Murphy 1998; Perrin et al. 2003; Auliac et al. 2008; Zou and Feng 2009; Shojaie and Michailidis 2010; Maathuis et al. 2010; Bolstad et al. 2011; Dondelinger et al. 2013; Chatterjee et al. 2012).
Over the last decade, a number of statistical techniques have been introduced for estimating networks from highdimensional data in both cases. They divide into modelfree and modeldriven approaches. Modelfree approaches for association networks directly estimate informationtheoretic measures, such as mutual information to detect edges in the network (Hartemink 2005; Margolin et al. 2006). Among modeldriven approaches, graphical models have emerged as a powerful class of models and a lot of algorithmic and theoretical advances have occured for static (independent and identically distributed) data under the assumption of sparsity. For instance, Gaussian graphical models have been thoroughly studied (see Bühlmann and van de Geer (2011) and references therein) under different regularization schemes to reinforce sparsity in linear models in an unstructured or a structured way. In order to infer causal relationship networks, Bayesian networks (Friedman 2004; Lèbre 2009) have been developed either from static or timeseries data within the framework of dynamical Bayesian networks. In the case of continuous variables, linear multivariate autoregressive modeling (Michailidis and d’Alché Buc 2013) has been developed with again an important focus on sparse models. In this latter framework, Granger causality models have attracted an increasing interest to capture causal relationships.
However, to date, few papers in the literature have focused on network inference for continuous variables in the presence of nonlinear dynamics despite the fact that many mechanisms (e.g. regulatory ones in biology) involve such dynamics. Of special interest are approaches based on parametric ordinary differential equations (Chou and Voit 2009) that alternatively learn the structure of the model and its parameters. The most successful approaches are based on Bayesian Learning (Mazur et al. 2009; Aijo and Lahdesmaki 2009) that allow them to deal with the stochasticity of biological data, while easily incorporating prior knowledge and genetic programming (Iba 2008) that provide a populationbased algorithm for a stochastic search in the structure space. In this study, we start from a regularization theory perspective and introduce a general framework for nonlinear multivariate modeling and network inference. Our aim is to extend the framework of sparse linear modeling to that of sparse nonlinear modeling. In the machine learning community, a powerful tool to extend linear models to nonlinear ones is based on kernels. The famous “kernel trick” allows one to deal with nonlinear learning problems by working implicitly in a new feature space, where inner products can be computed using a symmetric positive semidefinite function of two variables, called a kernel. In particular, a given kernel allows to build a unique Reproducing Kernel Hilbert Space (RKHS), e.g. a functional space where regularized models can be defined from data using representer theorems. The RKHS theory provides a unified framework for many kernelbased models and a principled way to build new (nonlinear) models. Since multivariate timeseries modeling requires defining vectorvalued models, we propose to build on operatorvalued kernels and their associated reproducing kernel Hilbert space theory (Senkene and Tempel’man 1973) that were introduced in machine learning by Micchelli and Pontil (2005) for the multitask learning problem with vectorvalued functions. This is an active area—see review (Alvarez et al. 2011)—with new applications on vector field regression (Baldassarre et al. 2010), structured classification (Dinuzzo and Fukumizu 2011), functional regression (Kadri et al. 2011) and link prediction (Brouard et al. 2011). However, the use of operatorvalued kernels in the context of time series is novel.
Building upon our previous work (Lim et al. 2013) that focused on a specific model, we define a whole family of nonlinear vector autoregressive models based on various operatorvalued kernels. Once an operatorvalued kernelbased model is learnt, we compute an empirical estimate of its Jacobian, providing a generic and simple way to extract dependence relationships among variables. We discuss how a specific operatorvalued kernel can produce not only a good approximation of the system dynamics, but also a flexible and controllable Jacobian estimate. To obtain sparse networks and get sparse Jacobian estimates, we extend the sparsity constraint regularly used in linear modeling.
To control smoothness of the model, the definition of the loss function involves an \(\ell _2\)norm penalty and additionally, may include two different types of penalties, either an \(\ell _1\)norm penalty or a mixed \(\ell _1/\ell _2\)norm regularization applied to the matrix of parameter vectors, depending on the nature of the estimation problem. To optimize a loss function that contains these nondifferentiable terms, we develop a general proximal gradient algorithm.
Note that selected operatorvalued kernels involve a positive semidefinite matrix as a hyperparameter. The background knowledge required for its definition is in general not available, especially in a network inference task. To address this kernel design task together with learning the other parameters, we introduce an efficient strategy that alternatively learns the parameter vectors and the positive semidefinite matrix that characterizes the kernel. This matrix plays an important role regarding the Jacobian sparsity; the estimation procedure for the matrix parameter also involves an \(\ell _1\) penalty and a positive semidefiniteness constraint. We show that without prior knowledge on the relationship between variables, the proposed algorithm is able to retrieve the network structure of a given underlying dynamical system from the observation of its behavior through time.
The remainder of the paper is organized as follows: in Sect. 2, we present the general network inference scheme. In Sect. 3, we recall elements of RKHS theory devoted to vectorvalued functions and introduce operatorvalued kernelbased autoregressive models. Section 4 presents the learning algorithm that estimates both the parameters of the model and the parameters of the kernel. Section 5 illustrates the performance of the model and the algorithm through extensive numerical work based on both synthetic and real data, and comparison with stateoftheart methods.
2 Network inference from nonlinear vector autoregressive models
Note that to obtain a high quality estimate of the network, we need a class of functions \(h\) whose Jacobian matrices can be controlled during learning in such a way that they could provide good continuous approximators of \(A\). In this work, we propose a new class of nonparametric vector autoregressive models that exhibit such properties. Specifically, we introduce Operatorvalued Kernelbased Vector AutoRegressive (OKVAR) models, that constitute a rich class as discussed in the next section.
3 Operatorvalued kernels and vector autoregressive models
3.1 From scalarvalued kernel to operatorvalued kernel models of autoregresssion
3.2 Basics of operatorvalued kernelbased theory
In RKHS theory with operatorvalued kernels, we consider functions with input in some set \({\fancyscript{X}}\) and output with vector values in some given Hilbert space \({{\fancyscript{F}}_y}\). For completeness, we first describe the general framework and then come back to the case of interest, namely \({\fancyscript{X}}= {{\fancyscript{F}}_y}= {\mathbb {R}}^d\). Denote by \(L({{\fancyscript{F}}_y})\), the set of all bounded linear operators from \({{\fancyscript{F}}_y}\) to itself. Given \(A \in L({{\fancyscript{F}}_y})\), \(A^*\) denotes its adjoint. Then, an operatorvalued kernel \(K\) is defined as follows:
Definition 1

\(\forall ({\mathbf {x}},\mathbf{z}) \in {\fancyscript{X}}\times {\fancyscript{X}}\), \(K({\mathbf {x}},\mathbf{z}) = K(\mathbf{z},{\mathbf {x}})^*\)

\(\forall m \in {\mathbb {N}}\), \(\forall \{({\mathbf {x}}_i,{\mathbf {y}}_i), i=1,\ldots ,m\} \subseteq {\fancyscript{X}}\times {{\fancyscript{F}}_y}, \sum _{i,j=1}^m \langle {\mathbf {y}}_i, K({\mathbf {x}}_i,{\mathbf {x}}_j) {\mathbf {y}}_j \rangle _{{{\fancyscript{F}}_y}}\ge 0\)
3.3 The OKVAR family
Let us recall the definition of the scalarvalued Gaussian kernel \(k_{Gauss}: {\mathbb {R}}^d \times {\mathbb {R}}^d\rightarrow {\mathbb {R}}\): \(k_{Gauss}({\mathbf {x}},\mathbf{z}) = \exp (\gamma {\mathbf {x}}\mathbf{z}^2)\). Please notice that in the special case \(d=1\), \(k_{Gauss}(x,z)\) reduces to \(\exp (\gamma (xz)^2)\).
As a baseline, we first consider the Gaussian transformable kernel which extends the standard Gaussian kernel to the matrixvalued case. If \({\mathbf {x}}\) is a vector, we denote \(x^m\) its \(m\)th coordinate. Then the Gaussian transformable kernel is defined as follows:
Definition 2
Interestingly, each \((i,j)\)coefficient of the kernel \(K_{Gauss}\) compares the \(i\)th coordinate of \({\mathbf {x}}\) to the \(j\)th coordinate of \(\mathbf{z}\), allowing a richer comparison between \({\mathbf {x}}\) and \(\mathbf{z}\). For the sake of simplicity, we will call this kernel the Gaussian kernel in the remainder of the paper. Note that the Gaussian kernel depends on only one single hyperparameter \(\gamma \). It gives rise to the following Gaussian OKVAR model.
Definition 3
An interesting feature of the Gaussian kernelbased OKVAR model is that each coordinate \(i\) of the vector model \(h_{Gauss}({\mathbf {x}}_t)^i\) can be expressed as a linear combination of nonlinear functions of variables \(j= 1, \ldots , d\): \(h_{Gauss}({\mathbf {x}}_t)^{i} = \sum _{\ell } \sum _j \exp (\gamma (x_t^i  x_{\ell }^j)^2) c_{\ell }^j\).
Decomposable kernels are another class of kernels, first defined by Micchelli and Pontil (2005) to address multitask regression problems and structured classification. When based on Gaussian kernels, they are defined as follows:
Definition 4
In this kernel, \(B\) is related to the structure underlying the outputs: \(B\) imposes dependence amongst selected outputs. This kernel was shown to be universal by Caponnetto et al. (2008), e.g. the induced RKHS is a family of universal approximators. The decomposable Gaussian OKVAR model is then defined as follows:
Definition 5
Let now \(K_{dec}\) be a decomposable Gaussian kernel with scalar parameter \(\gamma _1\) and matrix parameter \(B\) and \(K_{Gauss}\) be a Gaussian kernel with scalar parameter \(\gamma _2\). As proposed in Lim et al. (2013), we combine the Gaussian kernel and the decomposable kernel with the Hadamard product to get a kernel that involves nonlinear functions of single coordinates of the input vectors, while imposing some structure to the kernel through a positive semidefinite matrix \(B\). The resulting kernel is called the Hadamard kernel.
Definition 6
The resulting kernel \(K_{Hadamard}\) possesses the kernel property, i.e:
Proposition 1
The kernel defined by (11) is a matrixvalued kernel.
Proof
A Hadamard product of two matrixvalued kernels is a matrixvalued kernel [proposition 4 in Caponnetto et al. (2008)]. \(\square \)
The Hadamard OKVAR model has the following form:
Definition 7
3.4 Jacobians of the OKVAR models
4 Learning OKVAR with proximal gradient algorithms
4.1 Learning \(C\) for fixed kernel

\(L_{C}\) is a Lipschitz constant of \(\nabla _{C}f_C\), the derivative of \(f_{C}\) for variable \(C\)

For \(s>0\), the proximal operator of a function \(g\) applied to some \({\mathbf {v}} \in {\mathbb {R}}^{Nd}\) is given by: \(\text {prox}_s(g)({\mathbf {v}}) = \hbox {argmin}_{\mathbf{u}} \left\{ g(\mathbf{u}) + \frac{1}{2s} \mathbf{u}{\mathbf {v}}^2\right\} \)

Intermediary variables \(t^{(m)}\) and \({\mathbf {y}}^{(m)}\) in Step 2 and Step 3 respectively are introduced to accelerate the proximal gradient method (Beck and Teboulle 2010).
Remark
It is of interest to notice that Algorithm 1 is very general and may be used as long as the loss function can be split into two convex terms, one of which is differentiable.
4.2 Learning \(C\) and the kernel
Further, for fixed \(\hat{B}\), the loss function \({\fancyscript{L}}(\hat{B},C)\) is convex in \(C\) and conversely, for fixed \(C\), \({\fancyscript{L}}(B,\hat{C})\) is convex in \(B\). We propose an alternating optimization scheme to minimize the overall loss \({\fancyscript{L}}(B,C)\). Since both loss functions \({\fancyscript{L}}(\hat{B},C)\) and \({\fancyscript{L}}(B,\hat{C})\) involve a sum of two terms, one being differentiable and the other being subdifferentiable, we employ proximal gradient algorithms to achieve the minimization.
At iteration \(m\), \(B_m\) is fixed, and thus kernel \(K\) is defined. Hence, estimation of \(C_m\) in Step 1 boils down to applying Algorithm 1 to minimize (20).
4.2.1 Learning the matrix \(B\) for fixed \(C\)
Two proximal operators need to be computed. The proximal operator of \(g_{1,B}\) is the softthresholding operator while the proximal operator corresponding to the indicator function \(1_{{\fancyscript{S}}_d^+}\) is the projection onto the cone of positive semidefinite matrices: for \(Q\in {\fancyscript{S}}_d, \text {prox}(g_{B,2})(Q) =\hbox {argmin}_{B\in {\fancyscript{S}}_d^+}BQ_{F}\). Sequence \((B_m)\) is guaranteed to converge under the following assumptions [Theorem 2.1 in Raguet et al. (2011)]:
 (A)(i) \(0<\underline{\text {lim}} \mu _m \le \overline{\text {lim}} \mu _m < \min \left( \frac{3}{2},\frac{1+2/(L_B\overline{\eta })}{2}\right) \)
 (ii)
\(\sum _{m=0}^{+\infty } \Vert u_{2,m}\Vert <+\infty \), and for \(i\in \{1,2\}\), \(\sum _{m=0}^{+\infty } \Vert u_{1,m,i}\Vert <+\infty \)
 (ii)
 (B)(i) \(0<\underline{\text {lim}} \eta _m \le \overline{\eta }< \frac{2}{L_B}\)
 (ii)
\(I_{\mu }=(0,1]\)
 (ii)
5 Results
5.1 Implementation
The performance of the developed OKVAR model family and the proposed optimization algorithms were assessed on two tracks: using simulated data from a biological system (DREAM3 challenge data set) and real climate data (Climate data set). These algorithms include a number of tuning parameters. Specifically, in Algorithm 3, we set \(Z_1^{(0)}=Z_2^{(0)}=B_0\in {\fancyscript{S}}_d^+\), \(\alpha =0.5\) and \(\mu _m=1\). Parameters of the kernels were also fixed a priori: specifically, parameter \(\gamma \) was set to 0.2 for a transformable Gaussian kernel and to 0.1 for a decomposable Gaussian kernel. In the case of a Hadamard kernel, two parameters need to be chosen: parameter \(\gamma _2\) of the transformable Gaussian kernel remains unchanged (\(\gamma _2=0.2\)). On the other hand, as discussed in Sect. 3.4, parameter \(\gamma _1\) of the decomposable Gaussian kernel is fixed to a low value (\(\gamma _1=10^{5}\)) since it does not play a key role in the network inference task.
5.2 DREAM3 dataset
We start our investigation by considering data sets obtained from the DREAM3 challenge (Prill et al. 2010). DREAM stands for Dialogue for Reverse Engineering Assessments and Methods (http://wiki.c2b2.columbia.edu/dream/index.php/The_DREAM_Project) and is a scientific consortium that organizes challenges in computational biology and especially for gene regulatory network inference. In a gene regulatory network, a gene \(i\) is said to regulate another gene \(j\) if the expression of gene \(i\) at time \(t\) influences the expression of gene \(j\) at time \(t+1\). The DREAM3 project provides realistic simulated data for several networks corresponding to different organisms (e.g. E. coli, Yeast, etc.) of different sizes and topological complexity. We focus here on size10, size50 and size100 networks generated for the DREAM3 in silico challenges. Each of these networks corresponds to a subgraph of the currently accepted E. coli and S. cerevisiae gene regulatory networks and exhibits varying patterns of sparsity and topological structure. They are referred to as E1, E2, Y1, Y2 and Y3 with an indication of their size. The data were generated by imbuing the networks with dynamics from a thermodynamic model of gene expression and Gaussian noise. Specifically, 4, 23 and 46 time series consisting of 21 points were available respectively for size10, size50 and size100 networks. We generated additional time series and extended their lengths up to 50 time points to study the behavior of the algorithm in various conditions. For that purpose, we used the same tool that generated the previously obtained time series, an opensource software called GeneNetWeaver (Schaffter et al. 2011), the generator that provided the DREAM3 competition with the network inference challenges.
In all the conducted experiments, we assess the performance of our model using the area under the ROC curve (AUROC) and under the PrecisionRecall curve (AUPR) for regulation ignoring the sign (positive vs negative influence). The interested reader may however refer to the Supplementary Material where we provide additional results regarding this particular feature. The selected values for the hyperparameters for the penalty components \(\lambda _h,\lambda _C\) and \(\lambda _B\) are displayed in Table 1. For the DREAM3 data sets we also show the best results obtained from other competing teams using only time course data. The challenge made available other data sets, including ones obtained from perturbation (knockout/knockdown) experiments, as well as observing the organism in steady state, but these were not considered in the results shown in the ensuing tables.
Selected hyperparameters of OKVAR for DREAM3 size10 and size100 data sets
Size10  Size100  

E1  E2  Y1  Y2  Y3  E1  E2  Y1  Y2  Y3  
\(\lambda _h\)  1  1  1  1  1  10  1  1  1  1 
\(\lambda _C\)  0.01  1  1  1  1  10  0.01  100  100  100 
\(\lambda _B\)  0.1  1  10  0.01  0.01  10  1  1  1  1 
5.2.1 Comparison between OKVAR models
Synthetic table of studied OKVAR models
OKVAR models  

\(h^{Ridge}_{Gauss}\)  \(h^{\ell _1}_{Gauss}\)  \(h^{\ell _1/\ell _2}_{Gauss}\)  \(h^{\ell _1}_{dec}\)  \(h^{\ell _1/\ell _2}_{dec}\)  \(h^{\ell _1}_{Hadamard}\)  \(h^{\ell _1/\ell _2}_{Hadamard}\)  
Kernel  Transformable Gaussian  Decomposable Gaussian  Hadamard  
Loss  Eq. (16)  Eq. (19)  
\(\varOmega (C)\)  0  \(\varOmega _1\)  \(\varOmega _{struct}\)  \(\varOmega _1\)  \(\varOmega _{struct}\)  \(\varOmega _1\)  \(\varOmega _{struct}\) 
For the mixed \(\ell _1/\ell _2\)norm regularization, we need to define coefficients \(w_{\ell }\). As the observed timecourse data correspond to the response of a dynamical system to some given initial condition, \(w_{\ell }\) should increase with \(\ell \), meaning that we put more emphasis on the first timepoints. We thus propose to define \(w_\ell \) as follows : \(w_\ell = 1\exp ((\ell 1))\).
Consensus AUROC and AUPR (given in %) for the DREAM3 size10 networks using the DREAM3 original data sets (4 time series of 21 points)
OKVAR models  AUROC  AUPR  

E1  E2  Y1  Y2  Y3  E1  E2  Y1  Y2  Y3  
\(h^{Ridge}_{Gauss}\)  68.8  37.7  62.1  68.6  66.7  15.6  11.2  15.5  46.9  32.9 
\(h^{\ell _1}_{Gauss}\)  69.3  38.0  61.9  69.3  66.7  15.7  11.3  15.2  47.4  32.8 
\(h^{\ell _1/\ell _2}_{Gauss}\)  68.7  37.1  62.4  68.6  66.7  15.5  11.1  15.6  47.5  32.6 
\(h^{\ell _1}_{dec}\)  67.0  68.5  38.2  45.4  38.3  23.6  20.8  7.4  21.1  16.8 
\(h^{\ell _1/\ell _2}_{dec}\)  65.9  47.8  45.3  56.6  38.5  23.1  14.0  8.3  28.5  16.8 
\(h^{\ell _1}_{Hadamard}\)  81.2  46.2  47.7  76.2  70.5  23.5  12.7  8.7  50.1  39.5 
\(h^{\ell _1/\ell _2}_{Hadamard}\)  81.5  78.7  76.5  70.3  75.1  32.1  50.1  35.4  37.4  39.7 
5.2.2 Effects of hyperparameters, noise, sample size and network size
Next, we study the impact of different combinations of parameters including the sample size of the dataset (number of time points), the number of time series, the noise level and hyperparameters \(\lambda _C\) and \(\lambda _B\).
Consensus AUROC and AUPR (given in %) obtained by OKVARProx and the LASSO for DREAM3 E1 networks using the DREAM3 original data sets (4, 23 and 46 time series of 21 points for size10, size50, and size100 E1 respectively) and different noise levels (standard deviations \(\sigma =0,0.05,0.1,0.15,0.2\) and 0.3)
AUROC  AUPR  

Size10 E1  
Noise (\(\sigma \))  0  0.05  0.1  0.15  0.2  0.3  0  0.05  0.1  0.15  0.2  0.3 
OKVARProx  81.5  67.9  72.0  75.2  72.4  74.4  32.1  20.7  21.2  24.3  22.7  21.4 
LASSO  69.5  64.2  61.7  43.0  50.7  48.8  17.0  13.9  17.8  9.0  10.1  10.7 
Size50 E1  
Noise (\(\sigma \))  0  0.05  0.1  0.15  0.2  0.3  0  0.05  0.1  0.15  0.2  0.3 
OKVARProx  66.4  67.3  68.6  69.2  69.8  70.9  4.1  4.3  5.0  5.9  6.6  6.9 
LASSO  52.8  54.5  46.0  54.9  45.9  50.7  2.9  2.8  2.1  3.5  2.2  2.6 
Size100 E1  
Noise (\(\sigma \))  0  0.05  0.1  0.15  0.2  0.3  0  0.05  0.1  0.15  0.2  0.3 
OKVARProx  65.4  56.1  56.5  56.4  57.2  58.0  4.6  1.7  1.7  1.7  1.7  1.8 
LASSO  52.2  54.3  53.3  47.2  51.3  51.5  1.4  1.6  1.4  1.2  1.3  1.3 
Table 4 first indicates that both algorithms’ performances deteriorate with increasing network sizes. Second, OKVARProx clearly outranks the LASSO both in terms of AUROC and AUPR for any configuration of level noises and network sizes. Indeed, for a given size, OKVARProx performs quite well in the presence of high levels of noise while the LASSO is strongly impacted by the noise, which advocates for OKVAR’s robustness.
Consensus AUROC and AUPR (given in %) for the DREAM3 size10 E1 network using the DREAM3 original data sets (4 time series of 21 points) for different combinations of hyperparameters \((\lambda _C,\lambda _B)\)
\(\lambda _B\)  

\(10^{2}\)  \(10^{1}\)  \(1\)  
\(\lambda _C\)  \(10^{2}\)  79.1/31.5  81.5/32.1  73.5/21.2 
\(10^{1}\)  79.7/36.5  76.9/21.6  66.7/25.9  
\(1\)  78.1/25.9  71.0/20.7  61.4/16.0 
5.2.3 Choice of penalty component
In order to study the impact of the type of regularization employed, we assessed OKVAR’s performance on the following three network tasks: size10, size50 and size100 E1 networks.
5.2.4 Comparison with stateoftheart methods
The performance of the OKVAR approach for prediction of the network structure is assessed on DREAM3 size10 and size100 datasets.
Consensus AUROC and AUPR (given in %) for OKVARProx, LASSO, GPODE, G1DBN, Team 236 and Team 190 (DREAM3 competing teams) run on DREAM3 size10 networks using the DREAM3 original data sets (4 time series of 21 points)
Size10  AUROC  AUPR  

E1  E2  Y1  Y2  Y3  E1  E2  Y1  Y2  Y3  
OKVAR + True \(B\)  96.2  86.9  89.2  75.6  86.6  75.2  67.7  47.3  52.3  58.6 
OKVARProx  81.5  78.7  76.5  70.3  75.1  32.1  50.1  35.4  37.4  39.7 
LASSO  69.5  57.2  46.6  62.0  54.5  17.0  16.9  8.5  32.9  23.2 
GPODE  60.7  51.6  49.4  61.3  57.1  18.0  14.6  8.9  37.7  34.1 
G1DBN  63.4  77.4  60.9  50.3  62.4  16.5  36.4  11.6  23.2  26.3 
Team 236  62.1  65.0  64.6  43.8  48.8  19.7  37.8  19.4  23.6  23.9 
Team 190  57.3  51.5  63.1  57.7  60.3  15.2  18.1  16.7  37.1  37.3 
Consensus AUROC and AUPR (given in %) for OKVARProx, LASSO, G1DBN, Team 236 (DREAM3 competing team) run on DREAM3 size100 networks using the DREAM3 original data sets (46 time series of 21 points)
Size100  AUROC  AUPR  

E1  E2  Y1  Y2  Y3  E1  E2  Y1  Y2  Y3  
OKVAR + True \(B\)  96.2  97.1  95.8  90.6  89.7  43.2  51.6  27.9  40.7  36.4 
OKVARProx  65.4  64.0  54.9  56.8  53.5  4.6  2.6  2.3  5.0  6.3 
LASSO  52.2  55.0  53.2  52.4  52.3  1.4  1.3  1.8  4.3  6.1 
G1DBN  53.4  55.8  47.0  58.1  43.4  1.6  6.3  2.2  4.6  4.4 
Team 236  52.7  54.6  53.2  50.8  50.8  1.9  4.2  3.5  4.6  6.5 
The AUROC and AUPR values obtained for size10 networks (Table 6) strongly indicate that OKVARProx outperforms stateoftheart models and the teams that exclusively used the same set of time series data in the DREAM3 competition, except for size10 Y2 (nearly equivalent AUPR). In particular, we note that the OKVAR consensus runs exhibited excellent AUPR values compared to those obtained by other approaches.
A comparison of competing algorithms for size100 networks (Table 7) shows that the OKVAR method again achieves superior AUROC results compared to Team 236, although it only lags behind by a slight margin for size100 Y1 and Y3 in terms of AUPR. Team 236 was the only team that exclusively used time series data for the size100 network challenge, since Team 190 did not submit any results. No results are provided for the GPODE method on size100 networks either since the algorithm requires a full combinatorial search when no prior knowledge is available, which is computationally intractable for large networks. The OKVAR method is outranked by G1DBN for size100 E2 in terms of AUPR and for size100 Y2 with quite comparable AUROC values. It is noticeable that the AUPR values in all rows are rather small (lower than 10 %) compared to their size10 counterparts. Such difficult tasks require longer timeseries (more time points) and much more available timeseries to achieve better results in terms of AUROC and AUPR. Therefore, for size100 datasets, we applied a pure \(\ell _1\)norm constraint on model parameters, allowing any \(C\) coefficients to be set to 0 rather than a mixed \(\ell _1/\ell _2\)norm regularization that would have been too stringent in terms of data parsimony.
Finally, it is worth noting that OKVARProx would have ranked in the top five and ten, respectively for size10 and size100 challenges, while the best results employed knockout/knockdown data in addition to timeseries data, the latter being rich in information content (Michailidis 2012).
5.2.5 Comparison with OKVARBoost
Consensus AUROC and AUPR (given in %) for OKVARProx and OKVARBoost run on DREAM3 size10 and size100 networks using the DREAM3 original data sets (4 and 46 time series of 21 points for size10 and size100 networks respectively)
AUROC  AUPR  

E1  E2  Y1  Y2  Y3  E1  E2  Y1  Y2  Y3  
Size10  
OKVARProx  81.5  78.7  76.5  70.3  75.1  32.1  50.1  35.4  37.4  39.7 
OKVARBoost  85.3  74.9  68.9  65.3  69.5  58.3  53.6  28.3  26.8  44.3 
Size100  
OKVARProx  65.4  64.0  54.9  56.8  53.5  4.6  2.6  2.3  5.0  6.3 
OKVARBoost  71.8  77.2  72.9  65.0  64.3  3.6  10.7  4.2  7.3  6.9 
OKVARProx achieves better AUROC values than OKVARBoost for size10 networks, except for the E1 network, while there is no clear winner in terms of AUPR. On size100 inference tasks, OKVARBoost which benefits from projections on random subspaces patently outperforms OKVARProx which operates directly in the 100dimensional space with a limited amount of time points.
5.3 Climate dataset
Our second example examines climate data, originally presented in Liu et al. (2010). It contains measurements on climate forcing factors and feedback mechanisms obtained from different databases. We extracted monthly measurements for 12 variables for the year 2002 (i.e. 12 timepoints) that include temperature (TMP), precipitation (PRE), vapor (VAP), cloud cover (CLD), wet days (WET), frost days (FRS), Methane (CH4), Carbon Dioxide (CO2), Hydrogen (H2), carbon monoxide (CO), solar radiation (SOL) and aerosols (AER). The measurements were obtained from 125 equally spaced meteorological stations located throughout the United States.
We used an \(\ell _1\)norm regularized OKVARProx model to identify and explore dependencies between natural and anthropogenic (linked to human activity) factors. We learn a separate causal model for each of the multivariate time series for a specific area in the United States. Therefore, for the sake of exposition clarity, we first consider only a single location in northern Texas.
Average \(\pm \) standard deviation BIC for the climate data set on one location (Northern Texas)
\(\lambda _B\)  

\(10^{2}\)  \(10^{1}\)  \(1\)  
\(\lambda _C\)  \(10^{2}\)  279.89 \(\pm \) 8.20  224.09 \(\pm \) 5.76  129.76 \(\pm \) 4.23 
\(10^{1}\)  311.24 \(\pm \) 305.32  115.52 \(\pm \) 8.66  60.66 \(\pm \) 0.38  
\(1\)  \(5.63\times 10^5\pm 1.78\times 10^6\)  \(5.89\times 10^6\pm 1.86\times 10^7\)  51.70 \(\pm \) 0.79 
Most of the edges that OKVARProx identifies are reasonable and supported by external knowledge about the interactions of the underlying variables: specifically, VAP influences CLD since the likelihood of clouds (CLD) appearing increases with vapor concentration (VAP). Vapor (VAP) is also the main natural greenhouse gas on earth, thus corroborating its impact on temperature (TMP). Aerosols (AER) interact with Hydrogen (H2) through atmospheric chemistry and lower the presence of vapor (VAP) by favoring water condensation. Of course, some likely causal links are missing in our final model and one would expect an impact of the concentration of Carbon Dioxide (CO2) or Methane (CH4) on temperature (TMP). However, note that most of these edges appear in the initial consensus graph and would have been recovered if we set a lower selection threshold.
Since the physical and chemical processes at work in the atmosphere do not change drastically in neighboring locations, the causal graphs learned across the US should exhibit a certain degree of similarity. On the other hand, causal graphs corresponding to distant locations are likely to show topological differences due to regional specificities regarding both climate and human activity. We define the structured similarity \(s\) between two graphs \(G_1\) and \(G_2\) based on the Hamming Distance between the corresponding adjacency matrices \(A_1\) and \(A_2\): \(s(G_1,G_2)= 1  \frac{1}{d^2} \sum _{i,j} {A_1}_{ij}{A_2}_{ij}\). A spectral clustering algorithm using this similarity matrix with the number of classes set to three was applied; the number of clusters selected a priori was based on the number of hidden variables considered in Liu et al. (2010) for their latent variable model focusing on spatial interactions. Figure 5b shows the labels of the resulting graphs and their corresponding location on a map of the United States. A very clear segmentation of geographical locations emerges, exhibiting the same network structure: the first area (black) includes western and midnorthern locations that have a humid continental climate, another zone (blue) covers developed regions in the South and East, where high levels of CO2 concentration play a role, while red locations that mostly stretch in the center of the US correspond to less populated areas where human activity factors are less dominant in our model.
6 Conclusion
Network inference from multivariate time series represents a key problem in numerous scientific domains. In this paper, we addressed it by introducing and learning a new family of nonlinear vector autoregressive models based on operatorvalued kernels. The new models generalize linear vector autoregressive models and benefit from the framework of regularization. To obtain a sparse network estimate, we define appropriate nonsmooth penalties on the model parameters and a proximal gradient algorithm to deal with them. Some of the proposed operatorvalued kernels are characterized by a positive semidefinite matrix that also plays a role in the network estimation. In this case, an alternating optimization scheme based on two proximal gradient procedures is proposed to learn both kinds of parameters. Results obtained from benchmark size10 and size100 biological networks as well as from real climate datasets show very good performance of the OKVAR model. Future extensions include applications of OKVAR to other scientific fields, the use of the model in ensemble methods, and the application of proximal gradient algorithms to other structured output prediction tasks.
Notes
Acknowledgments
FAB’s work was supported in part by ANR (call SYSCOM2009, ODESSA project). GM’s work was supported in part by NSF Grants DMS1161838 and DMS1228164 and NIH Grant 1R21GM10171901A1.
Supplementary material
References
 Aijo, T., & Lahdesmaki, H. (2009). Learning gene regulatory networks from gene expression measurements using nonparametric molecular kinetics. Bioinformatics, 25(22), 2937–2944.CrossRefGoogle Scholar
 Alvarez, M. A., Rosasco, L., & Lawrence, D. N. (2011). Kernels for vectorvalued functions: A review. Technical report, MIT\_CSAILTR2011033.Google Scholar
 Auliac, C., Frouin, V., & Gidrol, X. (2008). Evolutionary approaches for the reverseengineering of gene regulatory networks: A study on a biologically realistic dataset. BMC Bioinformatics, 9(1), 91.CrossRefGoogle Scholar
 Baldassarre, L., Rosasco, L., Barla, A., & Verri, A. (2010). Vector field learning via spectral filtering. In J. Balczar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Machine learning and knowledge discovery in databases. Lecture notes in computer science (Vol. 6321, pp. 56–71). Berlin/Heidelberg: Springer.Google Scholar
 Beck, A., & Teboulle, M. (2010). Gradientbased algorithms with applications to signal recovery problems. In D. Palomar & Y. Eldar (Eds.), Convex optimization in signal processing and communications (pp. 42–88). Cambridge: Cambridge press.Google Scholar
 Bolstad, A., Van Veen, B., & Nowak, R. (2011). Causal network inference via group sparsity regularization. IEEE Trans Signal Process, 59(6), 2628–2641.CrossRefMathSciNetGoogle Scholar
 Brouard, C., d’Alché Buc, F., & Szafranski, M. (2011). Semisupervised penalized output kernel regression for link prediction. In ICML2011 (pp. 593–600).Google Scholar
 Bühlmann, P., & van de Geer, S. (2011). Statistics for highdimensional data: Methods, theory and applications. Berlin: Springer.CrossRefGoogle Scholar
 Caponnetto, A., Micchelli, C. A., Pontil, M., & Ying, Y. (2008). Universal multitask kernels. The Journal of Machine Learning Research, 9, 1615–1646.Google Scholar
 Chatterjee, S., Steinhaeuser, K., Banerjee, A., Chatterjee, S., & Ganguly, A. R. (2012). Sparse group lasso: Consistency and climate applications. In SDM (pp. 47–58). SIAM/OmnipressGoogle Scholar
 Chou, I., & Voit, E. O. (2009). Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Mathematical Biosciences, 219(2), 57–83.CrossRefzbMATHMathSciNetGoogle Scholar
 Combettes, P. L., & Pesquet, J. C. (2011). Proximal splitting methods in signal processing. In Fixedpoint algorithms for inverse problems in science and engineering. Springer Optimization and Its Applications, Vol. 49, pp. 185–212.Google Scholar
 Dinuzzo, F., & Fukumizu, K. (2011). Learning lowrank output kernels. In Proceedings of the 3rd Asian conference on machine learning, JMLR: Workshop and conference proceedings, Vol. 20.Google Scholar
 Dondelinger, F., Lèbre, S., & Husmeier, D. (2013). Nonhomogeneous dynamic bayesian networks with bayesian regularization for inferring gene regulatory networks with gradually timevarying structure. Machine Learning Journal, 90(2), 191–230.CrossRefzbMATHGoogle Scholar
 Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 303(5659), 799–805.CrossRefGoogle Scholar
 Gilchrist, S., Yankov, V., & Zakrajšek, E. (2009). Credit market shocks and economic fluctuations: Evidence from corporate bond and stock markets. Journal of Monetary Economics, 56(4), 471–493.CrossRefGoogle Scholar
 Hartemink, A. (2005). Reverse engineering gene regulatory networks. Nat Biotechnol, 23(5), 554–555.CrossRefGoogle Scholar
 Iba, H. (2008). Inference of differential equation models by genetic programming. Information Sciences, 178(23), 4453–4468.CrossRefGoogle Scholar
 Kadri, H., Rabaoui, A., Preux, P., Duflos, E., & Rakotomamonjy, A. (2011). Functional regularized least squares classication with operatorvalued kernels. In ICML2011 (pp 993–1000).Google Scholar
 Kolaczyk, E. D. (2009). Statistical analysis of network data: Methods and models: Series in Statistics. Berlin: Springer.Google Scholar
 Kramer, M. A., Eden, U. T., Cash, S. S., & Kolaczyk, E. D. (2009). Network inference with confidence from multivariate time series. Physical Review E, 79(6), 061,916+.CrossRefMathSciNetGoogle Scholar
 Lawrence, N., Girolami, M., Rattray, M., & Sanguinetti, G. (Eds.) (2010). Learning and inference in computational systems biology. Cambridge: MIT Press.Google Scholar
 Lèbre, S. (2009). Inferring dynamic genetic networks with low order independencies. Statistical Applications in Genetics and Molecular Biology, 8(1), 1–38.CrossRefMathSciNetGoogle Scholar
 Lim, N., Senbabaoglu, Y., & Michailidis, G. (2013). OKVARBoost: A novel boosting algorithm to infer nonlinear dynamics and interactions in gene regulatory networks. Bioinformatics, 29(11), 1416–1423.CrossRefGoogle Scholar
 Liu, Y., NiculescuMizil, A., & Lozano, A. (2010). Learning temporal causal graphs for relational timeseries analysis. In J. Fürnkranz, & T. Joachims (Eds.), ICML2010.Google Scholar
 Maathuis, M., Colombo, D., Kalish, M., & Bühlmann, P. (2010). Predicting causal effects in largescale systems from observational data. Nature Methods, 7, 247–248.CrossRefGoogle Scholar
 Margolin, I., & Nemenman, Aand. (2006). Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7(Suppl 1), S7.CrossRefGoogle Scholar
 Mazur, J., Ritter, D., Reinelt, G., & Kaderali, L. (2009). Reconstructing nonlinear dynamic models of gene regulation using stochastic sampling. BMC Bioinformatics, 10(1), 448.CrossRefGoogle Scholar
 Meinshausen, N., & Bühlmann, P. (2006). High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34, 1436–1462.CrossRefzbMATHMathSciNetGoogle Scholar
 Micchelli, C. A., & Pontil, M. A. (2005). On learning vectorvalued functions. Neural Computation, 17, 177–204.CrossRefzbMATHMathSciNetGoogle Scholar
 Michailidis, G. (2012). Statistical challenges in biological networks. Journal of Computational and Graphical Statistics, 21(4), 840–855.CrossRefMathSciNetGoogle Scholar
 Michailidis, G., & d’Alché Buc, F. (2013). Autoregressive models for gene regulatory network inference: Sparsity, stability and causality issues. Mathematical Biosciences, 246(2), 326–334.CrossRefzbMATHMathSciNetGoogle Scholar
 Morton, R., & Williams, K. C. (2010). Experimental political science and the study of causality. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
 Murphy, K. P. (1998). Dynamic bayesian networks: Representation, inference and learning. PhD thesis, Computer Science, University of Berkeley, CA, USA.Google Scholar
 Parry, M., Canziani, O., Palutikof, J., van der Linden, P., Hanson, C., et al. (2007). Climate change 2007: Impacts, adaptation and vulnerability. Intergovernmental Panel on Climate Change.Google Scholar
 Perrin, B. E., Ralaivola, L., & Mazurie, A., Bottani, S., Mallet, J., d’AlchéBuc, F. (2003). Gene networks inference using dynamic bayesian networks. Bioinformatics, 19(S2), 38–48.Google Scholar
 Prill, R., Marbach, D., SaezRodriguez, J., Sorger, P., Alexopoulos, L., Xue, X., et al. (2010). Towards a rigorous assessment of systems biology models: The DREAM3 challenges. PLoS ONE, 5(2), e9202.CrossRefGoogle Scholar
 Raguet, H., Fadili, & J., Peyré, G. (2011). Generalized forwardbackward splitting. arXiv preprint arXiv:1108.4404.
 Richard, E., Savalle, P. A., & Vayatis, N. (2012). Estimation of simultaneously sparse and low rank matrices. In J. Langford & J. Pineau (Eds.), ICML2012 (pp. 1351–1358). New York, NY, USA: Omnipress.Google Scholar
 Schaffter, T., Marbach, D., & Floreano, D. (2011). Genenetweaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16), 2263–2270.CrossRefGoogle Scholar
 Senkene, E., & Tempel’man, A. (1973). Hilbert spaces of operatorvalued functions. Lithuanian Mathematical Journal, 13(4), 665–670.Google Scholar
 Shojaie, A., & Michailidis, G. (2010). Discovering graphical granger causality using a truncating lasso penalty. Bioinformatics, 26(18), i517–i523.CrossRefGoogle Scholar
 Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68(1), 49–67.CrossRefzbMATHMathSciNetGoogle Scholar
 Zou, C., & Feng, J. (2009). Granger causality vs. dynamic bayesian network inference: A comparative study. BMC Bioinformatics, 10(1), 122.CrossRefGoogle Scholar