Learning sparse structural changes in highdimensional Markov networks
 756 Downloads
 3 Citations
Abstract
Recent years have seen an increasing popularity of learning the sparse changes in Markov Networks. Changes in the structure of Markov Networks reflect alternations of interactions between random variables under different regimes and provide insights into the underlying system. While each individual network structure can be complicated and difficult to learn, the overall change from one network to another can be simple. This intuition gave birth to an approach that directly learns the sparse changes without modelling and learning the individual (possibly dense) networks. In this paper, we review such a direct learning method with some latest developments along this line of research.
Keywords
Markov network Density ratio estimation Change detection1 Introduction
The problem of learning the changes of interactions between random variables can be useful in many applications. For example, genes may regulate each other in different ways when external conditions are changed; the number of daily flulike symptom reports in nearby hospitals may become correlated when a major epidemic disease breaks out; EEG signals from different regions of the brain may be synchronized/desynchronized when the subject is performing different activities. Spotting such changes in interactions may provide key insights into the underlying system.
The interactions among random variables can be formulated as undirected probabilistic graphical models, or Markov Networks (MNs) (Koller and Friedman 2009), expressing the interactions via the conditional independence. We consider a simple model: the pairwise MNs where the links are only encoded for single or pairs of random variables. Due to the Hammersley–Clifford theorem (Hammersley and Clifford 1971), the underlying joint probability density function can be represented as the product of univariate and bivariate factors.
As an important challenge, structure learning of MNs has also attracted a significant amount of attention. Earlier methods (Spirtes et al. 2000) use hypothesis testing to learn the conditional independence among random variables, which reflects the absences of edges. It is proved that such a problem is generally NPhard (Chickering 1996). Methods restricted to a subclass of graphical models (such as trees or forests) (Chow and Liu 1968; Geman and Geman 1984; Liu et al. 2011) also suffer from growing computational cost.
However, the Hammersley–Clifford theorem together with the recent breakthrough on sparsityinducing methods (Tibshirani 1996; Zhao and Yu 2006; Wainwright 2009) gave birth to many sparse structure learning ideas where the sparse factorization of the joint/conditional density function was estimated to infer the underlying structure of the MN (Friedman et al. 2008; Banerjee et al. 2008; Meinshausen and Bühlmann 2006; Ravikumar et al. 2010). Although most works focused on parametric models, the structure learning has been conducted on semiparametric ones in recent years (Liu et al. 2009, 2012).
There is also a trend of learning the changes between MNs (Zhang and Wang 2010; Liu et al. 2014; Zhao et al. 2014). Comparing to standard structure learning, the learning of changes views the problem in a more dynamic fashion: Instead of estimating a static pattern, we hope to obtain a dynamic one, namely “the change” by comparing two sets of data since in some applications, the static pattern may not be computationally tractable, or simply too hard to comprehend. However, the difference between two patterns may be represented by some simple incremental effects involving only a small number of nodes or bonds, Thus it takes much less effort to learn and understand.
One of the main uses of structural change learning is to spot responding variables in “controlled experiments” (Zhang and Wang 2010) where some key external factors of the experiments are altered, and two sets of samples are obtained. By discovering the changes in the MNs, we can see how random variables have responded to the change of the external stimuli.
In this paper, we first review a recently proposed method of structural change learning between MNs (Liu et al. 2014). This follows a simple idea: if the MNs are products of the pairwise factors, the ratio of two MNs must also be proportional to the ratios of those factors. Moreover, factors that do not change between two MNs will have no contribution to the ratio. This naturally suggests the idea of modelling the changes between two MNs P and Q as the ratio between two MN density functions \(p({\varvec{x}})\) and \(q({\varvec{x}})\). The ratio \(p({\varvec{x}})/q({\varvec{x}})\) is directly estimated from a oneshot estimation (Sugiyama et al. 2012). This densityratio approach can work well even when each MN is dense (as long as the change is sparse).
We also present some very recent theoretical results along this line of research. These works prove the consistency of the density ratio method in the highdimensional setting. The support consistency indicates the support of the estimated parameter converges to the support of the true parameter in probability. This is an important property for sparsity inducing methods. It is shown that under certain conditions the density ratio method recovers the correct parameter sparsity with high probability (Liu et al. 2017b). Moreover, Fazayeli and Banerjee introduced a theorem for the regularized density ratio estimator showing the estimation error, i.e., the \(\ell _2\) distance between the estimated parameter and the true parameter converges to zero under milder conditions.
As comparisons, we will also show a few alternative approaches to the change detection problem between MNs. The differential graphical model learning approach (Zhao et al. 2014) uses a covarianceprecision matrix equality to learn changes without going through the learning of the individual MNs. The “jumping” MNs (Kolar and Xing 2012) setting considers a scenario where the observations are received as a sequence and multiple subsequences are generated via different parametrizations of MN.
We organize this paper as follows: Firstly, we introduce the problem formulation of learning changes between MNs in Sect. 2. Secondly, the density ratio approach and two other alternatives are explained in Sect. 3. Section 4 reviews the theoretical results of these approaches. Synthetic and realworld experiments are conducted in Sect. 5 to compare the performance of methods. Finally, in Sects. 6 and 7, we give a few possible future directions and conclude the current developments along this line of research.
2 Formulating changes
In this section, we focus on formulating the change of MNs using density ratio. At the end of this section, a few alternatives are also introduced.
2.1 Structural changes by parametric differences
Directly estimating an MN in this generic form is challenging since \(Z({\varvec{\theta }}^{(p)})\) usually does not have a closed form except for a few special cases (e.g. Gaussian distribution). Markov Chain Monte Carlo (Robert and Casella 2005) is used to approximate such an integral. However, this would bring extra approximation errors.
2.2 Density ratio modelling
Interestingly, if one uses \(\psi _{u,v}(x_u x_v) = x_u x_v\) in the ratio model, it does not mean one assumes two individual MNs are Gaussian or Ising, it simply means we assume the changes of interactions are linear while other nonlinear interactions remain unchanged. This formulation allows us to consider highly complicated MNs as long as their changes are “simple”.
Throughout the rest of the paper, we simplify the notation from \({\varvec{\psi }}_{u,v}\) to \({\varvec{\psi }}\) by assuming the feature functions are the same for all pairs of random variables.
2.3 Quasi loglikelihood equality
This direct formulation specifically uses a property of Gaussian MN that the covariance matrix computed from the data and the precision matrix that encodes the MN structure should approximately cancel each other when multiplied. However, such a relationship does not hold for other distributions in general. Studies on the generality of this equality is an interesting open question (see Sect. 6).
Remark
In fact, it is not necessary to combine \({\varvec{\theta }}^{(p)} {\varvec{\theta }}^{(q)}\) in (2) (or \({\varvec{\varTheta }}^{(p)}  {\varvec{\varTheta }}^{(q)}\) in (4)) into one parameter. However, such a model will be unidentifiable since there are too many combinations of \({\varvec{\varTheta }}^{(p)}\) and \({\varvec{\varTheta }}^{(q)}\) can produce the same difference. Nonetheless, such an indirect modelling may still be useful when the individual structures of the MNs are also our interests. We review an example of such indirect modelling in Sect. 3.4.
3 Learning sparse changes in Markov networks
3.1 Density ratio estimation
Density ratio estimation has been recently introduced to the machine learning community and is proven to be useful in a wide range of applications (Sugiyama et al. 2012). In Liu et al. (2014), a density ratio estimator called the Kullback–Leibler importance estimation procedure (KLIEP) for loglinear models (Sugiyama et al. 2008; Tsuboi et al. 2009) was employed in learning structural changes.
3.2 Sparsity inducing and regularizations
In the search for sparse changes, one may regularize the KLIEP solution with a sparsityinducing norm \(\sum _{u\ge v} \Vert {\varvec{\delta }}_{u,v} \Vert \), i.e., the grouplasso penalty (Yuan and Lin 2006) where we use \(\Vert \cdot \Vert \) to denote the \(\ell _2\) norm.
Optimization Although the original objective of KLIEP was smooth and convex, the sparsity inducing norms are in general nonsmooth. Proximal gradient methods, such as Fast Iterative Shrinkage Thresholding Algorithms (FISTA) (Beck and Teboulle 2009) can be utilized to solve regularized KLIEP objectives. A FISTAlike algorithm was proposed in (Fazayeli and Banerjee 2016) with a faster rate of convergence.
3.3 Covarianceprecision matching
Optimization This method is quite computationally demanding as the dimension m grows. The Alternating Direction Method of Multipliers (ADMM) procedure (Boyd et al. 2011) was implemented based on an augmented version of (9) (see Section 3.3, Zhao et al. 2014 for details).
3.4 Maximizing joint likelihood
As it was mentioned in Sect. 2.3, one does not have to use the direct modelling to learn sparse changes between MNs. In fact, separated modelling may not only discover changes, but also can recover the individual MN themselves. Recently, a method based on fusedlasso (Tibshirani et al. 2005) has been developed (Zhang and Wang 2010). This method also sparsifies \({\varvec{\theta }}^{(p)} {\varvec{\theta }}^{(q)}\) directly.
Since the Fusedlassobased method directly sparsifies the changes in MN structure, it can work well even when each MN is not sparse (when \(\lambda _1\) is set to 0).
4 Theoretical analysis
The KLIEP algorithm does not only perform well in practice, but is also justified theoretically. In this section, we first introduce the support recovery theorem of KLIEP and then review some recent theoretical developments of direct change learning.
4.1 Preliminaries
In the previous section, a subvector of \({\varvec{\delta }}\) indexed by a pair (u, v) corresponds to a specific edge of an MN. From now on, we switch to a “unitary” index system as our analysis is not dependent on the edge nor the structure setting of the graph.
Sample Fisher information matrix \({\mathcal {I}} \in {\mathbb {R}}^{\frac{b(m^2+m)}{2} \times \frac{b(m^2+m)}{2}}\) denotes the Hessian of the loglikelihood: \({\mathcal {I}} = \nabla ^2 \ell _{\mathrm{KLIEP}} ({\varvec{\delta }}^*)\). \({\mathcal {I}}_{AB}\) is a submatrix of \({\mathcal {I}}\) indexed by two sets of indices \(A, B \subseteq E\) are indices on rows and columns.
In this section, we prove the support consistency, i.e. with high probability that \(S={\hat{S}},S_c={\hat{S}}_c\) (see e.g., Chapter 11 in Hastie et al. 2015 for an introduction of support consistency).
4.2 Assumptions
We try not to impose assumptions directly on each individual MNs, as the essence of KLIEP method is that it can handle various changes regardless the types of individual MNs.
The first two assumptions are essential to many support consistency theorems (e.g. Eqs. (15) and (16) in Wainwright 2009, Assumption A1 and A2 in Ravikumar et al. 2010). These assumptions are made on the Fisher information matrix.
Assumption 1
(Dependency assumption) The sample Fisher information submatrix \({\mathcal {I}}_{{SS}}\) has bounded eigenvalues: \( \varLambda _{\mathrm{min}}({\mathcal {I}}_{{SS}}) \ge \lambda _{\mathrm{min}} > 0, \) with probability \(1\xi _q\), where \(\varLambda _{{\mathrm {min}}}\) is the minimumeigenvalue operator of a symmetric matrix.
This assumption on the submatrix of \({\mathcal {I}}\) is to ensure that the density ratio model is identifiable and the objective function is “reasonably convex”.
Assumption 2
(Incoherence assumption) \( \max _{t'' \in S^c}\Vert {\mathcal {I}}_{t''S} {\mathcal {I}}_{SS}^{1}\Vert _1 \le 1\alpha , 0<\alpha \le 1. \) with probability 1, where \(\Vert Y\Vert _1 = \sum _{i,j} \Vert Y_{i,j}\Vert _1\).
This assumption says the unchanged edges cannot exert overly strong effects on changed edges. Note this assumption is sometimes called “irrepresentability” condition.
Assumption 3
\(\left\ \cdot \right\, {\left \left \left \cdot \right \right \right }\) are the spectral norms of a matrix and a tensor, respectively (see, e.g., Tomioka and Suzuki 2014 for the definition of spectral norm of a tensor). This assumption guarantees the loglikelihood function is wellbehaved. Now, we state the following assumptions on the density ratio:
Assumption 4
Although analysing the misspecified ratio model (Kanamori et al. 2010) is certainly an interesting open question, we focus on correctly specified models in this section.
Assumption 5
4.3 Successful support recovery of KLIEP (Liu et al. 2017a, b)
Theorem 1

Unique Solution: The solution of (11) is unique.

Successful Change Detection: \({\hat{S}} = S\) and \({\hat{S}}^c = S^c\).
The proof of this theorem follows the Primaldual witness construction (see, e.g., Section 11.4.2 in Hastie et al. 2015).
Remark
The main conclusion of this theorem states that if the regularization parameter is reasonably chosen (13) and the true nonzero groups \(\Vert {\varvec{\delta }}^*_{t'}\Vert , {t'}\in S\) is large enough (12), with high probability, we are guaranteed to have the correct support of parameters. The samples needed for \(n_p\) only grow linearly with \(\log m\) and \(n_q\) grows quadratically with \(n_p\).
4.4 \(\ell _2\) Consistency of KLIEP with atomic norm (Fazayeli and Banerjee 2016)
As it was introduced in Sect. 3.2, atomic norms can be used to learn changes with special topological structures. Instead of support recovery, we focus on the \(\ell _2\) loss between the estimated parameter \(\hat{{\varvec{\delta }}}\) and the true parameter \({\varvec{\delta }}^*\), i.e., \(\Vert {\varvec{\delta }}^*  \hat{{\varvec{\delta }}}\Vert \).
Such a theorem relies on the Restricted Strong Convex (RSC) property on the Error Set of the objective function. Intuitively, if \(\ell _{\mathrm {KLIEP}}({\varvec{\delta }})\) is “highly curved”, small \(\ell _{\mathrm {KLIEP}}(\hat{{\varvec{\delta }}})  \ell _{\mathrm {KLIEP}}({\varvec{\delta }}^*)\) ensures small \(\Vert \hat{{\varvec{\delta }}}  {\varvec{\delta }}^*\Vert \). Thus we only need to figure out how \(\ell _{\mathrm {KLIEP}}(\hat{{\varvec{\delta }}})  \ell _{\mathrm {KLIEP}}({\varvec{\delta }}^*)\) reaches zero as number of samples goes to infinity and this is a more accessible target.
To make sure our objective has such a “strongly convex” curvature, one needs to impose a uniform lowerbound on the eigenvalues of the objective Hessian (a.k.a., sample Fisher information matrix \({\mathcal {I}}\)). However, this is not realistic for the highdimensional setting, as \({\mathcal {I}}\) is certainly rankdeficient. As an alternative, we impose an assumption on the convexity of \(\ell_{{\rm KLIEP}}\) over a constrained set:
Two things remain to be shown. First, we need to find such a cone which contains \(\hat{{\varvec{\delta }}}  {\varvec{\delta }}^*\). Second, we need to prove \(\ell _{{\rm KLIEP}}\) is RSC on this cone. We start with the first problem.
As to the second problem, it can be proven that \(\ell _{{\rm KLIEP}}\) is RSC at \(C_r\) with high probability once \(n_q \ge n_0, n_0 = w^2(C_r \cap S)\), where S is a unit hypersphere (Theorem 2 in Fazayeli and Banerjee 2016). Thus \(n_0\) is the minimum number of samples required from Q to be able to apply this theorem.
Putting everything together, we have the main theorem proved in Fazayeli and Banerjee (2016):
Theorem 2
Remark
Although this bound does not directly prove the support consistency, we can learn that sample complexity \(\min (n_p, n_q) = \varOmega ( d\log m)\) guarantees the convergence of estimation error in \(\ell _2\) norm. As to \(n_q\), it should also satisfy \(n_q \ge c_1 w^2(C_r\cap S)\), which is again \(n_q = \varOmega ( d \log m )\) in the case of \(\ell _1\) norm. This sample complexity is milder than what Liu et al. have obtained in the previous section \(\varOmega (d^2 \log (m^2 + m)/2)\) and \(n_q = \varOmega (n_p^2)\). Nonetheless, both theories can be applied to highdimensional regime \(m\gg \min (n_p, n_q)\).
4.5 Support consistency of covarianceprecision matching (Zhao et al. 2014)
In this section, we introduce the support recovery theorem of the CovariancePrecision Matching method (9) in terms of support consistency on Gaussian MNs. Specifically for Gaussian MNs, we need a slightly different set of notations, as they are parametrized in matrix forms. \(\varSigma _{j,k}^{(p)}\) is the j, kth elements of matrix \({\varvec{\varSigma }}^{(p)}_{j,k}\) and \(\varSigma _{\mathrm{max}}^{(p)}\) is \(\max _j \varSigma ^{(p)}_{j,j}\).
The first assumption is to ensure that the “amount of change” is fixed and the change is always sparse, and does not grow with the number of dimension m.
Assumption 6
The difference matrix \({\varvec{\varDelta }}\) has \(d \le m\) nonzero elements in its upper triangular submatrix. \({\varvec{\varDelta }}_1 \le M_0\), and both d and \(M_0\) do not depend on dimension m.
The second assumption assures that the covariates are not strongly dependent if there are many changes in the precision matrix. This is similar to the incoherence assumption used in Assumption 2.
Assumption 7
We first intuitively explain how the proof works. The proof of the support consistency can be thought as controlling \(\Vert \hat{{\varvec{\varDelta }}}  {\varvec{\varDelta }}^*\Vert _\infty \). Clearly, for the population covariance matrices \({\varvec{\varSigma }}^{(p)}\) and \({\varvec{\varSigma }}^{(q)},\) \({\varvec{\varSigma }}^{(p)}{\varvec{\varDelta }}^*{\varvec{\varSigma }}^{(q)} + {\varvec{\varSigma }}^{(p)}  {\varvec{\varSigma }}^{(q)} = {\varvec{0}}\). If we replace the above population covariances with their sample versions, we can expect \(\Vert \hat{{\varvec{\varSigma }}}^{(p)}{\varvec{\varDelta }}^*\hat{{\varvec{\varSigma }}}^{(q)} + \hat{{\varvec{\varSigma }}}^{(p)}  \hat{{\varvec{\varSigma }}}^{(q)}\Vert _\infty \le \epsilon ,\) if the number of samples is large enough. Furthermore, \(\epsilon \) can be a function decreasing with \(\min (n_p, n_q)\) as the estimated covariances are getting closer and closer to the population ones.
Therefore, if we set the \(\epsilon \) to a decreasing function, we can still “contain” the optimal parameter \({\varvec{\varDelta }}^*\) in the feasible zone with high probability. By definition, the estimated difference \(\hat{{\varvec{\varDelta }}}\) should also be in the feasible zone; thus they should not be far off, if the zone is small enough. The rigorous proof of the above statements is given in the Appendix of Zhao et al. (2014).
Now, we give the support recovery theorem^{3} as follows (see Section 4 in Zhao et al. 2014 for details):
Theorem 3
This support consistency theorem, although only applies to Gaussian MNs, has similar structure to the one derived for KLIEP (Sect. 4.3). First, they both assume the true nonzero parameter should be large enough. Second, they both assume the sparsity inducing factor (\(\lambda _{n_p, n_q}\) and \(\tau _{n_p, n_q}\)) should decay as the sample size \(\min (n_p, n_q)\) increases, while increase as the logdimension \(\log m\) increases.
4.6 Summary and discussion

None of the above proofs require the sparsity assumption on each individual MN. Thus in theory, all methods should work well when individual MNs are dense.

The efficiency of all methods are affected by the sparsity of changes (i.e. d). This make sense since the sparsity assumption is made on the changes between two MNs.

All theorems apply to the highdimensional regime (\(m \gg \min (n_p, n_q)\)). None requires \(n_p\) or \(n_q\) to be comparable to the dimensionality m.
5 Experiments
In this section, we compare the performance of two direct change detection methods: KLIEP and CovariancePrecision (CP) Matching using synthetic and realworld examples.
5.1 Implementations
Sparsityinducing KLIEP can be implemented using subgradient descent approach. The MATLAB^{®}code can be found at http://www.ism.ac.jp/~liu/kliep_sparse/demo_sparse.html.
The R (R Core Team 2016) implementation of CP matching using ADMM can be obtained at https://github.com/sdzhao/dpm.
5.2 Synthetic examples
The same experiments are repeated using the CP matching method. However, since the sparsity control of CP matching is via the selection of the threshold \(\tau \), we set \(\epsilon = 0.2\) which shows good performance empirically and plot the learned \(\hat{{\varvec{\varDelta }}}\) using different thresholds. Results are shown in Fig. 1f–h.
As we can see, both approaches recover the change pattern well as we increase the sparsity control parameter.
The area under the curve (AUC) of ROC plot in Fig. 1b (“K” for KLIEP and “CP” for CP matching)
\(m = 9\)  \(m = 16\)  \(m = 25\)  \(m = 36\)  \(m = 49\)  \(m = 64\)  \(m = 81\)  \(m = 100\)  

K  0.8746  0.8865  0.8899  0.8890  0.8902  0.8878  0.8903  0.8866 
CP  0.8165  0.7917  0.7627  0.6829  0.5574  0.5914  0.5475  0.5656 
5.3 Running time
Although the rigorous timing comparison is difficult due to the different implementations of KLIEP and CP matching, from our experience, KLIEP is faster but more memoryconsuming as our implementation stores the entire parameter vector into the memory. On a server with 16 Xeon cores, it takes KLIEP about 15 min to run experiments needed for plotting Fig. 1b, while it takes CP matching roughly 1 h.
As to KLIEP, we also observe that “early stopping” heuristics (e.g., stopping at 100 iterations) can provide an accurate nonzero pattern within a short period of time.
5.4 Image difference detection
Two photos were taken in a rainy afternoon using a camera pointing at the parking lot of The Institute of Statistical Mathematics (ISM). In this task, we are interested in learning the changes of the parking patterns marked by green boxes in Fig. 2(b). As we can see from Fig. 2a, b, the light conditions and positions of raindrops vary in two pictures.
To construct samples, we use windows of pixels (Fig. 2c). Each window is a dimension of a dataset, and the samples are the pixel RGB values within this window. By sliding the window across the entire picture, we may obtain samples of different dimensions. Two sets of data can be obtained by using this sample generating mechanism over two images.
Assuming an image can be represented by an MN of windows, changes of pixels values within a window may cause changes of “interactions” between neighbouring windows. In other words, we can discover a change by looking at the change of the dependency of pixel values between a certain window and its neighbours. This is more advantageous than simply looking at the pixel values since changing the brightness of a picture may increase the pixel values in many windows simultaneously, even if the “contrast” between two windows does not change by much.
By applying KLIEP on such two sets of data and highlighting adjacent window pairs that are involved in the changes of pairwise interactions, we may spot changes between two images. In our experiment, we use sliding windows of size \(16 \times 16\) on a \(200\times 150\) image, generating two sets of samples with \(m=999\) and \(n_p=n_q=256\). We reduce \(\lambda \) until \({\hat{S}} > 40\). The spotted changes were plotted in Fig. 2d. It can be seen that KLIEP has correctly labelled almost all changed parkings between two images except one missing on the left.
Note that here we set \(\psi ({\varvec{x}}_u, {\varvec{x}}_v) = \exp (\frac{\Vert {\varvec{x}}_u  {\varvec{x}}_v\Vert ^2}{0.5})\), and the underlying MN is highly nonGaussian so CP matching cannot be applied here.
6 Open problems
Although pioneering works have been conducted in this area, there are still important unsolved open problems. In this section, we list a few examples.
Generalized covarianceprecision matching In Sect. 3.3, we introduced an equality between Gaussian covariance and precision matrix (4). This leads to a direct sparse change learning approach. However, it does not apply to more general pairwise MNs. A natural question is, can we extend this relationship between covariance and precision matrices to a more general principle? Particularly, in a recent work (Loh and Wainwright 2013), the generalized covariance matrix was used to learn a nonGaussian graphical model structure. Would a generalized equality (4) provide us with a universal framework of learning changes between MNs?
Learning changes from multiple MNs In this paper, we have only reviewed the algorithms that learn changes between two MNs. In fact, in some applications, datasets may be obtained as multiple “snapshots”. For example, gene activities may be measured at a few different time points. Under the same assumption that changes between adjacent time points are “mild” and “sparse”, can we perform multiple change detections in one shot?
Asymmetry versus symmetry As we have pointed out in Sect. 4.6, there exists an asymmetry in KLIEP while CovariancePrecision matching has a symmetric formulation. An interesting future direction is to systematically investigate how such an asymmetry affects the change detection results, and more importantly, how can we automatically determine which density to be Q and which one to be P in the ratio formulation.
We believe thorough investigations in these three directions will significantly expand our knowledge over the domain of learning changes between MNs in the future.
7 Conclusion
In this paper, we have reviewed an MN change learning method based on density ratio estimation and other alternative approaches. Statistical guarantees regarding the support recovery and \(\ell _2\) consistency were also reported and compared. Through their direct modelling and theoretical results, we can see an interesting common pattern in all these methods: they work well regardless of the difficulty of learning individual MNs.
These results are inspiring as they shed lights on a new family of methods that only learn the incremental patterns. They show that if the change itself is simple enough, even with limited amount of information, we can have good learning performance. Compared to classic, static pattern recognition, such methods are wellsuited for analysing dynamic datasets, where the “absolute” pattern is not the main interest, but learning the change itself is more valuable.
These works have offered a new vision of research on learning changes between patterns. We believe these methods and theorems may have many potential applications in the years to come.
Footnotes
 1.If one models the ratio \(\frac{q(x)}{p(x)}\), the normalizationshould be used.$$\begin{aligned} N({\varvec{\delta }}) = \int p(x) \exp \left( \sum _{u,v = 1, u\ge v}^m {\varvec{\delta }}_{u,v}^\top {\varvec{\psi }}_{u,v}(x_{u},x_{v}) \right) {\mathrm{d}}{\varvec{x}}\end{aligned}$$
 2.
\(q({\varvec{x}})\) should not be confused with \(q(x;{\varvec{\theta }})\).
 3.
In fact, the support recovery theorem was proved for a slightly augmented version of (9).
 4.
\(\tau _{n_p, n_q}, \epsilon _{n_p,n_q}\) is the sampledependent version of \(\tau ,\epsilon \) introduced in Sect. 3.3.
 5.
We convert \(\hat{{\varvec{\delta }}}\) into its corresponding matrix form.
Notes
Acknowledgements
We would like to thank Masashi Sugiyama, John Quinn and Michael Gutmann for their tremendous help during the development of the density ratio modelling idea. This work was partially supported by JSPS GrantinAid for Scientific Research Activity Startup 15H06823, MEXT kakenhi (25730013, 25120012, 26280009, 15H01678 and 15H05707), and JSTPRESTO. Authors would like to thank anonymous reviewers for their helpful comments. We would like to thank anonymous reviewers and Matthew Ames for their helpful comments and suggestions on this paper.
Compliance with ethical standards
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
References
 Banerjee O, El Ghaoui L, d’Aspremont A (2008) Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J Mach Learn Res 9:485–516MathSciNetMATHGoogle Scholar
 Banerjee A, Chen S, Fazayeli F, Sivakumar V (2014) Estimation with norm regularization. Adv Neural Inf Process Syst 26:1556–1564Google Scholar
 Beck A, Teboulle M (2009) A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202MathSciNetCrossRefMATHGoogle Scholar
 Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
 Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122CrossRefMATHGoogle Scholar
 Chandrasekaran V, Recht B, Parrilo PA, Willsky AS (2012) The convex geometry of linear inverse problems. Found Comput Math 12(6):805–849MathSciNetCrossRefMATHGoogle Scholar
 Chickering DM (1996) Learning Bayesian networks is NPcomplete. In: Learning from data. Springer, Berlin, pp 121–130Google Scholar
 Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467MathSciNetCrossRefMATHGoogle Scholar
 Fazayeli F, Banerjee A (2016) Generalized direct change estimation in ising model structure. In: Proceedings of the 33rd international conference on machine learning, pp 2281–2290. http://jmlr.org/proceedings/papers/v48/fazayeli16.html
 Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9(3):432–441CrossRefMATHGoogle Scholar
 Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741CrossRefMATHGoogle Scholar
 Hammersley JM, Clifford P (1971) Markov fields on finite graphs and lattices (unpublished)Google Scholar
 Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the Lasso and generalizations. CRC Press, Boca RatonMATHGoogle Scholar
 Kanamori T, Suzuki T, Sugiyama M (2010) Theoretical analysis of density ratio estimation. IEICE Trans Fundam Electron Commun Comput Sci E93A(4):787–798Google Scholar
 Kolar M, Xing EP (2012) Estimating networks with jumps. Electron J Stat 6:2069–2106MathSciNetCrossRefMATHGoogle Scholar
 Kolar M, Song L, Ahmed A, Xing EP (2010) Estimating timevarying networks. Ann Appl Stat 4(1):94–123MathSciNetCrossRefMATHGoogle Scholar
 Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press, CambridgeMATHGoogle Scholar
 Ledoux M, Talagrand M (2013) Probability in Banach spaces: isoperimetry and processes. Springer Science & Business Media, BerlinMATHGoogle Scholar
 Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10:2295–2328MathSciNetMATHGoogle Scholar
 Liu H, Xu M, Gu H, Gupta A, Lafferty J, Wasserman L (2011) Forest density estimation. J Mach Learn Res 12(Mar):907–951Google Scholar
 Liu H, Han F, Yuan M, Lafferty J, Wasserman L (2012) The nonparanormal skeptic. In: Proceedings of the 29th international conference on machine learning (ICML2012) (accepted)Google Scholar
 Liu S, Quinn JA, Gutmann MU, Suzuki T, Sugiyama M (2014) Direct learning of sparse changes in Markov networks by density ratio estimation. Neural Comput 26(6):1169–1197MathSciNetCrossRefGoogle Scholar
 Liu S, Suzuki T, Relator R, Sese J, Sugiyama M, Fukumizu K (2017a) Supplement to "support consistency of direct sparsechange learning in Markov networks" (accepted)Google Scholar
 Liu S, Suzuki T, Relator R, Sese J, Sugiyama M, Fukumizu K (2017b) Support consistency of direct sparsechange learning in Markov networks. Ann Stat (accepted)Google Scholar
 Loh PL, Wainwright MJ (2013) Structure estimation for discrete graphical models: generalized covariance matrices and their inverses. Ann Stat 41(6):3022–3049MathSciNetCrossRefMATHGoogle Scholar
 Meinshausen N, Bühlmann P (2006) Highdimensional graphs and variable selection with the Lasso. Ann Stat 34(3):1436–1462MathSciNetCrossRefMATHGoogle Scholar
 Mohan K, London P, Fazel M, Witten DM, Lee S (2014) Nodebased learning of multiple gaussian graphical models. J Mach Learn Res 15(1):445–488MathSciNetMATHGoogle Scholar
 Negahban S, Yu B, Wainwright MJ, Ravikumar PK (2009) A unified framework for highdimensional analysis of \( m \)estimators with decomposable regularizers. Adv Neural Inf Process Syst 21:1348–1356MATHGoogle Scholar
 R Core Team (2016) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.Rproject.org/
 Ravikumar P, Wainwright MJ, Lafferty JD (2010) Highdimensional Ising model selection using \(\ell _1\)regularized logistic regression. Ann Stat 38(3):1287–1319CrossRefMATHGoogle Scholar
 Robert CP, Casella G (2005) Monte Carlo statistical methods. Springer, BerlinMATHGoogle Scholar
 Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search. MIT Press, CambridgeMATHGoogle Scholar
 Sugiyama M, Nakajima S, Kashima H, von Bünau P, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Advances in neural information processing systems, vol 20, pp 1433–1440Google Scholar
 Sugiyama M, Suzuki T, Kanamori T (2012) Density ratio estimation in machine learning. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
 Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288MathSciNetMATHGoogle Scholar
 Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused Lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108MathSciNetCrossRefMATHGoogle Scholar
 Tomioka R, Suzuki T (2014) Spectral norm of random tensors. arXiv preprint arXiv:1407.1870 [math.ST]
 Tsuboi Y, Kashima H, Hido S, Bickel S, Sugiyama M (2009) Direct density ratio estimation for largescale covariate shift adaptation. J Inf Process 17:138–155Google Scholar
 Wainwright MJ (2009) Sharp thresholds for highdimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55(5):2183–2202CrossRefGoogle Scholar
 Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67MathSciNetCrossRefMATHGoogle Scholar
 Zhang B, Wang YJ (2010) Learning structural changes of Gaussian graphical models in controlled experiments. In: Proceedings of the twentysixth conference on uncertainty in artificial intelligence (UAI2010), pp 701–708Google Scholar
 Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563MathSciNetMATHGoogle Scholar
 Zhao S, Cai T, Li H (2014) Direct estimation of differential networks. Biometrika 101(2):253–268MathSciNetCrossRefMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.