Vine copula approximation: a generic method for coping with conditional dependence

Pair-copula constructions (or vine copulas) are structured, in the layout of vines, with bivariate copulas and conditional bivariate copulas. The main contribution of the current work is an approach to the long-standing problem: how to cope with the dependence structure between the two conditioned variables indicated by an edge, acknowledging that the dependence structure changes with the values of the conditioning variables. The changeable dependence problem, though recognized as crucial in the field of multivariate modelling, remains widely unexplored due to its inherent complication and hence is the motivation of the current work. Rather than resorting to traditional parametric or nonparametric methods, we proceed from an innovative viewpoint: approximating a conditional copula, to any required degree of approximation, by utilizing a family of basis functions. We fully incorporate the impact of the conditioning variables on the functional form of a conditional copula by employing local learning methods. The attractions and dilemmas of the pair-copula approximating technique are revealed via simulated data, and its practical importance is evidenced via a real data set.


Introduction
Pair-copula constructions (or vine copulas), introduced by Joe (1996) and further developed by Bedford and Cooke (2001), Bedford and Cooke (2002) and Aas et al. (2009), provide an adaptable and manageable way of modelling the dependence structure within a random vector. While a multivariate copula is superior to a multivariate joint distribution (in that the former divides the problem of specifying a full joint distribution into two: the problem of modelling marginal distributions and the problem of modelling multivariate dependence structure), a vine copula is more preferable than a multivariate copula (in that, compared with bivariate copulas, multivariate copulas developed in the literature are quite few and are incapable of capturing all the possible dependence structures within a random vector). A vine copula owes its flexibility and competence in modelling multivariate dependence to its vine hierarchy-a graphical tool for stacking (conditional) bivariate copulas. Over the past decade, vine copulas have been used in a variety of applied work, including finance, hydrology, meteorology, biostatistics, machine learning, geology and wind energy; see, e.g., Soto et al. (2012), Fan and Patton (2014), Hao and Singh (2015) and Valizadeh et al. (2015). Pircalabelu et al. (2015) incorporated vine copulas into Bayesian network to deal with continuous variables, while Panagiotelis et al. (2012) studied the problem of applying vine copulas to discrete multivariate data. We refer the reader to Joe (2014) for a comprehensive review on vine copulas and related topics.
A vine copula is a hierarchy of bivariate copulas and conditional bivariate copulas. For a conditional bivariate copula, the dependence structure between the two conditioned variables (i.e., the functional form of the conditional copula) can be highly influenced by the conditioning variables. Though theoretical and applied literature on vine copulas is quite large, the vast majority of the documented work adopted the simplifying assumption that the functional form of a conditional bivariate copula does not change with the conditioning variables (Acar et al. 2012). To name a few, Haff (2013) extended the work of Aas et al. (2009) to develop a stepwise semi-parametric estimator for parameter estimation of vine copulas; both Aas et al. (2009) and Haff (2013) assumed that the parameters of conditional bivariate copulas are all fixed. Later, Haff and Segers (2015) developed a method for nonparametric estimation of vine copulas. Again, they employed the simplifying assumption. Likewise, by adopting the simplifying assumption, Kauermann and Schellhase (2014) approximated conditional bivariate copulas by tensor product of a family of basis functions. So and Yeung (2014) assumed that certain dependence measures, e.g., rank correlation, change with time yet not with conditioning variables. See Stöber et al. (2013) for a discussion on limitations of simplified pair-copula constructions.
Apparently, ignoring the role of the conditioning variables in a conditional bivariate copula will contaminate the whole performance of the fitted multivariate copula. A natural practice to model a changeable conditional bivariate copula is to employ a parametric copula of which the involved parameter is a function of the conditioning variables; see, e.g., Gijbels et al. (2011), Veraverbeke et al. (2011) and Dißmann et al. (2013). Acar et al. (2011) approximated the function by local polynomials. Lopez-Paz et al. (2013) employed a type of parametric copulas that can be fully determined by Kendall's τ rank correlation coefficient; they related Kendall's τ rank correlation coefficient to conditioning variables by the standard normal distribution. We shall point out that, on the one hand, the choice of a parametric copula is always preconceived, usually from the existing parametric copulas in the literature. On the other hand, the functional form, relating the copula parameter and the conditioning variables, is always subjectively determined. The main contribution of the current work is a generic approach to the changeable dependence problem. One distinguishing feature of our approach is that we do not impose any structural assumption on the true (conditional) bivariate copula, except that (1) the copula is continuous w.r.t. its two arguments and its parameters and (2) the parameters are continuous functions of the conditioning variables. We approximate the true copula by a family of basis functions (to any required degree of approximation). The feasibility of the pair-copula approximating approach is guaranteed by the theoretical work developed in Bedford et al. (2016).
The remainder of the paper is organized as follows. In Sect. 2, we give a brief summary of vine copula and relative information. In Sect. 3, we present the general procedure for approximating bivariate copulas and conditional bivariate copulas. Section 4 is devoted to dealing with some technical issues when approximating a conditional copula. In Sect. 5, the attractions and dilemmas of the pair-copula approximating technique are revealed via simulated data, and its practical importance is evidenced via a real data set.
2 Vine copula and relative information

Vine copula
Vine is a graphical tool for helping construct multivariate distributions in a flexible and explicit manner. A vine on n variables is a nested set of connected trees: {T 1 , . . . , T n−1 } in which the edges of tree T i (i = 1, . . . , n −2) are the nodes of tree T i+1 , and each tree has the maximum number of edges. A regular vine on n variables is a particular vine in which two edges in tree T i (i = 1, . . . , n − 2) are joined by an edge in tree T i+1 if and only if the two edges share a common node. Formally, a vine is defined as follows (Kurowicka 2011, Chapter 3).
Here, is the symmetric difference operator, and # is the cardinality operator. A regular vine (called D-vine) on 4 variables is exemplified in Fig. 1. In Fig. 1 Fig. 1 A regular vine on four variables the set of variables to the right of the vertical slash is called a conditioning set, and the set of variables to the left of the vertical slash is called a conditioned set. The constraint set, conditioning set and conditioned set of an edge are defined as follows.
Definition 2 ∀ e = {e 1 , e 2 } ∈ E i (i = 2, . . . , n − 1), the constraint set related to edge e is the subset of {1, . . . , n} reachable from e. Write U * e for the constraint set of e. The conditioning set of e is D e = U * e 1 ∩ U * e 2 , and the conditioned set of e is {U * e 1 \ D e , U * e 2 \ D e }. Here, U * e 1 \ D e represents the relative complement of D e in U * e 1 . We might writeė 1 for U * e 1 \ D e andė 2 for U * e 2 \ D e . Hence the conditioned set of e is {ė 1 ,ė 2 }. Throughout the work, we represent edge e by {ė 1 ,ė 2 |D e }. Referring to Fig. 1, the set of edges for tree T 3 contains only one element: E 3 = {{1, 4|2, 3}}; the constraint set of the edge is {1, 2, 3, 4}, the conditioned set of the edge is {1, 4}, and the conditioning set of the edge is {2, 3}. If e = {e 1 , e 2 } ∈ E 1 , we have U * e = {e 1 , e 2 } and D e is empty. An n-variate copula is an n-variate probability distribution defined on the unit hypercube [0, 1] n with uniform marginal distributions. There is a one-to-one correspondence between the set of n-variate copulas and the set of n-variate distributions, as was stated in a theorem by Sklar (1959).
The coupling of regular vines and bivariate copulas produces a particularly versatile tool, called vine copula or pair-copula construction, for modelling multivariate data. The backbone of vine copula is re-forming, according to the structure of a regular vine, a multivariate copula into a hierarchy of (conditional) bivariate copulas. Given a regular vine V, for any e ∈ E(V) with the conditioned set {ė 1 ,ė 2 } and the conditioning set D e , let X X X e = (X v : v ∈ D e ) denote the vector of random variables indicated by the conditioning set D e . Throughout the work, all vectors are defined to be row vectors. Define Cė 1ė2 |D e (·) (resp. cė 1ė2 |D e (·)) to be the bivariate copula (resp. copula density) for the edge e. Cė 1ė2 |D e (·) and cė 1ė2 |D e (·) are conditioned on X X X e . Let xė 1 , xė 2 and x e , respectively, denote, from the generic point of view, the value of Xė 1 , Xė 2 and X X X e . We have the following theorem (Bedford and Cooke 2001).
Theorem 2 Let V = {T 1 , . . . , T n−1 } be a regular vine on the random variables {X 1 , . . . , X n }, and let the marginal distribution function F i (x i ) and density function f i (x i ) (i = 1, . . . , n) be given. Then the vine-dependent nvariate distribution is uniquely determined with density function Here, u x e = Fė 1 |D e (xė 1 | X X X e = x e ) and w x e = Fė 2 |D e (xė 2 | X X X e = x e ) are two conditional marginal distributions, both conditioned on X X X e . All the involved conditional marginal distributions can be derived from the marginal distribution functions and copula densities. (See Section 2.2 of the supplementary material for more discussion on deriving conditional marginal distributions.) Theorem 2 claims that we are able to derive the n-variate density function, once we are given the n marginal distribution functions and all the bivariate copulas originated from the regular vine. The n marginal distribution functions can be readily estimated from collected data, either parametrically or empirically, by using standard univariate methods. The estimation of the involved bivariate copulas is, however, non-trivial and still remains an open problem. Note that the form of the copula density cė 1ė2 |D e (·) (namely the dependence structure between Fė 1 |D e (Xė 1 | X X X e = x e ) and Fė 2 |D e (Xė 2 | X X X e = x e )) can be highly influenced by the value of X X X e . The dependence of the form of cė 1ė2 |D e (·) on X X X e is always intentionally ignored in the community of vine copula, due to certain practical concerns such as computational load and the curse of dimensionality (see, e.g., Kauermann and Schellhase 2014).
In Sect. 3, we will introduce a family of minimally informative copulas that can cope with the dependence of cė 1ė2 |D e (·) on X X X e . Deriving a minimally informative copula involves the notion of relative information (Kullback-Leibler divergence).

Relative information
Definition 3 The relative information of Q from P is a nonsymmetric measure of the information lost when a probability measure P is approximated by another probability measure Q (over a set ). P should be absolutely continuous w.r.t. Q. The relative information of Q from P, denoted by I (P|Q), is defined by Here, d P dQ is the Radon-Nikodym derivative of P w.r.t. Q. If μ is any measure on for which d P dμ and dQ dμ exist, then the relative information of Q from P can be written into where p = d P dμ and q = dQ dμ .
Relative information is always nonnegative and is minimized to 0 when P = Q almost everywhere. Relative information is a popular "metric" for measuring probability distance. There are two elegant properties of relative information, making it a natural criterion for copula selection. Firstly, relative information is invariant under monotonic transformation. For example, let n-variate distributions f (x 1 , . . . , x n ) and g(x 1 , . . . , x n ) have identical marginal distributions: f i (x i ), i = 1, . . . , n. Write c f (·) for the copula density of f (x 1 , . . . , x n ), and c g (·) for the copula density of g(x 1 , . . . , x n ). If we want to approximate Therefore, if f (x 1 , . . . , x n ) is the true law and c g (F 1 (x 1 ), . . ., F n (x n )) has the minimum relative information w.r.t. c f (F 1 (x 1 ), . . ., F n (x n )), then g(x 1 , . . . , x n ) has the minimum relative information w.r.t. f (x 1 , . . . , x n ). In what follows, we say an n-variate copula is minimally informative if the relative information of it from the independence copula is minimal. Therefore, a minimally informative copula is the most "independent" copula among all the qualified copulas. See (Jaynes 2003, Chapter 11) for an enlightening explanation on relative information (therein called entropy), which gives justification for the employment of minimally informative copulas for analyzing multivariate data. Given a multivariate data set, Eq.
(2) reduces the problem of finding the minimally informative multivariate distribution to the problem of finding the minimally informative multivariate copula. Then, how to find the minimally informative multivariate copula? The second property of relative information claims that a vine-dependent distribution is minimally informative if and only if all its bivariate copulas are minimally informative (Bedford and Cooke 2002). Therefore, to guarantee that a multivariate copula be minimally informative, we only need to find the minimally informative bivariate copula for each edge in the regular vine.
We now frame our line of approach to modelling multivariate data as follows. Given a multivariate data set and a regular vine on the involved random variables, we will formulate the optimal bivariate copula for every edge in the regular vine from the top level to the bottom level. A bivariate copula is optimal in the sense that it meets all the specified constraints and is minimally informative, making the corresponding multivariate copula minimally informative. Clearly, there are two problems related to our approach, which will be attended to in the following section: (1) the type of constraints that a bivariate copula needs to meet and (2) how to analytically formulate the optimal copula.
Remark 1 Here, we presume that the structure of the regular vine is given. The determination of the structure of a regular vine is an open topic of great importance. A well-structured vine copula can capture the underlying multivariate law by copulas in the lower hierarchy of the vine and therefore can reduce computational load and mitigate the curse of dimensionality by simplifying copulas in the deeper hierarchy of the vine (Dißmann et al. 2013). This topic is beyond the scope of the current work and will be addressed in the future.

Minimally informative copula
In the following, for exposition convenience, we assume that the n random variables X 1 , . . ., X n are uniformly distributed. (We can readily transform an arbitrary multivariate data set into a data set with uniform marginal distributions by taking the probability integral transformation.) The constraints that a qualified copula for an edge needs to meet are called expectation constraints. We illustrate, via an example, the manner and rationale of specifying expectation constraints, while a more detailed explanation is given by Bedford et al. (2016).
Example 1 Let V = {T 1 , . . . , T n−1 } be a regular vine on the uniform random variables {X 1 , . . . , X n }. For now, we concentrate on an edge e ∈ E 1 ; that is, the conditioning set of e is empty. Let X i and X j (1 ≤ i < j ≤ n) be the two uniform random variables joined by the edge e. A bivariate copula densityc e (x i , x j ) for the edge e is said to be qualified, if the following k (≥ 1) expectation constraints are satisfied: Here, the real-valued functions h 1 (x i , x j ), . . . , h k (x i , x j ) are linearly independent, modulo the constants; {α 1 , . . . , α k } are known, whose values can be obtained from data or expert elicitation. Equation (3) says that a qualified copula should satisfy the constraints that the expected value of the random then the rank correlation of a qualified copula should be α k .
In practice, we know a priori the expected values of the random variables {h 1 (X i , X j ), . . . , h k (X i , X j )}; then, every qualified copula for edge e should satisfy the constraints given in Eq. (3), and we select from these qualified copulas the minimally informative one. Many constraints can be written in the form of expectation constraints. For example, constraints are commonly specified in the form of probabilities. Yet, a probability can be expressed as the expectation of an identity function. Another conventional way to specify constraints is in the form of various kinds of correlations, such as product-moment correlations. Yet, due to the one-to-one correspondence between the set of n-variate copulas and the set of n-variate distributions, any correlation can be expressed as an expectation w.r.t. an appropriate copula. The way of specifying expectation constraints also allows a wider range of constraints if desired.
Another major advantage of specifying expectation constraints is that the minimally informative bivariate copula for an edge, satisfying all the specified expectation constraints, can be readily determined.
Example 2 (continued) According to Nussbaum (1989) and Borwein et al. (1994) (see Bedford and Wilson (2014) for a summary), there exists uniquely a minimally informative bivariate copula satisfying all the expectation constraints in (3) with the copula density given bŷ The Lagrange multipliers λ 1 , . . . , λ k are unknown and depend nonlinearly on α 1 , . . . , α k . The functions d 1 (·) and d 2 (·) are two regularity functions, makingĉ e (x i , x j ) a copula density. Let A(x i , x j ) denote the exponential part: Though A(x i , x j ) has a closed-form expression, the two regularity functions don't. Hence,ĉ e (x i , x j ) need to be determined numerically. (See Section 1 of the supplementary material for the determination of the two regularity functions and the Lagrange multipliers.) Namely, C( f ) is the set of all possible bivariate copula densities originated from the multivariate distribution It has been proved by Bedford et al. (2016) that the set C( f ) (and therefore L( f )) is relatively compact in the space C C C([0, 1] 2 ). Therefore, by selecting sufficiently many basis functions, Thenĉ e (x i , x j ) defined in Eq. (4) shall well approximate the true copula density c e (x i , x j ). Here, the metric employed on the space C C C([0, 1] 2 ) is the sup norm, hence only requiring the continuity of c e (x i , x j ).
For later reference, we call the set of basis functions {h 1 (x i , x j ), . . . , h k (x i , x j )} as "information set." We further explain Example 2 from a backward point of view. We knew that the set C( f ) is relatively compact in the space C C C([0, 1] 2 ). Let {h 1 (·), . . . , h k (·), . . .} be a countable set of basis functions that span the space C C C([0, 1] 2 ). Then for any copula density in C( f ) and any required level , we can select from {h 1 (·), . . . , h k (·), . . .} finite appropriate basis functions whose linear combination can approximate the copula density to the required level. However, the resulted linear combination is not minimally informative. Then we turn back to the logarithmic counterpart of C( f ), i.e., L( f ). Due to the one-to-one correspondence, L( f ) is also a relatively compact set in the space C C C([0, 1] 2 ). Therefore, for any element in L( f ) and any required level δ(>0), we can select from {h 1 (·), . . . , h k (·), . . .} finite appropriate basis functions whose linear combination can approximate the element to the required level δ. By the theoretical work given in Nussbaum (1989) and Borwein et al. (1994), we are able to derive the minimally informative copula density from the linear combination, i.e., Eq. (4). The derived minimally informative copula density well approximates the true underlying copula density.
There are many bases for the space C C C([0, 1] 2 ), e.g., {x p i x q j : p, q ≥ 0}. Note that we are not selecting basis functions from a whole basis, which is impossible and unnecessary. We are indeed selecting from a finite set, e.g., {x p i x q j : 0 ≤ p, q ≤ r } with an appropriate power limit r . Theoretically, by letting k be sufficiently large, we can well approximate any copula density in C( f ) by the same However, more basis functions will bring about more λ 's to be estimated and, consequently, more computational load. Hence, when approximating an individual copula density, we can determine the entry of a basis function into the information set according to its contribution to the approximation. Specifi- The procedure for selecting basis functions is outlined in Algorithm 1.

Algorithm 1 Information Set Determination
1: Set the information set to be empty. 2: Select from B the basis function that yields the largest value of the log-likelihood 3: while the stopping criterion is not met do 4: Move the selected basis function from B into the information set. 5: Select from B the basis function which, together with the basis functions in the information set, yields the largest value of the loglikelihood (5). 6: end while For ease of exposition, in what follows, we refer the loglikelihood from fitting a minimally informative copula to a data set as "estimated log-likelihood" and refer the loglikelihood from fitting the true underlying copula to a data set as "true log-likelihood." We should note the following points.
-The stopping criterion in Algorithm 1 could be the maximal number of basis functions or the minimal improvement in the estimated log-likelihood. By adding a new basis function to the information set, the estimated log-likelihood will always increase. Hence, there is no optimal number of basis functions. The choice of the number of basis functions involves the tradeoff between approximation accuracy and computational load. -Selecting basis functions according to the estimated log-likelihood is in consistent with information minimization. It is well known that the values of the Lagrange multipliers λ ( = 1, . . . , k), satisfying the expectation constraints, are also maximum likelihood estimates (see, e.g., Barron and Sheu 1991).
A distinguishing feature of our approach is that we have made no assumption on the structure of the underlying multivariate distribution f (x 1 , . . . , x n ), except that all the (conditional) bivariate copulas should be continuous. The expectation constraints are extracted from available data or expert judgement, and will be further studied in the following section.

Conditional copula approximation
For the practical implementation of our approach, one fundamental problem needs to be solved: how to evaluate {α 1 , . . . , α k } according to the data at hand. In Sect. 3, we took, for example, an edge in E 1 . Therein, α ( = 1, . . . , k) can be readily evaluated by calculating the sample mean of the j . It should be noted that for an edge e in tree T i (i = 2, . . . , n −1), the conditioning set D e is no longer empty. For notational simplicity, we define U x e = Fė 1 |D e (Xė 1 | X X X e = x e ) and W x e = Fė 2 |D e (Xė 2 | X X X e = x e ). Let u x e (resp. w x e ) denote the generic value of U x e (resp. W x e ). The expectation constraints now should take into account the value of X X X e : Clearly, α (·) is now a function of X X X e -the conditioning random vector related to the edge e. If, at a point X X X e = x e , we have a sizable sample of (U x e , W x e ), we intuitively estimate α (x e ) by the sample mean of the random variable h (U x e , W x e ). Furthermore, if there are sufficiently many realizations of the conditioning vector-conditioned on each of which we have a sizable sample of (U x e , W x e )-we will be able to approximate the functional form of α (X X X e ), which is a classical regression problem. Apparently, collected real-life data will never be what we are fancying here. Because {X 1 , . . . , X n } are continuous random variables, for any point X X X e = x e , there will be only one realization of the random variable h (U x e , W x e ). We certainly cannot replace "estimating α (x e ) by the sample mean of h (U x e , W x e )" with "estimating α (x e ) by one realization of h (U x e , W x e )." One simple explanation is that we conventionally treat a realization of a random variable as the mode of the distribution of that random variable (e.g., when conducting maximum likelihood estimation). The mean and the mode of a distribution usually take different values, and the relationship between them changes from distribution to distribution. As stated in "Appendix," under some mild assumptions, α (x e ) (and, therefore, the Lagrange multipliers {λ 1 , . . . , λ k }) is a continuous function of x e . One may suggest to approximate such function by fitting a regression model to the data We have to come up with an efficient surrogate for the sample mean of the random variable h (U x e , W x e ). Bedford et al. (2016) approached the above problem by dividing the domain of X X X e into equal-volume subregions and assuming that the copula density cė 1ė2 |D e (·) dose not change when X X X e varies within an individual subregion. Evidently, their approach suffers from certain inherent drawbacks. For example, the partition of the domain of X X X e is rather subjective. Even if they divide the domain by using, say, the CART (classification and regression tree), the fitted copula density is still not appealing: It is bumpy. In the following, we propose to relax the conditional expectation (6) and compute an average over a neighborhood of X X X e = x e , which is achieved by invoking the kernel-regression technique; see (Hastie et al. 2009, Chapter 6) for a brief introduction on kernel regression. Compared with parametric methods, kernel smoothing methods have the advantage that they make relatively milder structural assumptions. By employing kernel smoothing methods, two approximations are happening here: -expectation is approximated by averaging over sample data; -conditioning at a point is relaxed to conditioning on a region encircling that point.
Remark 2 Note that Eq. (6) provides a method to test if the simplifying assumption can be employed for a particular edge. Specifically, if the simplifying assumption holds, then the conditional copula cė 1ė2 |D e u x e , w x e | X X X e = x e does not depend on the value of X X X e . Therefore, α (x e ) should be a constant: Hence, for each value of X X X e = x e , we calculate α (x e ).
If α (x e ) is constant (or varies within a small range), then we might employ the simplifying assumption. Otherwise, if α (x e ) varies within a wide range or demonstrates an obvious relationship with x e , then we cannot employ the simplifying assumption.

Weighted average and weighted linear regression
For an edge e ∈ E i (2 ≤ i ≤ n − 1), Xė 1 and Xė 2 are the two conditioned random variables, and X X X e is the conditioning random vector (having (i − 1) elements). Let (x e ) denote the vth realization of (Xė 1 , Xė 2 , X X X e ) for v = 1, . . . , m. We now approximate the conditional expectation of the random variable h (U x e , W x e ), for 1 ≤ ≤ k and an arbitrary point X X X e = x e .
Let K μ (x e , x) be a kernel function, allocating an appropriate weight to x (∈ [0, 1] i−1 ) according to its distance from x e . For example, the radial Epanechnikov kernel is defined by in which || · || is the Euclidean norm, and Here, the parameter μ(>0), controlling the range of the local neighborhood, is usually called the bandwidth or window width. (Section 2.2 of the supplementary material discusses the high-dimensional problem for local learning methods.) The normal kernel with D(z) = φ(z) is another popular kernel, where φ(z) is the standard normal density function. The radial Epanechnikov kernel is optimal in terms of mean squared error (Epanechnikov 1969), while the normal kernel is more mathematically tractable. Intuitively, we can approximate α (x e ) by the Nadaraya-Watson kernelweighted averageα (x e , μ): in which u The weighted averageα (x e , μ) puts more weight on the data points that fall within distance μ from x e .
One drawback of locally weighted average is that it can be biased approaching the boundary of the domain of X X X e , because of the asymmetry of the data near the boundary. Locally weighted linear regression can help reduce bias dramatically at a modest cost in variance (Fan 1992). It exploits the fact that, over a small enough subset of the domain, any sufficiently nice function can be well approximated by an affine function. (See "Appendix" for the continuity and differentiability of the function α (·).) However, locally weighted linear regression cannot be directly applied here, because it may return an impractical estimate of α (x e ). We take the polynomial basis {x p i x q j : p, q ≥ 0}, for example. For any basis function from the polynomial basis, the value of the conditional expectation α (x e ) should falls within the interval (0, 1). However, locally weighted linear regression cannot guarantee that its estimate falls into the interval (0, 1). To avoid impractical estimates, we put an inequality restriction on regression coefficients. The inequality-constrained weighted linear regression approach to approximating α (x e ) proceeds as follows. Let θ θ θ = (θ 0 , θ 1 , . . . , θ i−1 ) be a vector of coefficients. At any point X X X e = x e , solve the following inequality-constrained least-square problem:θ θ θ(x e , μ) = arg min subject to 0 < (1, x e )θ θ θ tr < 1. Here, "tr" is the transpose operator. Then the optimal estimate of α (x e ) iŝ Guaranteed by the additional constraint, the estimatê α (x e , μ) will always be practical. The estimateθ θ θ(x e , μ) can be calculated by Dantzig-Cottle algorithms (Cottle and Dantzig 1968). According to Theorem 1 of Liew (1976), it is easy to prove that, when m is sufficiently large, the inequality-constrained least-square problem will reduce to a non-constrained least-square problem. Note that though we fit a linear model to the data within the neighborhood of X X X e = x e , we only utilize the fitted linear model to evaluatê α (x e , μ) at the single point X X X e = x e . Apparently, compared with locally weighted average, inequality-constrained locally weighted linear regression is more computationally demanding. Like any local learning problem, one needs to determine the optimal range of the neighborhood, i.e., the value of μ. A variety of automatic, data-based methods have been developed for optimizing the bandwidth, with the general consensus that the plug-in technique (Sheather and Jones 1991) and cross-validation technique (Rudemo 1982) are the most powerful; see Köhler et al. (2014) for a recent review. Plug-in methods require subjective estimators of certain unknown functions and perform badly in multivariate regression. Hence, we here illustrative the leave-one-out cross-validation approach to determining the optimal bandwidth μ. For five-or tenfold cross-validation, the appropriate translations are obvious.
The optimal bandwidth for edge e, denoted by μ * e 1ė2 |D e , is Note that the optimal bandwidths for different edges are different.

Basis function selection for conditional copulas
Recall that for an unconditional copula, we select from B the basis function that most improves the log-likelihood (5); if we have m data points, then the log-likelihood is a summation of m elements. However, for each conditional copula, the corresponding log-likelihood contains only one element. For example, if we have data {(x ) comes from the conditional copula cė 1ė2 |D e (u, w| X X X e = x (v) e ). Therefore, the estimated log-likelihood contains only one element: ). If we select basis functions for cė 1ė2 |D e u, w| X X X e = x (v) e according to the estimated log-likelihood, then the selected basis functions are optimal only in terms of the single datum (x (v) Consequently, the approximating copulaĉė 1ė2 |D e (u, w| X X X e = x (v) e ; μ) will be overfitting. Another problem related to selecting basis functions for conditional copulas is the huge computational load. If we are dealing with an n-variate vine copula, then we will have to approximate m × (n−1)(n−2) 2 conditional bivariate copulas. Even worse, taking account of cross-validation, if we determine the optimal value of μ among, say, a set of 100 values, then the computational load is 100 × m × (n−1)(n−2) 2 .
To alleviate the above-mentioned two problems, we here develop a two-stage procedure for approximating all the conditional copulas related to an edge e ∈ E i (2 ≤ i ≤ n − 1); see Algorithm 2.
Algorithm 2 Two-Stage Procedure 1: procedure Stage One 2: Divide the domain of X X X e into z regions: ∈ R j , 1 ≤ v ≤ m} as coming from an unconditional copula and determine the information set, denoted by B j , for them by using Algorithm 1. 6: end for 7: end procedure 8: procedure Stage Two 9: Let μ denote the value of the bandwidth. 10: for 1 ≤ v ≤ m do 11: If x (v) e ∈ R j , then the information set used for approximating the conditional copula cė 1ė2 |De (u, w| X X X e = x (v) e ) is B j . 12: Use a local learning method to calculate Use the method developed in Section 1 of the supplementary material to calculate the Lagrange multipliers {λ k (μ)} and numerically determine the two regularity functions d (v) 1 (u; μ) and d (v) 2 (w; μ). 14: The 17: Repeat steps 9-16 for different values of μ and select the optimal one that maximizes the estimated log-likelihood (9). 18: end procedure We should note the following points.
-The two-stage procedure utilizes the continuity of cė 1ė2 |D e (u, w| X X X e = x e ) on x e . According to Appendix A, the dependence structure cė 1ė2 |D e (u, w| X X X e = x e ) changes continuously with x e . Therefore, for a small region, we can assign the same information set to all the conditional copulas whose conditioning variables are in that region. Though the basis functions are the same, the Lagrange multipliers will be different for different conditional copulas.
-For a conditional copula in tree T 2 , it has only one conditioning variable. Since every individual variable is uniformly distributed, we can divide the domain (i.e., the [0, 1] interval) into a few equal-length subintervals. For a conditional copula in tree T i (i ≥ 3), the elements of the conditioning random vector are mutually dependent. Instead of subjectively dividing the domain of X X X e into z regions, we can employ k-means clustering to partition the observations {x (v) e : 1 ≤ v ≤ m} into z clusters, such that the within-cluster distance is minimized and the between-cluster distance is maximized; see, e.g., Hartigan and Wong (1979). The "kmeans" function in R software serves this purpose. -If parallel computing is possible, Stage Two can be calculated parallel on each tree level.

Numerical study
In this section, we examine the performance of the two-stage procedure via simulated data. Due to lack of space, a real data set is analyzed in the supplementary material.
For illustrative purpose, we focus on the D-vine structure of 6 random variables, with the nodes in tree T 1 from left to right being labeled by {X 1 , X 2 , X 3 , X 4 , X 5 , X 6 }. All the bivariate copulas originated from f (x 1 , . . . , x 6 ) are from the same family. For example, all the bivariate copulas originated from a 6-variate t-copula are t-copulas and have the same degrees of freedom.
Three types of bivariate copulas are examined: the Gaussian copula, t-copula and Gumbel copula. The Gaussian copula and t-copula are able to model moderate and/or heavy tails; the Gumbel copula is able to capture asymmetric dependence. In terms of tail dependence, the Gaussian copula is neither lower nor upper tail dependent; the t-copula is both lower and upper tail dependent; the Gumbel copula is upper tail dependent. Due to lack of space, we here only present simulation results of the t-copula. Simulation results of the other copulas are given in the supplementary material. The bivariate t-copula with υ (> 2) degrees of freedom is given by C ρ,υ (u, w) is the cdf of the one-dimensional t-distribution with υ degrees of freedom, and t ρ,υ (·) is the cdf of the bivariate t-distribution with υ degrees of freedom and correlation ρ ∈ (−1, 1). The superscript "−1" denotes the inverse of a function.
The parameter setting for simulating data is described as follows. For the five bivariate t-copulas in tree T 1 , the correlation parameter ρ takes in turn (from left to right) the following five values: {0.3, 0.4, 0.5, 0.6, 0.7}, which indicates that the dependence structure evolves from weak dependence to strong dependence. All the bivariate t-copulas have 3 degrees of freedom. For a conditional bivariate t-copula related to an edge e, the correlation parameter and the conditioning variables have the following relation: in whichx e is the average of the values of the conditioning variables. Consequently, α (x e ) is differentiable. Under the above parameter setting, we randomly simulate a sample of 1000 data points from the 6-variate t-copula. The scatter plots for copulas c 12 (u, w) and c 56 (u, w) are given in Fig. 3, in which scatter plots for two Gaussian copulas and two Gum-   .1958, 292.0212, 309.8739, 317.4228, 327.6366} 366.3090 bel copulas are also included. τ represents the Kendall rank correlation coefficient. Figure 3 shows that, when ρ = 0.3 or τ = 0.3, the simulated data do not have an evident pattern; when ρ = 0.7 or τ = 0.7, the relation between the involved two random variables is noticeable from the data. Two different bases are used to construct two different families of minimally informative copulas: the Bernstein basis functions, 6 p u p (1 − u) 6− p 6 q w q (1 − w) 6−q : 0 ≤ p, q ≤ 6 , and the polynomial basis functions {u p w q : 0 ≤ p, q ≤ 6}. Here the polynomial degree is 6. (Via intensive simulation study, we found that increasing the power to larger than 6 will improve a little the approximation, but will impose a lot of additional computational load.) Although we can approximate a bivariate copula to any required degree, given only a finite set of candidate basis functions, different bases will have different efficiency. Hence, we want to compare Bernstein basis with the polynomial basis.
We now approximate the five bivariate copulas in tree T 1 by minimally informative copulas. The estimated and true log-likelihoods are summarized in Table 1.
In Table 1, rows 2-6 stand for approximating the five copulas by Bernstein basis functions, and rows 7-11 stand for approximating the same five copulas by polynomial basis functions. For example, when approximating c 12 (u, w), the estimated log-likelihood after selecting the first optimal Bernstein basis function (resp. polynomial basis function) is 46.7498 (resp. 49.1366). After adding the second optimal Bernstein basis function (resp. polynomial basis function), the estimated log-likelihood increases to 72.9410 (resp. 53.8103). After selecting five Bernstein basis functions (resp. polynomial basis functions), the final estimated log-likelihood is 81.9875 (resp. 72.5107), while the true loglikelihood is 95.7630. It is observed from Table 1 that, for Bernstein basis, the joining of the fifth basis function does not contribute much to the estimated log-likelihood. In Table 1, for each edge, the estimated log-likelihood with five Bernstein basis functions is larger than that with five polynomial basis functions. Indeed, for each edge, the estimated loglikelihood with three Bernstein basis functions is already larger than that with four polynomial basis functions, showing the competence of Bernstein basis.
Surface plots and contour plots for copulas c 12 (u, w) and c 56 (u, w) are drawn in Figs. 4 and 5.
In Fig. 4 (resp., Fig. 5), the left two panels are the surface plot and contour plot for the true bivariate t-copula c 12 (u, w) (resp., c 56 (u, w)); the middle two panels are the surface plot and contour plot for the minimally informative copula having five Bernstein basis functions; the right two panels are the surface plot and contour plot for the minimally informative copula having five polynomial basis functions. Figures 4  and 5 show that, compared with the minimally informative copula having polynomial basis functions, the minimally informative copula having Bernstein basis functions bears a stronger resemblance with the true copula. Simulation results in the supplementary material also verified that Bernstein basis is more efficient than polynomial basis. Moreover, it is found that a combination of four Bernstein basis functions is capable of producing a good approximation. Hence, in the following, only Bernstein basis is used, and the cardinality of the information set for a conditional copula is set to be four.
We simulate another sample of 1000 data points from the 6-variate D-vine t-copula and approximate the five bivariate t-copulas in tree T 1 by minimally informative copulas having five Bernstein basis functions selected from 6 p u p (1 − u) 6− p 6 q w q (1 − w) 6−q : 0 ≤ p, q ≤ 6 . For each t-copula in tree T 1 , we calculate the correlation coefficient from the 1000 data points, which is a sample value of ρ for that copula. We denote such a value byρ. The correlation between the two random variables of a minimally informative copula can be evaluated via numerical integration. We denote such a value byρ. Apparently, if the minimally informative copula well approximates the true copula, the two sample values,ρ andρ, should be close. We repeat the above procedure for 100 times and obtain, for each t-copula in tree T 1 , two sequences: {ρ i : i = 1, · · · , 100} and {ρ i : i = 1, . . . , 100}. We plot them in Fig. 6, in which the five colors {"black," "red," "blue," "green," "purple"}, respectively, correspond to the five t-copulas having correla-tion coefficients {0.3, 0.4, 0.5, 0.6, 0.7}. Solid lines represent the sequence {ρ i : i = 1, . . . , 100}, and dotted lines represent the sequence {ρ i : i = 1, . . . , 100}. It is observed from Fig. 6 that, for each t-copula, the two sample valuesρ i and ρ i are close to each other, showing the competence of the minimally informative copula.
For example, the top-left panel shows the evolution of the log-likelihood deviation for edge {1, 3|2}, when the value of X 2 increases from 0 to 1. Figure 7 shows that the true log-likelihood is generally larger than the estimated loglikelihood, and the log-likelihood deviation fluctuates within a small range around zero.
From Remark 2, we know that Eq. (6) can be used to test the simplifying assumption. As here we know the underlying true law, to examine the performance of the two-stage procedure, we can also compare the true expected value α (x e ) with its estimateα (x e , μ * ): α (x e , μ * ) = 1 0 1 0ĉė 1ė2 |D e (u, w| X X X e = x e ; μ * ) h (u, w)dudw, whereĉė 1ė2 |D e (·) is the approximating minimally informative copula. Note that, for testing purpose, h (u, w) need not belong to the information set ofĉė 1ė2 |D e (·). Here we might set h (u, w) to be 6 In each panel, the red smooth curve is the evolving path of the true expected values {α (x e )}. For each of the top two panels, there are two jumps. The bottom left panel has one jump, and the bottom right panel has two small jumps. The jumps in Fig. 8 are due to the fact that we assign different information sets to different subintervals. As we divide the [0, 1] interval into four subintervals, there are at most three jumps. Yet, the four panels in Fig. 8 all have jumps fewer than three, implying that the minimally informative copula evolves slowly even when the value of the conditioning variable crosses a splitting point. Figure 8 shows that the estimated expected values are close to the true expected values; the relative errors are all within the interval (−0.02, 0.08). Figures 7 and 8 (and figures in the supplementary material) show that the two-stage procedure is capable of approximating conditional copulas.
We reuse the previously simulated 100 sets of data points (each with size 1000). For each data set, we have approximated the five bivariate copulas in tree T 1 ; we now approximate the conditional copulas in tree T 2 . Note that, for each edge in tree T 2 , the conditioning variable always distributes uniformly in the [0, 1] interval. Hence, it suffices to study only one edge, say edge {1, 3|2}. For each value of X 2 , we first calculate the correlation coefficient of the true conditional copula and then numerically calculate the correlation The true and estimated correlation coefficients, varying with the value of the conditioning variable X 2 . The true correlation coefficient is shown as a dashed curve, the average of the estimates taken over 100 Monte Carlo samples is displayed by the solid curve and the 90% Monte Carlo confidence intervals are given by dotted curves coefficient of the approximating minimally informative copula that is obtained from one data set. Since we have 100 data sets, we will have 100 such minimally informative copulas. In other words, for each value of X 2 , we will have one true correlation coefficient and 100 estimating correlation coefficients. We draw the 90% point-wise confidence intervals at 41 equally spaced grid points from 0 to 1; see Fig. 9.
In Fig. 9, it is observed that when X 2 varies in, say, interval (0.2, 0.8), the estimated correlation coefficient is close to the true correlation coefficient. However, when X 2 is too large or too small, the estimated correlation coefficient is biased. This is because the locally weighted average becomes biased approaching the boundary of the domain of X 2 , due to the asymmetry of the data near the boundary. One solution is to (X 2 + X 3 + X 4 + X 5 ) 4 Fig. 10 Evolving paths of the difference between the true log-likelihood and the estimated log-likelihood (trees T 3 , T 4 and T 5 ) use the locally weighted linear regression. Figures 6 and 9 suggest that minimally informative copulas are competent and the two-stage procedure is robust. We now approximate the conditional copulas in trees T 3 , T 4 and T 5 , in the same manner as approximating the conditional copulas in tree T 2 , except that we do not divide the domain of the conditioning random vector into equal-volume subregions. We employ k-means clustering to partition the 1000 observations of the conditioning random vector into 4 clusters. For T 3 , the longest distance between two points in the domain of X e is √ 2; hence, the set of candidate values of μ is {0. The two optimal bandwidths u * 15|234 and u * 26|345 are 0.8 and 0.7, respectively. The total true log-likelihood for edge {1, 6|2, 3, 4, 5} is 62.5579; the corresponding total estimated log-likelihood is 21.6081. The optimal bandwidth u * 16|2345 is 0.9.
The 1000 log-likelihood deviations for each edge in trees T 3 , T 4 and T 5 are plotted in Fig. 10.
The upper three panels correspond to tree T 3 , the bottom left two panels correspond to tree T 4 , and the bottom rightmost panel corresponds to tree T 5 . Figures 7 and 10 show that the two-stage procedure performs well; the estimated loglikelihoods are close to the true log-likelihoods. Compared with Fig. 7, the log-likelihood deviation in Fig. 10 fluctuates more violently, implying that the performance of the twostage procedure deteriorates with the tree level increasing. This is because the estimation error accumulates over tree level.
To show that our final 6-variate minimally informative vine copula well fits the given data, we randomly simulate 1000 data points from it. We first calculate the upper tail-dependence coefficient (Frahm et al. 2005 For a bivariate copula in T 1 , we letĤ 0 (u, w) denote the BECDF obtained from the original data; letĤ 1 (u, w) denote the BECDF obtained from the simulated data. We then calculate the quantiles of the simulated data. To draw the Q-Q plot, we calculate two sets of quantiles: one set of quantiles are calculated usingĤ 0 (u, w) and the other set of quantiles are calculated usingĤ 1 (u, w). Then the Q-Q plots for the five bivariate copulas in T 1 are given in Fig. 11.
It is clear from Fig. 11 that, for each bivariate copula, the two BECFDs are very close to each other. Consequently, we can conclude that our final 6-variate minimally informative vine copula well fits the given data.
its conditioning variables. To avoid overfitting and to reduce computational load, we developed a two-stage procedure. Numerical study showed that the two-stage procedure is both feasible and competent. We use (reuse) all the data points that are local; hence, the two-stage procedure can be applied to relatively higher-dimensional vine copulas than the method developed by Bedford et al. (2016). From the illustrative examples, it is clear that modelling data by a vine hierarchy of minimally informative copulas will demand a lot of computing effort. Unfortunately, increased computational load is what we have to pay if we stand by the viewpoint that a conditional copula should change with its conditioning variables-we have to approximate the conditional copula for every configuration of its conditioning variables.
In our approach, there are a number of parameters whose values need to be subjectively determined. After intensive numerical study, we list below some rules of thumb for reference.
-For a conditional copula, three to four basis functions from { 6 p u p (1−u) 6− p 6 q w q (1−w) 6−q : 0 ≤ p, q ≤ 6} yield an acceptable compromise between approximation accuracy and computational load.
-In stage one of the two-stage procedure, the number of regions {R 1 , . . . , R z } is usually determined with the purpose of maintaining a sizable sample in each region.
Note that the optimal number of basis functions for a particular bivariate data set may depend on multiple factors: the data size, the correlation coefficient, the tail-dependence coefficient, etc. For example, if the correlation coefficient is small and we use five basis functions, we may have the overfitting problem. On the other hand, if the tail-dependence coefficient is large and we use three basis functions, the approximation may be of low accuracy. One approach is to split the data into a training set and a testing set. For this issue, more intensive numerical study is needed. To select the optimal (number of) basis functions, an alternative approach is the regularization method: imposing an appropriate penalty on the log-likelihood [see, e.g., Kauermann and Schellhase (2014)]. However, in terms of the set { 6 p u p (1 − u) 6− p 6 q w q (1 − w) 6−q : 0 ≤ p, q ≤ 6}, the penalized log-likelihood function will have no less than 49 unknown parameters. Consequently, one has to develop a robust optimization method. Moreover, it is likely that the penalty parameter has to be numerically determined by using cross-validation technique. This work can be enriched in several ways. One promising direction is to study when we can employ the simplifying assumption. With the number of conditioning variables increasing, their effect will be complicated and may cancel out. Hence, in high-dimensional problems, it may be better to employ the simplifying assumption on copulas in the deeper hierarchy of a vine.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
In Sect. 4, we approximated α(x e ) by utilizing two local learning methods with the assumption that α(x e ) is a nice function of x e . Hence, herein, we study some properties of α(x e ).
The dependence of a conditional copula on its conditioning variables is commonly expressed through the parameters of the conditional copula. Specifically, if θ is the parameter of cė 1ė2 |D e (·), then cė 1ė2 |D e (u, w| X e = x e ) is usually formulated as cė 1ė2 |D e (u, w; θ(x e )), indicating that the value of the parameter depends on the conditioning variables. Throughout, we might assume that θ(x e ) is differentiable w.r.t. x e .
To examine the differentiability of α(x e ), we take partial derivative w.r.t. Hence, a sufficient condition for α(x e ) to be differentiable is that cė 1ė2 |D e (u, w; θ(x e )) is differentiable w.r.t. θ(x e ). Commonly used copulas, e.g., Gaussian copula, are all continuous and differentiable w.r.t. the involved parameters.
Remark 3 With x e varying within a small region, we can closely approximate cė 1ė2 |D e (u, w; θ(x e )) using a single information set (with the information set containing sufficient basis functions). Since cė 1ė2 |D e (u, w; θ(x e )) is a continuous function of x e , it is easy to prove that each of the Lagrange multipliers in the minimally informative copula is a continuous function of x e .