Abstract
We explore the use of penalized complexity (PC) priors for assessing the dependence structure in a multivariate distribution F, with a particular emphasis on the bivariate case. We use the copula representation of F and derive the PC prior for the parameter governing the copula. We show that any \(\alpha \)-divergence between a multivariate distribution and its counterpart with independent components does not depend on the marginal distribution of the components. This implies that the PC prior for the parameters of the copula can be elicited independently of the specific form of the marginal distributions. This represents a useful simplification in the model building step and may offer a new perspective in the field of objective Bayesian methodology. We also consider strategies for minimizing the role of subjective inputs in the prior elicitation step. Finally, we explore the use of PC priors in Bayesian hypothesis testing. Our prior is compared with competing default priors both for estimation purposes and testing.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In statistical modelling, it is typical to face the problem of calibrating a model which is complex enough to catch at least the main feature of the phenomenon of interest and, at the same time, simple enough to maintain interpretability. Starting from a model with a given level of complexity, called the base model, say \(M_0\), a feasible way to obtain a richer and more flexible structure is to include an extra-component so that the base model would be nested in the more complex one, say \(M_1\). For example, in many scenarios, it is important to verify whether some components of the model might be considered mutually independent or not. This amounts to comparing two nested models and, in a Bayesian framework, it is of paramount importance to adequately produce a prior distribution on the parameters governing the extra-complexity of the larger model.
This issue is at the core of the idea of penalized complexity (PC) priors, proposed in Simpson et al. (2017), where a principled way to construct a prior distribution for the parameters, say \(\xi \), of the extra-component of the model is described. The basic idea is to consider a reparameterization of \(\xi \) in terms of the Kullback-Leibler divergence between \(M_0\) and \(M_1\) and then assign an exponential prior over a specific function of the new parameter. The idea of using Kullback-Leibler divergence measures to construct priors, and in particular testing priors, has been originally proposed in Jeffreys (1961), while Bayarri and Garcia-Donato (2008) and Bayarri and Garcia-Donato (2007) represent central contributions.
In this paper, we consider the case where the base model assumes independence among the marginal distributions of the components, while the more complex one includes a certain degree of dependence among them. The most natural way of highlighting the differences between the two above models is certainly through their copula representation (Sklar 1959). In fact, a copula representation is very popular in many applied settings as it allows to separate the estimation procedure of the marginal distributions from that of the dependence structure. A copula is a multivariate distribution defined on the unit hyper-cube \([0,1]^k\), where k is the dimension of a multivariate random variable X. More formally,
where \(U_i \sim \text {U}(0,1)\), for \(i=1,2, \ldots , k\). If \(X = (X_1, \ldots , X_k) \in {\mathbb {R}}^k\) has a continuous multivariate cumulative distribution function \(F(x_1, \ldots , x_k) = \Pr (X_1 \le x_1, \ldots , X_k \le x_k)\), then F can be uniquely represented by a copula function \(C: [0,1]^k \rightarrow [0,1]\) so that
where \(F_i(x_i)\) is the marginal cumulative distribution function of \(X_i\), for \(i=1,2, \ldots , k\) (Sklar 1959). In addition, in the absolutely continuous case, the density function of X can be written as
where \(f_j(\cdot )\) is the marginal density function of the j-th component of X and \(c(\cdot )\) is the derivative of C.
The above representation allows to separately model the marginals and the dependence structure of a multivariate distribution. Thus, it makes sense, in a parametric set-up, to assume specific parametric forms for the marginal distributions of the components and for the copula function in a separate fashion. In addition, from a Bayesian perspective, prior elicitation can be done independently. In this sense, the penalized complexity approach is particularly suitable to this kind of situations when we interpret the extra-parameter as the one governing the dependence structure. This intuition is particularly reinforced by a general result, which we present in Sect. 2.2 and that states that the \(\alpha \)-divergence between a multivariate distribution and its counterpart with independent components does not depend on the marginals: so the elicitation of the “copula” parameters, following the penalized complexity approach, can be easily implemented without considering the specific nature of the marginals.
There is no consolidated theory of non-informative priors in a copula framework, Guillotte and Perron (2012) being a noticeable exception in a nonparametric framework. A typical approach for defining a prior over the parameters of a parametric copula consists either in using standard tractable priors or in defining a uniform prior for summaries of the dependence defined on a compact space, like for example the Spearman’s \(\rho \) or Kendall’s \(\tau \), or the correlation parameters in the case of elliptical copulas. The gist of this paper is to adopt a PC perspective in order to propose a general technique for deriving weakly informative priors for the copula parameter. In particular, we illustrate the situation where the copula function is assumed to be Gaussian and discuss several choices of the prior for the correlation parameter \(\rho \). The prior distribution we propose can also be used within a hierarchical modelling approach, where a Gaussian copula is instrumental to flexibly build more complex models. For example, Elfadaly and Garthwaite (2017) proposed a prior distribution for the parameters of a multinomial model based on a Gaussian copula representation with beta marginal distributions. See also Wilson (2018) for a vine-copula specification. Similarly, Sharma and Das (2017) proposed Gaussian and t copula priors representing generalizations of lasso, elastic net and g-priors. The methodology presented in this work can be easily adapted—in a hierarchical way—to the above context.
This work is focussed on bivariate copula models, and the goal is to define a prior distribution for a scalar parameter; while extensions to multivariate parameters are conceptually possible, the theory of penalized complexity prior distributions for the multivariate case is still not yet fully developed (Simpson et al. 2017). We will leave this case to further research.
The rest of the paper is organized as follows. In Sect. 2, we briefly describe the PC prior method and then use it in a copula framework. In Sect. 3, we derive and discuss the PC prior and compare it with some alternative competitors, often used in the literature. Section 4 is devoted to numerical comparison: Section 4.1 discusses some issues related to the potential use of PC priors as formal objective priors and discusses their use in testing. Section 5 concludes the paper.
2 PC priors and copula modelling
2.1 Background about PC priors
In this section, we briefly sketch the idea of PC priors, introduced by Simpson et al. (2017). The construction of a PC prior is principled, with the goal of producing a weakly informative prior where researchers are able to quantify their belief in the simple model through the calibration of a hyper-parameter.
Consider a generic statistical model \({\mathcal {G}} = \{ g(\cdot \,\vert \,{\varvec{\lambda }}), {\varvec{\lambda }} \in \Lambda \}\), with \(\Lambda \subset {\mathbb {R}}^q\), and a potential generalization of \({\mathcal {G}}\), that is a model \({\mathcal {F}} = \{f(\cdot \,\vert \,{\varvec{\lambda }}, \xi ), {\varvec{\lambda }} \in \Lambda , \xi \in \Xi \}\), where \(\Xi \subset {\mathbb {R}}^p\). In practice, we assume there is a value of \(\xi \), say \(\xi _0\), such that
Simpson et al. (2017) defined four basic principles behind the construction of a PC prior for the generic parameter \(\xi \).
-
Occam’s Razor. A simpler model formulation should be preferred until there is enough evidence for a more sophisticated model. In this setting, the simpler model is the base model where \(\xi =\xi _0\) and the prior will penalize deviations from it.
-
Complexity measure. The prior distribution is defined as a reparameterization of a measure of the increment in complexity, moving from \(M_0\) to \(M_1\). The increased complexity is evaluated in terms of Kullback-Leibler divergence (Kullback and Leibler 1951)
$$\begin{aligned} KL (f\Vert g)=\int _{\mathcal {X}} f(x\,\vert \,{\varvec{\lambda }}, \xi )\log \biggl (\frac{f(x\,\vert \,{\varvec{\lambda }}, \xi )}{g(x\,\vert \,{\varvec{\lambda }})}\biggr )dx, \end{aligned}$$(1)where \(g(x\,\vert \,{\varvec{\lambda }})=f(x\,\vert \,{\varvec{\lambda }}, \xi =\xi _0)\). Given that the prior distribution depends on the Kullabck-Leibler divergence between \(M_0\) and \(M_1\), it will be concentrated in areas where the loss of information one would incur when using the base model instead of the complex one is small. The Kullback-Leibler divergence is then transformed into the so-called distance \(s(f\Vert g)=\sqrt{2KL (f\Vert g)}\).
-
Constant Rate Penalization. A constant decay-rate r on the distance
$$\begin{aligned} \frac{\pi _s(s+\nu )}{\pi _s(s)}=r^\nu , \qquad s,\nu \ge 0, \end{aligned}$$(2)with \(0< r <1\), and where \(\pi _s\) is a prior on the distance, ensures that the change in the prior by increasing the distance of a factor \(\nu \) does not depend on s. This condition implies an exponential prior on s, say \(\pi _s(s)=\theta \exp (-\theta s)\). As a consequence, \(r=\exp (-\theta )\). Notice that \(\pi _s(s)\) has a mode at \(s=0\), which prevents overfitting. The constant decay-rate implies that one is willing to equally penalize each additional portion of distance in the parameter space, independently of the value of \(s=s(\xi )\). This is reasonable when there is not a clear idea about the distance scale. Finally, one defines the PC prior through a simple change of variable
$$\begin{aligned} \pi (\xi )=\pi _s(s(\xi ))\Bigg \vert \frac{\partial s(\xi )}{\partial \xi }\Bigg \vert . \end{aligned}$$(3)The resulting prior, being constructed in terms of s, does not depend on the specific adopted parameterization.
-
User-defined scaling. The hyper-parameter \(\theta \) can be chosen in terms of a subjective assumption on a tail event. One elicits a value W such that, for a specific probability level \(\alpha \),
$$\begin{aligned} \Pr (Q(\xi ) > W )= \alpha , \end{aligned}$$(4)where \(Q(\xi )\) is a specific transformation of \(\xi \). This is a crucial point: a non-robust choice of the rate parameter may significantly affect inference.
In the next sections, we describe a hierarchical approach in order to mitigate the impact of the hyper-parameter selection of the PC prior in a generic copula context, with a particular focus on the analysis of the correlation coefficient in a Gaussian copula.
2.2 PC prior for a copula parameter
We now exploit the idea of the PC prior construction to derive a prior on the parameter which regulates the dependence among random variables in a copula model. Consider a random vector \(X = (X_1, \ldots , X_k)\), such that \(X \in {\mathbb {R}}^k\), with continuous cumulative distribution function \(F(x_1, \ldots , x_k)\). In order to assess the strength of dependence, we consider, as base model, the one with independent marginal components
Here, \({\varvec{\lambda }}=(\lambda _1, \dots , \lambda _k)\) denotes the vector of parameters appearing in the marginal components of the density and \(\Lambda \subset {\mathbb {R}}^q\), with q integer, \(q\ge 1\); notice that each \(\lambda _j\) can be either a scalar parameter or a vector of parameters.
The corresponding more complex model assumes a non-constant copula function depending on \(\xi \), that is
where \(\Xi \subset {\mathbb {R}}^p\). According to Sklar’s theorem, the joint density can be written as
Clearly, model \(M_0\) is nested in model \(M_1\), as we retrieve it for \(\xi =\xi _0\).
It is well known that the Kullback-Leibler divergence between two measures is invariant with respect to strictly monotone transformations. This directly implies that, in our case, the divergence between \(M_0\) and \(M_1\) does not depend on the marginal distributions and it is a function of \(\xi \) only.
This result can be further generalized. In fact, the Kullback-Leibler divergence is a member of the \(\alpha \)-divergences family. In Cichocki and Amari (2010), the \(\alpha \)-divergence between two generic absolutely continuous positive measures, p and q, is defined, for \(\alpha \in {\mathbb {R}}\,\backslash \,\{0,1\}\), as
Then, it is possible to derive the following theorem.
Theorem 1
(Invariance of \(\alpha \)-divergences wrt the marginals) Let \(f_{1}(x_1, \dots , x_k\,\vert \,{\varvec{\lambda }}, \xi )\) be the generic member of a class of densities \(M_1\), which is assumed to be absolutely continuous with respect to the Lebesgue measure. Let \(f_{0}(x_1,\dots ,x_k)=\prod _{j=1}^{k} f_j(x_j\,\vert \lambda _j)\) be a density with independent components, where \(f_j\) is the marginal density of the j-th component of \(f_{1}(\cdot )\) Then, for any value of \(\alpha \), the \(\alpha \)-divergence (6) of \(f_1\) from \(f_0\) does not depend on \({\varvec{\lambda }}\), and
where \(c(u_1,\dots ,u_k)\) is the copula function associated with the density \(f_1(\cdot )\).
The proof is given in Appendix A. Notice that Theorem 1 is valid for any copula function and any dimension k of the random vectors X and Y. It is easy to see that \(D_\alpha (f_{X\vert {\varvec{\lambda }},\xi }\Vert f_0)=KL (f_{X\vert {\varvec{\lambda }},\xi }\Vert f_0)\) for \(\alpha \rightarrow 1\). Then, Equation (7) becomes
Following Simpson et al. (2017), we consider a function of the original Kullback-Leibler divergence
and assume a constant rate penalization, which induces an exponential prior on \(s(\xi )\).
In view of Theorem 1, \(s(\xi )\) only depends on \(\xi \): this allows an elicitation process on \(\xi \) which is independent of the values of the parameters of the marginal distributions.
2.3 PC Prior for the bivariate Gaussian copula model
In this section, we restrict the analysis to the case of a bivariate distribution, where the dependence is tuned by the scalar parameter of a Gaussian copula. Although Theorem 1 is valid to any dimension, the theory behind PC priors is well established for scalar parameters only and we delay the discussion of the general case to the final discussion.
Consider a two-dimensional random variable \(X \in {\mathbb {R}}^2\). The Kullback-Leibler divergence between the copula model and the model of independence among the marginal components can now be written as
We assume that \(c(\cdot , \cdot \,\vert \,\rho )\) is a bivariate Gaussian copula and \(\rho \) is the correlation coefficient, so that \(\rho \in [-1,1]\), and \(u = F_1(x_1)\), and \(v = F_2(x_2)\). Then,
Of course, the nested simpler model \(M_0\) is obtained by setting \(\rho =0\).
The Gaussian assumption allows to obtain a closed form expression for the PC prior of \(\rho \). Starting from equations (10) and (11), the Kullback-Leibler divergence is given in Guo et al. (2017) as
which is equivalent to the mutual information of a bivariate normal distribution. Then, the corresponding distance, according to Simpson et al. (2017), is \(s(\rho ) = \sqrt{2KL (\rho )} = \sqrt{-\log (1-\rho ^2)}\). From the above equation, it is apparent that the Kullback-Leibler divergence and the corresponding distance function are symmetric with respect to the base model, i.e. \(\rho =0\). In particular, the Kullback-Leibler divergence is only piece-wise monotone, depending on the sign of \(\rho \). Therefore, in this case, the exponential prior distribution on the distance function must consider the two branches of s separately, that is
where \(s_1(\rho )\) and \(s_2(\rho )\) are the distances when \(-1\le \rho <0\) and \(0\le \rho \le 1\), respectively. Notice that, if the simpler model would correspond to a \(\rho \not = 0\), the resulting PC prior would be more skewed, reflecting the “asymmetry” of the null model. In our symmetric case, a half exponential distribution is assigned to each branch of \(s(\rho )\). The resulting PC prior distribution for \(\rho \), for a fixed scale hyper-parameter \(\theta \), is then the sum of the two branches
Since
we obtain
Equation (13) produces a proper prior, which is symmetric around zero. All odd prior moments are consequently equal to zero.
2.4 The Jeffreys’ and arc-sine priors for \(\rho \)
As we illustrated in Sect. 2.3, the PC prior for \(\rho \) can be obtained modulo a scalar parameter \(\theta \), whose value must be tuned in a subjective way. In the next sections, we will explore, admittedly in a non-standard way, several well-established objective Bayesian techniques for selecting the hyper-prior distribution for \(\theta \).
We will not consider alternative priors to the PC one for several reasons. First of all, at least in an estimation context, and for scalar parameters, almost any objective approach for selecting a prior would lead to the well-known Jeffreys’ prior (Jeffreys 1946). However, here, we are more concerned with a Bayesian hypotheses testing based on Bayes factor; then, a proper prior is needed.
Jeffreys (1961) himself derived the standard improper prior for the correlation coefficient in the bivariate Gaussian model, conditional on the variances, as
Being improper, this prior could not be used in testing scenarios involving the calculation of the Bayes factor; then, Jeffreys (1961) also proposed the arc-sine prior, which shows a similar behaviour, although its integral on \([-1,1]\) is finite:
The arc-sine prior is not invariant to reparameterizations and we will not discuss further this aspect in this paper.
3 Choosing the hyper-parameter of the PC Prior
A crucial step in the definition of the PC prior is the selection of the hyper-parameter \(\theta \), in that it establishes how fast the prior shrinks towards the base model. In this regard, Simpson et al. (2017) proposed to introduce a subjective input in the form of a probability statement related to a tail event and stated that the prior distribution is relatively insensitive to the choice of the hyper-parameter \(\theta \), unless it is set to an “extremely poor” value. However, given the artificial nature of the hyper-parameter \(\theta _0\), prior information about it may be unavailable, or there may be uncertainty on how much the experimenter supports the base model a priori. Figure shows the PC prior in Equation (13) for different choices of \(\theta \).
Instead of selecting a specific value for \(\theta \), one could adopt a hierarchical perspective and assign a hyper-prior to the parameter \(\theta \). We discuss in this section two possible hierarchical strategies, although we remark that only proper hyper-priors for \(\theta \) guarantee to obtain a proper prior for \(\rho \). We illustrate strategies for deriving hyper-prior distributions in the particular case where \(\rho \) is the correlation parameter of a bivariate Gaussian copula. However, the methodology based on intrinsic priors is more general and we prove below that it produces the same intrinsic prior no matter what kind of copula is adopted.
3.1 Jeffreys’ prior for \(\theta \)
An objective way to originate a hyper-prior distribution for \(\theta \) is to adopt the Jeffreys’ approach (Jeffreys 1946) with some adjustments. A formal way of deriving the Jeffreys’ prior for a hyper-parameter would imply the marginalization of the parameter of interest \(\rho \) in order to consider a marginal copula model, namely
where \(c_\rho (u,v\,\vert \,\rho ) \) and \(\pi (\rho \,\vert \,\theta ) \) are given, respectively, in (11) and (13). This way we are considering a model with uniform marginals and the only unknown parameter is \(\rho \). Consequently, the Jeffreys’ prior for the “marginal” model can be easily obtained as
Unfortunately, this derivation does not produce a closed form expression for \(\pi (\theta )\). Alternatively, one could simply note that \(\theta \) acts as a scale parameter for the exponential density of \(s(\rho )\); so the natural objective choice is the use of the improper prior
where the superscript N stands for non-informative.
3.2 Intrinsic prior for \(\theta \)
In the spirit of making comparison between \(M_0\) and \(M_1\), similarly to the PC prior idea, an alternative way of deriving a hyper-prior for \(\theta \) is provided by the intrinsic prior methodology (Berger and Pericchi 1996). Here, we follow the approach described in Pérez and Berger (2002). Let
where \(\pi _{\rho \vert \theta }(\rho \,\vert \,\theta )\) is the PC prior for \(\rho \) in Equation (13), and
where \(\pi _{\rho \vert \theta _0}(\rho \,\vert \,\theta _0)\) is the conditional distribution of \(\rho \) for a specific value \(\theta _0\). Suppose one wants to compare the hypotheses
Starting from the improper pseudo-Jeffreys’ prior in Equation (16), the intrinsic prior \(\pi ^I(\theta )\) is obtained by selecting the minimum sample size that makes this prior distribution proper. Here, the prior for \(\rho \vert \theta \) plays the role of the likelihood in the standard application of the intrinsic prior methodology. The minimum sample size is equal to one, and the intrinsic prior for \(\theta \) is
where \(\pi (\theta \,\vert \,\rho _{\ell })\) can be interpreted as the posterior distribution of \(\theta \) after observing a virtual sample of size one, \(\rho _{\ell }\), and \(\pi (\rho _{\ell }\,\vert \,H_0)\) is given by (13) by fixing \(\theta \) at \(\theta _0\). The calculation of the intrinsic prior for other copulas implies the modification of the parameter space of reference.
The following theorem provides the expression of the intrinsic prior for any bivariate copula model depending on a scalar parameter \(\rho \), besides Gaussian copula.
Theorem 2
Consider a generic copula density \(c(u,v\,\vert \,\rho )\) and a penalized complexity prior on \(\rho \) given by
where \(s(\rho )\) is the already introduced transformation of the Kullback-Leibler divergence of \(c(u,v\,\vert \,\rho )\) from the independence copula \(c_0(u,v) =1\), that is \(s(\rho ) = \sqrt{-2 KL(c \vert \vert c_0)}\), with \(KL(c \vert \vert c_0) = \int _{0}^1 \int _{0}^1 c(u,v\,\vert \,\rho ) \log c(u,v\,\vert \,\rho )\,du \,dv \), and \(s^\prime (\rho )\) is the first derivative of \(s(\rho )\) wrt \(\rho \). Then, the intrinsic prior for testing \(\theta =\theta _0\) vs. \(\theta \not = \theta _0\) is
independently of the specific expression of \(c(u,v\,\vert \,\rho )\).
Although in this paper we are confined to copulas, the theorem above is valid to any PC prior arising from the comparison of two generic nested statistical models, as well as to multivariate copulas which are regulated by only one parameter, such as in the case of equicorrelation. The proof is given in Appendix B.
The last step of the procedure consists in selecting the \(\theta _0\) value in a sensible way, since its value is not suggested by the hypotheses system in this setting. Our proposal is to calibrate the value of \(\theta _0\) in order to maximize the prior variance of \(\rho \). The new prior for \(\rho \) is then
and, since \({\mathbb {E}}(\rho \,\vert \,\theta ) = 0, \,\forall \theta \), one has to maximize the quantity
The numerical problem has been tackled using two different methods, namely the Brent’s method (Brent 1973) and the Golden Section method, implemented in the R suite deconstructSigs. Both of them have produced the same value \({\hat{\theta }}_0 = 0.491\,525\). It should be noticed that the hierarchical model stabilizes the variance of the PC prior for \(\rho \); in fact, without taking into account the intrinsic prior for \(\theta \), the variance is hard to compute, even numerically, and especially for values of \(\theta \) between 0 and 1. This is a consequence of the fact that, for these values of \(\theta \), the variance of the PC prior for \(\rho \) increases as the prior spreads out towards the alternative model. Figure compares the Jeffreys’ prior, the arc-sine prior and the hierarchical PC prior with \(\theta _0=0.491\,525\).
4 Simulation study
In this section, we perform a simulation study in order to compare the frequentist behaviour of the hierarchical PC prior in (18) relatively to its natural competitors discussed before. We also analyse the performance of the hierarchical PC prior for the parameter \(\rho \)—with abuse of notation—controlling the strength of association of a Frank copula, against the classical version of the PC prior. In this case, the PC prior derivation does not provide a closed form expression. Notice also that, in the Frank copula, \(\rho \) does not assume the meaning of correlation coefficient. For details about Frank copula, we refer the reader to Joe (2014).
For each true value of \(\rho ^*\) in the set \((-0.95,-0.5,0,0.05,0.5,0.95,0.999)\) for the Gaussian copula and in the set (0.01, 1, 5, 10) for the Frank copula, and for each sample size \((n=5,30,100,1\,000)\), we have generated 200 independent samples from a Gaussian copula and a Frank copula, respectively. For each sample, we have computed i) the posterior mean of \(\rho \); ii) the \(95\%\) equal-tailed credible interval; iii) the Bayes factor for testing the hypotheses \(c(u,v) =c_0(u,v)=1\) vs. \(c(u,v)\not =c_0(u,v)\). In addition, we have computed an estimate of the mean squared error of the posterior mean \(\hat{\rho }\), namely
The posterior distribution \(\pi (\rho \,\vert \,x)\) was obtained using a standard Metropolis-Hastings-within-Gibbs algorithm, using a truncated Normal distribution in the Random Walk Metropolis step. The variance of the proposal has been calibrated to have an acceptance rate of about 30%.
From an “estimation” perspective, and only for the Gaussian copula, we consider the Jeffreys’ prior in (14), as the natural competitor. Relatively to the Gaussian copula, our simulation shows that the hierarchical PC prior produces a smaller MSE when the true \(\rho \) is close to zero, as illustrated in Fig. .
Also, notice how the sampling distributions of the posterior mean, using the PC prior, tend to be more concentrated around the true value of \(\rho ^*\), especially for moderate sample sizes. The Jeffreys’ prior produces smaller values of the mean squared error for intermediate levels of correlation. In this case, it seems to occur a sort of trade-off between bias and variance; the bias is smaller with the Jeffreys’ prior, the variance is smaller with the PC prior. For small values of \(\rho ^*\), the PC prior is superior in terms of MSE compared with the Jeffreys’ prior; this can be related to the spike of the PC prior in a neighbourhood of \(\rho =0\) (see Fig. 2).
We have also analysed the sensitivity of inference with respect to the choice of \(\theta \) in the standard implementation of the PC prior. For this purpose, we have run a simulation study using three different choices of \(\theta \), namely 0.1, 1 and 5, and compared the results with the hierarchical approach described in Sect. 3.2 and with the Jeffreys’ prior in Equation (14); this latter is only for the Gaussian copula. Results are summarized in Tables and , where we report the outputs for several sample sizes (\(n=5, 30, 100\)). In Table 1, related to Gaussian copula, Jeffreys’ prior shows the best performance in terms of MSE for large values of \(\vert \rho ^*\vert \); for small \(\vert \rho ^*\vert \), each “subjective” PC prior performs better in specific cases: The hierarchical PC prior always performs well and it represents the “safer” option. In Table 2, referred to Frank copula, we see again how there is no univocal value of \(\theta \) which makes the subjective PC prior always preferable to the hierarchical one. Therefore, we opt for this latter, where again we maximized the variance as we already did in Sect. 3.2 for the Gaussian copula, obtaining a similar value, i.e. \({\hat{\theta }}_0 = 0.549\,829\).
Table , related to Gaussian copula, reports the coverage probabilities of Jeffreys’ and hierarchical PC priors. Again the latter outperforms the former for small \(\vert \rho ^*\vert \), and, for large \(\vert \rho ^*\vert \), the behaviour of the two priors is similar. Table displays the coverage probabilities of the subjective and “objective” PC priors of \(\rho \) from a Frank copula. Again, it is evident how poor selection of \(\theta \) may affect the estimates, especially for small sample sizes.
4.1 Bayesian hypothesis testing
PC priors have not been proposed for testing hypotheses. However, it seems natural to analyse their performance in a Bayesian testing framework, since they are proper and particularly suitable for comparing nested models. Here, we consider the case where \(M_0\) is the nested model, that is \(c(u,v)=c_0(u,v)=1\), versus the alternative \(M_1: \, c(u,v)\not =c_0(u,v)\). In the following, we restrict the comparison to the case of “uniform marginals”. A more general discussion, involving uncertainty in the marginals parameters, would inevitably depend on the specific situation.
The standard tool for Bayesian testing nested models is the Bayes factor of \(M_0\) vs. \(M_1\), say \(B_{01}\). It is known that when the models have parameter spaces of differing dimensions, the use of improper priors yields indeterminate answers (Berger and Pericchi 2001). The lack of suitable objective priors for testing has led researchers to circumvent the problem in several different ways, either using conventional priors, based on some specific principles (Bayarri et al. 2012), or proposing alternative methods, like the already mentioned intrinsic prior discussed in Sect. 3.2.
First, for the Gaussian copula model, we compare the use of the hierarchical PC prior, based on the intrinsic approach, with the conventional arc-sine prior defined in Equation (15), which can be considered a “proper” version of the improper Jeffreys’ prior analysed before. Also, we explore the performance of the hierarchical PC prior when the complex model is described by a Frank copula.
Let \(\gamma _0=P(H_0)\) be the prior probability of the null hypothesis and let \(\pi ^{PC}(\rho \,\vert \,\theta _0)\), with \(\theta _0 \approx 0.491 \) for the Gaussian copula, and \(\theta _0 \approx 0.55 \) for the Frank one, be the prior density of \(\rho \) under \(M_1\). The Bayes factor \(B_{01}\) can be written as
and the posterior probability of the simplest model is
Again, the selection of the hyper-parameter \(\theta \) in a subjective way is crucial in this setting. Choosing a small value of \(\theta \) would produce Bayes factors that support the base model, even when \(M_1\) is true; in particular, in the limiting case of infinite prior variance, the Bayes factor would always select \(M_0\), independently of the observed data. On the other hand, setting a large value of \(\theta \) would make the Bayes factor shrink to 1, independently of the true value of \(\rho \). Our approach based on the hierarchical PC prior is able to alleviate these drawbacks. Table reports the performance of the Bayes Factor for each combination of \((n=\{5,30,100\}, \vert \rho ^*\vert )\) for the Gaussian copula model. Similar results, relating to the Frank copula, are reported in Table 4.
The computation of the Bayes factor was performed using a standard hierarchical Monte Carlo approach: we first draw each value \({\tilde{\theta }}\) from \(\pi ^I(\theta )={\theta _0}\big /{(\theta +\theta _0)^2}\) and then generate a value of \(\rho \) from \(\pi ^{PC}(\rho \,\vert \,{\tilde{\theta }})\).
Regarding the Gaussian copula, Fig. 2 illustrates how the hierarchical PC prior with \(\theta _0\approx 0.491\), when compared to the arc-sine prior, represents a compromise between “flatness” and the presence of a significant mass in a neighbourhood of the null hypothesis. This appears to be a safer strategy than setting an arbitrary value of \(\theta \), in absence of specific subjective information.
Tables 4 and 5 report the relative frequencies of times that \(B_{01}< 0.5\). This value has been chosen as a conventional threshold, below which data provide evidence in favour of \(M_1\) (Liseo and Loperfido 2006): when \(\gamma _0=0.5\), \(B_{01}< 0.5\) corresponds to \(P(H_0 \vert \text {data})< 1/3\). In Table 5, the results of comparison between the hierarchical PC prior and the arc-sine prior show a similar pattern; however, it seems that the PC prior prefers \(M_1\) slightly more often than the arc-sine prior for small values of \(\rho ^*\). Our conclusion is that the PC prior is more sensible than the arc-sine prior in capturing small dependencies; this behaviour is even more evident as the sample size grows.
Our analysis in this section was restricted to comparing two different dependence structures using a copula representation. In a more general setting, when the uncertainty on the marginal modelling comes to play, the Bayes factor would also depend on the marginal priors although this dependence is typically mild in our experience. However, also in the general case, the derivation of the hierarchical PC prior would not depend on the marginal modelling.
4.2 Prior on the model space
Testing statistical hypotheses is the most controversial aspect of statistical inference, with at least three major competing schools approaching the problem from different angles and often concluding with irreconcilable decisions (Robert 2014). The Bayesian approach has been often criticized because of the potential appearance of the Jeffreys-Lindley’s paradox. Robert (1993) proposed a solution which consists in a calibration of the prior probability associated to \(H_0\) in terms of the variance of the prior under \(H_1\). To address this issue, we discuss here an alternative strategy for selecting the prior probabilities to attach to the competing models, which depends on the particular choice of the hyper-parameter.
It is known (Berk 1966) that, when comparing two misspecified models, the posterior probability will tend to accumulate to the model which is closer, in terms of the Kullback-Leibler divergence, to the true one. Following a proposal developed in Villa and Walker (2015), the divergence, \(KL (c_\rho (u,v\,\vert \,\rho ) \;\Vert \; c_{\rho _0}(u,v\,\vert \,\rho _0))\), is interpreted as the loss we would incur if model \(M_1\) is removed and it is actually the true model. Here, we derive expressions for the Gaussian copula model. Therefore, the prior expected loss is equal to
The model prior probabilities can then be obtained in terms of the self-information loss function, which represents the loss connected to a probability statement. By equating the self-information and the expected loss, and following Villa and Walker (2015), one has
Setting \(\gamma _0 (\theta )\propto 1\), one obtains
The above derivation shows how the PC priors, although proposed and developed for estimation problems, may play a role also in the search of objective priors in model selection. This possibility will be explored in the future.
5 Discussion
The PC priors methodology is a powerful and useful idea for model construction. In this paper, we have explored their use in the general problem of checking the need of introducing dependence among the components of a random vector. We have done this exploiting the copula representation of the distribution of a random vector. We focussed on the bivariate case, and, for the sake of simplicity, we exposed the case of the Gaussian copula model in Sect. 2.3. We have considered—as a base model —the one with a constant copula density, i.e. \(c_0(u,v)=1\), which implies independence. In particular, for a Gaussian copula, this is equivalent to consider \(\rho =0\). In different contexts, other base models may be considered. For instance, Sørbie and Rue (2017) considered, in an autoregressive setting, as the base model, either the one which assumes independence (\(\rho =0\)), or the random walk one (\(\rho =1\)); Franco-Villoria et al. (2019) also considered the case of \(\rho =1\) for varying coefficient models.
There are several advantages in using the proposed PC prior for the problem discussed in the paper. First, the hierarchical PC prior can be well considered objective, and yet proper. There are many advantages in having objective prior with this property, such as the suitability of being used in testing scenarios, the fact of avoiding onerous proofs of the propriety of the yielded posterior, and more. Furthermore, it represents an important step towards the outline of a theory for objective priors for copula modelling which, with the exception of Guillotte and Perron (2012), is still missing. In addition, our approach circumvents the problem of calibrating the shrinkage parameter, which may lead to erroneous inference, especially in testing hypotheses. In absence of strong prior information on the distance scale, this is a convenient property, which allows an objective approach at the second level of the hierarchy. Second, given the nested nature of models that are compared and being the PC prior a proper probability distribution, its implementation for testing problems becomes natural. Although non-specifically designed for this type of scenarios, our experience tells that PC priors methodology can be viewed as intimately connected to the model selection problem. Obviously, this makes sense only when we fix the value of the extra-parameter in such a way that the complex model boils down to a simpler model, and not if we arbitrary pick a value of the extra-parameter in the parametric space. As a final remark, we see that the specification of the mutual information of random vectors as the negative copula entropy plays a key role in the derivation of the prior for any copula model, as allows to get rid of the parameters of the marginals. This fact has strong consequences in estimation procedures since it allows a separate elicitation of the priors for the marginals and the dependence structure.
In this work, we focussed our attention to the univariate case. Simpson et al. (2017) developed a PC prior for the regression matrix of a probit model. In our context, suppose we have a matrix \({\textbf{R}}\) for a Gaussian copula, with elements \(\rho _{ij}\), for \(i,j=1, \ldots , k\), where k is the dimension of the multivariate random variable. Then, it is possible to fix a value \(r \ge 0\) such that \(\varvec{\ell } \in {\mathcal {S}}_r = \{\varvec{\rho } \in {\mathbb {R}}^k : s(\varvec{\rho })=r\}\). The subsets \({\mathcal {S}}_r\) represent a foliation and the PC approach can be extended by assuming an exponential distribution for \(s(\varvec{\rho })\) and a uniform distribution on the leaves \({\mathcal {S}}_{s(\varvec{\rho })}\). The main difference is that computational techniques may be needed to derive the distribution \(\pi (\varvec{\rho })\). Simpson et al. (2017) provided a prior distribution for a correlation matrix which exploits a different parameterization based on a Cholesky decomposition, which, however, creates a dependence on the ordering. In general, it may not be easy to define a distance \(s(\varvec{\rho })\) involving all the parameters. Sørbie and Rue (2017) proposed to use a sequential approach to define PC prior distributions for the parameters of a AR(p) model: the Kullback-Leibler divergence is computed conditionally on the terms already included in the model, i.e. the divergence is calculated for the model for the k-th parameter with respect to the model based on the other \(k-1\) parameters and with \(\theta _k={\bar{\theta }}\). Therefore, conditional PC priors are defined for each parameter. However, inference may depend on the order chosen for the sequential procedure and, in the setting we are considering in this work, it may not be obvious how to define this order.
In the specific setting of copula models, it is possible to redefine the multivariate copula model through a vine construction (Czado 2019): under conditions of differentiability, the joint multivariate distribution can be written as
which can be easily generalized for generic dimension k. Therefore, it is possible to select each conditional model (Czado et al. 2013; Gruber and Czado 2018) and then define a PC prior for the univariate parameter of each bivariate model. Vine copulas tend to be computationally inefficient, therefore, it may be necessary to use faster algorithms in high dimension (Tagasovska et al. 2019). Moreover, while the order of the conditioning should not affect inference, it may be difficult to assess the impact of the univariate PC priors on inference defined for different vine structures. It seems evident that the multivariate extension of the proposed approach is not trivial, and, therefore, it has to be left to further research.
References
Bayarri MJ, Garcia-Donato G (2007) Extending conventional priors for testing general hypotheses in linear models. Biometrika 94(1):135–152
Bayarri MJ, Garcia-Donato G (2008) Generalization of Jeffreys divergence-based priors for Bayesian hypothesis testing. J Roy Stat Soc B 70(5):981–1003
Bayarri MJ, Berger JO, Forte A, García-Donato G (2012) Criteria for Bayesian model choice with application to variable selection. Ann Stat 40(3):1550–1577
Berger JO, Pericchi LR (2001) Objective Bayesian methods for model selection: introduction and comparison [with discussion]. In: Lahiri P (ed) Model Selection, vol 38. Institute of Mathematical Statistics, Lecture Notes - Monograph Series, Beachwood Ohio, pp 135–207
Berger JO, Pericchi LR (1996) The intrinsic bayes factor for model selection and prediction. J Am Stat Assoc 91:109–122
Berk RH (1966) Limiting behavior of posterior distributions when the model is incorrect. Ann Math Stat 37(1):51–58
Brent RP (1973) Algorithms for minimization without derivatives. Prentice-Hall, Englewood Cliffs, NJ
Cichocki A, Amari S-I (2010) Families of Alpha- Beta- and Gamma- divergences: flexible and robust measures of similarities. Entropy 12(6):1532–1568
Cichocki A, Cruces S, Amari S (2011) Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 13(1):134–170
Czado C (2019) Analyzing Dependent Data with Vine Copulas: a Practical Guide with R, vol 222, 1st edn. Lecture Notes in Statistics. Springer, Cham, Switzerland
Czado C, Brechmann EC, Gruber L (2013) Selection of vine copulas. In: Jaworski P, Durante F, Härdle WK (eds) Copulae in Mathematical and Quantitative Finance. Lecture Notes in Statistics, vol 213, pp 17–37. Springer, New York
Elfadaly FG, Garthwaite PH (2017) Eliciting Dirichlet and Gaussian copula prior distributions for multinomial models. Stat Comput 27(2):449–467
Franco-Villoria M, Ventrucci M, Rue H (2019) A unified view on Bayesian varying coefficient models. Electron J Statistics 13(2):5334–5359
Gruber LF, Czado C (2018) Bayesian model selection of regular vine copulas. Bayesian Anal 13(4):1111–1135
Guillotte S, Perron F (2012) Bayesian estimation of a bivariate copula using the Jeffreys prior. Bernoulli 18(2):496–519
Guo J, Riebler A, Rue H (2017) Bayesian bivariate meta-analysis of diagnostic test studies with interpretable priors. Stat Med 36(19):3039–3058
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. proceedings of the royal society of London. Series A. Math Phys Sci 186(1007):453–461
Jeffreys H (1961) Theory of Probability. Oxford University Press, New York
Joe H (2014) Dependence Modeling with Copulas, 1st edn. Monographs on Statistics and Applied Probability, vol 134. Chapman & Hall/CRC, New York
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Liseo B, Loperfido N (2006) A note on reference priors for the scalar skew-normal distribution. J Statistical Plan Inference 136:373–389
Pérez JM, Berger JO (2002) Expected-posterior prior distributions for model selection. Biometrika 89:491–511
Robert CP (1993) A note on Jeffreys-Lindley paradox. Stat Sin 3:601–608
Robert CP (2014) On the Jeffreys-Lindley paradox. Philosophy of Sci 81:216–232
Sharma R, Das S (2017) Regularization and Variable Selection with Copula Prior. arXiv preprint arXiv:1709.05514
Simpson D, Rue H, Riebler A, Martins TG, Sørbye SH (2017) Penalising model component complexity: a principled, practical approach to constructing priors (with discussion). Stat Sci 32(1):1–28
Sklar A (1959) Fonctions de Répartition à \(n\) Dimensions et Leurs Marges. Publication de l’Institut Statistique de l’Université de Paris 8:229–231
Sørbie SH, Rue H (2017) Penalised complexity priors for stationary autoregressive processes. J Time Ser Anal 38(6):923–935
Tagasovska N, Ackerer D, Vatter T (2019) Copulas as high-dimensional generative models: Vine copula autoencoders. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems, vol 32
Villa C, Walker S (2015) An objective bayesian criterion to determine model prior probabilities. Scand J Stat 42(4):947–966
Wilson KJ (2018) Specification of informative prior distributions for multinomial models using vine copulas. Bayesian Anal 13(3):749–766
Funding
Open access funding provided by Università degli Studi di Sassari within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Theorem 1
Proof
The Alpha-Beta divergence of a positive measure p, from another positive measure q, is given in Cichocki et al. (2011) as
Let \(q(x)=f_0(x\,\vert \,{\varvec{\lambda }}, \xi =\xi _0)=\prod _{j=1}^{k} f_j(x_j\,\vert \,\lambda _j)\) be the base model in the case of independence, and let \(p(x)=f_{X\vert {\varvec{\lambda }},\xi }(x\,\vert \,{\varvec{\lambda }},\xi )=c(u_1, \dots ,u_k \,\vert \,\xi ) \cdot \prod _{j=1}^{k} f_j(x_j\,\vert \,\lambda _j)\) be the more complex model. Then,
If one restricts the family of divergences with the constraint \(\alpha +\beta =1\), we are left with the \(\alpha \)-divergences subfamily \(D_\alpha \); in that case, divergences are invariant with respect to the marginal distributions. This occurs because, when passing from \({\mathcal {X}}\) to \({\mathcal {U}}\), the marginal distributions simplify with the determinant of the Jacobian of the transformation. In practice, for \(\beta =1-\alpha \)
where the last equality descends from
\(\square \)
Appendix B: Proof of Theorem 2
Proof
After “observing” a pseudo-value \(\rho _\ell \), the posterior of \(\theta \) is
By using equation (17), and noticing that the parameter space of \(\rho _{\ell }\) does not necessarily need to be the subset \([-1,1]\) as in the Gaussian copula, we obtain
\(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Battagliese, D., Grazian, C., Liseo, B. et al. Copula modelling with penalized complexity priors: the bivariate case. TEST 32, 542–565 (2023). https://doi.org/10.1007/s11749-022-00843-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-022-00843-w