1 Introduction

In statistical modelling, it is typical to face the problem of calibrating a model which is complex enough to catch at least the main feature of the phenomenon of interest and, at the same time, simple enough to maintain interpretability. Starting from a model with a given level of complexity, called the base model, say \(M_0\), a feasible way to obtain a richer and more flexible structure is to include an extra-component so that the base model would be nested in the more complex one, say \(M_1\). For example, in many scenarios, it is important to verify whether some components of the model might be considered mutually independent or not. This amounts to comparing two nested models and, in a Bayesian framework, it is of paramount importance to adequately produce a prior distribution on the parameters governing the extra-complexity of the larger model.

This issue is at the core of the idea of penalized complexity (PC) priors, proposed in Simpson et al. (2017), where a principled way to construct a prior distribution for the parameters, say \(\xi \), of the extra-component of the model is described. The basic idea is to consider a reparameterization of \(\xi \) in terms of the Kullback-Leibler divergence between \(M_0\) and \(M_1\) and then assign an exponential prior over a specific function of the new parameter. The idea of using Kullback-Leibler divergence measures to construct priors, and in particular testing priors, has been originally proposed in Jeffreys (1961), while Bayarri and Garcia-Donato (2008) and Bayarri and Garcia-Donato (2007) represent central contributions.

In this paper, we consider the case where the base model assumes independence among the marginal distributions of the components, while the more complex one includes a certain degree of dependence among them. The most natural way of highlighting the differences between the two above models is certainly through their copula representation (Sklar 1959). In fact, a copula representation is very popular in many applied settings as it allows to separate the estimation procedure of the marginal distributions from that of the dependence structure. A copula is a multivariate distribution defined on the unit hyper-cube \([0,1]^k\), where k is the dimension of a multivariate random variable X. More formally,

$$\begin{aligned} C(u_1, \ldots , u_k) = \Pr (U_1 \le u_1, \ldots , U_k \le u_k), \end{aligned}$$

where \(U_i \sim \text {U}(0,1)\), for \(i=1,2, \ldots , k\). If \(X = (X_1, \ldots , X_k) \in {\mathbb {R}}^k\) has a continuous multivariate cumulative distribution function \(F(x_1, \ldots , x_k) = \Pr (X_1 \le x_1, \ldots , X_k \le x_k)\), then F can be uniquely represented by a copula function \(C: [0,1]^k \rightarrow [0,1]\) so that

$$\begin{aligned} F(x_1, \ldots , x_k) = C(F_1(x_1), \ldots , F_k(x_k)), \end{aligned}$$

where \(F_i(x_i)\) is the marginal cumulative distribution function of \(X_i\), for \(i=1,2, \ldots , k\) (Sklar 1959). In addition, in the absolutely continuous case, the density function of X can be written as

$$\begin{aligned} f(x_1, \ldots , x_k) = \prod _{j=1}^k f_j(x_j) c(F_1(x_1), \ldots , F_k(x_k)), \end{aligned}$$

where \(f_j(\cdot )\) is the marginal density function of the j-th component of X and \(c(\cdot )\) is the derivative of C.

The above representation allows to separately model the marginals and the dependence structure of a multivariate distribution. Thus, it makes sense, in a parametric set-up, to assume specific parametric forms for the marginal distributions of the components and for the copula function in a separate fashion. In addition, from a Bayesian perspective, prior elicitation can be done independently. In this sense, the penalized complexity approach is particularly suitable to this kind of situations when we interpret the extra-parameter as the one governing the dependence structure. This intuition is particularly reinforced by a general result, which we present in Sect. 2.2 and that states that the \(\alpha \)-divergence between a multivariate distribution and its counterpart with independent components does not depend on the marginals: so the elicitation of the “copula” parameters, following the penalized complexity approach, can be easily implemented without considering the specific nature of the marginals.

There is no consolidated theory of non-informative priors in a copula framework, Guillotte and Perron (2012) being a noticeable exception in a nonparametric framework. A typical approach for defining a prior over the parameters of a parametric copula consists either in using standard tractable priors or in defining a uniform prior for summaries of the dependence defined on a compact space, like for example the Spearman’s \(\rho \) or Kendall’s \(\tau \), or the correlation parameters in the case of elliptical copulas. The gist of this paper is to adopt a PC perspective in order to propose a general technique for deriving weakly informative priors for the copula parameter. In particular, we illustrate the situation where the copula function is assumed to be Gaussian and discuss several choices of the prior for the correlation parameter \(\rho \). The prior distribution we propose can also be used within a hierarchical modelling approach, where a Gaussian copula is instrumental to flexibly build more complex models. For example, Elfadaly and Garthwaite (2017) proposed a prior distribution for the parameters of a multinomial model based on a Gaussian copula representation with beta marginal distributions. See also Wilson (2018) for a vine-copula specification. Similarly, Sharma and Das (2017) proposed Gaussian and t copula priors representing generalizations of lasso, elastic net and g-priors. The methodology presented in this work can be easily adapted—in a hierarchical way—to the above context.

This work is focussed on bivariate copula models, and the goal is to define a prior distribution for a scalar parameter; while extensions to multivariate parameters are conceptually possible, the theory of penalized complexity prior distributions for the multivariate case is still not yet fully developed (Simpson et al. 2017). We will leave this case to further research.

The rest of the paper is organized as follows. In Sect. 2, we briefly describe the PC prior method and then use it in a copula framework. In Sect. 3, we derive and discuss the PC prior and compare it with some alternative competitors, often used in the literature. Section 4 is devoted to numerical comparison: Section 4.1 discusses some issues related to the potential use of PC priors as formal objective priors and discusses their use in testing. Section 5 concludes the paper.

2 PC priors and copula modelling

2.1 Background about PC priors

In this section, we briefly sketch the idea of PC priors, introduced by Simpson et al. (2017). The construction of a PC prior is principled, with the goal of producing a weakly informative prior where researchers are able to quantify their belief in the simple model through the calibration of a hyper-parameter.

Consider a generic statistical model \({\mathcal {G}} = \{ g(\cdot \,\vert \,{\varvec{\lambda }}), {\varvec{\lambda }} \in \Lambda \}\), with \(\Lambda \subset {\mathbb {R}}^q\), and a potential generalization of \({\mathcal {G}}\), that is a model \({\mathcal {F}} = \{f(\cdot \,\vert \,{\varvec{\lambda }}, \xi ), {\varvec{\lambda }} \in \Lambda , \xi \in \Xi \}\), where \(\Xi \subset {\mathbb {R}}^p\). In practice, we assume there is a value of \(\xi \), say \(\xi _0\), such that

$$\begin{aligned} f(\cdot \,\vert \,{\varvec{\lambda }}, \xi _0) = g(\cdot \,\vert \,{\varvec{\lambda }}), \qquad \forall {\varvec{\lambda }} \in \Lambda . \end{aligned}$$

Simpson et al. (2017) defined four basic principles behind the construction of a PC prior for the generic parameter \(\xi \).

  • Occam’s Razor. A simpler model formulation should be preferred until there is enough evidence for a more sophisticated model. In this setting, the simpler model is the base model where \(\xi =\xi _0\) and the prior will penalize deviations from it.

  • Complexity measure. The prior distribution is defined as a reparameterization of a measure of the increment in complexity, moving from \(M_0\) to \(M_1\). The increased complexity is evaluated in terms of Kullback-Leibler divergence (Kullback and Leibler 1951)

    $$\begin{aligned} KL (f\Vert g)=\int _{\mathcal {X}} f(x\,\vert \,{\varvec{\lambda }}, \xi )\log \biggl (\frac{f(x\,\vert \,{\varvec{\lambda }}, \xi )}{g(x\,\vert \,{\varvec{\lambda }})}\biggr )dx, \end{aligned}$$
    (1)

    where \(g(x\,\vert \,{\varvec{\lambda }})=f(x\,\vert \,{\varvec{\lambda }}, \xi =\xi _0)\). Given that the prior distribution depends on the Kullabck-Leibler divergence between \(M_0\) and \(M_1\), it will be concentrated in areas where the loss of information one would incur when using the base model instead of the complex one is small. The Kullback-Leibler divergence is then transformed into the so-called distance \(s(f\Vert g)=\sqrt{2KL (f\Vert g)}\).

  • Constant Rate Penalization. A constant decay-rate r on the distance

    $$\begin{aligned} \frac{\pi _s(s+\nu )}{\pi _s(s)}=r^\nu , \qquad s,\nu \ge 0, \end{aligned}$$
    (2)

    with \(0< r <1\), and where \(\pi _s\) is a prior on the distance, ensures that the change in the prior by increasing the distance of a factor \(\nu \) does not depend on s. This condition implies an exponential prior on s, say \(\pi _s(s)=\theta \exp (-\theta s)\). As a consequence, \(r=\exp (-\theta )\). Notice that \(\pi _s(s)\) has a mode at \(s=0\), which prevents overfitting. The constant decay-rate implies that one is willing to equally penalize each additional portion of distance in the parameter space, independently of the value of \(s=s(\xi )\). This is reasonable when there is not a clear idea about the distance scale. Finally, one defines the PC prior through a simple change of variable

    $$\begin{aligned} \pi (\xi )=\pi _s(s(\xi ))\Bigg \vert \frac{\partial s(\xi )}{\partial \xi }\Bigg \vert . \end{aligned}$$
    (3)

    The resulting prior, being constructed in terms of s, does not depend on the specific adopted parameterization.

  • User-defined scaling. The hyper-parameter \(\theta \) can be chosen in terms of a subjective assumption on a tail event. One elicits a value W such that, for a specific probability level \(\alpha \),

    $$\begin{aligned} \Pr (Q(\xi ) > W )= \alpha , \end{aligned}$$
    (4)

    where \(Q(\xi )\) is a specific transformation of \(\xi \). This is a crucial point: a non-robust choice of the rate parameter may significantly affect inference.

In the next sections, we describe a hierarchical approach in order to mitigate the impact of the hyper-parameter selection of the PC prior in a generic copula context, with a particular focus on the analysis of the correlation coefficient in a Gaussian copula.

2.2 PC prior for a copula parameter

We now exploit the idea of the PC prior construction to derive a prior on the parameter which regulates the dependence among random variables in a copula model. Consider a random vector \(X = (X_1, \ldots , X_k)\), such that \(X \in {\mathbb {R}}^k\), with continuous cumulative distribution function \(F(x_1, \ldots , x_k)\). In order to assess the strength of dependence, we consider, as base model, the one with independent marginal components

$$\begin{aligned} M_0=\{ f_0(x\,\vert \,{\varvec{\lambda }}, \xi =\xi _0)=f_{X\vert {\varvec{\lambda }}}(x\,\vert \,{\varvec{\lambda }})=\prod _{j=1}^{k} f_j(x_j\,\vert \,\lambda _j), X\in {\mathbb {R}}^k, {\varvec{\lambda }}\in \Lambda \}. \end{aligned}$$

Here, \({\varvec{\lambda }}=(\lambda _1, \dots , \lambda _k)\) denotes the vector of parameters appearing in the marginal components of the density and \(\Lambda \subset {\mathbb {R}}^q\), with q integer, \(q\ge 1\); notice that each \(\lambda _j\) can be either a scalar parameter or a vector of parameters.

The corresponding more complex model assumes a non-constant copula function depending on \(\xi \), that is

$$\begin{aligned} M_1=\{ f_{X\vert {\varvec{\lambda }}, \xi }(x\,\vert \,{\varvec{\lambda }},\xi ), X\in {\mathbb {R}}^k, \xi \in \Xi , {\varvec{\lambda }} \in \Lambda \}, \end{aligned}$$

where \(\Xi \subset {\mathbb {R}}^p\). According to Sklar’s theorem, the joint density can be written as

$$\begin{aligned} f_{X\vert {\varvec{\lambda }}, \xi }(x\,\vert \,{\varvec{\lambda }},\xi )=\prod _{j=1}^{k} f_j (x_j\,\vert \,\lambda _j) c_\xi (F_1(x_1\,\vert \,\lambda _1), \dots , F_k(x_k\,\vert \,\lambda _k)\,\vert \,\xi ). \end{aligned}$$
(5)

Clearly, model \(M_0\) is nested in model \(M_1\), as we retrieve it for \(\xi =\xi _0\).

It is well known that the Kullback-Leibler divergence between two measures is invariant with respect to strictly monotone transformations. This directly implies that, in our case, the divergence between \(M_0\) and \(M_1\) does not depend on the marginal distributions and it is a function of \(\xi \) only.

This result can be further generalized. In fact, the Kullback-Leibler divergence is a member of the \(\alpha \)-divergences family. In Cichocki and Amari (2010), the \(\alpha \)-divergence between two generic absolutely continuous positive measures, p and q, is defined, for \(\alpha \in {\mathbb {R}}\,\backslash \,\{0,1\}\), as

$$\begin{aligned} D_\alpha (p\Vert q)=\frac{1}{\alpha (1-\alpha )}\int _{\mathcal {X}}\left( \alpha p(x)+(1-\alpha )q(x)-p^\alpha (x)q^{1-\alpha }(x)\right) dx. \end{aligned}$$
(6)

Then, it is possible to derive the following theorem.

Theorem 1

(Invariance of \(\alpha \)-divergences wrt the marginals) Let \(f_{1}(x_1, \dots , x_k\,\vert \,{\varvec{\lambda }}, \xi )\) be the generic member of a class of densities \(M_1\), which is assumed to be absolutely continuous with respect to the Lebesgue measure. Let \(f_{0}(x_1,\dots ,x_k)=\prod _{j=1}^{k} f_j(x_j\,\vert \lambda _j)\) be a density with independent components, where \(f_j\) is the marginal density of the j-th component of \(f_{1}(\cdot )\) Then, for any value of \(\alpha \), the \(\alpha \)-divergence (6) of \(f_1\) from \(f_0\) does not depend on \({\varvec{\lambda }}\), and

$$\begin{aligned} D_{\alpha }(\xi )=\frac{1}{\alpha (1-\alpha )}\left[ 1-\int _{[0,1]^k} c^\alpha (u_1,\dots ,u_k\,\vert \,\xi )\,du_1\dots du_k \right] , \end{aligned}$$
(7)

where \(c(u_1,\dots ,u_k)\) is the copula function associated with the density \(f_1(\cdot )\).

The proof is given in Appendix A. Notice that Theorem 1 is valid for any copula function and any dimension k of the random vectors X and Y. It is easy to see that \(D_\alpha (f_{X\vert {\varvec{\lambda }},\xi }\Vert f_0)=KL (f_{X\vert {\varvec{\lambda }},\xi }\Vert f_0)\) for \(\alpha \rightarrow 1\). Then, Equation (7) becomes

$$\begin{aligned} \int _{[0,1]^k} c(u_1, \dots , u_k\,\vert \,\xi ) \log c(u_1, \dots , u_k\,\vert \,\xi )\,du_1\dots du_k. \end{aligned}$$
(8)

Following Simpson et al. (2017), we consider a function of the original Kullback-Leibler divergence

$$\begin{aligned} s(\xi )=\sqrt{2KL (\xi )}, \end{aligned}$$
(9)

and assume a constant rate penalization, which induces an exponential prior on \(s(\xi )\).

In view of Theorem 1, \(s(\xi )\) only depends on \(\xi \): this allows an elicitation process on \(\xi \) which is independent of the values of the parameters of the marginal distributions.

2.3 PC Prior for the bivariate Gaussian copula model

In this section, we restrict the analysis to the case of a bivariate distribution, where the dependence is tuned by the scalar parameter of a Gaussian copula. Although Theorem 1 is valid to any dimension, the theory behind PC priors is well established for scalar parameters only and we delay the discussion of the general case to the final discussion.

Consider a two-dimensional random variable \(X \in {\mathbb {R}}^2\). The Kullback-Leibler divergence between the copula model and the model of independence among the marginal components can now be written as

$$\begin{aligned} KL (\xi )=\int _0^1\int _0^1 c(u,v\,\vert \,\xi ) \log c(u,v\,\vert \,\xi )\,du\,dv. \end{aligned}$$
(10)

We assume that \(c(\cdot , \cdot \,\vert \,\rho )\) is a bivariate Gaussian copula and \(\rho \) is the correlation coefficient, so that \(\rho \in [-1,1]\), and \(u = F_1(x_1)\), and \(v = F_2(x_2)\). Then,

$$\begin{aligned} c (u,v\,\vert \,\rho )=\frac{1}{\sqrt{1-\rho ^2}}\exp \left( -\frac{\rho ^2 \Phi ^{-1}(u)^2+\rho ^2 \Phi ^{-1}(v)^2-2\rho \Phi ^{-1}(u)\Phi ^{-1}(v)}{2(1-\rho ^2)}\right) .\nonumber \\ \end{aligned}$$
(11)

Of course, the nested simpler model \(M_0\) is obtained by setting \(\rho =0\).

The Gaussian assumption allows to obtain a closed form expression for the PC prior of \(\rho \). Starting from equations (10) and (11), the Kullback-Leibler divergence is given in Guo et al. (2017) as

$$\begin{aligned} KL (f_X\Vert f_0)=-\frac{1}{2}\log (1-\rho ^2), \end{aligned}$$
(12)

which is equivalent to the mutual information of a bivariate normal distribution. Then, the corresponding distance, according to Simpson et al. (2017), is \(s(\rho ) = \sqrt{2KL (\rho )} = \sqrt{-\log (1-\rho ^2)}\). From the above equation, it is apparent that the Kullback-Leibler divergence and the corresponding distance function are symmetric with respect to the base model, i.e. \(\rho =0\). In particular, the Kullback-Leibler divergence is only piece-wise monotone, depending on the sign of \(\rho \). Therefore, in this case, the exponential prior distribution on the distance function must consider the two branches of s separately, that is

$$\begin{aligned} \pi \left( s_i(\rho )\right) =\frac{1}{2}\theta \exp \left( -\theta s_i(\rho )\right) , \qquad i=1,2, \end{aligned}$$

where \(s_1(\rho )\) and \(s_2(\rho )\) are the distances when \(-1\le \rho <0\) and \(0\le \rho \le 1\), respectively. Notice that, if the simpler model would correspond to a \(\rho \not = 0\), the resulting PC prior would be more skewed, reflecting the “asymmetry” of the null model. In our symmetric case, a half exponential distribution is assigned to each branch of \(s(\rho )\). The resulting PC prior distribution for \(\rho \), for a fixed scale hyper-parameter \(\theta \), is then the sum of the two branches

$$\begin{aligned} \pi (\rho )= \pi \left( s_1(\rho )\right) \Bigg \vert \frac{\partial s_1(\rho )}{\partial \rho }\Bigg \vert I_{(-1,0)}(\rho ) + \pi \left( s_2(\rho )\right) \Bigg \vert \frac{\partial s_2(\rho )}{\partial \rho }\Bigg \vert I_{(0,1)}(\rho ) \end{aligned}$$

Since

$$\begin{aligned} \Biggl |\frac{\partial s(\rho )}{\partial \rho } \Biggr |= \frac{|\rho |}{(1-\rho ^2)\sqrt{-\log (1-\rho ^2)}}, \end{aligned}$$

we obtain

$$\begin{aligned} \pi (\rho )&= \frac{1}{2}\theta e^{-\theta \sqrt{-\log (1-\rho ^2)}}\frac{|\rho |}{(1-\rho ^2)\sqrt{-\log (1-\rho ^2)}} ~I_{(-1,0)}(\rho ) \nonumber \\&\quad + \frac{1}{2}\theta e^{-\theta \sqrt{-\log (1-\rho ^2)}}\frac{|\rho |}{(1-\rho ^2)\sqrt{-\log (1-\rho ^2)}} ~I_{(0,1)}(\rho ) \nonumber \\&=\frac{\theta }{2} \exp \Bigl (-\theta \sqrt{-\log (1-\rho ^2)}\Bigr )\frac{|\rho |}{(1-\rho ^2)\sqrt{-\log (1-\rho ^2)}} . \end{aligned}$$
(13)

Equation (13) produces a proper prior, which is symmetric around zero. All odd prior moments are consequently equal to zero.

2.4 The Jeffreys’ and arc-sine priors for \(\rho \)

As we illustrated in Sect. 2.3, the PC prior for \(\rho \) can be obtained modulo a scalar parameter \(\theta \), whose value must be tuned in a subjective way. In the next sections, we will explore, admittedly in a non-standard way, several well-established objective Bayesian techniques for selecting the hyper-prior distribution for \(\theta \).

We will not consider alternative priors to the PC one for several reasons. First of all, at least in an estimation context, and for scalar parameters, almost any objective approach for selecting a prior would lead to the well-known Jeffreys’ prior (Jeffreys 1946). However, here, we are more concerned with a Bayesian hypotheses testing based on Bayes factor; then, a proper prior is needed.

Jeffreys (1961) himself derived the standard improper prior for the correlation coefficient in the bivariate Gaussian model, conditional on the variances, as

$$\begin{aligned} \pi ^J(\rho )\propto \frac{\sqrt{1+\rho ^2}}{1-\rho ^2}. \end{aligned}$$
(14)

Being improper, this prior could not be used in testing scenarios involving the calculation of the Bayes factor; then, Jeffreys (1961) also proposed the arc-sine prior, which shows a similar behaviour, although its integral on \([-1,1]\) is finite:

$$\begin{aligned} \pi _arc-sine (\rho )=\frac{1}{\pi }\frac{1}{\sqrt{1-\rho ^2}}. \end{aligned}$$
(15)

The arc-sine prior is not invariant to reparameterizations and we will not discuss further this aspect in this paper.

3 Choosing the hyper-parameter of the PC Prior

A crucial step in the definition of the PC prior is the selection of the hyper-parameter \(\theta \), in that it establishes how fast the prior shrinks towards the base model. In this regard, Simpson et al. (2017) proposed to introduce a subjective input in the form of a probability statement related to a tail event and stated that the prior distribution is relatively insensitive to the choice of the hyper-parameter \(\theta \), unless it is set to an “extremely poor” value. However, given the artificial nature of the hyper-parameter \(\theta _0\), prior information about it may be unavailable, or there may be uncertainty on how much the experimenter supports the base model a priori. Figure shows the PC prior in Equation (13) for different choices of \(\theta \).

Instead of selecting a specific value for \(\theta \), one could adopt a hierarchical perspective and assign a hyper-prior to the parameter \(\theta \). We discuss in this section two possible hierarchical strategies, although we remark that only proper hyper-priors for \(\theta \) guarantee to obtain a proper prior for \(\rho \). We illustrate strategies for deriving hyper-prior distributions in the particular case where \(\rho \) is the correlation parameter of a bivariate Gaussian copula. However, the methodology based on intrinsic priors is more general and we prove below that it produces the same intrinsic prior no matter what kind of copula is adopted.

Fig. 1
figure 1

The Jeffreys’ prior in (14) and PC prior in (13) for varying \(\theta \)

3.1 Jeffreys’ prior for \(\theta \)

An objective way to originate a hyper-prior distribution for \(\theta \) is to adopt the Jeffreys’ approach (Jeffreys 1946) with some adjustments. A formal way of deriving the Jeffreys’ prior for a hyper-parameter would imply the marginalization of the parameter of interest \(\rho \) in order to consider a marginal copula model, namely

$$\begin{aligned} c(u,v\,\vert \,\theta ) = \int _{-1}^1 c_\rho (u,v\,\vert \,\rho ) \pi (\rho \,\vert \,\theta ) d\rho , \end{aligned}$$

where \(c_\rho (u,v\,\vert \,\rho ) \) and \(\pi (\rho \,\vert \,\theta ) \) are given, respectively, in (11) and (13). This way we are considering a model with uniform marginals and the only unknown parameter is \(\rho \). Consequently, the Jeffreys’ prior for the “marginal” model can be easily obtained as

$$\begin{aligned} \pi ^J(\theta ) \propto \left[ - {\mathbb {E}}_{\theta }\left( \frac{d^2 \log c(u,v\,\vert \,\theta ) }{d \theta ^2}\right) \right] ^{1/2}. \end{aligned}$$

Unfortunately, this derivation does not produce a closed form expression for \(\pi (\theta )\). Alternatively, one could simply note that \(\theta \) acts as a scale parameter for the exponential density of \(s(\rho )\); so the natural objective choice is the use of the improper prior

$$\begin{aligned} \pi ^N(\theta ) \propto {\theta }^{-1}, \end{aligned}$$
(16)

where the superscript N stands for non-informative.

3.2 Intrinsic prior for \(\theta \)

In the spirit of making comparison between \(M_0\) and \(M_1\), similarly to the PC prior idea, an alternative way of deriving a hyper-prior for \(\theta \) is provided by the intrinsic prior methodology (Berger and Pericchi 1996). Here, we follow the approach described in Pérez and Berger (2002). Let

$$\begin{aligned} M_1=\bigl \{\pi _{\rho \vert \theta }(\rho \,\vert \,\theta ), \rho \in (-1,1), \theta \in {\mathbb {R}}^+\bigr \}, \end{aligned}$$

where \(\pi _{\rho \vert \theta }(\rho \,\vert \,\theta )\) is the PC prior for \(\rho \) in Equation (13), and

$$\begin{aligned} M_0=\bigl \{\pi _{\rho \vert \theta _0}(\rho \,\vert \,\theta _0), \rho \in (-1,1), \theta _0\in {\mathbb {R}}^+\bigr \}, \end{aligned}$$

where \(\pi _{\rho \vert \theta _0}(\rho \,\vert \,\theta _0)\) is the conditional distribution of \(\rho \) for a specific value \(\theta _0\). Suppose one wants to compare the hypotheses

$$\begin{aligned} H_0:\theta =\theta _0 \quad \text {vs.} \quad H_1:\theta \ne \theta _0. \end{aligned}$$

Starting from the improper pseudo-Jeffreys’ prior in Equation (16), the intrinsic prior \(\pi ^I(\theta )\) is obtained by selecting the minimum sample size that makes this prior distribution proper. Here, the prior for \(\rho \vert \theta \) plays the role of the likelihood in the standard application of the intrinsic prior methodology. The minimum sample size is equal to one, and the intrinsic prior for \(\theta \) is

$$\begin{aligned} \pi ^I(\theta )=\int _{-1}^1\pi (\theta \,\vert \,\rho _{\ell })\,\pi (\rho _{\ell }\,\vert \,H_0)d\rho _{\ell }, \end{aligned}$$
(17)

where \(\pi (\theta \,\vert \,\rho _{\ell })\) can be interpreted as the posterior distribution of \(\theta \) after observing a virtual sample of size one, \(\rho _{\ell }\), and \(\pi (\rho _{\ell }\,\vert \,H_0)\) is given by (13) by fixing \(\theta \) at \(\theta _0\). The calculation of the intrinsic prior for other copulas implies the modification of the parameter space of reference.

The following theorem provides the expression of the intrinsic prior for any bivariate copula model depending on a scalar parameter \(\rho \), besides Gaussian copula.

Theorem 2

Consider a generic copula density \(c(u,v\,\vert \,\rho )\) and a penalized complexity prior on \(\rho \) given by

$$\begin{aligned} \pi (\rho \,\vert \,\theta ) = \theta \exp \left( - \theta s(\rho ) \right) \left| s^\prime (\rho ) \right| , \end{aligned}$$

where \(s(\rho )\) is the already introduced transformation of the Kullback-Leibler divergence of \(c(u,v\,\vert \,\rho )\) from the independence copula \(c_0(u,v) =1\), that is \(s(\rho ) = \sqrt{-2 KL(c \vert \vert c_0)}\), with \(KL(c \vert \vert c_0) = \int _{0}^1 \int _{0}^1 c(u,v\,\vert \,\rho ) \log c(u,v\,\vert \,\rho )\,du \,dv \), and \(s^\prime (\rho )\) is the first derivative of \(s(\rho )\) wrt \(\rho \). Then, the intrinsic prior for testing \(\theta =\theta _0\) vs. \(\theta \not = \theta _0\) is

$$\begin{aligned} \pi ^I(\theta )=\frac{\theta _0}{(\theta +\theta _0)^2}, \quad \theta >0, \end{aligned}$$

independently of the specific expression of \(c(u,v\,\vert \,\rho )\).

Although in this paper we are confined to copulas, the theorem above is valid to any PC prior arising from the comparison of two generic nested statistical models, as well as to multivariate copulas which are regulated by only one parameter, such as in the case of equicorrelation. The proof is given in Appendix B.

The last step of the procedure consists in selecting the \(\theta _0\) value in a sensible way, since its value is not suggested by the hypotheses system in this setting. Our proposal is to calibrate the value of \(\theta _0\) in order to maximize the prior variance of \(\rho \). The new prior for \(\rho \) is then

$$\begin{aligned} \pi (\rho \,\vert \,\theta _0)=\int _0^\infty \pi ^{PC}(\rho \,\vert \,\theta )\pi ^I (\theta \,\vert \,\theta _0) d\theta , \end{aligned}$$
(18)

and, since \({\mathbb {E}}(\rho \,\vert \,\theta ) = 0, \,\forall \theta \), one has to maximize the quantity

$$\begin{aligned} {\mathbb {V}}\text{ ar }(\rho \,\vert \,\theta _0) = \int _{-1}^1\int _0^\infty \rho ^2\pi ^{PC}(\rho \,\vert \,\theta )\pi ^I (\theta \,\vert \,\theta _0) \,d\theta \,d\rho . \end{aligned}$$

The numerical problem has been tackled using two different methods, namely the Brent’s method (Brent 1973) and the Golden Section method, implemented in the R suite deconstructSigs. Both of them have produced the same value \({\hat{\theta }}_0 = 0.491\,525\). It should be noticed that the hierarchical model stabilizes the variance of the PC prior for \(\rho \); in fact, without taking into account the intrinsic prior for \(\theta \), the variance is hard to compute, even numerically, and especially for values of \(\theta \) between 0 and 1. This is a consequence of the fact that, for these values of \(\theta \), the variance of the PC prior for \(\rho \) increases as the prior spreads out towards the alternative model. Figure  compares the Jeffreys’ prior, the arc-sine prior and the hierarchical PC prior with \(\theta _0=0.491\,525\).

Fig. 2
figure 2

The Jeffreys’ prior in (14), the arc-sine prior in (15), and the hierarchical PC prior in (18) with \(\theta _0=0.491\,525\) for \(\rho \) of the Gaussian copula model

4 Simulation study

In this section, we perform a simulation study in order to compare the frequentist behaviour of the hierarchical PC prior in (18) relatively to its natural competitors discussed before. We also analyse the performance of the hierarchical PC prior for the parameter \(\rho \)—with abuse of notation—controlling the strength of association of a Frank copula, against the classical version of the PC prior. In this case, the PC prior derivation does not provide a closed form expression. Notice also that, in the Frank copula, \(\rho \) does not assume the meaning of correlation coefficient. For details about Frank copula, we refer the reader to Joe (2014).

For each true value of \(\rho ^*\) in the set \((-0.95,-0.5,0,0.05,0.5,0.95,0.999)\) for the Gaussian copula and in the set (0.01, 1, 5, 10) for the Frank copula, and for each sample size \((n=5,30,100,1\,000)\), we have generated 200 independent samples from a Gaussian copula and a Frank copula, respectively. For each sample, we have computed i) the posterior mean of \(\rho \); ii) the \(95\%\) equal-tailed credible interval; iii) the Bayes factor for testing the hypotheses \(c(u,v) =c_0(u,v)=1\) vs. \(c(u,v)\not =c_0(u,v)\). In addition, we have computed an estimate of the mean squared error of the posterior mean \(\hat{\rho }\), namely

$$\begin{aligned} MSE \bigl (\hat{\rho }(x)\bigr )={\mathbb {E}} \Bigl [\bigl (\hat{\rho }(x)-\rho ^*\bigr )^2\Bigr ]. \end{aligned}$$

The posterior distribution \(\pi (\rho \,\vert \,x)\) was obtained using a standard Metropolis-Hastings-within-Gibbs algorithm, using a truncated Normal distribution in the Random Walk Metropolis step. The variance of the proposal has been calibrated to have an acceptance rate of about 30%.

From an “estimation” perspective, and only for the Gaussian copula, we consider the Jeffreys’ prior in (14), as the natural competitor. Relatively to the Gaussian copula, our simulation shows that the hierarchical PC prior produces a smaller MSE when the true \(\rho \) is close to zero, as illustrated in Fig. .

Fig. 3
figure 3

Box-plots of the posterior mean, over 200 samples, for different values of \(\rho \) from a Gaussian copula model. Sample sizes, \(n= 5, 30,100, 1\,000\), are lay out by row

Also, notice how the sampling distributions of the posterior mean, using the PC prior, tend to be more concentrated around the true value of \(\rho ^*\), especially for moderate sample sizes. The Jeffreys’ prior produces smaller values of the mean squared error for intermediate levels of correlation. In this case, it seems to occur a sort of trade-off between bias and variance; the bias is smaller with the Jeffreys’ prior, the variance is smaller with the PC prior. For small values of \(\rho ^*\), the PC prior is superior in terms of MSE compared with the Jeffreys’ prior; this can be related to the spike of the PC prior in a neighbourhood of \(\rho =0\) (see Fig. 2).

We have also analysed the sensitivity of inference with respect to the choice of \(\theta \) in the standard implementation of the PC prior. For this purpose, we have run a simulation study using three different choices of \(\theta \), namely 0.1, 1 and 5, and compared the results with the hierarchical approach described in Sect. 3.2 and with the Jeffreys’ prior in Equation (14); this latter is only for the Gaussian copula. Results are summarized in Tables  and  , where we report the outputs for several sample sizes (\(n=5, 30, 100\)). In Table 1, related to Gaussian copula, Jeffreys’ prior shows the best performance in terms of MSE for large values of \(\vert \rho ^*\vert \); for small \(\vert \rho ^*\vert \), each “subjective” PC prior performs better in specific cases: The hierarchical PC prior always performs well and it represents the “safer” option. In Table 2, referred to Frank copula, we see again how there is no univocal value of \(\theta \) which makes the subjective PC prior always preferable to the hierarchical one. Therefore, we opt for this latter, where again we maximized the variance as we already did in Sect. 3.2 for the Gaussian copula, obtaining a similar value, i.e. \({\hat{\theta }}_0 = 0.549\,829\).

Table 1 Mean squared error, computed over the posterior mean, for different choices of the parameter \(\theta \) in the subjective PC prior in (13), for the hierarchical (objective) PC prior in (18) and for the Jeffreys’ prior in (14) on \(\rho \) of a Gaussian copula
Table 2 Mean squared error, computed over the posterior mean, for different choices of the parameter \(\theta \) in the subjective PC prior and for the hierarchical (objective) PC prior on \(\rho \) of a Frank copula

Table , related to Gaussian copula, reports the coverage probabilities of Jeffreys’ and hierarchical PC priors. Again the latter outperforms the former for small \(\vert \rho ^*\vert \), and, for large \(\vert \rho ^*\vert \), the behaviour of the two priors is similar. Table  displays the coverage probabilities of the subjective and “objective” PC priors of \(\rho \) from a Frank copula. Again, it is evident how poor selection of \(\theta \) may affect the estimates, especially for small sample sizes.

Table 3 Coverage probabilities, over 200 samples, for the Gaussian copula model
Table 4 Coverage probabilities and the frequency of times that \(BF _{01}<0.5\), over 200 samples, for the Frank copula model

4.1 Bayesian hypothesis testing

PC priors have not been proposed for testing hypotheses. However, it seems natural to analyse their performance in a Bayesian testing framework, since they are proper and particularly suitable for comparing nested models. Here, we consider the case where \(M_0\) is the nested model, that is \(c(u,v)=c_0(u,v)=1\), versus the alternative \(M_1: \, c(u,v)\not =c_0(u,v)\). In the following, we restrict the comparison to the case of “uniform marginals”. A more general discussion, involving uncertainty in the marginals parameters, would inevitably depend on the specific situation.

The standard tool for Bayesian testing nested models is the Bayes factor of \(M_0\) vs. \(M_1\), say \(B_{01}\). It is known that when the models have parameter spaces of differing dimensions, the use of improper priors yields indeterminate answers (Berger and Pericchi 2001). The lack of suitable objective priors for testing has led researchers to circumvent the problem in several different ways, either using conventional priors, based on some specific principles (Bayarri et al. 2012), or proposing alternative methods, like the already mentioned intrinsic prior discussed in Sect. 3.2.

First, for the Gaussian copula model, we compare the use of the hierarchical PC prior, based on the intrinsic approach, with the conventional arc-sine prior defined in Equation (15), which can be considered a “proper” version of the improper Jeffreys’ prior analysed before. Also, we explore the performance of the hierarchical PC prior when the complex model is described by a Frank copula.

Let \(\gamma _0=P(H_0)\) be the prior probability of the null hypothesis and let \(\pi ^{PC}(\rho \,\vert \,\theta _0)\), with \(\theta _0 \approx 0.491 \) for the Gaussian copula, and \(\theta _0 \approx 0.55 \) for the Frank one, be the prior density of \(\rho \) under \(M_1\). The Bayes factor \(B_{01}\) can be written as

$$\begin{aligned} B_{01}=\frac{1}{\int _{\rho } c_\rho (u,v\,\vert \,\rho )\pi ^{PC}(\rho \,\vert \,\theta _0)d\rho }. \end{aligned}$$

and the posterior probability of the simplest model is

$$\begin{aligned} P(M_0\,\vert \,x)=\biggl [1+\frac{1-P(M_0)}{P(M_0)}\frac{1}{B_{01}}\biggr ]^{-1}. \end{aligned}$$

Again, the selection of the hyper-parameter \(\theta \) in a subjective way is crucial in this setting. Choosing a small value of \(\theta \) would produce Bayes factors that support the base model, even when \(M_1\) is true; in particular, in the limiting case of infinite prior variance, the Bayes factor would always select \(M_0\), independently of the observed data. On the other hand, setting a large value of \(\theta \) would make the Bayes factor shrink to 1, independently of the true value of \(\rho \). Our approach based on the hierarchical PC prior is able to alleviate these drawbacks. Table  reports the performance of the Bayes Factor for each combination of \((n=\{5,30,100\}, \vert \rho ^*\vert )\) for the Gaussian copula model. Similar results, relating to the Frank copula, are reported in Table 4.

The computation of the Bayes factor was performed using a standard hierarchical Monte Carlo approach: we first draw each value \({\tilde{\theta }}\) from \(\pi ^I(\theta )={\theta _0}\big /{(\theta +\theta _0)^2}\) and then generate a value of \(\rho \) from \(\pi ^{PC}(\rho \,\vert \,{\tilde{\theta }})\).

Table 5 The frequency of times that \(BF _{01}<0.5\), over 200 samples, for the Gaussian copula model

Regarding the Gaussian copula, Fig. 2 illustrates how the hierarchical PC prior with \(\theta _0\approx 0.491\), when compared to the arc-sine prior, represents a compromise between “flatness” and the presence of a significant mass in a neighbourhood of the null hypothesis. This appears to be a safer strategy than setting an arbitrary value of \(\theta \), in absence of specific subjective information.

Tables 4 and 5 report the relative frequencies of times that \(B_{01}< 0.5\). This value has been chosen as a conventional threshold, below which data provide evidence in favour of \(M_1\) (Liseo and Loperfido 2006): when \(\gamma _0=0.5\), \(B_{01}< 0.5\) corresponds to \(P(H_0 \vert \text {data})< 1/3\). In Table 5, the results of comparison between the hierarchical PC prior and the arc-sine prior show a similar pattern; however, it seems that the PC prior prefers \(M_1\) slightly more often than the arc-sine prior for small values of \(\rho ^*\). Our conclusion is that the PC prior is more sensible than the arc-sine prior in capturing small dependencies; this behaviour is even more evident as the sample size grows.

Our analysis in this section was restricted to comparing two different dependence structures using a copula representation. In a more general setting, when the uncertainty on the marginal modelling comes to play, the Bayes factor would also depend on the marginal priors although this dependence is typically mild in our experience. However, also in the general case, the derivation of the hierarchical PC prior would not depend on the marginal modelling.

4.2 Prior on the model space

Testing statistical hypotheses is the most controversial aspect of statistical inference, with at least three major competing schools approaching the problem from different angles and often concluding with irreconcilable decisions (Robert 2014). The Bayesian approach has been often criticized because of the potential appearance of the Jeffreys-Lindley’s paradox. Robert (1993) proposed a solution which consists in a calibration of the prior probability associated to \(H_0\) in terms of the variance of the prior under \(H_1\). To address this issue, we discuss here an alternative strategy for selecting the prior probabilities to attach to the competing models, which depends on the particular choice of the hyper-parameter.

It is known (Berk 1966) that, when comparing two misspecified models, the posterior probability will tend to accumulate to the model which is closer, in terms of the Kullback-Leibler divergence, to the true one. Following a proposal developed in Villa and Walker (2015), the divergence, \(KL (c_\rho (u,v\,\vert \,\rho ) \;\Vert \; c_{\rho _0}(u,v\,\vert \,\rho _0))\), is interpreted as the loss we would incur if model \(M_1\) is removed and it is actually the true model. Here, we derive expressions for the Gaussian copula model. Therefore, the prior expected loss is equal to

$$\begin{aligned}&\int _{-1}^1 KL \bigl (f_X\Vert f_0\bigr ) \pi ^{PC}(\rho \,\vert \,\theta ) d\rho \nonumber \\&\quad =\int _{-1}^1 -\frac{1}{2}\log (1-\rho ^2)\frac{\theta }{2} e^{-\theta \sqrt{-\log (1-\rho ^2)}}\frac{|\rho |}{(1-\rho ^2)\sqrt{-\log (1-\rho ^2)}}d\rho =\frac{1}{\theta ^2}. \end{aligned}$$

The model prior probabilities can then be obtained in terms of the self-information loss function, which represents the loss connected to a probability statement. By equating the self-information and the expected loss, and following Villa and Walker (2015), one has

$$\begin{aligned} 1-\gamma _0(\theta )\propto \exp \left( \frac{1}{\theta ^2} \right) . \end{aligned}$$

Setting \(\gamma _0 (\theta )\propto 1\), one obtains

$$\begin{aligned} \gamma _0(\theta )=\frac{1}{1+\exp \bigl (1/\theta ^2\bigr )}, \qquad 1-\gamma _0(\theta )=\frac{\exp \bigl (1/\theta ^2\bigr )}{1+\exp \bigl (1/\theta ^2\bigr )}. \end{aligned}$$

The above derivation shows how the PC priors, although proposed and developed for estimation problems, may play a role also in the search of objective priors in model selection. This possibility will be explored in the future.

5 Discussion

The PC priors methodology is a powerful and useful idea for model construction. In this paper, we have explored their use in the general problem of checking the need of introducing dependence among the components of a random vector. We have done this exploiting the copula representation of the distribution of a random vector. We focussed on the bivariate case, and, for the sake of simplicity, we exposed the case of the Gaussian copula model in Sect. 2.3. We have considered—as a base model —the one with a constant copula density, i.e. \(c_0(u,v)=1\), which implies independence. In particular, for a Gaussian copula, this is equivalent to consider \(\rho =0\). In different contexts, other base models may be considered. For instance, Sørbie and Rue (2017) considered, in an autoregressive setting, as the base model, either the one which assumes independence (\(\rho =0\)), or the random walk one (\(\rho =1\)); Franco-Villoria et al. (2019) also considered the case of \(\rho =1\) for varying coefficient models.

There are several advantages in using the proposed PC prior for the problem discussed in the paper. First, the hierarchical PC prior can be well considered objective, and yet proper. There are many advantages in having objective prior with this property, such as the suitability of being used in testing scenarios, the fact of avoiding onerous proofs of the propriety of the yielded posterior, and more. Furthermore, it represents an important step towards the outline of a theory for objective priors for copula modelling which, with the exception of Guillotte and Perron (2012), is still missing. In addition, our approach circumvents the problem of calibrating the shrinkage parameter, which may lead to erroneous inference, especially in testing hypotheses. In absence of strong prior information on the distance scale, this is a convenient property, which allows an objective approach at the second level of the hierarchy. Second, given the nested nature of models that are compared and being the PC prior a proper probability distribution, its implementation for testing problems becomes natural. Although non-specifically designed for this type of scenarios, our experience tells that PC priors methodology can be viewed as intimately connected to the model selection problem. Obviously, this makes sense only when we fix the value of the extra-parameter in such a way that the complex model boils down to a simpler model, and not if we arbitrary pick a value of the extra-parameter in the parametric space. As a final remark, we see that the specification of the mutual information of random vectors as the negative copula entropy plays a key role in the derivation of the prior for any copula model, as allows to get rid of the parameters of the marginals. This fact has strong consequences in estimation procedures since it allows a separate elicitation of the priors for the marginals and the dependence structure.

In this work, we focussed our attention to the univariate case. Simpson et al. (2017) developed a PC prior for the regression matrix of a probit model. In our context, suppose we have a matrix \({\textbf{R}}\) for a Gaussian copula, with elements \(\rho _{ij}\), for \(i,j=1, \ldots , k\), where k is the dimension of the multivariate random variable. Then, it is possible to fix a value \(r \ge 0\) such that \(\varvec{\ell } \in {\mathcal {S}}_r = \{\varvec{\rho } \in {\mathbb {R}}^k : s(\varvec{\rho })=r\}\). The subsets \({\mathcal {S}}_r\) represent a foliation and the PC approach can be extended by assuming an exponential distribution for \(s(\varvec{\rho })\) and a uniform distribution on the leaves \({\mathcal {S}}_{s(\varvec{\rho })}\). The main difference is that computational techniques may be needed to derive the distribution \(\pi (\varvec{\rho })\). Simpson et al. (2017) provided a prior distribution for a correlation matrix which exploits a different parameterization based on a Cholesky decomposition, which, however, creates a dependence on the ordering. In general, it may not be easy to define a distance \(s(\varvec{\rho })\) involving all the parameters. Sørbie and Rue (2017) proposed to use a sequential approach to define PC prior distributions for the parameters of a AR(p) model: the Kullback-Leibler divergence is computed conditionally on the terms already included in the model, i.e. the divergence is calculated for the model for the k-th parameter with respect to the model based on the other \(k-1\) parameters and with \(\theta _k={\bar{\theta }}\). Therefore, conditional PC priors are defined for each parameter. However, inference may depend on the order chosen for the sequential procedure and, in the setting we are considering in this work, it may not be obvious how to define this order.

In the specific setting of copula models, it is possible to redefine the multivariate copula model through a vine construction (Czado 2019): under conditions of differentiability, the joint multivariate distribution can be written as

$$\begin{aligned} f(x_1, x_2, x_3)&= f_1(x_1) f_2(x_2) f_3(x_3) \times c_{1,2} (F_1(x_1), F_2(x_2)) \\&\quad \times c_{1,3} (F_1(x_1), F_3(x_3)) \times c_{2,3\mid 1} (F_{2\mid 1}(x_2\mid x_1), F_{3\mid 1}(x_3\mid x_1)), \end{aligned}$$

which can be easily generalized for generic dimension k. Therefore, it is possible to select each conditional model (Czado et al. 2013; Gruber and Czado 2018) and then define a PC prior for the univariate parameter of each bivariate model. Vine copulas tend to be computationally inefficient, therefore, it may be necessary to use faster algorithms in high dimension (Tagasovska et al. 2019). Moreover, while the order of the conditioning should not affect inference, it may be difficult to assess the impact of the univariate PC priors on inference defined for different vine structures. It seems evident that the multivariate extension of the proposed approach is not trivial, and, therefore, it has to be left to further research.