Changing correlations: a flexible definition of non-Gaussian multivariate dependence

Dependencies between variables are often very complex, and may for high values, be different from that of the low values. As the normal distribution and the corresponding copula behave symmetrically for low and high values the frequent application of the normal copula for the description of the dependence may be inappropriate. In this contribution a new way of defining high dimensional multivariate distributions with changing correlations is presented. The method can also be used for a flexible definition of tail dependence. Examples of copulas with linear changing correlations illustrate the methodology. Parameter estimation methods and simulation procedures are discussed. A five dimensional example using groundwater quality data and another four dimensional one using air pollution data, are used to illustrate the methodology.


Introduction
The dependencies between two or more variables, can be very complex. In the case of environmental variables physical, chemical and biological processes have a major influence on the variables of interest. These processes are often not explicitly understood and are non-linear. Statistical investigations of dependence are frequently based on correlations between the variables of interest. This however may lead to a sub-optimal recognition and description of dependencies.
Multivariate distributions relating variables are often considered to be normal or normal after data transformations. Popular transformations like the log transformation or Box-Cox transformations, transform the one dimensional marginals to normal, which does not imply that the bi-or multivariate marginal distributions also become normal. Relationships between variables are often more complex, even non-monotonic and frequently deviate from normal.
An elegant possibility to describe complex relationships between continuous variables is possible by using copulas (Sklar 1959). Copulas enable the investigation of the dependence independently of the one dimensional marginal distributions. A large number of different theoretical copulas are described in different publications such as (Joe 1997) or (Nelsen 1999).
Copulas are frequently used in many different disciplines such as finance, economy, environmental studies or hydrology. In hydrology they are used for describing multivariate flood frequencies (Gräler et al. 2013) relationships between flood characteristics (Chen and Guo 2019) precipitation (Favre et al. 2018) or drought (Won et al. 2020) just to mention a few. In Bárdossy (2006) copulas were used for the spatial statistics for groundwater quality parameters. In Brunner et al. (2019) the Fisher copula was used to investigate the complex dependencies of flood occurrences.
Some of the well known copulas are derived from known multivariate distributions such as the Gaussian or the t-distribution. Another way to construct complex dependencies is to use vine-copulas (Czado and Nagler 2022). While the vine copulas offer a very flexible way to describe dependence, their construction and application for high dimensional cases is relatively complicated.
The purpose of this paper is to introduce a very flexible family of distributions with value dependent (asymmetrical) dependence and varying tail dependence. The paper is divided into 8 sections. After the introduction the definition of copulas is presented, and a new family of multivariate distributions with non-Gaussian copulas is introduced. Parameter estimation issues and simulation procedures are presented. Theoretical examples illustrate the flexibility of the methodology. Two data sets, one five dimensional on groundwater, the other four dimensional on air pollution are used as real-life examples to demonstrate the methodology. A short discussion and conclusions section completes the paper.

Copulas
A copula C is defined as a multivariate distribution on the n dimensional unit cube: which has to have uniform marginals: Copulas are related to multivariate distributions through Sklar's theorem (Sklar 1959): Each multivariate distribution Fðt 1 ; . . .; t n Þ can be represented with the help of a copula: where F t i ðtÞ represents the i-th one dimensional marginal distribution of the multivariate distribution. If the marginal distributions are continuous then the copula C in (2) is unique. Hence, copulas can be regarded as the pure expression of the dependence without the influence of the marginal distributions. Copulas can be defined explicitly using their density functions or another possibility is to define copulas using multivariate distributions by inverting (2). This means the copula is defined as: Many well known copulas such as the normal, the t-copula and the skew-normal copula are constructed using this method.

Distributions with value dependent correlations
In this paper a very flexible way of defining the multivariate distributions which define copulas with interesting non-Gaussian properties is presented. The basic idea behind the construction of the distribution is to gradually change the dependence structure depending on the values of the variables. As described in Guthke and Bárdossy (2012) one can obtain very similar spatial fields if one uses the same set of random numbers to generate them. A similar methodology was used in Bardossy and Pegram (2012) to exchange correlation structures of simulated precipitation. This idea combined with continuity can be used to define random variables with changing dependence structure. Formally: is called continuous if each element i, j of the matrix FðsÞ i;j is a continuous function Definition Let RðsÞ be a in s continuous function of correlation matrices of dimension k Â k. Let X s be a k-dimensional normal random variable with the correlation matrix RðsÞ and standard normal marginal distributions. In this case XðsÞ can be coupled with the help of independent standard normal variables U ¼ ðU 1 ; . . .; U k Þ in the form: If the correlation matrices RðsÞ are continuous in s than U defines a set of interdependent XðsÞ random variables which are continuous in s. A random vector Z ¼ ðZ 1 ; . . .; Z k Þ is defined as: Note that as U À1 ð0Þ ¼ À1\Xð0Þ i and U À1 ð1Þ ¼ þ1 [ Xð1Þ i and XðsÞ i is by definition a continuous function for any i and U, the above definition leads to well defined Z i -s. This k-dimensional vector variable has, by this definition a dependence structure which is different for small and large values, resulting in a non-Gaussian multivariate distribution. Figure 1 explains the construction. For a given vector of random numbers ðu 1 ; . . .; u k Þ and for two selected pairs of variables ði 1 ; i 2 Þ and ðj 1 ; j 2 Þ the corresponding ðX i 1 ðsÞ; X i 2 ðsÞÞ and ðX j 1 ðsÞ; X j 2 ðsÞÞ are plotted as a function of U À1 ðsÞ. The ðZ i 1 ; Z i 2 Þ and ðZ j 1 ; Z j 2 Þ values correspond to the first intersection of the diagonal and the ðX i 1 ðsÞ; X i 2 ðsÞÞ and ðX j 1 ðsÞ; X j 2 ðsÞÞ functions. The ðX Ã ðsÞ functions are continuous due to the continuity of the function of correlation matrices. The intersection with the diagonal represents the value assigned to Z 1 and Z 2 . As the Figure shows the two different pairs get values from different XðsÞs.
If RðsÞ ¼ R is constant for all s values, then the resulting random variable is multivariate normal with the correlation matrix R.

Construction of continuous R(s) functions
For the definition of the random variables in equation (5) a continuous correlation matrix valued function is required. The construction of such a function is a non trivial task as these matrices should not have negative eigenvalues to be valid correlation matrices.

Construction using the square roots of the matrices
In order to define such matrices one can use the fact that C is a covariance matrix if and only if it can be written as a product of a matrix Y and its transpose Y T : (Y is the square root of the covariance matrix.) If YðsÞ is a continuous matrix valued function then CðsÞ ¼ YðsÞ T Á YðsÞ is also a continuous matrix valued function. All these matrices are valid covariance matrices. By defining RðsÞ ¼ ðrðsÞ i;j Þ from CðsÞ ¼ ðcðsÞ i;j Þ as: one obtains a continuous matrix valued function of correlation matrices due to the continuity of the transformation in (6). The general construction uses infinitely many matrices. For practical use one has to find simplifications. The most simple construct is to use two parameters for each pair of variables -a starting and a final correlation, called twocorrelations model in the subsequent text. Formally: Let Rð0Þ and Rð1Þ two valid correlation matrices. If can be used to define CðsÞ ¼ Y Ã ðsÞ T Y Ã ðsÞ covariance matrices. The correlation matrices RðsÞ corresponding to these covariance matrices form a continuous function of correlation matrices connecting Rð0Þ and Rð1Þ. This model is called the two correlation linear model. This construction can be generalized to produce a set of m þ 1 correlation matrices, the m þ 1 correlation model: Then: defines the sequence of covariance, and a subsequent continuous sequence of correlation matrices.
For any continuous function F of matrices the and continuous function h : ½0; 1 ! ½0; 1, F(h(t)) is also a continuous function of matrices. This new matrix function defines a different random variable Z ðhÞ with a different copula. The choice of the function h(t) can change the shape of the corresponding copula even if all correlation matrices remain the same. The left and right hand derivatives of the transformation function h at 0 h 0 ð0 À Þ and at 1 h 0 ð1 þ Þ are responsible for the tail behavior of the corresponding copula.
For those pairs where the starting Rð0Þ and or the ending Rð1Þ correlation matrix contain non diagonal values equal to 1, upper and/or lower tail dependence may occur. The value of the tail dependence can be anything between 0 and 1 by adjusting the speed of convergence of s to 1 or 0 respectively. The corresponding proof is in the Appendix of this paper.

Construction using spatial statistics
In this case the correlation matrices are constructed using curves in an m dimensional space. All curves are parameterized with s in [0, 1] and the correlation matrices are defined using a stationary spatial covariance function.
Let ðy 1 ðsÞ; . . .; ðy k ðsÞÞ be such that y i ðsÞ is a point in an m\k dimensional space, and y i ðsÞ is a line continuous in s. If C(h) is a valid continuous correlation function in space then: . . .
is a valid correlation matrix, and due to the continuity of the y i ðsÞ-s the corresponding matrices are continuous in s.

Simulation
The simulation of a realization of the model is quite simple. One only has to know the correlation matrices for each s and has to simulate k independent standard normal variables.
The exact simulation can be done using the following procedure: 1. Select the starting Rð0Þ and ending Rð1Þ correlation matrices. 2. Draw k independent normally distributed random numbers ðu 1 ; . . .; u k Þ 3. Solve the linear equation (refeq:regu) for s i for each This procedure is slightly slower than an approximate simulation using a discrete set of possible s values. For the simulation of the linear model (7) the following step-bystep procedure can be used.
1. Select the starting Rð0Þ and ending Rð1Þ correlation matrices. 2. Calculate the square roots of the starting and end correlation matrices. 3. Select a set of s values 0 ¼ sð0Þ\s 1 \. . .\s m ¼ 1.
4. Calculate the in-between correlation matrices for each s i using linear interpolation of the square roots and renorming. 5. Draw k independent normally distributed random numbers ðu 1 ; . . .; u k Þ 6. For each variable i find the simulated value is the x i for which: . . .; x k Þ is the simulated member. 7. Repeat steps 5-6 N times Both algorithms are very simple and large samples can be generated with little computational effort.

Examples
As a first example consider a bi-variate distribution with correlation matrices with normal marginals and linearly changing correlation matrices according to (7). The Pearson correlation of the two variables (with normal marginals) is 0.60. For comparison the multivariate normal distribution with the same Pearson correlation (0.60) is considered. For both bi-variate distributions a sample of N ¼ 2000 was generated. Figure 2 shows the corresponding empirical copulas. One can see the effect of changing correlations leading to weak dependence for low values and a strong dependence for high values.
Another example is a tri-variate distribution with two variables being independent and the third having a lower tail dependence with one of the variables and an upper tail dependence with the other one. The value of both tail dependencies can be arbitrary. This distribution can be constructed using the correlation matrices: Figure 3 shows the result of a simulation with N ¼ 3000 points. One can see that variables Z 1 and Z 2 are independent, while Z 1 and Z 3 have some kind of upper tail dependence and Z 2 and Z 3 have some kind of lower tail dependence. Such copulas are difficult to construct with other methods. Note that the exact value of the tail dependence is determined by the speed of convergence to the matrices Rð0Þ and Rð1Þ.
To show the high flexibility of the copulas obtained by this construction two three-correlations examples in the sense of (8) are shown. They correspond to the three correlation matrices: The first shows a relationship weaker than a normal dependence for the medium values. For the second the relationship is reversed; the medium values show a stronger dependence than the extremes (Figure 4). The construction of similar examples in higher dimensions is not difficult, one only has to be careful that the

Parameter estimation
The parameters of the distributions depend on how the changing correlation structure of the random variable is chosen, as there is no general analytic expression for the density and distribution function of the corresponding random variable, allowing a straightforward maximum likelihood estimation. Instead specific procedures have to be used. For the simplest two-correlations linear model two matrices Rð0Þ and Rð1Þ have to be estimated. This can be done in a pairwise manner as is done in the case of the multivariate normal distribution. The random variable inherits the pairwise definition of the dependence. For each pair of variables (i, j) the two correlation coefficients q i;j ð0Þ and q i;j ð1Þ have to be estimated. For this purpose two parameters describing properties of the dependence are needed. One possibility is to pick two classical dependence measures from Pearson's correlation of the normal score transformed data, Spearman's rank correlation or Kendall's s. However these parameters all focus on the strength of dependence and not on value-related differences in dependencies. As an alternative, one may use the measure of dependence asymmetry as introduced for spatial statistics in Bárdossy (2008) where n is the number of samples and F i and F j are the empirical distribution functions of variables i and j. This measure describes the difference between the dependence of high values and that of the low values. It is well suited for variables with positive dependence, but for negative dependence it is better to first invert one of the variables.
The bi-variate two correlation linear model is the simplest model. Its parameters can be estimated using the Pearson correlation and the dependence asymmetry. The estimation can be done by using large simulated samples. The algorithm is as follows: 1. The observed data are transformed to normal using the normal score transformation. 2. The Pearson correlation q and the asymmetry a of the transformed data are calculated. 3. A pair of correlations ðrð0Þ; rð1ÞÞ is selected 4. A random sample of the size N of the two correlations model is simulated with the correlations ðrð0Þ; rð1ÞÞ denoted as S Z ¼ fðz Ã 1 ðnÞ; z Ã 2 ðnÞÞ; n ¼ 1; . . .; Ng. 5. The correlation and the asymmetry q Ã and a Ã of the simulated data are calculated. 6. The difference function: Dðq Ã ; a Ã Þ ¼ ðq À q Ã Þ 2 þ ða À a Ã Þ 2 is then minimized using an appropriate algorithm (for example steepest descent method). In the minimization at each new evaluation an new random sample of the size N is generated. The optimal ðrð0Þ; rð1ÞÞ is taken as parameters of the model.
As the simulation of the two correlations model is very simple and fast very large sample sizes N can be generated to assure that the parameters become stable. The Chebysev  (12) and (13) inequality can be used to estimate the appropriate sample size.
As the above procedure is computationally intensive a numerical approximation of the function Gðrð0Þ; rð1ÞÞ ! ðq; aÞ can be done using a regular grid of ðrð0Þ; rð1ÞÞ values via simulation. The obtained table can then be used to estimate the two parameters of the two correlations model. Figure 5 shows the Pearson correlation and the asymmetry as a function of the lower (rð0Þ) and the upper (rð1Þ) correlation for the two-correlation linear model. The two measures are kind of orthogonal allowing a simple estimation of qð0Þ and qð1Þ from the observed Pearson correlation and asymmetry.
For the multivariate case the parameters of the two correlations model are estimated in a pairwise manner as in the case of the multivariate normal distribution.
An interesting case is where correlations change their sign -for example low values show a negative correlation and high values are positively correlated. The overall Pearson correlation of such variables is usually close to zero, but the dependence is present and can be detected by other measures such as entropy.
Note that the number of parameters for the simplest two correlations linear model is only twice as much as for the multivariate normal distribution.
For the estimation of the parameters of a more complex dependencies additional measures have to be used. These could correspond to higher moments. As an alternative one may also use a set of indicator correlations. These are defined for different thresholds 0\h\1: The indicator correlations considered as a function of h can also be used to to see if a dependence is symmetrical or not. For copulas like the Gaussian or the t these functions are symmetrical around 0.5, the indicator correlations q I ðhÞ and q I ðð1 À hÞ should be equal.
The parameter estimation of the more complex models is also based on large simulated random samples. In this case the difference function of the form: should be minimized. For the m þ 1 correlation model for m [ 1 the breakpoints s i also have to be estimated. This changes the parameter estimation problem -as a pairwise estimation is not possible due to the common breakpoints. In order to simplify this problem it is reasonable to select for a given m the s j ¼ j m for j ¼ 0; . . .; m. Note that as the third example in Sect. 5 defines random variables symmetrical dependence with respect to the values, thus one cannot use the asymmetry measure defined in (14) for the estimation of the parameters. Instead one could use indicator correlations for the parameter estimation.

Application
As an example groundwater quality data from the state of Baden-Württemberg in Germany are considered. In the framework of regular groundwater observation samples of more than 2300 wells were collected. These were analysed for their chemical composition. The parameters selected are chloride, nitrate, pH, sulfate and oxygen. The basic statistics of the data are listed in Table 1. The data are highly skewed, with the exception of pH, which has a small negative skew. Due to the high skew the Pearson correlation of the data is strongly influenced by the large values of the data, and thus it is an imperfect description of the strength of the pairwise dependence. In order to investigate the interdependence between the different parameters the data were transformed to normal using the normal score transformation. As a next step the Pearson correlations (Table 2) and the dependence asymmetry according to equation (14) were calculated for each pair of parameters. In order to investigate the Gaussianity of the dependence, the asymmetry of the Gaussian distribution was calculated for simulated samples with the same size and Pearson correlations N ¼ 1000 times. The corresponding asymmetries were calculated and the 95 % confidence interval was identified. For all 10 pairs of parameters the asymmetry was outside the confidence interval of the Gaussian, with the same correlation, indicating that the dependence is not normal.
A two-correlations linear model was fitted to the data using the Pearson correlations and the asymmetries as parameters. The two correlation matrices Rð0Þ and Rð1Þ were estimated in a pairwise manner (Table 3).
It is interesting to observe that the Pearson correlation between pH and Sulfate is very low, while the two-correlation model shows a relationship with changing sign. The lower correlation (corresponding to s ¼ 0) is positive 0.70 while the upper corresponding to s ¼ 1 is negative À0.52. This changing relationship cannot be captured by the normal copula (and also not by any other commonly used copulas). The indicator correlations shown on Fig. 6 confirms the changing relationship between the two variables, which is reasonably well captured by the two-correlations model, and are not captured by the normal model. This leads to a loss of information when applying the normal copula based model. A three correlation model was also fitted to the data, such that only the relationship between pH and Sulfate was altered. The three correlations were assessed using the indicator correlations. The new relationship is now 0.85 ! 0.00 ! À0.3 (corresponding to s i =0, 0.5 and 1). Note that the correlation for the low values corresponding to s ¼ 0 increased, but decreases faster. Figure 6 shows the improvement of the fit to the indicator correlations.
A comparison of the two dimensional marginals of the simulated data with the observed ones shows that in all cases the Kolmogorov distance between the simulated twocorrelations model and the observations, is lower than for the normal copula based simulations.
For the second example, air quality measurements taken near Zurich-Schimmelstrasse in Switzerland were used. Four parameters NO x , SO 2 , NO 2 and PM10 were selected. These data are publicly available for the time period of 2011-2021 on the internet under opendata.swiss. Basic statistics of the data are listed in Table 4. The Pearson correlations of the normal score transformed data are given in Table 5. The correlations are higher than those of the groundwater example.
The matrices of the two-correlations model were fitted to the data by using the Pearson correlation and the asymmetry as measures using the algorithm described in Sect. 6. The fit was done numerically by using large simulated samples for the estimation of the parameters. The pairwise calculated asymmetries showed that out of the 6 pairs, 4 are significantly different from the normal. The theoretical model is in this case contains two pairs for which r i;j ð0Þ ¼ r i;j ð1Þ. In order to investigate the appropriateness of the model, N ¼ 1000 simulations of the 4 dimensional distributions were considered using the twocorrelation linear model with parameters listed in Table 6.  The same number of simulations were carried out for the normal copula case. As an example Fig. 7 shows the empirical copula a simulated normal and a simulated changing correlation copula. The sample sizes are in all cases the same. As one can see, the strong dependence of the low values is not captured by the normal simulation, but well captured by the two-correlations model. A comparison of the two dimensional marginals shows that the two-correlation linear model captures the asymmetrical dependence well. Note that in contrast to the previous example in this case all correlations are positive. Deviations from the Gaussian dependence are very frequent, and both examples include cases where the high values have stronger dependence than the low ones and the reverse case too.

Discussion and conclusions
In this paper a method to construct multivariate non-Gaussian distributions was presented. The construction is very general and special cases with 2 or more parameters for the description of dependence of the pairs can be used.
The copulas obtained via these distributions are not only useful for monotonic dependence but can also represent dependencies with changing character, for example negative dependence for small values and positive dependence for high values.
Copulas were most frequently used for the investigation and description of the dependence of extremes. However, the dependence of the variables might be applied to nonextreme values and also deviations from the normal.
The copulas defined in this paper can describe arbitrary upper and lower tail dependence, and can also be used for the description of asymmetrical dependence even without focusing on the extremes.
A disadvantage of this construction is that the distribution functions do not have a general closed analytical form. This makes the estimation of the parameters difficult. The distributions defined using this construction may have many parameters, but due to the increasing data volumes of    environmental variables may be well used for the investigation of big data. The development of parameter estimation methods for more complex models requires further research.
The presented construction yields bi-variate marginal copulas which are symmetrical with respect the main axis addedof the copula. Non-symmetrical dependence can be achieved by using the same idea with other multivariate distributions, such as the skew-normal distribution. In this case both the parameters of the correlation matrix as the k parameter defining the skewness can be varied in the same way as done in equation (5).
The use of these copulas as an alternative for multivariate linear regression is also possible, but goes beyond the scope of this contribution.
The methodology can be extended to time series and spatial random fields. memory.
Author Contributions AB did everything.
Funding Open Access funding enabled and organized by Projekt DEAL.

Conflict of interest The authors declare no competing interests
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the Fig. 7 Empirical and simulated copulas for the air quality parameters NO x and NO 2 for Zürich: left simulated using normal bivariate copula, middle observed, right simulated using the two-correlations model Pairs with non-normal dependence at 95 % significancance level are in boldface source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.