Evaluating the Discrimination Ability of Proper Multivariate Scoring Rules

Proper scoring rules are commonly applied to quantify the accuracy of distribution forecasts. Given an observation they assign a scalar score to each distribution forecast, with the the lowest expected score attributed to the true distribution. The energy and variogram scores are two rules that have recently gained some popularity in multivariate settings because their computation does not require a forecast to have parametric density function and so they are broadly applicable. Here we conduct a simulation study to compare the discrimination ability between the energy score and three variogram scores. Compared with other studies, our simulation design is more realistic because it is supported by a historical data set containing commodity prices, currencies and interest rates, and our data generating processes include a diverse selection of models with different marginal distributions, dependence structure, and calibration windows. This facilitates a comprehensive comparison of the performance of proper scoring rules in different settings. To compare the scores we use three metrics: the mean relative score, error rate and a generalised discrimination heuristic. Overall, we find that the variogram score with parameter p=0.5 outperforms the energy score and the other two variogram scores.


Introduction
Alongside the profusion of models for generating point or distribution forecasts, a prolific strand of theoretical research focusses on developing methods for evaluating these forecasts.A standard approach is to quantify the accuracy of each prediction with a proper scoring rule (see, for example, Gneiting and Raftery 2007).Many scoring rules are proposed and considered in the literature: Bao et al. (2007) advocate using the Kullback-Leibler information criterion which is derived from the logarithmic score; Gneiting and Raftery (2007) advocate using the continuous ranked probability score (CRPS) and Gneiting and Ranjan (2011) extend this to adopt the weighting approach of Amisano and Giacomini (2007) so that evaluation can be focused on a specific area of the distribution, such as a tail or the centre; Gneiting and Raftery (2007) generalise the CRPS to the energy score for multivariate distributions; Scheuerer and Hamill (2015) use the concept of variograms from geostatistics to derive a proper score; Hyvärinen (2005) and Parry et al. (2012) consider local scoring rules, which are based on the density function or its derivatives; Diks et al. (2011) and Diks et al. (2014) generalise the log score to conditional likelihood and censored likelihood, which can allocate more weights on specific regions.
Given the large number of scoring rules, conventional wisdom dictates to apply a suitable one for the application at hand.While it is generally agreed upon that only proper scoring rules quantify the accuracy of probabilistic forecasts adequately (Winkler, 1996;Gneiting and Ranjan, 2011), the question of which of the proper scores to use remains largely open (Gneiting and Raftery, 2007).This problem is especially relevant for multivariate evaluation, since the rankings of univariate scoring rules mostly coincide, which reduces the risk of conflicting conclusions (see, for example, Staël von Holstein, 1970;Winkler, 1971;Bickel, 2007).Some previous studies provide an analytic assessment of proper scoring rules for a specific forecasting problem, but these are restricted to a univariate setting in which some strong assumptions are made (see, for example, Machete 2013; Buja et al. 2005;Merkle and Steyvers 2013;Johnstone et al. 2011).
Previous studies comparing multivariate proper scoring rules are limited to very simple simulation settings.In their evaluation of the energy score, Pinson and Tastu (2013) restrict themselves to a bivariate Gaussian as the true distribution.They introduce a useful discrimination heuristic which measures the average relative distance of the sub-optimal scores (i.e.those obtained from mis-specified models) from the score obtained from the true distribution.They conclude that the energy score is able to discriminate errors in mean but lacks sensitivity to errors in variance and especially to errors in correlation. 1 Scheuerer and Hamill (2015) compare three different variogram scores with the energy score and the score proposed by Dawid and Sebastiani (1999).Their system has 5 or 15 dimensions but again they only consider a very simple true model, i.e. a Gaussian or Poisson distribution.A simulation study assesses the ability of each score to identify the correct model through a simple heuristic, i.e. the score sample mean.Overall, only the variogram score with p = 0.5 is devoid of ranking issues in their study.Also, they confirm the finding of Pinson and Tastu (2013) that the energy score lacks sensitivity to misspecification in the dependency structure.Ziel and Berk (2019) compare the sensitivity of the energy score, variogram score and the score proposed by Dawid and Sebastiani (1999).They first consider a simulation study focusing on multivariate Gaussian distributions of unit variances and different correlations.Then they consider a time series empirical study where autoregressive (AR) models with different lags are applied to airline passengers.They conclude that the energy score performs the best in their simulation study, which seems to contradict the conclusion from Pinson and Tastu (2013).Diks and Fang (2020) use conditional likelihood and censored likelihood to compare multivariate and univariate approaches for forecasting portfolio returns.Their empirical study considers the univariate and multivariate generalised autoregressive conditional heteroscedasticity (GARCH) models with elliptical distributional assumptions applied to financial data.Their focus is not to compare scoring rules, but to illustrate that multivariate models do not necessarily lead to better forecasting accuracy for portfolio returns when compared with a univariate approach.
We make two major contributions to this literature.The first is to propose new metrics for discriminating between correctly specified and misspecified models using proper multivariate scoring rules.The second is to provide the first empirical study on real financial data of the relative performance of multivariate scoring rules.Most empirical research in the forecasting literature in finance and economics includes a short empirical study, but extensive application of proper scoring rules, even to univariate distribution forecasts of financial returns, are hard to find.We fill this gap in the literature by conducting an extensive comparison between different rules across a diverse selection of eight static and dynamic models for multivariate distributions.We do this in a forecasting setting using three large sets of daily financial data spanning 1991 to 2018: commodity prices, exchange rates and interest rates.
Each data set has eight dimensions.We apply our new metrics to draw conclusions about the ability of different proper scoring rules to detect model misspecification.We do this by calibrating each model to one of the datasets and then simulating day-ahead values using the calibrated model.The exercise is repeated quarterly across many years and, every time, results are generated with each of eight different models as the true data generation process.An online appendix contains further results which have been omitted here for lack of space.
In the following: Section 2 describes the scoring rules; Section 3 describes the models for multivariate distribution predictions; Section 4 describes the design of the experiment; Section 5 presents our results and Section 6 concludes.

Review of Scoring Rules
Forecasting accuracy evaluations rely on loss measures to quantify the performance of a distribution forecast.As Diebold and Mariano (1995) mention, this loss generally depends on the underlying economic structures associated with the forecast.Scoring rules offer a promising measure by condensing the accuracy of a distribution forecast to a single penalty oriented value while retaining attractive statistical properties.
Let F be the convex class of distributions on (Ω, A).Let y be an observation of the random variable Y and let F be a forecast of the distribution of Y .A scoring rule is a function A scoring rule S is said to be proper if for all distributions F and the true distribution G, the following holds, (1) Further, a scoring rule is strictly proper if equation (1) holds if G is the unique minimiser.
Propriety of a scoring rule is important because the ideal forecast is preferred irrespective of the costloss structure (Diebold et al., 1998;Granger and Pesaran, 2000).A proper scoring rule is designed so that a forecaster who believes the future distribution to be G has no incentive to predict any distribution Gneiting et al., 2007).The term has been coined by Winkler (1996Winkler ( , 1977) ) who shows that proper scoring rules test for both calibration and sharpness of a distribution forecast simultaneously.
The usage of non-proper scoring rules is generally not recommended since those can lead to wrong inferences (Gneiting and Ranjan, 2011).
The most well-known score is undoubtedly the log score, defined as follows, where f is the probability density function (PDF) of the distribution F .The log score has been widely applied in both model evaluation and estimation.In the log score, the PDF f clearly cannot have value 0. To overcome this weakness, quadratic and pseudo-spherical scores have been considered as alternative PDF-based scores where f α . .= f (y) α µ dy 1/α .The spherical score is a special case of the pseudo-spherical score with α = 2.All three scores are strictly proper.However, these scores have been criticised for only crediting forecasts for high probabilities of the realizing value but not for high probabilities of values near the realizing one (Gneiting and Raftery, 2007).In addition, the PDF-based scores rely on predictive densities which might not be available, especially with ensemble forecasts.In view of these limitations, efforts have been devoted to developing alternative scores that do not require predictive densities.In the rest of section, we briefly review three such scores: the CRPS, energy score and variogram score.

Conditional Ranked Probability Score
The CRPS introduced by Matheson and Winkler (1976) and augmented by Gneiting and Ranjan (2011) has been widely used to properly compare distribution forecasts with a potential focus on certain regions of interest.The CRPS is defined as where F −1 (α) = inf{x|F (x) = α} denotes the α quantile of F , ν : [0, 1] → R ≥0 is a quantile weight function and QS α is known as the quantile score.
Apart from this quantile score representation, CRPS can also be expressed using the Brier probability score through with threshold weight function u : R → R ≥0 as shown by Laio and Tamea (2007).Given a realization y, the integral of equation ( 3) splits into two easily interpretable parts which get penalized by the score as visualized in Figure 1.We remark that equations ( 2) and (3) are generally not equivalent.
Additionally, Gneiting and Raftery (2007) derive the kernel score representation where Y and Y are independent random variables with sampling distribution F .This concise expression serves as a foundation for the generalization of CRPS to the multivariate energy score discussed in the next subsection.For distributions with finite first moment, the CRPS is strictly proper.Thus, the true probability function receives the lowest CRPS and is preferred to any other probabilistic forecast.Emphasizing specific parts of the distribution by the choice of the quantile or threshold weight functions is simple since any non-negative functions ν and u can be used, provided that equations (2) and (3) are convergent.
Table 1 lists the proposed functions by Amisano and Giacomini (2007) that we use in our analysis.
The weight functions ν : [0, 1] → R ≥0 and u : R → R ≥0 put additional emphasis on certain parts of the distribution.Forecasts which deviate on those parts are penalized additionally and receive a higher CRPS.Here, ϕ and Φ denote the density and the distribution functions of the standard normal distribution.

Energy Score
The energy score is a popular multivariate strictly proper score introduced by Gneiting and Raftery (2007) which generalizes the kernel representation of the CRPS in equation ( 4).Let y = (y 1 , . . ., y d ) be an observation of the random vector Y and let F be a forecast of the distribution of Y such that ) is finite.The energy score is then defined as where Y and Y are independent random vectors with distribution F .Székely (2003) shows that the energy score with β ∈ (0, 2) is strictly proper while Gneiting and Raftery (2007) provide an alternative general proof.In practice, β = 1 has been a common choice.In this case, the energy score reduces to the CRPS in the univariate case.Closed form expressions of the energy score are generally unavailable which means that computations are done through Monte Carlo methods.
Despite its popularity, this score has been criticized for being insensitive to misspecification of the dependency structure (Pinson and Girard, 2012;Pinson and Tastu, 2013) and for being unable to distinguish a good representation of the predictive distribution from a very sparse one (Scheuerer and Hamill, 2015).

Variogram Score
The variogram score proposed by Scheuerer and Hamill (2015) is based on the concept of variograms from geostatistics.Similar to diagnostic methods by Hamill (2001) andFeldmann et al. (2015), the score examines pairwise element differences of the variables y i of y.The variogram score of order p is defined as where Y i and Y j are the i-th and j-th components of a random vector Y with distribution F .Intuitively, the score makes use of the variogram of order p , which quantifies the degree of spatial dependence of a stochastic process.Pairwise comparisons measure the closeness of the deviations in the observations relative to those of the corresponding expectations.Scheuerer and Hamill (2015) show that the score is proper relative to the class of distributions for which the 2p-th moments of all elements are finite.The variogram score is not strictly proper because it depends only on the p-th absolute moment of the distribution of the element differences.Therefore, it cannot distinguish any distributions where the element differences deviate in higher moments but are the same for moments of order less than or equal to p.
In practice, the choice of p depends on the forecasted distribution and should generally be large enough to consider all relevant moments of the pairwise deviations but not too large to overly emphasize outliers through the exponentiation.The values p = 0.5, 1, 2 are often suggested, and Figure 2 shows the effect of p by illustrating the observed variogram |y i − y j | p of different popular orders relative to changes in |y i − y j |.It is clearly visible that the magnitude of the effect depends heavily on the value of |y i − y j | with the absolute slope varying between 0 and 3 in the depicted domain (−1.5, 1.5).The sensitivity of the observed variogram in a neighbourhood of zero deviation is strongest for p = 0.5 and very weak for p = 2.This order reverses for |y i − y j | > (1/4) 2/3 .We expect parameter p = 0.5 to be more sensitive to similar y i and y j while p = 2 will respond more to cases where |y i − y j | is expected to be large.The choice p = 1 should be able to strike a balance between these two cases.As with the energy score, the encapsulation of the information into a single score leads to a loss of information.However, empirical applications support the sensitivity of the score to flawed forecasts, especially regarding the dependency structure (Scheuerer and Hamill, 2015).Since the score is based on pairwise deviations, any bias that is the same for all components of the forecast cancels out and is therefore undetectable.This further motivates the good practice to using multiple proper scores for the evaluation of multivariate distribution forecasts.

Models for Predicting Multivariate Distributions
While most prior studies of this type have only considered static parametric distributions with fixed parameters as data generating process (DGP) and parameter variations for the misspecified models, we aim here to instead select from a broad range of realistic models for both the DGP and the misspecified models.There is a very long list of alternative static and dynamic models that could be selected for our study.We require a diverse selection but would like all of these to fit the historical datasets reasonably well, while differing in features such as whether the marginal distributions are static, whether the dependence structure is static, how much of the noise is filtered from the historical data,2 and finally how much sampling error is generated by the length of the calibration window.In this section we summarize all the models to be used in our study, beginning with static models before moving to dynamic models.

Static Models
Our simplest model assumes that the next period joint distribution of the variables can be well approximated by their joint historical distribution, i.e. the frequency distribution of contemporaneous observations on the random variables.The lack of complexity makes it this approach particularly popular with practitioners.In a survey by Pérignon and Smith (2010) of risk models used by commercial banks, almost 3/4 used such historical simulation techniques.Moreover, Danielsson et al. (2016) shows that it remains unclear whether dynamic time-series models can outperform this so-called 'historical' approach, at least for predicting quantiles.We therefore include models that use historical data to build an empirical distribution function (EDF) for each marginal and apply the Gaussian copula for the dependence structure.More sophisticated copulas could be selected but we only apply the Gaussian copula for consistency with the other models employed.We use EDF C n to denote the empirical marginal distribution with Gaussian copula, calibrated to a sample size n.
The Factor Quantile (FQ) methodology developed by Alexander and Han (2020) uses quantile regression to capture the marginals.Each random variable in the system is regressed on a set of m common factors, which may be exogenously selected or determined endogenously.Here we use the latter approach, with latent factors x m derived from a principal component analysis of all the variables in the system.Given a quantile partition Q, common to each variable, we perform a set of quantile regressions for each variable, i.e. one for each quantile τ ∈ Q.Then, conditional on a set of predetermined values (one for each of the latent factors) we compute the fitted quantiles of each variable, and apply a shapepreserving spline to the fitted quantiles over all τ ∈ Q.This way we interpolate a conditional marginal distribution for each variable, all conditional on the same latent-factor values.Alexander and Han (2020) claim that such latent-factor FQ models are at least as successful for predicting distributions of financial asset returns as multivariate GARCH models.Furthermore, they have several advantages in that they are very quick to calibrate, more computationally robust than GARCH models, and they scale far more easily to high dimensions.
Here we confine our selection to two types of latent factor FQ models, each with a Gaussian copula  (1996).We refer to Alexander and Han (2020) for further technical details.

Dynamic Models
The most common class of dynamic model for an expected value is the family of autoregressive moving average (ARMA) time-series models -because they are flexible, have the temporal aggregation property, and allow both easy estimation and straightforward inference.However, for distribution forecasts we also require a parametric forecast of variance.To this end, we employ time-series models for conditional variance forecasts using the GARCH(p, q) model class, a generalization by Bollerslev (1986) of the autoregressive conditional heteroscedasticity (ARCH) model of Engle (1982).The GARCH(p, q) process assumes a time-varying conditional variance as follows: where ε t is the market shock or innovation at time t and I t is the information set containing all past returns up to t.The parameters (α i ) q i=1 and (β i ) p i=1 measure the reaction of the conditional variance to market shocks and the persistence of conditional variance respectively.
These models have been particularly successful in modelling univariate time-series of financial returns (Engle, 2001) partly because they are designed to capture volatility clustering effects which are often present in finance (Mandelbrot, 1963).In the presence of such effects, volatility becomes timedependent and can exhibit prolonged periods of exceptionally high or low values. 3Multivariate GARCH models are designed to capture time-variation in conditional variances and conditional covariances.We only consider the GARCH(1,1) structure.Further lags could be considered, of course, but we need to limit the number of models selected in order to focus our results on the performance of proper scoring rules, and not on the models themselves.Having said this, we should also select the best models in the GARCH(1, 1) class, so we shall allow for asymmetries in innovations as well as leverage effects, and for two kinds of conditional covariance dynamics.In multivariate analysis, clustering extends beyond volatilities to correlations and generalizations of univariate GARCH models also capture time-varying conditional covariances and spillover of volatility between different assets.Bollerslev (1990) introduces the constant conditional correlation GARCH (CCC-GARCH) model where the conditional correlations are assumed to be time-invariant.The covariance matrix is estimated as is a matrix such that V t is the covariance matrix (e.g.V 1/2 t can be chosen as the Cholesky factorization of V t ) z t is the standardised residual which has mean zero and the identity covariance matrix, C is a constant correlation matrix and D t is the diagonal matrix containing the time-varying individual volatilities.In practice, each volatility can be estimated by an univariate GARCH model while C can be specified as the sample correlation between standardized residuals.The model is easy to estimate since dependency and the volatilities are examined separately.
The assumption of constant correlation may seem too strong and is not fulfilled for many assets (Tsui and Yu, 1999).Dynamic conditional correlation GARCH (DCC-GARCH) by Engle (2002) generalizes CCC-GARCH to account for time-varying correlations.The correlation is estimated directly from the residuals of the univariate models and adjusted depending on the co-movement of the returns.As such, the covariance matrix is given by of an ARMA model to the conditional means and the second being the estimation of a GARCH model to the residuals of the conditional mean equations.
where the conditional correlation C t is described by with The transformation of Q t to C t guarantees a well-defined correlation matrix as long as Q t is positive definite.Similar to CCC-GARCH, there are no restrictions on the choice of univariate GARCH models for the volatility.For both the CCC-GARCH and DCC-GARCH models, z t can be chosen as the multivariate Gaussian distribution, which leads to consistent parameter estimators even under distributional misspecification (Bauwens and Laurent, 2005).It is also possible to consider non-normal distributions.
For example, Bauwens and Laurent (2005) use multivariate skew distributions while Cajigas and Urga (2006) and Pelagatti (2004) apply the Laplace distribution and general elliptical distributions, respectively.There are several alternative multivariate GARCH models but in this paper we limit our analysis to CCC-GARCH(1,1) and DCC-GARCH(1,1).

Simulation Study
Our discussion in Section 2 introduced several univariate and multivariate proper scoring rules that are prevalent in the literature.While it is agreed upon that these offer a sound way to quantify the accuracy of probabilistic forecasts (Winkler, 1996;Gneiting and Ranjan, 2011), the question of which score to use remains largely open (Gneiting and Raftery, 2007).Conventional wisdom dictates to apply a suitable scoring rule for the application at hand (Machete, 2013) but this only provides a few requirements and does not sufficiently restrict the selection.4 Our aim is to analyse the ability of different scoring rules to discriminate between true and misspecified distributions.Improving on previous studies we employ a realistic simulation setting that approximates the conditions of practical applications in finance.This is reflected by our simulation design which calibrates the models described in the previous section to daily USD-denominated exchange rates from 1999 -2018; US interest rates from 1994 -2018; and Bloomberg investable commodity indices from 1991 -2018.
In the following: Subsection 4.1 describes the data we use to calibrate the models used in the simulation study.Subsection 4.2 motivates our choice of models.Subsection 4.3 describes the design of the simulation study.All results are discussed in Section 5.

Data Description
We obtain three eight-dimensional time series of daily frequency: the first is on USD-denominated exchange rates, the second is on US interest rates and the third is on Bloomberg investable commodity indices.We obtain the daily exchange rates and commodity index values from Thomson Reuters Datastream and the interest rates data from the US Treasury website.All time series end on 30 June 2018 but the start date varies with data availability.Within each set we have selected variables to broadly represent the asset class.The exchange rates are those with the highest trading volume excluding the Chinese Renminbi, which was pegged to the USD until recently (Bank of International Settlements, 2016).Our data starts in January 1999 with the introduction of the Euro as accounting currency.The interest rates span the term structure of US Treasury bonds from 6 months to 20 years.
Alternative available maturities are 1 month, 2 month, 3 month, and 30 years but those maturities are missing data for an extended period of time and are therefore excluded.Our data starts in January 1994 after the 20-year maturity interest rate became available in October 1993.The commodity indices are chosen to reflect the most liquid commodities with the highest USD-weighted production value and are diversified to represent the energy, grains, industrial / precious metals, softs and livestock sectors (Bloomberg, 2017).The Bloomberg commodity indices were launched in 1998 with a backward projection to January 1991.We include all available data in our study.We summarize the total sample period and the starting date of our out-of sample evaluation in Table 2.
Summary statistics of the data are listed in Table 3 for monthly returns of exchange rates and commodites, and for monthly changes in interest rates.The sample mean for all monthly returns and changes is near zero which allows us to apply the principal component representation of Factor Quantile models without prior transformations.Furthermore, all assets are leptokurtic and require heavy-tailed

distributions.
To examine the robustness of our analysis, we segment the data into three parts, covering 2006-2010, 2010-2014, and 2014-2018 with breakpoints at the end of June in each case.Because the exchange rate data starts much later than the other data sets, the period from June 2006 to February 2007 is still used for calibration.Therefore, the first sub-period begins in March 2007 in this case.

Model Selection
Although there have been a few studies explicitly comparing scoring rules (as reviewed in Section 1), the models chosen and the corresponding discussions generally fail to put enough emphasise on the distinct features of financial data, such as time-varying volatility, covariance and leptokurtic conditional distributions.Pinson and Tastu (2013) restrict themselves to bivariate Gaussian distributions with different means, variances and correlations, and do not consider time series in their study.Similarly Scheuerer and Hamill (2015) only consider simple distributions such as Gaussian and Poisson distributions. 5 Ziel and Berk (2019) conduct a time series empirical study but they only consider AR models for the mean with no dynamic volatility structure.In addition, their data is essentially a univariate time series, where observations on multiple days are viewed jointly as multivariate distributions.This is inherently different from standard financial applications, where synchronised financial data are viewed as a multivariate system.Diks and Fang (2020) do consider financial applications, in which they use the univariate and multivariate GARCH models to capture the dynamic variance and covariance structure.
However, as we explained in Section 1, their focus is not to compare scoring rules.
Most of the empirical literature on financial forecasting accuracy concerns univariate times series and, moreover, considers only point forecasts.Yet, even in this restricted context, it remains unclear The monthly returns and changes are calculated using the values at the start of each month which the summary statistic aggregates over the time periods specified in Table 2. Our study only applies daily data but we use a monthly frequency in this table to avoid minuscule magnitudes.
whether a dynamic model structure is always preferable to a static one which assumes that the next period distribution of variables can be well approximated by the historical distribution.Furthermore, multivariate models can differ both according to the specification of the marginals and their dependence structure.As described in Section 3, a static marginal distribution is commonly represented by an empirical distribution function, while we also consider models in the Factor Quantile class.The dynamic models are instead from the GARCH class.Regarding dependence, all but one of our models assumes a static dependence structure (typically the Gaussian copula) -only the DCC-GARCH models extends this assumption to dynamic conditional correlation.
Our simulation study selects a DGP as the true distribution and assesses the ability of different proper scoring rules to detect model misspecification.To reduce dependence on our choice of DGP we employ a diverse selection of models for multivariate distributions that can be used as either the DGP or a misspecified model.Also, for our results to be relevant to practical applications we calibrate these models to a very comprehensive selection of real-world data.However, even though the models are calibrated to actual data, it is important to emphasize that we do not seek to compare the forecasting performance of different models.Instead, our focus is on a comparison of the ability of different proper scoring rules to detect model misspecification.Nonetheless, we do choose models which have been shown in Alexander and Han (2020) to have similar forecasting abilities in typical financial datasets, thereby posing a greater challenge to scoring rules in terms of identifying the DGP and discriminating between competing models.As compared with other studies, we thus emply a more realistic setting with models that are better representations of financial data, we calibrate these models to actual data and our results are much more comprehensive.
We employ a sample size n = 250 and n = 2000 for our results based on static models.However, the dynamic multivariate GARCH models cannot be calibrated using n = 250 because the optimization of their complex likelihood function does not converge, most of the time.Hence, their parameters are estimated using n = 2000 only.The FQ models are much easier to calibrate on small samples, so we present results for these models based on both n = 250 and n = 2000.6Because the EDF and FQ models are calibrated on two different sample sizes we end up with eight competing models in total: (i) Two FQ-AL models, (ii) two FQ-AB models, (iii) two EDF models and (iv) two multivariate GARCH models.This way, each model has one associated model that is similar but differs either in the calibration length or the conditional covariance structure.Table 4 summarizes the models employed.
Our motivation for selecting the models described here is to use a small but representative selection from the static and dynamic modelling paradigms that are commonly used for multivariate financial systems.Expanding the model set beyond the eight models described in this section may hinder rather than help our aim.Our selection of models strikes a balance between the need to represent the salient qualities of real data adequately and a sufficiently simple structure that calibration difficulties will not interfere with the very large scale of our simulation study.Column 1 summarizes the notation used for the model, column 2 describes the marginals, column 3 the dependency structure and column 4 the length of the calibration window.For example, the GARCH models are calibrated to an eight-dimensional time series spanning 2000 days and the FQ-AB C 250 models are calibrated to an eight-dimensional time series spanning 250 days.

Simulation Design
Our simulation study quantifies and compares the ability of the energy score and the variogram score with p = 0.5, 1, 2 to distinguish the correct DPG from misspecified models.These scoring rules were defined in Section 2 and the selected values of p, which have also been used by Scheuerer and Hamill (2015) are considered typical choices (Jordan et al., 2017).
The models described above are calibrated on the systems of daily, eight-dimensional USD-denominated exchange rates, interest rates and Bloomberg investable commodity indices that we discussed in Section 4.1.The parameters for each model are estimated on an initial sample size n, then the calibrated model is used to generate a one-day-ahead multivariate distribution forecast, then the sample is rolled forward keeping the sample size fixed, and then the calibration and forecasting is repeated until the data set is exhausted.Then we move to another data set and repeat, until we have a continuous set of daily, rolling distribution forecasts, for each model and for each of the three data sets.
Our aim is not to assess the forecasting performance of different models.It is to assess the ability of a scoring rule to distinguish the true DGP from misspecified models, when used for multivariate distribution forecasting.Our simulation setting controls the DGP so that we know the true distribution every time we evaluate the forecasts and the other seven models are misspecified.For the forecasts made at time t all the models are calibrated using the n most recent observations up to time t, as previously mentioned.In addition, we select one of these models to generate the observations for time t + 1 that are used to analyse the performance of scoring rules.Put another way, realisations are simulated from just one of these models -i.e. the one that is designated as the DGP.The reason why we have selected such diverse models is that we seek to reduce the dependence of our results on the choice of a specific DGP.Therefore, we rotate the choice of DGP to span all eight models.
The simulation algorithm is summarized as follows.First, fix a proper scoring rule s and a model m * that is designated as the true DGP.This rule s will quantify each of the eight model's distribution forecasts, for each data set, as follows: Stage 1 Calibrate a model using data ending at time t and forecast a multivariate distribution for time t + 1, i.e. one-day-ahead.Repeat this eight times, once for each model in Table 4; Stage 4 Roll t forward by one quarter of a year and repeat stages 1, 2 and 3 until the entire sample for that data set is exhausted.
The four stages outlined above are then repeated eight times, so that each of the models in Table 4 is selected as the designated DGP m * .Then we change the scoring rule s and repeat the four stages eight times, so that again each model can be the designated DGP m * .We do this four times, so that s spans all four scoring rules and every time we change the scoring rule we repeat stages 1 to 4 eight times, so that each of the eight models can be the designated DGP.Finally, we change the data set and repeat the entire process again.
We evaluate the scoring rules at the first date of each quarter in our evaluation period.This yields new simulations (each with eight dimensions) on 50, 66, and 82 dates in USD-denominated exchange rates, US interest rates and Bloomberg investable commodity indices, respectively.Since we have eight possible DGPs, this leads to approximately 1,600 applications of the simulation study above for each of the four multivariate scoring rules.
This setting gives us a very detailed view on the discrimination ability for each scoring rule over time (i.e.different market conditions) and for various choices of the DGP.Note that our simulation design provides optimistic conditions for the scoring rules since it knows the distribution of the DGP for certain, and moreover it samples a very large number of realisations at each time t. 7In practice, we only observe one realisation and therefore must consider the scores over a large period -and the longer the sample used the less likely the DGP is to be To approximate a more realistic setting, in which the length of the out-of-sample period is restricted due to lack of data, we shall also compare the scoring rules on smaller sub-samples, with only 100 realisations instead of 5,000.Recall that all models use historical information to forecast their joint distribution, but in our setting they are evaluated using samples from the chosen DGP, rather than realisations of the original time series.
The models are only punished if their forecast deviates from that of the true DGP and a good scoring rule should be able to distinguish misspecified models from the true distribution.

Discrimination Metrics
We shall generalize the approach of Pinson and Tastu (2013) to compare the discrimination ability of several scoring rules and introduce a new metric, the error rate, as an additional measure of the sensitivity of scoring rules.Suppose there are M models in the study and that scores are based on N realisations at time t.The N × 1 score vector assigned by scoring rule s to model m at time t, when model m * is the DGP is defined as: We employ several metrics aimed at summarizing the properties of (8).First, and following Scheuerer and Hamill (2015), the simplest metric we employ is the mean relative score: almost always obtains the lowest score among all misspecified models as expected given its similarity to DCC-GARCH.The scores can distinguish distributions which differ only in their marginals and they are able to identify FQ-AL C 250 as one of the misspecified models with great confidence.In contrast, the difference between FQ-AB C 250 and EDF C 250 is less pronounced which means that they produce predictions which each deviate by a similar amount to the distribution forecast of DCC-GARCH.
Both the variogram score with p = 0.5 and p = 1 show clear and robust rankings between the misspecified models and distinguish them well from the DGP.The discrimination ability is weaker for the energy score.As pointed out by Pinson and Tastu (2013) and Scheuerer and Hamill (2015), the energy score changes only by a small amount between the DGP and other models.This is evident in Figure 3 as well, where the average score of the worst model is only 25% larger than that of the DGP.In comparison, the variogram scores with p = 0.5 and p = 1 assign average scores over 200% and 100% larger than that of the DGP respectively.Unlike the other scoring rules, the variogram score with p = 2 changes the rankings at several times and is also the only scoring rule which makes wrong inferences even with a large sample size of 5,000 scores.For instance, FQ-AB C 250 is preferred over the DGP around the end of 2017.Hence, the energy score and variogram scores with p = 0.5 and p = 1 may be preferable to the variogram score with p = 2.
However, there are vast differences in the discrimination ability of scoring rules which can lead to wrong inferences in smaller sample sizes: 1. Despite the overall success of the variogram score with p = 0.5 and p = 1, wrong inferences may occur with only 100 samples.The shaded areas of CCC-GARCH dip below 1 frequently which means that a slightly misspecified model may be chosen over the DGP.(ii) the sample mean is at times higher than the sample 0.75-quantile.This is, for instance, the case around the end of 2013.
These initial findings suggest that variogram score with p = 0.5 and p = 1 offer superior discrimination ability to the more popular energy score.The variogram score with p = 2 performs very poorly and may yield erroneous rankings of the forecasting models, even with a very large sample of scores.

Error Rate Comparison
Figure 4 shows the results from applying (10) with DCC-GARCH as DGP for all datasets and all competing models and uses the scores for all t to generate the density.Each column of the figure illustrates the density of equation ( 10) for a specific misspecified model, under various scoring rules and data sets.We include the error rate in the upper right corner of each sub-figure which shows the probability that equation (10) yields a negative value.For clarity, we do not use the same x-axis for all sub-figures but show all values between the 0.001-and 0.999-quantiles of each distribution.This means that the magnitude of the error is not visible in these figures but instead we gain insight on the shape of the error density.Figures for alternative DGPs can be found in the supplementary materials.
Figure 4 shows that the probability of getting scores which are lower than that of the DGP is quite high and varies significantly across cases.Depending on the data set and scoring rule, the average error rate (for each row of plots) varies between 31% and 54%.The variogram score with p = 2 in particular often assigns lower scores to misspecified models.This happens in 50%, 54%, 53% of cases for exchange rate returns, interest rate changes and commodity rate returns respectively and is therefore around 60% worse than the error rate of the variogram score with p = 0.5 (with corresponding error rates of 31%, 32% and 32%).This scoring rule achieves the lowest error rate, followed by the variogram score with p = 1 and the energy score, which both achieve error rates in the 39-42% range.
It is important to note that the error rate is only a binary statistic which does not take into account the magnitude by which the scores of misspecified models are smaller than that of the DGP.
By averaging over a sample of scores, the error rate decreases, until it reaches zero due to the propriety of the scoring rules.The number of samples needed for a sample mean that favours the DGP depends on the shape of the distribution.If the tail of the shaded area is small in comparison to the tail of the non-shaded area, a small sample might be sufficient.However, many of the distributions in Figure 4 are approximately symmetric which means that large positive and negative values in equation ( 10) are equally likely.As an additional indicator for the convergence speed, we illustrate the expectation of the distributions with a dotted line.These are always non-negative due to propriety of the scoring rules, but an expectation far right from the shaded area corresponds to a faster convergence towards lower mean relative scores for the DGP.Again, the values are generally close to zero, which suggests slow convergence towards positive sample mean scores.
The average error rate over all DGPs for the evaluation period of the multivariate scoring rules is compared in Figure 5. Similar to Figure 4, we examine the number of times the score of a misspecified model is lower than that of the DGP but we now include the error rate across multiple choices of the DGP.The conclusions from Figure 5 are similar to Figure 4, although the magnitude of the error rates is lower than for the case of only DCC-GARCH as DGP.The variogram score with p = 2 has a significantly higher error rate that is more than 47% higher than that of the variogram score with    p = 0.5.This is also consistent with Figure 3 where misspecified models were sometimes preferred over the DGP.Again, there is typically a clear ranking of the scoring rules that persists with all three data sets and the entire evaluation period.For the variogram scores, the error rate increases with the parameter p and the error rate of the energy score typically falls between the variogram score with p = 1 and p = 2, but interestingly it often outperforms variogram with p = 1 for the interest rate data.

Generalised Discrimination Heuristic
Through the consideration of multiple models, we go beyond ceteris paribus sensitivities to obtain more general results.Our misspecified models combine various misspecifications at once and are therefore more similar to the settings under which the proper scoring rules are applied in practice.Our generalized discrimination heuristic defined in (11) allows us to test discrimination ability against multiple competing models at once.Note that we do not subtract the scores of the DGP from those of the misspecified models in the numerator, but this does not affect the rankings of the scoring rules.Figure 6 depicts our discrimination heuristic for each scoring rule over time with a logarithmic scale.Unlike Figure 3, results are illustrated for all DGPs (eight rows of plots) and all data sets (three columns of plots).
Figure 6 shows a clear distinction between scoring rules in most cases, but the preferences of the discrimination heuristic for the four rules can vary slightly depending on the data set, the time period and the DGP chosen.Overall, there are several key features which emerge: 1.The energy score is always the scoring rule with the lowest discrimination heuristic.This, again, We display the discrimination heuristic of equation ( 11) for all three data sets and eight DGPs with a logarithmic scale.Scoring rules which separate the scores of misspecified models and the DGP by a larger relative distance are assigned higher values for the discrimination heuristic.We smooth the discrimination heuristic with a moving average of 8 observations to improve the interpretability of the figure, but the same patterns are present in case no smoothing is applied.
variogram score with p = 1 in terms of discrimination ability since differences between models are magnified less.On the whole, variogram score with p = 0.5 seems to perform best in most settings, but another important conclusion is that one should typically consider multiple performance metrics and be aware of how different scoring rules tend to behave in different model settings.Such insights can undoubtedly be of great value when deciding which scoring rule to use in different multivariate forecasting applications.

Figure
Figure 2: Variogram observation of various orders capturing dependence, which are designated FQ-AL C n and FQ-AB C n where n is again the number of observations used to calibrate the model and C denotes the Gaussian copula.The difference between the AL and AB versions stems from the choice of factors x m : the AB version uses the first m components and the AL version uses the last m principal components.The components are ordered by size of eigenvalue, so that the variance explained by each component progressively decreases.The AL version simply separates relevant information (captured by the intercept) from the noise (captured by the last m components).In effect, it is a means of removing unwanted variation from the EDF.This may also be viewed as a latent-factor version of the Jensen (1968) 'alpha', hence the terminology 'alpha-latent' or AL.The AB version uses the first m eigenvectors as latent factors, but in this case each quantile forecast is associated with a large uncertainty.Therefore a variance reduction technique based on "bootstrap aggregation" or bagging is applied to reduce this uncertainty, via an algorithm proposed by Breiman

Stage 2
Draw 5,000 observations from the distribution forecasted by model m * .These observations are realisations from the true model at time t + 1; Stage 3 Given a distribution forecast for time t+1 by model m, apply s to each of the 5,000 realisations generated in Stage 2, to quantify the performance of model m.Repeating this for all models inTable 4 gives 5,000 scores for each of the eight models at time t including the designated DGP model m * ;

Figure 3 :
Figure 3: Average scores relative to score of DGP (USD exchange rates)

2.
The energy score suffers from misidentification risk to a much larger extent.Besides CCC-GARCH, FQ-AB C 250 and EDF C 250 are also at times assigned lower scores than the DGP in 2010, 2013, 2016 and 2017 with the smaller sample size.Overall though, the energy score still manages to produce a clear ranking that is mostly accurate.3. The variogram score with p = 2 largely fails to yield any meaningful results with the smaller sample size.The rankings can change considerably, and all obtain a lower sample mean than the DGP at various times.Even FQ-AL C 250 , which is regarded as the worst model by all other scoring rules, has lower scores than DCC-GARCH around 2016.Additionally, the variogram score with p = 2 may assign scores of very large magnitude that greatly affect the sample mean.This is visible in Figure 3 in two aspects: (i) The scoring rule has wide confidence intervals and

Figure 4 :
Figure 4: Density of differences between scores with DCC-GARCH as DGP This figure displays the density of the difference between the scores of the DGP and the misspecified models described in equation(10).A Gaussian kernel is used to smooth the densities.The shaded areas correspond to negative values, where a lower score is assigned to the misspecified models.In the upper right corner of each sub-figures, the probability of the shaded area is displayed.The dotted vertical line shows the expectation of the density.For clarity, we limit the sub-figure to values between the 0.001-and 0.999-quantiles.Figures for alternative DGPs can be found in the supplementary materials.

Figure
Figure 5: Error rates of scoring rules

Figure
Figure 6: Discrimination heuristic of scoring rules

Table 1 :
Possible weight functions s for CRPS

Table 2 :
Sample for each data setAll three data sets use daily frequencies, yielding over 18,000 observations in total.The first dates vary due to data availability.

Table 3 :
Summary statistics of the monthly returns / changes

Table 4 :
Summary of Models used in the Simulation Study