Osband’s principle for identification functions

Given a statistical functional of interest such as the mean or median, a (strict) identification function is zero in expectation at (and only at) the true functional value. Identification functions are key objects in forecast validation, statistical estimation and dynamic modelling. For a possibly vector-valued functional of interest, we fully characterise the class of (strict) identification functions subject to mild regularity conditions.


Introduction and informal statement of main result
Consider a statistical functional T of the random variable Y ∼ F , that is, a mapping F → T (F ), such as the mean or the median.In the theory of forecast validation, a corresponding strict identification function V (x, y) takes the forecast x and the realisation y of Y as arguments and its expectation with respect to Y ∼ F is zero if and only if x equals the true functional value T (F ).This defining property makes identification functions a central tool in forecast validation through calibration tests (Nolde and Ziegel, 2017), often referred to as backtests in finance, and to forecast rationality (or optimality) tests in economics (Elliott et al., 2005;Dimitriadis et al., 2021b).Furthermore, these functions are fundamental to zero (Z) or generalised method of moments (GMM) estimation (Huber, 1967;Hansen, 1982;Newey and McFadden, 1994), where they are often called moment functions or moment conditions.However, their statistical applications go much beyond these two fields and among others, they influence dynamic modelling through generalised autoregressive score (GAS) models (Creal et al., 2013), isotonic regression estimates (Jordan et al., 2022), or the derivation of anytime valid sequential tests (Casgrain et al., 2022).A complete understanding of the full class of (strict) identification functions for a given functional is crucial in these applications.Our main contribution, Theorem 4, provides such a full characterisation result.
In the jargon of decision theory (Gneiting, 2011) ) or the pair consisting of the quantile and the Expected Shortfall (ES) at the same level with natural action domain Examples 2 and 3 for details.To present the formal definition of an identification function us introduce the convention that V is called F-integrable if for each of its components V i the integral O V i (x, y) dF (y) exists and is finite for all x ∈ A and F ∈ F.Moreover, we shall use the shorthand V (x, F ) = O V (x, y) dF (y) for any x ∈ A, F ∈ F, where the integral is understood componentwise.
On the class of distributions on R with a finite mean, F1 (R), the mean is identifiable with strict distributions on R such that there exists an x with F (x) = α, the α-quantile admits the strict Functionals failing to be identifiable on practically relevant classes of distributions are the variance and Expected Shortfall.On such classes F, both of them violate the selective convex level sets property, which is necessary for identifiability (Osband, 1985;Fissler et al., 2021). 1 However, the pairs (mean, variance) and (quantile, ES) turn out to be identifiable with corresponding two-dimensional strict identification functions, see Examples 2 and 3.
Regarding the flexibility of the class of identification functions, the following observation is immediate: If V (x, y) is a strict F-identification function for T : F ։ A ⊆ R k , it can be multiplied with any R k×k -valued function h(x) of full rank and remains a strict identification function for T .Intriguingly, Theorem 4 formally states that, subject to mild regularity conditions, the reverse is also true, and the entire class of strict identification functions is given by (1) Besides its theoretical appeal, this characterisation result opens the way for diverse applications.First, it can be used to optimise power of (conditional) calibration (forecast rationality or optimality) tests studied in Nolde and Ziegel (2017).It is further related to efficient Z-or GMM-estimation based on conditional moment conditions in the sense of Chamberlain (1987) and Newey (1993), where the matrix h is submerged in the choice of an optimal instrument matrix; see Theorem 3.1 and especially Remark 3.2 in Dimitriadis et al. (2021a) for details.
Based on the choice of an identification function (called score by these authors) as their forcing variable, dynamic GAS models of Creal et al. (2013) determine an autoregressive model structure for a corresponding functional of interest that nests classical ARMA and GARCH models for the mean and variance.In these models, the so-called scaling matrix takes the place of the matrix h and, as already called for by Creal et al. (2013, p. 779), this choice "warrants separate inspection".
The following examples discuss interesting applications of our characterisation result in (1) to vector-valued functionals.
Example 2 (Mean and variance).The pair (mean, variance) is identifiable on the class F 2 (R) of distributions with finite variance with the two-dimensional strict F 2 (R)-identification function One can use the characterisation result (1) to produce a multitude of other strict F 2 (R)identification functions.Motivated by the decomposition of the variance into the difference of the second moment the squared expectation, a comparably intuitive one is which arises by choosing the full rank matrix h(x 1 , x 2 ) = 1 0 2x 1 1 .
Example 3 (Quantile and ES).In financial mathematics, Value-at-Risk at level α ∈ (0, 1) (VaR α ) denotes the lower α-quantile, VaR α (F ) = inf q α (F ) = inf{x ∈ R | α ≤ F (x)}.Then, the ES at level α ∈ (0, 1) of a distribution F is formally defined as (3) On any subclass of F α (R) where ES α is finite, e.g. on F α (R) ∩ F 1 (R), there is the following strict identification function for (q α , ES α ) , where the second component naturally corresponds to a truncated expectation.Applying (1) with the full rank matrix h(x 1 , x 2 ) = 1 0 x 1 /α 1 , one obtains the alternative strict identification function The advantage of V ′ over V is that when evaluating V ′ on a discontinuous distribution with F (VaR α (F )) > α, even though the first components of V and V ′ fail to be an identification function for q α , 2 the second component of V ′ still vanishes in expectation when plugging in the correct values for q α (F ) and ES α (F ) for x 1 and x 2 .Intuitively, the second component of V ′ adds a correction term corresponding to the one on the right-hand side of (3).The choice ( 4) is already utilised by Dimitriadis and Bayer (2019, Equation ( 4)) for Z-estimation of a joint quantile and ES regression model and naturally shows up in consistent scoring functions for (q α , ES α ), see Fissler and Ziegel (2016, Corollary 5.5).Finally notice that the ES α (F ) is sometimes also defined as the upper average quantile over VaR β with β ∈ (α, 1).Then, our results apply mutatis mutandis.

Formal statement of main result
The assertion of Theorem 4, and in particular its proof, parallels Osband's principle for consistent scoring functions Fissler and Ziegel (2016, Theorem 3.2), see also Osband (1985); Gneiting (2011).Up to our knowledge, the assertion has first been stated in the PhD thesis Fissler (2017, Proposition 3.2.1).We need the following assumptions.
Assumption (1).Let F be a convex class of distributions on O such that for every Assumption (2).For every y ∈ R d there exists a sequence (F n ) n∈N of distributions F n ∈ F that converges weakly to the Dirac-measure δ y and a compact set K ⊂ R d such that the support 2 To obtain a better understanding of identifiability for the possibly set-valued α-quantile and its lower endpoint VaRα, one can distinguish three cases.First, if F is strictly increasing and continuous at its α-quantile, the latter is singleton-valued and V (x, y) = 1{y ≤ x} − α is a strict identification function both for qα and for VaRα.Second, if F is flat at its set-valued α-quantile, V is still a strict identification function for the set-valued qα, but it is only a (non-strict) identification function for the singleton-valued VaRα.Third, if F is discontinuous at VaRα(F ) such that F (VaRα(F )) > α (that is, if F / ∈ Fα(R)), neither qα nor VaRα are identified by V . of F n is contained in K for all n.
Assumption (3).Suppose that for Lebesgue almost all x ∈ int(A) the maps V (x, •) and V ′ (x, •) are locally bounded.Moreover, suppose that the complement of the set Assumptions (1), (2), and (3) basically correspond to Assumptions (V1), (F1), and (VS1) in Fissler and Ziegel (2016), respectively.Assumption (1) ensures that the class F is sufficiently rich, implying in particular the surjectivity of T onto int(A) and the fact that there are no redundancies in V in the sense that all its components are needed; see Remark 5 for some further comments.Assumptions (2) and (3) ensure that V (x, y) can be approximated by a sequence of integrals V (x, F n ).
Theorem 4. Let T : F ։ A ⊆ R k be a functional with a strict F-identification function V : A × → R k .Then the following two assertions hold: for all x ∈ int(A) and for all F ∈ F.
If V ′ is a strict F-identification function for T and it also satisfies Assumption (1), then additionally det(h(x)) = 0 for all x ∈ int(A).If the integrated identification functions V (•, F ) and V ′ (•, F ) are continuous, then also h is continuous, which implies that either det(h(x)) > 0 for all x ∈ int(A) or det(h(x)) < 0 for all x ∈ int(A).
Proof of Theorem 4. Part (i) is a direct consequence of the linearity of the expectation.For (ii), the proof of the existence of h follows along the lines of Theorem 3.2 in Fissler and Ziegel (2016).
One just needs to replace ∇ S(x, F ) with V ′ (x, F ).If V ′ satisfies Assumption (1) as well, one directly obtains that h must have full rank on int(A) by exchanging the roles of V and V ′ .If the expected identification functions are both continuous, the continuity of h follows again exactly like in the proof of Theorem 3.2 in Fissler and Ziegel (2016).
For the pointwise assertion (5), consider (x, y) ∈ int(A) × O such that both V (x, •) and V ′ (x, •) are continuous at y. (Due to Assumption (3), this holds for Lebesgue almost all (x, y).)Let (F n ) n∈N ⊆ F be a sequence as specified in Assumption (2).That is, (F n ) n∈N converges weakly to on the class of absolutely continuous distributions with positive density, see Fissler and Hoga (2022, Remark 4.3).
, the quantity of interest Y attains values in an observation domain O ⊆ R d , which is equipped with the Borel-σ-algebra.The class of potential probability distributions F of Y is denoted by F. Forecasts are elements of an action domain A ⊆ R k .Formally, the functional of interest T is a potentially set-valued mapping from F to A, denoted by T : F ։ A, where the notation ։ indicates that the values of T are subsets of A, with the convention that we identify point-valued functionals such as the mean with the singleton containing this value.For O = A = R, prime examples for T are the mean or the α-quantile q α(F ) = {x ∈ R | lim t↑x F (t) ≤ α ≤ F (x)}, α ∈ (0,1), where the latter is interval-valued.Prime examples for multivariate functionals are the mean-functional in case of multivariate observations (O = A = R k ).For univariate observations, examples are multiple quantiles at different levels, the pair (mean, variance) with the natural action domain where for any set B ⊆ R k , int(B) denotes the interior of B and conv(B) denotes the convex hull of B.
•) are continuous at the point y} has (k + d)-dimensional Lebesgue measure zero.