Horizon confidence sets

This paper introduces a new statistical procedure to discriminate between competing forecasting models at different forecast horizons. Unlike existing tests, which eliminate a model from all horizons if dominated according to some loss measure, our methodology identifies an ‘optimal’ set of models at each horizon, retaining a model that performs well at a given horizon even if dominated at others. While our method is especially useful in applications to long-term forecasting as well as short-term nowcasting, it can also be applied in wider settings like the comparison of forecasting models across countries. We conduct a small Monte Carlo study to investigate the finite sample properties and apply our procedure to nowcasting US real GDP growth and its subcomponents, comparing a factor-based nowcasting method to a naïve benchmark. Unlike existing methods, ours can formally detect the point in the quarter at which the factor method beats the benchmark or vice versa.


Introduction
Forecasters are often interested in the performance of econometric models at forecasting different horizons into the future. However, as soon as we compare two or more competing models across many horizons, challenging questions arise. Is there a single best model at every horizon? Do models all perform equally well at all horizons? Is there a different optimal model at different horizons? Many empirical studies in different settings have suggested that the best predictive model does, indeed, change with horizon. For instance, in long-run exchange rate forecasting, the survey of Rossi (2013) finds that univariate models dominate at short horizons while models with economic fundamentals are best at long horizons. In short-term nowcasting, where horizons are typically thought of in terms of days before the release of gross domestic product (GDP) data, studies such as Banbura et al. (2013) find that big data methods like factor models only dominate when the daily nowcast horizon is small enough to allow a lot of relevant information to become available.
In this paper, we propose a statistical procedure which addresses the above questions, and obtains the collection of models which dominate at different horizons with a given level of confidence. We coin the term Horizon Confidence Set to denote the collection of 'optimal horizon-specific models'. Our approach is based on the model confidence set (MCS) procedure developed by Hansen et al. (2011), but is modified to operate over multiple horizons on the same set of models. Specifically, we compute Diebold- Mariano t-statistics (Diebold and Mariano 1995) for equal predictive ability (EPA) to compare two competing models in each of the horizons. Then we propose an elimination rule based on the maximal t-statistic which removes a model from a specific horizon if its p-value falls below the nominal level. Unlike existing procedures, our methodology therefore does not operate as a 'horse race'-type test identifying only the dominant model overall, but retains the optimal set of models across all horizons. It does so while guarding against the multiple testing problem which occurs by testing across horizons. In the 'Appendix', we generalize our procedure to allow more than two models at each horizon.
Besides the multi-horizon context of forecasting and nowcasting, our procedure can also be applied to other settings: for instance, replacing 'horizon' with 'country', the methodology allows the comparison of two competing models across countries, retaining possibly a different model for different countries. Alternatively, taking the exchange rate forecasting example again, dating back to Mark (1995) it has become custom to compare the predictive ability of the different exchange rate models for different currency pairs with the US dollar. Our method could therefore be used to perform this cross-currency comparison, instead of the multi-horizon aspect.
The horizon confidence set procedure proposed in this paper differs from the original MCS procedure of Hansen et al. (2011) in two important ways. Firstly, one could consider directly applying the MCS procedure to all model-horizon pairs jointly. However, this procedure would potentially eliminate all models from a single horizon and would also involve computing unfair comparisons of, say, model A at horizon x to model B at horizon y. Secondly, one might consider applying the MCS procedure to each horizon independently. However, by not guarding against the multiple testing issue across horizons, this may produce too many false positives and provide the researcher with a sparser set of models than is statistically justified. Such 'sparsity' cannot arise in our case as models are removed at a given horizon only when the performance is worse relative to comparisons from other horizons. Moreover, note that standard Bonferroni-type methods are typically not advisable in many of the settings we consider as they become too conservative when the number of horizons is large.
Our method also differs from other procedures which are in principle applicable to multiple models at multiple horizons, such as the MCS procedure based on the concept of uniform and average superior predictive ability (SPA) proposed by Quaedvlieg (2020). In his procedure, Quaedvlieg (2020) aims to detect the model(s) which either strictly dominate the competitor models (uniform SPA) or which exhibit the best average performance (average SPA) across all horizons. Though of separate interest, in the case where the 'optimal' set of models changes across horizons, uniform SPA would fail to provide a conclusive answer, while average SPA may lead us to retain all models in all horizons, even if models are dominated at specific horizons. On the contrary, our procedure is able to potentially identify this changing pattern of 'optimal' models across horizons. 1 We will apply our methodology to short-term nowcasting, which we consider to be a leading case. Our method is complementary to various existing nowcasting papers which have tended to shut down one of the two channels of multiple testing in model evaluation. On the one hand, some studies have performed tests for nowcast evaluation on a pair of models at single nowcast horizons (for example Giannone et al. 2016), whereas other studies focus on the performance of a single nowcasting model across many nowcast horizons (Fosten and Gutknecht 2020). To further add to the literature of nowcasting, we outline how our method can be helpful in performing nowcast combination at different release dates and demonstrate how this nowcast combination approach using our confidence set output can be used to test nowcast monotonicity (see also Banbura et al. 2013;Aastveit et al. 2014;Knotek and Zaman 2017).
More specifically, our empirical application looks at the factor model method used by Bok et al. (2018) in making the New York Fed Staff Nowcasts. We extend their analysis to consider nowcasts of the five subcomponents of US real GDP as well as aggregate GDP. We compare the nowcasts of this factor method to a simple autoregressive benchmark across the different GDP subcomponents. This builds on existing empirical nowcasting studies, including Marcellino and Schumacher (2010), Banbura et al. (2013), Luciani and Ricci (2014), Foroni and Marcellino (2014), Aastveit et al. (2014Aastveit et al. ( , 2017, Foroni et al. (2015), Antolin Diaz et al. (2017), Kim and Swanson (2018) and McCracken et al. (2019). As a preview of the results, our procedure does not find any evidence of substantial differences between the factor method and the benchmark for aggregate GDP growth or consumption growth. On the other hand, in subcomponents like investment and government spending, we are able to determine the point in the nowcast period at which the factor method beats the benchmark, or vice versa. This finding demonstrates how our method improves over the use of average or uniform SPA which would have forced us to reject or retain the models across all horizons.
The rest of the paper is divided as follows. Section 2 describes the horizon confidence set procedure, and Sect. 3 provides some further uses of this methodology for practitioners. Section 4 contains a small Monte Carlo study to assess the procedure's small sample behaviour, while Sect. 5 is the empirical application to nowcasting US real GDP. Finally, Sect. 6 concludes the paper. Additional figures and the extension to multiple models are given in the Appendix.

The horizon confidence set
In what follows, we outline the set-up and details of our horizon confidence set approach. We are interested in predicting a target variable y t , for which we have observations t = 1, . . . , T . As stated above, for tractability we will take the simplest possible modelling set-up where we wish to compare M = 2 models, which we collect into the set M 0 = {1, 2}, over a set of h = 1, . . . , H different horizons. In the 'Appendix', we will set out how this extends so that M 0 contains more than two models. We note that these horizons can be as in the traditional multi-step forecasting sense where we make forecasts of y t+h at increasing horizons for h = 1, . . . , H . Alternatively, in the near-term nowcasting literature, where we nowcast y t (at a quarterly horizon of zero), the term horizon usually refers to daily, weekly or monthly horizons throughout the nowcast quarter at which we make predictions before the release date of the target variable.
In order to compare the predictions from different methods, we will use the losses of each model computed at each of the H horizons. We define the loss L h i,t to be the loss of model i ∈ M 0 at horizon h in period t. With squared error loss, for instance, we obtain the commonly used mean squared forecast error (MSFE) losses. In order to illustrate this set-up, Fig. 1 shows two examples to visualize the kind of loss behaviour one might observe in practice.
The examples in Fig. 1 mimic a typical case in nowcasting, where we compare a big data factor method (Method 2) which is regularly updated at many nowcast horizons, with an autoregressive model (Method 1) which is only updated once in the quarter. 2 We will explore this type of comparison in our empirical illustration later. Figure 1 is useful to show how different patterns of optimal models can arise, even in a two-model multi-horizon setting. On the one hand, in Example 1 in Fig. 1a we see that Method 2 dominates Method 1 in all horizons and we want a statistical procedure which is capable of detecting this. On the other hand, in Example 2 in Fig. 1b we see that Method 2 only dominates Method 1 in the second half of the horizons. In this case, we want a procedure which can formally detect at which point Method 2 improves over Method 1. We define d h i j,t = L h i,t − L h j,t to be the loss differential between models i and j in time period t and horizon h. 3 Moreover, we define its expectation to be . Note that we generally expect the pairs of loss differentials to be correlated across time t since one (if not both) of the models in M 0 will be (dynamically) misspecified. At each horizon h = 1, . . . , H , the horizon confidence set is defined as: This gives us the identity of the model or models at each horizon which are weakly dominant. The collection of horizon confidence sets {M * h } H h=1 , where each M * h is determined according to Eq. (1), fully describes which models should be used throughout the whole sequence of horizons. Note that this differs from the MCS procedures proposed in Quaedvlieg (2020) and Hansen et al. (2011), which can be compared best in terms of the differing null hypothesis being tested (see below). Thus, our idea is to arrive at a (different) subset of M 0 at each horizon by eliminating any model which is inferior to the other model at that same horizon. This is accomplished by testing a sequence of null hypotheses H 0,h , h = 1, . . . , H , given by: Here, each M h , h = 1, . . . , H , is a subset of M 0 , and may in fact be different across h depending on which model survives elimination in specific horizons in the sequential procedure we outline further below. As mentioned above, this differs conceptually from Hansen et al. (2011), whose direct translation into a multi-horizon framework (HLN-M) would be based on testing the following null: and M denotes a generic subset of M 0 containing models across horizons 1, . . . , H . By contrast, the concepts of uniform and average SPA (uSPA and aSPA, respectively) as defined by Quaedvlieg (2020) condense to testing the nulls: where ω h , h = 1, . . . , H denote predetermined weights. Thus, the main conceptual difference between our approach and the multi-horizon version of Hansen et al. (2011) as well as uSPA and aSPA of Quaedvlieg (2020), respectively, is that our horizon-specific tests which are carried out focusing exclusively on comparisons at each horizon h, while both Hansen et al. (2011) and Quaedvlieg (2020) compare models across horizons dates h and l, for h, l ∈ {1, . . . , H }. As outlined in the Introduction, restricting ourselves to 'within-horizon' tests avoids unfair comparisons of say model 1 at horizon h with model 2 at horizon l. It also allows to retain models that outperform at specific horizons, but underperform at others (and could thus be eliminated by concepts such as uSPA or aSPA). Finally, note that an alternative application of Hansen et al. (2011) could be to apply their procedure to each horizon separately. This amounts to testing H 0,h , h = 1, . . . , H , for each h as in our case. However, this independent procedure does not take account of the issue of multiple testing which occurs when we compare models across horizons. We therefore expect this method to over-reject the null, whereas we expect our method to be better able to guard against over-rejections. Our Monte Carlo simulations in Sect. 4 further explore this point. An alternative way of writing our null hypotheses in Eq.
(2) is to write a horizonstacked version of Hansen et al. (2011): The alternative hypothesis, denoted by Here, t h i j can either denote Diebold-Mariano t-statistics (Diebold and Mariano 1995) of the form: are the estimated mean loss differential of models i, j at horizon h and its estimated variance (see below), or simply a non-studentized statistic t h i j =d h i j . The use of non-studentized statistics (along with an appropriate bootstrap procedure) has been seen in papers such as White (2000). Moreover, standard HAC estimators for the variance may suffer from size distortions in small samples unless appropriately corrected (see Coroneo and Iacone 2019, and references therein). On the other hand, as argued by Romano and Wolf (2005) in the context of superior predictive ability testing, studentization may have favourable properties in terms of improving the power in finite samples. We therefore proceed in a general way allowing for t h i j to be either a studentized or a non-studentized statistic at this stage, although we will restrict ourselves to non-studentized statistics later on in the empirical section.
In order to constructd h i j and V (d h i j ), respectively, we use a sample of T time series observations. However, whiled h i j and V (d h i j ) typically depend on parameters which need to be estimated, we abstract from the parameter estimation problem in this context to avoid additional notation. From a technical perspective, this may be motivated by the fact that when the sample is split into sub-samples of sizes R and P, where R is the number of observations retained for parameter estimation and P the number of observations used for pseudo-out-of-sample forecasts, a condition such as lim T →∞ (P/R) = 0 is sufficient to make parameter estimation error in the Diebold-Mariano test negligible asymptotically, or that the in-sample and out-of-sample loss function are the same (cf. West 1996). 4 At each period t = 1, . . . , T , we defined k i j as: while an appropriate and consistent HAC-type estimator V (d h i j ) can be used for the long-run variance of d h i j,t (see Newey and West 1987) to account for the (potential) serial correlation in the loss differential due to model misspecification.
Intuitively, the test statistic in (4) picks out the largest gap across all pairwise model comparisons for each horizon given the surviving set of models in each M h , h ∈ {1, . . . , H }, and then chooses the largest such deviation across horizons. Letting H denote the set of all horizons 1, . . . , H , the procedure is completed by using the following elimination rule: This rule eliminates model i from the horizon h, identified by taking the arg max over both h and i. 5 We are now ready to state the horizon confidence set algorithm:  Hansen et al. (2011).
Since the asymptotic distribution of the test based on Diebold-Mariano statistics depends on unknown nuisance parameters, a bootstrap procedure will be used for inference. More specifically, letting d i j,t denote an H -dimensional vector containing d h i j,t , h = 1, . . . , H as elements, we can construct bootstrap samples as follows. For each b = 1, . . . , B: • Re-sample blocks of length l from d i j,t , t = 1, . . . , T , i, j ∈ M h , with replacement using the moving block bootstrap of Künsch (1989). Call these draws d b . . , H , and, when the studentized version of t h i j is used, the bootstrap variance is estimated according to Götze and Künsch (1996) and Gonçalves and White (2004): where Q denotes the number of blocks and q is the corresponding counter (for simplicity, assume that T = Q · l).
• Construct the statistic: and obtain:

Extensions to the horizon confidence set
There are several ways in which the horizon confidence set might be extended and used for further analysis. In this section, we shine particular light on two such extensions. Firstly, we might want to perform model averaging at the various horizons, thereby extending the original suggestion of Hansen et al. (2011) to the multi-horizon setting. Secondly, in the nowcasting case we might use these results to test for nowcast monotonicity, which has become a common criterion used to check whether nowcasting methods improve as we add information (Banbura et al. 2013).

Model averaging
The horizon confidence set procedure outlined above gives rise to a set of models which are to be used at different horizons. Asymptotically speaking, one should in principle be 'indifferent' between the model(s) included at a given horizon. Depending on the scenario, this could potentially lead to cases where individual models move in and out as the horizon changes and the number of models used in each horizon could change repeatedly. If there is more than one optimal model at different horizons, it may be operationally preferable to just form averages constructed across the different M * h,1−α , h = 1, . . . , H . For example, for every model i and horizon h, one could form simple averaging weights from the non-eliminated models as follows: where | M * h,1−α | denotes the number of models in M * h,1−α and the indicator function I{·} returns a value of 1 if model i is included in M * h,1−α and zero otherwise. The nowcast combination is then calculated as: where y iht is method i's prediction at horizon h in quarter t. In the case of nowcasting, this procedure gives an alternative to recent nowcast averaging procedures such as in Aastveit et al. (2018) where the weights are estimated directly. We also note that the resulting nowcasts retain the same asymptotic 'optimality' as the individual models from the collection {M * h } H h=1 as they are just a linear combination of optimal models. Unlike using the individual models themselves, however, this simplification of the horizon confidence set not only allows us to reduce the complexity of the nowcasting procedure, but also to perform further specification tests such as the monotonicity test outlined next.

Monotonicity testing
When looking at nowcasting, in addition to the selection of relevant nowcast models, an important consideration in the empirical literature is whether or not nowcast methods are monotonically improving as we add information, Banbura et al. (2013) is one example. In the presence of more than one model, however, there is little guidance on how to perform monotonicity tests, with the recently proposed test of Fosten and Gutknecht (2020) being established in the single-model case. One appeal of the averaging approach from the previous subsection is that it results in a single nowcast from the different models and allows us to perform this monotonicity test as if it were a single model. More specifically, suppose one constructs the nowcast combination y ht as in Eq. (7) and the nowcast errors ε ht = y t − y ht . Then, these errors can be used to construct a monotonicity test to assess whether the losses from these multiple models are (on average) declining over the horizon 1, . . . , H , i.e. as the nowcast period approaches the publication date of the target variable.

Monte Carlo simulation
In this section, we report the results of Monte Carlo simulations to investigate the properties of the horizon confidence set procedure developed in Sect. 2. In order to assess the performance, we will compare the results to those where the MCS procedure of Hansen et al. (2011) is applied independently to each horizon where we expect the rejection rate to be too high even if models have equal predictive ability. 6 We set up our Monte Carlo design to be similar in spirit to the related simulation studies in Hansen et al. (2011) andQuaedvlieg (2020). As such, we will directly generate the losses L h i,t which gives us the flexibility to freely change various aspects like which of the models appear in the true model set across horizons. This is impor-tant given that our set-up has an additional dimension to the aforementioned studies, specifically the operation of the elimination rule across h = 1, . . . , H , We focus on the two-model set-up described above with M = 2, and for models i = 1, 2, we generate a sample of T observations of the losses across horizons L i,t = [L 1 i,t , . . . , L H i,t ] according to the following data generating process (DGP): where δ is a scalar parameter which controls the time series dependence of the losses, N (0, I H ) is an H × 1 random draw from a multivariate standard normal distribution and is used to control the dependence of model i's losses across horizons. In the nowcasting setting, we indeed expect a reasonable amount of correlation of the losses for a given model across data release dates. We set to be the Toeplitz matrix with elements i,i = ρ |i−i | so that cross-horizon dependence is only governed by the single parameter ρ.
The important parameter vector θ i = [θ 1 i , . . . , θ H i ] controls the behaviour of the mean of the loss differential because E[L h i,t ] = θ h i , and therefore in the terminology of the null hypotheses formulated in Eq. (2), we have that for models i, j and for horizons h = 1, . . . , H . In this two-model case, for θ h 1 , θ h 2 we use the following specification: where θ is a scalar parameter. When θ = 0, the models have equal loss at every h = 1, . . . , H and we do not expect the average rejection rate of the horizon confidence set procedure to be larger than α. When θ > 0, the models do not have equal loss: model 1 has lower loss in the second half of the blocks when h > H /2, whereas model 2 has lower loss in the first half of the horizons (h ≤ H /2). This mimics the nowcasting setting described above where one model may have better predictive ability for earlier data releases but be beaten by another model at other points in the data flow. Clearly, the larger the θ, the greater the the average loss differential and we expect the rejection rate to increase towards unity. We consider a variety of values for this DGP set-up. We let the number of horizons be H ∈ {2, 4} which gives a small but representative example of the set-up above. We let θ ∈ {0, 0.1, 0.2, 0.5} to give a range for the loss differential behaviour, and we consider values of ρ ∈ {0, 0.5} and δ = 0.2 for the time series and cross-sectional dependence. The sample sizes we consider are T ∈ {100, 200, 500}. We use B = 400 bootstrap replications to derive the critical values at each step of the MCS procedure and perform K = 1000 Monte Carlo replications. The nominal size is set to be α = 0.1.
The results of these simulations are displayed in Table 1. The results compare the average rejection rate for our method relative to the method where we independently apply the MCS procedure of Hansen et al. (2011) to each horizon. We see that, in the  Hansen et al. (2011) procedure to each horizon, the rejection rate is too high with greater than 20% rejection rate. As we increase θ , we see that the rejection rate moves closer to 100% and this rejection rate improves with the sample size as seen in the T = 500 case. We note that the results are not very sensitive to changing the ρ parameter which controls the degree of correlation of the losses across blocks. Overall, we find that our procedure is better able to control the rejection rate than an independent procedure across horizons, as is expected.

Empirical application
In order to illustrate our methodology, we use the example of nowcasting quarterly aggregate US real GDP growth and its subcomponents. We will focus on comparing the performance of factor-based methods, which use the common component from a data set of macroeconomic series, to the predictions of a naïve autoregressive benchmark. This kind of comparison of factor methods to a univariate benchmark has become standard in the factor model nowcasting literature (see Anesti et al. 2019 for a recent example). Our method will be able to formally detect at which points in the nowcast period the univariate benchmark is dominated (if at all). We will report results for aggregate GDP growth and five subcomponents (consumption, investment, government spending, imports and exports). This will shed more light on recent analyses of GDP subcomponent nowcasting, including Antolin Diaz et al. (2017) and Fosten and Gutknecht (2020). In what follows, we will describe the data and empirical set-up before presenting the results.

Data and set-up
In predicting quarterly real GDP and its subcomponents, we will follow the approach of Bok et al. (2018) who document the New York Fed Staff Nowcast procedure based on the factor model nowcasting methodology of Giannone et al. (2008). The data series and their transformations to stationarity are described at length in Bok et al. (2018). They construct a parsimonious database of the series most widely followed by market participants, only focusing on the headline series and not disaggregates. The data set comprises mostly monthly variables related to production, employment, consumption and consumer sentiment, housing, trade and so on. We update the data set using the FRED Economic Data service, starting in 1985M1 as in Bok et al. (2018) and ending in 2020M2, with the final data on the quarterly GDP series being in 2019Q3. 7 We remove some variables which do not have sufficient data for the out-of-sample analysis detailed below, which results in a total of N = 25 series being used. 8 As is customary in nowcasting studies, we keep track of the calendar of releases of all predictors, which dictates the nowcast horizons in the nowcast updating procedure. We will make nowcasts which are updated at intervals of 10 days from the start of the nowcast quarter up until day 20 of the first month of the following quarter, which is just before when GDP is first released by the Bureau of Economic Analysis (BEA). This gives a total of H = 11 nowcast updates per quarter observed, each of which corresponds, respectively, to days 10, 20, 30, …, 110 after the beginning of the reference nowcast quarter. Therefore, the last two nowcast updates are actually backcasts.
The factor-based method we use can be described as follows: we denote y i,t as the monthly or quarterly variables in the data set, where i = 1, . . . , N and N = 25 as above. We assume that there exists one latent factor, f t , which drives the co-movement amongst the y i,t series as follows: where λ i is the factor loading for variable i and ε i,t is an idiosyncratic error term. As in Bok et al. (2018), we only use one global factor in this relatively small data set. We do not mimic their use of additional local block factors due to the lack of data availability in our initial estimation window in the pseudo-out-of-sample experiment described below. We therefore have one factor, which we treat as fixed across all estimation windows. 9 The model in Eq. (11) is specified at the monthly frequency, with the quarterly series treated as a filtered monthly series with missing observations. For these quarterly series, the aggregation from a latent monthly growth rate to the quarterly growth rate is dealt with using the method of Mariano and Murasawa (2003). In order to cast the model into state space form, the factor and idiosyncratic disturbances are state variables which are both assumed to follow AR(1) processes with normal innovations: for i = 1, . . . , N , where u t and v i,t are i.i.d. normal processes. Equations (11)-(13) jointly form the state space model which is estimated using the Kalman smoother and the expectation maximization (EM) algorithm (see Giannone et al. 2008;Doz et al. 2011Doz et al. , 2012; Bańbura and Modugno 2014 for full details of this procedure).
Having estimated the factor model and obtained the nowcasts for aggregate real GDP growth and the five subcomponents, we will compare the predictions to that of a simple AR(1) model as this is the most commonly used benchmark model in the literature. 10 For these quarterly target series, the AR(1) method only has one distinct release date in the nowcast period, and so we only observe two distinct nowcasts throughout the prediction period.
To generate the nowcasts and nowcast errors, we will split the sample into T = R + P observations and use the pseudo-out-of-sample procedure as in West (1996). We will use the rolling scheme as suggested by Hansen et al. (2011), but we will compare the results to those of the recursive scheme which is widely used in empirical studies. We therefore start using data on the first R quarterly observations to estimate the models. The first predictions are made of quarter R + 1 where we begin adding information released at the beginning of the quarter and we update the nowcasts H = 11 times every 10 days until day 20 of the first month of the next quarter, just before the GDP data are released. Then, the sample is expanded by one quarter and the procedure is repeated, adding one quarter at a time, until the end of the sample. We start making nowcasts in 1994Q1 which results in P = 103 out-of-sample evaluation periods. For the nowcast error loss function, we will consider both the absolute value function L(e) = |e| and the squared error loss function L(e) = e 2 , giving rise to test statistics involving mean absolute error (MAE) and mean squared forecast error (MSFE). 11 As pointed out in Sect. 2, the tests are constructed using non-studentized statistics.
For the horizon confidence set procedure, we will construct these at the 75% confidence level which, although using a somewhat high nominal level, is common practice in most empirical work using MCS including that of Hansen et al. (2003Hansen et al. ( , 2011. Finally, the number of bootstrap repetitions is B = 400, where we chose the block length as the estimated optimal AR length across the loss differential series. 12

Horizon confidence set results
To gain a preliminary insight into the performance of the competing nowcast methods, Figs. 2 and 3 graph the evolution of the MAE and MSFE throughout the H = 11 horizons of the nowcast prediction period. These graphs are for the rolling scheme, whereas the corresponding figures for the recursive scheme are in the Appendix. From these sets of charts, we can see that these six different target variables produce rather different behaviour in terms of the loss differentials between the factor model and the AR(1) benchmark across different nowcast horizons. This gives us a good mixture of settings in which to apply the horizon confidence set procedure.
In the case of aggregate GDP, we see that there is no clear 'winner' between the factor method and the AR(1) benchmark uniformly across nowcast horizons in terms of MAE or MSFE. Looking at the scale of the MAE/MSFE, the two models indeed appear to deliver very close nowcast error losses, which will be formally assessed by our horizon confidence set procedure. For the cases of consumption, government spending and exports, we find that the AR(1) model provides lower MAE and MSFE than the factor method, whereas for investment the AR(1) model is worse over all nowcast horizons. The loss differential appears to be much larger in the case of MSFE in Fig. 5, especially for investment and government spending. In the case of imports, there is a less clear differential between the two models.
In order to make more formal statements about the performance of the models in Figs. 2 and 3, we now perform the horizon confidence set procedure. Figures 4 and 5 are graphical depictions of the final estimated collection { M * h,1−α } H h=1 . Looking at the main case, aggregate GDP growth, we find no evidence that either model outperforms the other at any of the nowcast horizons. This seems to confirm the graphical evidence from above which shows there to be only a small loss differential on average. The same picture holds in the case of consumption and imports, where we do not see a rejection of any model over any of the nowcast horizons.
Looking at the other variables, focusing first on the MSFE results in Fig. 5, for investment we see that the horizon MCS procedure removes the AR(1) model in every nowcast horizon after the first 2 months of the nowcast quarter. This implies that one might consider averaging the nowcasts from the factor and AR(1) models only up to day 60 of the nowcast period and only use the factor model nowcasts thereafter. For government spending, the reverse is true: both the factor method and the AR(1)   Fig. 4 shows fewer rejections for investment and government spending, which reinforces the earlier comment analysing the graphical evidence in Figs. 2 and 3. In the case of exports, we also see a handful of rejections of the factor method in the earlier part of the nowcast period. The fact that we see a variety of different features in these horizon confidence sets highlights the usefulness of our method in making decisions about the use of different nowcast models across multiple nowcast horizons. In the case of aggregate GDP, consumption and imports, our method justifies the use of model averaging of these two methods. On the other hand, for investment and government spending, model averaging only makes sense up to a certain point in the nowcast period, after which one of the two models is dominated. This kind of pattern would have remained undetected  (2020), where we would have to reject or retain a given model across all horizons.

Comparison with the independent horizons MCS
In this section, we compare the results of our horizon confidence set procedure to the case where we treat horizons independently and run the MCS procedure of Hansen et al. (2011) separately at each of the horizons. Figures 6 and 7 display the equivalent results to those in the previous section. We firstly see that there are many more rejections using this procedure than using our procedure. This could be likened to the results of Sect. 4 where simulation evidence suggested that the independent method has a high rejection rate even if the models have equal predictive ability. Our method also tends to produce more stable results with less fluctuation of models in and out of the confidence set across horizons than in Figs. 6 and 7.
To give an example of this switching behaviour, looking at the results for aggregate GDP under MSFE loss in Fig. 7, we see that there the AR(1) model is removed in the second and third nowcast periods, whereas the factor method is removed in the fifth. This seems at odds with the graphical evidence in Fig. 3 where the small crossings in the MSFE profiles are of very small magnitude relative to the scale.

Nowcast averaging and monotonicity test results
We next aim to shed some light about the performance of nowcast averaging in terms of monotonicity, when weights are constructed using the horizon confidence set results or if a simple equal weights scheme is used. We first construct nowcast averages y kt in Eq. (7) using the two models from the horizon confidence set procedure (we denote this method 'MCS_AVE' in the following figures and tables). Clearly, these weights will fluctuate between (1, 0), (0, 1) and (1/2, 1/2) across all the nowcast horizons. This gives a single combined nowcast at each horizon which can be used to test monotonicity using the recent procedure of Fosten and Gutknecht (2020). We will also compare the MSC_AVE method to that of using a simple average across the models in all nowcast horizons (denoted simply 'AVE').
Focusing on the MSFE results, Fig. 3 shows that, at least graphically, there is evidence of non-monotonicity in the MSFE profiles of the factor method for aggregate GDP, whereas cases like investment appear to be monotonically declining. Looking at these two cases of aggregate GDP and investment, the MSFE of the nowcast combination is shown in Fig. 8 which uses the MCS weights derived from the results displayed in Fig. 5. For aggregate GDP, the results coincide as no model is removed at any horizon. For investment, the MSFE of the MCS_AVE method is slightly lower than that of the simple average across horizons.
To assess these two cases formally, we obtain results of the monotonicity test of Fosten and Gutknecht (2020) for the null hypothesis of nowcast monotonicity of the averaged MSFE profiles in Fig. 8. These results are provided in Table 2. We find no evidence to reject the null hypothesis of nowcast monotonicity for either of the two series. This indicates that, while we do see graphical evidence of non-monotonicity for aggregate real GDP growth, these movements are very small and not statistically different from the case of a flat MSFE profile. For investment, we see that the p-value reduces slightly for the MCS_AVE version of the test relative to the AVE version. Both results demonstrate that a simple average across all models is equally capable of producing monotonically declining MSFE, even it is slightly outperformed by MCS_AVE. Overall, this indicates that the method of nowcast averaging is capable of producing  κ represents the number of moment inequalities used in the monotonicity test and U * is the test statistic of Fosten and Gutknecht (2020) along with the bootstrap 50th, 90th and 95th percentiles and the one-sided p-value nowcasts of GDP and its subcomponents which are monotonically improving as the time horizon shrinks until the publication date of GDP.

Conclusion
In this paper, we have proposed a methodology which allows researchers to determine which models to use at different horizons, which we call the Horizon Confidence Set. We build on the MCS procedure introduced by Hansen et al. (2011), adapting it to the multi-horizon set-up. In both long-term forecasting and near-term nowcasting, which are the main applications of our methodology, we argue that we need a method capable of retaining models which are better at some horizon but removing them at horizons where they underperform. Our proposed method sequentially eliminates the worst models in each horizon, resulting in a potentially changing set of 'optimal' models across different horizons. This provides an advantage over existing multihorizon approaches which look for uniform or average model superiority (Quaedvlieg 2020). Our method also has advantages over naïvely applying the MCS procedure independently across horizons.
To facilitate the practical applicability of our methodology, we discuss how model combination can be performed on the basis of the horizon confidence sets. This is similar in spirit to various existing nowcast combination studies (including Kuzin et al. 2013;Aastveit et al. 2018) and also allows to conduct formal monotonicity tests as recently proposed by Fosten and Gutknecht (2020).
Finally, we apply our methodology using the factor model methodology employed by Bok et al. (2018) in making the New York Fed Staff Nowcasts. However, we extend their analysis of nowcasting the aggregate US real GDP growth rate to that of the five GDP subcomponents. Comparing this factor method to a naïve autoregressive benchmark, our procedure shows the point in the quarter at which the factor method beats the benchmark model or vice versa. This finding is novel and could have potentially remained undetected by existing tests for uniform or average SPA. On the other hand, when no model is dominant at any horizon our method is capable of reaching this conclusion, and model averaging is considered in these cases. Our results are also more stable than a simple independent horizons application of Hansen et al. (2011). We therefore deem our procedure a useful and complementary tool to existing methods in the literature which can yield new insights into the comparison of multi-horizon forecasts and nowcasts.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.