Evaluation of economic forecasts for Austria

In this paper, we evaluate macroeconomic forecasts for Austria and analyze the effects of external assumptions on forecast errors. We consider the growth rates of real GDP and the demand components as well as the inflation rate and the unemployment rate. The analyses are based on univariate measures like RMSE and Theil’s inequality coefficient and also on the Mahalanobis distance, a multivariate measure that takes the variances of and the correlations between the variables into account. We compare forecasts generated by the two leading Austrian economic research institutes, the Institute for Advanced Studies (IHS) and the Austrian Institute of Economic Research (WIFO), and additionally consider the forecasts produced by the European Commission. The results indicate that there are no systematic differences between the forecasts of the two Austrian institutes, neither for the traditional measures nor for the Mahalanobis distance. Generally, forecasts become more accurate with a decreasing forecast horizon, as expected; they are unbiased for forecast horizons of less than a year considering traditional measures and for the shortest forecast horizon considering the Mahalanobis distance. Finally, we find that mistakes in external assumptions, in particular regarding EU GDP and the oil price, translate into forecast errors for GDP and inflation.


Introduction
Macroeconomic forecasts provide important information for economic policy makers, companies and private households. In Austria, two research institutes, the Institute for Advanced Studies (Institut für Höhere Studien, IHS) and the Austrian Institute of Economic Research (Wirtschaftsforschungsinstitut, WIFO), have a long tradition of providing economic forecasts. In addition, the Austrian National Bank (Oesterreichische Nationalbank, OeNB) as well as some private banks publishes macroeconomic forecasts for Austria. Further, international institutions like the European Commission (EC), the International Monetary Fund (IMF), and the Organisation for Economic Cooperation and Development (OECD) periodically produce forecasts for Austria.
This paper focuses on the evaluation of IHS and WIFO forecasts, because they share certain features that make them particularly suitable for comparison. First, both institutes publish their forecasts always and exactly at the same time in a joint press conference. This implies that the information sets for both institutes are very similar, as they use roughly the same cutoff day until which information is taken into account. The period between the cutoff day and the publication day is, at around one week, rather short, unlike for OeNB and international institutions (OECD, IMF, EC), which are part of a forecasting system that includes a large number of countries. The longer this interval, the more likely it is that recent events cannot be incorporated in the newest forecast publication. Both IHS and WIFO produce four short-term forecasts per year, which are published toward the end of each quarter. OeNB, on the other hand, publishes macroeconomic forecasts twice a year, in June and December.
Second, the target variables of IHS and WIFO are exactly the same. One might assume that this is true for all forecasts, but as practitioners know, there is a wide range of possible choices regarding the exact specification of the variable in question. On a quarterly basis the forecaster can, for example, choose whether to include a seasonal adjustment, a working-day adjustment, or the irregular component. Regarding the growth rates of GDP and its demand components, both IHS and WIFO forecast annual averages of the original series, i.e., not adjusted for working days. This might be true also for the private banks, but it is not for OeNB. The OeNB forecasts are part of the Eurosystem forecasts, which are based on series adjusted for working days to facilitate a better comparison across the different European countries. Different series are also available when it comes to inflation and unemployment. The national inflation rate usually differs only slightly from the harmonized European rate. However, this is not true for the unemployment rate due to substantial methodological differences.
Third, the objectives and therefore the evaluation and the underlying loss functions are the same for IHS and WIFO. Both institutes aim at producing "exact" point forecasts. Apart from definitional issues about what "exact" means, producing exact forecasts is not always the main goal of forecasters. It may well be, for example, that internal macroeconomic forecasts of a finance ministry should instead produce a conservative GDP forecast in order to reduce bad surprises when it comes to tax receipts. Overly pessimistic (and sometimes optimistic) forecasts, for their part, might increase the forecaster's media coverage and thus enhance its public visibility. 1 In any case, we assume that the main objective of both IHS and WIFO is to generate "exact" point forecasts, i.e., forecasts with the smallest forecast error. 2 Yet the precise definition of forecast error is far from obvious. On the one hand, there are different "realized" values with which the forecasts can be compared. We take as realizations the first release of the annual national accounts published by the Austrian statistical office (Statistik Austria) around nine months after the end of the year. Inflation and unemployment figures are usually not subject to revision. On the other hand, the choice of the loss function with respect to the forecasting error could alter the determination of the "best" forecast.
Traditionally, measures of accuracy like the mean absolute error (MAE), the mean absolute percentage error (MAPE), the mean-squared error (MSE), the root-meansquared error (RMSE), the mean directional accuracy (MDA), and Theil's inequality measure U2 have been used. Although these measures are widely accepted and regularly applied in forecast evaluations, they have the drawback of being one-dimensional; i.e., they examine each variable separately and do not take the relationship between different forecasts into account. Therefore, we additionally consider the Mahalanobis distance, which assesses jointly the forecasts of a group of variables. 3 This multidimensional measure takes both the (potentially different) variances of the variables and the correlations between the different variables into account. In addition, we test for the unbiasedness of the forecasts and for differences between the forecasters.
Specifically, we evaluate the growth forecasts of GDP and its demand components: private consumption, gross fixed capital formation, exports, and imports. We also consider the forecasts of unemployment and inflation. The forecast assessment is based on the four forecasts published per year for the current year and the following year.
Furthermore, we consider forecasts of the European Commission, where we point out that a direct comparison with the national forecasts by IHS and WIFO may be flawed due to the different sets of information available at the time of forecasting. We examine the EC spring and autumn forecasts for the current year and the following year, published in May and November. The related analysis excludes inflation and unemployment, since national and international forecasters use different definitions.
Finally, we complement our forecast evaluation with an analysis of the effects of errors regarding external assumptions on forecast errors for GDP and inflation. For this purpose, we consider a number of assumptions related to the international environment, including the GDP growth of the European Union, the oil price, and the foreign exchange rate (euro versus US dollar) and examine whether and to what 1 Obviously, both IHS and WIFO are also interested in public visibility, but not at the cost of producing "extreme" (false) forecasts. 2 Alternative objectives might be to predict business-cycle turning points. See, for example, Giusto and Piger (2017) and Kovacs et al. (2017). 3 Sinclair and Stekler (2013) were probably the first to use the Mahalanobis distance in the context of macroeconomic forecasting, when they compared different vintages of US GDP and major component estimates. Other applications of this method in forecast evaluation include Sinclair et al. (2012Sinclair et al. ( , 2015, Döhrn (2015) and Sinclair et al. (2016). extent mistakes in these assumptions translate into forecast errors for GDP growth and inflation.
The paper is structured as follows. The next section sketches the dataset. We then describe the traditional evaluation measures and the Mahalanobis distance. In the following section we present and discuss the results, including the findings related to the effects of external assumptions on the forecast error. Finally, we summarize the main findings and conclude.

Data
The data include IHS and WIFO economic forecasts published in the period 1995 to 2017. In each year, the two institutes publish four forecasts, usually at the end of each quarter, i.e., in March, June, September, and December. At each forecast date and for each forecast variable, annual forecasts for the years t and t + 1 are published. 4 Hence, the included forecast years cover the period 1995 to 2017, and for each year eight forecasts are available, four published in the year t − 1 (year-ahead forecasts), and four published in the year t (current-year forecasts). For the first year, 1995, only four forecasts are available, as the t +1 forecasts of 1994 are missing. 5 We consider the following variables: annual growth rates of real GDP, private consumption, real gross fixed capital formation (investment), exports, and imports. In addition to these demand components, forecasts of the inflation rate as measured by the national consumer price index and forecasts of the unemployment rate (according to national statistics, i.e., registered unemployment) are evaluated.
As actual data, we use the first (and usually final) publications of annual inflation and annual unemployment data; for the national accounts variables, we take the first release of the annual accounts, published by the Austrian statistical office approximately nine months after the end of a given year. While there are almost no revisions of unemployment and inflation rates, GDP and its components are frequently revised, and these revisions are sometimes large relative to the absolute values of the growth rates. This issue is particularly relevant for capital formation, due to its large variability. We use the first release of the annual accounts rather than the first "preliminary" release based on quarterly accounts, which are available three months after the end of the year, as the annual accounts are based on a larger information set. In a robustness check, we find that the differences between these data vintages are very small (see Sect. 4.3). Alternatively, it would be possible to take the latest data release.
The question of what values to take as realizations or actual values is a much debated issue. In the literature on forecast evaluations, both first and later data vintages (releases) have been used as benchmarks. Sinclair et al. (2016) evaluate German macroeconomic forecasts for the year 2013, published around December 2012. They compare these forecasts with the first release of actual data for 2013, published in January 2014, as well as the "final" (second) release of February 2014. The corresponding results are very similar. Sinclair and Stekler (2013) compare different vintages of US GDP and its ten major components: namely, the initial estimates available one month after the end of the quarter, and the estimates available three months after the end of the quarter. Despite the existence of some biases, overall the differences are rather small. Kirby et al. (2015) look at the accuracy of the NIERS's forecasts of GDP growth in the UK, the USA, and the euro area, comparing these forecasts to the first release of the variables. Sheng (2015) evaluates real GDP growth, inflation, and unemployment forecasts of members of the FOMC of the US Federal Reserve System, using "final" estimates that are released roughly three months after the end of the quarter. Chen et al. (2016) evaluate forecasts of GDP growth and inflation for ten Asian countries, where forecasts are compared to initial releases. As a robustness check, they use revised data and find that the results are rather similar. The European Commission, in its regular forecast assessments, 6 uses different realized values for current-year and year-ahead forecasts. The realizations are taken from the same publication as the forecasts, i.e., from the autumn publications for year-ahead forecasts and from the spring publications for current-year forecasts. Evaluations of macroeconomic forecasts for Austria usually take the first preliminary release of the national accounts provided by WIFO. 7 Baumgartner (2002a, b) considers additionally the first "final" release produced by the Austrian statistical office, as do we, and concludes that the differences are very small. 8

Evaluation measures
The selection of appropriate measures for the evaluation of forecasts is not straightforward. Obviously, the "best" economic forecast is a forecast that accurately predicts the realization of the forecast variable. Equally obviously, it is almost impossible to deliver "exact" predictions. Therefore, intervals, e.g., the 68% confidence interval, are sometimes published. In business-cycle research, the correct anticipation of so-called turning points is even more important than the exact forecast of a certain GDP growth rate. A standard evaluation criterion is that the forecast beats the naive no-change forecast. Furthermore, the forecasts should be unbiased, implying that the forecast should not be systematically too optimistic or too pessimistic. We evaluate the accuracy of the IHS and WIFO forecasts on the basis of traditional one-dimensional and more novel multi-dimensional measures. 6 The first study is Keereman (1999); the most recent evaluation is Fioramanti et al. (2016). 7 These studies include Baumgartner (2002a, b), Ragacs and Schneider (2007), and Schuster (2018) 8 We also find that the differences between using the first preliminary release and the first official release are very minor (see Sect. 4.3).

Traditional measures
We apply the following traditional measures: the mean absolute error (MAE) 9 , the root-mean-squared error (RMSE) 10 , the mean absolute percentage error (MAPE) 11 , Theil's inequality coefficient (U2) 12 , and the mean directional accuracy (MDA) 13 . The RMSE is defined as the square root of the average difference between the forecast and the actual realization. Since it is scale dependent, the RMSE is useful for a comparison between different forecasts, but its magnitude is not meaningful in itself. Theil's inequality coefficient (see Theil 1966) compares a given model-generated forecast with the naive forecast of no change. If the forecast is better than the naive no-change assumption, then U2 is smaller than one. 14 The test for unbiasedness of the forecast is based on a procedure introduced by Mincer and Zarnowitz (1969). The test rests on regressing the realized values on a constant and the forecast. However, as Sinclair et al. (2010) point out, forecast errors might depend on the state of the economy, e.g., the stage of the business cycle. Therefore, these authors suggest to include a dummy for the state of the economy. This gives rise to estimating Eq. (1).
where F t and A t are the forecast for and the actual value at year t, respectively. D t is the recession dummy, which is not present in the original version of the test. The recession dummy takes the value 1 if the economy is in a recession, and 0 otherwise. In order to identify the stage of the business cycle, first a simple Hodrick-Prescott (HP) filter is applied to the level of GDP over the entire data sample, and then the output gap is calculated as the deviation of actual GDP from the HP trend. We define a recession as a year with a negative output gap. 15 For the forecast to be unbiased, α should not be significantly different from zero, β should not significantly deviate from one, and γ should not be significantly different from zero. We test this joint hypothesis with a Wald test. For the growth rates of GDP and the demand components, we employ both the original Mincer-Zarnowitz test, i.e., without the recession dummy, and the where F t and A t are the forecast for and the actual value at t, respectively, and h is the forecast horizon (h = 1 for current-year forecasts and h = 2 for year-ahead forecasts). Note that in the definition of the mean directional accuracy we do not consider the values of zero separately but together with the values of ones. We thus conclude that the forecast direction of a given variable is assessed correctly if either the forecast goes down when the actual value goes down, or if the forecast goes up or stays the same when the actual value goes up or stays the same. 14 Note that for GDP and the demand components the assumption of no change refers to the growth rates of these variables. 15 Strictly speaking, this is a downturn rather than a recession. modified version that includes this dummy. However, we believe that for the inflation rate and the unemployment rate, only the original test is meaningful: the labor market lags behind real economic activity. Hence, in general, unemployment does not rise at the same time as GDP growth falls (it may even be negative), but only later on, sometimes even only when real activity has already risen again. Furthermore, the labor market in Austria, as in many other European countries, is characterized by labor hoarding: companies react to an economic downturn only to an attenuated extent so as to avoid hiring costs in the following recovery. Inflation, likewise, does not follow the economic cycle very closely; hence, there are periods which according to our simple definition would be classified as a recession, but that also involve high and persistent inflation. At the same time, defining the recession dummy for each variable separately would also not be very meaningful, since a recession is usually characterized as a significant and widespread decline in activity across the economy lasting longer than a few months.
In addition, we apply the encompassing test introduced by Chong and Hendry (1986) in order to judge whether one institute's forecast contains all the information inherent in the other institute's forecast: The null hypothesis is that all information included in one forecast is already contained in the other forecast and hence β 1 = 0 or β 2 = 0. In the general version of the test, one given forecast is compared with a number of other forecasts, and the idea is that if a single forecast contains all the information contained in the other individual forecasts, that forecast will be just as good as a combination of all other forecasts.

The Mahalanobis distance
The aforementioned measures share the drawback of being one-dimensional. Therefore, we also apply the Mahalanobis distance, which is a multi-dimensional evaluation measure taking the variances of and the correlations between the variables into account. This measure allocates weights to the individual forecast errors, which are implied by the variance-covariance matrix of the variables. In utilizing this methodology, we follow Sinclair and Stekler (2013).
In order to formally define the Mahalanobis distance, let us assume that F t is an m-dimensional vector of forecasts for time period t, and A t is an m-dimensional vector of actual realizations of a variable at time period t. Let m be the number of variables to be predicted. If, for example, the growth rates of GDP, the inflation rates, and the unemployment rates are taken into account, m equals three.
LetF andĀ be the mean column vectors of F t and A t , respectively, and let W be the pooled sample variance-covariance matrix of F t and A t . Then we define the Mahalanobis distance, M, as Under the assumptions of normality, one can construct an F-statistic based on the squared Mahalanobis distance, M 2 , to test the null hypothesis that two sets of forecasts have the same population means. 16 We employ this test in order to examine the difference between the IHS and WIFO joint forecasting accuracies.
For assessing the unbiasedness of the joint forecasts, or rather the existence of any systematic errors, we follow the procedure used by Sinclair and Stekler (2013). This is a more general approach than investigating the variables separately (as done in Sect. 3.1). According to this approach, a given forecast error should not depend on past forecast errors of either the variable itself or other variables. We thus construct a first-order vector autoregression (VAR(1)) of the forecast errors of each variable, which is given by , β 0 is an m−dimensional vector of constants, and β 1 is an m × m matrix of coefficients on the lags of the forecast error. If the joint estimators are unbiased, then none of the coefficients in the VAR should be significant: the constant estimates should be zero, the coefficients on the own lags should be zero, and none of the past errors made in forecasting the other variables should Granger-cause any of the other errors. 17 This implies a total number of m × 3 single hypotheses that need to be examined. Instead of testing each null hypothesis separately at a given level α, we use the Bonferroni-Holm test (see Holm 1979). This is a multiple-level α test, whereby the probability of committing any type I error is always smaller than or equal to a given level α. 18 In contrast to our procedure, Sinclair and Stekler (2013) examine the null hypotheses separately and do not employ multiple-level α tests.

Results
Both IHS and WIFO forecasts of GDP growth show a smoothing pattern over the business cycle. While in upturns both institutes tend to underestimate growth, in downturns they both overestimate growth. The pattern is shown in Fig. 1, which plots the forecast 16 F = n 1 n 2 (n 1 +n 2 −m−1) (n 1 +n 2 )m(n 1 +n 2 −2) M 2 , with m and n 1 + n 2 − m − 1 the degrees of freedom, where n 1 and n 2 are the numbers of observations of the first and second group of variables (see McLachlan 1999). 17 This is a generalization of the Holden and Peel (1990) test for bias. This smoothing behavior seems to be characteristic of economic forecasters, 19 as unexpected shocks ("extreme" events) are not highly predictable in general.
Another observation is that the forecast disagreement concerning GDP growth across the two institutes, measured in terms of the absolute difference between the respective forecasts, seems to be larger in downturns than in upswings. This is reflected by a negative correlation between the forecast differences and the realized values, for all current-year forecasts of GDP. 20 This pattern suggests that forecasters particularly disagree in the assessment of economic downturns or recessions.

Accuracy
Tables 1 and 2 report the forecast evaluation results based on the traditional measures. Table 1 presents the values of the respective test statistics, while Table 2 assesses the improvement of the forecasts over time. For the latter, the evaluation measure for a given forecast horizon is divided by the "best" evaluation measure, no matter whether this best measure is provided by IHS or WIFO. Usually the best forecast is the one with the shortest forecast horizon (i.e., the December t forecast). Also, better forecasts usually go along with smaller values of the evaluation measures, except for the mean directional accuracy (MDA), where larger values imply better forecasts. The best forecast for a given variable thus shows a value of one. A value of two, for example, The number for a given forecast variable and a given evaluation criterion is the ratio with respect to the minimum criterion (of either IHS or WIFO) for that variable In order to judge whether the forecasts across the two institutes differ significantly from each other, we employ a standard Diebold-Mariano test. For basically all variables and all forecast horizons, our results imply that the forecasts of IHS and WIFO do not differ significantly from each other. The only exception is consumption growth in the December year-ahead and current-year forecasts, where, as shown in Table  3, WIFO seems to provide a marginally better forecast (involving a smaller forecast error). 21 Another way of comparing the forecasts is to perform an encompassing test, which tells us whether the forecast of one institute could be improved by using the other institute's forecast. The results, presented in Table 4, show that mostly the forecast of one institute indeed encompasses the forecast of the other institute. In particular, this is always true for GDP forecasts. In addition, all current-year IHS forecasts (with one exception) encompass the respective WIFO forecasts. In total, the IHS forecast does not encompass the WIFO forecast in four out of 56 cases, while the WIFO forecast does not encompass the IHS forecast in 17 out of 56 cases.
More important than the small differences between the forecasts is the common feature that all forecasts improve considerably over time (see Fig. 2 and Tables 1 and 2). This improvement is most distinct for the inflation rate and the unemployment rate, where the forecast errors are almost zero in September and December of the current year t. This result is to be expected, since inflation and unemployment data are published monthly, in a very timely fashion, and in the December forecast almost all of the realizations are known. Also with regard to Theil's U2, the improvement of the forecasts over time is clearly visible. As mentioned above, in contrast to the RMSE the absolute value of Theil's U2 is meaningful. It should be below one, since only then is the forecast better than the naive no-change assumption. Theil's U2 is above unity only in one case: namely the first IHS unemployment forecast. Especially for current-year forecasts, this measure is clearly below 1, particularly for the forecasts of the inflation rate and the unemployment rate, but also for the GDP growth forecast. The improvement is least distinct for the consumption growth forecast. This might be related to the fact that consumption is rather smooth over time. Hence, the growth rates do not fluctuate much over time, rendering the no-change assumption that is the benchmark of Theil's U2 more difficult to beat.
Similarly to the other traditional measures, the difference between the two institutes with respect to getting the directional change right is rather small. Among all variables, inflation is assessed best: already at the beginning of the current year the directional change is projected correctly by IHS in more than 90% of all cases, and by WIFO in more than 80%. The directional change of GDP growth is forecast about equally well. The directions of import changes seem to be harder to predict: considering the March forecasts in year t, roughly 60% of all changes are anticipated correctly by IHS, and roughly 70% by WIFO.

Bias
The original and the modified Mincer-Zarnowitz tests show that in general the forecasts are neither too high nor too low (Tables 5, 6). Based on the original test, i.e., without taking the state of the business cycle into account, the growth forecasts of private consumption and exports, and the forecasts of the unemployment rate are the only ones in which the null hypothesis of no systematic forecast errors has to be rejectedand then only in a small number of cases. If the recession dummy is included, only some consumption growth forecasts show signs of being biased. Overall, the forecasters of IHS and WIFO do not seem to make any systematic errors, even when the state of the economy in the business cycle is taken into account.
Summarizing all results based on traditional, one-dimensional accuracy measures, we can draw the following conclusions: (i) all forecasts improve considerably over time, which is what one would expect; (ii) the forecasts published by the two institutes IHS and WIFO do not differ significantly from each other, except in two out of 56 cases; (iii) in most cases the forecast of one institute encompasses the forecast of the other institute; and (iv) there are basically no systematic forecast errors for the forecasts published in year t, including when the state of the economy is taken into account.

The Mahalanobis distance
We assess the joint forecasts of three different groups of variables. The first group includes all variables considered (hereafter called All), i.e., growth rates of real GDP, private consumption, real gross fixed capital formation, exports, and imports, as well as the inflation rate and the unemployment rate. The second group includes real GDP growth, the inflation rate, and the unemployment rate (hereafter called Macro), and the third group includes growth rates of real GDP, real gross fixed capital formation, private consumption, exports, and imports (hereafter called Demand). Very often forecasts of the Macro group are reported to provide a quick overview of the economic outlook. Table 7 reports the joint evaluation results of IHS and WIFO forecasts using the Mahalanobis distance, considering all variables (All), the Macro group, and the Demand group. The differences between the two institutes are rather small. In each group the maximum difference between the two institutes is observed for the second earliest forecast, i.e., the June forecast published in year t − 1. This goes along with what one might expect: forecasts are usually more divergent the earlier they are produced. With a decreasing forecast horizon, forecasts normally move closer to the realized values (i.e., the forecast error decreases), and hence, the difference between the two institutes also shrinks, as more information becomes available. Overall, this observation is borne out in Fig. 3, which plots the evaluation results for the three groups of variables. IHS forecasts show a slightly larger joint forecast error at early forecast dates. By contrast, both institutes show nearly identical forecast errors for the Macro Table 5 Mincer-Zarnowitz test for unbiased forecasts The table lists p values for testing the null hypothesis of no bias. Starred figures indicate that the null hypothesis of no bias is rejected at the 5% significance level   The three graphs in Fig. 3 show that the joint forecasts clearly improve over time, and this is particularly true for the Macro group. The reason is probably the relatively high accuracy (small forecast errors) of the inflation and unemployment forecasts, already pointed out for the univariate results, which account for two of the three variables in that group. For the Macro group, the joint forecast errors implied by the latest forecasts are smaller than the errors implied by the first forecasts by a factor larger than ten. For the other two groups of variables (All and Demand), this factor is equal to or below two, and hence, the improvement over time is less pronounced (see Table 7).  (2) no (2) no (2) no (1) Jun t-1 no (5) no (5) no (3) no (2) no (3) no (1) Sep t-1 no (6) no (5) no (2) no (1) no (3) no (2) Dec t-1 no (5) no (4) no (1) no (2)

Bias
In order to examine whether joint forecasts are unbiased, a fairly large number of null hypotheses need to be tested. The concrete number (m × 3) depends on the number of variables considered in the joint forecast and is thus equal to 21, 9, and 15 in our case. Table 8 presents the results. Only for the most recent forecast, i.e., the December t forecast, are the joint forecasts unbiased for all the groups under consideration (All, Macro, Demand) for both Austrian research institutes. This means that in a VAR(1) system of the forecast errors implied by the variables under consideration, neither the constant terms nor the lagged errors are significant, and in addition none of the given errors is significantly Granger-caused by the other errors. Further, the September t forecasts are largely unbiased, and in the case of the IHS all forecasts are unbiased. If a joint forecast is not unbiased, we report the number of cases in which the null hypotheses are rejected. For all current-year forecasts, the maximum number of individual hypotheses rejected is three (IHS) or four (WIFO), out of a total of 21 cases (All). We provide more detailed results in Tables 1 to 4 in the Online Appendix in order to be able to assess the precise source(s) of the bias. For example, the single rejection for the WIFO September t forecasts in the groups All and Macro originates from the unemployment forecast error being Granger-caused by the other forecast errors (see Tables 2 and 3 in the Online Appendix). In fact, it is nearly always true that the source of the bias can be found in a given forecast error being Granger-caused by the other forecast errors.

Robustness checks
We perform two sets of robustness checks. First, we exclude the year 2009 from our analyses since during this year of the "Great Recession" GDP in Austria dropped by 3.8%. The declines of fixed capital formation and exports were even more severe, at 7.2% and 14.4%, respectively. No forecasts predicted this extreme economic downturn during most of the year 2008. Second, we use another benchmark with which we compare our forecasts, namely the first estimate of annual national accounts published in March of year t + 1. We use this release only for the robustness check, and not as the benchmark, for two reasons. First, the annual accounts are based on a larger information set and are thus usually subject to smaller revisions than the first estimate of March. Second, the March release is produced by WIFO, i.e., one of the two forecasts institutes, on behalf of the Austrian statistical office, while the release published in late summer is produced by the statistical institute itself. Of course, inflation and unemployment figures are usually not subject to revision, unless methodological changes are implemented.
The results for both robustness checks are very similar to our main results. In particular, there are no systematic deviations, and thus, the conclusions remain unaffected. In contrast to our study, Baumgartner (2002a, b) uses the first estimate of the annual national accounts, provided by WIFO, as a benchmark. But he performs the analysis also with the realizations taken from the national accounts produced by Statistik Austria and finds, as we do, that there are hardly any differences in the results.

Comparison with European Commission forecasts
In this section, we compare the economic forecasts of IHS and WIFO with the forecasts for Austria made by the European Commission, which produces economic forecasts twice a year. 23 The EC publication dates, spring forecasts in May and autumn forecasts in November, differ from those of IHS and WIFO. As a consequence, the information available for national and international forecasters is not the same, which complicates a proper forecast comparison. If, for example, the national March issues are compared with the EC spring issues, then the EC has an information advantage and should, ceteris paribus, provide more accurate forecasts. If, on the other hand, the EC spring issues are compared with the national June issues, then IHS and WIFO have an information advantage. 24 If there is no systematic difference in the quality of forecasting across national and international institutions, then the forecasts should improve with a decreasing forecast horizon. Note that this section excludes inflation and unemployment from the analysis, since national and international forecasters use different definitions. 25 Figure 4 shows the different measures of accuracy for the national and international forecasts of GDP growth for different forecast horizons. We have twelve forecast dates in total, eight from the national institutes and four from the European Commission. 23 A few years ago the EC started to publish an additional winter forecast. The winter forecasts are not considered in our analysis due to the short time series. 24 Baumgartner (2002b) investigates the forecast accuracy of the two national institutes (IHS and WIFO) and OECD by grouping the national and international forecasts in two alternative ways (and performing the analysis twice). One grouping reflects an informational advantage for the national institutes; the other grouping an informational advantage for the international institute (OECD). 25 IHS and WIFO use the national definitions of inflation (CPI) and the unemployment rate (Claimant Count based on administrative data from the benefits system), while the EC uses harmonized measures (HICP and LFS, respectively).

Fig. 4 Measures of accuracy for different dates of forecasts and different forecasting institutions for GDP growth
As we can see from the graphs, the first three GDP growth forecasts that we consider, i.e., the year-ahead forecasts in March, May, and June, are roughly the same and do not clearly improve over time. Similarly, the last three forecasts, i.e., the current-year forecasts of September, November, and December, do not seem to get better over time. By contrast, the GDP growth forecasts do seem to get more accurate from the year-ahead forecasts in September through the current-year forecasts in September. This general observation holds more or less across all the different measures of accuracy that we evaluated. More formally, as shown in Table 9, the Diebold-Mariano test for two successive forecasts 26 confirms our impression. The year-ahead forecasts in September, November, and December are significantly better  The table lists p values for testing the null hypothesis that successive forecasts of GDP growth show the same accuracy against the alternative hypothesis that the more recent forecast is more accurate, where the loss function is the squared forecast error. Starred figures indicate that the null hypothesis is rejected at the 10% significance level and hence that the more recent forecast is more accurate than their preceding counterparts (for both IHS and WIFO), and the same applies for the current-year September forecast (for both IHS and WIFO), the current-year June forecast (IHS), and the current-year November forecast (against the preceding WIFO forecast).
We observe a similar pattern in the quality of forecasts for the remaining variables, i.e., for the growth rates of consumption, investment, exports, and imports. For those variables as well, the forecasts do not considerably improve over the period of the first forecasts (March to June year-ahead forecasts) or over that of the last forecasts (September to December current-year forecasts), while they usually do improve over the remaining period (year-ahead September forecasts to current-year September forecasts), see Fig. 5 and Table 9.
Overall, this comparison between the IHS, WIFO, and EC forecasts corroborates our findings that the accuracy of the forecast depends much more on the time of publication, i.e., on the data available when preparing the forecast, than on the question of which institution publishes the forecast.

Effects of external assumptions
Macroeconomic forecasts are usually conditional on assumptions about the international economic environment, such as world trade, GDP growth for the main trade partners, the oil price, exchange rates, and monetary and fiscal policies. This is particularly true for small open economies like Austria's. 27 Forecast errors may then result from wrong external assumptions, a poor forecast model, or both. In addition, revi-Footnote 26 continued different from the Diebold-Mariano tests performed before, when the alternative hypothesis was that the two competing forecasts were different. 27 Note, for example, that in 2017 the external trade to GDP ratio of Austria amounted to 104%, while it was 27% in the USA. In order to answer this question, we perform a regression analysis in the spirit of Keereman (2003), Fioramanti et al. (2016), and the European Commission (2016) to determine the influence of deviations of external assumptions on the forecast error of Austrian GDP growth. We consider the effect of an unexpected change in the growth rate of GDP of the European Union, the oil price, and the foreign exchange rate. A similar set of regressions is performed for the inflation forecast for Austria, with a view to answering the question whether and to what extent mistakes in external assumptions translate into inflation forecast errors. The analyses are based on 17 to 22 observations, depending on the forecasting horizon and the institute. Table 10 summarizes the results with respect to prediction errors in the growth rate of GDP. We find a positive influence of EU GDP errors on the forecast error of the Austrian GDP, which is highly significant for the first five forecasts. With regard to the positive sign, this is what we would expect, since an overestimation of external growth should lead to a higher than realized national growth rate. With regard to oil prices our expectation about the sign is ambiguous. Overestimated oil prices might lead, on the one hand, to an underestimation of growth due to overestimated import prices. On the other hand, overestimated oil prices might just reflect the overestimation of the state of the global economy. The latter is what almost all our results show, in particular when the coefficients are significant. We find little evidence that mistakes in the oil price  Table 11 shows the corresponding results for the effect of external assumption on the inflation forecasts. As anticipated, we find that unexpected changes in the oil price do indeed explain part of the inflation forecast error. If oil prices are thought to increase more, inflation is overestimated as well, and vice versa. However, for currentyear forecasts the statistical significance either disappears (IHS) or becomes weaker (WIFO). In the case of the inflation forecast errors, roughly 50% to 60% of the variation can be explained by deviations in external assumptions for the first three forecasts. By contrast, we do not find any evidence for the propagation of mistakes in the external assumptions concerning GDP growth and exchange rates. Again, a robustness check excluding the crisis year 2009 yields only very minor changes in the analysis.
While other studies 28 consider the effects of external assumptions only for one current-year forecast and one year-ahead forecast, this analysis provides a more 28 Keereman (2003), Fioramanti et al. (2016), and European Commission (2016). detailed view of the influence of the external assumptions over different forecast horizons. This is due to the eight forecasts under analysis, which reflect an increasing information set and a shrinking forecast horizon. For both target variables, GDP growth and inflation, we find that the impact of mistakes in the external assumptions is rather strong for year-ahead forecasts but cannot be observed, or only weakly observed, for current-year forecasts. This is true for both institutes. These findings are in line with the literature previously mentioned. In Keereman (2003), for example, for 10 out of 12 countries the unexpected change in US GDP shows a higher significance level (lower p value) in explaining the year-ahead GDP forecast error than in explaining the current-year forecast. In Fioramanti et al. (2016), for the year-ahead forecasts the GDP forecast error in the base model and almost all additional models can be explained by errors in the external assumptions with respect to world GDP, while this is not the case for the current-year forecasts. With regard to unexpected changes in the oil price, the case is less clear. However, among the six models presented in Fioramanti et al. (2016) the year-ahead forecasts tend to result in lower p values and higher R 2 than those in the current year. Our findings are also in line with the results of Fenz et al. (2019), who use a different methodology to examine the forecast errors for Austrian GDP growth. They find that the variance of forecast errors can be explained largely by global and euro area shocks (91%), and only to a small degree by national shocks (9%) for year-ahead forecasts. This picture is nearly reversed for current-year fore-casts, where 42% of the variance can be explained by national shocks and 58% by global and euro area shocks. Our findings, as well as those in the related literature, suggest that the structure (decomposition) of the forecast error changes with the forecast horizon. In particular, assumptions with respect to the international environment seem to be much more important for longer (year-ahead) than for shorter (current-year) forecast horizons and additional factors not considered in our analysis are probably more important for current-year forecasts. One possible explanation for this change in structure could be that for the current-year forecasts data revisions play a more important role than for the year-ahead forecasts. This hypothesis is supported by the results in Fenz et al. (2019).
One implication of our analysis is that forecasters in small open economies like Austria's should focus more on making correct assumptions about the external environment, like GDP growth for important trade partners, oil prices, or US and EU monetary policy measures, if they want to reduce their forecast errors for GDP or inflation.

Conclusion
In this paper, we evaluate macroeconomic forecasts for Austria, published by the two leading Austrian economic research institutes, the Institute for Advanced Studies (IHS) and the Austrian Institute of Economic Research (WIFO). We evaluate the forecasts of growth rates of real GDP, private consumption, gross fixed capital formation, exports, and imports as well as the inflation rate and the unemployment rate. For each variable, we evaluate the year-ahead and the current-year forecasts published in March, June, September, and December. The analyses are based on traditional, univariate measures like the RMSE, Theil's inequality coefficient, and the mean directional accuracy, as well as on a more novel, multivariate measure, the Mahalanobis distance. The latter assesses jointly a group of variables and thereby takes the variances of and the correlations between these variables into account. Furthermore, we compare the forecasts of the two Austrian institutes with forecasts of the European Commission, taking into account the different publication dates. Finally, we examine how errors in the external assumptions affect forecast errors for GDP growth and inflation.
Our first finding is that the forecasts of the two Austrian economic research institutes are very similar. Considering both univariate and multivariate forecast evaluation measures, we basically do not find any significant differences between the two institutes (for any variable, any group of variables, or any forecast horizon). The only exceptions are the December year-ahead and the December current-year forecast of consumption growth, where WIFO seems to outperform IHS. If we examine the question of whether one institute's forecast can be improved by using the other institute's forecast, we conclude that mostly this is not the case; i.e., the forecast of one institute usually encompasses the forecast of the other institute. However, it happens a bit more often that WIFO forecasts do not encompass IHS forecasts than the other way around.
Our second finding is that the forecasts improve significantly over time, which is what one would expect; however, the improvement is usually less pronounced in the first two and the last two forecasts. This is true both for the univariate measures and the Mahalanobis distance, and it particularly applies to the inflation rate and the unemployment rate, and to the Macro group (GDP, inflation, and unemployment). This pattern is probably due to the fact that the inflation and unemployment rates are published at a monthly frequency, in a very timely manner, and are usually not revised, so a lot of information is known to the forecaster toward the end of the year. The larger forecast errors at longer forecast horizons show that policy makers should be cautious when basing their budgetary planning on such forecasts. With respect to the Mahalanobis measure, a complementary analysis would be helpful, which allows to precisely disentangle the univariate forecast errors from consistency errors.
Our third finding relates to the unbiasedness of forecasts (univariate measures), where we also take explicit account of recessions, and the existence of systematic errors (Mahalanobis distance). We find that forecasts of GDP, investment, and import growth are always unbiased (for any given forecast horizon and for both institutes), and that all other forecasts are largely unbiased if they are published in the current year. In addition, we do not find any systematic errors in the forecasts published in December t (for any given group and for both institutes). Considering the September t forecasts, the IHS forecasts do not show any systematic errors (for any given group), while WIFO forecasts show errors in the group of all variables and in the Macro group.
With respect to the comparison of IHS and WIFO forecasts with those of the European Commission, we point out that any direct comparison may be flawed due to the different sets of information available to the national and international forecasters at the time of forecasting. We find, in general, an improvement in accuracy with a decreasing forecast horizon, except for the first three and the last three forecasts. This observation does not depend on whether a given forecast is produced by IHS, by WIFO, or by the EC. These results show that the forecast accuracy depends much more on the dataset available when preparing the forecast than on the question of which institute publishes the forecast. In this context, it would be useful to have methods that take explicit account of the different sets of information available to forecasters; this is left for future research.
Finally, we find that errors in external assumptions with respect to EU GDP growth translate into forecast errors for Austrian GDP growth, in particular for year-ahead forecasts. Similarly, mistakes in external assumptions with respect to the oil price are reflected in the forecast errors for inflation. All results regarding external assumptions are very similar across IHS and WIFO. This implies that forecasters in small open economies like Austria's should focus more on making correct assumptions about the external environment, like GDP growth for important trade partners, oil prices, or US and EU monetary policy measures, if they want to improve their forecasts, in particular for the year ahead. To what extent data revisions contribute to forecast errors is a question left for further research.