1 Introduction

German business cycle forecast reports offer quantitative point forecasts and qualitative text data for growth and inflation, among other variables. The qualitative texts describe forecasters’ views on the macroeconomic situation and development. And, the narratives also express the forecasters’ expectations about the future economic development. Using the narratives, the forecasters’ expectations can be objectified by applying textual analysis methods to generate sentiment indices. The key issue is to analyse whether the forecasters’ narratives contain additional information beyond the quantified forecasts.

The evaluation of German and international business cycle forecasts has traditionally focused on the analysis of quantitative point forecasts. A large number of existing studies have examined the accuracy and efficiency of German macroeconomic forecasts (see e.g. Heilemann and Stekler 2013; Fritsche and Tarassow 2017; Döpke et al. 2019, and the literature cited therein). Prior research suggests three key insights. First, macroeconomic forecasts for Germany are (mostly) unbiased, but inefficient (see e.g. Döpke et al. 2010; Krüger and Hoss 2012). Second, forecast errors seem to be stable on average over decades which are neither increasing nor decreasing in tendency (Heilemann and Stekler 2013). Third, no forecaster’s performance is uniformly superior (Döpke and Fritsche 2006), and there are not significant institutional differences in accuracy across a long time horizon (Döhrn and Schmidt 2011).Footnote 1

Recently, another forecast evaluation approach, which uses qualitative text as data, has become increasingly popular. In this context, textual analysis methods are applied to convert qualitative text data into quantitative scores. The generated indices are used for forecast evaluation tests. Two major strands of the literature can be identified.

One strand will be subsumed here under the term ‘elicited forecasts’, which was used by Jones et al. (2020). This concept applies a manual scoring procedure to quantify qualitative assessments about the future stance of the economy. Goldfarb et al. (2005) mapped newspaper articles published during the Great Depression into an index series using a scoring system to compare the quantified qualitative assessments with numerical forecasts and realized values. A series of forecast evaluation studies applied the developed scoring procedure of Goldfarb et al. (2005) in several contexts to generate elicited forecasts to evaluate them (see e.g. Lundquist and Stekler 2012; Stekler and Symington 2016; Mathy and Stekler 2018). The recent analysis of Jones et al. (2020) investigates the Bank of England’s growth forecasts using elicited forecasts over the period 2005–2015. The more general research question as to whether the text contains additional information for the numerical forecasts is similar to this work. Jones et al. (2020) find that the economic development in the UK is accurately represented by the elicited forecasts. Moreover, regression results suggest informational content of the text index in the sense that they can improve the Bank of England’s numerical growth nowcasts and one-quarter-ahead forecasts.

A second strand of the literature uses computational text analysis methods to generate text-based sentiment indices. Clements and Reade (2020) and Sharpe et al. (2020) are two seminal related studies. The latter study applies computational text analysis to quantify the ‘tonality’ (the degree of optimism versus pessimism) of the Federal Reserve Board’s Greenbooks and examines whether this measure has predictive power for the economic development over the period 1972–2009. The investigation shows some predictive power of the Greenbook tonality on Greenbook numerical GDP growth and unemployment forecasts, as well as on private GDP forecasts. The latter point implies that the sentiment index also covered policy-relevant information (Sharpe et al. 2020). Clements and Reade (2020) analyse whether the narratives in the Bank of England’s Inflation Reports contain useful information about the future course of GDP growth and inflation between 1997 and 2018. Encompassing tests show some informational content for predicting GDP forecast errors for one and two quarters ahead, but no evidence that sentiment indices are useful to predict forecast revisions. Both studies use the dictionary-based approach to generate sentiment indices, and both studies show that ‘an important element of economic forecasting is in the accompanying narrative’ (Sharpe et al. 2020, p. 31).

Considering German forecasters’ narratives, Fritsche and Puckelwald (2018) analyse the topics of German business cycle forecast reports using generative models. The authors find that textual expressions vary with the business cycle, which is in line with the hypothesis of adaptive expectations. But, a number of questions regarding German forecasters’ narratives remain to be addressed.

There is a broader and growing literature in (computational) textual analysis in economics, finance, and accounting (see e.g. Loughran and McDonald 2016; Gentzkow et al. 2019, and the literature cited therein). The following examples give a selective overview of literature that is related to this paper. For example, Shapiro et al. (2020) for the US and Lamla et al. (2020) for Germany use textual analysis tools to create news media sentiment indicators. Both studies has provided evidence for a correlation between news media sentiment indicators and the business cycle and show that sentiment indicators can serve as predictors of the future stance of the economy. Another strand of the literature concerns the predictability of stock market activity. Tetlock (2007), Tetlock et al. (2008) and Garcia (2013) use a dictionary-based approach to generate sentiment indices via news coverage. Loughran and McDonald (2011, 2016) developed a finance-specific dictionary to improve the forecasting performance relative to existing linguistic dictionaries. Jegadeesh and Wu (2013) and Manela and Moreira (2017) apply text regression methods to predict stock market outcomes, while Jegadeesh and Wu (2013) show that text regression-based sentiment indices are superior to sentiment indices based on Loughran and McDonald (2011) dictionary in an out-of-sample forecast environment. The analysis of central bank communication is another topic in text mining. Jegadeesh and Wu (2017) find incremental information value in the Federal Open Market Committee meeting minutes. The authors use a generative model to quantify the tone and the topics of texts. Tillmann and Walter (2018) apply dictionary-based sentiment indices to analyse the tone of Bundesbank and ECB speeches. The authors find significant divergences between the tone of the two institutions. An additional topic is about measuring policy uncertainty. Baker et al. (2016) developed the prominent economic policy uncertainty index (EPU) by analysing news coverage with a dictionary method. Using a (nonlinear) text regression method to construct an EPU for Belgium, Tobback et al. (2018) show that they have improved the predictive power of the EPU.

This paper makes several contributions to the literature on forecast evaluation and textual analysis. First, German forecasters’ narratives were considered using textual analysis methods. Second, previous studies have almost exclusively focused on dictionary methods to generate sentiment indices. To the best of the author’s knowledge, this paper is the first in forecast evaluation to apply (linear) text regression approaches, and additionally, it uses a recursive estimation technique. Third, the paper tests why forecasters’ narratives have predictive power. Although recent studies discussed several explanatory hypotheses, the answer is still insufficiently explored.

The purpose of the paper is to analyse German forecasters’ narratives and the question as to whether the forecasters’ stories and expectations contain additional information relative to numerical forecasts. Based on 534 business cycle forecast reports covering 10 German institutions from 1993 to 2017, the paper creates sentiment indices using text mining techniques. Regression results suggest that some sentiment indices can reduce the absolute magnitude of the quantitative forecast errors for GDP growth and inflation forecasts. German forecasters’ narratives are informative for the accuracy of German business cycle forecasts. One explanation might be that forecasters’ narratives contain useful information about the future stance of the German economy. An in-sample and out-of-sample forecasting exercise tests whether the sentiment indices can predict the evolution of German economic activity. Forecasting results indicate weak in-sample predictive power and out-of-sample predictive power of the sentiment indices.

The following section explains the methodology used to convert qualitative text data into quantitative sentiment scores. Section 3 describes the employed text corpus and numerical data. Section 4 analyses the empirical results, and Sect. 5 concludes and discusses these results.

2 Methodology: sentiment analysis

There are various computational analysis methods to connect word counts to attributes to generate sentiment indices, e.g. dictionary-based methods, text regression methods, generative models, and word embeddings (Gentzkow et al. 2019). This paper uses dictionary-based methods and text regression methods to convert qualitative text data into quantitative indices.

Furthermore, qualitative measures can only be directly related to macro-variables, provided that they are appropriately scaled (Clements and Reade 2020, p. 1491). Hence, all weighted sentiment indices are standardized to have a mean equal to zero and a standard deviation equals to one. In order to avoid bias in the measure, all weighted sentiments are normalized by the total number of words per report to account for varying text lengths and numbers of documents per year (Fritsche and Puckelwald 2018).

2.1 Dictionary-based method

Following Clements and Reade (2020) and Sharpe et al. (2020), the dictionary-based method is applied to develop sentiment indices. In fact, three well-established linguistic dictionaries are used to generate five different indices.

  • First, the word list is prepared by Bannier et al. (2018). This is the German equivalent of the English original dictionary provided by Loughran and McDonald (2016). The last-mentioned word list is well established for textual analysis in finance- and accounting-specific contexts. The word list prepared by Bannier et al. (2018) includes over 2200 positive and 10,000 negative word forms. The dictionary is binary coded for polarity in positive and negative terms.

  • Second, there is a forecast-specific German dictionary-based on Sharpe et al. (2020). According to Di Fatta et al. (2015), words have different connotations and meanings in different contexts, and sentiment indices have to be adapted to the content to which they have been applied. To this end, Sharpe et al. (2020) developed a forecast-specific word list which excludes words that have special meanings in an economic forecasting context. The word list contains 205 positive and 103 negative words (see Tables 8, 9) and is binary coded like the previous one.

  • Finally, there is the SentimentWortschatz (SentiWS) dictionary (Remus et al. 2010). The SentiWS dictionary contains a German-specific word list for sentiment analysis. The current version (v2.0) contains about 16,000 positive and 18,000 negative word forms, and unlike the other two dictionaries, it includes weights for polarity within the interval of \([-1; 1]\).

Two different score systems will be applied for the two binary dictionary-based sentiments (hereinafter called ‘Bannier’ and ‘Sharpe’). Sentiment score number one consists of the difference between positive word count, P, and negative word count, N, normalized by the total number of words, T, per report:

$$\begin{aligned} \hbox {Sentiment \,score}_1 = (P - N) / T \end{aligned}$$
(1)

The second sentiment score (polarity score) is defined as the quotient of the difference between positive and negative word counts and the sum of positive and negative words:

$$\begin{aligned} \hbox {Sentiment \,score}_{2} = (P - N) / (P + N) \end{aligned}$$
(2)

In contrast, the SentiWS index is a continuous score. The score of each word sums up over all words and is normalized by the total number of words per report.

2.2 Automatic variable selection approach

The automatic variable selection approach is a promising text regression method to generate regression-based sentiment indices (e.g. Pröllochs et al. 2018). In contrast to the dictionary-based method, here the required dictionary is not given and will be recursively estimated. In fact, the estimated parameters will be updated by expanding the estimation windows by one observation in chronological order (see Sect. 2.3). Generally, text regression methods introduce a regularization penalty that reduces the complexity, number, and size of the predictors included in the model. Penalized linear models use each word in the text corpus as explanatory variables, shrink non-informative noise variables to zero, and select decisive variables (Pröllochs et al. 2015).

Regularization methods can serve as mathematical mechanisms to extract important terms, which is why it is a common tool for variable selection in data science (Pröllochs et al. 2018; Varian 2014). Given a standard multivariate regression with y (dependent variable) as a linear function of \(\beta _0\) (constant) and \(x_j\) (explanatory variable), the penalty term of the form:

$$\begin{aligned} \lambda \sum _{j=1}^{P} \left[ (1-\alpha ) \vert \beta _{j} \vert + \alpha \vert \beta _{j}^2 \vert \right] \end{aligned}$$
(3)

can be added (Varian 2014). Setting \(\alpha = 0\), the term Eq. 3 reduces to the linear \(l_1\)-norm penalty \(\lambda \sum _{j=1}^{P} \vert \beta _{j} \vert \), which represents the least absolute shrinkage and selection operator (LASSO) introduced by Tibshirani (1996). Formally, the LASSO estimator is given by (Pröllochs et al. 2015):

$$\begin{aligned} {\hat{\beta }}_\mathrm{LASSO} = {{\,\mathrm{arg\,min}\,}}_\beta \sum _{i=1}^{N} \left[ y_i - \beta _0 + \sum _{j=1}^{P} \beta _j x_{ij} \right] ^2 + \lambda \sum _{j=1}^{P} \vert \beta _{j} \vert \end{aligned}$$
(4)

where \(x_{ij}\) are the document terms (words \(j = 1, \ldots , P\)) for forecast report \(i = 1, \ldots , N\), and \(y_i\) represents the 12-month-ahead fixed horizon growth and inflation forecasts as response variables. If \(\lambda = 0\), the penalty reaches zero, and we get the classical OLS estimator by simply minimizing the residual sum of squares. The higher \(\lambda \), the larger the penalty shrinkage gets, with the result that more coefficients end up being zero. The optimal \(\lambda ^*\) is estimated by minimizing the mean squared error (MSE) (Dimpfl and Kleiman 2019):

$$\begin{aligned} \hbox {MSE}_\mathrm{CV} (\lambda ) = \frac{1}{K} \sum _{i=1}^{K} \frac{1}{n_{i}} \vert \vert y_{i}-X_{i}{\hat{\beta }}_\mathrm{LASSO}^{-i} \vert \vert _{2}^{2} \end{aligned}$$
(5)

using an established 10-fold cross-validation, where \(n_i\) is the size of ith subsample. Therefore, the data are split into K subsets, one part i is removed, the coefficients \({\hat{\beta }}_\mathrm{LASSO}^{-i}\) are estimated, and the cross-validated \(\hbox {MSE}_\mathrm{CV} (\lambda )\) is calculated for any given value of \(\lambda \).

In contrast, setting \(\alpha = 1\) shortens the term Eq. 3 to the quadratic \(l_2\)-norm penalty \(\lambda \sum _{j=1}^{P} \beta _{j}^2\), and the ridge estimator is implemented (Pröllochs et al. 2015):

$$\begin{aligned} {\hat{\beta }}_\mathrm{Ridge} = {{\,\mathrm{arg\,min}\,}}_\beta \sum _{i=1}^{N} \left[ y_i - \beta _0 + \sum _{j=1}^{P} \beta _j x_{ij} \right] ^2 + \lambda \sum _{j=1}^{P} \beta _{j}^2 \end{aligned}$$
(6)

Again, the tuning parameter \(\lambda \) is the regularization penalty. The quadratic penalty \(l_2\)-norm follows similar characteristics to the LASSO penalty: if \(\lambda \) reaches zero, we get OLS coefficients; if \(\lambda \) moves towards infinity, the coefficients come down to zero. However, in contrast to the LASSO regularization, the ridge estimator does not set explicitly some coefficients equal to zero (Pröllochs et al. 2015).Footnote 2 Again, the optimal \(\lambda ^*\) is estimated by minimizing the MSE using 10-fold cross-validation.

Equations 4 and 6 are used to estimate the LASSO and ridge regression coefficients \({\hat{\beta }}_\mathrm{LASSO}\) and \({\hat{\beta }}_\mathrm{Ridge}\). The magnitude of \({\hat{\beta }}_\mathrm{LASSO}\) and \({\hat{\beta }}_\mathrm{Ridge}\) serve as the weight and a measure of variable importance, specifying which variables (words) are included in the final dictionary (Pröllochs et al. 2015). A linear rule is then applied to calculate document ith sentiment score. Again, the document’s score is defined as the continuous score normalized by the total number of words.

2.3 Recursive estimation

In order to guarantee that no information is produced and used for tests for forecast efficiency and predictive power that are (hypothetically) not known for forecaster in time t, a recursive estimation technique will be applied for sentiment indices based on the automated variable selection approach. First, a sufficiently large text corpus is generated as a basis (pre-estimation corpus) using business cycle forecast reports from the period 1993–1998, including 74 observations. Second, based on the pre-estimation corpus, a recursive estimation approach is applied, expanding the estimation windows by one observation per estimation in chronological order. In fact, the following procedure is executed in each recursive estimation step: First, the extended text corpus is established and weighted; second, the optimal \(\lambda ^*\) is estimated by minimizing the MSE using 10-fold cross-validation; third, LASSO and ridge estimator (Eqs. 4, 6 ) are used to estimate the respective dictionaries and weights (\({\hat{\beta }}_\mathrm{LASSO}\) and \({\hat{\beta }}_\mathrm{Ridge}\)); finally, the respective sentiment (document) score is calculated and stored in a common series.

3 Corpus and data

3.1 The text corpus

The plain corpus includes business cycle forecast reports for Germany issued by 10 institutions with different institutional backgrounds. First, the corpus covers the six largest economic research institutes in Germany that are formally politically and economically independent. These comprise the five publicly founded institutes, the Ifo Institute Munich (Ifo), the Berlin Institute (DIW), the Essen Institute (RWI), the Halle Institute (IWH), the Kiel Institute (IfW), and the privately funded Hamburg Institute (HWWI).Footnote 3 Second, the corpus contains institutes that are funded by interest groups: the employer’s institute of the German economy located in Cologne (IW Köln), and the trade union’s macroeconomic policy institute (IMK). Third, the corpus includes the ‘joint diagnosis’ (GD), the economic projection of the leading research institutes as an institution within the process of economic policy advice. Fourth, the corpus covers a financial institution, the Bundesbank. The German central bank is another formally politically and economically independent public institution.

The entire corpus contains 534 documents.Footnote 4 There is a wider range of potential business cycle forecast reports for Germany than the selected institutes that did not meet the defined criteria. For the selection, a range of criteria was checked:

  • Business cycle forecast (sub-)section Business cycle forecast reports are heterogeneous in size and content. Some reports are structured into different subsections like recent national or international economic development, business cycle forecasts, economic policy advices, or methodological explanations. Other reports are miscellaneous texts of various themes and cannot be split in a meaningful way. Therefore, business cycle reports should contain a clearly defined forecast (sub-)section.

  • Time range The corpus covers business cycle forecast reports for Germany from 1993 to 2017 to circumvent the German reunification and possible misspecification for East and West Germany.

  • Forecasters’ experiences Continuity and regularity of publication within the examined period ensure forecasters’ experiences in the field of economic forecasting, ensuring a sufficient level of homogeneity in language across institutes.

  • Language homogeneity The (relatively short) period of 25 years as well as forecasters’ experiences assures a sufficient degree of homogeneity in language over time.

  • Quantitative forecast availability To use a comparative sample for growth and inflation forecast analysis, only business cycle forecast reports with a calculable fixed horizon forecast for growth and inflation will be used. The availability of numerical point forecasts of growth and inflation for the current and next year restricts the number of incorporated forecast reports (see Sect. 3.2).

  • Forecasting date The forecasting date is distributed over the whole year, depending on respective institutional practice and the frequency of publication. In most cases, the frequency of publication is bi-annual or higher.

  • Text availability Another criterion was the public availability of business cycle forecast reports, which is why private institutes like banks are not included.

As a result, 534 business cycle forecast reports for Germany issued by 10 institutions are used for the creation of the corpus. In the first step of textual analysis, data cleaning and linguistic pre-processing are applied to all texts. In fact, line breaks, numbers and words with fewer than four characters are eliminated, lower cases were introduced, stopwords (e.g. from German linguistic stopword lists or names) and sparse terms where a word that occurs in less than 10% of documents are removed. With reference to Zipf’s law (Zipf 1949), the texts are weighted with their term frequency—inverse document frequency (tf-idf).Footnote 5 Zipf’s law for empirical language implies that a word’s frequency is inversely proportional to its rank. Consequently, the corpus is adjusted for that symptom. Figure 1 shows the wordcloud of the weighted corpus. The wordcloud sort terms frequency in descending order. The larger the word, the more often the term occurs. The wordcloud shows that the weighted corpus includes a lot of important forecast-specific vocabulary, for example ‘Anstieg’ (growth), ‘Prognose’ (forecast), and ‘Exporte’ (exports).Footnote 6

Fig. 1
figure 1

Wordcloud of German business cycle forecast reports. Notes Own illustration

Finally, Porter’s stemming algorithm (Porter et al. 1980) is used to truncate the different word forms to its base forms.Footnote 7

3.2 The sample

The incorporated business cycle forecast reports for Germany typically contain numerical fixed event forecasts of growth and inflation for the current and next year. Depending on the forecast date, the forecast horizon of fixed event forecasts varies from one up to 11 months. Heilemann and Müller (2018) show in a forecast evaluation study for Germany that forecast accuracy decreases with increasing forecast horizon, and that differences in forecast accuracy are mainly determined by the different timings of the production of the forecasts.Footnote 8

Furthermore, uncertainty and cross-sectional dispersion of fixed event forecasts show a pronounced seasonal pattern (Dovern et al. 2012). Consequently, fixed horizon forecasts are used to reduce different forecast horizons within one quarter. The method of Dovern and Fritsche (2008), Heppke-Falk and Hüfner (2004) and Smant (2002)

$$\begin{aligned} {\hat{y}}^{12}_{i,t} = \frac{4-q+1}{4}{\tilde{y}}^{0}_{i,t} + \frac{q-1}{4}{\tilde{y}}^{1}_{i,t} \end{aligned}$$
(7)

is applied to construct 12-month-ahead fixed horizon forecasts for growth and inflation. Given current (\({\tilde{y}}^{0}_{i,t}\)) and next (\({\tilde{y}}^{1}_{i,t}\)) year fixed event forecast, q is equal the quarter where the forecast is done. Subsequent, the fixed horizon forecast is approximated as a quarterly weighted average of their share in both years. For example, considering the forecasts of the Berlin institute from September 2015, \({\tilde{y}}^{0}_{i,t} = 1.8\) and \({\tilde{y}}^{1}_{i,t} = 1.9\), q is equal to three and the 12-month-ahead fixed horizon forecast \({\hat{y}}^{12}_{i,t} = 1.85\).

Moreover, forecast narratives cannot distinguish between different forecast horizons within a quantitative textual analysis. All in all, nine different sentiment indices will be calculated for each forecasting report at time t.

Figure 2 depicts the different forecast horizons and the construction of 12-month-ahead fixed horizon forecast and sentiment indices using an forecast report of the German institute of economic research (DIW Berlin).

Fig. 2
figure 2

Time line of forecast horizons and construction for DIW forecast report. Notes Own illustration

Besides, seasonally adjusted and finally revised real GDP is used for realized GDP growth (quarterly data, source Federal Statistical Office 2019b). Finally, the revised consumer price index is used for actual inflation outcome (monthly data, source Federal Statistical Office 2019a).Footnote 9 (Dovern et al. 2012) point out that the approximation error in the fixed horizon series in Eq. 7 could result in a correlation if dependent variable and regressors are constructed in the same way. To avoid this, the annualized cumulative percentage change from past quarter \(t-h\) to current quarter t is used for the realized values. Thus, \(h=4\) denotes the forecasting horizon in quarters based on the 12-month-ahead fixed horizon forecasts.

The forecast error is defined as \(e_{t} = A_{t} - P_{t}\)—the realized value in period t minus the forecast made in period \(t-j\). Hence, a positive forecast error represents an underestimation of the growth (inflation) rate, and vice versa, whereas a negative forecast error corresponds to an overestimation.

Table 1 Descriptive statistics on forecast accuracy in Germany, 1993–2017

Table 1 provides an overview of some standard error measures of forecast evaluation (see for example, Fildes and Stekler 2002) for the pooled data of the introduced sample. On the whole, the error measures correspond to previous forecast evaluation studies for Germany (Heilemann and Stekler 2013; Döpke et al. 2019). The ME is nearly zero, indicating unbiased forecasts. Growth forecasts MAE and RMSE are on average large compared to Heilemann and Stekler (2013) and Döpke et al. (2019) due to the forecasting error in the Great Recession 2008/2009.Footnote 10

Considering the ability to forecast turning points, three directional analysis measures are included. Referring to Diebold and Lopez (1996, p. 28) and Merton (1981), the information content of a forecast series is calculated.Footnote 11 The forecasts beat a pure coin-flip if the informational content has a value above one. Second, a \(\chi ^2\)-test validate whether the forecasts are significant better than chance, testing the null hypothesis of no information content of the forecasts under investigation. The results indicate that both, growth and inflation forecasts, have an significant informational content at conventional significance levels. In addition, the area under a receiver operating curve (AUROC), a frequently used measure of the quality of directional forecasts (see, e.g. Berge and Jordà 2011; Pierdzioch and Rülke 2015; Liu and Moench 2016) is calculated. An AUROC \( < 0.5 \) indicate that forecasts are even worse than pure coin-flip and an AUROC \( = 0.5 \) that forecasts are indistinguishable from a pure coin-flip because the ROC curve coincides with the \(45^{\circ }\) line. An AUROC \( > 0.5 \) and \( < 1 \) beat the coin-flip, whereas an AUROC \( = 1 \) represents perfect forecasts. Considering the AUROC for growth and inflation forecasts, both forecasts beat again pure coin-flip and indicate to some directional accuracy.

4 Empirical results

4.1 Sentiments’ characteristics

Table 2 gives an overview of sentiment characteristics.

Table 2 Overview dictionaries metrics

Considering dictionary metrics as positive and negative entries and standard statistical measures, Table 2 shows how different the individual sentiment approaches work. The ridge estimation results show that the ridge estimator does not explicitly set some coefficients equal to zero. In contrast to the LASSO estimator, the ridge approach selects much more words as its LASSO counterpart.

Tables 10, 11 and 12 list in a full sample example the (stemmed) dictionaries and weights generated by the automated variable selection approach. Table 10 shows the estimated 71 words and their coefficients according to LASSO regression with real GDP growth forecasts as the response variable (hereinafter ‘LASSO_GDP_P’). The term with the most positive weight is ‘upswing’ (‘Aufschwung’), which in German is also a synonym for ‘boom’ or ‘recovery’, whereas ‘drastic’ (‘drastisch’) is the word with the most negative coefficient. The list of plausible words and weight with respect to GDP development is long, i.e. ‘export dynamic’ (‘Exportdynamik’), ‘continuation’ (‘Fortsetzung’), ‘lively’ (‘schwungvoll’) with positive coefficients, or ‘deep’ (‘tief’), ‘layoffs’ (‘Entlassungen’), and ‘shrink’ (‘schrumpfen’) with negative coefficients. Nevertheless, the list contains few outliers whose economic sense is not immediately clear, e.g. ‘a third’ (‘drittel’), or where the words have a non-intuitive weight, such as ‘recover’ (‘erholen’).Footnote 12

Similar patterns can be observed in other text regression-based dictionaries. Table 11 lists the estimated 69 words and weights according to LASSO regressions, with inflation forecasts as the response variable (hereinafter ‘LASSO_INF_P’). Table 12 list ridge regression results for real GDP growth forecasts (hereinafter ‘Ridge_GDP_P’) and inflation forecasts (hereinafter ‘Ridge_INF_P’). Both tables list the top 30 estimated words with the largest positive and negative coefficients.

Figures 3 and 4 give a visual impression of the generated sentiment indices. The figures illustrate the sentiment values per business cycle forecast report aggregated over years and across institutes, in combination with the realized real GDP growth, or inflation rate, respectively. Panels (a) to (i) present for each sentiment specification the aggregate sentiment value per year on the left axis (solid line), and the realized value of GDP growth, respective inflation, on the right axis (dashed line).

Considering each of the panels from (a) to (i) separately, we can conclude that each sentiment specification varies in its pattern. Concerning, for instance, the Great Recession in 2008–09, it can be seen that some sentiment indices are closer to the real development, i.e. LASSO_GDP forecast in Fig. 3, whereas some sentiment indices have a longer time lag, i.e. Sharpe 1 in Fig. 3. Other sentiment indices are even ahead of the real development, i.e. Sharpe 2 in Fig. 4. Another picture illustrates a (partly) countercyclical behaviour. For example, Bannier1 and Bannier2 in Fig. 4 show this countercyclical behaviour, which could be explained by a huge time lag or an opposite polarity of terms.

In summary, the generated sentiment indices differ across patterns and in amplitude, as well as in terms of time lag and lead.

Fig. 3
figure 3

Sentiment indices (solid line) and realized growth (dashed line), aggregate over years and institutes

Fig. 4
figure 4

Sentiment indices (solid line) and realized growth (dashed line), aggregate over years and institutes

4.2 Forecast efficiency

Forecast efficiency analysis is used to test whether the narratives of German business cycle reports contain useful information for the numerical forecasts of German forecasters. More precisely, we test whether the sentiment indices can be used to improve the accuracy of the quantitative point forecasts. In particular, we test for weak and strong efficiency of forecasts by using the specification of Holden and Peel (1990):

$$\begin{aligned} e_{i,t} = \beta _{0,i} + \beta _1 e_{i,t-1} + \beta _2 \hbox {Sentiment}_{i,t} + u_{i,t}, \end{aligned}$$
(8)

and test the joint null hypothesis \(H_0 : \beta _{0,i} = \beta _1 = \beta _2 = 0\).

In Eq. 8, \(e_{i,t}\) is the forecast error of forecaster i in time t, \(\beta _{0,i}\) is institution’s i individual effect, \(e_{i,t-1}\) is the institution’s forecast error made in \(t-1\), \(\hbox {Sentiment}_{i,t}\) is the forecaster’s sentiment index at time t as exogenous variable which is known by the forecasters on the forecasting date, and \(u_{i,t}\) is the error term. Forecasts are weakly efficient if the forecast errors are not autocorrelated, and forecasts are strongly efficient if there is no variable that helps to predict the forecast errors, including the lagged forecast error. Optimal forecasts should consider all available information at the date of the forecast. A fixed effects estimation approach is used to account for individual institutional effects, such as different forecast horizons during the quarter. According to Gaibulloev et al. (2014), panel-corrected standard errors (PCSE) suggested by Beck and Katz (1995) are reliable for panel type T>N to deal with unit heterogeneity and panel heteroscedasticity. The standard test statistics are reliable and the Nickell bias (Nickell 1981) is negligible (see Gaibulloev et al. 2014, and the literature cited therein).Footnote 13 Estimates are corrected for serial and cross-sectional correlation. Comparable forecast evaluation studies have used this kind of robust standard errors (see, among others, Keane and Runkle 1990; Kauder et al. 2017; Döpke et al. 2019).

Table 3 Tests for efficiency of forecasts—1999-2017

Table 3 presents the estimated parameters and the standard errors (in parentheses) of the individual coefficients and the p-value [in brackets] for the joint efficiency test. In almost all cases, the weak efficiency condition of no serial correlation of the forecast errors has to be rejected for GDP growth forecasts. Moreover, test results with sentiment indices indicate several significant influences of forecasters’ narratives for forecast accuracy. For both Sharpe sentiment indices, as well as for all text regression-based sentiment indices, the null of no correlation has to be rejected at a conventional significance level. The negative coefficients indicate that a higher sentiment value correlates with a higher GDP prediction in that smaller (or negative) forecast errors imply higher forecast values. In addition, all specifications reject the joint test on efficiency. But it is not clear whether the autocorrelated forecast error or the sentiment indices are the reason for the rejection of the joint tests.

Considering inflation forecasts, again, the lagged forecast error has generally a significant influence on the forecast error of the following period, at a conventional significance level. Moreover, we find some hints for explanatory power of the narratives on the numerical point forecast errors. Sharpe2 and the LASSO, as well as the ridge sentiment with inflation forecast as response variable, are significantly correlated with the forecast error. Both text regression-based sentiment indices are the only two out of nine specifications that also reject the joint efficiency hypothesis without having autocorrelated errors. The varying signs of sentiment indices’ coefficients indicate sentiment indices with different polarity. Thus, rising inflation, e.g. the word ‘inflation’, could have both positive and negative weights, depending on the given dictionary (dictionary-based methods) and the used response variable (text regression methods).

The efficiency test results suggest that forecasters’ narratives have informational power for the forecast errors at the time when the forecasts were made, implying that the numerical forecasts do not make efficient use of all available information. Previous studies (e.g. Döpke et al. 2010, 2019) confirm that forecasts for Germany are not strongly (in part weakly) efficient by not incorporating all available information. But they never test the narratives of the forecaster itself. Sentiment indices, based on business cycle forecast reports, seem informative for the accuracy of German business cycle forecasts.Footnote 14 Thus, forecasters’ narratives contain information which is not exhausted by numerical forecasts. One explanation might be that the forecasters’ narratives contain useful information about the future stance of the German economy.

4.3 Predictive power

To test whether the narratives of German business cycle forecast reports contain useful information for the future stance of the German economy, the paper applies an in-sample and an out-of-sample forecast exercise.

4.3.1 In-sample forecasting regressions

Following Estrella and Hardouvelis (1991), Stock and Watson (2003) and Ferreira (2018), single forecasting equations are used to predict actual GDP growth and the inflation rate of changes. The in-sample and (pseudo) out-of-sample forecasting exercise tests whether text-based sentiment indices have predictive power for actual GDP growth and inflation. Similar methods were used to find predictors of economic activity (Estrella and Hardouvelis 1991) or predictors of business cycle fluctuations (Ferreira 2018). In order to do that, the sentiment indices are transformed by averaging all observations per quarter to build quarterly time series as explanatory variables. Hence, we get a quarterly time series with 100 observations from 1993Q1 to 2017Q4. The dependent variable in the basic forecasting regression is the annualized cumulative percentage change in real GDP respectively inflation. Following (Estrella and Hardouvelis 1991; Stock and Watson 2003):

$$\begin{aligned} {\hat{Y}}_{t|t+h} = (400/{h}) [\hbox {ln} ({Y_{t+h}}/{Y_t}) ] \end{aligned}$$
(9)

where \(Y_t\) and \(Y_{t+h}\) denote the level of real GDP (consumer price index) in period t and \(t+h\), \({\hat{Y}}_{t|t+h}\) is the annualized cumulative percentage change from current quarter t to future quarter \(t+h\), and \(h=4\) denotes the forecasting horizon in quarters. The single forecasting equation is provided by (Ferreira 2018):

$$\begin{aligned} {\hat{Y}}_{t|t+h} = \alpha + \underbrace{\sum _{i=1}^p \rho _{i} {\hat{Y}}_{t-i}}_{\text {Lag. endog. var.}} + \underbrace{\sum _{j=0}^q \beta _{j} \hbox {SI}_{t-j}}_{\text {Sentiment indices}} + \underbrace{\sum _{m=1}^3 \sum _{j=0}^q \gamma _{j}^m \text {IN(m)}_{t-j}}_{\text {Control variables}} +\,\, \epsilon _{t+h} \end{aligned}$$
(10)

where \(\hbox {SI}_t\) denotes the respective sentiment index, and \(\text {IN(m)}\) represents German leading indicators as control variables. The control variables are also standardized by subtracting the mean from each variable and dividing it by its standard deviation. The forecast horizon h is set to four quarters to capture the annualized cumulative percentage change of GDP growth (\({\hat{Y}}_{t|t+h}\)), respectively inflation, from current quarter t to future quarter \(t+h\). To hold the model parsimonious, the lag length p of the endogenous variable is set to one, and q is set equal to 0.

The single forecast regression given in Eq. 10 reduces under the simplifying assumption to a simple forecast equation, as suggested by Estrella and Hardouvelis (1991). According to Estrella and Hardouvelis (1991), the overlapping forecasting horizons provoke a moving average error term of order \(h-1\), resulting in consistent but inefficient estimates. Therefore, Newey and West (1987)-corrected standard errors for heteroscedasticity and autocorrelation are applied with a lag length set equal to three (\(h=4\)) in line with Estrella and Hardouvelis (1991).Footnote 15

As control variables for the forecasting regressions, several admitted economic predictors for the German business cycle are introduced:Footnote 16

  • First, the term ‘spread’ (long-term interest rate minus the short-term interest rate) serves as a monetary control variable. The long-term interest rate serves the yield on debt securities outstanding issued by residents with mean residual maturity of more than nine and up to 10 years (monthly average, source Deutsche Bundesbank 2020). As the short-term interest rate, the EURIBOR 3-month funds money market rate is used (monthly average, source Deutsche Bundesbank 2020).

  • Second, total orders received by the German industry serves as the industry control variable. We take the change over the previous month at constant prices, calendar and seasonally adjusted orders (source: Deutsche Bundesbank 2020)

  • Third, the Ifo business climate index as leading business cycle indicator (monthly data, source Ifo institute 2020)

Table 4 presents the in-sample forecasting regression results, including selected business cycle indicators as control variables given by Eq. 10. While neither the lagged endogenous variable nor the Ifo business climate index is significantly different from zero, the order inflow and the spread interest rate have a significant impact on the average GDP growth rate. All control variables have the expected sign and a notable magnitude, indicating to a robust specification. Considering the generated sentiment indices, it can be seen that the coefficients are statistically significant only in three out of nine cases. The bag-of-words approach of Bannier1 and both text regression-based sentiments with inflation prediction as response variable (LASSO_INF_P, Ridge_INF_P) are statistically different from zero at conventional significance levels.

Noteworthy is the performance of text regression-based sentiment indices with inflation forecasts as response variables, instead of GDP growth prediction. It seems that this ‘wrong’ macroeconomic target variable captures the real GDP development as well.Footnote 17 This results can be a hint that GDP sub-aggregates, such as investments and consumption, could be promising response variables for text analysis tools to predict GDP growth.

Table 4 Forecasting equations including sentiment indices and control variables for Germany, GDP, 1999Q1 to 2017Q4

Table 5 presents results regarding inflation in-sample forecasting regressions. Both dictionary-based Bannier sentiment indices have a significant influence on the average growth rate of inflation over the next four quarters. Both sentiment indices are negatively correlated with the target variable.Footnote 18 However, most of the generated sentiment indices do not show a significant impact on the average growth rate of inflation over the next four quarters at a conventional significance level.

In brief, changes in the narratives have weak in-sample predictive power on the average growth rate of GDP and inflation over the next four quarters.

Table 5 Forecasting equations including sentiment indices and control variables for Germany, Inflation, 1993Q1 to 2017Q4

4.3.2 Out-of-sample forecasting performance

To evaluate the pseudo out-of-sample predictive power of the narratives, a reduced forecasting model of Eq. 10 is used to predict the 12-month-ahead average growth rate of real GDP respectively inflation:

$$\begin{aligned} {\hat{Y}}_{t|t+h} = \alpha + \sum _{i=1}^p \rho _{i} {\hat{Y}}_{t-i} + \sum _{j=0}^q \beta _{j} SI_{t-j} + \epsilon _{t+h} \end{aligned}$$
(11)

Following Ferreira (2018), we include only the lagged endogenous variable to the forecasting model as an additional regressor. The training sample covers 56 observations for the period from 1999Q1 to 2013Q4. The test sample includes 20 observations for the period from 2014Q1 to 2017Q4, which meets the recommended value of 20 per cent of the full sample (Hyndman and Athanasopoulos 2018). The model will be re-estimated at each iteration of the pseudo out-of-sample exercise before each one-step-ahead forecast is computed. The number of lags of the endogenous variable (p) and the predictor variable \(SI_t\) (q) will be obtained by minimizing the Akaike information criterion (AIC) at each forecasting period. An autoregressive model is used as a comparative benchmark model. The order of the autoregressive model is also determined by minimizing the Akaike information criterion (AIC) at each forecasting period. In order to evaluate the predictive ability of the narratives, two common forecast evaluation metrics are calculated in a first step. The relative MAE:

$$\begin{aligned} \text {Relative MAE} = \frac{\frac{1}{T}\sum _{t=1}^T \left| e_{t}^{\hbox {SI}(k)} \right| }{\frac{1}{T}\sum _{t=1}^T \left| e_{t}^{\hbox {AR}} \right| } \end{aligned}$$
(12)

with a linear loss function, and the relative MSE with quadratic loss:

$$\begin{aligned} \text {Relative MSE} = \frac{\frac{1}{T} \sum _{t=1}^T \left( e_{t}^{\hbox {SI}(k)}\right) ^2}{\frac{1}{T} \sum _{t=1}^T \left( e_{t}^{\hbox {AR}}\right) ^2 } \end{aligned}$$
(13)

is calculated by using the respective forecast error \(e_{t}\) of model 11 in relation to the benchmark autoregressive model. If the value of the relative measure is smaller than 1, the current model outperforms the benchmark model.

In a second step, a Diebold–Mariano test (Diebold and Mariano 1995; Harvey et al. 1997) is employed to test the out-of-sample forecasting performance. To this end, the null hypothesis of equal predictive accuracy (i.e. equal expected loss) between the forecasts with sentiment index and without (benchmark model). The one-sided alternative hypothesis that the forecasts without sentiment index is less accurate:Footnote 19

$$\begin{aligned} H_0: L \left( e_t^\mathrm{AR} \right) = L\left( e_t^{\mathrm{SI}(k)} \right) \, \text {versus} \, H_1: L \left( e_t^\mathrm{AR} \right) > L \left( e_t^{\mathrm{SI}(k)} \right) \end{aligned}$$
(14)

where \(L(e_t)\) represents the respective linear loss \(L(e_t)=e_t\) or quadratic loss \(L(e_t)=e_t^2\). Again, the Newey and West (1987) procedure is applied to correct for autocorrelation and the lag length is set equal to 3 (\(h-1\)) following Estrella and Hardouvelis (1991).

Table 6 Out of sample forecasting performance

Table 6 shows the pseudo out-of-sample forecasting performance results for real GDP growth and inflation. The first two columns present the relative forecast performance based on relative MAE and MSE measures. Considering GDP growth, two forecasting series with regression-based sentiment indices (LASSO_GDP_P, Ridge_INF_P) beat the benchmark series in both relative measures, MAE and MSE, whereas one forecasting series (LASSO_INF_P) outperforms the benchmark series at least in relative MAE. In contrast, no dictionary-based sentiment index outperforms the benchmark forecasts in relative forecast performance metrics. Statistical tests to check whether the forecasting series with sentiment indices are more accurate as the benchmark forecasts without sentiment indices are given in lines three to six. The Diebold–Mariano tests for linear and quadratic losses do not reject the null hypothesis of equal predictive accuracy for all except one (linear: LASSO_GDP_P) forecasting series with sentiment indices at a conventional significance level. Thus, the generated sentiment indices do not seem to be a statistically powerful out-of-sample predictor for the average growth rate of GDP over the next four quarters. Forecasting performance results for inflation are also given in Table 6. On average, the relative forecast performance of the sentiment series are also weak, measured by the relative MAE and MSE. Again, two forecasting series (Sharpe2, Ridge_INF_P) outperform the benchmark series in relative MAE and relative MSE, whereas one forecast series with LASSO_INF_P index beat the benchmark series in relative MSE. Considering Diebold–Mariano tests, the null hypothesis of equal forecast accuracy can only be rejected for the forecasting series with Ridge_INF_P index in linear and quadrat forecast error environment. To summarize, forecasters’ narratives in the form of sentiment indices have weak, at best, predictive ability regarding future GDP growth and inflation in a (pseudo) out-of-sample environment.

5 Discussion and conclusion

Based on 534 business cycle forecast reports covering 10 German institutions for the period 1993–2017, the paper analysed the information content of German forecasters’ narratives for German business cycle forecasts and macroeconomic development. In order to do that, textual analysis is used to convert qualitative text data into quantitative sentiment indices.

In a first step, computational textual analysis methods are used to transform forecasters’ expectations about the future macroeconomic development into nine sentiment indices.

Second, sentiment analysis shows that the generated sentiment indices vary in their behaviour, pattern, and amplitude. In addition, the sentiment indices differ in their timely relationship to the realized macroeconomic development. Some sentiment indices show nearly a parallel development to the realized value, while other sentiment indices lag behind the real development and a small number of exceptions (partly) lead, compared to the realized value.

Third, sentiment indices are used to test forecast efficiency for GDP growth and inflation forecasts. Using 12-month-ahead fixed horizon forecasts, fixed-effects panel regression results suggest several sentiment indices with informational content for GDP growth and inflation forecasts. German forecasters’ narratives can enhance the accuracy of German business cycle forecasts. Overall, the results are in line with the findings of Jones et al. (2020), Sharpe et al. (2020) and Clements and Reade (2020). The four-quarter forecast horizon is comparable with the results of Sharpe et al. (2020) for the Fed’s Greenbook, whereas findings for the UK show shorter forecast horizons (Jones et al. 2020; Clements and Reade 2020).

Fourth, a forecasting exercise analysed the predictive power of sentiment indices for realized growth and inflation. This might explain why forecasters’ narratives have predictive power for forecast errors. But the forecasting exercise finds modest evidence, at best, for this hypothesis. The results indicate weak in-sample and out-of-sample predictive power of the sentiment indices for the future stance of the economy. However, more sophisticated forecasting models, e.g. mixed-data sampling (MIDAS) regression models, could improve the results.

There are several explanatory hypotheses as regards why the narratives contain information that is not exhausted by numerical forecasts. One of these is information rigidity. Based on the hypothesis that forecast revisions have predictive power for forecast errors (Nordhaus 1987), Coibion and Gorodnichenko (2015) and Dovern et al. (2015) find some hints supporting this hypothesis using tests for numerical forecasts in an international setting. Kirchgässner and Müller (2006) also find some evidence that German forecasters are reluctant to revise numerical forecasts. In a similar vein, forecasters’ narratives could be faster adjusted than their numerical counterparts. Sharpe et al. (2020) analysis for sticky point forecasts could only find weak evidence, at best, for this hypothesis. Another explanatory approach for the predictive power of forecasters’ narratives is the ‘modal-forecast explanation’ (Sharpe et al. 2020, p. 5). This hypothesis is based on the concept that the sentiment indices are particularly informative about tail risks, whereas numerical forecasts unbalance the risks because they are modal rather than mean forecasts. Sharpe et al. (2020) findings suggest such an interpretation. An additional explanation could be that the forecast narrative offers a wider scope for individuality than the quantitative forecast. The numerical forecast is limited to a number. And the production of the forecasts also depends on the institutes’ hierarchy and other influencing factors (see e.g. Fritsche and Heilemann 2010, for the Joint Diagnosis). Thus, the forecast report may allow the forecaster a higher degree of freedom. An study of the general issue—why forecasters’ narratives have predictive power for forecast errors—could form part of further research.

Last but not least, there is not a single sentiment index or sentiment analysis approach which is generally superior to other methods. The forecast-specific dictionary (Sharpe et al. 2020) and text regression methods perform well in tests for forecast efficiency. Considering the predictive power for GDP growth and inflation, dictionary-based approaches and text regression methods perform relatively weakly. However, the sentiment analysis could be improved in further research using more sophisticated text analysis and machine learning tools.