In this section, we set up a nowcasting model and explain the estimation. Algorithm 1 summarizes the whole procedure.
4.1 The MIDAS Model
Since unemployment rates are monthly statistics, it is not straightforward to develop a predictive model using daily data. As discussed in Sect. 2, such a situation is called “mixed data sampling” or MIDAS in short. We employ a most simple variant of MIDAS models, “bridge equation”. Ghysels and Marcellino (2018) [30] provides detailed explanation on MIDAS models. Notations used in this paper are based on the book. Suppose \(y_t\) is a monthly (low frequency) outcome variable to be predicted and \(x^H_{n,t-i/d}\) is N daily (high frequency) feature variables. The two variables themselves are not compatible with each other. We need to “bridge” high frequency data \(x^H_{n,t}\)s to low frequency \(x^L_t\). That is,
$$\begin{aligned} x^L_{n,t} = \sum \limits _{i=0}^{30}\phi _ix^H_{n,t - i/30} \end{aligned}$$
(4)
where \(\phi _i\)s are positive scalars holds \(\sum \limits _{i=0}^{30}\phi _i=1\). Hereafter we assume every month has 31 days regardless of the month. We pad zeros to the first \(d^*\) days for months with fewer than 31 days. For example, non-Olympic year February (28 days) goes like \((0,0,0, \text {1st day}, \cdots , \text {28th day})\). Then with a suitable machine learning model f, one can forecast \(y_t\).
$$\begin{aligned} \hat{y}_t = f(x^L_{1, t}, \cdots , x^L_{N,t}, \mathbf {z}_t), \end{aligned}$$
(5)
where \(\mathbf {z_t}\) includes month and year.
4.2 Estimation of Parameters
We have two sets of parameters to be estimated. One is a vector of \(\mathbf {\phi }\) which transform daily data to monthly data. The other is parameters in the model f, which gives prediction of y from x. In MIDAS literature, weight vector \(\mathbf {\phi }\) is chosen from several options. [21] Here we tried linear scheme \(\phi _i = 1/31\) and normalized beta \(\phi _i = \frac{beta(i/30, \alpha , \beta )}{\sum \limits _{i = 0}^{30}beta(i/30,\alpha , \beta )}\) where \(beta(x, \alpha , \beta ) = \frac{x^{\alpha -1}(1-x)^{\beta -1}\Gamma (\alpha + \beta )}{\Gamma (\alpha )\Gamma {\beta }}\). We go with normalized beta as it outperforms linear scheme. \(\beta \) governs the peak of the weights and \(\alpha \) governs the slope of the weights (see Fig. 2). Since official monthly labor survey collects data during the last week of the month, it is reasonable to set \(\beta = 1\). Finally \(\alpha \) is chosen according to the resulting RMSE and MAE by grid searching.Footnote 1
For forecasting model f we need to consider that the number of employment service offices (544) is much larger than the number of data points (40 months). This means standard MIDAS regression is not applicableFootnote 2. We pick up standard Random Forest and L1-regularized least squares (LASSO). More flexible regression models such as SVM and neural nets are not suitable for our short time series data. Furthermore, when evaluating the model, training data gets much more shorter. Our model should not learn data from the future. We evaluate the model on data from May 2018 to Apr 2019. It leaves only 28 months to learn when evaluated at May 2018. Random forest out-perform LASSO for the most of the cases, we go with random forestFootnote 3.
4.3 Imputation of Missing Data
Since our goal is to nowcast unemployment rate as quick as possible, ideally we want to estimate unemployment rate any day in month. Suppose we are on June 28th 2019, the day official statistics for May is released (Fig. 3). One wants to forecast unemployment rate for June 2019. This is called “one month ahead prediction” since it predicts unemployment for one month ahead. Then we need impute missing GPS data for three days (28th, 29th and 30th). We use standard ARIMA models to impute missing GPS data. The models are run separately for each office. Parameters are automatically selected by
of R package
. At most five days imputation suffices for one month ahead prediction.
What if we want to conduct two month ahead prediction? If you are in the middle of the month (e.g. July 15th) then you need to impute 16 days of GPS logs. We check how the imputation affects prediction performance in the experiment.
Alternatively, we can estimate model without imputation by using only available data. However, the number of available days of data changes day by day and we need a lot of predictive models to be estimated (p. 462 in [30]). Here we resort to one predictive model with imputation for simplicity.
4.4 Feature Selection
As already discussed in Sect. 2, feature selection is another important task here. Although random forest automatically selects informative feature variables, heuristic feature selection will benefit. Since the number of visitors to each office are expected to positively correlate with the number of unemployed persons, offices with negative correlation should be dominated by noise. We first calculate correlation of the data from each offices and official statistics for the number of unemployed persons in training period. Then we discard data from the offices with correlation smaller than 0.3. This procedure, however, leaves more than a hundred of offices.