Keywords

1 Introduction

Unemployment rate is widely considered as one of the most important macroeconomic indices. Most of the central governments put a very high priority on employment since the unemployment causes many problems including poverty, crimes and social instability. The importance of unemployment statistics is not limited to public sector. Market participants assess the macroeconomic environment with unemployment rates. Ordinary employees also benefit from knowing unemployment rates since they need to decide whether or not quit the current job based on the labor market condition.

Besides its importance, the index has a serious problem. Official statistics authorities such as the U.S. Bureau of Labor Statistics (BLS) and Eurostat publish unemployment rates on monthly basis. We can not notice acute changes in the labor market from the statistics. Furthermore, the statistics is usually published around a month after the expiry of the month. The delay is caused by the time spent on the distribution and collection of survey questionnaires, and data processing. For prompt macroeconomic policy intervention and efficient market functioning, important indices such as unemployment rates should be reported as quick as possible. Catastrophic economic event such as financial crisis during 2007–2009 must be noticed by the government and important decision makers in the private sector before it gets seriously worse.

The need for the correct knowledge of current economic status leads to a large literature of nowcasting [1, 2] and introduced to the real world. The FRB of Atlanta has been continuously updating real-time estimates of GDP using monthly economic statistics in their “GDPNow” website [3, 4].

While the economic statistics rarely uses highly frequent data (e.g. hourly, daily, or weekly), a massive volume of high frequency data have become available. GPS log data is one instance. Many location-based apps such as maps, entertainment, game, and fitness collect users’ geo-location information if the users give permissions. These data are primarily utilized for improvement of user experience as well as advertising, recommendation and business intelligence. However, we see fast-growing literature on statistical analysis using collected GPS logs in a variety of areas including prediction of demographics and preference, detection of home, mode detection, and population analysis to name a few [5,6,7,8,9].

Recently, “alternative data”, or non-traditional data has been embraced in the non-academic area. In the financial industry, more and more market participants start using alternative data including geo-location data to make investment decisions. [10] Investors such as hedge funds predict sales by location data. [11] Rigorous consideration is needed in the field.

In this paper, we introduce GPS data to nowcasting literature and develop a unique model predicting current unemployment rates with GPS log. Our evaluation proves that GPS data has substantial predictive power for number of the unemployed persons. In the following sections, we first briefly review literature in Sect. 2, then explain our data in Sect. 3. Section 4 gives the detail of our model and Sect. 5 evaluates it. Finally Sect. 6 concludes.

2 Related Works

To the best of our knowledge, this is the first attempt to forecast unemployment rates with GPS data. Nowcasting of labor market statistics with alternative data has been actively studied since Varian and Choi [12] suggested the potential predictive power of search query data. The earliest attempts to forecast unemployment rate with search query reveal the predictive power of query data for labor market [13,14,15] and many studies follow (e.g. [16,17,18]). While most of the papers utilize ARIMA-type models, Onorante and Koop [19] apply Dynamic Model Selection/Averaging and Scott and Varian [20] develop the Bayesian structural time series model.

The present work considers mixed data sampling (MIDAS) scenario pioneered by Ghysels et al. [21] in which the high frequency data is used to forecast infrequent data.The idea of MIDAS is to represent frequent data in a parsimonious way. A natural extension is a situation where high dimensional (large p) and high frequency predictive variables are present in small sample (smaller N). Various models combine feature selection techniques and MIDAS are proposed [22,23,24]. Recently Uematsu and Tanaka [25] showed a simple penalized regression without MIDAS technique performs well for GDP forecasting with high frequent data. While these research focus on monthly official statistics as high frequent data and quarterly data (usually GDP) as target. The present paper extends MIDAS to much more high frequent alternative data.

Moreover, unlike existing models, our model is unique in its purely static form, which reveals the predictive power of GPS itself.

3 Data

In this section, we explain the data for the target (unemployment rates) and the predictor (GPS logs) in detail.

3.1 The Unemployment Rate

The unemployment rate is defined as “the number of unemployed persons as a percentage of the total number of persons in the labour force” [26]. In mathematical form,

$$\begin{aligned} u = y/l, \end{aligned}$$
(1)

where y and l denote the number of unemployed persons and labor force.

The number of unemployed persons and persons in the labor force are usually surveyed by the government on monthly basis. In Japan, monthly Labor Force Survey takes the role. The survey collects information about labor status of approximately 40,000 households during the last week of each month. To estimate the number of unemployed persons, we take advantage of the fact that they have strong incentives to go to public employment service offices. It is mandatory for Japanese unemployed workers to visit one of public employment services offices to become eligible for unemployment insurance benefits. Furthermore, they have to visit the office at least once a month to maintain their eligibility [27]. We can easily presume more visitors implies more unemployed persons.

Once we get the number of unemployed persons, we need the number of labor force to divide it. Unfortunately finding clues for the number of labor force from the GPS data is not very easy. However, labor force is far less volatile and thus the prediction error is relatively small. A simple ARIMA model produces accurate predictions with the RMSE of 0.22 million and MAE of 0.18 million when the mean of labor force is 66 million.

In short, we estimate seasonally-adjusted unemployment rate \(u^{SA}_t\) as,

$$\begin{aligned} \hat{u}^{SA}= & {} \hat{y}^{SA}/\hat{l}^{SA}, \end{aligned}$$
(2)
$$\begin{aligned} \hat{y}^{SA}= & {} \hat{y}^{\mathrm{GPS}} /s^U \end{aligned}$$
(3)

where \(s^U\) is seasonality index for unemployed persons. In the following sections, we first focus on the estimation of y rather than u. Resulting estimates of unemployment rate is shown in Sect. 5.

3.2 The GPS Data

Throughout this paper, we heavily rely on GPS logs from smartphones. Many mobile apps collect users’ geographical location information to improve their services when the users give permission. We use completely annonymized version of GPS data taken from Jan 2016 to April 2019 (40 months). The data consists of four columns: hashed id, latitude, longitude and timestamp. We count the number of app users who possibly visit each employment service office daily basis. The resulting data consists of N (the number of offices) \(\times \) D (the number of days) data points. We decide a person visits an employment service office when one or more logs are found within specific areas covering each office (Fig. 1). Since mobile phone determines its location based on the signals from GPS satellites, the accuracy deteriorates during a user is inside buildings or surrounded tall buildings due to the reflections of signals (multi-path). Furthermore, the logs are recorded infrequently to reduce battery consumption. To circumvent risk that we fail to count the person inside building due to the inaccuracy and infrequency of the nature of GPS data, the areas need to have some buffer outside the building.

The areas are set based on the size of the offices. To get size of the buildings we applied OpenCV [29] to map. The number of logs represent the number of visitors who has installed and given permission to specific app(s). The numbers are affected by whether the smartphone is turned on/off, whether the apps are turned on or not, and whether GPS logs are accurate. Moreover, note that visitors are not always unemployed persons. Visitors include consultants of the office, HR staffs from companies and other related people. Nevertheless, the numbers are expected to include some information about the number of unemployed. The counts are normalized by dividing by the total number of the daily unique users to mitigate the effect from the change in data volume.

Fig. 1.
figure 1

Image of GPS data and an employment service office. Logs (red points) found in the green-colored area are counted. (Color figure online)

4 Nowcasting Model

In this section, we set up a nowcasting model and explain the estimation. Algorithm 1 summarizes the whole procedure.

figure a

4.1 The MIDAS Model

Since unemployment rates are monthly statistics, it is not straightforward to develop a predictive model using daily data. As discussed in Sect. 2, such a situation is called “mixed data sampling” or MIDAS in short. We employ a most simple variant of MIDAS models, “bridge equation”. Ghysels and Marcellino (2018) [30] provides detailed explanation on MIDAS models. Notations used in this paper are based on the book. Suppose \(y_t\) is a monthly (low frequency) outcome variable to be predicted and \(x^H_{n,t-i/d}\) is N daily (high frequency) feature variables. The two variables themselves are not compatible with each other. We need to “bridge” high frequency data \(x^H_{n,t}\)s to low frequency \(x^L_t\). That is,

$$\begin{aligned} x^L_{n,t} = \sum \limits _{i=0}^{30}\phi _ix^H_{n,t - i/30} \end{aligned}$$
(4)

where \(\phi _i\)s are positive scalars holds \(\sum \limits _{i=0}^{30}\phi _i=1\). Hereafter we assume every month has 31 days regardless of the month. We pad zeros to the first \(d^*\) days for months with fewer than 31 days. For example, non-Olympic year February (28 days) goes like \((0,0,0, \text {1st day}, \cdots , \text {28th day})\). Then with a suitable machine learning model f, one can forecast \(y_t\).

$$\begin{aligned} \hat{y}_t = f(x^L_{1, t}, \cdots , x^L_{N,t}, \mathbf {z}_t), \end{aligned}$$
(5)

where \(\mathbf {z_t}\) includes month and year.

4.2 Estimation of Parameters

We have two sets of parameters to be estimated. One is a vector of \(\mathbf {\phi }\) which transform daily data to monthly data. The other is parameters in the model f, which gives prediction of y from x. In MIDAS literature, weight vector \(\mathbf {\phi }\) is chosen from several options. [21] Here we tried linear scheme \(\phi _i = 1/31\) and normalized beta \(\phi _i = \frac{beta(i/30, \alpha , \beta )}{\sum \limits _{i = 0}^{30}beta(i/30,\alpha , \beta )}\) where \(beta(x, \alpha , \beta ) = \frac{x^{\alpha -1}(1-x)^{\beta -1}\Gamma (\alpha + \beta )}{\Gamma (\alpha )\Gamma {\beta }}\). We go with normalized beta as it outperforms linear scheme. \(\beta \) governs the peak of the weights and \(\alpha \) governs the slope of the weights (see Fig. 2). Since official monthly labor survey collects data during the last week of the month, it is reasonable to set \(\beta = 1\). Finally \(\alpha \) is chosen according to the resulting RMSE and MAE by grid searching.Footnote 1

For forecasting model f we need to consider that the number of employment service offices (544) is much larger than the number of data points (40 months). This means standard MIDAS regression is not applicableFootnote 2. We pick up standard Random Forest and L1-regularized least squares (LASSO). More flexible regression models such as SVM and neural nets are not suitable for our short time series data. Furthermore, when evaluating the model, training data gets much more shorter. Our model should not learn data from the future. We evaluate the model on data from May 2018 to Apr 2019. It leaves only 28 months to learn when evaluated at May 2018. Random forest out-perform LASSO for the most of the cases, we go with random forestFootnote 3.

Fig. 2.
figure 2

Values of \(\phi \) by parameters.

4.3 Imputation of Missing Data

Since our goal is to nowcast unemployment rate as quick as possible, ideally we want to estimate unemployment rate any day in month. Suppose we are on June 28th 2019, the day official statistics for May is released (Fig. 3). One wants to forecast unemployment rate for June 2019. This is called “one month ahead prediction” since it predicts unemployment for one month ahead. Then we need impute missing GPS data for three days (28th, 29th and 30th). We use standard ARIMA models to impute missing GPS data. The models are run separately for each office. Parameters are automatically selected by of R package . At most five days imputation suffices for one month ahead prediction.

What if we want to conduct two month ahead prediction? If you are in the middle of the month (e.g. July 15th) then you need to impute 16 days of GPS logs. We check how the imputation affects prediction performance in the experiment.

Alternatively, we can estimate model without imputation by using only available data. However, the number of available days of data changes day by day and we need a lot of predictive models to be estimated (p. 462 in [30]). Here we resort to one predictive model with imputation for simplicity.

Fig. 3.
figure 3

Publication schedule and Imputation

4.4 Feature Selection

As already discussed in Sect. 2, feature selection is another important task here. Although random forest automatically selects informative feature variables, heuristic feature selection will benefit. Since the number of visitors to each office are expected to positively correlate with the number of unemployed persons, offices with negative correlation should be dominated by noise. We first calculate correlation of the data from each offices and official statistics for the number of unemployed persons in training period. Then we discard data from the offices with correlation smaller than 0.3. This procedure, however, leaves more than a hundred of offices.

5 Evaluation

In this section, we evaluate our predictive model by comparing baseline models. We first examine model for the number of unemployed persons and then one for unemployment rates. Throughout the section we utilize root of mean square error (RMSE) and mean absolute error (MAE) of the model for 12 months of rolling forecast. RMSE is defined as

$$\begin{aligned} \mathrm{RMSE}= & {} \left( \sum \limits _{t=\mathrm{May 2018}}^{\mathrm{Apr 2019}} (y_t - \hat{E}[y_t|\hat{x}^L_{t}])^2 \right) ^{1/2} \, \mathrm{for \, GPS} \end{aligned}$$
(6)
$$\begin{aligned} \mathrm{RMSE}= & {} \left( \sum \limits _{t=\mathrm{May 2018}}^{\mathrm{Apr 2019}} (y_t - \hat{E}[y_t|y_{t-lag}, \dots , y_1])^2 \right) ^{1/2} \, \mathrm{for \, ARIMA} \end{aligned}$$
(7)

where \(y_t\) denotes the ground truth taken from official statistics and lag indicate the number of steps of the forecasts. Note that \(\hat{x}^L\) is estimated using information available at \(t-lag\). MAE is absolute error version of RMSE.

5.1 Nowcast for the Number of Unemployed Persons

Figure 4 shows one-month-ahead (\(\hat{y}_{t|t-1}\)) forecasts by our model (GPS) and ARIMA model with ground truth. The specification of ARIMA model is chosen by . It takes seasonality into account. As we’ve already seen in Sect. 3.2, forecasting before the end of the month needs some imputation. The right panel shows forecast without missing data while the left shows forecast with three days missing. Since there is a substantial delay in official statistics, imputation is not necessary in the most of the cases.

In general, the GPS model (green solid line) well predicts true values (blue dot-dashed) with several exceptions (May 2018, Jan 2019, and Mar 2019). One of the stinking features of the GPS model is its smoothness of the prediction. Compared to ARIMA (red dashed line), the predictions of the GPS model are far less volatile. In particular, ARIMA tends to mimic the level of the last month while the GPS model does not. This is reasonable because GPS model is a simple static model and does NOT have an autoregressive characteristic. The shape of the prediction by GPS model seems too smooth. It fails to predict some dips in the ground truth. However, economists want to see the trend of the economy rather than the short-term fluctuation. That’s why economists prefer moving averaged indicators. GPS model accurately predict downward trend in unemployment.

Figure 5 shows the two months ahead forecasts (\(y_{t|t-2}\)). The right hand panel shows the forecasts based on data with five days missing while the left miss fifteen days of GPS log. Compared with one month ahead forecasts, two months ahead forecasts shows larger errors for several months (Feb 2019 and Oct 2018). However, the results are much better than ARIMA model.

Fig. 4.
figure 4

One month ahead forecast (\(y_{t|t-1}\)) by proposed model (GPS), ARIMA model, and ground truth (true)

Fig. 5.
figure 5

Two month ahead forecast (\(y_{t|t-2}\)) by proposed model (GPS), ARIMA model, and ground truth (true)

Table 1 summarizes the performance of the models. GPS models out-perform ARIMA model both one month ahead and two months ahead forecasts. Also the period of imputation seems not to affect the performance. Even model trained on data with 15 days imputation outperform ARIMA model. Also, the accuracy is almost same for one or two month ahead forecasts. The only difference between one month ahead and two month ahead GPS models is that two month ahead models do not learn data of just before the target month (i.e. \(y_{t-1}\)). Learning the last month of the target month might not be so important.

Table 1. RMMSE and MAE of each models (million person). The parameters of ARIMA model are automatically chosen according to BIC.
Fig. 6.
figure 6

Forecasting of unemployment rates

5.2 Forecasts for Unemployment Rates

Finally, we evaluate the predictive performance of our GPS model for unemployment rates. Unfortunately, we do not have a good predictive model for labor force. We resort to an ARIMA model for prediction of seasonaly-adjusted labor force and estimate unemployment rate. That is,

$$\begin{aligned} \hat{u}^{\mathrm{SA, GPS-ARIMA}} = \frac{\hat{y}^{\mathrm{GPS}} /s^{U}}{\hat{l}^{\mathrm{SA, ARIMA}}}, \end{aligned}$$
(8)

where \(s^U\) is seasonality index. In Sect. 5.1, the GPS model has already beaten ARIMA model. This time we deployed another baseline model: an ARIMA model directly predicts seasonally adjusted unemployment rates. The results (Table 2) show our GPS-ARIMA model is inferior to the ARIMA model for one month prediction horizon (\(\hat{u}_{t|t-1}\)) but is competitive for two month prediction horizon (\(\hat{u}_{t|t-2}\)). As shown in Fig. 6, the up-and-down of the ground truth is better predicted by ARIMA while the absolute values are better predicted by GPS-ARIMA (e.g. Jul 2018, Nov 2018, Jan 2019).

The disappointing result is actually no surprise. The existing literature shows that the predictive power of alternative data is sometimes weak. [16, 19] Also, the better predictive model for labor force could improve the results.

Table 2. Performance of predictive model for unemployment rates. RMSE/MAEs are inflated by 1,000. For example, 1.0 of MAE implies 0.1% mean absolute error.

6 Conclusion

In this paper, we examined the usefulness of GPS log data for nowcasting for unemployment rates. First we prove that model using GPS data without the lagged dependent variable out-performs a standard ARIMA model for prediction of the number of unemployed persons. Then we found that the a combination of GPS and ARIMA model is only competitive for longer prediction horizon when applied to unemployment rates. The predictive performance could be improved by several ways. First, as described in Sect. 2, various modern techniques for MIDAS and high dimensional data are available. Second, using GPS data as an independent variable in an autoregressive model is another good candidate. Third, more sophisticated treatment for GPS log is expected to improve the quality of the data. Counting log is simple but the literature on GPS trajectories suggests many other technique to improve accuracy. Nevertheless, we hope the paper presents new idea for both nowcasting of economic statistics and utilization of GPS data.