Abstract
Unemployment rate is one of the most important macroeconomic indicators. Central governments and market participants heavily rely on the index to assess the economies. However, official statistics of unemployment rate are released infrequently with substantial delay. Prediction of official statistics of labor market will be helpful for these authorities as well as private companies and even workers. In this paper, we combine massive location data coming from smartphones and mixed data sampling (MIDAS) techniques to predict current unemployment rate in Japan. We found GPS data is very useful to predict the status of labor markets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- GPS data
- MIDAS
- Mixed data sampling
- Location data
- Unemployment rate
- Time series analysis
- Macroeconomic policy
- Nowcasting
- Forecasting
1 Introduction
Unemployment rate is widely considered as one of the most important macroeconomic indices. Most of the central governments put a very high priority on employment since the unemployment causes many problems including poverty, crimes and social instability. The importance of unemployment statistics is not limited to public sector. Market participants assess the macroeconomic environment with unemployment rates. Ordinary employees also benefit from knowing unemployment rates since they need to decide whether or not quit the current job based on the labor market condition.
Besides its importance, the index has a serious problem. Official statistics authorities such as the U.S. Bureau of Labor Statistics (BLS) and Eurostat publish unemployment rates on monthly basis. We can not notice acute changes in the labor market from the statistics. Furthermore, the statistics is usually published around a month after the expiry of the month. The delay is caused by the time spent on the distribution and collection of survey questionnaires, and data processing. For prompt macroeconomic policy intervention and efficient market functioning, important indices such as unemployment rates should be reported as quick as possible. Catastrophic economic event such as financial crisis during 2007–2009 must be noticed by the government and important decision makers in the private sector before it gets seriously worse.
The need for the correct knowledge of current economic status leads to a large literature of nowcasting [1, 2] and introduced to the real world. The FRB of Atlanta has been continuously updating real-time estimates of GDP using monthly economic statistics in their “GDPNow” website [3, 4].
While the economic statistics rarely uses highly frequent data (e.g. hourly, daily, or weekly), a massive volume of high frequency data have become available. GPS log data is one instance. Many location-based apps such as maps, entertainment, game, and fitness collect users’ geo-location information if the users give permissions. These data are primarily utilized for improvement of user experience as well as advertising, recommendation and business intelligence. However, we see fast-growing literature on statistical analysis using collected GPS logs in a variety of areas including prediction of demographics and preference, detection of home, mode detection, and population analysis to name a few [5,6,7,8,9].
Recently, “alternative data”, or non-traditional data has been embraced in the non-academic area. In the financial industry, more and more market participants start using alternative data including geo-location data to make investment decisions. [10] Investors such as hedge funds predict sales by location data. [11] Rigorous consideration is needed in the field.
In this paper, we introduce GPS data to nowcasting literature and develop a unique model predicting current unemployment rates with GPS log. Our evaluation proves that GPS data has substantial predictive power for number of the unemployed persons. In the following sections, we first briefly review literature in Sect. 2, then explain our data in Sect. 3. Section 4 gives the detail of our model and Sect. 5 evaluates it. Finally Sect. 6 concludes.
2 Related Works
To the best of our knowledge, this is the first attempt to forecast unemployment rates with GPS data. Nowcasting of labor market statistics with alternative data has been actively studied since Varian and Choi [12] suggested the potential predictive power of search query data. The earliest attempts to forecast unemployment rate with search query reveal the predictive power of query data for labor market [13,14,15] and many studies follow (e.g. [16,17,18]). While most of the papers utilize ARIMA-type models, Onorante and Koop [19] apply Dynamic Model Selection/Averaging and Scott and Varian [20] develop the Bayesian structural time series model.
The present work considers mixed data sampling (MIDAS) scenario pioneered by Ghysels et al. [21] in which the high frequency data is used to forecast infrequent data.The idea of MIDAS is to represent frequent data in a parsimonious way. A natural extension is a situation where high dimensional (large p) and high frequency predictive variables are present in small sample (smaller N). Various models combine feature selection techniques and MIDAS are proposed [22,23,24]. Recently Uematsu and Tanaka [25] showed a simple penalized regression without MIDAS technique performs well for GDP forecasting with high frequent data. While these research focus on monthly official statistics as high frequent data and quarterly data (usually GDP) as target. The present paper extends MIDAS to much more high frequent alternative data.
Moreover, unlike existing models, our model is unique in its purely static form, which reveals the predictive power of GPS itself.
3 Data
In this section, we explain the data for the target (unemployment rates) and the predictor (GPS logs) in detail.
3.1 The Unemployment Rate
The unemployment rate is defined as “the number of unemployed persons as a percentage of the total number of persons in the labour force” [26]. In mathematical form,
where y and l denote the number of unemployed persons and labor force.
The number of unemployed persons and persons in the labor force are usually surveyed by the government on monthly basis. In Japan, monthly Labor Force Survey takes the role. The survey collects information about labor status of approximately 40,000 households during the last week of each month. To estimate the number of unemployed persons, we take advantage of the fact that they have strong incentives to go to public employment service offices. It is mandatory for Japanese unemployed workers to visit one of public employment services offices to become eligible for unemployment insurance benefits. Furthermore, they have to visit the office at least once a month to maintain their eligibility [27]. We can easily presume more visitors implies more unemployed persons.
Once we get the number of unemployed persons, we need the number of labor force to divide it. Unfortunately finding clues for the number of labor force from the GPS data is not very easy. However, labor force is far less volatile and thus the prediction error is relatively small. A simple ARIMA model produces accurate predictions with the RMSE of 0.22 million and MAE of 0.18 million when the mean of labor force is 66 million.
In short, we estimate seasonally-adjusted unemployment rate \(u^{SA}_t\) as,
where \(s^U\) is seasonality index for unemployed persons. In the following sections, we first focus on the estimation of y rather than u. Resulting estimates of unemployment rate is shown in Sect. 5.
3.2 The GPS Data
Throughout this paper, we heavily rely on GPS logs from smartphones. Many mobile apps collect users’ geographical location information to improve their services when the users give permission. We use completely annonymized version of GPS data taken from Jan 2016 to April 2019 (40 months). The data consists of four columns: hashed id, latitude, longitude and timestamp. We count the number of app users who possibly visit each employment service office daily basis. The resulting data consists of N (the number of offices) \(\times \) D (the number of days) data points. We decide a person visits an employment service office when one or more logs are found within specific areas covering each office (Fig. 1). Since mobile phone determines its location based on the signals from GPS satellites, the accuracy deteriorates during a user is inside buildings or surrounded tall buildings due to the reflections of signals (multi-path). Furthermore, the logs are recorded infrequently to reduce battery consumption. To circumvent risk that we fail to count the person inside building due to the inaccuracy and infrequency of the nature of GPS data, the areas need to have some buffer outside the building.
The areas are set based on the size of the offices. To get size of the buildings we applied OpenCV [29] to map. The number of logs represent the number of visitors who has installed and given permission to specific app(s). The numbers are affected by whether the smartphone is turned on/off, whether the apps are turned on or not, and whether GPS logs are accurate. Moreover, note that visitors are not always unemployed persons. Visitors include consultants of the office, HR staffs from companies and other related people. Nevertheless, the numbers are expected to include some information about the number of unemployed. The counts are normalized by dividing by the total number of the daily unique users to mitigate the effect from the change in data volume.
4 Nowcasting Model
In this section, we set up a nowcasting model and explain the estimation. Algorithm 1 summarizes the whole procedure.
4.1 The MIDAS Model
Since unemployment rates are monthly statistics, it is not straightforward to develop a predictive model using daily data. As discussed in Sect. 2, such a situation is called “mixed data sampling” or MIDAS in short. We employ a most simple variant of MIDAS models, “bridge equation”. Ghysels and Marcellino (2018) [30] provides detailed explanation on MIDAS models. Notations used in this paper are based on the book. Suppose \(y_t\) is a monthly (low frequency) outcome variable to be predicted and \(x^H_{n,t-i/d}\) is N daily (high frequency) feature variables. The two variables themselves are not compatible with each other. We need to “bridge” high frequency data \(x^H_{n,t}\)s to low frequency \(x^L_t\). That is,
where \(\phi _i\)s are positive scalars holds \(\sum \limits _{i=0}^{30}\phi _i=1\). Hereafter we assume every month has 31 days regardless of the month. We pad zeros to the first \(d^*\) days for months with fewer than 31 days. For example, non-Olympic year February (28 days) goes like \((0,0,0, \text {1st day}, \cdots , \text {28th day})\). Then with a suitable machine learning model f, one can forecast \(y_t\).
where \(\mathbf {z_t}\) includes month and year.
4.2 Estimation of Parameters
We have two sets of parameters to be estimated. One is a vector of \(\mathbf {\phi }\) which transform daily data to monthly data. The other is parameters in the model f, which gives prediction of y from x. In MIDAS literature, weight vector \(\mathbf {\phi }\) is chosen from several options. [21] Here we tried linear scheme \(\phi _i = 1/31\) and normalized beta \(\phi _i = \frac{beta(i/30, \alpha , \beta )}{\sum \limits _{i = 0}^{30}beta(i/30,\alpha , \beta )}\) where \(beta(x, \alpha , \beta ) = \frac{x^{\alpha -1}(1-x)^{\beta -1}\Gamma (\alpha + \beta )}{\Gamma (\alpha )\Gamma {\beta }}\). We go with normalized beta as it outperforms linear scheme. \(\beta \) governs the peak of the weights and \(\alpha \) governs the slope of the weights (see Fig. 2). Since official monthly labor survey collects data during the last week of the month, it is reasonable to set \(\beta = 1\). Finally \(\alpha \) is chosen according to the resulting RMSE and MAE by grid searching.Footnote 1
For forecasting model f we need to consider that the number of employment service offices (544) is much larger than the number of data points (40 months). This means standard MIDAS regression is not applicableFootnote 2. We pick up standard Random Forest and L1-regularized least squares (LASSO). More flexible regression models such as SVM and neural nets are not suitable for our short time series data. Furthermore, when evaluating the model, training data gets much more shorter. Our model should not learn data from the future. We evaluate the model on data from May 2018 to Apr 2019. It leaves only 28 months to learn when evaluated at May 2018. Random forest out-perform LASSO for the most of the cases, we go with random forestFootnote 3.
4.3 Imputation of Missing Data
Since our goal is to nowcast unemployment rate as quick as possible, ideally we want to estimate unemployment rate any day in month. Suppose we are on June 28th 2019, the day official statistics for May is released (Fig. 3). One wants to forecast unemployment rate for June 2019. This is called “one month ahead prediction” since it predicts unemployment for one month ahead. Then we need impute missing GPS data for three days (28th, 29th and 30th). We use standard ARIMA models to impute missing GPS data. The models are run separately for each office. Parameters are automatically selected by of R package . At most five days imputation suffices for one month ahead prediction.
What if we want to conduct two month ahead prediction? If you are in the middle of the month (e.g. July 15th) then you need to impute 16 days of GPS logs. We check how the imputation affects prediction performance in the experiment.
Alternatively, we can estimate model without imputation by using only available data. However, the number of available days of data changes day by day and we need a lot of predictive models to be estimated (p. 462 in [30]). Here we resort to one predictive model with imputation for simplicity.
4.4 Feature Selection
As already discussed in Sect. 2, feature selection is another important task here. Although random forest automatically selects informative feature variables, heuristic feature selection will benefit. Since the number of visitors to each office are expected to positively correlate with the number of unemployed persons, offices with negative correlation should be dominated by noise. We first calculate correlation of the data from each offices and official statistics for the number of unemployed persons in training period. Then we discard data from the offices with correlation smaller than 0.3. This procedure, however, leaves more than a hundred of offices.
5 Evaluation
In this section, we evaluate our predictive model by comparing baseline models. We first examine model for the number of unemployed persons and then one for unemployment rates. Throughout the section we utilize root of mean square error (RMSE) and mean absolute error (MAE) of the model for 12 months of rolling forecast. RMSE is defined as
where \(y_t\) denotes the ground truth taken from official statistics and lag indicate the number of steps of the forecasts. Note that \(\hat{x}^L\) is estimated using information available at \(t-lag\). MAE is absolute error version of RMSE.
5.1 Nowcast for the Number of Unemployed Persons
Figure 4 shows one-month-ahead (\(\hat{y}_{t|t-1}\)) forecasts by our model (GPS) and ARIMA model with ground truth. The specification of ARIMA model is chosen by . It takes seasonality into account. As we’ve already seen in Sect. 3.2, forecasting before the end of the month needs some imputation. The right panel shows forecast without missing data while the left shows forecast with three days missing. Since there is a substantial delay in official statistics, imputation is not necessary in the most of the cases.
In general, the GPS model (green solid line) well predicts true values (blue dot-dashed) with several exceptions (May 2018, Jan 2019, and Mar 2019). One of the stinking features of the GPS model is its smoothness of the prediction. Compared to ARIMA (red dashed line), the predictions of the GPS model are far less volatile. In particular, ARIMA tends to mimic the level of the last month while the GPS model does not. This is reasonable because GPS model is a simple static model and does NOT have an autoregressive characteristic. The shape of the prediction by GPS model seems too smooth. It fails to predict some dips in the ground truth. However, economists want to see the trend of the economy rather than the short-term fluctuation. That’s why economists prefer moving averaged indicators. GPS model accurately predict downward trend in unemployment.
Figure 5 shows the two months ahead forecasts (\(y_{t|t-2}\)). The right hand panel shows the forecasts based on data with five days missing while the left miss fifteen days of GPS log. Compared with one month ahead forecasts, two months ahead forecasts shows larger errors for several months (Feb 2019 and Oct 2018). However, the results are much better than ARIMA model.
Table 1 summarizes the performance of the models. GPS models out-perform ARIMA model both one month ahead and two months ahead forecasts. Also the period of imputation seems not to affect the performance. Even model trained on data with 15 days imputation outperform ARIMA model. Also, the accuracy is almost same for one or two month ahead forecasts. The only difference between one month ahead and two month ahead GPS models is that two month ahead models do not learn data of just before the target month (i.e. \(y_{t-1}\)). Learning the last month of the target month might not be so important.
5.2 Forecasts for Unemployment Rates
Finally, we evaluate the predictive performance of our GPS model for unemployment rates. Unfortunately, we do not have a good predictive model for labor force. We resort to an ARIMA model for prediction of seasonaly-adjusted labor force and estimate unemployment rate. That is,
where \(s^U\) is seasonality index. In Sect. 5.1, the GPS model has already beaten ARIMA model. This time we deployed another baseline model: an ARIMA model directly predicts seasonally adjusted unemployment rates. The results (Table 2) show our GPS-ARIMA model is inferior to the ARIMA model for one month prediction horizon (\(\hat{u}_{t|t-1}\)) but is competitive for two month prediction horizon (\(\hat{u}_{t|t-2}\)). As shown in Fig. 6, the up-and-down of the ground truth is better predicted by ARIMA while the absolute values are better predicted by GPS-ARIMA (e.g. Jul 2018, Nov 2018, Jan 2019).
The disappointing result is actually no surprise. The existing literature shows that the predictive power of alternative data is sometimes weak. [16, 19] Also, the better predictive model for labor force could improve the results.
6 Conclusion
In this paper, we examined the usefulness of GPS log data for nowcasting for unemployment rates. First we prove that model using GPS data without the lagged dependent variable out-performs a standard ARIMA model for prediction of the number of unemployed persons. Then we found that the a combination of GPS and ARIMA model is only competitive for longer prediction horizon when applied to unemployment rates. The predictive performance could be improved by several ways. First, as described in Sect. 2, various modern techniques for MIDAS and high dimensional data are available. Second, using GPS data as an independent variable in an autoregressive model is another good candidate. Third, more sophisticated treatment for GPS log is expected to improve the quality of the data. Counting log is simple but the literature on GPS trajectories suggests many other technique to improve accuracy. Nevertheless, we hope the paper presents new idea for both nowcasting of economic statistics and utilization of GPS data.
Notes
- 1.
The weights are generated by R package .
- 2.
R package does not have implementation for regularization.
- 3.
We used R package [28].
References
Giannone, D., Reichlin, L., Small, D.: Nowcasting: the real-time informational content of macroeconomic data. J. Monet. Econ. 55, 665–676 (2008). https://doi.org/10.1016/j.jmoneco.2008.05.010
Bańbura, M., Giannone, D., Modugno, M., Reichlin, L.: Now-casting and the real-time data flow. In: Handbook of Economic Forecasting, pp. 195–237. Elsevier (2013). https://doi.org/10.1016/B978-0-444-53683-9.00004-9
Federal Reserve Bank of Atlanta: GDPNow. https://www.frbatlanta.org/cqer/research/gdpnow.aspx. Accessed 11 June 2019
Higgins, P.C.: GDPNow: a model for GDP “Nowcasting”. SSRN Electron. J. (2014). https://doi.org/10.2139/ssrn.2580350
Sangaralingam, K., Verma, N., Ravi, A., Datta, A., Chugh, V.: Predicting age & gender of mobile users at scale - a distributed machine learning approach. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1817–1826. IEEE, Seattle, WA, USA (2018). https://doi.org/10.1109/BigData.2018.8621942
Ravi, A., Sangaralingam, K., Datta, A.: Predicting consumer level brand preferences using persistent mobility patterns. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1986–1991. IEEE, Seattle, WA, USA (2018). https://doi.org/10.1109/BigData.2018.8622225
Vanhoof, M., Reis, F., Ploetz, T., Smoreda, Z.: Assessing the quality of home detection from mobile phone data for official statistics. J. Off. Stat. 34, 935–960 (2018). https://doi.org/10.2478/jos-2018-0046
Siła-Nowicka, K., Vandrol, J., Oshan, T., Long, J.A., Demšar, U., Fotheringham, A.S.: Analysis of human mobility patterns from GPS trajectories and contextual information. Int. J. Geogr. Inf. Sci. 30, 881–906 (2016). https://doi.org/10.1080/13658816.2015.1100731
Shimosaka, M., Hayakawa, Y., Tsubouch, K.: Spatiality preservable factored Poisson regression for large-scale fine-grained GPS-based population analysis. In: AAAI 2019, The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), January 2019
Opimas: Alternative Data - The New Flontier in Asset Management. http://www.opimas.com/research/217/detail/
Advan Research: Advan Location White Paper. https://www.advan.us/research.html
Choi, H., Varian, H.: Predicting the present with Google trends. Econ. Rec. 88, 2–9 (2012). https://doi.org/10.1111/j.1475-4932.2012.00809.x
Askitas, N., Zimmermann, K.F.: Google econometrics and unemployment forecasting. Appl. Econ. Q. (formerly: Konjunkturpolitik) 55, 107–120 (2009)
D’Amuri, F., Marcucci, J.: “Google It!” forecasting the US unemployment rate with a Google job search index. SSRN Electron. J. (2010). https://doi.org/10.2139/ssrn.1594132
Suhoy, T.: Query Indices and a 2008 Downturn: Israeli Data, vol. 34 (2009)
Pavlicek, J., Kristoufek, L.: Nowcasting unemployment rates with Google searches: evidence from the visegrad group countries. PLoS ONE 10, e0127084 (2015). https://doi.org/10.1371/journal.pone.0127084
Anvik, C., Gjelstad, K.: Just Google it. Forecasting Norwegian unemployment figures with web queries (2010)
Naccarato, A., Falorsi, S., Loriga, S., Pierini, A.: Combining official and Google trends data to forecast the Italian youth unemployment rate. Technol. Forecast. Soc. Chang. 130, 114–122 (2018)
Onorante, L., Koop, G.: Macroeconomic nowcasting using Google probabilities. In: Proceedings of the 1st International Conference on Advanced Research Methods and Analytics. Universitat Politècnica València (2016). https://doi.org/10.4995/CARMA2016.2016.4213
Scott, S.L., Varian, H.R.: Predicting the present with Bayesian structural time series. Int. J. Math. Model. Numer. Optim. 5, 4–23 (2014)
Ghysels, E., Sinko, A., Valkanov, R.: MIDAS regressions: further results and new directions. Econ. Rev. 26(1), 53–90 (2007)
Marsilli, C.: Variable selection in predictive MIDAS models. Banque de France Working Paper No. 520 (2014). https://doi.org/10.2139/ssrn.2531339
Siliverstovs, B.: Short-term forecasting with mixed-frequency data: a MIDASSO approach. Appl. Econ. 49(13), 1326–1343 (2017)
Mogliani, M.: Bayesian MIDAS penalized regressions: estimation, selection, and prediction. arXiv:1903.08025 (2019)
Uematsu, Y., Tanaka, S.: High-dimensional macroeconomic forecasting and variable selection via penalized regression. Econ. J. 22, 34–56 (2019). https://doi.org/10.1111/ectj.12117
International Labour Organization: Unemployment Rate. https://www.ilo.org/ilostat-files/Documents/description_UR_EN.pdf. Accessed 11 June 2019
Employment Security Bureau, Ministry of Health, Labour, and Welfare: Procedures of Employemnt Insurance in Japanese. https://www.hellowork.go.jp/insurance/insurance_procedure.html. Accessed June 11 2019
Wright, M.N., Ziegler, A.: ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77 (2017). https://doi.org/10.18637/jss.v077.i01
Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools 25, 120–125 (2000)
Ghysels, E., Marcellino, M.: Applied Economic Forecasting Using Time Series Methods. Oxford University Press, Oxford (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2020 The Author(s)
About this paper
Cite this paper
Moriwaki, D. (2020). Nowcasting Unemployment Rates with Smartphone GPS Data. In: Tserpes, K., Renso, C., Matwin, S. (eds) Multiple-Aspect Analysis of Semantic Trajectories. MASTER 2019. Lecture Notes in Computer Science(), vol 11889. Springer, Cham. https://doi.org/10.1007/978-3-030-38081-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-38081-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38080-9
Online ISBN: 978-3-030-38081-6
eBook Packages: Computer ScienceComputer Science (R0)