Skip to main content
Log in

Forecasting building permits with Google Trends

  • Published:
Empirical Economics Aims and scope Submit manuscript

Abstract

We propose a useful way to predict building permits in the USA, exploiting rich data from web search queries. The relevance of our work relies on the fact that the time series on building permits is used as a leading indicator of economic activity in the construction sector. Nevertheless, new data on building permits are released with a lag of a few weeks. Therefore, an accurate nowcast of this leading indicator is desirable. In this paper, we show that models including Google search queries nowcast and forecast better than many of our good, not naïve benchmarks. We show this with both in-sample and out-of-sample exercises. In addition, we show that the results of these predictions are robust to different specifications, the use of rolling or expanding windows and, in some cases, to the forecasting horizon. Since Google queries information is free, our approach is a simple and inexpensive way to predict building permits in the USA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. For example, Strauss (2013) finds that building permits outperform other standard leading indicators of overall economic activity, such as interest rates and oil prices in most US states.

  2. In the USA, the federal agency in charge of collecting these data from granting government agencies is the US Census Bureau, which provides a monthly estimate through the Building Permits Survey. See more information in the data section.

  3. Arouba and Diebold (2010), for example, stressed the importance of having higher frequency, real-time data to monitor macroeconomic variables. Also, the term nowcasting—which was coined by Giannone et al. (2008)—was introduced in the literature to refer to their methodology to update forecasts of lower-frequency variables, such as quarterly GDP, as higher-frequency relevant information appears, such as monthly industrial production.

  4. Naturally, Google Trends has also been used in other research areas, such as oil spending (Yu et al. 2019) youth unemployment (Naccarato et al. 2018) and macroeconomics, for example to forecast inflation and consumer confidence (Niesert et al. 2019), just to name a few. For a review of the use of Google Trends in research during the last decade, see Jun et al. (2018).

  5. We are only interested in the aggregated number of building permits in the USA, which present no missing data.

  6. To see the exact dates of data releases, see https://www.census.gov/construction/bps/schedule.html.

  7. D’Amuri and Marcucci (2017) clarify this point. They present the following equations for calculating the Google search index (GI):

    • The search participation of a certain term in a day (d) and geographical location (r) is given by the number of searches of the term (\(V_{d,r}\)), divided by the total number of searches (\(T_{d,r}\)). Therefore, the daily relative searches of a certain term is \(S_{d,r} = \frac{{V_{d,r} }}{{T_{d,r} }}\).

    • The relative weekly searches of the term are calculated as a simple average of the daily searches: \(S_{T,r} = \frac{1}{7}\mathop \sum \nolimits_{{d = {\text{Sunday}}}}^{{{\text{Saturday}}}} S_{d,r}\).

    Google also scales the index as follows: \({\text{GI}}_{T,r} = \frac{100}{{{\text{max}}_{t} \left( {S_{T,r} } \right)}}S_{T,r}\). D’Amuri and Marcucci (2017) interpret GI as the probability of a random user of the r location searching in Google for a particular term during a week.

  8. It is also important to mention that D’Amuri and Marcucci (2017) show that the effects of sampling errors in Google Trends are quite negligible when applied to unemployment data.

  9. Examples of this literature are, to name a few, Ginsberg et al. (2009) who select 45 queries over 50 million search terms using out-of-sample goodness of fit for illness data; and Scott and Varian (2014) who use Bayesian methods to automatically select predictors of initial claims and retail sales.

  10. Of course, it is possible to use both simultaneously: for example, using the first method to narrow down some terms, and use judgment to discard terms that are most likely spurious. Examples of this approach are Fondeur and Karamé (2013), Choi and Varian (2012) and D’amuri and Marcucci (2017).

  11. For our preferred search queries, we find high correlations between building permits and each of these variables, with and without seasonal adjustment. The lowest correlation is 0.86 for the query “new housing development” while the highest is 0.96 for the seasonally adjusted query “real estate exam.”

  12. See a complete list of requisites to become a realtor in https://www.kapre.com/resources/real-estate/how-to-become-a-real-estate-agent.

  13. Notice that estimates of the drift terms are removed from Table 7.

  14. The penalty for the number of parameters is much higher with BIC than with AIC in estimation windows of 50 observations.

  15. Here, \(\gamma \left( L \right)\) and \(x_{t}\) are defined as in expression (4).

  16. Table 8 in “Appendix” shows estimates and diagnostic statistics of models (10) and (12) for building permits and the four different Google search queries under consideration. We have removed, again, estimates of the drift terms. We observe that our SARIMA specifications seem to offer a better representation of the data relative to our models (3) and (5). In particular, all the coefficients shown in Table 8 are statistically significant at usual levels, the Schwarz criterion are lower in Table 8 relative to the comparable figures in Table 7 and also the Durbin Watson statistics are closer to 2. This last point is indicating that SARIMA specifications seem to be more successful at removing the excess of first order autocorrelation in the errors relative to our simple specifications in (3) and (5). Finally, while the coefficients of determination show an important degree of heterogeneity, relative to our univariate linear specifications, we observe that SARIMA models tend to produce a higher coefficient of determination for our Google search queries, and a slightly lower one for building permits. This is the only aspect in which the basic linear model in (3) seems to be slightly better than the SARIMA specification in (9) and (11).

  17. Notice that for estimation of our models we only use a total of R observations both for building permits and Google Trends. The extra observation of Google Trends is used only in the generation of nowcasts and forecasts.

  18. When recursive or expanding windows are used instead, the size of the estimation window grows with the number of available observations for estimation. For instance, the first nowcast is constructed estimating the models with R observations, whereas the last nowcast is constructed estimating the models with T observations.

  19. Simulation evidence carried out by Clark and McCracken (2013) and Pincheira and West (2016) show that normal critical values tend to work well when multistep-ahead forecasts are constructed using the iterative method, at least when the data generating process is not very persistent. This is very important because in this paper we use the iterative method for the construction of multistep-ahead forecasts.

  20. We use the word “lags” in parenthesis because we are also including in expression (1) contemporaneous terms of the search queries.

  21. Let us recall that in nested environments the CW test removes a term that should be zero in population under the null hypothesis, but that is not zero in finite samples. Tables 4, 13, 14, and 15 corroborate this prior as the corresponding t-statistics of the GW/DMW test are always lower than the comparable t-statistics of the CW test.

  22. The pairwise Pearson correlations of the 15 series for “real estate exam” fluctuate between 0.97 and 0.99. For "new construction," all correlations are at least 0.99. For "new housing development,” the correlation factors are between 0.90 and 0.96. Finally, the correlations for “new home construction” fluctuate between 0.97 and 0.99.

References

Download references

Acknowledgements

We would like to thank two anonymous referees and participants of workshops at the Central Bank of Chile, Central Bank of Argentina, Universidad de Lima, Peru; Universidad de Santiago and Universidad de Talca, Chile. Erik Hurst, Yan Carrière-Swallow and Felipe Labbé have provided wonderful comments. We are also grateful to Rodrigo Cruz and Montserrat Martí for outstanding research assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Coble.

Ethics declarations

Conflict of interest

Both authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Additional figures and tables

See Fig. 3 and Tables 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16.

1.2 Between-days analysis

During 15 days, we downloaded the series for the four search terms—real estate exam, new construction, new housing development and new home construction—using two different IPs and Google accounts. Four graphs are presented below, one for each Google index that represents the series downloaded for each term according to the IP and day of download. The lines specified in the legend take the form query_ip`i’_`j’, where `i’ and `j’ represent the IP and day, respectively.Footnote 22 For example, rex_ip1_2 represents the query for real estate exam, IP1 for day 2 (Figs. 4, 56 and 7).

1.3 Intra-day analysis

Li (2018) raises some concerns about possible sampling errors if the series were downloaded in different moments during a day, from different computers (IPs) and Google accounts. To check the stability of the variables we use in this study, we carry out the present intra-day robustness analysis.

We conclude that all the series are highly robust. We find that for the same IP and Google account, each Google index downloaded in the same day is identical, consistent with what Li (2018) reports. However, if we compare the same series for different IPs, although very similar, these are not exactly the same. We run correlations of these series and find that they are almost equal to one. A summary of the correlation analysis can be found in Table 17.

We downloaded Google indices for the four queries—real estate exam, new construction, new housing development and new home construction—eight times a day, during three days, using two different IPs and Google accounts. Then, we calculated the average for the eight versions of each index (which are the identical), for each IP address. Finally, we calculate the correlation of each term between the two different IPs for each day.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coble, D., Pincheira, P. Forecasting building permits with Google Trends. Empir Econ 61, 3315–3345 (2021). https://doi.org/10.1007/s00181-020-02011-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00181-020-02011-1

Keywords

JEL Codes

Navigation