Abstract
Small-area population forecasting, such as the forecasting of age/gender groupings at the level of US Census Tracts, is challenged by thorny issues including (1) small population sizes, (2) frequent and sometimes directionally opposing shifts in population dynamics between censuses, (3) data availability, and (4) the ongoing evolution of the US census geographies. It is, therefore, not surprising that evaluation studies suggest wide-ranging forecast errors. Estimates vary between lows between 10% and 20% and highs sometimes exceeding 100% within any given age/gender interval. Despite its successes, only recently have population forecasters begun to explore the possibilities presented by machine learning. Using 1990 and 2000 census data, we develop 10-year age/gender-structured 2010 population forecasts for 50,965 census tracts in the U.S. using a well-known machine learning technique: boosted regression trees. Using standard ex post facto measures of forecast error (MAPE, MALPE, and MAPE-R), we demonstrate that forecasts based on “out-of-the-box” boosted regression trees have greater accuracy and produce fewer and less extreme outliers than comparison forecasts produced by the Hamilton-Perry method (reported in Baker et al. in Population Res Policy Rev 40:1341–1354, 2021. https://doi.org/10.1007/s11113-020-09601-y).
Similar content being viewed by others
Data availability
Data utilized in this publication were obtained from: (1) https://nhgis.org/ and (2) via the U.S. Census Bureau’s Application Programming Interface (API), documented at https://www.census.gov/data/developers/about.html. Secondary posting of NHGIS data is precluded per their policy. Details necessary to re-extract these data are found in Table 1.
Notes
These forecasts excluded any census tract with one or more zeros in its age/gender groups in any of the three study years 1990, 2000 and 2010. These exclusions were made because the cohort-change ratios are not well suited to deal with zeros because they are a ratio, a measure that is undefined when the denominator is zero. Also, the inclusion of zero populations exacerbated the impact of outlying errors on assessments of forecast accuracy.
This study also found that uncontrolled H-P projections are surprising accurate at the census tract level, but that forecast errors were reduced when projections by age/gender were controlled to a total population forecast in a census tract.
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Baker, J., Alcantara, A., Ruan, X. M., Ruiz, D., & Crouse, N. (2014a). Sub-County component estimates using administrative records: A case-study in New Mexico. In N. Hoque & L. Potter (Eds.), Emerging Techniques in Applied Demography (pp. 63–80). Springer.
Baker, J., Alcantara, A., Ruan, X. M., & Watkins, K. (2014b). Spatial weighting improves accuracy and reduces bias in small-area demographic forecasts of urban Populations. Journal of Population Research, 31(4), 345–359.
Baker, J., Alcantara, A., Ruan, X. M., Watkins, K., & Vasan, S. (2013). A Comparative evaluation of accuracy and bias in census tract-level age/sex-specific population estimates: Component I (net-migration) vs Component III (Hamilton-Perry). Population Research and Policy Review, 32(6), 919–942.
Baker, J., Swanson, D., & Tayman, J. (2021). The accuracy of Hamilton-Perry population projections for census tracts in the United States. Population Research and Policy Review, 40, 1341–1354. https://doi.org/10.1007/s11113-020-09601-y
Baker, J., Swanson, D. A., Tayman, J., & Tedrow, L. M. (2017). Cohort change ratios and their applications. Springer.
Belkin, M., Hsu, D., & MA, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academies of Science, 16(32), 15849–15854.
Breiman, L. (1996). Heuristics of Instability and Stabilization in Model Selection. The Annals of Statistics, 24(6), 2350–2383.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification & regression trees. Wadsworth.
Chi, G., & Wang, D. (2017). Small-area Population Forecasting: A geographically weighted regression approach. 449–471 in D. Swanson (ed): Frontiers in Applied Demography. Springer: Dordrecht, The Netherlands.
Fragoso, T. M., Bertoli, W., & Louzada, F. (2018). Bayesian Model Averaging: A systematic review and conceptual classification. International Statistical Review, 86(1), 1–28.
Freund, Y., & Schapire, R. (1999). A Short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5), 771–780.
Friedman, J. (1999). Greedy function approximation: A gradient boosting machine. https://biostat.jhsph.edu/~mmccall/articles/friedman_1999.pdf.
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2), 337–407.
Hamilton, C. H., & Perry, J. (1962). A short-cut method for projecting population by age from one decennial census to another. Social Forces, 41, 163–170.
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman & Hall.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). New York.
Hauer, M. (2019). Population projections for U.S. counties by age, sex, and race controlled to shared socioeconomic pathways. Scientific Data. https://www.natur e.com/artic les/sdata 20195 .pdf.
Jivetti, B., & Hoque, N. (Eds.). (2020). Population change and public policy. Springer.
Keyfitz, N. (1982). Choice of function for mortality analysis: Effective forecasting depends on a minimum parameter representation. Theoretical Population Biology, 21, 329–352.
Kintner, H., Merrick, T., Morrison, P., & Voss, P. (Eds.). (1997). Demographics: A casebook for business and government. Westview Press.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer.
Lunn, D. J., Simpson, S. N., Diamond, I., & Middleton, L. (1998). The accuracy of age-specific population estimates for small areas in Britain. Population Studies, 52(3), 327–344.
Mueller, J. T., & Santos-Lozada, A. R. (2022). The 2020 U.S. census differential privacy method introduces disproportionate discrepancies for rural and non-white populations. Population Research and Policy Review. https://doi.org/10.1007/s11113-022-09698-3
Pol, L., & Thomas, R. (1997). Demography for business decision-making. Praeger.
Pol, L., & Thomas, R. (2012). Demography of health Care. Plenum.
Raftery, A., & Ševčíková, H. (2021). Probabilistic population forecasting: Short to very long-term. International Journal of Forecasting. https://doi.org/10.1016/j.ijforecast.2021.09.001
Rayer, S., & Smith, S. K. (2014). Population projections by age for Florida and its counties: Assessing accuracy and the impact of adjustments. Population Research and Policy Review, 33(5), 747–770.
Rees, P., Norman, P., & Brown, D. (2004). A framework for progressively improving small area population estimates. Journal of the Royal Statistical Society, 167(1), 5–36.
Ruggles, S., & Van Riper, D. (2021). The role of chance in the census bureau database reconstruction experiment. Population Research and Policy Review. https://doi.org/10.1007/s11113-021-09674-3
Schapire, R., & Freund, Y. (2014). Boosting: Foundations & algorithms. MIT Press.
Siegel, J. S. (2002). Applied demography: Applications to business, government, law and public policy. Academic Press.
Smith, S., & Shahidullah, M. (1995). An evaluation of population projection errors for census tracts. Journal of the American Statistical Association, 90(429), 64–71.
Smith, S. K., & Tayman, J. (2003). An Evaluation of Population Projections by Age. Demography, 40(4), 741–757.
Smith, S., Tayman, J., & Swanson, D. (2001). State and local population projections: Methodology and analysis. Kluwer Academic Publishers.
Smith, S., Tayman, J., & Swanson, D. (2013). A practitioner’s guide to state and local population projections. Springer.
Swanson, D., & Tayman, J. (2014). Measuring uncertainty in population forecasts: A new approach. pp. 203–215 in Marco Marsili and Giorgia Capacci (eds.) Proceedings of the 6th EUROSTAT/UNECE Work Session on Demographic Projections. National Institute of Statistics: Rome, Italy.
Swanson, D., Bryan, T., & Sewell, R. (2021). The effect of the differential privacy disclosure avoidance system proposed by the census bureau on 2020 census products: Four case studies of census blocks in Alaska. PAA Affairs, https://www.populationassociation.org/blogs/paa-web1/2021/03/30/the-effect-of-the-differential-privacy-disclosure.
Swanson, D., & Coleman, C. (2007). On the MAPE-R as a measure of cross-sectional estimation & forecast accuracy. Journal of Economic and Social Measurement, 32(4), 219–233.
Swanson, D., & Pol, L. (2004). Contemporary developments in applied demography within the United States. Journal of Applied Social Science, 21(2), 26–56.
Swanson, D., & Tayman, J. (1999). On the validity of the MAPE as a measure of population forecast accuracy. Population Research and Policy Review, 18(4), 299–322.
Swanson, D., Tayman, J., & Barr, C. F. (2000). A note on the measurement of accuracy for subnational demographic estimates. Demography, 37(2), 193–202.
Swanson, D., Tayman, J., & Bryan, T. (2011). MAPE-R: A rescaled measure of accuracy for cross-sectional, sub-national forecasts. Journal of Population Research, 28, 225–243.
Tayman, J., Smith, S., & Rayer, S. (2011). Evaluating population forecast accuracy: A regression approach using county data. Population Research and Policy Review, 30(2), 235–262.
Tayman, J., Swanson, D., & Barr, C. F. (1999). In search of the ideal measure of accuracy for subnational demographic forecasts. Population Research and Policy Review, 18(5), 387–409.
Tibshirani, R., & Friedman, J. (2020). A pliable lasso. Journal of Computational and Graphical Statistics, 29(1), 215–225.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistics. Journal of the Royal Statistical Society B, 63(2), 411–423.
Wilson, T. (2016). Evaluation of alternative cohort-component models for local area population forecasts. Population Research and Policy Review., 35, 241–261.
Wilson, T., Grossman, M., Alexander, M., Rees, P., & Temple, J. (2021). Methods for small area population forecasts: State-of-the-art and research needs. Population Research and Policy Review, Online First. https://doi.org/10.1007/s11113-021-09671-6
Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd ed.). Boca Raton, FL.
Acknowledgements
We thank Tom Wilson, Irina Grossman, and two anonymous reviewers for their helpful comments on earlier drafts of this paper and, more generally, on the methods deployed therein. While we are grateful for this help, any remaining errors in logic or method remain our own.
Funding
The authors did not receive funding or any other form of support from any organization or individual for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors did not receive funding or any other form of support from any organization or individual for the submitted work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Baker, J., Swanson, D. & Tayman, J. Boosted Regression Trees for Small-Area Population Forecasting. Popul Res Policy Rev 42, 51 (2023). https://doi.org/10.1007/s11113-023-09795-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11113-023-09795-x