Abstract
Machine learning (ML) is increasingly being used in official statistics with a range of different applications. The main focus of ML models is to accurately predict attributes of new, unlabeled cases whereas the focus of classical statistical models is to describe the relations between independent and dependent variables. There is already a lot of experience in the sound use of classical statistical models in official statistics, but for ML models this is still under development. Recent discussions concerning the quality aspects of using ML in official statistics have concentrated on its implications for existing quality frameworks. We are in favor of the use of ML in official statistics, but the main question remains as to what factors need to be considered when using ML models in official statistics. As a means of raising awareness regarding these factors, we pose ten propositions regarding the (sensible) use of ML in official statistics.
Similar content being viewed by others
References
Amaya A, Biemer PP, Kinyon D (2020) Total error in a big data world: adapting the TSE framework to big data. J Surv Stat Methodol 8:89–119. https://doi.org/10.1093/jssam/smz056
Autoriteit Persoonsgegevens (2020) Belastingdienst toeslagen. De verwerking van de nationaliteit van aanvragers van kinderopvangtoeslag (in Dutch). Tech. rep. https://autoriteitpersoonsgegevens.nl/uploads/imported/onderzoek_belastingdienst_kinderopvangtoeslag.pdf. Accessed 14 November 2023.
Baker R, Brick JM, Bates NA, Battaglia M, Couper MP, Dever JA, Gile KJ, Tourangeau R (2013) Summary report of the AAPOR task force on non-probability sampling. J Surv Stat Methodol 1(2):90–143. https://doi.org/10.1093/jssam/smt008
Beck M, Dumpert F, Feuerhake J (2018) Machine learning in official statistics. ArXiv. https://doi.org/10.48550/arXiv.1812.10422
Binder M, Moosbauer J, Thomas J, Bischl B (2020) Multi-objective hyperparameter tuning and feature selection using filter ensembles. Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp 471–479. https://doi.org/10.1145/3377930.3389815
van den Brakel J, Bethlehem J (2008) Model-based estimation for official statistics. CBS discussion paper. Statistics Netherlands. https://www.cbs.nl/-/media/imported/documents/2008/10/200802x10pub.pdf?la=nl-nl. Accessed 14 November 2023.
Breiman L (2001) Statistical modeling: the two cultures. Stat Sci 16:199–215
Buelens B, Burger J, van den Brakel JA (2018) Comparing inference methods for non-probability samples. Int Stat Rev 86(2):322–343. https://doi.org/10.1111/insr.12253
Bughin J, Hazan E, Lund S, Dahlström P, Wiesinger A, Subramaniam A (2018) Skill shift: Automation and the future of the workforce. Tech. rep., McKinsey. https://www.mckinsey.com/featured-insights/future-of-work/skill-shift-automation-and-the-future-of-the-workforce. Accessed 14 November 2023.
Burger J, van der Laan J (2021) Predicting transitions into and out of poverty using machine learning. Proceedings of Statistics Canada Symposium 2021. https://www150.statcan.gc.ca/n1/en/pub/11-522-x/2021001/article/00003-eng.pdf?st=hvojNHXh. Accessed 14 November 2023.
Burger J, Meertens Q (2020) The algorithm versus the chimps: On the minima of classifier performance metrics. In: Cao L, Kosters W, Lijffijt J (eds) Proceedings of BNAIC/BeNeLearn. Leiden University, Leiden, pp 38–55 (available at https://bnaic.liacs.leidenuniv.nl)
CBS (2019) Cybercrime achterhalen in aangiften (in dutch). https://www.cbs.nl/nl-nl/over-ons/innovatie/project/cybercrime-achterhalen-in-aangiften. Accessed 14 November 2023.
Chambers R (2006) Evaluation criteria for editing and imputation in Euredit vol 3. United Nations, Geneva, pp 17–27
Daas PJH, van der Doef S (2021) Detecting innovative companies via their website. SJI 36:1239–1251. https://doi.org/10.3233/SJI-200627
Das S, Mullick SS, Zelinka I (2022) On supervised class-imbalanced learning: An updated perspective and some key challenges. IEEE Transactions on Artificial Intelligence 3(6):973–993. https://doi.org/10.1109/TAI.2022.3160658
De Broe S, Struijs P, Daas P, van Delden A, Burger J, van den Brakel J, ten Bosch O, Zeelenberg K, Ypma W (2021) Updating the paradigm of official statistics: new quality criteria for integrating new data and methods in official statistics. SJI 37:343–360. https://doi.org/10.3233/SJI-200711
van Delden A, van Bemmel K (2012) Handling incompleteness after linkage to a population frame: incoherence in unit types, variables and periods. Tech. rep., Statistics Netherlands. https://www.cbs.nl/-/media/imported/documents/2012/26/2012-08-x10-pub.pdf?la=nl-nl. Accessed 14 November 2023
van Delden A, Windmeijer D (2021) Evaluating and improving a text classifier for subpopulations: the case of cyber crime. CBS discussion paper. Statistics Netherlands. https://www.cbs.nl/en-gb/background/2021/28/evaluating-and-improving-a-text-classifier-for-subpopulations-. Accessed 14 November 2023
ESS (2019) Quality assurance framework of the European Statistical System, version 2.0. Eurostat, Luxemburg. https://ec.europa.eu/eurostat/documents/64157/4392716/ESS-QAF-V1-2final.pdf/bbf5970c-1adf-46c8-afc3-58ce177a0646. Accessed 14 November 2023
Eurostat (2014) ESS Handbook for Quality Reports. Eurostat, Luxemburg. https://doi.org/10.2785/983454
Frank E, Hall M (2001) A simple approach to ordinal classification. European conference on machine learning. Springer, pp 145–156
Gardner M (1970) The fantastic combinations of John Conway’s new solitaire game “life” by Martin Gardner. Sci Am 223:120–123
Gevaert CM (2022) Explainable AI for earth observation: a review including societal and regulatory perspectives. Int J Appl Earth Obs Geoinformation 112(102):869. https://doi.org/10.1016/j.jag.2022.102869
González P, Castaño A, Chawla NV, Coz JJD (2017) A review on quantification learning. ACM Comput Surv 50(5):article 74. https://doi.org/10.1145/3117807
Groves RM, Fowler FJ Jr, Couper M, Lepkowski JM, Singer E, Tourrangeau R (2004) Survey Methodology. Wiley, New York
Guo X, van Stein B, Bäck T (2019) A new approach towards the combined algorithm selection and hyper-parameter optimization problem. 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp 2042–2049. https://doi.org/10.1109/SSCI44817.2019.9003174
Han S, Yuan B, Liu W (2009) Rare class mining: progress and prospect. 2009 Chinese Conference on Pattern Recognition, pp 1–5. https://doi.org/10.1109/CCPR.2009.5344137
Hassani H, Saporta G, Silva ES (2014) Data mining and official statistics: the past, the present and the future. J Big Data 2:34–43. https://doi.org/10.1089/big.2013.0038
Hill AB (1965) The environment and disease: association or causation? Proc Royal Soc Med 58:295–300
Huang MH, Rust RT (2018) Artificial intelligence in service. J Serv Res 21(2):155–172. https://doi.org/10.1177/1094670517752459
Imbens GW, Rubin DB, Sacerdote BI (2001) Estimating the effect of unearned income on labor earnings, savings, and consumption: evidence from a survey of lottery players. Am Econ Rev 91(4):778–794. https://doi.org/10.1257/aer.91.4.778
Jean N, Burke M, Xie M, Davis WM, Lobell DB, Ermon S (2016) Combining satellite imagery and machine learning to predict poverty. Science 353(6301):790–794. https://doi.org/10.1126/science.aaf7894
Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data 6(1):1–54. https://doi.org/10.1186/s40537-019-0192-5
Julien C (2021) Machine learning for official statistics. UNECE report. https://unece.org/sites/default/files/2022-09/ECECESSTAT20216.pdf. Accessed 14 November 2023
Klingwort J, Burger J (2023) A framework for population inference: combining machine learning, network analysis, and non-probability road sensor data. Comput Environ Urban Syst 103(101):976. https://doi.org/10.1016/j.compenvurbsys.2023.101976
Kloos K, Meertens Q, Scholtus S, Karch J (2020) Comparing correction methods for misclassification bias. In: Cao L, Kosters W, Lijffijt J (eds) Proceedings of BNAIC/BeNeLearn. Leiden University, Leiden, pp 103–129 (available at https://bnaic.liacs.leidenuniv.nl)
Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2017) Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA. J Mach Learn Res 18(1):826–830.
Kühnemann H, van Delden A, Windmeijer HJM (2020) Exploring a knowledge-based approach to predicting NACE codes of enterprises based on web page texts. SJI 36:807–821. https://doi.org/10.3233/SJI-200675
Kumar P, Bhatnagar R, Gaur K, Bhatnagar A (2021) Classification of imbalanced data: review of methods and applications. IOP Conf. Series: Materials Science and Engineering, pp 1–8 (available at https://iopscience.iop.org/article/10.1088/1757-899X/1099/1/012077)
Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of Google flu: traps in big data analysis. Science 343:1203–1205. https://doi.org/10.1126/science.1248506
Lemain-van der Nest M (2021) Named entity recognition: Identifying NER indicators in Dutch police reports. Master thesis, Computational Lexicology and Terminology Lab, Vrije Universiteit Amsterdam. http://www.cltl.nl/teaching/topics-for-ba-and-ma-thesis/masters-theses/. Accessed 14 November 2023
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 4768–4777
Marr D (1982) Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co. Inc., New York
McCullagh P, Nelder J (1989) Generalized Linear Models. Chapman and Hall, London
Measure A (2022) Six years of machine learning in the Bureau of Labor Statistics. In: Snijkers G (ed) Advances in Business Statistics. Wiley, New York
Meertens QA (2021) Misclassification Bias in Statistical Learning. PhD Thesis. University of Amsterdam, University of Leiden. SIKS Dissertation series 2021-10
Molnar C (2021) Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/. Accessed 14 November 2023
Naseem U, Razzak I, Khan KS, Prasad M (2021) A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models. Acm Trans Asian Low-Resource Lang Inf Process 20:1–35. https://doi.org/10.1145/3434237
Parlementaire ondervragingscommissie Kinderopvangtoeslag (2020) Ongekend onrecht (in Dutch). Tech. rep., Tweede Kamer. https://www.tweedekamer.nl/sites/default/files/atoms/files/20201217_eindverslag_parlementaire_ondervragingscommissie_kinderopvangtoeslag.pdf. Accessed 14 November 2023
Peerlings DEW, Brakel JA, Baştürk N, Puts MJH (2022) Multivariate density estimation by neural networks. IEEE Trans Neural Netw Learning Syst. https://doi.org/10.1109/TNNLS.2022.3190220
Powers DMW (2011) Evaluation: from precision, recall and F‑measure to ROC, informedness, markedness & correlation. ArXiv. https://doi.org/10.48550/arXiv.2010.16061
Puts MJH, Daas PJH (2021) Machine learning from the perspective of official statistics. Surv Stat 84:12–17
Rao JNK (2021) On making valid inferences by integrating data from surveys and other sources. Sankhya 83-B:242–272. https://doi.org/10.1007/s13571-020-00227-w
Roscher R, Bohn B, Duarte M, Garcke J (2020) Explainable machine learning for scientific insights and discoveries. IEEE Access 8:42,200–42,216. https://doi.org/10.1109/ACCESS.2020.2976199
Rozkrut D, Świerkot Strużewska O, van Halderen G (2021) Mapping the United Nations fundamental principles of official statistics against new and big data sources. SJI 37:161–169. https://doi.org/10.3233/SJI-210789
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.1093/biomet/63.3.581
Sande S, Zhang LC (2021) Design-unbiased statistical learning in survey sampling. Sankhya 83:714–744. https://doi.org/10.1007/s13171-020-00224-1
Schmitz B, Ponsen M (2022) Change detection of land use: a deep learning case-study. Proceedings of BNAIC/BeNeLearn 2022. https://bnaic2022.uantwerpen.be/wp-content/uploads/BNAICBeNeLearn_2022_submission_1578.pdf. Accessed 14 November 2023
Scholtus S, van Delden A (2020) On the accuracy of estimators based on a binary classifier. CBS discussion paper. Statistics Netherlands. https://www.cbs.nl/en-gb/background/2020/06/the-accuracy-of-estimators-based-on-a-binary-classifier. Accessed 14 November 2023
Sigrist F (2020) Gaussian process boosting. ArXiv. https://doi.org/10.48550/ARXIV.2004.02653
Sluiskes M (2021) Imputation of business survey data: A systematic comparison between ratio and random forest-based imputation methods. Master thesis, Leiden University Statistical Science for the Life and Behavioural Sciences
Steward M (2019) The actual difference between statistics and machine learning. https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3. Accessed 14 November 2023
Tharwat A (2020) Classification assessment methods. Appl Comput Informatics 17(1):168–192. https://doi.org/10.1016/j.aci.2018.08.003
Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-WEKA: combined selection and hyperparameter optimization of classication algorithms. 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’13), pp 847–855. https://doi.org/10.1145/2487575.2487629
Tollenaar N, Rokven J, Macro D, Beerthuizen M, van der Laan A (2019) Predictieve tekstmining in politieregistraties (in Dutch). Tech. rep., Cahiers 2019-02. Wetenschappelijk Onderzoek- en Documentatiecentrum. https://repository.wodc.nl/handle/20.500.12832/220. Accessed 14 November 2023
van der Velden B, Kuijf H, Gilhuijs K, Viergever M (2022) Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med Image Anal. 79:102470. https://doi.org/10.1016/j.media.2022.102470
de Waal T (2016) Obtaining numerically consistent estimates from a mix of administrative data and surveys. SJI 32:231–243. https://doi.org/10.3233/SJI-150950
Weerts HJP, Mueller A, Vanschoren J (2020) Importance of tuning hyperparameters of machine learning algorithms. ArXiv. https://doi.org/10.48550/arXiv.2007.07588
Yung W, Tam SM, Buelens B, Chipman H, Dumpert F, Ascari G, Rocci F, Burger J, Choi I (2022) A quality framework for statistical algorithms. SJI 38:291–308. https://doi.org/10.3233/SJI-210875
Zhang LC (2012) Topics of statistical theory for register-based statistics and data integration. Statistica Neerlandica 66:41–63. https://doi.org/10.1111/j.1467-9574.2011.00508.x
Acknowledgements
We thank our colleagues Marc Ponsen and Jan van der Laan, the guest editor, and two anonymous referees for their helpful comments on an earlier version of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflicts of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Example: cybercrime statistics
In 2019 and 2020, Statistics Netherlands (in Dutch: CBS) was involved in a project that ultimately aims to lead to official statistics on the proportion of potential felonies (crime offenses) that are cyber-related, classified by subpopulations such as characteristics of the victim and of the potential perpetrators. These statistics would be additional to figures on cybercrime that are already available, which involves yearly data on victims of cybercrime based on a survey by CBS and monthly data on pure cybercrime by region (https://data.politie.nl). Cyber-related crimes refer to two main types of cybercrime: pure cyber where the computer is the target of the crime like DDOS attacks, and digital crimes where the computer is the tool to commit a crime, such as purchase fraud.
The source data that we would like to use to derive the cyber-related crimes is a national registration of the police (Dutch: Basisvoorziening Handhaving, BVH) that holds both incidents and potential crimes reported by individuals and found by police work. The dataset was limited to the potential crimes. In this registration, each incident (record) obtains one code that best describes the offence, including one code when the main incident concerns pure cybercrime. However, other main codes can also concern cyber-related aspects, but that is not registered. Each record in the BVH concerns a number of codes (on regions, main crime type and so on) and three text fields that give background information on the felony. Using those texts as input, CBS aimed to develop a text mining classifier and in 2019 a beta-product was developed (CBS 2019), which was further developed by van Delden and Windmeijer (2021).
CBS obtained the 2016 BVH data from the police. Various random samples were labeled manually into cyber-related crimes (yes/no) by different annotators, leading to a total sample of over 2000 records. Furthermore, CBS also obtained a set of 5300 records that were annotated by the police. That set was obtained by using keywords to select cases that potentially concerned cyber-related aspects; therefore it concerned a selective sample.
A support vector machine (SVM) classifier was developed (CBS 2019) based on a bag of words approach. The text preparation step consisted of extracting the words in the text using regular expressions, down casing, and removing a list of words from a list of stop words. Furthermore, texts with 15 or fewer characters were removed, as well as texts with more than 10,000 characters. Next, the data was split where 300 units of the random sample were put in a test set (representative of the population) and the remaining units were put in a training set.
Next, a six-fold cross-validation of the training set was applied to tune the hyperparameters of the model. The options that were tested concerned both setting of the features (minimum document frequency, maximum document frequency, word n‑gram range), tuning a hyperparameter of the SVM model (C-parameter), and different optimization settings (different score functions for the CV procedure and weighting methods). The choice among the optimization settings was determined by making predictions on the test set. An overview is given in Table 6, and the selected values are given in the final column. After setting the hyperparameters, we also tested whether the lemmatization of the words (yes/no) affected the results, but the differences were minimal. All further analyses were continued using these hyperparameters.
1.2 On ML performance measures
Three typical problems with the performance measures are ignoring the performance of some classes, the sensitivity to imbalanced data (Tharwat 2020), and differences in performance at the micro-level and for aggregates. The first problem for instance occurs with the F1 score, which is often used to evaluate how well the prediction of a class works in the case of a binary variable. Compare two confusion matrices A and B in Table 7. Both matrices have 1000 test cases and no class imbalance (prevalence 0.5), and both models yield an identical F1 score (0.72). In matrix A, the number of false negatives and the number of false positives are identical, resulting in an identical recall and precision. In matrix B, however, the number of false negatives is much smaller than the number of false positives, resulting in a high recall but low precision. This imbalance is not reflected in their harmonic mean (F1) but does yield a lower MCC.
The second problem, to some extent related to the first problem, is that many of the performance scores of classifiers are sensitive to the class imbalance in the data, that is to the true proportion of the classes in the population. This is illustrated by matrices A and C in Table 7. The recall, precision, and F1 of class 1 are the same. But class 1 is rarer in matrix C (prevalence 0.1) than in matrix A (prevalence 0.5), making it more difficult to classify. This is not reflected by the F1 but by the MCC. The number of true negatives is much greater in matrix C than in matrix A, but the F1 score does not take that cell into account.
Burger and Meertens (2020) propose to rescale the performance measures by applying a minimum-maximum normalization, that accounts for two effects: a) a correction for the effect of the imbalance and b) compare the score function with guessing at random. The normalized performance scores higher than 0 imply that results are better than guessing at random, where the guessing probability is accounting for the imbalance.
The third problem is that performance measures, such as precision, recall, and F1 score are defined at the micro-level whereas in official statistics one might be interested in an accurate estimation of aggregates. Table 8 shows an example taken from Scholtus and van Delden (2020) comparing two trained ML algorithms. Model B is clearly more accurate at the micro-level than model A, but the bias of the estimated aggregate due to errors in the ML model is larger for model B than for model A. The formulas to derive the bias are also given in Scholtus and van Delden (2020). Especially in situations where the true proportion of a class is very small or large, small changes in the confusion matrix can have a large effect on the bias. When one is interested to estimate a proportion of the population by using ML, one should also take the bias (and variance) at the population level into account, and correct the end result for bias if needed (Kloos et al. 2020). An interesting reference on correcting for misclassification bias in ML is the PhD thesis by Meertens (2021), and a general reference to estimation in the context of ML is Sande and Zhang (2021).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
van Delden, A., Burger, J. & Puts, M. Ten propositions on machine learning in official statistics. AStA Wirtsch Sozialstat Arch 17, 195–221 (2023). https://doi.org/10.1007/s11943-023-00330-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11943-023-00330-0