Skip to main content
Log in

Ten propositions on machine learning in official statistics

  • Originalveröffentlichung
  • Published:
AStA Wirtschafts- und Sozialstatistisches Archiv Aims and scope Submit manuscript

Abstract

Machine learning (ML) is increasingly being used in official statistics with a range of different applications. The main focus of ML models is to accurately predict attributes of new, unlabeled cases whereas the focus of classical statistical models is to describe the relations between independent and dependent variables. There is already a lot of experience in the sound use of classical statistical models in official statistics, but for ML models this is still under development. Recent discussions concerning the quality aspects of using ML in official statistics have concentrated on its implications for existing quality frameworks. We are in favor of the use of ML in official statistics, but the main question remains as to what factors need to be considered when using ML models in official statistics. As a means of raising awareness regarding these factors, we pose ten propositions regarding the (sensible) use of ML in official statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

Download references

Acknowledgements

We thank our colleagues Marc Ponsen and Jan van der Laan, the guest editor, and two anonymous referees for their helpful comments on an earlier version of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnout van Delden.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Example: cybercrime statistics

In 2019 and 2020, Statistics Netherlands (in Dutch: CBS) was involved in a project that ultimately aims to lead to official statistics on the proportion of potential felonies (crime offenses) that are cyber-related, classified by subpopulations such as characteristics of the victim and of the potential perpetrators. These statistics would be additional to figures on cybercrime that are already available, which involves yearly data on victims of cybercrime based on a survey by CBS and monthly data on pure cybercrime by region (https://data.politie.nl). Cyber-related crimes refer to two main types of cybercrime: pure cyber where the computer is the target of the crime like DDOS attacks, and digital crimes where the computer is the tool to commit a crime, such as purchase fraud.

The source data that we would like to use to derive the cyber-related crimes is a national registration of the police (Dutch: Basisvoorziening Handhaving, BVH) that holds both incidents and potential crimes reported by individuals and found by police work. The dataset was limited to the potential crimes. In this registration, each incident (record) obtains one code that best describes the offence, including one code when the main incident concerns pure cybercrime. However, other main codes can also concern cyber-related aspects, but that is not registered. Each record in the BVH concerns a number of codes (on regions, main crime type and so on) and three text fields that give background information on the felony. Using those texts as input, CBS aimed to develop a text mining classifier and in 2019 a beta-product was developed (CBS 2019), which was further developed by van Delden and Windmeijer (2021).

CBS obtained the 2016 BVH data from the police. Various random samples were labeled manually into cyber-related crimes (yes/no) by different annotators, leading to a total sample of over 2000 records. Furthermore, CBS also obtained a set of 5300 records that were annotated by the police. That set was obtained by using keywords to select cases that potentially concerned cyber-related aspects; therefore it concerned a selective sample.

Table 5 Annotated data on the cyber-related aspect in police registration data
Table 6 Setting to optimize the fit of the SVM algorithm

A support vector machine (SVM) classifier was developed (CBS 2019) based on a bag of words approach. The text preparation step consisted of extracting the words in the text using regular expressions, down casing, and removing a list of words from a list of stop words. Furthermore, texts with 15 or fewer characters were removed, as well as texts with more than 10,000 characters. Next, the data was split where 300 units of the random sample were put in a test set (representative of the population) and the remaining units were put in a training set.

Next, a six-fold cross-validation of the training set was applied to tune the hyperparameters of the model. The options that were tested concerned both setting of the features (minimum document frequency, maximum document frequency, word n‑gram range), tuning a hyperparameter of the SVM model (C-parameter), and different optimization settings (different score functions for the CV procedure and weighting methods). The choice among the optimization settings was determined by making predictions on the test set. An overview is given in Table 6, and the selected values are given in the final column. After setting the hyperparameters, we also tested whether the lemmatization of the words (yes/no) affected the results, but the differences were minimal. All further analyses were continued using these hyperparameters.

1.2 On ML performance measures

Three typical problems with the performance measures are ignoring the performance of some classes, the sensitivity to imbalanced data (Tharwat 2020), and differences in performance at the micro-level and for aggregates. The first problem for instance occurs with the F1 score, which is often used to evaluate how well the prediction of a class works in the case of a binary variable. Compare two confusion matrices A and B in Table 7. Both matrices have 1000 test cases and no class imbalance (prevalence 0.5), and both models yield an identical F1 score (0.72). In matrix A, the number of false negatives and the number of false positives are identical, resulting in an identical recall and precision. In matrix B, however, the number of false negatives is much smaller than the number of false positives, resulting in a high recall but low precision. This imbalance is not reflected in their harmonic mean (F1) but does yield a lower MCC.

Table 7 Blind spots of F1

The second problem, to some extent related to the first problem, is that many of the performance scores of classifiers are sensitive to the class imbalance in the data, that is to the true proportion of the classes in the population. This is illustrated by matrices A and C in Table 7. The recall, precision, and F1 of class 1 are the same. But class 1 is rarer in matrix C (prevalence 0.1) than in matrix A (prevalence 0.5), making it more difficult to classify. This is not reflected by the F1 but by the MCC. The number of true negatives is much greater in matrix C than in matrix A, but the F1 score does not take that cell into account.

Burger and Meertens (2020) propose to rescale the performance measures by applying a minimum-maximum normalization, that accounts for two effects: a) a correction for the effect of the imbalance and b) compare the score function with guessing at random. The normalized performance scores higher than 0 imply that results are better than guessing at random, where the guessing probability is accounting for the imbalance.

The third problem is that performance measures, such as precision, recall, and F1 score are defined at the micro-level whereas in official statistics one might be interested in an accurate estimation of aggregates. Table 8 shows an example taken from Scholtus and van Delden (2020) comparing two trained ML algorithms. Model B is clearly more accurate at the micro-level than model A, but the bias of the estimated aggregate due to errors in the ML model is larger for model B than for model A. The formulas to derive the bias are also given in Scholtus and van Delden (2020). Especially in situations where the true proportion of a class is very small or large, small changes in the confusion matrix can have a large effect on the bias. When one is interested to estimate a proportion of the population by using ML, one should also take the bias (and variance) at the population level into account, and correct the end result for bias if needed (Kloos et al. 2020). An interesting reference on correcting for misclassification bias in ML is the PhD thesis by Meertens (2021), and a general reference to estimation in the context of ML is Sande and Zhang (2021).

Table 8 Bias of estimated proportion of class 1 estimated with machine learning model A versus model B (from Scholtus and van Delden 2020)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

van Delden, A., Burger, J. & Puts, M. Ten propositions on machine learning in official statistics. AStA Wirtsch Sozialstat Arch 17, 195–221 (2023). https://doi.org/10.1007/s11943-023-00330-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11943-023-00330-0

Keywords

Navigation