1 Introduction

In recent years, National Statistical Institutes (NSIs) are progressively moving to the production of official statistics based on the combination of data from different sources, with the aim to reduce costs and response burden while delivering detailed and high-quality information [13]. In this view, new strategies in producing the required outputs need to be developed exploiting as far as possible the integrated use of different data sources, which include data from survey data and administrative sources. The complexity of production of such multi-source statistics is due to the fact that they come in many different varieties as data sources can be combined in many different ways. [6] provide an overview of statistical methods for combining multiple administrative and survey data sources, and [5] list eight basic situations of multi-source processes, providing practical guidelines for producers of multi-source statistics on problems that may be encountered and methods that can be applied to overcome such problems.

The study of methods for producing estimates on employment in Italy using different data sources represents one of the research projects addressed by the Italian National Statistical Institute (Istat). The relevant data are drawn from the Labour Force Survey (LFS), and from several administrative sources that Istat regularly acquires from external bodies.

Administrative data are defined as “(...) data derived from an administrative source, before any processing or validation by the NSIs” ([4], pag. 20). Traditionally, as described by [16], they have been used as auxiliary sources of information in different phases of the production process, while survey data are used as “primary” data, following the assumption that they provide correct measures of the target variables as they are not affected by errors. A different way of thinking is based on the assumption that both survey and administrative data may be at least potentially affected by measurement errors and a more “symmetric” approach can be adopted to take into account deficiencies in the measurement processes. This approach starts by assuming that the target variables are latent (unobserved) and describes a model for the measurement processes through the distribution of the observed variables conditional on the latent ones. In this context, Latent Class Analysis is considered as a method to identify a latent categorical construct of interest using categorical observed variables that can be used to evaluate measurement errors [2]: Hidden Markov Models (HMMs) represent a potential extension when longitudinal data are available. Several applications on the use of latent models in the field of research on employment can be found (see, for example, [1], and [9]). A proposal employed by Istat to estimate employment rates is based on a HMM to account for the inconsistencies in the measurement process of both surveys and administrative sources, according to [16] and [17]. The same method has been applied by [10] to determine the measurement error of the variable measuring whether a respondent has a permanent or a temporary job, by using both survey and register data.

Another approach to deal with multi-source data is based on Machine Learning (ML) tools. The interest in ML from NSIs has been growing rapidly, although time is still needed before it can be used to its full potential. Just to give an example, there has been two international initiatives on ML for official statistics [15]: the UNECE High-Level Group on Modernisation of Official Statistics (HLG-MOS) Machine Learning Project (2019–2020) and the United Kingdom’s Office for National Statistics (ONS) – UNECE Machine Learning Group 2021, approved by the HLG-MOS. [18] describe the role of ML in analyses that can be conducted in a multi-source context, distinguishing three main settings: micro-, macro- and no linkage, while [11] describe the five quality dimensions of statistical output that need to be used to identify the challenges for the ML application and enumerate the most important research topics that need to be studied to enable the successful application of ML for official statistics.

The present work describes the use of ML techniques, decision trees and random forests, to predict the individual employment status. The final aim is to show how ML can be used to extract important information from the data for the purpose of estimating the target variable, and to learn more about the phenomenon. Even if the reference source on employment is the LFS, we exploit the use of AD. To this purpose, we show how to construct a “new” response variable indicating situations where the data sources agree. With this approach, we do not focus only on estimating the individual employment status, but we consider the agreement between survey and administrative data, and we try to establish why these sources do not give the same information. By using the agreement between survey and administrative data, the prediction on LFS employment status can be, however, always indirectly derived. ML techniques have been applied using the R software.

The paper is organized as follows. Section 2 describes the context, characterized by the presence of multiple data sources. Section 3 discusses the application of ML procedures to predict the employment status. Section 4 concludes the work.

2 The context

Available data come from Labour Force Survey and administrative sources. The Italian LFS is a continuous survey carried out during every week of the year. It involves, every year, more than 250,000 families residing in Italy (for a total of 600,000 individuals) distributed in approximately 1400 Italian municipalities. The LFS provides quarterly estimates for the main aggregates of labour market (employment status, type of work, work experience, job search, etc.), stratified by gender, age and territory (up to a regional detail). The reference population is composed by all members of families residing in Italy, even if temporarily abroad.

LFS represents the main source of statistical information on the Italian labour market; the information collected via LFS is the basis for official estimates of employment. It also produces information on the main aggregates of the job offer - profession, sector of economic activity, hours worked, type and duration of contracts, training. LFS is harmonized at the European level as established by the EU Regulation 2019/1700 of the European Parliament and the Council. Its main statistical aim is to classify the population in working age (15 years and over) into three mutually exclusive and exhaustive groups: employed, unemployed (both together make up the so-called “labour force”) and economically inactive. This last category defines the population “outside the labour force”: for example students, retired, and housewives. The classification criteria are based on definitions inspired by the International Labour Office (ILO) and implemented by the Community Regulations.

The employed category includes people between 15 and 89 who in the reference week: (i) have worked at least one hour for pay or profit, including unpaid family workers; (ii) are temporarily absent from work because on vacation, with flexible hours (vertical part time, hours recovery, etc.), on sick, compulsory maternity/paternity leave, in professional training paid by the employer; (iii) are on parental leave and are receiving and/or are entitled to receive income or work related benefits, regardless of the duration of the absence; (iv) are absent as seasonal workers but continue to carry out necessary duties and tasks on a regular basis to the continuation of the activity; (v) are temporarily absent for other reasons and the expected duration of the absence is not more than three months. These conditions do not necessarily include an employment contract and, thus, the category of employees recorded through the LFS also includes forms of irregular work. The employment condition in the LFS is completely independent from the opinion that the interviewees have on the respondents’ status. The main regular job is defined as the only job performed or, if there are more than one, the one with the greatest number of hours usually worked or the one that individual thinks to be more important (greater income, greater stability, etc.).

LFS follows a quarterly rotation scheme in which families are interviewed for two consecutive quarters, excluded for two quarters and re-interviewed for other two quarters. Data are collected through a combination of Computer Assisted Personal and Telephone Interview (CAPI, CATI). The sample design is based on space (selection of units) and time (selection of the survey period for each sample unit). For further details on the LFS contents, methodologies and organization in Istat see [7].

In the last decade, all European NSIs, have started using Administrative Data (AD) for statistical production process. In Istat, AD that may be relevant to the labour statistics mainly come from social security and fiscal authorities. More specifically, data come from several different sources, such as Modello UniEMens (EMENS), DMAG, etc. for social security data and Modello 730, Modello Unico, Certificazione Unica, etc. for fiscal data.

Administrative sources on employment can be classified on the basis of different aspects: the administrative purpose of the source and, consequently, the information content on social security contributions and/or earned income; the availability of temporal information on employment; and the different forms of employments (dependent or self-employed). The quality and the informative power of each administrative source is different. Just to give an example, some sources have temporal details on the start and the end of an employment contract, while other may only detect an overall signal during the whole year. Furthermore, some statistical units are not covered by administrative information, for example irregular jobs and jobs whose salary does not exceed a given threshold. After preprocessing and harmonization of the information, data are organized in an information system with a linked employer-employees structure: the main unit of analysis is the employee job position defined as a relationship established between an employer and a worker. From this data structure we may obtain information on the statistical unit of interest, i.e., the worker. Subsequently, for each subject, the main regular job activity and its characteristics, according to the ILO definitions, are derived: useful information from all labour AD sources are selected and harmonized to make them comparable at a monthly level, and a deterministic methodology to identify the main job for each individual and month is implemented, following the indication of subject matter experts. Data are differently treated according to the type of employment relationship, i.e. self-employed, employees, external workers.

Starting from the resident population in Italy, the available data on employment from LFS and administrative sources are linked at the individual level, and the information is harmonized using details at the month level. For LFS, the weekly information, if present, represents the monthly occupational status. In AD, we used the information on the same LFS week, when a statistical units is in the LFS sample, and a random week in the month for other cases. The resulting dataset contains monthly employment status measured by both LFS and administrative sources.

Table 1 graphically represents the informative context. Out of the entire resident population, AD provides information on the presence of each individual in at least one administrative source, in each month. LFS collects information on approximately 1.2% of the overall Italian population. For each individual there are at maximum two observations per year.

Table 1 Available employment data: Administrative (AD), Labour Force Survey (LFS)

Table 2 shows the number of LFS occurrences per each year after the harmonization process with AD, and the corresponding number of individuals. In the following, we will refer to LFS respondents to indicate the number of the survey occurrences, rather than to individuals, except when not explicitly mentioned. Table 3 shows the number of LFS respondents per each year and month.

Table 2 Distribution of LFS interviews and distinct individuals, per year. Years 2016–2019. Italy
Table 3 Distribution of LFS interviews, per year and month. Years 2016–2019. Italy

Information from LFS and AD does not always agree due to deficiencies in the definitions and in the data acquisition process. Just to give some examples, there could be temporal misalignment of the sources, especially for occasional jobs, structural lack of administrative information on irregular work, misalignment of employment definition in the available sources, etc.

Table 4 summaries the distribution of the employment status by LFS and AD, where employment is defined by the presence in at least one administrative source. The two measures disagree on about the 6% of the interviews, and the values outside the main diagonal in the Table represent classification errors. The agreement between the measures is observed for each month, and then averaged over the year. This means that an individual who reported a working activity in a certain month in the LFS survey but who is absent in all administrative sources in the same month is considered non-concordant in that month, even if the same individual is present in an AD source at any other time in the year.

Table 4 Distribution of employment status by LFS and AD, per year. LFS interviews, years 2016–2019. Italy
Fig. 1
figure 1

Distribution of employment status by LFS and AD, per year. LFS interviews, years 2016–2019. Italy

As shown in Fig. 1, the agreement between the two (sets of) sources remains fairly constant over time. The interviews that report an employment in the LFS but that correspond to individuals that are absent in all administrative source (\(LFS=1,AD=0\)) for the same month are 69,905, corresponding to 41,271 individuals, while those that report not employed by the LFS and correspond to individuals that are present in at least one administrative source (\(LFS=0,AD=1\)) in the same month are 54,822, corresponding to 32,635 individuals. The respondents who reported the same information for the same month in the two sources are 2,113,127: 719,149 (285,428 individuals) in the group (\(LFS=1,AD=1\)) and 1,393,978 (524,928 individuals) in the group (\(LFS=0,AD=0\)).

For individuals that have been interviewed twice in the year, Table  5 shows the distribution of the employment status measured by AD and LFS in the second interview, conditional to the employment status in the first one. The highest percentage are on the diagonal: for the groups (\(LFS=0,AD=0\)) and (\(LFS=1,AD=1\)) in the first interview the percent of units belonging to the same group in the second interview is higher than 95, while for the other two groups it is higher than 60 and lower than 70. Except for the diagonals, other groups show an a non-negligible percentage value, (\(LFS=0,AD=1\)) or (\(LFS=1,AD=0\)) in the first interview and (\(LFS=0,AD=0\)) or (\(LFS=1,AD=1\)) in the second interview, indicating that when the two sources disagree in the first interview, they will partly tend to agree in the second one. For the group (\(LFS=0,AD=1\)), the percentage of units changing the employment status to (\(LFS=1,AD=0\)) is very small. The same happens for the group (\(LFS=1,AD=0\)) in the first interview, changing to (\(LFS=1,AD=0\)) in the second interview.

Table 5 Distribution of employment status by AD and LFS in the second survey interview conditional to the first survey interview, per year. LFS interviews, years 2016–2019. Italy

Compared to LFS, the composition of the group with disagreement between LFS and AD, (\(AD=1,LFS=0\)) and (\(AD=0,LFS=1\)), is quite different with respect to all covariates (see Tables in Appendix A and Fig. 2), i.e.gender, geographical area, degree of education, Italian citizenship, age, employment and self-employment income, number of retirees. Probably, this group represents individuals that are more subject to change in their labour career: higher percentage of males, individuals from the south and islands, higher educated, non-Italian citizenship, younger age, lower percentage of retirees. Furthermore, they have a lower employment income but higher self-employment income.

By splitting the respondents with disagreement between LFS and AD, the group (\(LFS=1,AD=0\)) and the group (\(LFS=0,AD=1\)) do not show relevant differences by gender, while they show differences in the distribution by geographical area (see Fig. 2). In the group (\(LFS=0,AD=1\)), on average, there are more individuals from the south and islands, with a lower degree of education, Italian citizenship, older age, lower employment and self-employment income, and more retirees. The composition of this group may suggest a criticality in the LFS data acquisition process for some population groups.

Fig. 2
figure 2

Disagreement between LFS and AD by geographical area. LFS interviews, year 2019. Italy

Table 6 Distribution of LFS interviews by AD source type, years 2016–2019. Italy

The former group, (\(LFS=1,AD=0\)), is somehow more problematic. These individuals in fact are employed according to the LFS, but they are not present in any administrative source, at least in the month of the interview. Out of the 69,905 respondents employed by the LFS that are absent in all administrative source in the same month, 42,935 (61.4%) are not present in any administrative source during the year. This could be due to the lack of administrative information on irregular work. Also jobs whose salary does not exceed a certain threshold are not totally covered by AD. For the other 26,970 units, the phenomenon may refer to delays in reporting an existing employment spell to the AD as it is frequent for part-time and occasional employees.

The two administrative sources that contribute for higher percentage of individuals are: (i) the main source of the Italian social security authority (INPS) and (ii) the source covering agricultural labour, with more than 50% and 20% occurrences, respectively. Note that individuals may be present in \(K \ge 2\) administrative sources during the year. Out of the 19 available administrative sources, we can not draw any information from 3 (ENCPD, CPDEN, VOUCHER). A list of the AD we have used in this manuscript is reported in Appendix B.

The variable Source type (SOURCE, see the distribution reported in Table 6) classifies the different administrative sources into non overlapping categories corresponding to situations where administrative information: 1) is not available at all, 2) refers to employees, 3) refers to self-employed at month level (social security data), 4) refers to self-employed only at year level (mainly fiscal data). A hierarchy of data sources has been defined to take into account situations where employment information is provided by multiple sources in a year and to connect this information to that from LFS. Just to give an example, if an individual is a fixed-term employee and has a VAT number, this is defined as \(\textit{SOURCE} = 4\). In fact, she will turn out to be a worker according to LFS. Except for respondents not covered by any administrative source, the highest percentage of respondents are self-employed with administrative information at month level.

3 Predicting individual employment status

NSIs are increasingly moving to use more and more secondary data sources: these are usually not collected for statistical purposes but they may contain some relevant statistical information. In this context, Machine Learning techniques are becoming more and more important. These tools can be used in the processing of primary data as well, but they do not fit to the traditional quality frameworks, and work is required to develop a framework that can handle both primary and secondary data sources [18]. At the same time, the prediction approach seems to have some promises but additional work is needed to generalize the performance from non-probability samples to statistical populations.

A supervised ML approach, using both decision tree (DT) and random forest (RF) procedures, has been applied to predict individual employment status as a function of individual features. Both DT and RF belong to the class of supervised learning techniques. DTs may be used for classification and regression: their goal is to create a path that by means of simple decision rules inferred from the data features can be fruitfully used to predict a qualitative (classification) or quantitative (regression) response. Each rule assigns an observation to a group based on the value of specific feature, used as an input. One rule is applied after the other, sequentially, determining a hierarchy of groups: the hierarchy is called a tree and each group is called a node. RF are an ensemble learning method for classification, regression and other tasks. RFs construct several DTs using different, randomly selected, training sets. A rule may then be used to produce a synthesis of the different DTs, eg by using the decision chosen by the majority of trees (mode in classification problems or mean in regression) as the final decision. Generally speaking, RFs are expected to produce more reliable predictions when compared to single DTs, but they are not that easy-to-interpret tools.

We use a supervised ML approach, as in our data there are both auxiliary information and labeled response, represented by the employment status measured by LFS. Since the variable of interest is qualitative/categorical, the machine learns to solve a classification problem.

Figure 3 schematically represents the phases of such an approach. In the learning phases we can distinguish (i) the training phase of the model, where the input is paired with the expected output and (ii) the validation/test phase, where we measure how well the model has been trained (depending on the size of data, the value to predict, the input, etc.) and evaluate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision, etc.). In particular, the validation phase compares models and select the best performing one, while the test phase estimates the accuracy of the selected approach. The application phase applies the model to real-world data and gets the results. In this situation, there is usually no reference value: we may only speculate about the quality of the model output using the results obtained in the validation/test phase. In this work, we use data on respondents to LFS in 2016, 2017 and 2018 for model training and validation, and data on respondents to LFS in 2019 for model testing.

Fig. 3
figure 3

The phases of a machine learning scheme

Table 7 Distribution of agreement between LFS and AD, per year. LFS interviews, years 2016–2019. Italy

The original Istat project consists on the production of data for the entire population through the integration of Italian census survey data, mainly LFS data, and administrative information. Since October 2018, the Census of Population and Housing is conducted yearly, during the first week of October. Therefore, we will apply the proposed techniques to predict employment status in October.

As far as the response variable is concerned, we use two approaches. In the former, we apply ML models directly to the employment status as registered by LFS. In the latter, we apply the models to the variable indicating the cases where the data sources disagree. In this way, we want to take into consideration that also LFS may not be error free, and there are situations where AD and LFS measure a "different phenomenon" and both measures have the same relevance.

Table 7 shows the distribution of employment status by LFS and AD and the agreement between the sources by year. It is important to underline that both approaches provide information on the prediction of employment status from LFS for the whole population, that remains our variable of interest. The first approach is direct and we predict the LFS employment status, while the second approach is indirect: once the agreement between the two sources has been predicted, we may provide an estimate for the LFS employment status by employing the administrative information on the entire population. For example, when the predicted agreement between LFS and AD is equal to 1, the predicted LFS employment status has the same value as registered by AD.

The following individual covariates have been used as input variables in the ML procedure: gender, age class, residence (region), level of education, income class from work, income class from capital, income class from retirement, one binary indicator associated with Italian citizenship, type of pension (disability, age, etc.). We assume that these individual covariates help explain the individual employment dynamics. At this stage of the project, it is not possible to use additional data from LFS and AD on some job characteristics, such as employment sector or type of contract (part- and full-time): this work can be considered as a preliminary analysis that may be useful for better defining future developments. The four-category covariate SOURCE has been used to take into account the different administrative sources. Furthermore, even if we are not using a longitudinal analysis, we used the administrative information from January (AD.1) to September (AD.9) as input for ML because we want to study the relationship of information on employment in the past with the variable of interest measured by LFS in October. For decision trees and random forests application, the R software[12] packages rpart [3] and randomForest [14] have been used. We have used the package ROSE [8] to deal with the imbalance in the classes of the response variable by generating synthetic balanced samples: an unbalanced set of data may in fact lead to bias in the prediction, favoring the estimate towards the most frequent class. Both decision trees and random forests are prediction models frequently used to classify large amounts of data; as a benchmark, we also estimated a standard logistic model with the same covariate set.

To evaluate the effectiveness of the models, we examine the confusion matrix, together with classification errors and other measures, i.e. accuracy, precision, recall and F1 score. In predictive analysis, a confusion matrix is a 2X2 contingency table that summarizes the distribution of the observed response by the estimated one. Accuracy is defined as the proportion of correct classifications and tells how many times the ML model was correct overall; precision is the fraction of cases identified as positive that are correctly positive and measures how good the model is at predicting a specific category; recall, also known as sensitivity or true positive rate, is the fraction of positives that are identified by the model as positive and tells you how many times the model was able to detect a specific category. F1 score is the harmonic mean between precision and recall. It ranges from 0 to 1, and the greater the F1 score, the better is the model performance. It provides information on how precise the model is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances). It is usually used when there is an uneven class distribution.

3.1 Results

In this paragraph some results are summarized and commented. We used data on interviews by LFS in 2016, 2017 and 2018 for model training and validation, and data on LFS in 2019 for model testing. As discussed, we applied the chosen methods to predict the employment status measured by LFS and the agreement between LFS and AD. Since our aim is to predict LFS employment status (in October), results are always reported in terms of LFS employment status prediction.

Tables 8 and  9 show the confusion matrices we obtained by applying the logistic model and the ML techniques to the employment status measured by LFS, by using either an unbalanced or a balanced dataset obtained by applying ROSE. Classification errors are highlighted in italics.

The same methods have been applied to the agreement between LFS and AD. Tables  10 and  11 show the results in terms of employment status measured by LFS: when the predicted agreement between LFS and AD is equal to 1, the predicted LFS variable has the same value as AD, while when the predicted agreement between LFS and AD is equal to 0, the predicted LFS variable has the opposite value than AD.

Table 8 LFS employment status. Unbalanced dataset. LFS interviews, years 2016–2018. Italy
Table 9 LFS employment status. Balanced dataset. LFS interviews, years 2016–2018. Italy
Table 10 Agreement between LFS AD, observed vs predicted LFS employment status. Unbalanced dataset. LFS interviews, years 2016–2018. Italy
Table 11 Agreement between LFS AD, observed vs predicted LFS employment status. Balanced dataset. LFS interviews, years 2016–2018. Italy

Tables 12 and  13 show the measures we used to globally evaluate the models.

Table 12 Model evaluation measures. LFS employment status. LFS interviews, years 2016–2018. Italy
Table 13 Model evaluation measures. Agreement between LFS and AD. LFS interviews, years 2016–2018. Italy

When using the LFS employment status as the target variable in the training and validation phase, the balancing of the sample substantially does not affect the results, that are good both in terms of classification error and performance measures, i.e. accuracy, precision, recall and F1. The balancing of the dataset seems to worse results when using the agreement between AD and LFS as target variable. This is probably due to the fact that the features associated to individual status that we use as covariates provide information that is mostly related to univariate responses rather than to the observed joint distribution. This may be further investigated by looking at the prediction performance on the univariate profiles and comparing it to the bivariate one, also in terms of variable importance. Generally and as expected (as the methods are not trained to directly predict the employment status), results obtained by using the employment status measured by LFS as target variable are slightly better than those obtained by using the agreement between AD and LFS. ML models show a better performance than the logit model.

By focusing on the most problematic cases, where individuals reported are employed in the LFS but not present in any administrative source in the month of the interview (\(LFS=1, AD=0\)), results are opposite to those obtained on all LFS interviews, as shown in Tables 14, 15, 16 and 17: a balanced dataset and the agreement between sources as target variable seems to lead to better predictions. In fact, the classification errors in Table 17 for decision trees and random forests, equal to 0.162 and 0.127, respectively, are the lowest compared to the classification errors obtained by using either the LFS employment status as target variable and the agreement between sources with an unbalanced dataset.

Table 14 LFS employment status. Unbalanced dataset. LFS interviews, with \(AD=0\), years 2016–2018. Italy
Table 15 LFS employment status. Balanced dataset. LFS interviews, with \(AD=0\), years 2016–2018. Italy
Table 16 Agreement between LFS and AD, observed vs predicted LFS employment status. Unbalanced dataset. LFS interviews, with \(AD=0\), years 2016–2018. Italy
Table 17 Agreement between LFS and AD, observed vs predicted LFS employment status. Balanced dataset. LFS interviews, with \(AD=0\), years 2016–2018. Italy

Also in the test phase (see Tables  18,  19,  20,  21), we observed good results in terms of classification errors. Generally speaking, when using the agreement between AD and LFS we obtain far better results with unbalanced data than with balanced data, while balancing affects the results when using the employment status from LFS by realligning the correct decision rate for LFS=0 and LFS=1. For unbalanced dataset, the use of agreement between sources as target variable improves, albeit slightly, the quality of the results for both ML procedures, and decreases the quality of the results for the logistic model.

Table 18 LFS employment status. Unbalanced dataset. LFS interviews, year 2019. Italy
Table 19 LFS employment status. Balanced dataset. LFS interviews, year 2019. Italy
Table 20 Agreement between LFS and AD, observed vs predicted LFS employment status. Unbalanced dataset. LFS interviews, year 2019. Italy
Table 21 Agreement between LFS and AD, observed vs predicted LFS employment status. Balanced dataset. LFS interviews, year 2019. Italy

Summarising, when we use ML methods to predict the LFS employment status for the entire population, we suggest to use the LFS employment status as the target variable. In this case, balancing does slightly affect the quality of the results. On the contrary, when we use ML methods to predict the LFS employment status for specific part of the population, such as the group of individuals with information from AD and LFS that disagree, we suggest to use both balancing and the agreement between sources as the target variable.

In the following, we analyse the importance that the different ML methods give to the input variables.

Figures 4,  5,  6,  7 show the variable importance in decision trees, and Figs. 8,  9,  10,  11 show the variable importance in random forests. Results show variables that are most closely linked to the response of interest, the employment status in the first case and the agreement between the two measures in the second. Results are similar for both decision trees and random forests.

As already discussed, the covariates used in the ML methods are: gender, age in classes, residence (region), level of education, income from both employment and self-employment, income from capital, income from retirement, one binary indicator associated with Italian citizenship, type of pension (disability, age, etc.), type of administrative sources. All variables refer to October. Also the information on employment status from AD from January (AD.1) to September (AD.9) are used as input variables.

Fig. 4
figure 4

Unbalanced data, decision trees for LFS employment status. Variable importance

Fig. 5
figure 5

Balanced data, decision trees for LFS employment status. Variable importance

Fig. 6
figure 6

Unbalanced data, decision trees for agreement between LFS and AD. Variable importance

Fig. 7
figure 7

Balanced data, decision trees for agreement between LFS and AD. Variable importance

Fig. 8
figure 8

Unbalanced data, random forests for LFS employment status. Variable importance

Fig. 9
figure 9

Balanced data, random forests for LFS employment status. Variable importance

Fig. 10
figure 10

Unbalanced data, random forests for agreement between LFS and AD. Variable importance

Fig. 11
figure 11

Balanced data, random forests for agreement between LFS and AD. Variable importance

The importance that is given to the input variables by ML to predict LFS employment status or agreement between sources is different. Only the income from capital and citizenship have always the lowest importance in all ML procedures, i.e. decision trees and random forests for LFS employment status and agreement between sources with balanced and unbalanced data.

When using employment status measured in October to predict LFS employment status, information on past administrative sources is important, at least the most recent: even if we are not using a longitudinal analysis, the relationship of information on employment in the past with the variable of interest measured by LFS in October is quite strong. For balanced data, also the variable SOURCE, which indicates the type of administrative sources, plays an important role.

When we want to predict the agreement between sources, income from employment and age are the variables that show a high importance, while the information on past administrative sources is not that important. Also in this case, the variable SOURCE, plays an important role.

4 Conclusions and future work

Istat has traditionally produced official statistics on employment using data from the Labour Force Survey. In recent years, Istat as well as other national statistical institutes has been investing in the development of methods for the production of official statistics through the use of several information sources. In this context, an important role is played by administrative data. The use of this type of data for the production of official statistics has a number of advantages, including cost reduction, a decrease in the burden on respondents and possible improvements in terms of relevance, accuracy and timeliness of the statistical output. However, there are also disadvantages, especially related to the transformation process that is needed to “adapt” the administrative data to statistical purposes.

In the field of employment, several administrative sources are regularly acquired by Istat from external bodies. These cover different categories of work and are characterized by different information detail and quality. LFS directly measures the employment status, while AD contain some signal on individual working condition. Measurements from LFS and AD may not agree for different reasons including the presence of measurement error. An important reason to study the informative content of AD is that LFS data can be used for answering to some, macro, longitudinal purposes but they do not help in understanding individual-specific employment evolution, that can be studied by using administrative data, that are available over time at a very high level of detail.

This work describes the use of machine learning methods, in particular decision trees and random forests, to predict the LFS employment status and understand the reasons for discrepancies between data from the LFS and administrative sources, in a temporal window of four years, from 2016 to 2019. We used two target variables in the ML procedures: the employment status measured by the LFS and the agreement between the LFS and AD measures. The latter can also be used to indirectly predict the LFS employment status: indeed, when we predict an agreement between LFS and AD, the predicted LFS employment status depends on the value recorded by the AD, that is available for the whole population. Data on respondents to LFS in 2016, 2017 and 2018 linked to AD have been used for training and validating the models, and data on respondents to LFS in 2019 linked to AD have been used for model testing.

We show that ML seems to be useful and promising in the research on employment. ML procedures allowed to understand the role of the auxiliary variables in potentially explaining the phenomenon of interest (the LFS employment status and the agreement between the sources), and in decreasing the prediction errors compared to traditional methods, such as logistic model. Furthermore, we show that the ML procedures applied to the agreement between LFS and AD can be used to improve the prediction of the LFS employment status in the most problematic cases where the sources do not agree, that is for individuals that declare to be employed to the LFS, but are not present in any administrative source, at least in the month of the interview. This is probably due to the fact that, in this case, ML procedures are trained to “learn” about the phenomenon of interest (agreement between sources) rather than about the employment status in general.

The work is still in progress: the information potential of ML methods in the employment research is very high but needs a more thorough exploration. Results can be used to asses the quality of AD sources and to improve the quality of LFS in the data collection phase, for example in sample allocation or questionnaire. Just to give an example, LFS design could be modified so the sample may reach those groups of individuals where the two information usually disagree. Furthermore, AD lack information on occasional work, such as jobs that were once covered by so-called vouchers, and these types of work appear to be relatively frequent in groups where the two measures do not agree.

Additional study is also needed to exploit, in details, the informative content of administrative sources and LFS, also by using information on some job characteristics, such as employment sector or type of contract (part- and full-time). Furthermore, to better use the LFS content, we should use LFS sample weights in the ML procedures. Concerning ML tools, we need to exploit the potential of ML. On the one hand, we need to investigate the role that covariates play in the models. Following a Reviewer’s suggestion, we have considered not only the effect of AD.1-AD.9 as in the case we refer to in the text, but also AD.1 and the number of observed transitions recorded in the considered temporal window. There is not clear evidence of a substantial gain in explanatory power, but this may be worth of some further detailed analysis. We also need to exploit the potential of balancing techniques, as suggested by the analysis of the agreement between sources: the results that are generally most negative, are the most positive in the most problematic cases. A possible approach is to use data from different months at the same time to face the problems linked to imbalance in the data.