Multi-source statistics on employment status in Italy, a machine learning approach

Varriale, Roberta; Alfo’, Marco

doi:10.1007/s40300-023-00242-7

Multi-source statistics on employment status in Italy, a machine learning approach

Open access
Published: 17 April 2023

Volume 81, pages 37–63, (2023)
Cite this article

Download PDF

You have full access to this open access article

METRON Aims and scope Submit manuscript

Multi-source statistics on employment status in Italy, a machine learning approach

Download PDF

1265 Accesses
1 Citation
Explore all metrics

Abstract

In recent decades, National Statistical Institutes have started to produce official statistics by exploiting multiple sources of information (multi-source statistics) rather than a single source, usually a statistical survey. In this context, one of the research projects addressed by the Italian National Statistical Institute (Istat) concerned methods for producing estimates on employment in Italy using survey data and administrative sources. The former are drawn from the Labour Force survey conducted by Istat, the latter from several administrative sources that Istat regularly acquires from external bodies. We use machine learning methods to predict the individual employment status. This approach is based on the application of decision tree and random forest techniques, that are frequently used to classify large amounts of data. We show how to construct a “new” response variable denoting agreement of the data sources: this approach is shown to maximise the information we may derive by machine learning approach in some problematic cases. The methods have been applied using the R software.

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Article Open access 25 March 2022

Random forest analysis of two household surveys can identify important predictors of migration in Bangladesh

Article 11 March 2020

Applying random forest in a health administrative data context: a conceptual guide

Article 17 July 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, National Statistical Institutes (NSIs) are progressively moving to the production of official statistics based on the combination of data from different sources, with the aim to reduce costs and response burden while delivering detailed and high-quality information [13]. In this view, new strategies in producing the required outputs need to be developed exploiting as far as possible the integrated use of different data sources, which include data from survey data and administrative sources. The complexity of production of such multi-source statistics is due to the fact that they come in many different varieties as data sources can be combined in many different ways. [6] provide an overview of statistical methods for combining multiple administrative and survey data sources, and [5] list eight basic situations of multi-source processes, providing practical guidelines for producers of multi-source statistics on problems that may be encountered and methods that can be applied to overcome such problems.

The study of methods for producing estimates on employment in Italy using different data sources represents one of the research projects addressed by the Italian National Statistical Institute (Istat). The relevant data are drawn from the Labour Force Survey (LFS), and from several administrative sources that Istat regularly acquires from external bodies.

Administrative data are defined as “(...) data derived from an administrative source, before any processing or validation by the NSIs” ([4], pag. 20). Traditionally, as described by [16], they have been used as auxiliary sources of information in different phases of the production process, while survey data are used as “primary” data, following the assumption that they provide correct measures of the target variables as they are not affected by errors. A different way of thinking is based on the assumption that both survey and administrative data may be at least potentially affected by measurement errors and a more “symmetric” approach can be adopted to take into account deficiencies in the measurement processes. This approach starts by assuming that the target variables are latent (unobserved) and describes a model for the measurement processes through the distribution of the observed variables conditional on the latent ones. In this context, Latent Class Analysis is considered as a method to identify a latent categorical construct of interest using categorical observed variables that can be used to evaluate measurement errors [2]: Hidden Markov Models (HMMs) represent a potential extension when longitudinal data are available. Several applications on the use of latent models in the field of research on employment can be found (see, for example, [1], and [9]). A proposal employed by Istat to estimate employment rates is based on a HMM to account for the inconsistencies in the measurement process of both surveys and administrative sources, according to [16] and [17]. The same method has been applied by [10] to determine the measurement error of the variable measuring whether a respondent has a permanent or a temporary job, by using both survey and register data.

Another approach to deal with multi-source data is based on Machine Learning (ML) tools. The interest in ML from NSIs has been growing rapidly, although time is still needed before it can be used to its full potential. Just to give an example, there has been two international initiatives on ML for official statistics [15]: the UNECE High-Level Group on Modernisation of Official Statistics (HLG-MOS) Machine Learning Project (2019–2020) and the United Kingdom’s Office for National Statistics (ONS) – UNECE Machine Learning Group 2021, approved by the HLG-MOS. [18] describe the role of ML in analyses that can be conducted in a multi-source context, distinguishing three main settings: micro-, macro- and no linkage, while [11] describe the five quality dimensions of statistical output that need to be used to identify the challenges for the ML application and enumerate the most important research topics that need to be studied to enable the successful application of ML for official statistics.

The present work describes the use of ML techniques, decision trees and random forests, to predict the individual employment status. The final aim is to show how ML can be used to extract important information from the data for the purpose of estimating the target variable, and to learn more about the phenomenon. Even if the reference source on employment is the LFS, we exploit the use of AD. To this purpose, we show how to construct a “new” response variable indicating situations where the data sources agree. With this approach, we do not focus only on estimating the individual employment status, but we consider the agreement between survey and administrative data, and we try to establish why these sources do not give the same information. By using the agreement between survey and administrative data, the prediction on LFS employment status can be, however, always indirectly derived. ML techniques have been applied using the R software.

The paper is organized as follows. Section 2 describes the context, characterized by the presence of multiple data sources. Section 3 discusses the application of ML procedures to predict the employment status. Section 4 concludes the work.

2 The context

Available data come from Labour Force Survey and administrative sources. The Italian LFS is a continuous survey carried out during every week of the year. It involves, every year, more than 250,000 families residing in Italy (for a total of 600,000 individuals) distributed in approximately 1400 Italian municipalities. The LFS provides quarterly estimates for the main aggregates of labour market (employment status, type of work, work experience, job search, etc.), stratified by gender, age and territory (up to a regional detail). The reference population is composed by all members of families residing in Italy, even if temporarily abroad.

LFS represents the main source of statistical information on the Italian labour market; the information collected via LFS is the basis for official estimates of employment. It also produces information on the main aggregates of the job offer - profession, sector of economic activity, hours worked, type and duration of contracts, training. LFS is harmonized at the European level as established by the EU Regulation 2019/1700 of the European Parliament and the Council. Its main statistical aim is to classify the population in working age (15 years and over) into three mutually exclusive and exhaustive groups: employed, unemployed (both together make up the so-called “labour force”) and economically inactive. This last category defines the population “outside the labour force”: for example students, retired, and housewives. The classification criteria are based on definitions inspired by the International Labour Office (ILO) and implemented by the Community Regulations.

The employed category includes people between 15 and 89 who in the reference week: (i) have worked at least one hour for pay or profit, including unpaid family workers; (ii) are temporarily absent from work because on vacation, with flexible hours (vertical part time, hours recovery, etc.), on sick, compulsory maternity/paternity leave, in professional training paid by the employer; (iii) are on parental leave and are receiving and/or are entitled to receive income or work related benefits, regardless of the duration of the absence; (iv) are absent as seasonal workers but continue to carry out necessary duties and tasks on a regular basis to the continuation of the activity; (v) are temporarily absent for other reasons and the expected duration of the absence is not more than three months. These conditions do not necessarily include an employment contract and, thus, the category of employees recorded through the LFS also includes forms of irregular work. The employment condition in the LFS is completely independent from the opinion that the interviewees have on the respondents’ status. The main regular job is defined as the only job performed or, if there are more than one, the one with the greatest number of hours usually worked or the one that individual thinks to be more important (greater income, greater stability, etc.).

LFS follows a quarterly rotation scheme in which families are interviewed for two consecutive quarters, excluded for two quarters and re-interviewed for other two quarters. Data are collected through a combination of Computer Assisted Personal and Telephone Interview (CAPI, CATI). The sample design is based on space (selection of units) and time (selection of the survey period for each sample unit). For further details on the LFS contents, methodologies and organization in Istat see [7].

In the last decade, all European NSIs, have started using Administrative Data (AD) for statistical production process. In Istat, AD that may be relevant to the labour statistics mainly come from social security and fiscal authorities. More specifically, data come from several different sources, such as Modello UniEMens (EMENS), DMAG, etc. for social security data and Modello 730, Modello Unico, Certificazione Unica, etc. for fiscal data.

Administrative sources on employment can be classified on the basis of different aspects: the administrative purpose of the source and, consequently, the information content on social security contributions and/or earned income; the availability of temporal information on employment; and the different forms of employments (dependent or self-employed). The quality and the informative power of each administrative source is different. Just to give an example, some sources have temporal details on the start and the end of an employment contract, while other may only detect an overall signal during the whole year. Furthermore, some statistical units are not covered by administrative information, for example irregular jobs and jobs whose salary does not exceed a given threshold. After preprocessing and harmonization of the information, data are organized in an information system with a linked employer-employees structure: the main unit of analysis is the employee job position defined as a relationship established between an employer and a worker. From this data structure we may obtain information on the statistical unit of interest, i.e., the worker. Subsequently, for each subject, the main regular job activity and its characteristics, according to the ILO definitions, are derived: useful information from all labour AD sources are selected and harmonized to make them comparable at a monthly level, and a deterministic methodology to identify the main job for each individual and month is implemented, following the indication of subject matter experts. Data are differently treated according to the type of employment relationship, i.e. self-employed, employees, external workers.

Starting from the resident population in Italy, the available data on employment from LFS and administrative sources are linked at the individual level, and the information is harmonized using details at the month level. For LFS, the weekly information, if present, represents the monthly occupational status. In AD, we used the information on the same LFS week, when a statistical units is in the LFS sample, and a random week in the month for other cases. The resulting dataset contains monthly employment status measured by both LFS and administrative sources.

Table 1 graphically represents the informative context. Out of the entire resident population, AD provides information on the presence of each individual in at least one administrative source, in each month. LFS collects information on approximately 1.2% of the overall Italian population. For each individual there are at maximum two observations per year.

Table 1 Available employment data: Administrative (AD), Labour Force Survey (LFS)

Multi-source statistics on employment status in Italy, a machine learning approach

Abstract

Similar content being viewed by others

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Random forest analysis of two household surveys can identify important predictors of migration in Bangladesh

Applying random forest in a health administrative data context: a conceptual guide

1 Introduction

2 The context

3 Predicting individual employment status

3.1 Results

4 Conclusions and future work

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Descriptive statistics

Appendix B Administrative sources

Appendix C ML results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation