Keywords

1.1 Why This Book?

Most population projections forecast the population using only demographic characteristics (age and sex), but the inclusion of additional dimension such as education (Lutz et al. 2014) and sociocultural variables (Bélanger et al. 2019) is an emerging approach in the social sciences (Spielauer 2010). Indeed, in addition to providing a richer set of outputs, including additional dimensions provides more flexibility in the generation of policy-relevant alternative projection scenarios. Furthermore, it improves the overall quality of the projection, as more sources of heterogeneity are considered, which also allows for a more refined modeling of demographic events.

Traditional demographic projections using the cohort-component method can only provide outcomes related to the age and sex structure of a population. When extended to multistate and multiregional applications (Rogers 1980, 1995), more dimensions can also be added (such as region or education). Microsimulation is a powerful tool that can be used to create population projections when the number of dimensions becomes large. Such a model is very flexible and characterised by the stochastic simulation of individual life courses based on derived parameters and individual characteristics (Van Imhoff and Post 1998). Until the late 90s, computer power was not sufficient to use microsimulation for very complex population projection. However, with a newer generation of powerful computers, some institutions around the world changed their projection methods to microsimulation (Caron-Malenfant et al. 2017).

Many microsimulation models are built using a language or a software specifically designed for this purpose, such as ModGen, JANSIM, Mic-Core, or OpenM++ (Bélanger and Sabourin 2017; Mannion et al. 2012; Zinn 2014). Using these tools requires specific and exhaustive prior knowledge, as they are complex and not user-friendly. Moreover, user guides and online support are in general limited, given the small number of users. Most of those tools are also not very flexible, as they are usually designed for a specific purpose and their functions cannot be modified or adapted easily for other purposes. This also keeps the user in the dark concerning what exactly happens when a function is called, sometimes leading to unexpected or awkward outcomes. Indeed, when using such tools, the assistance of a coding expert is generally required.

For those reasons, when a researcher needs to perform microsimulation, building one’s own model with common statistical software, such as SAS, Stata, or R, might be a good option. These programs are widely used among scholars and are taught in most social science departments, so many social scientists already have the required background in the coding language. Given a large number of users, online support can also be found easily when needed.

Microsimulation packages specifically designed for population projection already exist in R (Zinn 2014). This book is a step-by-step guide showing how to build a microsimulation model for demographic projections using the SAS language. For this book, we used SAS 9.4 Codes we provide also work with other versions of SAS, such as SAS University Edition. The guide is designed for people with beginner to intermediate knowledge in SAS. We suggest codes that are easy to understand so that they can be replicated or adapted for other purposes. They are however not necessarily the most efficient.

First, this book shows how to convert an existing multistate projection by age, sex, education and region into a microsimulation model framework. Two new dimensions are then added, the labour force participation and the sector of activity, and some examples of outputs and alternative scenarios that would not be possible with standard demographic methods are shown. Other chapters show how to adapt the model for other countries or other purposes.

The book is intended for people with a good background in demography, population dynamics, and quantitative analysis, who wish to extend their technical skills by learning how to use microsimulation in demography with SAS. The user needs to know the principles of population projection, as the book does not explain how to build demographic assumptions for the future. The demographic components of the microsimulation models constructed as examples in this book come from existing multistate projections, either from KC et al. (2018) or from Lutz et al. (2018), that forecast populations by region and educational attainment. We do however build assumptions for additional dimensions of the projection, labour force participation and sector of activity, which are modelled from various surveys.

For each chapter, all input files and code files used in this book can be found in the Chapter ESM (Electronic Supplementary Materials).

1.2 What Is Microsimulation? Why Use It?

Microsimulation is an alternative approach to the deterministic macro-level population projection models that use aggregate-level data, such as the cohort-component method, to project future population dynamics. In microsimulation, the modelling is based on individual-level data. Though microsimulation methods have been conceptualised for decades and used for other purposes (Orcutt 1957), their application for population forecasts is quite new. For an exhaustive description of microsimulation for population forecasting and its properties, compared to multistate cohort-component methods, see Van Imhoff and Post (1998).

A microsimulation model starts from a baseline population that consists of individual actors whose characteristics represent the composition of a given population across chosen dimensions. These individual actors are exposed to the risk of a set of events relevant to their state and specific to their own characteristics: death, births of children (which generate new actors inside the model), moving to a different region in a country, leaving the country, achieving a level of education, entering or exiting the labour market, and so on. International immigrants enter the model with a predetermined set of individual characteristics and are subjected to risks of the events mentioned above. Transitions between the states are determined stochastically with a random experiment (Monte Carlo method). Microsimulation thus allows not only for including a larger set of dimensions than the standard multistate population projection models, in which handling more than three or four dimensions becomes challenging but also for handling competing risks easily.

Figure 1.1 shows a simple example of how stochastic microsimulation works (using the Monte Carlo method), compared to the cohort-component method. Suppose 1000 women were aged 75–79 in 2015. If we assume a probability of dying of 5%, the cohort-component method will simply remove 5% of the cohort, and we will get 950 survivals aged 80–84 in 2020.

Fig. 1.1
figure 1

Cohort component method versus microsimulation

In microsimulation, we start with a dataset in which each row is an individual. In this example, we thus have 1000 rows representing the 1000 women age 75–79 in 2015, all tagged as being alive. Some of them will die before 2020, about 5% according to our assumptions. We determine who will die with random experiments, which implies comparing the probability of dying (5%) with a linear random number between 0 and 1. When the random number is lower than the probability of dying, the individual dies, and we switch the variable alive to 0. Out of 1000, we will get about 950 survivals. When the sample is small (for instance 10 individuals), the number of survivals in a single run could be far from the expected numbers: this is the Monte Carlo error resulting from the random experiment. In these cases, it might be useful to take the average of multiple simulations or increase the sample size, which would reduce the error.

If microsimulation gives similar results to the cohort-component methods, why choose this method? Spielauer (2010) describes three broad situations when microsimulation should be used, rather than the multistate cohort-component method:

  • When heterogeneity matters in the projection modeling or in the projection outcomes. The multistate cohort-component method can only handle a limited number of dimensions because the number of cells for the transition matrices corresponds to the multiplication of the number of categories of each dimension. In microsimulation, each additional dimension only adds a new column in the dataset. Suppose we have a 7-dimension model projecting age (20 age groups), sex (2 categories), education (6 categories), education of the mother (6 categories), region (70 categories), labour force participation (2 categories) and child parity (10 categories). The matrix for a multistate model would require more than 2 M cells (20*2*6*6*70*2*10). In microsimulation, the number of cells is the number of individuals in the sample multiplied by the number of dimensions. So if we have a sample of 100,000 individuals, the number of cells would be 700,000 (7 dimensions * 100,000). Then, if we want to add another dimension, for instance the religion in 4 categories, the number of cells in the multistate would be multiplied by 4 and would exceed 8 M, while in the microsimulation model, we just add one column to the data set and get 800,000 cells, which is much more manageable.

  • When behaviours can be better understood at the micro level than the aggregated level. For instance, the number of years spent in a country is a major predictor for immigrants’ fertility, mortality or labour force participation. At the micro level, these predictors can easily be taken into account. Only one additional column is required for the variable “time spent in the country”, the value of which is incremented every year without any complex modeling. The variable can then be used in the modeling of other events, using, for instance, relative risks and logit parameters.

  • When individual histories matter. For instance, past life habits might have a big impact on mortality and older ages. Similarly, retirement pensions depend in many cases of the past income and number of years worked. Microsimulation can also easily keep a record of the birth history of women. Every time the birth event occurs, we can just increment a variable “number of births”, which can then be used once women get older to analyse their potential as caregivers.

1.3 Examples of Demographic Projections Using Microsimulation

Many types of microsimulation models have been developed and used to address different types of research questions in various fields. For example, they have been used to evaluate the future performance of long-term programs such as pensions (Morrison 2017) and long-term care (Carrière et al. 2008), to simulate the potential impacts of prospective public policies or policy changes (Sutherland 2007), and to project life-time behaviours (e.g. saving) or complex dynamics (e.g. ageing) for policy analysis (Sundberg 2007). An exhaustive overview of microsimulation applications in social sciences and other areas can be found elsewhere (Li & O’Donoghue 2013; Spielauer 2010).

Recent developments in computing technology, as well the rise in the number of micro-data sources needed to calculate the parameters of microsimulation, have made it easier to develop more complex models and have increased the level of interest in such models (Bélanger and Sabourin 2017). Those interested in reading about the different uses of specific microsimulation models and their specific methodological issues can browse the International Journal of Microsimulation,Footnote 1 which is the official peer-reviewed journal of the International Microsimulation Association.

With regard to population projections that use microsimulation, Statistics Canada, the official statistical agency of Canada, is a pioneer. The agency has used microsimulation methods for its official projections for many years. This started in 2004 with the model PopSim (now DemoSim) which was designed to project the Canadian population in terms of various characteristics (Caron-Malenfant et al. 2017). The model is built using the ModGen language and its most recent version begins with the microdata file of the National Household Survey of 2011. It projects dynamically and in continuous-time on the one hand, sociodemographic characteristics such as age, sex, education and labour force participation, and on the other hand, several ethnocultural variables, such as visible minority group, place of birth, generation status, and language.

As Canada is becoming more and more diverse with large inflows of international immigrants, the model includes explicitly the different behaviours of ethnocultural groups living in the country. Among other sources of heterogeneity, the model accounts for higher fertility for some ethnic groups (Black, Muslim, First Nations), as well as for recent immigrants, compared to those who have been living in the country longer. It accounts for the higher propensity of international immigrants to emigrate (return migration), as compared to the native population. The “healthy immigrant” effect is also implemented, which provides immigrants with lower probabilities of dying in the years following their arrival as a result of direct and indirect immigration selection (McDonald and Kennedy 2004). Domestic migration is also modulated by languages, as the French and English speakers that constitute the core of the Canadian population have very different mobility patterns.

Microsimulation is the only possible method for dynamically including such heterogeneity in sociodemographic behaviours, thus allowing for more accurate and more detailed projection outcomes. Statistics Canada has used the model to produce several reports on future Canadian populations, such as visible minority groups (Morency et al. 2017), aboriginal populations (Caron Malenfant et al. 2015), and language speakers (Houle and Corbeil 2017), and to forecast labour force participation (Martel 2019).

DemoSim is built using several confidential data files that are not available to external researchers. From public microdata files, the Laboratoire de simulations démographiques (LSD) (Demographic Simulation Laboratory) of the Institut national de la recherche scientifique (National Institute for Scientific Research) proposed a framework for a lighter version of the microsimulation model that could project the population while accounting for several sociodemographic and ethnocultural variables, in order to study population changes in a context of relatively high immigration and low fertility (Bélanger et al. 2019). This framework has been adapted to produce several region-specific versions. For instance, the LSD framework was used to build a model for the United States (Van Hook et al. 2020), LSD-USA, from the anonymised public files of the 2015 American Community Survey and General Social Surveys (1995–2015). It projects the population of the USA to 2065 and includes dimensions such as race, generation, duration of stay, education and labour force participation. LSD-USA has been used to project the effect of several policy-oriented scenarios regarding immigration levels and educational attainment on the future workforce of the country (Van Hook et al. 2020).

From the LSD framework, the Center of Expertise on Population and Migration (CEPAM), a partnership between the International Institute for Applied Systems Analysis and the Joint Research Center of the European Commission, built a similar model called CEPAM-Mic (Bélanger et al. 2019). The base population and assumptions are built from different sources: public microdata files of European Labour Force Surveys and General Social Surveys on the one hand, and aggregated data from the Census 2011 and from a multistate cohort-component model on the other hand.

The CEPAM-Mic model can dynamically project the population for the EU28 member states in terms of several socioeconomic and ethnocultural dimensions, including education, labor force participation, employment. age at immigration, region of birth, duration of residence, education of the mother, religion and language. This model allows for the study of alternative scenarios of migration and their consequences on future populations and labour supply trends in the European Union. It has been used to assess policy-relevant scenarios with regard to sociocultural inequalities in education (Marois et al. 2019a), and integration of immigrants (Marois et al. 2019b), as well as to propose an innovative dependency ratio that takes into account the productivity of workers (Marois et al. 2020). CEPAM-Mic allows researchers to assess a large range of policy-relevant alternative scenarios and produce indicators showing that population aging is less daunting than it may seem when only age structure is considered.

Beyond ethnocultural and sociodemographic variables, other types of dimensions can also be implemented in microsimulation models for demographic projections. Starting from the CEPAM-Mic model mentioned above, the model ATHLOS-Mic implements a health module that refines projection outcomes (Marois and Aktas 2021). This module adds a health metric ranging from 0 to 100 and a set of risk factors (such as smoking, obesity, etc.) to the characteristics of individuals. Changes in risk factors are determined with logit regression parameters that take into account other risk factors. The value of the health metric, which is also used to modulate the probability of dying, is then determined from risk factors and other sociodemographic characteristics. This model thus allows researchers to assess the impact of policy-intervention scenarios on different outcomes, such as the number of years of life lost or the average health of the population.