Keywords

6.1 A Flexible Model

In previous chapters, we translated a multistate model projecting education in India and its regions into a microsimulation model, then added two new dimensions, labour force participation and sector of activity, for which we showed examples of alternative scenarios.

The framework of the model can be easily adapted for other purposes. Adapting it for another country would only require us to change the population and parameters files accordingly, while changes in the code would be minor, such as changing the name of regions in the migration module, or if the number of categories of education is different, adapting the code to a larger or smaller number of categories accordingly. Depending on our needs, we could also change the modelling of certain events or add events.

In this chapter, we will adapt the model for another projection with different attributes. We will replicate the multistate projection for China from the Wittgenstein Center (Lutz et al. 2018); we will add labour force participation, and we will calibrate the outcomes on the medium variant of the World Population Prospects (United Nations 2019). Compared to the projection for India we built in Chaps. 3 and 4, this projection model has the following differences:

  • It is not multiregional. We need therefore to turn off the domestic migration module and the shift from rural to urban areas;

  • The projection is not closed, which means there is international migration;

  • The time span is 2015 to 2100.

In this section, we will detail the required changes in the microsimulation model, beginning by naming the scenario. All changes made compare to the code from Chap. 4 are highlighted in yellow.

figure a

6.2 Updating Input Files

Before making further changes in the code, input files need to be updated, including the base population and parameters files. The simulation needs to start from a new micro dataset that matches the population of China in 2015 by age, sex and education. We created this new base population with the exact same procedure as in Chap. 2, but using a different input for the aggregated population (taken from Lutz et al. (2018)). The base population, POP_2015.csv is created in the folder Chapter6_China/Population. The code and the input file can be found in the folder Chapter6_China/NewBasePop. Although we only build an example for China in this section, Lutz et al. (2018) provide data with a similar structure for all countries of the world.

Updated parameters are provided in the folder “Chapter6_China/Parameters”. For the demographic and education dimensions, parameters are extracted from the SSP2 scenario of Lutz et al. (2018). Because the projection is not multiregional, there are no files for domestic migration (dom_mig), rural to urban shifts (ur), or for the sector of activity (formal and formal_input), as these dimensions are not included in this adaptation of the model. The structure of other parameter files stays the same and updated parameters can be directly implemented in files. The region variable is also useless, as there is only one region. However, it’s better to keep it in the input file and fill it in with a single value, such as “China”, as illustrated in Fig. 6.1 showing an example for the parameter file fertility.csv. Otherwise, an additional change would be required in the microsimulation code.

Fig. 6.1
figure 1

Screenshot of the parameter file “fertility.csv” (opened with excel)

For the parameter file for the labour force participation module (lfp_.csv) and for the imputation of labour force participation in the base population (lfp_imput.csv), we used pooled data of waves 2010 to 2017 of the Chinese General Social Survey (CGSS) to re-estimate the logistic regression model predicting the labour force with personal characteristics (see Eq. 4.1 from the labour force participation module presented in Chap. 4). However, the education variable in the CGSS has only 5 categories: no education, primary completed, lower secondary, upper secondary and postsecondary. Therefore, in the regression model, the category “no education” includes incomplete primary. The parameters are implemented accordingly in the parameter files.

The statistical model also doesn’t include the region, because the projection is not multistate, nor the presence of children at home, because this survey is not suitable for the inclusion of this variable. Therefore, in the parameter files, the variable for the presence of a young kid at home (young_kid_p) and its interaction with education (young_kid_edu_p) are set to 0 (see Fig. 6.2, which highlights in yellow parameters that are switched to 0). Similarly, the number of categories for the variable “region” is reduced to only one, which is set to “China” and also takes the value 0. Adapting the parameter file in such a way, that is, setting to 0 parameters of variables that are not used in the modelling rather than removing them from the file, allows us to minimize changes in the code of the labour force participation module.

Fig. 6.2
figure 2

Screenshot of the parameter file lfp.csv (opened with excel)

Other parameters are implemented from regression models. As compared to India, gender gaps are much less prominent in China, as rates of labour force participation for women are only 10 to 15% lower than those of men (see Fig. 6.3). Chinese women with postsecondary education indeed already have rates that are close to those of men. The gender gap exists mainly for individuals with lower education. For both males and females, rates are lower for the populations with no education or primary education for most of adulthood, but become higher past the age of 60, implying that retirement comes later for the minimally educated population.

Fig. 6.3
figure 3

Source National Sample Survey on Employment and Unemployment 2017/2018 (India); Chinese General Social Survey 2010–2015 (China)

Predicted labour force participation rate by age, sex and education, China.

6.3 Changing the Time Span of the Projection

The time span of this projection is 2015 to 2100 instead of 2010 to 2060. The base population is now pop_2015.csv in the population folder. We thus change the code to import accordingly:

figure b

The first thing to do is change the value in the macro loop, calling the simulation accordingly. (As a reminder, the second parameter corresponds to the starting year of the last loop. Thus, 2095 implies that the last loop simulates 2095 to 2100).

figure ap

Then, we need to adjust the code in the imputation of the base population section in order to select the proper file, which is pop_2015 instead of pop_2010.

figure c

Then, in the count of the population, we set the proper year (2015 instead of 2010) in the label for the output files.

figure d

6.4 Turning Off Modules

There are different ways to turn off a module. If we want to minimize change in the code, we might simply switch values in the parameters files of those modules to 0. Doing so, the event will still be simulated, but since all probabilities are 0, the event will never occur. This is the simpler option, but it has the inconvenience of loading the model with unnecessary calculations, which increases the running time of the simulation. If the sample is large, depending on the power of the computer, this might generate significant unnecessary delay. In the following example, we will show how to turn off the code of the module in order to remove unnecessary calculations from the simulation.

The first thing to do is to remove code lines importing the parameters files of those modules. Because there is no domestic migration, no rural to urban shifts and no sector of activity, the code lines importing parameters (dom_mig.csv, ur.csv, formal.csv and formal_input.csv) should not be read anymore. This can be done either simply by erasing the code, or, as we do in the code below, by embedding it between /*…*/. This latter method allows to keep the code without reading it (the color is then changed to green) and can be useful if we want to reactive these codes later.

figure e

We also do the same for the code sorting these files.

figure f

6.4.1 Domestic Migration and Rural to Urban Reclassification

The model for China is multistate, but not multiregional. Therefore, we need to switch off the domestic migration and the rural to urban shifts modules and, consequently, adjust the outputs. Again, we simply embed the code in /*…*/.

figure g

Because this code is not read anymore, the population file “pop_dm” is not created. We therefore need to change the population file read in the next module accordingly, in our case, the fertility module. The fertility module thus needs to start from the last population file created, which is now “pop_edu” rather than “pop_dm”.

figure h

We then embed similarly the reclassification of rural to urban areas module in order to switch it off.

figure i

Again, because the temporary population file “pop_reclass” is not created anymore, we modify the starting population file of the next module (updating the characteristics) with the last one (“pop_birth3”).

figure j

The fertility module used the variable “region_fert”, which was created in the domestic mobility so that the fertility risk would have the right exposure. Since the module is deactivated, the variable is not created and is not useful anymore. Indeed, the region of the mother cannot change during the period. Therefore, the region in which she gives birth does not need to be tracked anymore. However, replacing it everywhere it was used with the variable “region” would be more complicated than simply creating elsewhere a new variable “region_fert” that is a duplicate of region. We add this line at the beginning of the fertility module, in the section adjusting age for exposure.

figure k

Also, in the fertility module, newborns had the possibility of domestic migration, which was assessed by comparing the region of birth and the region of residence at the end of the period. We need to remove this line of code.

figure l

We then need to withdraw the count of domestic migration and rural reclassification in the section generating the outputs. For this purpose, again we simply embed the corresponding code between /*…*/.

figure m

The section cleaning the population file for the next period removes temporary variables created to track demographic events. As there is no more domestic mobility and no rural to urban transitions, variables tracking those events need to be removed from the code.

figure n

Finally, in the section merging the population count to the components of growth, we remove datasets for the count of these removed components, inflow, outflow, gain_urban and loss_rural.

figure o

6.4.2 Sector of Activity

The model for China we are building also does not include the sector of activity. We thus switch off the module in a similar way. However, this needs to be done twice, because the module includes a subtitle (“Formal–Informal event”) that is also embedded in /*…*/.

figure p

Since the population file formal2 is not created anymore, we then change the starting population file in the next step to make it correspond to the last one created, pop_lfp2.

figure q

The section generating outputs also needs to be adjusted. As a reminder, the variable “formal” was used to calculate the active population by adding those working in the formal sector to those working in the informal one. In the code producing the population count output, we replace the variable “formal” with the variable “labour” and label categories accordingly (changes highlighted in yellow).

figure r

Finally, in the section merging the population count and components of growth, the calculation of the active population from the variables “formal” and “informal” can be removed, as this variable is already included in the dataset.

figure s

The simulation can then be run. However, we need to adjust in similar way the section imputing the base population. We embed the code for the imputation of the sector of activity in /*…*/, since this variable is not projected.

figure t

Then, in the population count, we change the variable “formal” to the variable “labour” and set the appropriate labels for categories.

figure u

Finally, we remove the code calculating the variable “active” from the variables “formal” and informal”.

figure v

6.5 Building a Deterministic Module in a Microsimulation Model for International Migration

All events in the modules we have shown so far have been modelled stochastically, with random experiments. For some modules, however, we may want to use a deterministic approach. This could be appropriate if we want to perfectly calibrate an event in a multistate cohort-component projection, or if the sample size for some subgroups of interest is too small (as there is no Monte Carlo error in the deterministic approach). A microsimulation model can incorporate this kind of modelling. We will show how to model an event deterministically, using international migration as an example.

In the examples we presented in Chaps. 3, 4 and 5, the population was closed: KC et al. (2018) assumed no international migration. However, other scenarios or projections might require the implementation of international migration modules. In this example, international migration is implemented through two new modules, emigration and immigration, which are modelled independently. In the projections of Lutz et al. (2018) that we replicate, international migration is implemented after mortality and education events, but before fertility. International immigrants are therefore not in the population that is at risk of dying or changing their level of education in the period in which they arrive, but they can have children.

Although the deterministic approach has the advantage of having no Monte Carlo error, and though it works quite well for events such as mortality and international migration, it can be very complex for other events, such as fertility. It requires the duplication of all women of reproductive age and the consequent adjustment of the weight of the new births, based on the fertility rate. In doing so, the number of rows in the dataset quickly becomes very large (all women aged 15–49 are duplicated in each step). The same issue occurs for interregional migration.

6.5.1 Emigration

If implemented stochastically, the emigration event can be modelled in the same way as mortality, by simply comparing the emigration rate to a random number, flagging emigrants, and deleting them from the dataset at the beginning of the next period. When implemented deterministically, no observations are deleted: only sample weights change.

The parameters file for the emigration includes emigration rates (emx) by age, sex and education, as shown in Fig. 6.4. For this scenario, emigration rates are thus constant throughout the projection, but this can be changed by simply adding a column “year” to the parameters file.

Fig. 6.4
figure 4

Screenshot of the parameters file emig.csv (opened with excel)

The international migration event occurs after the education event and is therefore implemented right after the education module. It starts from the temporary population file pop_edu. We merge the parameter file param.emig to the population file pop_edu in the same way we did for events modelled with a stochastic approach.

figure w

In the deterministic approach, the number of people who don’t emigrate is determined by adjusting weights according to emigration rates. We store the weight at time t in a temporary variable “old_weight”. For those who survived the mortality module (death = 0), the new weight at time t + 5 (we attribute a new value to variable “weight”) is then calculated by multiplying the weight at time t (old_weight) by 1—emigration rate (emx). In a temporary variable “nb_emig”, we calculate the difference between weights at time t and t + 5 (old_weight-weight). When we generate outputs, the sum of this variable will give the number of emigrants for the period.

figure x

6.5.2 Immigration

The international immigration module doesn’t require any statistical calculation. Immigrants are simply implemented by merging an immigration microdata file to the population file of the period. This immigration file should have exactly the same structure as the population file. However, we need to set assumptions for the number and composition of the immigrants. For this, we use the aggregated number of immigrants in China by age, sex, education and year, taken from Lutz et al. (2018), which are generated from international migration flows, taking into account changes in the composition of the world population. From this file, we create the immigration microdata file using the same code as the one used to create the base population (see Chap. 2), but we add a variable “immig = 1” to keep track of immigrants. The output CSV file is labelled “immig.csv” and is stored in the subfolder “parameter”. The complete code and the aggregated CSV file can be found in the subfolder “ImmigrationFile” of this chapter.

The immigration file, from which we select only immigrants of the period (year = &styr), can then be concatenated to the population file of the projection, right after the emigration module in our example.

figure y

Since we implanted new modules in the middle of the simulation, we need to change the population file on which the next module (fertility) is built, which is now pop_immig instead of pop_edu.

figure z

Finally, in the assumptions taken from Lutz et al. (2018), the age of immigrants is their age at the end of the period. We therefore need to exclude immigrants of the period (if immig ne 1) when we increment age in the time module.

figure aa

6.5.3 Adjusting the Exposure in the Fertility Module

As mentioned previously, the age at immigration is the one at the end of the period. Since migrants are submitted to the fertility event, some of them could give birth before aging. For the rest of the population, “age” corresponds to age of the beginning of the period. We therefore exclude immigrants (if immig ne 1) when determining their age at birth.

figure ab

For immigrants, age at birth could be their age at the beginning of the period. Accordingly, we adjust it randomly for half of them.

figure ac

Immigrants and emigrants are both exposed to the fertility event for part of the year. Therefore, some other minor changes are required in the code of the fertility module. The variable “immig” included in the immigration file allows us to track immigrants of the period. Assuming they arrive in the middle of the period, we add a condition similar to the one used for deaths that allows only half of immigrants to be exposed to the fertility event.

figure ad

For emigrants, however, we don’t have a variable tracking them, since they are calculated by weight adjustments. We therefore need to add one line in the section for creating a newborn in order to transfer randomly either the old weight (before emigration) or the new weight (after emigration) of the mother to the baby.

figure ae

6.6 Adjusting Outputs and the Population File for the Next Period

Three temporary variables were used in the new emigration and immigration modules, “old_weight” (which was the weight of individuals before emigration), “nb_emig” (used for compiling the number of emigrants) and “immig” (which tracks immigrants). These variables need to be removed, since for the next period, immigrants should not be counted as new immigrants and should be exposed to the complete risk of fertility. We simply add these three variables to the section for cleaning the population for the next period, next to the other temporary variables to drop.

figure af

Finally, we want to include immigrants and emigrants when generating outputs of the components of growth. Since immigrants are tracked with the variable “immig”, we can use it with a proc freq to sum up the number of immigrants by age, sex and education, as we do for other components such as deaths and births.

figure ag

Emigrants don’t have a variable tracking them, since they are computed by weight adjustment. To calculate their total number, we need to sum up the variable nb_emig that was calculated in the emigration module by subtracting the weight at the end of the period from the weight at the beginning of the period. This can be done with the tabulate procedure. The outputs table is stored in a dataset “emigrants”. In this dataset, we rename the column counting the sum of emigrants (nb_emig_sum) for “emigrants” and drop the unused variables “_table_”, “_page_” and “_type_”. The statement “var” is used to identify continuous variables, such as nb_emig, while the statement class identifies the categorical variables (year, agegr, sex, edu and region). The table is then built to have the same structure as other datasets for components: we want the sum of emigrants by year, age, sex, education and region.

figure ah

Finally, in the section merging population count and components of growth, we add a dataset for the number of immigrants and emigrants.

figure ai

6.7 Calibrating Simulation Outcomes

For different reasons, we might want to calibrate the projection outcomes on other projections or estimates, either for a specific year or for the entire time span. For instance, the base population of the model we built for India is for 2010, but since then, more recent population estimates by age and sex have been released. We therefore might want to calibrate our outcomes on the more recent estimates, at some point of the projection, such as 2015 or 2020.

A calibration at the end of each period on the outcome of a simpler cohort-component model can also be performed to make two models comparable, or to facilitate the implementation of broader general assumptions. For instance, since our assumptions in the fertility module are age-, region- (for India), and education-specific fertility rates, the total fertility rate (TFR) of India or China will depend on the composition of population, which itself depends on other demographic components of the projection. In other words, we do not know a priori exactly what the forecasted TFR will be, as we do in a cohort-component model. When calibrating on such a model, the underlying assumptions should not be interpreted in terms of their absolute values, but rather in terms of the differences among subgroups.

As example of this calibration, we will calibrate in this section our microsimulation projections of India and China on the projection outcomes of the medium variant scenario of the World Population Prospects (United Nations, 2019) by age, sex and year. Doing this will increase the consistency between the demographic assumptions, as they will come from the same source, and this will therefore make their outcomes more comparable. Table 6.1 summarizes the broad assumptions.

Table 6.1 Summary of demographic assumptions for different projections

For India, demographic assumptions for the whole country are about the same in both projections. The main difference between the two projections is the inclusion of the education and sub-regions dimensions in KC et al. (2018). The difference in the net migration assumptions has a negligible impact on the outcomes, as the numbers (about −500 k per year) are marginal compared to the total population size of the country. Consequently, both projections yield very similar results in terms of age and sex size and composition.

As for China, the difference between the SSP2 scenario of the projection of Lutz et al. (2018) that we used to build the microsimulation model and the medium variant of the World Population Prospects is much more appreciable: the present and future total fertility rates are higher by about 0.3 children per woman in the latter. Detailed explanations concerning the different assumptions for fertility can be found in Basten et al. (2014). Calibrating our microsimulation on those of the World Population Prospects will thus indirectly increase equally the age- and education-specific fertility rates that we took from Lutz et al. (2018) in order to match the number of 0–4 year olds at the end of each period that would be given by a TFR of around 1.7–1.8, as assumed by the World Population Prospects. As the modelling of education is kept as well as differentials in demographic behaviours by educational attainment, the calibration will allow the model to merge the education component of Lutz et al. (2018) and the labour force dimension we added to the official projection of the United Nations.

In this section, we will explain the code for calibration implemented in the model of China. The code calibrating the model of India is however the same and can be found in the file “Chapter 6—India calibrated.sas”, located in the folder “Chapter6—India calibrated” in which other files necessary for the projection are also located.

The file calibration.csv, located the subfolder “param”, contains the population size by age, sex and year from the World Population Prospects (United Nations, 2019). Figure 6.5 shows an excerpt of the file. The column “pop_wanted” represents the population that will be used for the calibration. At the end of each period, we want our outcomes to match with these. We will proceed by adjusting the individual weights accordingly. Note that the older age group in this file is 100 (for 100+), while in our projection, it goes to 125. We will need to take this into consideration when we will calculate the adjustment factors.

Fig. 6.5
figure 5

Screenshot of the parameter file calibration.csv (opened with excel)

The first thing to do is to import this file and sort it. This is done the same way as we did for all other parameter files, with the macro import and the sort procedure in the section for importing files.

figure aj

The calibration module is implemented directly in the simulation loop, once all demographic events are completed, i.e. between the time module and the labour force participation module. It thus starts from the temporary population file pop2. First, we make the categories of the age group variable match those of the calibration file. All age groups above 100 are thus reassigned to 100, which now includes all of the population aged 100 and above. Before doing this, we store the old age group variable into a temporary variable agegr2 that will be used later.

figure ak

Using this redefined age group variable, we then produce an output of the population count by age and sex with the FREQ procedure. We also include the variable “year”, since we want the year to appear in the output in order to allow us to match it with the appropriate year of the calibration file. We use the list option to have all counts in a single column, and the options nocol, nopercent and norow to remove all percentages from the output. With the out option, we store the resulting output table in the work library under the name “simul”, from which we drop the column “percent”.

figure al

The column “count” in the table “work.simul” thus represents the population by age and sex at the end of the period, as simulated without calibration. We now need to calculate the adjustment factor for individual weights. We create the temporary file “factor” by merging work.simul with the calibration file (param.calibration) by year, sex and agegr. As a reminder, in the file param.calibration, the column “pop_wanted” represents the population we want. The adjustment factor is then calculated by dividing pop_wanted by count. So if we simulate 1000 individuals for a specific group, while we want 1100, the adjustment factor is 1.1, meaning that the weights of all individuals in this specific group need to be increased by 10% in order to get the target population. As “pop_wanted” and “count” will not be used anymore, we can drop them with the drop statement.

figure am

We can now adjust the weights in the population file “work.pop2”. After sorting it correctly, we merge it by sex and age group with the file “factor” we just created. The variable “weight” can then be adjusted by multiplying it by the variable “factor”. With the variable agegr2 we created before, we also reassign the detailed age group categories for the population aged 100. Variables “factor” and “agegr2” will not be used after this point and are therefore dropped from the file.

figure an

The microsimulation now calibrates the population by age and sex on the projections of the World Population Prospects at the end of each period. To get consistent trends in the final outputs, we also need to calibrate the base population. For this purpose, we use the exact same code as described above, but replace the population file “work.pop2” with “pop.pop_2015” (highlighted in yellow). This code is implemented right before the macro simul, after importing and sorting files.

figure ao

Since the calibration module readjusts the population size by age and sex at every step, the components of demographic growth (births, deaths, migrants, etc.) produced in the outputs are not accurate anymore. Interpretations of these components should thus be made with caution.

6.8 Overview of Results

In Fig. 6.6, we compare the final age pyramid by education from the microsimulation before and after calibration with the one from the medium variant from the World Population Prospects (United Nations, 2019) and from the scenario SSP2 from the multistate model of Lutz et al. (2018). As expected, when not calibrated, demographic components of the microsimulation replicate those of the Lutz et al. (2018) approximately. The total population starts slowly declining in 2025 from 1.4 to 0.8G in 2100. The population is then very old, with modal age groups between 75 and 89, but highly educated, with almost all people of working age having at least an upper secondary education and more than half having postsecondary education.

Fig. 6.6
figure 6

Comparison of the projected age pyramid in 2100 by education in China from multistate models and from microsimulation

The calibrated microsimulation yields the same age-sex structure as the World Population Prospects. Both yield a much younger age structure in 2100 and a smaller population, as compared to the non-calibrated scenario (which replicates the demographic growth of Lutz et al. (2018)). These differences are due to the underlying assumptions and not to the modelling. Indeed, the difference is mainly explained by different fertility assumptions. In Lutz et al. (2018), the yearly TFRs range from 1.35 to 1.5 between 2020 and 2100, while they range between 1.7 and 1.8 in the medium variant of the World Population Prospects.

We did a similar calibration for the model of India (see folder Chapter6_India calibrated) in order to make the microsimulation model of India comparable to that of China. In Fig. 6.7, we compare results for China and India. By 2025–2030, India will become the world’s largest country in terms of both population size and working-age population (15–64), exceeding China. Both the total and working-age populations will be growing continuously in India, while they will start declining in China. This outcome is well known from other multistate or other cohort-component projections (Lutz et al. 2018; UN 2017). The new outcome of our microsimulation projection model is the labour force participation dimension. We saw in Chap. 4 that India is deprived of many potential workers in its working-age population due to the very low labour force participation of women, while gender gaps are much smaller in China. Consequently, as shown in Fig. 6.7, China will remain the world leader in terms of actual workers for several more decades, until 2060, despite having a lower working-age population from 2030. Indeed, the number of workers in China will also be declining, from 840 M in 2015 to 629 M in 2060, but the gap in labour force size in 2015 (840 M vs 458 M) between both countries is much higher than the gap for the working-age population (1022 M vs 848 M). Without a change in labour force participation rates, much more time is thus required for India to reach the level of China in terms of number of workers. This shows again the high stakes of including labour force participation as source of heterogeneity in population projections.

Fig. 6.7
figure 7

Projected total population size, working-age population (1564) size, and labour force size, India and China, 20152060