Keywords

3.1 Mortality Event

The parameter files for the mortality module are survival_a (for adults’ mortality) and survival_k (for children’s mortality). As a reminder, parameters for adults and children are split into two files because they don’t have the same predictors. However, both files have the same structure. In this example, survival ratios (sx_a or sx_k) by year, age, sex, region and education (for adults)/education of the mother (for kids) are used for the modeling of mortality. Files are structured such that all parameters are in one single column, and each line corresponds to one possible combination of education category, age group, sex and region, as shown below in the image of the first lines of the dataset survival_a. The value 0.990324 for sx_a represents the probability of surviving the next 5 years (up to 2015) for a male (sex = 0) in the age group 15–19 (agegr = 15) with no education (edu = e1) living in a rural area of Andhra Pradesh (region = AD_rural) in 2010. Note that in the dataset survival_k, the agegr = -5 corresponds to the survival ratios for children that will be born during the period. Those will be used in the fertility module (Fig. 3.1).

Fig. 3.1
figure 1

Screenshot of the parameter file survival_a.csv (opened with Excel)

We need to allocate to all individuals of the base population of 2010 their corresponding survival ratio. To do so, we first merge the parameters file survival_a to the population of 2010 in a temporary dataset (pop_survival1) located in the temporary library “work”. We specify variables to take into account in the merging with the “by” statement, in this case: region, agegr, edu, sex and year. The option “in = in1” followed by “if in1” is used to keep only observations from the population file (in other words, if there is no match in the population file for a specific survival ratio, the line will be deleted).

figure a

Similarly, after sorting pop_survival1, we then create a second temporary dataset, pop_survival2, in which we add survival ratios for kids (using eduM—the education of the mother—in the merge statement rather than edu). In the base population, we do not know what the education of the mother. Thus, for those cohorts aged 0–14 in 2010, no differentials by mother’s education are implemented. Those differentials will only be applied for new cohorts generated throughout the projection.

figure b

Depending on the age of the individual (below or above 15), we attribute the appropriate sx (either from survival_a, which are sx_a, or from survival_k, which are sx_k).

figure c

Now that each individual has their own survival ratio, the temporary dataset is ready to simulate the mortality event with a random experiment. For this, we compare sx to a random number (from 0 to 1) following a uniform distribution generated with “rand(‘uniform’)”. If sx is lower than the random number, then the individual will die during the period (death = 1). At this point, we don’t yet delete him from the dataset, because we still need to consider him in the exposed population of other events that will be modelled in further steps such as fertility.

figure d

Once we know who will survive and who will die during the period, we can then remove the survival ratios with the drop statement, since they will not be used anymore (different sx will be used for further periods).

figure e

3.2 Education Module

The education variable has 6 categories. Transitions in education may only occur in one direction (from a lower level to the next upper level) and only at certain ages. Children below the age of 15 are classified in a specific category in the outputs. The education variable thus starts at the age of 15. Before that age, the education of the mother is used in events such as mortality. The parameters for the education module are stored in the file education.csv, the structure of which is shown in Fig. 3.2.

Fig. 3.2
figure 2

Screenshot of the parameter file education.csv (opened with Excel)

Transitions vary by sex, age, region, and year. There are two types of parameters, and they start with the prefixes “e” and “pe”, respectively. Parameters starting with “e” are only applied to the population age 10–14 at time t and correspond to educational attainment at age 15–19 (time t + 5). For instance, among males (sex = 0) who were age 10 (agegr = 10) in 2010 and living in a rural part of Andhra Pradesh (region = AD_rural), 13.9% will have a complete primary education (e3) as a higher level of education at age 15–19 in 2015. The category e1 is omitted in parameters, as it corresponds to 1 minus the sum e2 to e6.

The second type (pe) of parameter is the probabilities of change from level e3 to e4 (pe4), from e4 to e5 (pe5) and from e5 to e6 (pe6) between t and t + 5. These parameters are applied from the age group 15–19 to 25–29 at time t (so 20–24 to 30–34 at time t + 5). In our example, a man in the age group 15–19 living in a rural part of Andhra Pradesh and having an upper secondary education in 2010 has a 36.1183% chance of completing postsecondary education by 2015.

There are no transitions between other education categories, such as between incomplete primary (e2) and complete primary (e3), as at the age of 15–19, most of those individuals who do not have at least a complete primary education (e3) are already out of the education system, and their current educational attainment is the one they will have for the rest of their life.

The first step in the code of the education module is the merging of the parameter file (education) to the last population file resulting from the mortality module (work.pop_suvival2) by the relevant variables (in our example, region, agegr, sex and year) into a new temporary population file (pop_edu) in which the education event will occur.

figure f

We then generate a linear random variable that we store in the temporary variable “a”, and we create another temporary variable “edu_new” for which the initial value is the same as the current level of education (edu). We then simulate changes in education. First, by comparing the random variable “a” to parameters e2 to e6, we assign the educational attainment at the age 15–19 at time t + 5 (so to the population in the age group 10–14 at time t). Since the education variable has many categories, we need to use cumulative probabilities in order to have only one alternative for each value of “a”. The default value of new_edu is “e1”, so if the random variable “a” is higher than the cumulative proportions of “e2” to “e6”, the category “e1” is kept.

figure g

For the population age 15–19 to 25–29, we then simulate transitions from the category “e3” to “e4”, from “e4” to “e5” and from “e5” to “e6” using parameters “pe4”, “pe5” and “pe6” respectively, that we compare with the random variable “a”.

figure h

The projection uses fertility rates by education as an input. To have good exposure for those rates, we create a variable “edu_fert” that is for half of the population the education at the beginning of the period (at this point, it is still the variable edu) and for the other half, the education at the end of the period (the variable edu_new).

figure i

Finally, we replace the value of edu with the value of edu_new. The variable edu is now the education level that will be reached at t + 5. We drop parameters (“e2-e6”, “pe5” and “pe6”) and temporary variables (“a” and “edu_new”).

figure j

3.3 Domestic Migration Module

Interregional (or domestic) migration can be modelled in different ways. In our example, we use rates in an origin–destination (O–D) matrix. For other dimensions we have modelled up to now, the number of states was small and few transitions were possible. For the mortality module, there are only two states and only one transition is possible, from alive to dead. In the education module, there are 6 states and transitions are only possible in one direction (from a lower level to the next upper level). In the case of domestic mobility, we have in this example 70 regions and it is possible for an individual to move to any of them in any stage of life.

The original parameters from the multistate projection are an OD matrix in which each possible combination of age, sex, education, region of origin and region of destination has its own rate (Fig. 3.3). The probability of a male (sex = 0) between 30 and 34 years of age (agegr = 30) with no education (edu = e1) living in an urban area of Andaman & Nicobar Islands (region = AN_urban) will move to an urban area of Andhra Pradesh (AD_urban) is thus 0.7042%. Values in the diagonal are very high, as they represent the probability of staying in the origin region (89.6198% in this example). The matrix includes the education dimension, but in the reference scenario no differentials have been implemented. This dimension might however be used in a further extension of the assumptions.

Fig. 3.3
figure 3

Screenshot of the origin–destination file used to create the parameter file dom_mig (opened with Excel)

In the case of a multistate event such as this one, we need to rearrange the file in order to have cumulative probabilities for the different alternatives, as shown in Fig. 3.4. Otherwise, if we use the same type of structure as used for the mortality module, a random number might be lower than more than one destination region. For the case presented above, the cumulative probability corresponding to the mobility from AN_urban to AD_urban would thus be 90.324%, either the probability of staying in AN_urban (89.6198%) + the probability of moving to AD_urban (0.7042%). For all cases, the value of the last option would thus be 1. Once structured in this way, only one region will correspond to a random number between 0 and 1.

Fig. 3.4
figure 4

Screenshot of the parameter file dom_mig.csv (opened with Excel)

For the code, in a new population file (pop_dm), we merge the parameter file (dom_mig) to the last population file (pop_edu). Each individual will therefore have a (cumulative) probability of staying in the origin region and a (cumulative) probability of moving to any other region.

figure k

We store the origin region in a new variable “oldregion”, which will be used in subsequent steps. We generate a random number that we store in a temporary variable “a”. The random number needs to be tracked because the same number needs to be used in different places in modeling the event. We then create an array called prob that states the regions ( is used to specify that the number of elements in the array is the number of arguments that follows).

figure l

Using the do statement on all elements stated in the array prob, we compare the random number “a” to the cumulative probability for every destination region i. When “a” is lower than the cumulative probability of region i, but higher than the one for the previous region (i−1), the label of the corresponding element in the array (we select it with the vname statement) is then assigned to the temporary variable “newregion”. For the first region in the array (i = 1, AN_urban in our example), we just compare the cumulative probability to the random number a (i−1 doesn’t exist). Thus, for any random number generated, only one option is possible.

figure m

Once done, the variable “region” corresponds to the region of residence at time t, while the variable newregion corresponds to the region at time t + 5. As we did for the education module, we need to create a special variable in order to have the proper exposed population in the fertility module. We can assume that half of the migrants are exposed to the fertility rate of the destination, while the other half are exposed to the origin’s fertility rates. Therefore, we create accordingly a variable “region_fert” that will be used for the region of exposure in the fertility module.

figure n

When variables region and newregion are the same, the individual did not move. In order to count the number of migrants, we create a variable “dom_mig” that takes the value of 1 when “newregion” and “region” are different. We then replace the value of the variable “region” with the new region. The number of emigrants by region can therefore be obtained by doing a crosstable of dom_mig and oldregion, while the number of immigrants can be obtained by doing a crosstable of “dom_mig” and “region”. Note that in our example, only the population that survives until t + 5 (death = 0) can migrate, because data to estimate the mobility come from the question on the previous residence in censuses, which by definition is asked to the surviving population only.

figure o

Finally, we drop the temporary variables (i a newregion) and variables for parameters (AN_urban–WB_rural), because they will not be used anymore.

figure p

In our example, rates are constant throughout the projection. To make them evolve, we can simply add a column “year” to the dom_mig file, filling in the rates for the future, and adding “year” as in merging variables.

3.4 Fertility Module

The fertility module is a bit more complex since when the event happens, a new individual (having its own characteristics) must be generated and added to the population file. Also, it needs to take into account that this new individual might not survive between birth and time t + a.

Over a 5-year-period, a cohort age X at time t will also be exposed to the risk of fertility of age group X and X + 5. The Lexis diagram below (Fig. 3.5) shows an example of exposure between t and t + 5 for the cohort age 15–19 at time t (the red lozenge). The cohort has half of the risk in the age group 15–19 (in orange), half of the risk in 20–24 (in light blue). The rates for the age group 20–24 (green square) is thus applied to two cohorts.

Fig. 3.5
figure 5

Lexis diagram

To expose the population to appropriate fertility rates, in a new temporary population file (pop_birth) created from the last population file (pop_dm), we create a new variable “agegr_fert” that corresponds to the variable used for the age at fertility. We assume half of the population will change age group before the middle of the period. Therefore, agegr_fert is the same as the age group for half of the population, while for the other half, it corresponds the age group + 5. The split is done randomly by comparing a random number distributed uniformly between 0 and 1 generated with rand (‘uniform’) to 0.5.

figure q

Similar exposure-adjusted variables have been created for the education and the region of residence in their respective modules. After sorting the population file properly, in a new population (pop_birth1), we then implement the fertility rates from the parameter file param.fertility. Note that the link variables used are specific to the fertility module: agegr_fert, region_fert and edu_fert. Then, in a similar way, in a new population file pop_birth2, we add the parameters file for the sex ratio at birth by region and year (param.srb).

figure r

We can then proceed to the fertility event. First, we reset the variable identifying the mother of a young kid, since children born during the previous period will change age group. The events are applied to surviving individuals (death = 0) and to half of the deaths that we select randomly, as we assume that they die on average in the middle of the period. Then, using the Monte Carlo method, we simulate the fertility event: when a random uniform variable between 0 and 1 is lower than the fertility rate, then the individual gives birth during the period (identified with the variable young_kid).

figure s

Once a woman gives birth, we generate a new individual to add to the population: her baby. For this purpose, we use the output statement,Footnote 1 which duplicates the line of the mother. We then have to determine the characteristics of the baby. A flag variable birth is first created (birth = 1), in order to compile the number of births in a further step. We also create a variable stating the age of the mother that can further be used in outputs. We assign the age group “−5” to the baby (as a reminder, −5 stands for those who are born between t and t + 5, while 0 is for those aged 0–4 at time t). We also define the year of birth in the variable “cohort” (2010 would thus stand for babies born between 2010 and 2014).

Then, with another random experiment, we determine if the baby is a boy or a girl using the sex ratio at birth (SRB) parameter (the variable srb—sex ratio at birth, which comes from the file srb.csv). The srb is expressed in this case as the number of baby girls for 1000 baby boys. Although in most countries, this is usually always about the same (about 953 girls for 1000 boys), it’s much lower in India because of sex-selective abortions that are very common in some regions (Retherford and Roy 2003). Thus, the SRB has region- and year-specific values, allowing us to build assumptions for its future evolution.

figure t

Since the line for the baby is a duplication of the line of the mother, the sample weight of the mother is automatically transferred to the baby. The region is also transferred, but, as a reminder, since the domestic mobility module occurs before, it now corresponds to the region at the end of the period and not the region at birth. The region at birth is the variable used for the exposure to fertility, region_fert, that we store in another variable, “region_birth”, which will be used when generating the outputs of the components of growth. When this region of birth differs from the region of residence at the end of the period, the baby is flagged as a domestic migrant (dom_mig = 1).

We also keep track of the education of the mother in the variable eduM (which is used for the mortality until the age of 15), and set the education of the baby to e1 (no education).

The projection uses 5-year steps, so it is possible to have a fertility rate higher than 1 for a specific group. In such cases, the rate will always be higher than the random number generated for the fertility event, and some births will be missing. On average, more than one birth occurs per woman, so a simple way to get the correct number of births is to adjust the population weight of the baby by multiplying it by the fertility rate (such as what would be done in a deterministic microsimulation; see Chap. 6). In the case of a woman with a weight of 1000 and a fertility rate of 1.2, the weight of the newborn would thus be 1200.

The last step of the fertility module is to simulate the survival of the baby between the birth and the end of the period (t + 5). For this purpose, we will again need to use the parameters file containing survival ratios for kids (survival_k). We merge it with our last population file (pop_birth2), by region, age, sex, education of the mother and year. We then simulate the mortality event for babies born between t and t + 5 (identified by agegr = −5), the same way we did in the mortality module for the living population at time t. Because the mortality event occurs before migration, the region at death is the one where the baby is born (region = region_birth). Finally, since we don’t need them anymore, we drop temporary variables and parameters.

figure u

3.5 Reclassification of Rural to Urban Areas

With the rapid urbanisation of India, cities are spreading and absorbing some rural areas. Thus, some people’s environments can switch from rural to urban without migration. The multistate model provides annual rates of the reclassification of rural areas by region, age, sex and education. Those rates are included in the parameters file “urbanisation”, in the column “ur”, as illustrated in Fig. 3.6. Rates are region-specific and constant throughout the projection. The number 0.040471 for the region AD_rural mean that about 4% of the population living in a rural area of Andhra Pradesh are reclassified as living in an urban area every 5 years. We can note that some rates are very high, such as the one for Dadra & Nagar Havel (DN_rural) which is almost 67%. Regions with such high rates have small areas and rapid urbanisation. Consequently, the rural areas in such regions are expected to disappear within the next few decades.

Fig. 3.6
figure 6

Screenshot of the parameter file ur.csv (opened with Excel)

The reclassification event occurs after all other demographic events. After sorting the last population file (pop_birth3) by region, we merge it with the parameters file that includes the reclassification rates (param.ur) in a new temporary population file called pop_reclass. Each individual living in a rural area now has a probability of being reclassified as living in an urban area.

For the coding of the event, we first store the old region in another variable (oldregion2). This variable will be used later when generating outputs of the projection. The event only applies to those who are still alive at this point of the projection (death = 0) and those living in a rural area, which we can select using the substr function. Options in the brackets of this function mean that we look to see if the 5 characters starting from the 4th character of the variable “region” contain the string “rural”.

figure v

Once the appropriate population is selected, the event is modelled with the Monte Carlo method, comparing a random number to the reclassification rate (ur). When the event happens, the function transwrd is used to change the value of the region variable from rural to urban. The first term in the brackets of this function selects the variable to modify (region); the second term identifies the string of characters to replace (‘rural’); and the third term is the string of characters of substitution (‘urban’). We keep track of reclassified individuals in a new variable labelled “reclass”, which will be used when generating outputs. Finally, we drop the reclassification rates (ur) from the dataset, as they won’t be used anymore.

3.6 Preparing the Population File for the Next Step

We have simulated changes in the population characteristics for the first 5 years of the projection. All demographic events are completed. We now have to prepare the population file for the next step. First, we make our population ages and we increment the year in a small “time module”. Starting from our last population file (pop_reclass), we create a new one (pop2) in which the year is increased by 5 as well as the age of survival (death = 0). For those who died during t and t + 5 (except for babies born during the period), we assume that only half died before their anniversary. This last statement will only be useful for outputs related to the age distribution of diseases.

figure w

The population file pop2 represents the population at the end of the period, but still includes the deaths, which are flagged with the variable death = 1. For the next period of the projection, those individuals need to be removed. From pop2, we create a new population file “pop_2015” that is the surviving population in 2015. This file is stored in the library pop because it is the final projected population for 2015. We delete the deaths from the dataset and we drop variables that were used as flags for births, migrants, and individuals who were reclassified from rural to urban areas.

figure x

3.7 Generating Outputs

The population file pop_2015 is the dataset corresponding to the surviving population in 2015. The file pop_reclass is the last file containing a record of demographic events (with variables birth, death, dom_mig and reclass). From these two files, we will generate projection outputs. Variables included in this step of the model may vary according to the needs of the user. The code for generating outputs has no impact on the projection.

First, from the population file pop_2015, we create a flat file outputpop that aggregates the population by sex, region, age group and education. The option noprint is used, because the output is stored in a separate dataset (out= ) and we don’t need to display the table in SAS. The option list is used to create one single table in which each column corresponds to a variable and each row is a specific combination of categories. The options norow, nocol, nopercent and nocum are stated in order to keep only frequencies in the table and to remove percentages and sums. In the output file, the column “count” is renamed “pop”, as it represents the population size.

figure y

An excerpt of the resulting file is shown in Fig. 3.7. Remember that since microsimulation is based on a stochastic process using random experiments, results may differ slightly on each run.

Fig. 3.7
figure 7

Screenshot of the results file outputpop (opened with SAS)

In our example, we also want to generate outputs for components of the growth. We need to use the temporary file pop2 for that purpose, since it is the last one that keeps track of demographic events, including births, deaths, domestic migration (both inflows and outflows), and individuals reclassified from rural to urban areas. We use the FREQ procedure to create frequency tables for each component.

For the birth event, we select births (birth = 1) with the where statement. In the table statement, we split births by age of the mother, education of the mother, sex, and region of birth, and in options, we specify that we want to keep only frequencies. We also rename the variable education of the mother (eduM = edu), the age of the mother (age_moth = age) and the region of birth (region_birth = region), because we will later merge this file with the population count. The resulting table is stored in the dataset births in the temporary library work. An excerpt of the table is shown in Fig. 3.8.

figure al
Fig. 3.8
figure 8

Screenshot of the results file births (opened with SAS)

Using flag variables identifying deaths, migrants and reclassified individuals, we generate similar frequency tables for deaths, inflows, outflows, and changes from rural to urban areas, though we replace the education and age of the mother with the education and age of the individual (and remove the rename statement accordingly). For outflows and individuals reclassified from rural to urban areas, we need to state the former region of residence (oldregion and oldregion2, respectively) rather than the current one, and rename the variable in the dataset in order to allow us to merge it with another dataset later. For reclassified individuals, we also need to exclude domestic migrants (dom_mig ne 1) from the tables in order to avoid double counts. For both rural to urban reclassification and domestic migration, because they are implemented after the mortality event, we also add death = 0 in order to avoid counting a baby that died between the birth and the end of the year (the rest of the deceased population was already excluded in the modeling of the event).

figure z

We now have 7 datasets of outputs, one for the population count and six for components of growth. We will now merge them into a single dataset. We first need to add a column “sex”, taking the value of 1 for everyone in the file birth2015, because we want births to match the female population only.

figure aa

We can now merge all those files in a new dataset called “output2015” stored in the “results” library. Using an array called “change” that included numeric variables (selected with _numeric_), we switch missing values to 0 and remove decimals using the round statement.

figure ab

As shown in Fig. 3.9, the resulting file output2015 now includes the population size by age, education, sex and region, as well as the components of growth for every specific sub-group.

Fig. 3.9
figure 9

Screenshot of the file output2015 (opened with SAS)

3.8 Cleaning the Workspace

During the simulation of events, we created many temporary datasets. Before passing to the next period, we need to remove them from the memory. The code below is used to clean all datasets stored in the work library.

figure ac

3.9 Simulating for Next Periods

Up to now, we simulated the population for one step, from time t to t + 5 (2010 to 2015 in our example). We started from a base population file (pop_2010), and we ended up with another population file having the same structure for 2015 (pop_2015). This file is now the starting point for the next step, from 2015 to 2020. We could just use the same code, but call it pop_2015 rather than pop_2010 in the first module (mortality), changing the name of the final population file in the module preparing the population file for the next step (pop_2020 rather than pop_2015) and changing the years in the name of the outcome files.

Doing this, though easy, would however require many lines of code. For this kind of situation where codes are repeated with only small changes, SAS allows the user to create a macro that stores the repetitive code in a function and in which elements to change are identified as parameters. To run the code, we can further call the macro function and specify the appropriate parameters.

To create a macro, we need to insert the repeating code between statements %macro name and %mend name. Parameters are declared in brackets after stating the name of the macro, and would refer to the code identified by the same label with the prefix & in the codes.

In our example, we will create a macro called micosim with parameters styr (for starting year) and endyr (for ending year). The code to repeat starts by sorting the initial population file. We thus declare the macro just before this code with %macro microsim(styr,endyr). Then, throughout the codes, we change labels that refer to the starting year (2010) by &styr and those that refer to the ending year (2015) by &endyr. We highlight them in yellow in the code below.

figure ad

We close the macro function with the statement %mend microsim right after the last line of the code that needs to be repeated, which is the procedure for cleaning temporary work files.

figure ae

To run the microsimulation up to 2060, we could call the macro function microsim for every step from 2010 to 2060, for example:

figure af

Alternatively, since we repeat the macro by increasing parameters by 5 every step, we can imbed a loop function in a second macro (called loop) that automatizes the process. Parameters of this macro loop are the initial year of the first and the last step of the projection. When calling the macro using 2010 and 2055 as parameters, the projection will run until 2060 (2055 being the initial year of the last step that thus ends in 2060).

figure ag

The output of each year is then stored in a specific SAS format dataset (.sas7bdat) in the library results, as shown in Fig. 3.10.

Fig. 3.10
figure 10

Screenshot of files included in the results library (opened with SAS)

To facilitate the analysis of projection outcomes, we can concatenate them together and export the resulting file in CSV. Because we want the initial population of 2010 to be included in the final dataset of results, we first need to create an output for the starting year (2010) having the same format as other outputs for projected years. We thus use the same code, but using “2010” instead of “&styr”.

figure ah

In a data step, we then concatenate all files starting with “outputpop” located in the library results by using the symbol ‘:’ instead of the year in the dataset name. The resulting file “outputTotal” thus contains the population outcome for all years. Values for components of growth are for the period t−5 to t, so they are empty for the year 2010, which is the starting year of the population.

figure ai

Using the export procedure, we can then export the dataset into a CSV file located in the output folder. Personalised tables can then easily be generated with pivot tables from Excel.

figure aj

3.10 Validation of Results

The validation of a new model is a necessary step that must be taken before using its outcomes for purposes of analysis. As with any population projection model, the aim of the microsimulation model we built is not to predict what will happen, but rather to forecast the population under conditional assumptions. In our case, those assumptions are taken from another projection model. Therefore, the microsimulation model should reproduce what occurs with that multistate model. In other words, our validation needs to confirm the demographic events are modelled correctly and that there is no error in the code. For this purpose, an external validation by error analysis is sufficient, as it is the most common procedure used to validate population projections (Grummer-Strawn and Espenshade 1991; Smith et al. 2002). If we had built our own assumptions, more sophisticated validation and sensitivity analysis might have been required. For specific examples, see National Research Council (1991) or Caswell and Sánchez Gassen (2015).

We compared our outcomes with those we want to reproduce (the multistate model of KC et al. (2018)). In order to see if there was a systematic bias in the microsimulation model compared to the multistate model, we calculated the mean error and the mean absolute error in the projected population size in 2060 (Table 3.1). We disaggregated the population by the smallest possible comparable subgroups, these being the size of each age-, sex-, education- and region-specific group. We also split results according to the population size of the subgroup. Remember that since the microsimulation is stochastic, results may differ slightly between each run, and consequently, the difference with the multistate model results may also change slightly. We can see that the accuracy of the projection is relatively good, with a mean relative error of 1%. Because of the stochastic nature of the microsimulation, the smaller the population is, the higher the error. Thus, although the mean absolute error when the population size of the subgroup is between 0 and 10,000 is much higher and reaches 4%, it corresponds to a gap of only −14 individuals on average (mean error).

Table 3.1 Error between the multistate and the microsimulation model

In Fig. 3.11 we compare the projected population by level of education between 2010 and 2060 from our microsimulation model to the outcome from the multistate. The microsimulation leads to very similar results, with a sharp increase in the population between 2010 to 2060 from 1.2 G to almost 1.8 G. Most of the increase is projected to be in the population with an upper secondary or postsecondary level of education, while the population with no education is projected to decline sharply.

Fig. 3.11
figure 11

Comparison of projected population size of India by educational attainment from multistate model and microsimulation, 20102060

In Fig. 3.12, we compare the projected age pyramid by education level in 2060 from each model. In Fig. 3.13, we show the population size by region in 2060. Again, we can observe that the microsimulation produces results similar to those of the multistate projection, though the discrepancy becomes larger for regions with smaller populations, given the Monte Carlo error.

Fig. 3.12
figure 12

Comparison of the projected age pyramid in 2060 by education in India from multistate model and microsimulation

Fig. 3.13
figure 13

Comparison of the projected population size in 2060 by region, India, from multistate model and microsimulation

Overall, the microsimulation replicates the multistate model outcomes quite well for broad aggregations, such as the population by age, sex, and education and for subgroups with relatively large populations. However, for more specific subgroups with very small populations, for instance, the men aged 95–99 with postsecondary education in an urban area of Dadra & Nagar Haveli, results may differ dramatically (183 in the multistate vs. 0 in our run of microsimulation). If we are interested in analysing outputs for those small subgroups, we could improve the accuracy of results by increasing the sample size in the base population. Alternatively, we can also implement some events, such as mortality, using a deterministic approach rather than a stochastic one (see Chap. 6 for details).