4.1 Adjusting the Workspace for the Addition of New Dimensions

In Chap. 3, we replicated in a microsimulation framework what a standard multistate model can do. In this chapter, we will increase complexity by adding two dimensions: the labour force participation and the sector of activity. While these variables can be derived from the outcome of a multistate model (e.g. using the resulting population and applying predefined participation rates), the microsimulation can implement them dynamically. In the example, we present in this chapter, the labour force participation and the sector of activity are calculated using parameters from statistical models. Predictors include age/cohort, sex, region of residence, education and a binary variable stating whether the individual is a woman who gave birth within the last 5 years. As this last predictor suggests, assumptions as to fertility have an impact on the labour force outcome.

The modules for labour force participation and the sector of activity reassess the individual outome of these variables at the end of each period using personal characteristics as determinants. Because of the availability of data used in the statistical model, these modules do not take into account the past labour force participation and sector of activity of the individual. In other words, what is modelled is the probability of being in the labour force rather than the probability of entering or leaving the labour market, or the probability of changing the sector of activity. Consequently, the modeling can project reliable crossectional values, but it does not allow for longitudinal analysis, as life courses may be inconsistent.

The code file “Chapter 4 – Adding new dimensions.sas includes” the complete code of the microsimulation with two additional dimensions that are explained in this chapter. Below, we will explain the difference from the file used in Chap. 3, which replicated a multistate model. First, we change the name of the scenario. In the support documents provided with this book (Chapter ESM), all necessary files can be found in the folder “chapter4”.

figure a

The parameter files for modules for labour force participation and the sector of activity were already imported in Chap. 2 (with the macro function import and the sort procedure). As a reminder, the code lines for this purpose were:

figure b

4.2 Labour Force Participation Module

Labour force participation rates (P) are estimated from a logit regression model. Logit models can be estimated with SAS using the LOGISTIC procedure.Footnote 1 When modeling a binary variable such as labour force participation, logit models are preferred over linear models, as the predicted value of a logit model can only range from 0 to 1. If the predicted outcome has more than two categories, multinomial or ordered logit may be more appropriate. The logit model used in the labour force participation module is based on data from the National Sample Survey on Employment and Unemployment 2017/2018 (population aged 15–74; n = 323,092). The model is described in Eq. 4.1:

$$ \begin{aligned} & {\text{logit}}\left( P \right) = \beta _{{s,0}} + \beta _{{s,1}} AGEGR + \beta _{{s,2}} AGEGR^{2} + \beta _{{s,3}} EDU + \\ & \beta _{{s,4}} REGION~ + ~\beta _{{s = F,5}} YOUNG\_KID + \beta _{{s = F,6}} POSTSEC*YOUNG\_KID + \beta _{{s,7}} EDU* \\ & AGEGR + \beta _{{s,8}} EDU*AGEGR^{2} \\ \end{aligned} $$
(4.1)

The logit of a probability corresponds to the natural logarithm of its odds. Therefore, the logit of the participation rate (P) is log(P/(1 − P)), and the rate P can be calculated from the parameters, such as:

$$ {\text{P}} = \frac{{\begin{array}{*{20}c} {\exp (\beta _{{s,0}} + \beta _{{s,1}} AGEGR + \beta _{{s,2}} AGEGR^{2} + \beta _{{s,3}} EDU + \beta _{{s,4}} REGION~ + ~\beta _{{s = F,5}} YOUNG\_KID + } \\ {\beta _{{s = F,6}} POSTSEC*YOUNG\_KID + \beta _{{s,7}} EDU*AGEGR + \beta _{{s,8}} EDU*AGEGR^{2} )} \\ \end{array} }}{{\begin{array}{*{20}c} {1 + \exp (\beta _{{s,0}} + \beta _{{s,1}} AGEGR + \beta _{{s,2}} AGEGR^{2} + \beta _{{s,3}} EDU + \beta _{{s,4}} REGION~ + ~\beta _{{s = F,5}} YOUNG\_KID + } \\ {\beta _{{s = F,6}} POSTSEC*YOUNG\_KID + \beta _{{s,7}} EDU*AGEGR + \beta _{{s,8}} EDU*AGEGR^{2} )} \\ \end{array} }}) $$
(4.2)

Each sex has its own set of parameters and its own intercept. The slope for age, education, and having a young kid is thus assumed to be the same in all regions. The age group is included with a quadratic function, allowing it to be modelled with a reverse U-shape with lower participation rates for younger adults still in school and the elderly. The interaction of age and education allows the model to take into account that the age pattern in labour force participation varies by educational attainment. It was not possible to have region-specific parameters because of the small number of respondents in many categories (such as highly educated people in a specific age range in smaller regions). However, regions have their own gradients.

The max-rescaled R-Square is 0.5342 for the males’ model (percent concordant = 91.2) and 0.2374 for the females’ one (percent concordant = 76.6). Complete parameters can be found in the parameters file lfp. In Fig. 4.1, we showed the predicted rates from the model by age and education for both males and females (with no young kid). For males, rates are very high for everyone between 25 and 59. The education gap concerns mainly young and older adults, with lower rates for higher educated ones. In other words, more educated men enter later in the labour market, since they stay at school longer, and they also retire earlier, probably because they have better jobs during their working lives and can afford an earlier retirement. For females, the pattern is very different. For all education categories and at any age groups, rates are much lower than for men, generally more than twice as low. Furthermore, the effect of education seems to follow a U-shape, with higher rates for both the highest and lowest categories. These trends are similar to those observed by Kapsos et al. (2014).

Fig. 4.1
figure 1

Predicted labour force participation rates from Eq. 4.1 by age and education, India

The parameter for the variable young_kid is −0.3236, implying that women who gave birth within the last 5 years are much less likely to work. This parameter would thus reduce by about 8.5 percentage points a participation rate that would otherwise have been 42%. The negative impact of having a young_kid is moreover much larger for women with a postsecondary education than for other women, as the parameter for the interaction between these two variables is −0.3895. Finally, parameters show strong heterogeneity among regions, and also higher participation rates in rural areas than in urban areas of the same region.

For other modules, assumptions are implemented directly as rates that were merged to individuals according to their characteristics. For the labour force participation module, we use regression parameters and therefore, the implementation method is different. Variable-specific parameters will first be merged one by one to the corresponding population. Then, using those parameters, we will calculate the individual probability of participating in the labour force.

To merge the parameter file to the population file, we need to structure it in a particular way, as shown in Fig. 4.2. Each discrete variable needs to have its own column with specific categories on different rows. Parameters corresponding to these categories are on another column, under the label ‘variable name’_p. Reference categories (such as edu = e3) also need to be included with a parameter of 0. Otherwise, a missing value would be used in the calculation of the rate, which would result in an error. For continuous variables as well as for intercepts, since they are applied in the calculation of the labour force participation rate for each individual, each is implemented under a specific column, such as agegr_p and agegr2_p for the two parameters of the quadratic form of age, and agegr_edu_p and agegr_edu_p for the quadratic form of the interaction of age with education.

Fig. 4.2
figure 2

Screenshot of the parameter file lfp.csv (opened with Excel)

The labour force participation module is implemented once the demographic events are completed, when the age and year are those corresponding to the end of the period. The population file to be used is thus work.pop2 and the module needs to be written right after the “time module” and before cleaning the population file for the next period.

First, we need to merge the parameters file (param.lfp) with the population file (pop2). Using the merge statement as in previous modules would not be optimal, since it would require a specific merging for each variable from the logit model. We thus use a command in Structured Query Language (SQL),Footnote 2 which is supported by SAS. We create a new population file (pop_lfp1) that links parameters from the file lfp to individuals for the last population file (pop2, that we select under p.*). Parameters are selected one by one, with the appropriate variables (under t1 to t9). Parameters in each set are joined by their specific correspondent variables. For instance, parameters for education are joined both by sex and education, while parameters for the presence of kids and its interaction with education are joined by sex, education, and presence of kids, and so on for other sets of parameters. In the code, we also specify “where not missing (‘name of the parameter)” to join only the values of parameters, as we don’t want missing cells to be imported.

figure c
figure d

The population file pop_lfp1 now includes individual-specific parameters for the labour force participation module. Starting from this file, we create a new one (pop_lfp2) in which the labour force participation event occurs. For each step of the projection, the labour force variable is first reset to 0 (out of the labour force) for all individuals (labour = 0). We then calculate the individual-specific labour force participation rate for the population affected by the event (those aged between 15 and 74). In our example, we use logit regression parameters. The rate thus corresponds to the exponential of the sum of parameters (multiplied by the value of the variable in the case of continuous variables) divided by 1 + the exponential of the sum parameters.

figure e

Once each individual has a specific probability of participating in the labour force, we can proceed to the simulation of the event with the Monte Carlo method. When the rate is higher than the random number, we switch the labour force variable to 1. Finally, we drop parameters for labour force participation from the population file.

figure f

4.3 Sector of Activity Module

In India, as in many developing countries, the informal sector (jobs that are not regulated or monitored by the government, including unpaid jobs) represents a large part of the economy. With the modernisation of the economy, urbanisation, globalisation, the demographic transition, and the expansion of the educational attainment, the informal sector is likely to shrink and be replaced by formal jobs (Cáceres-Delpiano 2012; McCaig and Pavcnik 2015; Siggel 2010).

The sector of activity module is implemented in the same way as the labour force participation module, with logit regression parameters. However, covariates and their interactions differ. More than age, the cohort of birth has a major influence on whether or not an individual is likely to work in the formal sector (McCaig and Pavcnik 2015). Thus, the formalisation of an economy occurs in large part by the replacement of generations, through the mechanism of demographic metabolism (Lutz 2013). Accordingly, the modelling of the sector of activity (S) uses the cohort of birth as an individual determinant, while the age dimension is dismissed. Equation 4.3 describes the model:

$$ \begin{aligned} & {\text{logit}}\left( S \right) = SEX*(\beta _{0} + \beta _{1} COHORT + \beta _{2} EDU + \beta _{3} REGION~ + ~\beta _{4} YOUNG\_KID + \\ & \beta _{5} POST\_SEC*YOUNG\_KID + \beta _{6} REGION*COHORT) \\ \end{aligned} $$
(4.3)

The model is applied only to the active population of the National Sample Survey on Employment and Unemployment 2017/2018. The max-rescaled R-Square is 0.3080 for the males’ model (percent concordant = 78.3) and 0.5235 for the females’ model (percent concordant = 88.4). Complete parameters can be found in the parameters file formal. As for the labour force participation model, each sex has its own set of parameters and its own intercept.

The cohort is implemented as a continuous variable taking the value of 0 for the cohort born in 1940–1944, 1 for those born in 1945–1949 and so on. The cohort parameter thus captures the secular trend that can be extrapolated for future cohorts entering the labour market. The model also includes an interaction between the cohort and the region in order to take into account the regional disparity in the pace of development. To avoid inconsistencies for regions that already have a very high proportion of the active population working in the formal sector, we added the constraint that region-specific cohort trends need to be positive or equal to 0 (β1 + β6 ≥ 0).

In Fig. 4.3, we present the extrapolation for future cohorts of the arithmetic average (not weighted by region population) of region-specific cohort parameters. It shows that for both sexes, there is a sharply increasing trend in the proportion of workers in the formal sector. A bit more than 20% of cohorts born in the 50s work in the formal sector, compared to half of cohorts born in the late 90s. When extrapolating trends, the proportions will exceed 60% for cohorts born after 2025. Despite having much lower labour force participation rates, women are slightly more likely to work in the formal sector, but the gap will gradually close. The model also accounts for strong regional differentials (not shown in the figure). Rates are in general much lower in rural regions than in urban ones, but the difference shrinks gradually over cohorts.

Fig. 4.3
figure 3

Arithmetic average of region-specific cohort parameters for the sector of activity converted into rate (education = complete primary; no birth in the last 5 years)

In addition to cohort and region, the model also includes education, a parameter for women that have a young child (having given birth in the last 5 years) and its interaction with the education. As shown in Table 4.1, presenting odds ratios for the educational attainment, education emerges as a key determinant of having a formal job for both males and females, as a steep gradient in parameters is observed between the lowest degree and the highest. Active men with a postsecondary education are about 10 times more likely to work in the formal sector than men with no education (6.699/0.544). This ratio is above 25 for women (17.921/0.689).

Table 4.1 Odds of working in the formal sector by level of educational attainment (exp(β2) from Eq. 4.3)

Finally, the model includes the negative effect of having a young kid at home for women on the probability of working in the formal sector (−0.845). The positive parameter (0.799) for the interaction of the variables YOUNG_KID and POST_SEC however suggests that this effect is much less for women with a postsecondary education.

As the sector of activity module also uses regression coefficients, the parameter file “formal” has the same format as the parameter “lfp file” (one different column for each variable and one different column for each set of parameters), as shown in Fig. 4.4. As a reminder, the cohort is implemented as a continuous variable and therefore does not require a category to link it to the corresponding population.

Fig. 4.4
figure 4

Screenshot of the parameter file formal.csv (opened with Excel)

The code to implement parameters is also similar to the one used for the labour force participation module. Using a SQL command, in a new population file called “pop_formal1”, we join to the last population file (“pop_lfp2”) the parameters from the parameters file “formal” (which is stored in the library param), using the appropriate set of variables for each parameter.

figure g
figure h

In a new population file (pop_formal2), we can now simulate the event, which will split workers between the formal and the informal sector. First, in a temporary variable “cohort2”, we need to transform the cohort variable to make it correspond to the one used in the regression model. As a reminder, the cohort born in 1940–1944 has the value 0, while the cohort born in 1945–1949 has the value 1, and so on. Therefore, the cohort variable used in the sector of activity event should be (cohort—1940)/5. Someone born in 2020 would thus have a value of 16.

figure i

Because the sector of activity is modelled using a cross-sectional approach, we reset the variable to 0 (formal = 0, which corresponds to being out of the labour force). For those aged 15 to 74 and in the labour force (labour = 1, which is the outcome of the labour force module of the previous section), we set by default the variable formal to 1, signifying working in the informal sector. We then use parameters to calculate the probability of working in the formal sector (prob_form) and we proceed to the Monte Carlo experiment to select those who work in the formal sector (formal = 2). Finally, we drop parameters and temporary variables.

figure j

Now, the last population file is work.pop_formal2. In the section for cleaning the population file for the next period, we thus need to replace pop2 (which was the last population file in Chap. 3) with this.

figure k

4.4 Including the New Dimensions in the Outputs

The population file pop_&endyr (pop_2015 for the first step of the projection) now includes the projected status of the labour force and the projected sector of activity. We now need to modify the code that generates the projection outputs to include these dimensions. First, in the code generating the population by some characteristics, we add the variable “formal” to the table.

figure l

The variable for labour force participation (labour) doesn’t need to be included, since it can be rebuilt from the variable “formal” (summing up those working in the formal sector and those working in the informal sector gives the active population, while the inactive have their own category).

After adding this new dimension to the outputpop table, each set of age-sex-region-education group is now divided into three categories, “inactive” (formal = 0), “working in the informal sector” (formal = 1) and “working in the formal sector” (formal = 2), as illustrated in Fig. 4.5, showing a screenshot of the file.

Fig. 4.5
figure 5

Screenshot of outputpop before the transpose procedure

Before merging the population count with the components of the growth, we need to transpose the output file “outputpop” that we just created to have the variable “pop” in three columns, one for each category of the variable “formal”. We use the transpose procedure for this purpose. The variable “pop” is selected following the statement var which selects the variable to transpose. The “by” statement identifies the group of variables in columns in the new dataset. We specify the variable “formal” under the “id” statement to have one column for each category of this variable. Since a column cannot have numerical label, an underscore is added, so the category 0 is labelled as _0, 1 as _1 and 2 as _2. In the options of the out statement, we rename the new columns by the name of the category, “inactive” for _0, “informal” for _1, and “formal” for _2.

figure m

An excerpt of the resulting dataset is shown in Fig. 4.6. Values are indeed missing in the formal and informal columns for the age group 0–14 and 75+, as by default in the modelling, they are all inactive.

Fig. 4.6
figure 6

Screenshot of outputpop after the transpose procedure

The merger with the components of growth outputs can then proceed. However, the population is still split among the inactive, the informal workers, and the formal workers. In the code that creates the final output file of the period (output_&endyr), we can rebuild the total population and the active population, right after changing the missing values into 0 and the rounding of outcomes. As highlighted in yellow in the code below, the active population thus corresponds to the sum of the formal and informal workers, while the total population (pop) corresponds to the sum of the inactive and active populations.

figure n

Up to this point, the labour force participation and the sector of activity have been projected from 2015 to 2060, but they are not included in the initial population of 2010. Since we want to be able to generate trends, we need to incorporate those variables in the initial population. Ideally, we would use real values from a survey, such as the National Sample Survey on Employment and Unemployment 2009, but the variable suffered from methodological problems in this wave and is therefore not comparable (Kapsos et al. 2014). We will thus input those variables in the initial population in a way similar to what we did for the forecasted years, using regression parameters from the National Sample Survey on Employment and Unemployment 2017/2018.

Because the initial population does not include a variable on the presence of a child in the model, we need to re-estimate the logit models without this variable. Those parameters are included in the parameter files lfp_imput.csv and formal_imput.csv, which were imported and converted already in Chap. 2. Parameters for men are exactly those used for the simulation, while those for women differ slightly because of the omission of the presence of a child at home in the model. The structure of these files is the same as those used for the simulation, with each variable having its own column with their specific categories on different rows and parameters corresponding to these categories in another column.

We impute the labour force and sector of activity to the base population of 2010 the same way we did for the simulation, with SQL commands that first merge individuals to parameters corresponding to their characteristics. For the labour force, this is done in a temporary population dataset “lfp_imput”.

figure o
figure p

From this, the imputation is then done with a random experiment in a data step creating a new population dataset “lfp_imput2”.

figure q

The same is then done for the sector of activity. The resulting dataset “formal_imput2” includes the base population of 2010 with their imputed labour force participation and sector of activity.

figure r
figure s

Before concatenating the output files of the different periods together, we apply the same addition (highlighted in yellow) in the code of the imputed population of 2010 as we did for the simulated population, in order to have five columns for population count in the output: the total population, the active one, those working in the formal sector, those working in the informal one and those who are inactive.

figure t

The final output file exported in CSV (outputTotal.csv) now includes the population by age, sex, education, region, labour force status and sector of activity.

4.5 Overview of Results

The scenario produced in this chapter assumes constant parameters for labour force participation and sector of activity. At aggregated levels, this means that any change in those dimensions comes from changes in the population composition. In Fig. 4.7, we show the projection outcomes by labour force status and sector of activity.

Fig. 4.7
figure 7

Projected population by labour force status and sector of activity (left) and labour force dependency ratio (right), India, 2010–2060

The total population of India is projected to grow by a bit more than 500 M from 2010 to 2060. About 40% (210 M) of this growth will be among the active population (formal + informal), which is likely to stabilize around 2045, passing from about 415 M in 2010 to 625 M. Accordingly, the labour force dependency ratio (the inactive population divided by the active one) will not change much over the next decades. According to this scenario, a small decline may first be seen as a result of the demographic dividend. The ratio will thus pass from about 1.93 in 2010 to 1.68 in 2040. Because of the population ageing that will further increase the share of elderly that are inactive, the ratio will then increase slightly, reaching 1.86 in 2060.

The composition of the labour force will however change drastically. In 2010, about one quarter of workers worked in the formal sector. This proportion is likely to grow to 65% in 2060. This change is caused by a cohort effect. By demographic metabolism, the younger cohorts that are already much more likely to work in the formal sector will gradually replace the older ones.

As shown in Fig. 4.8, the education composition of the labour force is also projected to change drastically. The proportion of workers with a high school degree or above is likely to double, passing from about one-third in 2010 to about two-third in 2060. The proportion of women among workers, however, remains very low (about 20%). This is because of our assumption of constant parameters. This means that India doesn’t use a large portion of its potential workforce, and therefore, the projected labour force size and the labour force dependency ratio could be much better. In the next chapter, we will build an alternative scenario showing what India might gain from greater participation of women in the labour force.

Fig. 4.8
figure 8

Projected change in the sex and education composition of the labour force, India, 20102060