Skip to main content

Advertisement

Log in

Decomposition of Inequality of Opportunity in India: An Application of Data-Driven Machine Learning Approach

  • ARTICLE
  • Published:
The Indian Journal of Labour Economics Aims and scope Submit manuscript

Abstract

This paper introduces a novel measure of inequality of opportunity (IOp) in India, by comparing both ex-ante and ex-post results, which aligns with Roemer’s (1998) equality of opportunity, theory. The study utilizes data-driven machine learning algorithms, namely conditional inference tree and conditional inference forest, to measure ex-ante IOp, and a transformation tree to estimate ex-post IOp. The findings indicate that, according to the ex-ante approach, approximately 58–61 percent of the overall income inequality can be attributed to variations in circumstances, while around 46 percent of the overall income inequality is explained by differences in the degree of efforts. The results from the tree-based analysis reveal that parents’ occupation, sector (rural–urban areas), and geographical regions are the primary circumstances contributing to IOp, which is further confirmed by the Shapley decomposition exercise. Specifically, individuals residing in rural areas in the eastern and central parts of the country, whose parents are employed in low-skilled and unskilled occupations, and have below secondary and no formal education, and who belong to marginalized social groups, exhibit significantly lower average income. Consequently, it is crucial to implement regional-level development policies that specifically target marginalized groups in order to foster a more equitable society and mitigate overall income inequality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Source: Authors' Calculations. Note: R, Rural; U, Urban; N, North; NE, Northeast; S, South; W, West; E, East; C, Central; Sec/HS, Secondary/Higher Secondary; GradAbv, Graduate and Above; NoEdu, Illiterate or No Formal Schooling; BS, Below Secondary; NR, Non Routine Cognitive; RC, Routine Cognitive; NRM, Non Routine Manual; RM, Routine Manual M, Male F, Female

Fig. 2

Source: Authors' Calculations

Fig. 3

Source: Authors' Calculations

Fig. 4

Source: Authors' Calculations. Note: R, Rural; U, Urban; N, North; NE, Northeast; S, South; W, West; E, East; C, Central; Sec/HS, Secondary/Higher Secondary; GradAbv, Graduate and Above; NoEdu, Illiterate or No Formal Schooling; BS, Below Secondary; NR, Non Routine Cognitive; RC, Routine Cognitive; NRM, Non Routine Manual; RM, Routine Manual M, Male F, Female

Fig. 5

Source: Authors' Calculations

Fig. 6

Source: Authors' Calculations

Similar content being viewed by others

Notes

  1. It is widely used in computer graphics to model smooth curves (Farouki 2012). It outperforms competitors such as kernel estimators, in approximating distribution function (Lablanc 2012).

  2. A Conditional Distribution Functions (CDF) is a function of the form \({P}_{r}(Y=j|X={x}_{o})\), i.e., the probability of Y is j given for a given value of X (James et al. 2013, p.37). A type-specific ECDF, as used in Brunori et al. (2023), describes the probability distribution of a random variable given certain conditions, in the context of the paper, ECDF’s give us about the probability distribution of the MPCI given a circumstance type.

References

Download references

Acknowledgements

Authors are grateful to Prof S Madeshwaran, Prof Arup Maitra, Dr Pedro Salas-Rojo, and the conference audience at 63rd ISLE conference for their valuable comments and suggestion in the draft paper.

Funding

This paper is the outcome of a study titled ‘Inequalitrees—A Novel Look at Socio-Economic Inequalities using Machine Learning Techniques and Integrated Data Sources’ funded by the Volkswagen Stiftung, Germany. Authors gratefully acknowledge the financial support received from the VW-Stiftung.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Balwant Singh Mehta.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Sample Selection and Construction of the Variables

Variable Selection

From the PLFS 2018–19, six variables have been selected, three variables namely sector, caste, and gender are used in the existing form, while another three variables namely states, parents’ education, and parents’ occupations are modified, and created in the new form. The sector is categorized as rural, and urban; gender as male, female, and caste as General Caste (GEN), Scheduled Caste (SC), Scheduled Tribe (ST), and Other Backward Classes (OBC). In the gender variable, transgender has been dropped before the analysis.

The state variable is categorized into 36 states/union territories of India, which have been modified and classified into six broad following geographical regions as follows:

  1. 1.

    North: Jammu and Kashmir, Himachal Pradesh, Punjab, and Haryana

  2. 2.

    East: Bihar, Jharkhand, Orissa, and West Bengal

  3. 3.

    Central: Uttar Pradesh, Rajasthan, Madhya Pradesh, Uttarakhand, and Chhattisgarh

  4. 4.

    North-East: Sikkim, Arunachal Pradesh, Assam, Nagaland, Meghalaya, Manipur, Mizoram, and Tripura

  5. 5.

    South: Karnataka, Andhra Pradesh, Tamil Nadu, Pondicherry, Kerala, and Lakshadweep

  6. 6.

    West: Gujrat, Daman and Diu, Dadra and Nagar Haveli, Maharashtra, and Goa.

The education variable is classified into following four broad categories as follows:

  1. 1.

    Illiterate or no education: (code 1: Illiterate)

  2. 2.

    Below secondary: (code 2–7, literate to up to middle school)

  3. 3.

    Secondary and above secondary: (code 8–10, secondary to higher secondary)

  4. 4.

    Graduate and above: (code 12–13, graduate and post-graduate)

The occupation/skill level is classified into following four broad categories using NCO (National Classification of Occupations) at one digit: (as per.OECD Employment Outlook 2014; NCO, 2015, Ministry of Labour and Employment, Government of India).

1.Unskilled or routine manual task: Typically involves the performance of simple and routine physical or manual tasks (NCO code 9: Elementary Occupations or unskilled such as domestic helpers, cleaners, street vendors and garbage collectors)

2. Low-skilled or non-routine manual task: Typically involves the performance of tasks such as operating machinery and electronic equipment, driving vehicles, maintenance and repair of electrical and mechanical equipment and manipulation, ordering and storage information (NCO code 4–8, low skilled as clerical jobs, service workers, shop and market sales workers, craft and related trade workers, etc.).

3. Medium skill or non-routine cognitive task: Typically involves the performance of complex technical and practical tasks that require an extensive body of factual, technical and procedural knowledge in a specialized field (NCO code 3, as professional and technical associates); and.

4.High skilled or cognitive task: Typically involves the performance of tasks that require complex problem solving, decision making and creativity based on an extensive body of theoretical and factual knowledge in a specialized field (NCO 2 as professional and Technicians).

The concept of skill level is not applied in the case of NCO code 1 as legislators, managers, etc., as skills for executing tasks and duties of these occupations varied to such an extent that it was not feasible to link them with any of the four, broad skill levels.

Sample Selection

For the selection of sample following multi-stage procedure has been adopted.

In the first stage, the parent of each respondent has been identified using the relation to the head variable in the data. For an individual identified as self (code 1), the household member with code 7 (labeled Father/Mother/Father-in-Law/Mother-in-Law) was treated as the parents and prepared the first set of data with children and parents.

In the second stage, the individuals as unmarried children (code 5) and married children (code 3) have been identified, and further, the parents of these children are identified as household heads labeled as self (code 1) in the data. T the respondent labeled self was identified as the parent and prepared the second set of data with children and parents.

In case of duplicate records (or multiple parental information), we have deleted the duplicate case by carefully looking at the unit records. Once both the files were cleaned, we merged both the files along with key variables in the data as discussed above.

Appendix 2: Grid Search CV Process for Conditional Inference Tree and Conditional Inference Forest

In the Grid Search CV process, the data are divided into training and test sets. Different combinations of min-split (minimum number of observations required to perform a split) and alpha values are tested, and the combination that yields the lowest root mean squared error (RMSE) for the test set is selected. The RMSE is a measure of the model's prediction accuracy. For Conditional inference tree model with MPCI as dependent variables, the Grid Search CV has been conducted. After evaluating various combinations, an alpha value of 0.07 and a min-split value of 10,000 provide the lowest RMSE. The robustness of the endogenously chosen alpha is examined by comparing the results with alpha values of 0.01 and 0.05 as given in Table

Table A.2.1 Ctree results MPCI

This comparison is done following the approach outlined by Salas-Rojo and Rodriguez (2022).

Similarly, after evaluating various combinations, an alpha value of 0.06 and number of tree at 200 provide the lowest RMSE for conditional inference forest model. The robustness of the endogenously chosen alpha is examined by comparing the results with alpha values of 0.01 and 0.05 as given

Table A.2.2 Cforest results MPCI

Appendix 3: Plots for MPCI

figure afigure a

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mehta, B.S., Dhote, S. & Srivastava, R. Decomposition of Inequality of Opportunity in India: An Application of Data-Driven Machine Learning Approach. Ind. J. Labour Econ. 66, 439–469 (2023). https://doi.org/10.1007/s41027-023-00446-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41027-023-00446-5

Keywords

JEL Classification

Navigation