Measuring human capital in South Africa across socioeconomic subgroups using a latent-variable approach

Human capital is a complex concept to measure given it is an unobserved latent construct. Education is a fundamental dimension of human capital and thus an education-based approach is the one most widely used. However, the international literature recommends a latent-variable approach to measuring human capital. This study thus aims to measure human capital in South Africa, a country experiencing extreme earnings and education inequalities, using a latent-variable approach and the National Income Dynamic Study (NIDS) dataset. The findings are that parental education is associated with the largest amount of variance in (latent) human capital, while the health indicator captures the least variance. Furthermore, the (latent) human capital variable provides a valuable measure to profile the distribution of human capital by socioeconomic subgroups.


Introduction
Human capital (HC) has long been recognized as invaluable to national wealth. Alfred Marshall (1920:564) states: "the most valuable of all capital is that invested in human beings". At the microeconomic level this translates to the HC of each household 1 (HH) or individual being the most valuable national asset. Thinking about the concept of HC dates back as far as the seventeenth century, when Sir William Petty (1623-1687) was one of the first economists to try to define and measure it (Folloni and Vittadini, 2010). Before the twentieth century, research regarding HC was centered largely on how to enhance national wealth accumulation (Folloni and Vittadini, 2010). Controversy shrouded earlier discussions regarding HC as it was believed morally wrong to value humans as resources contributing to wealth (Folloni and Vittadini, 2010). A shift in thought occurred with the introduction of microeconomic theories at the University of Chicago that regarded HC as a rational investment choice (Schultz, 1959, Becker, 1962, and Mincer, 1958. These authors were thereafter considered the "founders" of microeconomic HC theory. The "new" theories were concerned with the "activities that influence future real income through the embedding of resources in people" -that is: through investments in HC (Becker, 1962:9). However, the definition of what HC is and therefore how it should be measured has continued to be refined. Laroche et al. (1999, in Le et al., 2005 added the innate abilities of individuals to the definition. The OECD (2001:18) reflected these advances, stating HC to be "the knowledge, skills, competencies and attributes embodied in individuals that facilitate the creation of personal, social and economic well-being". Thus, HC is today viewed as multifaceted and not directly observed (Folloni and Vittadini, 2010:267).
Historically, three standard approaches to measuring HC (Le et al., 2005) have emerged. These are: an income-based measure, originally modeled by Farr (1853); a cost-based measure (Eisner, 1989); and, lastly, various education-based methods (Barro andLee, 1993 andWoßmann, 2003). More recently a latent-variable approach to measuring HC has been introduced and extended to measure HC in monetary terms (Dagum and Slottje, 2000). This approach was suggested because HC is statistically defined as a multidimensional non-observed latent variable (Dagum and Slottje, 2000). This study thus aims to measure HC availability in South African by applying a latent-variable approach, in a Structural Equation Modelling (SEM) framework, using as reflective indicators education, father's and mother's education, school quintile and health. Following the latent HC estimation, the results are compared with the education-based approach, the most widely used to measure HC. Furthermore, considering the persistent inequalities in South Africa, the study examines the distribution of HC, estimated using both measurement approaches, according to different characteristics of the population, namely gender, race, geographical area and income quintile. The value of estimating HC and the distribution thereof, using the different approaches, is thus explored.
In the next section, literature on the importance of HC as a determinant of socioeconomic inequality in South Africa is discussed, followed by an examination of the approaches used previously to measure HC. Thereafter the methodology, including the data, analytical approach and variables used in this paper, are explained. In Sect. 5 the results, including the distribution of the latent HC estimate by subgroup are presented. The final section draws conclusions.

Human Capital as a determinant of inequality in South Africa
South Africa has one of the highest levels of wealth and income inequality in the world (Orthofer, 2017). The gini coefficients for wealth and income were estimated as 0.95 in 2017 (Orthofer, 2017) and 0.63 in 2014 (World Bank, 2019) respectively, without significant decreases in the inequality estimates over the 26 years since the end of Apartheid (Chatterjee et al., 2020). Since the end of the apartheid most of the research on income distribution in South Africa focused on inter-racial income distribution (Van der Berg and Louw, 2004). Van der Berg (2014) found that inequality amongst racial groups has increased. Several factors have contributed to this, most importantly the unequal distribution of skills and inequality in educational outcomes, which Spaull (2015) highlighted as driven by inequalities in schooling quality. Furthermore, findings indicate that income inequality in South Africa is generated in the labour market (Leibbrandt et al., 2010 andHundenborn et al., 2016).
The impact of HC in South Africa on poverty and the labour market has been researched by authors including Lam (1999), Liebbrandt et al. (2010), Van der Berg (2010), Pellicer and Ranchhod (2012), Branson et al., 2013 andWanka (2014). Lam (1999) examined educational inequality and its impact on earnings in South Africa and Brazil. Included is the intergenerational transmission of educational inequality Lam (1999) found that, firstly, schooling plays a large role in explaining earnings inequality in both countries and, secondly, educational inequality strongly influences intergenerational transmission. Liebbrandt et al. (2010) noted the importance of education and health policies in South Africa in equipping the labour force, particularly those disadvantaged by apartheid, with the skills required for active participation in a productive workforce and so in earning the income necessary to address poverty and inequality. This is aligned with HC theory regarding investments in HC, including education, on-the-job training and health, which are rational choices towards increasing earnings. Van der Berg (2010) furthermore showed that, rather than access to political power or wealth, wage inequality was a major driver of income inequality in South Africa and was itself the result of unequal HC distribution (measured as the education attainment and the quality thereof). He thus concluded that improved education quality was required both for future economic growth and to reduce poverty and income inequality. In studying HC accumulation and inequality traps in South Africa, through assessing theoretical frameworks relating inequality and decisions to invest in post-secondary education, Pellicer and Ranchhod (2012), concluded that government policies are needed to provide quality education that will lead to accumulating the necessary HC to reduce inequality in future. Pellicer and Ranchhod (2012) furthermore highlighted that in South Africa high income inequality leads to low levels of skill accumulation perpetuating income inequality. This provides grounds for an additional conceptual perspective on the relationship between education and earnings, namely the theory of social reproduction (Chiappa and Mejias, 2019).
In the context of HC, the social reproduction perspective proposes that the social stratification order is reinforced by the education system (Bourdieu and Passeron, 1977). The idea of social reproduction stems from Marx's statement that "every social process of production is, at the same time, a process of reproduction" (Marx, 1977:711). Social reproduction theory thus highlights the intergenerational transmission of socioeconomic status (SES), given the accumulated disadvantages accrued during childhood by being born into a family with a lower SES and parents with low levels of education (Breen and Johnson, 2005). The OECD (2014) indicates that the combined impacts of social background and HC effects reproduce SES and thus a circular relationship between SES and HC is perpetuated. It is this reproduction of social order, highlighted by Bowles and Gintis (1975), that is missing from microeconomic HC theory. This furthermore highlights the potential for endogeneity when studying HC and SES. Spaull (2015), for example, states that a child's future socio-economic outcomes (for example: employment, income and education) in South Africa can be predicted with a certain level of accuracy by the age of seven. Children from poorer homes do not acquire basic skills, which reflects their future productivity, and ensures they remain perpetually disadvantaged. Wealthy individuals, HHs and nations are expected to spend more on education and consequently acquire more HC (Chani et al., 2014). This suggests that an unequal distribution of income/wealth may negatively impact HC inequality and vice-versa (Chani et al., 2014). This would furthermore explain why in South Africa, or alternate countries with high levels of inequality, talented citizens are unfortunately not likely to reach their full potential, which negatively impacts much needed social and economic development (Keeton, 2014).

An alternative approach to measuring HC
Despite education being a fundamental element of HC, Le et al. (2005) note that education measures are simply a proxy for HC and do not measure the multidimensional aspects of HC directly (United Nations, 2016). When critiquing the education-based method, Le et al. (2005) highlighted that the different proxies originally used at a national level were: adult literacy rates, school enrollment ratios, and average years of schooling. These education proxies were however noted as misspecifying HC, specifically on a theoretical basis (Wößmann, 2003) and fail to account for the multidimensional nature of HC (Folloni and Vittadini, 2010). The most used proxy, namely average years of schooling, measured using individual workers highest education attainment (Wöβmann, 2003), was critiqued for not allowing for returns to education as well as the quality of the education received. Given the almost non-comparable differences in education quality in South Africa (Spaull, 2019), there is sound motivation for moving beyond the education-based method of measuring HC. Education attainment as a proxy for HC is thus subject to proxy measurement error, which Wendelspiess Chávez Juárez (2015:119) defined "as a deviation of the observed variable with respect to the underlying concept we actually want to measure".
Given that HC cannot be observed directly and has been statistically defined as a multidimensional composite variable, it needs to be measured in a latent variable framework (Folloni and Vittadini, 2010). Dagum and Slottje (2000) described household HC as the "multidimensional nonobservable construct generated by personal ability, home and social environments, and investments in the education of the HH head and spouse whose effects are indirectly measurable by means of the present value of a flow of earned income throughout an individual life span" (Dagum et al. 2007: 581). Dagum and Slottje (2000) were among the first to formalize the latent-variable approach to measuring HC in a breakthrough study that introduced an actuarial mathematical approach to convert the composite construct HC to monetary values. Dagum and Slottje (2000) used age of HH head, years of schooling of HH head and spouse, years of fulltime work of HH head and spouse, marital status, gender of HH spouse, region of residence, number of children, total wealth and total debt as indicators of personal ability, home and social environments. The monetary value of the latent HC was then used to calculate the distribution of HH HC in the United States of America (USA). This method was repeated by Dagum et al. (2003) who studied the relationships between wealth, debt and HC. Authors including Dagum et al. (2007), Vittadini and Lovaglio (2007) and Lovaglio (2008) extended the method developed by Dagum and Slot-tje (2000) to ensure estimates are consistent with the economic definition of HC suggested by Dagum et al. (2007: 581). The authors, studied HH HC in the Italian and USA context. Vittadini et al. (2003), noted, there are numerous ways in which a latent variable can be defined, each of which will have implications for the statistical model used. For example, HC may be estimated by means of Factor Analysis if the latent HC is identified as a factor representing the observed variance in the observed reflective indicators. Ultimately the model specified should reconcile with the meaning and definition of HC applied in the study and empirical evidence (Coltman et al., 2008). Laverede-Rojas et al. (2019) analyse the benefits of measuring HC as a latent variable in a SEM framework in comparison to an education-based approach, by comparing their predictive power for economic growth. They find that years of education has reduced predictive power since the 1990s, which they attribute to schooling having become homogenized, with almost universal enrolment. This could be an answer to the question raised by Castello-Clement and Domenech (2017) who investigated why the reduction in education attainment inequality over the years, has not been accompanied by falling income inequality.
It is therefore evident that policymakers would benefit from applying a multidimensional measure of HC to help guide policies to ensure socioeconomic development. The variables used to measure latent HC have varied slightly between studies. When selecting variables to estimate HC at a disaggregated level it is important to consider data availability, economic definition, and context. From previous studies (Dagum et al., 2007;Vittadini and Lovaglio, 2007and Lovaglio 2008/2010 the variables in the model should be consistent with the definition of HC and fall into the following broad areas: education indicators, demographic indicators, on the job training and participation, adult skills and family characteristics. These variables are consistent with HC and social reproduction theory. Income is an additional variable which should be considered in accordance with the income-based approach (Vittadini et al., 2003).
This study thus estimates HC on a microeconomic level as a latent construct, in a SEM framework, and uses the results to analyse socioeconomic inequalities. The meaning of HC as defined by the OECD (2001) was used to develop the conceptual path model in this study and HC is thus defined as a non-observable construct, developed in home and social environments that best determine socioeconomic wellbeing. Furthermore, the socioeconomic context, data availability and variables suggested by Dagum et al. (2007), Vittadini and Lovaglio (2007) and Lovaglio (2008Lovaglio ( /2010 guided the development of the conceptual model.

Data and sample
The study analysed HC in South Africa, a country with high levels of socioeconomic inequalities. Primary data from the National Income Dynamics Study (NIDS) dataset was employed. The NIDS is a panel dataset compiled biennially by the South African Labour and Development Research Unit (SALDRU) at the University of Cape Town (UCT). The first wave of data was released in 2008 (NIDS, 2018). The initial wave was administered to a nationally representative sample of approximately 26 776 individuals from 7 300 HHs, but grew with each wave as HHs expanded, to a total of approximately 37 368 individuals and 13 719 HHs in wave 5 (Brophy et al., 2018). The most recent, Wave 5, conducted in 2017 and released in 2018, is used in this study (Brophy et al., 2018) to conduct a crosssectional analysis of South Africa in 2017, with panel weights applied to account for panel attrition (Zizzamia et al., 2018). The two main reasons for using the NIDS dataset were, firstly, the comprehensive nature of the socioeconomic data collected, and, secondly, that this research forms part of a larger study, which aims to analyse HC using different measurement approaches and the impact of HC on inequality. Using the same dataset makes the different HC measurements comparable. The sample was restricted to employed individuals in order to analyse the distribution of HC in the labour market (the market in which HC is employed and income inequality in South Africa is generated [Leibbrandt et al., 2010]). Employed individuals without missing observations for the variables in the model were included in the sample and thus restricted to 2 771 individuals. Post-stratification weights were applied to allow for population estimates (SALDRU, 2017).

Analytical approach
This study employs a SEM framework, a multivariate statistical analysis method, which combines factor analysis and multiple regression (Bollen, 2002). The analytical approach includes three steps: (1) Exploratory data analysis; (2) Confirmatory Factor Analysis (Measurement model); and (3) Analysis of the profile of HC by SES indicators, estimated using a latent-variable (HC L ) and education-based approach (HC E ). The first 2 steps were conducted to establish the model and estimate HC L per person in the sample. A sensitivity analysis is included to confirm the robustness of the results.
Step 3 was included to explore the distribution of HC using the education-based and latent-variable approaches, and the HC profile by SES indicators, namely, income, gender, race, and geographical area.

Conceptual measurement model and variables
Different path models were explored to estimate the latent variable HC within a SEM framework and it was identified that the latent variable HC is best estimated in South Africa by applying a reflective CFA measurement model. Coltman et al. (2008) noted that in a measurement model observed indicators can be specified as causal in nature (here HC would be estimated in a formative model specifying the indicators as causes of HC) or reflective in nature (here HC would be estimated in a reflective model specifying that the observed indicators are representatives of HC). Jarvis (2003) identified that a reflective model is motivated if the indicators all represent the underlying construct, thus potential proxies of HC, and are correlated. Please see the correlation matrix in Appendix 1. Furthermore, the indicators should be interchangeable, and dropping one indicator should not alter the conceptual meaning of the latent construct (Jarvis, 2003).
The reflective indicators selected in the measurement model of HC are parental education, education attainment, school quintile and self-perceived health status. The observed indicators are consistent with the OECD (2001) definition of HC, empirical evidence in South Africa (Van der Berg, 2010;Pellicer andRanchod, 2012 andSpaull, 2015;, HC and social reproduction theory, as well as the variables suggested by authors aiming to account for the multidimensional nature of HC internationally (Dagum and Slottje, 2000;Grossman, 2000;Dagum et al., 2003;Vittadini et al., 2003;Dagum et al., 2007;Vittadini and Lovaglio, 2007;andLovaglio, 2008, 2010). These authors highlight the variables considered should broadly fall under the headings of education, work experience, indicators linked to career progression, health and, lastly, income as an outcome of HC. The variables included in this study are detailed in Table 1.
The measurement model of latent HC (HC L ) can be written as a set of 5 regression equations, as captured in Eq. 1 to 5, or illustrated in the conceptual measurement path model presented in Fig. 1 (StataCorp, 2013).
Variables representing the reflective indicators for estimating HC L .
Variables employed in the analysis of the distribution of HC Education (edu) (Categorical) Gender (Dummy) (1) Primary or less; (2) Incomplete secondary; 3. Matric**; 4. Some tertiary; 5. Bachelor or higher 1. Male; 0. Other  Branson and Leibbrandt (2013). ** Matric is equivalent to achieving a secondary school leaving certificate, which gives access to tertiary education, namely International Standard Classification of Education (ISCED) levels 5 or 6. *** Schools in South Africa are classified into schools from most poor to least poor. Schools receive money from government according to Quintiles, with Quintile 1 schools receiving the highest allocation per learner, while Quintile 5 receives the lowest Where HC Li is estimated latent HC; edu i the education; , fthedu j the father's education; , mthedu k, the mother's education; schoolquin l, the school quintile; health m health; α is the intercept associated with the latent variable and β i , γ j , δ k , ρ l ,and τ m are the path coefficient for education, father's education, mother's education, school quintile and health respectively and ε the error term associated with the respective equations. The path diagram is illustrated in Fig. 1.

Self
Equation 1 to 5 are used to estimate the path coefficients between the specified unobserved latent (HC L ) and observed indicators in the data. Thereafter HC L per observation is predicted ( HC L ) to analyse the distribution of HC in South Africa.
It should be noted that Fig. 1 does not imply causality, but by accounting for the variance and covariance of the observed variables (Jarvis et al., 2003) the reflective CFA model estimates how well the specified observed indicators represent HC L (Bagozzi, 2011).
With regards to the observed indicators in the model, the inclusion of parental education is supported by social reproduction theory and the empirical evidence of the intergenerational transmission of socioeconomic outcomes in South Africa (Pellicer and Ranchod, 2012). Parental education furthermore represents the intergenerational transfer of cognitive and non-cognitive skills and health outcomes (Lundborg et al., 2018). School quintile is included because empirical evidence suggests that in South Africa the school attended plays an important role in educational attainment. This is because high-fee-paying schools are associated with better education quality (HC theory), as well as the reproduction of education outcomes where only the wealthy are able to afford to send their children to better functioning fee-paying schools (social reproduction theory). Education attainment and health are included in accordance with HC theory which specifies these as investments in HC (Becker, 1964) and facilitating social mobility. Becker (1964:3) included health in the definition of HC, stating that HC is the "knowledge, information, ideas, skills, and health of individuals". Empirical evidence from South Africa has found that higher education attainment and better health have positive impacts on socioeconomic wellbeing (van Broekhui-  Branson and Leibbrandt 2013;and Nwosu, 2015). Self-reported health status is used to proxy actual health. Nwosu (2015), identified the arguments for and against using subjective self-perceived health status variables. The arguments in support are the ability to capture various dimensions of health (Bound, 1991) and future mortality (Ardington & Gasealahwe, 2012). However, nonrandom measurement errors of self-perceived health indicators highlight potential bias (Kreider, 1999). The conclusion reached by Nwosu (2015) was that, in capturing health and labour market participation relationships, self-perceived health responses may be more effective than reported individual illness or symptom responses (Bound, 1991). Furthermore, Ardington and Gasealahwe (2012) stated that, despite some measurement error, the health variables in the NIDS dataset are useful for the analysis of health and SES relationships.
Following the estimation of the CFA measurement model, using the statistical programme STATA (sem package), the goodness of fit analysis and confirmation of the measurement model, HC L is predicted for further analysis (Coltman et al., 2008 andStataCorp, 2013). The results from the reflective model will highlight potential areas requiring additional investments to reduce HC and productivity gaps in South Africa.
i) Exploratory Data Analysis.
The number of observations, mean and standard deviation of the variables relevant to the model and analysis of the distribution of HC L are presented in Table 2. Included in the exploratory data analysis, a log-linear earnings function, specified in Eq. 6, is estimated to explore the associations between the observed HC reflective and socioeconomic indicators and labour income, as per microeconomic HC theory (Schultz, 1959, Becker, 1962and Mincer, 1958. The partial regression coefficients from the log-linear earnings function specified in Eq. 6 are presented in Table 2 2 .
The log-linear earnings function is: logY ijklm = α + β i edu i + γ j f tedhu j + δ k mthedu Lk + ρ l schoolquin l + τ m health m + ijklm (6) Where logY is the log of net monthly labour income per capita in ZAR, and edu, ftedhu, mthedu, schoolquin, and health are the observed study variables and β i , γ j , δ k , ρ l , and τ m are the respective coefficients being ε is a vector of residuals per observation (error term). Eq. 6 is consistent with log-linear earnings functions adopted by economists to estimate the private returns to investments in HC as well as differences in earnings as per socioeconomic indicators (van Broekhuizen;2011;Branson and Leibbrandt;. The partial regression coefficients are thus included in the descriptive statistics. The data was tested to identify the suitability for structure detection and factor and SEM analysis. The Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy (MSA) test and Bartlett's sphericity test, were therefore performed with the results discussed in Sect. 5.2. The KMO indicates the proportion of variance in the variables that might cause underlying factors, with a result above 0.6 verifying the sample to be acceptable (Pett. et al., 2003  items, hypothesizing that the correlation matrix is an identity matrix. Rejecting the null hypothesis would indicate that variables are related and thus suitable for factor and SEM analysis (Treiblmaier and Filzmoser, 2010) and the measurement model may be specified (Loehlin and Beaujean, 2017).
ii) Confirmatory Factor Analysis (The measurement model).
CFA is a way of analysing specified measurement models (StataCorp, 2013) and the validity of the relationships between the observed indicators (manifestations) and proposed constructs (Loehlin and Beaujean, 2017). Thus, for the purpose of obtaining a reliable measure of HC L , a CFA measurement model was estimated as presented in the conceptual model (Fig. 1). The statistical significance of the relationships between the suggested indicators and the proposed HC L were estimated in order to judge the reliability of the observed indi- Notes: Statistically significant (**p < 0.01; *** p < 0.05; **** p < 0.10). ****To explore the study variables, the partial regression coefficients are estimated as per Eq. 6, in which the dependent variable is the log of labour income per month Source: NIDS wave 5; authors' calculations. Post-stratified weights applied cators. The goodness of fit of the model was assessed using the coefficient of determination (CD), comparative fit index (CFI), Tucker-Lewis index (TFI), standardized root mean squared residual (SRMR) and root mean square error of approximation (RMSEA). The indices were benchmarked as follows: CFI values ≥ 0.90 (Bentler, 1990); TFI values ≥ 0.90 (Hu and Bentler, 1999); SRMR values ≤ 0.06 (Hu and Bentler, 1999) and RMSEA values ≤ 0.08 (Browne and Cudek, 1993). Furthermore, a CD close to 1 (R 2 ) would indicate that the model fitted the data well. Following model confirmation, a continuous variable HC L was predicted 3 , which is factor score 4 extraction conditional on the value of the observed variables (StataCorp, 2013). The statistical programme STATA (sem package) was used. The path coefficients and goodness of fit results are presented in Table 3. Furthermore, a sensitivity analysis was conducted, using two approaches to test whether the estimated measure of HC L is robust. Firstly, the reflective CFA approach is retained but the specified model in Fig. 1 is adjusted in two ways: (1) the health observed indicator is removed to predict the latent variable referred to as HC h ; and (2) the father's educational attainment observed indicator is removed to predict the latent variable referred to as HC f . The results are compared by presenting goodness of fit results and standardized coefficients of the reflective indicators in Table 4.
Secondly, an alternative DRT, namely PCA, was used to assess whether the results are sensitive to the estimation approach. PCA is a statistical method to reduce the number of variables into a few principal components, while maintaining as much of the variance as possible to reproduce the data structure for further analysis (Mooi et al., 2018). Thus, PCA is an alternative approach to measuring a composite HC (referred to as HC p ) to confirm the findings on HC L in South Africa. The first principal component (PC1), which accounts for the largest possible variance in the dataset, is used to predict HC p . The factor weights per component are presented in Table 5. 3 HC L was predicted as a continuous variable per person in the sample using z-transformations and thus has a mean of 0. Negative and positive HC L are expected. 4 The factor scores are estimated by means of a linear regression, using the fitted model's mean vector and variance matrix. Firstly, to map the concentration distribution of HC, Kernel density (k-density) curves 5 for HC L, HC h , HC f , HC p and HC E are illustrated in Fig. 2. Thereafter, a histogram (in Fig. 3) is presented to compare the frequency distribution of HC L and HC E . Finally, Tables 6 and 7 present the profile of HC L and HC E , respectively, by socioeconomic indicator. The analysis by socioeconomic indicator is motivated by the Sustainable Development Goals which 5 K-density curves illustrate the density of a variable from the observations of the variable.  highlight the importance of studying the outcomes of 'the poorest, most vulnerable and those furthest behind' (United Nations, 2015:37). In South Africa, these indicators include race, gender, geographical region and labour income (StatsSA, 2019). The figures were constructed using the statistical program R.4.0.3.

Descriptive Statistics
The descriptive statistics for the sample datasets, including the number of observations, mean (for continuous variables) and percentage distribution of observations per category (for the categorical and dummy variables), are presented in Table 2. Furthermore, the partial regression coefficients estimated from Eq. 6 are presented.
The results indicate that the majority of the sample have less than matric, have good health or better, attend no-fee schools, have parents with primary schooling or less, are females, African and reside in urban areas. It is estimated that educational attainment has improved over the generations, with over 50% of parents obtaining primary schooling or less, compared to 9.93% of the earlier generation. This is potentially a result of the reduction in variance in schooling curricula, educational attainment and almost universal schooling enrolment, which Laverde-Rojas et al. (2019) referred to as "educational homogenization". However, higher levels of educational attainment (a Bachelors degree or more) remain low, at 11.19% of employed individuals. This is concerning, given that returns to education are high and increasing for degree graduates in South Africa (van Broekhuizen;2011;Branson and Leibbrandt; and income inequality in South Africa originates in the labour market. It is expected that without increases in higher levels of education attainment, income inequality will persist. The increased returns to higher levels of education is supported by the partial regression coefficients presented in Table 2. Having a Bachelors degree or more is correlated with earnings 198% higher in comparison to those with primary schooling or less. Only one of the parental education estimated coefficients, and none of the school quintile, geographical area and self-perceived health regression coefficients, were statistically significant. However, while controlling for all other model variables, the gender and race results were consistent with expectations, including: males are expected to earn on average 49% more than females; and Whites, Asian/Indians and Coloureds are expected to earn 60%, 57% and 18% respectively more than Africans. Comparing males and females in the same jobs, StatsSA (2019) found males earned on average 30% more than females. The finding of a wage gap of 49% is thus larger than the StatsSA (2019) finding and should be noted as the estimates based on the specifications of Eq. 6 and the sample studied. Fig. 3 Distribution of HC L and HC E Source: NIDS wave 5; authors' calculations. Post-stratified weights applied Note: Compulsory schooling in South Africa is from grade 1 to grade 9 (Republic of South Africa, 1996) The results from Bartlett's sphericity and KMO tests confirmed sample data adequacy and sufficient variation between the proposed variables for SEM analysis. Bartlett's sphericity test was highly significant ( χ 2 = 3204.67, ρ = 0.0000) and the KMO test coefficient was 0.739 and thus the factor analysis could be conducted.

Confirmatory factor analysis (CFA) measurement model
The CFA measurement model has the objective of achieving a reliable measure of HC. The model was specified as standardized, thus the variance of HC L is set to 1. All the reflective indicators are positive, statistically significant, reflective indicators of HC L as presented in Table 3. The fathers' and mothers' education coefficient estimates are both above 0.8 and are thus important proxies for latent HC. The education attainment and school quintile indicators are both significantly associated with HC L , with coefficients of 0.541 and 0.513 respectively. Health was included in the model and although it was estimated as having the lowest factor loading (0.158), its contribution to the HC L was statistically significant. The health indicator was retained in the model, since HC theory suggests investments in health are representative of HC, as well as its positive impact on labour force participation (Nwosu, 2015), educational attainment and labour productivity (Suhrcke et al., 2006). Nwosu (2015), using the NIDS dataset in a cross-sectional analysis, found statistically significant impacts of health (using different measures 6 ) on labour force participation of between 20% and 33%.  When applying the post-stratified weights, the goodness of fit tests were restricted to residuals. The CD value of 0.837 indicates that 83.7% of the variance in the observed indicators is captured in the HC L construct. Providing a broader range of goodness of fit test results, the unweighted estimates, illustrated in Appendix 2, confirm the goodness of fit of the model and the estimated path-coefficients. The equation level goodness of fit confirmed the importance of father's and mother's education, with R 2 values of 0.69 and 0.68 respectively. Thus 69% and 68% of the variance in father's education and mother's education is captured in the HC L construct respectively. Appendix 2 indicates the Chi-square goodness of fit estimate was 25.396 with a p-value of 0.000. The null hypothesis for the Chi-square test is that the model fits the observed data perfectly, thus rejecting the null-translates-topoor model fit. However, it is noted that the Chi-square test is sensitive to sample size, with larger sample sizes decreasing the p-value where there may be only a trivial misfit (Babyak & Green, 2010). It is therefore recommended that chi-square test results be considered in combination with the CFI, TFI, SMRM, RMSEA and other fit tests (Alavi et al., 2020).
The sensitivity analysis results presented in Table 4, highlight how removing the health indicator improved the goodness of fit, particularly evident when considering unweighted goodness of fit tests. However, the standardized coefficients show only negligible changes in comparison to the full model. Removing, father's education attainment illustrates good model fit (albeit a lower CD and greater SRMR) and the standardised coefficient estimates remain comparable with the HC L estimates, with only negligible differences.
The factor scores presented in Table 5, estimated using PCA, indicate comparable variance estimations identified using the CFA measurement model to predict HC L , as indicated in Table 3. The results in Tables 3 and 5 reveal that parental education is responsible for  most of the variance and that health contributes the least to the predicted HC estimates. The eigenvalue for PC1 is the only component with an eigenvalue over 1 (estimated as 2.41582). It is retained as the component used to predict HC p . Furthermore, the proportion of variance explained by PC1 is double that of PC2.
Thus, considering all the goodness of fit tests and sensitivity analysis estimates, the conclusion is that the model has a good fit and the method used is robust. HC L per individual is therefore estimated for analysis by gender, race, geographical area and income in South Africa.

Human capital frequency distribution by subgroup in South Africa
From 2770 observations the predicted HC L has an estimated standard deviation of 0.59, a minimum value is -0.68 and maximum value is 1.48. The estimated concentration of individual HC L is presented in Fig. 2 using k-density curves. Numerous individuals are estimated as having low levels of HC L . The sensitivity analysis illustrates the robustness of the results of the concentration of HC L, given that lower levels of HC continue to dominate the k-density estimations of HC, when applying the adjusted CFA measurement models and the PCA estimation approach. This is expected given the latent measure of correlated indicators maximises common variance and thus the tendency for greater variance in the latent measure than that of a single proxy indicator. However, Wendelspiess Chávez Juárez (2015) identified that when measuring the inequality of a latent construct, factor analysis is more accurate using multiple indicators in comparison to a single proxy, which is subject to proxy measurement error and is downward biased. Parental education is estimated as having a greater impact on the variance of the latent HC in comparison to education attainment (see Table 3); thus the variance of HC L will be less aligned with education attainment and more aligned with parental education. Regardless of the measurement method applied, few individuals are indicated as having high levels of HC.
The histogram presented in Fig. 3 furthermore illustrates a distinct difference in the distribution of HC using the two measurement approaches. A large number of individuals are estimated as having low levels of HC L, however moderate levels of HC E .
The distribution profile of HC L and of HC E by race, gender, geographical area and income quintile are presented in Tables 6 and 7, respectively. In Table 6 the HC L quintiles are calculated by dividing the population into five groups of equal number according to the value of their HC L . The findings illustrate that the individuals in the sample with the top 20% of HC L are most commonly in income quintile 5, White, male and resident in urban areas. The subgroup profile of HC E is presented in Table 7 considering five education categories. Individuals in the top HC E category are predominantly income quintile 5 earners, White, female and resident in urban areas. It should be noted that HC L and HC E estimates are not directly comparable by subgroup using quintile distributions given data limitations. Thus, the subgroup distribution of HC L is presented by quintile and of HC E using five education categories, which is the standard approach to estimating rates of return to education in South Africa (Van Broekhuizen, 2011 andLeibbrandt, 2013).
Considering the distribution of HC by geographical area, it is observed that only 5.5% of rural compared to 24.9% urban employed residents are in the 20% of the sample with the highest HC L . In contrast, 35.3% of rural dwellers are in the lowest HC L quintile (compared with only 15%). The HC E categories confirm the dominance of higher levels of HC in urban areas, however, HC L illustrates larger differences between urban and rural areas when comparing quintile 5 levels of HC L with the highest HC E category, namely Bachelors degree or more, as evident in Tables 6 and 7.
Considering the distribution of HC by gender, males are slightly more likely than females to have quintile 5 levels of HC L . This is in contrast to HC by education attainment, where females are more dominant in the two highest HC E categories. This is evident with 13.3% of females having obtained Bachelors degrees or more, in comparisons to 9.4% of males. The differences between the HC estimates by gender does not adequately explain the large differences in labour income earnings between genders. While controlling for several factors, the earnings partial regression coefficient for gender illustrates that males' incomes are 49% higher than females. Possible reasons for the higher differences in earnings between genders includes women shouldering a disproportionate burden of care work in the HH (Casale &Posel, 2005 andMagadla et al., 2019), gender stereotyping, and a wage ceiling, given that women are restricted from participating in the highest paid positions in the economy (Mosomi, 2019).
The racial HC L distribution differences are extreme; 0% of Whites but 50% of Africans fall in the two lowest HC L quintiles. The estimates confirm how the unequal development of HC under the apartheid-era continues to disadvantage other race groups (particularly Africans) in comparison to Whites. The disparities between race groups are however not as distinct when considering HC E , with 57.9% of Whites in the top two HC E categories in comparison to 76.4% in the top HC L quintile alone. The comparable measure for Africans are 29.6% for the top two categories of HC E and only 11.1% in the top category of HC E .
Both measures clearly demonstrate the importance of HC in explaining income distribution in South Africa. 66.8% of income quintile 1 and 56.7% of income quintile 2 have the two lowest quintiles of HC L . The corresponding shares for HC E are 74.6% and 60.8%. In contrast 75.8% and 75.4% of income quintile 1 have the highest quintile of HC L and HC E respectively. Human capital therefore clearly matters. This is expected, given other findings of higher returns to education for higher levels of education in South Africa Leibbrandt, 2012 andvan Broekhuizen, 2011). The CFA measurement model specified thus identifies the continued disadvantages between racial groups, given that parental educational attainment and school quality are incorporated as indicators of HC L . This provides evidence that Whites will continue to remain advantaged and Africans disadvantaged unless the latter are afforded the opportunity to acquire better quality education and improved education attainment at the higher levels. Racial differences in schooling quality remain evident in South Africa and without high quality education for all citizens, the cycle of poverty and economic inequalities will persist.
Considering the profile of HC by income quintile identifies the benefit of profiling individuals by population quintile, which cannot be accurately presented when using the HC E categories. For example, it is expected that the concentration of income quintile 1 individuals will steadily decrease from the lowest to the highest HC E category as per the HC L quintile profiling. However, this is not the case; there is an increase from HC E category 1 the category 2. Furthermore, for income quintile 4 it is expected that there should be a steady increase in the percentage of employed individuals from education category 1 to 5 (as per HC L quintile). However, there is a reduction in the percentage of individuals in education category 5 compared to education category 4. The reason for these unexpected outcomes is likely to be a result of the number of individuals in the different education categories. The education category 2 has the highest percentage of individuals in the sample at 36.2% in comparison to education category 5 which has the lowest percentage at 10.14%. To investigate the relationship between HC L and earnings, further research on the returns to HC L is recommended.

Conclusions
Human capital is a complex concept to measure, given that it is unobserved and that there is no clear definition or specified indicators, to capture its meaning. Education has thus become generally accepted as a proxy for human capital, mainly because there is extensive data available. However, a latent-variable approach has been proposed, since human capital is in fact a multidimensional unobserved construct and the potential for proxy measurement errors. The authors employed a reflective confirmatory factor analysis measurement model to estimate the latent construct human capital and the model was identified as having a good fit and robust.
The observed indicators included in the model were selected based on empirical evidence of high returns to education attainment, inequalities in education quality, intergenerational transmission of poverty and inequality and the contribution of health as a factor impacting education attainment and labour market outcomes. The indicators were furthermore in accordance with social reproduction theory and microeconomic human capital theory, and sensitivity analyses were conducted to analyse the robustness of the results. Human capital profiles by (latent) human capital quintiles and (education) human capital categories are presented.
Valuable insights are gained from estimating human capital in a confirmatory factor analysis measurement model and presenting the (latent) human capital and (education) human capital profiles by socioeconomic status indicators. Firstly, it was found that in South Africa parental education is associated with the largest amount of variance in human capital, while the health indicator captures the least. Human capital (latent) is noted as illustrating a greater frequency distribution in lower levels of human capital in comparison to (education) human capital. The differences are highlighted because of the inclusion of parental education and school quality in (latent) human capital, which captures the socioeconomic status of HHs. Furthermore, when studying human capital frequency distribution by population subgroup in South Africa, the (latent) human capital variable allows for analysis by population quintiles, which human capital measured using an education-based approach does not. It is identified that the individuals in the sample with the top 20% of (latent) human capital are most commonly income quintile 5 earners, White, male and resident in urban areas. The subgroup profile of (education) human capital, presented using five education categories, estimates individuals in the top (education) human capital category to be predominantly income quintile 5 earners, White, female and resident in urban areas. However, although the results do not appear to differ significantly, the (latent) human capital measure accentuates distribution differences. For example, when comparing racial distributional differences, the (latent) human capital estimates emphasise the unequal development of HC between Whites and Africans under the apartheid-era continues, which although evident when studying (education) human capital are not as accentuated.
Thus, it is suggested that without improvements in schooling quality, valuable education attainment and parental education for those historically disadvantaged and subject to poor service delivery, will remain low and thus inequalities in human capital and income will persist. In addition, the results provide evidence that the social reproduction theory is evident with the education system in South Africa perpetuating social stratification. The sensitivity analysis confirmed the robustness of the results, with model fit improved when removing health as an observed indicator.
Despite the reflective model providing insights into the analysis of human capital, the process of measuring human capital using a latent-variable approach in this study highlights why the education-based approach to measuring human capital persists in South Africa. The reasons include data availability, simplicity of the approach and that the education-based human capital variable presents expected distributional differences between subgroups, with income quintile 5 earners, Whites, and residents in urban areas experiencing higher levels of human capital in comparison to lower income quintile earners, Africans, and residents in rural areas. The latent-variable approach however allows for population quintile analysis and multiple indicators to be incorporated so as to account for the intergenerational transfer of poverty and inequality and a schooling system which perpetuates this cycle. For future research, policymakers from additional countries could benefit from exploring a structural equation modelling framework to analyse human capital as a latent variable. Furthermore, research focused on understanding health as an indicator of human capital is also recommended.

Declarations
Competing interests The authors declare they have no competing interests related to this research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.