1 Introduction

The ‘returns to schooling (education)’ concept, as developed by Mincer (1974), was subsequently theoretically enriched and empirically tested in various contexts, and contributed to the evaluation of the economic role of education, labor market conditions and human capital productivity. Although the model in its basic form has certain conceptual flaws—in particular, it ignores bias potentially caused by unobservable factors that can influence both schooling and earnings—methods have been proposed to overcome them such as a use of panel data allowing to explicitly control for unobserved heterogeneity, assuming it is time-invariant. The aim of this study is to estimate the returns to schooling in Kazakhstan with the use of massive repeated cross-sectional data collected by the Household Budget Survey in 2002–2016, as a synthetic or pseudo-panel. The approach proposed by Deaton (1985) suggests adopting a pseudo-panel of cohort means, where a ‘cohort’ we consider to be a group of people of the same gender born in the same year who are assumed to share some common, unobserved characteristics.

There are no assessments available for the returns to education in Kazakhstan during the Soviet period, but they are believed to be low due to wage levelling, wage ‘grids’, and the centralized allocation of the labor force. However, according to a few post-Soviet examinations, they soared with the transition. In Kazakhstan, whose independence can be roughly divided into two sub-periods—the severe crisis of the 1990s and the oil boom of the 2000s (Fig. 1)—the later economic growth might additionally have contributed to the increase in returns via several channels. Demand for education consistently grew during the period of independence, with the number of university students increasing from 287,367 in 1990/91 to 542,458 in 2018/19, and the number of college students (technical and vocational education and training) from 247,650 to 489,818 for the same years, respectively (with a corresponding net increase in population of around 1.5 million).Footnote 1 Kazakhstan’s educational institutions expanded accordingly: from 55 HEIs in 1990/91 to 124 in 2018/19, and from 247 colleges to 769 for the corresponding years.Footnote 2 On the other hand, with this nearly twofold increase in the inflow of educated people, it might be reasonable to predict a decrease in returns to schooling over time. Additionally, the period under consideration represents the oil boom decade when GDP per capita grew from 1658.00 USD in 2002 to 13,890.80 USD in 2013,Footnote 3 but with signs of recession starting in 2014 when world commodity prices plummeted, dragging down an economy that was (and still is) highly dependent on oil and gas exports, which might also have intriguing effects.

Fig. 1
figure 1

Source: World Bank Data

GDP per capita, (constant 2010 USD).

With the pseudo-panel approach, we found the returns to schooling to be relatively high (7–13% with the fixed effects and 8–11% with Mundlak random effects, depending on a set of additional control variables) and essentially identical to simple OLS estimates obtained from individual data (8–12% for men and 10–13% for women), which are in turn very similar to the only previous examination that used the instrumental variables approach (Arabsheibani and Mussurov, 2007). Though the results for schooling are robust across models regardless of controlling for cohort heterogeneity, with the Mundlak model the cohort effect (between-estimator) turned out to be highly significant and negative: while an increase in cohorts’ average schooling over time increases their wages, less educated cohorts earn more than more educated ones. More educated cohorts in the sample are the younger individuals whose school-leaving age fell roughly within the recession of the 1990s, suggesting the business cycle impact interpretation: cohorts entering the labor market during a recession and facing a lack of jobs apparently end up getting more education and lower lifetime wages.

The study additionally uncovers other curious results. First, though real wages rocketed during the observed period (by about 500–600% for men and 300–400% for women), the returns to schooling dropped (by about 4–5% and 2–3% for men and women, respectively). Second, the rapid growth in real wages over the period could only partially be explained by the changes in the working population’s observed characteristics, including education, by about 30% for men and 40% for women, leaving the remaining part likely due to the oil boom growth. Third, despite females earning less, their returns to schooling were consistently higher for all models. The latter could probably be explained by gender differences in the labor force allocation between industries and sectors, with men mainly employed in market-oriented, riskier, but better paid industries with predominantly private ownership that probably value education less than the public sector and those industries absorbing the female labor force, where a certain level of schooling is often formally required and, indeed, rewarded. This, in turn, complies with the higher level of education amongst women compared to men due to their rational decisions under the prevailing labor market conditions.

The paper is organized as follows. The following section discusses the theoretical framework, the pseudo-panel methodology and its possible drawbacks, and briefly reviews its previous applications worldwide. It also details some of the very few research efforts to examine the returns to schooling in Kazakhstan and the region. Section 3 depicts the sampling methodology and the questionnaire, stating data limitations and caveats with regard to the interpretation of the results so arising. It familiarizes the reader with descriptive statistics and visualizes the most important individual-level variables disaggregated by gender, as well as the cohort-level data. The following section discusses the main findings from the estimated models and their possible interpretations in the context of the Kazakhstani labor market, as summarized by the conclusions.

2 Theoretical framework and previous examinations

A definition of returns to schooling was given by Mincer in his seminal work as “a full quantitative accounting of the effects of the distribution of investment in human capital on observed earnings inequality” (Mincer 1974, p. 43). Mincer’s earning function, in its attempt to explain the extent to which earnings depend on schooling or education, is still widely used in many empirical studies as a key concept for the analysis of private returns. As Heckman et al. (2003, p. 1) note, “Mincer’s model of earnings... is the framework used to estimate returns to schooling, returns to schooling quality, and to measure the impact of work experience on male-female wage gaps”. Comprehensive reviews of existing empirical applications are given by Harmon and Walker (2001) and Card (1999, 2001).

In its basic form, Mincer’s model suggests the log of earnings (or wages) to be linearly dependent on either years of schooling or a level of education attained by an individual and other relevant control variables, such as their experience (practically, often substituted by its proxy, age, normally both in linear and quadratic terms to allow for diminishing returns to experience), gender, region and others. Depending on what is used as an explanatory variable—schooling or level of education attained—the model estimates the returns to either schooling or credentials. The debate in Labor Economics with regard to this topic has given rise to a number of hypotheses, among which the ‘sheepskin effect’ might be considered as potentially promising for testing in Kazakhstan, where the current education system has been widely criticized by society as adding little value in terms of human capital productivity due to overall low-profile staff, outdated content and learning facilities, and poor links to industry. The concept suggests that completing a degree provides better returns than the same years of schooling with no degree awarded (Hungerford and Solon 1987) and echoes human capital signalling theory, indicating education’s filtering and signalling role: in a market of asymmetric information with employers having limited access to information on potential employees and no opportunity to conduct formal tests for productivity, they can only rely on information regarding their level of education as a signal of potential productivity, which is the main channel leading from education to the labor market returns rather than the value added by education (Spence 1978; Arrow 1973; Stiglitz 1975).

The biggest challenge with Mincer’s specification, as discussed in the academic literature, is that it treats schooling (or education) as exogenous, ignoring any possible endogeneity caused by potential correlation of unobservable factors influencing wages (such as inner ability, motivation or family background) with schooling (education). This strong assumption generates omitted variable bias—so-called ‘ability bias’ (Griliches 1977)—and methods to deal with it have been proposed and empirically tested. One such is the fixed effects model, the implementation of which requires panel data. Generally, the whole idea behind the use of panel data is motivated by the possibility of being able to solve the omitted variable problem (Wooldridge 2010).

Using panel data — repeated observation of the same individuals overtime — allows unobservable variable(s) influencing wages to be held constant while obtaining the partial effects of the observable explanatory variables. With the wage equation:

$$\begin{aligned} y_{it} = \beta _0+{\varvec{x}}_{{{\varvec{it}}}}{\varvec{\beta }}+c_i+u_{it}, t=1, \ldots , T \end{aligned}$$
(1)

where

  • yit—individual i’s wage observed in a period t

  • xit—observable variables reflecting various factors influencing i’s wage in a period t

  • ci—time-constant unobserved component

  • uit—an idiosyncratic error term,

this might be achieved through either differencing or transformation, with both eliminating the unobserved component (Wooldridge 2010):

$$\begin{aligned} \varDelta y= & {} {\varvec{\varDelta }} {{\varvec{x}}}{\varvec{\beta }}+\varDelta u \end{aligned}$$
(2)
$$\begin{aligned} y_{it}-\bar{y_i}= & {} ({{\varvec{x}}}_{{{\varvec{it}}}} -\bar{{{\varvec{x}}}_{{\varvec{i}}}}){\varvec{\beta }}+u_{it}-\bar{u_i} \end{aligned}$$
(3)

Both are known as the fixed effects model.

In reality, especially with regard to developing countries, genuine micro-level panel data is rarely available. Deaton (1985) proposed the use of a time-series of independent repeated cross-sections as a synthetic or pseudo-panel. In particular, he “considers the possibility of tracking ‘cohorts”’, “with a ’cohort’ defined as a group with fixed membership” assuming that they share some common characteristics, whilst the use of intra-cohort means represents an alternative to that of individual data (Deaton 1985, p. 109).

This approach has been employed in a number of empirical research efforts (Dickerson et al. 2001; Brunello and Comi 2004; Warunsiri and McNown 2010; Himaz and Aturupane 2016; Bhattacharya and Sato 2017), with the most common treatment of cohorts being those of age and gender groups, as initially proposed by Deaton (1985) and as replicated in this study. We argue that although unobserved heterogeneity might not be fully eliminated (since ability or parental background are determined at an individual, not group, level), it will at least be in part. By this, we assume the economic and social conditions witnessed and shared by people of the same generation can potentially have a similar effect once we account for gender, or the so-called ‘cohort effect’. In our case, this includes labor market conditions (demand and supply, institutional framework including labor market policies, and so on), content and quality of education, educational policies and reforms, and possible external shocks, which might be particularly pronounced in the country during the transition from the communist regime to the market economy. Interestingly, synthetic panel data might even hold some advantages over the genuine panel data, particularly while estimating the returns to schooling. The schooling variable in genuine panel data usually varies only incrementally, where one normally observes an increase in schooling only once for a particular individual. By contrast, this could be rather variable in a pseudo-panel.

Some issues arise with the pseudo-panel methodology, the potentially most serious of which is the error-in-variables caused by averaging observations at the cohort level, which in turn might create attenuation bias and additional noise. However, Verbeek and Nijman (1992) argue that with a large enough cohort (where by ‘large enough’ they assume 100 or more individual observations per cohort), the sampling error can be disregarded, and estimates may thus be considered to be unbiased. On the other hand, increasing cohort size results in a decrease in the number of cohorts (which is the number of observations in a pseudo-panel); this, in turn, reduces precision. Thus, empirically, there is always a trade-off between the number of cohorts and their size.

Another problem is heteroscedasticity, which arises with variations in cohort size. The efficient estimator is achieved by weighting each observation by the square root of the cohort size (or any other appropriate weight), as validated by Deaton (1985).

With Deaton’s (1985) synthetic panel approach, one can adopt any method allowed with the genuine panel data, such as the fixed effects or the random effects methods, with the latter being more efficient since it utilizes both the within- and between-group variations; however, it implies a strong assumption of no correlation between explanatory and unobserved variables. Mundlak (1978) suggested a technique justifying the use of the random effects model in situations when one might expect endogeneity. Mundlak’s (1978) correlated random effects model is essentially the random effects model with added group (cohort) means of the variable(s) which are believed to be endogenous, varying within the group and over time. This ‘within-between’ estimator is based on the decomposition of the unobserved component from the model (1) as:

$$\begin{aligned} c_i=\psi +\bar{{{\varvec{x}}}_{{{\varvec{i}}}}}{\varvec{\xi }}+a_i, E(a_i|{{{\varvec{x}}}_{{\varvec{i}}}})=0, \end{aligned}$$
(4)

which includes correlated (with explanatory variable(s)) and uncorrelated components. Further, substituting Eq. (4) into the wage equation (1) allows one to reach strict exogeneity:

$$\begin{aligned} E(y_{it}|{{\varvec{x}}}_{{{\varvec{i}}}})=E(y_{it}|{{\varvec{x}}}_{{{\varvec{it}}}}, \bar{{{\varvec{x}}}_{{{\varvec{i}}}}})= {{\varvec{x}}}_{{{\varvec{it}}}}{\varvec{\beta }}+ \psi +\bar{{{\varvec{x}}}_{{\varvec{i}}}}{\varvec{\xi }}, t=1,\ldots T \end{aligned}$$
(5)

In addition, the degree of statistical significance of the group-mean estimates serves as a test for endogeneity (Mundlak test).

Over the last few decades, a variety of empirical studies have appeared that attempt to establish the causal effect of schooling on earnings. In his famous review of the studies evaluating the returns to schooling in a number of developed countries’ databases, Card (2001) implied the returns to schooling found from these studies to be around 7% for OLS estimations and around 9% for instrumental variables (IV) estimations. Overall, the studies employing quasi-experimental designs tend to find higher returns compared with OLS estimations: “average returns to schooling from simple regression methods are around 6% internationally but over 9% from these alternative methods” (Harmon and Walker 2001, p. 6). This seems not to be the case for pseudo-panel estimations, where the empirical results worldwide are mixed with pseudo-panel models providing both higher and lower outcomes than OLS. However, overall, examinations in developing countries generally show somewhat higher returns coefficients, probably reflecting diminishing returns to education due to the accumulation of human capital in advanced economies as the average level of schooling grows (Psacharopoulos and Patrinos 2004).

In the Soviet and post-Soviet economies, research studies on the returns to schooling have been limited. As Fleisher et al. (2005) remark, prior to the late 1980s reforms, the returns in the USSR were less than 5%, which is explained by the “wage compression imposed by the [wage] grids” (Fleisher et al. 2005, p. 352), as compatible with the communist ideology of equality and the favoritism of the working class. Kapelyushnikov (2008) reports even lower estimates—at least by the end of the Soviet era, returns to schooling were not more than 1–2%. However, this changed in the post-reform period. According to Fleisher et al. (2005, p. 352), returns in transition economies tended “to rise almost immediately following reform, albeit at different speeds”. There is very little empirical evidence pertaining to Kazakhstan in this regard. Barro and Lee (2010) estimated the rate of returns for an additional year of schooling worldwide, finding it to be a little more than 8% for ‘Europe and Central Asia’. Arabsheibani and Mussurov (2007)—having used OLS and IV methodologies (with spouse education and smoking habits as instruments)—indicated that the returns to schooling in Kazakhstan have increased with the transition, with OLS estimations of 8% for men and 11.5% for women and 2SLS estimations of 11% for married men and 13.7% for married women.

This study attempts to partially fill the gap in the empirical analysis of the returns to schooling in Kazakhstan by making use of the pseudo-panel technique in conjunction with national statistics data.

3 Data and methodology

The study analyzes the Household Budget Survey data from the Kazakhstan Committee on Statistics for 2002–2016. The methodology of the survey, first introduced in 2002, has changed several times. Before 2011, monthly data were recorded, whilst after 2011 the survey has been conducted on a quarterly basis. To achieve comparability, the data for 2002–2010 have been aggregated to a quarterly level. According to the current methodology (MNE 2015), the survey is designed in the form of rotating repeated cross-sections with one-third of the 12,000 participating households being replaced at the end of each year.Footnote 4

A two-stage stratified random sample design has been adopted for sampling. In the first stage, the population is stratified into 30 strata representing the country’s 14 provinces (‘oblasts’) with urban and rural places of residence, and the two biggest cities (the current capital and the previous capital, the latter of which still remains the main financial and business centre in the country) considered separately. 400 territorial units are selected as the primary sampling units (PSUs) with a probability proportional to the stratum size (number of households per stratum). In the second stage, 30 households per PSU are randomly selected for interview from a register of dwellings; the distribution of the PSU by strata is found in Appendix 1).

In some years, the sample consists of fewer observations (2006, 2007, 2008 and 2010). The final dataset used for estimations is the pooled quarterly data comprising of 588,100 employee-individuals. Unemployed, economically inactive, self-employed and employee-respondents having reached the official state retirement age (63 for men and 58 for women) were filtering off in order to ensure accuracy and better comparability.

The survey consists of questions about employment, household spending and savings, and individual incomes. The question regarding employment changes in 2015: before 2015, the respondents were asked if they had worked for at least 1 h during the past 7 days and received monetary payment or payment in kind, which allowed them to be considered as employed in accordance with the ILO approach; from 2015 onwards, they have been asked if they have worked at least 1 h in the past 30 days. Whether or not both questions may cause inaccuracy is open to question, as no data on hours worked by the individual are recorded. By using wage data aggregated quarterly (not hourly) and with no information on full- or part-time employment, we violate two conditions set by Griliches (1977), thus requiring additional caution when interpreting schooling and experience estimates.

The dependent variable in all models is the natural logarithm of the real wage from employment; thus, other earnings (income from self-employment, benefits, property income and other incomes) are excluded from the analysis. Wages for 2003–2016 are adjusted by the CPI officially reported by the Committee on Statistics, with 2002 as the base year.

Schooling is a derivative variable transformed from the levels of education attained that are recorded in the surveyFootnote 5:

  • no education: 0 years of schooling;

  • primary education: 4 years of compulsory schooling (from the age of 6–7);

  • basic secondary education: 9 years of compulsory schooling;

  • general secondary education and TVET: minimum of 11 years of schooling (9 years of compulsory schooling plus either 2 years of university-preparatory secondary school or 2–3 years of specialized technical and vocational training);

  • higher education: minimum of 15 years of schooling (11 years of secondary schooling plus a Bachelor’s or ‘specialist’ degree requiring a minimum duration of 4 years;Footnote 6)

  • postgraduate education (Master’s degree): minimum of 16 years of schooling;Footnote 7

  • academic degree (‘uchenaya stepen’): (1) Ph.D. and (2) “Candidate of Science’ and ‘Doctor of Science’ degrees from the previously existing Soviet system—minimum of 18 years of schooling.Footnote 8

Therefore, with the given variable, we estimate the returns to credentials rather than the returns to schooling, albeit with its average rate, accepting that, as Harmon and Walker note from comparison between returns to schooling (linear specification) and returns to credentials (with nonlinearities between completion of different qualification assumed) computed for the same sample of individuals, “a linear form seems to be a reasonable approximation” (Harmon and Walker 2001, p. 31). However, we cannot directly test the sheepskin effect with no direct data on years of schooling.

Descriptive characteristics of the sample, as divided by gender, are given below (descriptive statistics can be found in Appendix 2). The following figures demonstrate summary statistics on gender subsamples and shed some light on the character of employment dissimilarities between the genders.

Fig. 2
figure 2

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Proportions of the respondents by attained level of education in the corresponding year.

Figure 2 describes the distribution of the respondents’ highest attained levels of education combined in four wider groups by year of observation. For both genders, the majority of respondents had attained a general secondary education or TVET. The share of respondents with a degree in higher education grew until 2011 and was consistently higher for females in each year (mean of schooling in the pooled data is 0.62 years higher for females than males). This corresponds to the official aggregated statistics reporting the share of people having attained at least a higher education as comprising 35.7% of the employed population in 2016 (31.7% and 40.3% for men and women, respectively) (MNE 2017). The number of people with no schooling and with postgraduate degrees (including academic degrees) is very low for both subsamples, respectively.

Fig. 3
figure 3

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Proportions of the respondents employed by the private sector in the corresponding year.

Fig. 4
figure 4

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Distribution of 2011–2016 respondents by industry.

As can be seen from Fig. 3, men are mostly employed by the private sector and the share of such increases over time, while women are approximately evenly distributed between public and private sector employment. It is noticeable that females’ employment by sector is almost static. This is also reflected in the industry of employment (more precisely, the ‘type of economic activity’, which we further refer to as an ‘industry’), with plot 4 showing the number of employees in different industries built for respondents for 2011–2016 only, since the earlier data does not record industry. Leading industries for male workers are those with primarily private ownership (construction, transportation, mining and quarrying, agriculture), while nearly 30% of working females in the sample are employed in education (with the majority in public secondary education) (Fig. 4).

Figures 5, 6, 7, 8 and 9 document average real wages by the respondents’ selected characteristics used as explanatory variables in different specifications, as separately computed from the pooled data for each gender.Footnote 9 Region is aggregated into four geographical groups and the ‘metropolis’ category, which includes the two largest cities having the highest wages, followed by the western regions specializing on oil and gas exports. Sector of employment is derived from the categories listed in the questionnaire: ‘private company employee’, ‘farm-worker’ and ‘those employed by individuals’, the latter group mainly consisting of shadow sector wage-earners—combined into the ‘private sector’, and ‘public company employee’ comprising the ‘public sector’. As seen from Fig. 5, higher educational attainments consistently provide higher wages, on average. Excluding the highest and lowest levels of education (which both have very few observations), the log transformed average real wages demonstrate a somewhat parallel pattern for the two largest groups in each of the genders. There is a gender gap in almost every category for every variable observed.

Fig. 5
figure 5

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Mean real wages by attained level of education and year.

Fig. 6
figure 6

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Mean real wages by region and year.

Fig. 7
figure 7

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Mean real wages by residence and year.

Fig. 8
figure 8

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Mean real wages by sector of employment and year.

Fig. 9
figure 9

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Mean real wages by industry, pooled data, 2011–2016.

The pseudo-panel was designed from individual data based on the respondents’ recorded years of birth and gender. The youngest and oldest cohorts are dropped due to either having only a few or no observations for the particular year(s). The final cohorts are:

  • Male cohorts: 1954–1986, one per each year of birth;

  • Female cohorts: 1959–1986, one per each year of birth.

The final number of cohorts is 61 (33 male cohorts and 28 female cohorts). The cohort size is sufficiently large, with mean numbers of observations of 9272 and 10,075 per cohort for the male and female cohorts, respectively, and smallest cohort sizes of 6352 (male) and 7156 (female)—see Appendices 3 and 4. However, the size of the cohort varies substantially over the years, therefore, in accordance with Deaton’s approach, observations are weighted in the pseudo-panel data by the square root of the corresponding cohort size in any given year of observation.

Since the data used for this analysis has a rather large time-series dimension observed over the period of the rapid economic growth the country experienced with the oil boom, one might be concerned about a possible non-stationarity issue potentially leading to spurious regression (Kao 1999; Phillips and Moon 2000). Indeed, although it is unlikely this can be tested properly with 15 time series, the growth of real wages over the years is easily visually detected (Figs. 5, 6, 7, 8). We cannot disregard this possibility, however:

  1. (1)

    we rely on the fact that the number of cross-section units (n) is still larger than the number of times series periods (T), allowing one to interpret them as “a set of cross-sections... [rather than] a set of time series” (Smith and Fuertes 2010, p. 8), and it is believed that the non-stationarity problem normally arises in the panel data context when n ≤ T (Phillips and Moon 2000; Baltagi and Kao 2001; Smith and Fuertes 2010);

  2. (2)

    according to Kao (1999, p. 2) even with non-stationary data in the panel regression, the “LSDV estimator... is consistent for its true value”, though the standard errors could be biased. Thus, in this study we employ ‘classical’ micro-panel methods (Chamberlain 1984; Wooldridge 2010), additionally detrending them by explicitly controlling for the year dummy variables.

Descriptive statistics on cohort data are given in Appendices 5 and 6. Figure 10 demonstrates the decomposition of real wages by cohort. Each line represents the evolution of the particular cohort’s mean real wages over time; every third cohort is plotted to keep the plot less ’busy’.

Fig. 10
figure 10

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Decomposition of real wages by cohort and age effect for every third cohort.

As surmised, the youngest cohorts earn the lowest wages, and the age-wage profile has a somewhat inverted U-shape for both males and females. The picture additionally reflects the recessions of 2008 and 2014 due to the world market commodity bust when all cohorts’ real wages dropped slightly. This is also evident from Fig. 11, which shows the year effect, where each dot represents the cohort’s mean real wage observed for each year and the line represents the mean of each year’s means.

Fig. 11
figure 11

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Decomposition of real wages by cohort, year effect.

In this analysis, we start with the ‘classical’ Mincerian specification, with the age in linear and quadratic terms as the only control variables, further augmenting with additional controls: region, residence and sector of employment (private vs. public):

$$\begin{aligned} \log w = \beta _0+\beta _1S+\beta _2A+\beta _3A^{2}+{{\varvec{X}}}{\varvec{\beta }}+\epsilon \end{aligned}$$
(6)

where

  • w—real wage

  • S—years of schooling

  • A—age

  • X—additional control variables

  • ϵ—composite error term.

For pseudo-panel models, we run log and squared transformations before taking means.

When estimating models, we follow three slightly different approaches. First, we start with the basic OLS models which do not control for the cohort effect and which, therefore, likely suffer from omitted variable bias. Considering the long series of the repeated cross sections, we further concentrate on the first and the final years and elaborate on the changes that took place over the period in question by decomposing the wage equation using the Blinder–Oaxaca technique. This allows us to observe the year effects and, further, we additionally focus on the gender differences in the returns to schooling computed from the individual data. Second, we estimate the fixed effects and the Mundlak random effects models on the pseudo-panel data with cohort means treated as individual observations. Finally, to grasp the returns’ variations across cohorts, we estimate the OLS model, separately controlling for cohort dummies and their interaction with schooling for individual gender subsamples.

4 Outcomes and discussion

4.1 Year effects

Tables 1 and 2 report the outcomes of the basic Mincer model computed from the pooled individual data with additional control variables introduced step-by-step, separately, for each gender. The returns to additional year of schooling vary from 7.75 to 11.50% for men and from 10.12 to 12.73% for women, depending on specification.

Table 1 Returns to schooling estimated from pooled individual data by OLS, men.
Table 2 Returns to schooling estimated from pooled individual data by OLS, women.

To examine the year dynamic, the same models are estimated for each year’s subsamples independently. Tables 3 and 4 document detailed outcomes for the first and the final years for men and women, respectively, while Appendix 7 details the descriptive statistics for these 2 years’ subsamples.

Table 3 Returns to schooling estimated from individual data for 2002 and 2016, men.
Table 4 Returns to schooling estimated from individual data for 2002 and 2016, women.

For both genders, regional and residency disparities have mitigated over the analyzed period, and employment in the private sector has become more lucrative, especially for men. The returns to schooling decreased over the analyzed period for both genders, though the difference in the returns’ estimates in 2016 versus 2002 is slightly larger for men. This is also seen from Fig. 12, which plots the schooling coefficients with their confidence intervals extracted from the each year’s subsamples’ models. To test if the difference is statistically significant, we run the models for the pooled data with an interaction between schooling and year (reported in Appendices 8, 9) separately for each gender, and found them to vary between − 4.23 and − 5.28% for men and between − 2.43 and − 2.71% for women. For comparison, the differences in real wages in 2016 compared to 2002, as computed from the same models, were 516–610% and 365–396% for men and women, respectively. The wages increased dramatically with the oil boom, while the returns to schooling dropped. This might be related to the corresponding increasing trend in years of schooling observed with the descriptive statistics.

Fig. 12
figure 12

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Returns to schooling with 95% confidence intervals independently computed for each year of observation from models with schooling, age, age squared and: a with no additional control variables; b region; c region and residence; d region, residence and sector of employment.

To identify the extent to which the observed characteristics contributed to the change in the real wages of working employees between 2002 and 2016, following Lassibille and Gomez (1998) we decompose the wage equation with the Blinder–Oaxaca decomposition for these 2 years for each of the gender subsamples (Blinder 1973; Oaxaca 1973). The results presented in the Tables 5 and 6 allow us to examine which part of the wage difference between the 2 years can be explained by the changes in covariates and which is unexplained and should be attributed to the “differences in the pay structure” (Lassibille and Gomez 1998, p. 7) or other structural changes across the period under consideration. 2002 is set as the reference year.

Table 5 Blinder–Oaxaca decomposition of log real wage differential between 2002 and 2016, men.
Table 6 Blinder–Oaxaca decomposition of log real wage differential between 2002 and 2016, women.

The mean real wage of male employees in 2002 is 10.15 log points, and in 2016 is 11.50 log points. Within the difference of approximately 1.35 log points, only 0.43 log points (or about 32%) is explained by the change in the observed characteristics and 0.91 log points (or 68%) remains unexplained. Though the log real wages are lower for females for both years (9.92 in 2002 and 11.25 in 2016), the cross year difference is nearly identical at 1.33 log points, but the model turned out to be slightly more powerful in terms of explaining the coefficients dynamic, as almost 40% of the difference is explained. The main contributor to the unexplained part for both genders is the constant term, which likely picks up the oil boom-driven differences in economic conditions over time. Among the observed employees’ characteristics, the age in both linear and quadratic terms contributes the most, which is also shown by the variable-by-variable decomposition of the unexplained differentials plotted in Figs. 13 and 14, which might be partially attributed to the observed difference in the age between year subsamples for each gender (average age in 2002 is 34 versus 45 in 2016 for males and 33 versus 44 for females—see Appendix 7). However, a more interesting and meaningful pattern appears with the schooling coefficient’s unexplained part. Since the 2016 sample is more educated (mean of schooling is 11.96 for males and 12.53 for females versus 11.74 and 12.37 in 2002, respectively), the returns to schooling contribute to the observed wage differentials. At the same time, taking 2002 as a baseline year, schooling is significantly ‘underestimated’ in 2016 for both genders. Other coefficients (except for west region residency) have different patterns. For example, employment in the private sector provided a higher wage premium for men in 2016 than it was otherwise ‘expected’ to provide, accounting for the difference across years which is likely to be explained by structural changes in the economy during the same period. On the other hand, the interpretation of the nominal variables (region, residence or sector) coefficients’ unexplained portion might be biased and meaningless (Jann 2008), unlike those for schooling and age having a natural zero point.

Fig. 13
figure 13

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Blinder–Oaxaca decomposition—unexplained portion in wage differential between 2002 and 2016, men (with 95% confidence intervals, 2002 is the reference year).

Fig. 14
figure 14

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Blinder–Oaxaca decomposition—unexplained portion in wage differential between 2002 and 2016, women (with 95% confidence intervals, 2002 is the reference year).

4.2 Gender effects

If anything, the results presented above reveal systematically higher returns to schooling for females, which are additionally shown in the models run using the pooled data with the schooling-gender interaction term included, as performed as a robustness exercise (Appendix 10). Male respondents earn about 0.9–1.3% less premium for each additional year of schooling than the female respondents, depending on specification. This is often the case in developing countries, and is usually attributed to “lower base levels of education of females compared to males” (Warunsiri and McNown 2010). However, this interpretation is hardly appropriate for Kazakhstan with its observed low (or absence of) educational gender disparities during the Soviet period, which turned into the current higher overall level of education amongst females, as may be noted from the analyzed database. Along with the overall lower female earnings, this suggests that the observed returns gap might be attributed to the sector-industry allocation of the labor force: as mentioned above, females are mostly employed by (less risky but worse paid) industries are primarily in public ownership, which might ’value’ education more than the others in the sense that the jobs concentrated in these industries require a degree and provide better degree premiums. This observation might be in line with frequently observed worldwide “gender segregation in occupations, industries, firms, and jobs” (Garcia-Aracil 2007, p. 431; Meng 2004; Bielby and Baron 1986).

To test this, we run OLS on individual data gender and year subsamples with the basic Mincerian specification and the interaction term of the years of schooling with the sector of employment—the interaction term estimates are shown in Fig. 15. Despite the fact that for both genders, employees in the public sector are better educated than those in the private sector (the average schooling is 12.50 for men and 12.88 for women in the public versus 11.74 for men and 12.19 for women in the private sector,Footnote 10) the returns to schooling in the private sector are significantly higher for women for almost all of the years observed, and higher for men until 2011 though with a clearly decreasing trend over time. The latter, along with the observed shift of the male respondents towards the private sector, likely contributes to the decreasing returns to schooling in males. However, the observed trend does not explain the higher returns to schooling amongst females.

Fig. 15
figure 15

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Returns to schooling in the private sector with 95% confidence intervals computed for each year of observation and gender independently from models with schooling, age, age squared, year dummy and schooling with sector of employment interaction term (reference sector: public).

Hypothesizing that the schooling returns premium gap could be better explained by industries rather than private versus public ownership, we compute the same specification models for the industry (instead of the sector) of employment, which appears in the survey from 2011; the associated results are presented in Fig. 16,Footnote 11 additionally plotting the number of employees in each industry. We drop the least populated industry (“Activities of households as employers”) and assign “Education” as the reference category. Additionally, we show the average years of schooling for each industry and gender—see Fig. 17.

Fig. 16
figure 16

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Returns to schooling in industry with 95% confidence intervals computed for each gender independently from models with schooling, age, age squared, year dummy and schooling with industry of employment interaction term (reference industry: Education), pooled data, 2011–2016.

Fig. 17
figure 17

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Average years of schooling, pooled data, 2011–2016.

Indeed, in the majority of industries, females’ returns to schooling are greater than or equal to those of males, although this is not the case for the main female industries of employment. Along with this, for both genders the main industries of employment (according to Fig. 4) are among those with both (gender-specific) lowest returns to schooling and the lowest levels of education attained. The only exemption is “Education” for females, which provides relatively high returns to schooling and demands more schooling for both genders. At the same time, the industries with the highest wages for each of the genders—particularly, “Mining and quarrying” and “Manufacturing”—are predominantly occupied by men. The exception is “Financial and insurance activities”, where there are more females than males observed. Unlike “Mining and quarrying” and “Manufacturing”, this sector ‘requires’ more schooling (the highest average schooling in the sample for both genders). This suggests that, possibly for females, a higher attained level of education serves as a pass to at least some of the best-paid industries within the mining-oriented economy. On the other hand, they could additionally choose less demanding (in terms of schooling and probably competition for jobs) but worse-paid economic activities, such as “Education”, “Human health and social work” or “Wholesale and retail trade”. The former two might additionally provide relatively improved levels of social security.

Interestingly, experience coefficients (as reported in Tables 1, 2) are nearly twice as small for females than males; as a result, the age turning point differs with gender: about 41 for males and 47 for females. This might again reflect differences in the nature of employment and social security across genders. The gender returns gap additionally explains higher females’ schooling that may be considered as their rational decision based on their expectations about future employment opportunities.

4.3 Cohort effects: pseudo-panel models

The results of the pseudo-panel models are introduced in Tables 7 and 8. Observations are weighted by cohort size, varying with both cohort and year.Footnote 12 Both models additionally include year dummies to capture year fixed effects, and other proportional control variables computed as proportions of respondents belonging to a certain group (for example, with rural versus urban residence) within a cohort in each observed year.

Table 7 Returns to schooling estimated from pseudo-panel, fixed effects.
Table 8 Returns to schooling estimated from pseudo-panel, Mundlak random effects.

Interestingly, the fixed effects and the Mundlak random effects models produce results that are nearly identical to the individual data models, which are also very similar to the previous examinations that used the IV approach (Arabsheibani and Mussurov 2007). Attempts to control for cohort effects do not noticeably change estimations for the returns to schooling, only slightly decreasing the Mundlak model’s estimates. At the same time, the mean of schooling estimate in the Mundlak model is highly significant, suggesting an unobserved cohort effect (from comparison of FE and RE). It is also negative, where this should be interpreted as follows: though the increase in the cohorts’ average level of schooling over time resulted in an increase in the average level of their earnings, the cohorts with higher average levels of schooling earned less in comparison to those with lower levels.

The figures in Appendices 11 and 12 show that the more educated cohorts are the younger cohorts in the sample,Footnote 13 and thus the negative relation could imply an age (experience) effect. It might also indicate a business cycle impact; the cohorts who entered the labor market during the 1990s recession and faced a lack of jobs apparently ended up getting more education and—because of worse economic conditions—lower lifetime wages, and vice versa (Betts and McFarland 1995; Dellas et al. 1996; Kahn 2010; Clark 2011; Oreopoulos et al. 2012; Liu et al. 2016). The other possible explanation could be a widely perceived difference in the quality of education and its link to the labor market between the Soviet and post-Soviet eras; however, this could not be further tested with the data at hand.

While OLS estimates reported in the previous sections have the expected signs and magnitudes, some of the additional control variables’ estimates from the pseudo-models look rather controversial. This might reflect an error-in-variables bias generated by errors in the survey data, as noted by Griliches (1977, p. 12): “in cross-sectional household interview data all of the variables are subject to some error. Even if errors are small, their effect will be magnified as more variables are added to the equation in an attempt to control for “other possible sources of bias”. Regardless, the schooling coefficients are fairly robust and consistent in all models.

4.4 Cohort effects: individual level models

Finally, to detail how the returns to schooling vary across cohorts, we set up individual-level OLS models controlling for the set of cohort dummies and an interaction term between schooling and cohort dummies, as well as the other control variables, separately for each gender. The coefficients for the interaction terms with their confidence intervals for four specifications are shown in Figs. 18 and 19. The baseline category is the first (the oldest) cohort: the 1954 cohort for men and the 1959 cohort for women.

Fig. 18
figure 18

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Male cohorts: Returns to schooling for each cohort with 95% confidence intervals from model with schooling, age, age squared, cohort, schooling*cohort and: a with no additional control variables; b region; c region and residence; d region, residence and sector of employment.

Fig. 19
figure 19

Source: Household Budget Survey, the Committee on Statistics of the Republic of Kazakhstan

Female cohorts: returns to schooling for each cohort with 95% confidence intervals from model with schooling, age, age squared, cohort, schooling*cohort and: a with no additional control variables; b region; c region and residence; d region, residence and sector of employment.

For men, with some few exemptions and for the youngest cohorts (born after 1983), the difference in the returns across cohorts seems to be statistically insignificant. Minor fluctuations observed in the returns look somewhat similar to the business cycle fluctuations and might reflect the external backgrounds (economic conditions, unemployment rate, labor market conditions and policies, skills mismatch, external shocks) cohorts are faced with when entering the labor market and over their life-time cycle accordingly impacting their returns. However, for females this variation appears to be somewhat more systematic, with higher rates for the older cohorts and a downward trend towards the younger ones, namely those who entered the labor market after the collapse of the Soviet Union. This might mirror the different nature of employment for the genders, as referred to above: particular industries could be more sensitive to over-education or to perceived differences in Soviet-type and post-Soviet-type education. On the other hand, this could reflect (or be contaminated with) the age effect, and disentangling them is hardly feasible.

5 Conclusion

This paper represents the first attempt to employ a long series of repeated cross-sectional data from the Kazakhstani National Statistics to estimate the returns to schooling. The few such previous examinations found that the returns immediately increased from the very low rates typical of the Soviet era to internationally comparable rates with the transition.

Using the Household Budget Survey data for 2002–2016, we found three to sixfold growth in the real wages observed over this period. However, this dramatic growth in wages in real terms is only partially explained by the improvement in the observed labor force characteristics (education, residency and the sector of employment), by around 30% for working men and 40% for working women. The unexplained part can probably be attributed to the external economic background—fast economic growth due to the oil boom with the GDP growing by 6% on average annually.Footnote 14 This growth in the GDP could have resulted in the increase in demand for a more educated labor force; on the other hand, with the associated dramatic increase in schooling, one might expect a corresponding decrease in the returns to schooling over the years.

With the various models developed, our results reveal the returns to be robust across models and specifications, statistically significant and relatively high, though decreasing over the analyzed years despite the economic boom and with the magnitude of this decrease being larger for men. Albeit that the estimates do not particularly change with the pseudo-panel data, the comparison between the fixed effects and the Mundlak random effects models suggests the presence of the cohort effect. We tend to consider the negative sign of the mean of schooling estimate as an effect of the business cycle phase which cohorts are faced with at their school-leaving age: a downward swing with a lack of jobs leaves them with a little choice but to get more schooling which, however, in turn provides them with lower returns (compared to older cohorts).

We find—at least, with the data at hand—gender differences in the nature of employment: while men prefer employment in the private sector with higher wages and apparently more uncertainty, women seem to be inclined towards worse-paid, but more secure industries and the public sector. This trait represents a likely explanation for the higher rates of the returns to schooling observed for females. We argue that some particular jobs, mainly in the public sector, formally require some level of schooling and, accordingly, value schooling to a greater degree than the private sector, possibly due to the wage grid still being persistent to some extent. These jobs tend to be occupied by females, which, in turn, reasonably explains their higher levels of education. On the other extreme, the females employed by the best-paid industries, as heavily dominated by male employees, have systematically higher levels of schooling (than males), which seems to serve as a pass to at least some of the best-paid industries for women within the mining-oriented economy.

Finally, we uncovered a downward trend in the returns pattern across cohorts, with younger cohorts demonstrating lower rates of returns (confirming the Mundlak model outcomes), which might reflect a decrease in returns to schooling due to a labor market glut but that might also reflect an age effect, and untangling these two effects is hardly feasible. This feature is more pronounced in females, which might again point to gender differences in the nature of employment. At the same time, for males, variations in the returns rates across cohorts can likely be explained by external economic conditions, such as labor market oscillations and the business cycle. Overall, the mechanism(s) determining the returns to schooling in Kazakhstan is (are) probably more complicated and require more sophisticated and detailed data to fully establish.