There are at least three possible sources of bias associated with the OLS estimates of the returns to FLS. First, there may be unobserved heterogeneity affecting both FLS and earnings (Chiswick and Miller 1995). Second, there may be reverse causality as individuals earning more can invest more in improving their FLS (Chen et al. 2014). Third, self-reported FLS may suffer from misclassification errors (Dustmann and Van Soest 2001, 2002). Although it seems difficult to eliminate all possible sources of endogeneity, I adopted several approaches to address potential bias.
The first and most important problem is the potential bias resulting from the endogeneity of FLS. If individuals studying foreign languages are on average more able, skilled or motivated, we may expect that the OLS estimate of the wage premium will be overestimated as I do not control for unobservable abilities and skills. The wage premium estimated in this way will reflect not only returns to FLS, but also returns to other skills and abilities that were useful for studying foreign languages and are useful at work. Thus, the standard approach to address unobserved heterogeneity is to use OLS and include control variables that proxy for cognitive skills (Toomet 2011; Fabo et al. 2017; Grin 2001; Fry and Lowell 2003; Ginsburgh and Prieto-Rodriguez 2011; Di Paolo and Tansel 2015). However, the results are rather descriptive in this case and cannot be causally interpreted. Other methods used include: propensity score matching (Saiz and Zoido 2005), panel data methods (Saiz and Zoido 2005; Lang and Siniver 2009) and instrumental variables (Wang et al. 2017). Typically, the wage returns to FLS are identified by using foreign language proficiency as the key independent variable.
As I did not have any variable in my database that could serve as an instrument, I used OLS estimation and included a full set of control variables in the specification to reduce concern that selection may bias the estimates. For this purpose, I used a unique property of the Human Capital Balance survey database—the information on the respondents’ 12 additional skills. Considering that the development of skills—both linguistic and non-linguistic—is determined by abilities and motivation, the inclusion of a wide set of respondent skills to the wage equation may, to a certain degree, reduce the endogeneity bias. However, I treat this approach rather as an exercise and I do not claim that the endogeneity bias will unquestionably be reduced in this way.
The second problem is a potential reverse causality because individuals with higher wages are more likely to afford the payments for private language courses. The share of adults (aged 18–64) in my sample who learned foreign languages within 12 months before the survey is very small (0.4%), while all young people in Poland have to learn two foreign languages at school up to the age of 18. As the share of adults learning foreign languages is marginal, I argue that the estimates should not be substantially biased as a result of reverse causality. Nevertheless, I address this issue by running a robustness check in Sect. 6.
The third problem is the misreporting and rounding of FLS, as this skill is self-assessed by respondents. In order to minimize the bias that may be caused by misreporting, I based my measure of FLS on the key language production skill (i.e. speaking skill) only. Though I had information on reading and listening comprehension, I decided not to use these measures to eliminate possible bias resulting from the overestimation of these skills.Footnote 3 Regarding rounding error, I believe that this error is not large because respondents in my sample were asked to assess their FLS on a relatively wide scale that has six levels. Obviously, I am aware that despite my efforts to minimize misreporting and rounding errors, they are present to some extent in my sample and they may bias my estimates.
My identification strategy is to use OLS to estimate the following wage equation:
$$\ln (w_{i} ) = \beta_{0} + FLS_{i} \beta_{1} + S_{i} \beta_{2} + X_{i} \beta_{3} + \varepsilon_{i}$$
(1)
where: the dependent variable ln(wi) is the natural logarithm of hourly net earnings,Footnote 4FLSi is a vector of variables representing the level of foreign language skills, Si covers the map of respondents’ skills and vector Xi—other factors that may have an impact on earnings.
Vector FLSi consists of three binary variables representing the following levels of language skills: elementary, intermediate and advanced. I created these variables based on the assessment of the key language production skill—speaking. As the level of this skill was rated by respondents from 1 to 6, I defined FLS as:
-
Elementary, if the reported level was 1 or 2;
-
Intermediate, if the reported level was 3 or 4;
-
Advanced, if the reported level was 5 or 6.
The map of skills (Si) covers the 12 skills listed in the previous section. The model includes a separate variable for each skill, taking values from 1 to 5. Furthermore, the model also includes other control variables (Xi) that represent respondents’ features (gender, age, education level), as well as the characteristics of the local labour market (place of residence, region, year of survey). All independent variables are listed in Table 8 in the Appendix.
In order to eliminate outliers, I deleted 0.2% of the upper and lower extreme values from the distribution of the hourly net earnings. The linear regression model was estimated using OLS with heteroscedasticity-robust standard errors.