Text analysis of job offers for mismatch of educational characteristics to labour market demands

Beręsewicz, Maciej; Cherniaiev, Herman; Mantaj, Andrzej; Pater, Robert

doi:10.1007/s11135-023-01707-7

Text analysis of job offers for mismatch of educational characteristics to labour market demands

Open access
Published: 21 July 2023

Volume 58, pages 1799–1825, (2024)
Cite this article

Download PDF

You have full access to this open access article

Quality & Quantity Aims and scope Submit manuscript

Text analysis of job offers for mismatch of educational characteristics to labour market demands

Download PDF

Maciej Beręsewicz^1,4,
Herman Cherniaiev²,
Andrzej Mantaj² &
…
Robert Pater ORCID: orcid.org/0000-0001-7619-9843^2,3

1500 Accesses
Explore all metrics

Abstract

Nowadays, the traditional ways of job seeking have become less popular than digital methods. Recruitment websites are more attractive to job seekers since they provide easy, convenient access to a greater number of job vacancies. The biggest disadvantage, however, is that job vacancies published online are often unstructured and confusing. Studies related to online job vacancies are usually restricted to a short duration and a small number of recruitment websites. Such studies frequently use proxies for skills and occupations, or aggregate them into wider groups. The aim of our research is to provide full educational characteristics of job vacancies in Poland and calculate a complete list of educational mismatches. We introduce an approach that includes stages of source selection; data collection; and extraction of occupations, qualifications, and skills. We describe difficulties with data scraping and ways to overcome them. Thanks to our large dataset, we are able to determine and describe the labour demand. We also show the results of a survey that estimates educational traits of the labour supply. To measure mismatch between education and labour supply and demand, we use structural compliance indices. The paper also offers a case study for chosen occupational groups. Our findings reveal the greatest mismatch is in education and job-related skills, with the least mismatch occurring between geographic regions.

Using online vacancies and web surveys to analyse the labour market: a methodological inquiry

Article Open access 25 September 2015

How unemployment scarring affects skilled young workers: evidence from a factorial survey of Swiss recruiters

Article Open access 29 June 2018

Does it matter if immigrants work in jobs related to their education?

Article Open access 11 April 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

There is an increasing need for a detailed analysis of companies’ demand for workers; that is, for occupations, skills (competencies) and qualifications (Deming 2017; Deming and Kahn 2018; Hershbein and Kahn 2018). Current surveys conducted by statistical offices do not contain information on company demands for future workers’ skills or qualifications. One may consider online job offers as supporting or alternative data to job market surveys. However, these data sources are unstructured, so relevant information should be extracted from this data to conduct quality assessments, to determine representativeness of the data, and to develop an estimation process.

The purpose of this research is to provide complete educational characteristics of labour demand with job vacancies. Educational characteristics include occupations, qualifications, and skills. Skills are divided into transversal and job-related categories. Initially, extracting occupational data is especially important because official statistics use the International Standard Classification of Occupations (ISCO, International Labour Office 2012). Educational terminology is based on qualifications grouped by fields of study. To compare labour demand with educational characteristics, this standardised qualification terminology must be used. Finally, development of the global economy brings greater attention to skills, as people’s knowledge and skills become more specialised. For this reason, extraction of job skill data is extremely important for research.

We use online job offers as rich sources of data on labour demand. Although these data are comprehensive, online job databases often contain extraneous and unstructured information. To better structure this information, we use labour market and educational terminology with official classifications.

Our main contribution is the proposal of an approach for analysing online job offers in the context of their detailed structural information. We use information from job offers to extract educational characteristics of job openings. This information is then used to calculate educational mismatches between labour supply characteristics and labour demand as observed in job offers. We show that such detailed information may be used to provide estimates of skill (competence), qualification, occupation, and regional mismatch between labour supply and demand. We provide a robust procedure that can be applied to different websites with minimum comparability. The procedure consists of choosing online job portals, collecting data from them, extracting relevant information, and then performing representativeness corrections based on this information. We present some challenges related to this research and propose solutions to them. Through the use of large dictionaries, we use detailed classifications and explicit measures of skills instead of proxies to more accurately describe labour demand. We then apply this method to the Polish job market and compare the results with ongoing representative surveys.

These results are especially important for economists, the educational sector, and labour market institutions, such as the Organisation for Economic Co-operation and Development Skills Strategy (OECD 2011). Such detailed information may be used to adjust labour market and educational policy, especially policies directed toward reducing structural unemployment.

The article is organised as follows. Section 2 reviews the related literature on data collection from online job websites and using them to analyse labour market matching. Section 3 describes our data sources, collection, and preparation. Section 4 shows the measures of mismatch that we use. Section 5 presents our results. Section 6 discusses the results and provides our conclusions. Appendix 1 includes data collection algorithm. Appendix 2 contains the representativeness analysis.

2 Related literature

Our approach relates to previously conducted online job offer research. This method is based on collecting online job offers and analysing them using data mining techniques. Most of the previous research of this type has been devoted to a specific topic, not conducted periodically, or were based on a single job website. Although such research may prove valuable, our method may be compared to others that address data from many online job sites at a regular basis. One such study was conducted by the Australian Bureau of Statistics and New Zealand’s Department of Labour (Wall and Fale 2010). This study analysed online job vacancies from selected websites according to occupation, industry, and region. The educational characteristics of offered jobs were not emphasized, as the index was mainly used to track time series changes of vacancies by occupation, industry, and region.

The longest history of analysing job offers has been maintained by the US Conference Board, currently publishing the Help Wanted OnLine index^{Footnote 1} (see e.g., Barnichon 2010). The index is disaggregated to local labour markets. A more detailed analysis of skills has been performed in cooperation with Burning Glass Technologies. A relatively large number of articles has been written based on these data in recent years. However, they contain little information about the procedure of collecting, cleaning, and classifying data, as their data processing methods are largely unknown. Hershbein and Kahn (2018) show that the representativeness of Burning Glass Technologies data is stable over time across groups of occupations. Acemoglu et al. (2020), Blair and Deming (2020), Deming and Noray (2020), Forsythe et al. (2020), Modestino et al. (2022), and Kudlyak et al. (2022) study the requirements of employers in these job offers and their changes. Among the variables they extract are: wages, level of education, industry, work experience, occupational groups, and skills across local labour markets.

Cedefop (2019a; b) and Colombo et al. (2018) conduct the largest study in Europe, collecting online job offers from all of the European Union member states. While the dataset is large, Cedefop pays limited attention to the composition of websites from which the information is collected. The approach is rather focused on obtaining large numbers of job offers. This leads to a potential overuse of job portals that aggregate job offers from other sources. Such websites may not last long or may change their sources frequently, which leads to instability of results. Also, a problem of language treatment arises, wherein different languages may require dedicated approaches. Finally, the data collection and treatment of Cedefop makes it difficult to analyse and correct various biases of online data, which leads to data representativeness problems (Beręsewicz and Pater 2020).

Lovaglio et al. (2020) used Cedefop data to compare the characteristics of online job offers and survey-based data on vacancies. They show selected time series properties of online job offers. We use online job offer data to show the usefulness of their structural properties at disaggregated levels. In our research, we aim to provide the most thorough educational characteristics of vacancies, while minimising other potential information obtained from online job offers.

3 Data

3.1 Sources

As there are many websites with job offers on the Internet, estimating the number of websites with job offers in all of Poland with reasonable accuracy would be an unrealistic task. Websites with job offers are also heterogeneous; they can be divided into types. In particular, these types include:

Country-specific specialised websites.
Locally specialised websites, most often encompassing a city, a community (NUTS-5 region according to the European Union Nomenclature of Territorial Units for Statistics), or a voivodeship (NUTS-2 region).
Specialised websites with job offers, such as financial occupations or information technology (IT) jobs.
Websites of Local Labour Offices (LLOs) operating in the NUTS-4 region, and official Public Information Bulletins (PIBs) containing most of the job offers in the public sector.
Employer websites.
Internet forums and social media groups (for example, Facebook groups).
Websites that aggregate job offers from other portals.

Information obtained from the artefakt.pl website indicates that 97.3% of internet users in Poland use the Google search engine. We used Google Trends to find the most popular internet websites with job offers. Google Trends is an index of the volume of Google queries by geographic location and category (Choi and Varian 2012). This technique was suggested by Askitas and Zimmermann (2015) for social sciences. Most often, after entering a search for local job offers, Google search shows links to countrywide websites. This is connected to the use of positioning techniques by website administrators. Countrywide websites often contain paid advertisements of job offers.

Job offers from local websites often contain a short description with more detailed job descriptions (e.g., employee requirements, skills, detailed qualifications) shown less often than for national websites. Also, employers using such sites know less about the professional terminology of the labour market and related education. Such offers may contain grammatical errors and may be less structured. They more often contain job offers for people without higher education. Local websites are more popular in smaller or medium-sized cities than in major cities. They often contain various local advertisements in addition to job offers. Some sites also assume the role of local information portals that additionally enable job posting.

A significant share of branch websites with job offers allow posting of job offers only after completing a registration process. These websites often allow free advertising. Fees are often charged for placing promoted job offers. To a large extent, they also contain branch articles, guides, and news about a selected branch.

Queries regarding the pages of Local Labour Offices (LLOs) and Public Information Bulletins (PIBs) constituted a relatively large part of all employee searches. However, they are less than half as popular a source of obtaining information about job offers than websites with national coverage. LLO and PIB websites are more popular in smaller and medium-sized areas than in large cities.

Job offers posted in Facebook groups are a separate category of information sources for employment seekers. These groups can be public or private. Access to public groups is transparent, and interaction with people posting job advertisements is permitted after joining the group. Private groups have limited access to job offers, as a person must be admitted to the group to view them. Groups on Facebook enable much greater interaction between job posters and job seekers. They usually have a regional specification, such as work in Kraków, or a specialisation, such as jobs for computer graphic designers.

Internet job seekers less frequently query the websites of a potential employer or specialised websites, especially in smaller cities. Specialty websites, local websites, and social networking sites occasionally also include jobs. Job descriptions here are unclear as to whether they relate to any formal contract. Websites that aggregate job offers from other websites contain the most job offers. However, they do not monitor which offers are obsolete or are still valid. The information on these websites is organised in various ways, since they refer to several differently-structured websites. Moreover, collecting data automatically from job aggregators is difficult because such websites usually block web crawlers. Two possible solutions to this limitation are to limit the number of data requests or to use proxy servers for web scraping.

We classified websites according to the frequency of search queries and average numbers of job offers they contained. We also used data from a media tracking website (Wirtualne Media^{Footnote 2}) on most popular websites according to registered users, coverage, and page views. Finally, we chose twenty-five additional websites (see Table 6 in Appendix 2). These included mostly national websites, but also a few sites that aggregate job offers and websites that contain local subsites (e.g., the OLX portal). These websites cover both national and local job offers, bigger cities and small towns, and ensure a sufficient quantity of job offers. We excluded other websites for reasons including insufficient coverage of job offers containing the necessary detailed information (such as required skills), or information that was not readily extractable (e.g., Facebook groups).

3.2 Data collection procedure

Collection of online data from multiple sources is extremely impractical to do manually, especially cyclically. To aggregate job offers automatically, we developed web scraping applications (data collection tools).^{Footnote 3} The biggest advantage of web scraping is that once the application is ready, it can be used multiple times without much additional interaction. However, the application needs to be designed for each website separately, taking into consideration the website’s features such as its structure, type (dynamic or static), the technologies used, limitations, etc. Moreover, even a small change in a website may cause a critical error in the scraping tool, so monitoring of applications is also an important part of data collection process.

Since the data included mostly Polish and English data, we limited the analysis to these languages. At the initial stage of website analysis, a problem with encoding of non-Latin characters (including Polish letters) on Windows operating systems was identified. The issue was resolved by deciding that all information obtained from various portals would be saved using Unicode Transformation Format (UTF-8) encoding, which is the most common type of text encoding, thus ensuring consistency of data formatting.

Before the programming platform was created, a thorough analysis of websites with job offers was carried out, in order to recognise the structures and features of such websites. Some websites load page content dynamically during scrolling, while others required the user to be signed into a website-specific account to observe job offers; specific software was required to address these situations. All of these features have a significant impact on application design process and its sustainable work in the future. We noted that websites are characterised by:

Unique user interface (UI) design.
The necessity of authorization (user account creation and login).
Different paging mechanisms (e.g., selected websites provide a range of subpages, while other websites read subpages dynamically).
Various naming conventions for regions, although every website used NUTS-2 classification.
Various interaction models (e.g., selection of information after completing the form, selection of data based on shared selection lists, and dynamic loading of data using buttons and text fields).
Limited website availability at times.
Data inconsistency (e.g., non-existent links, expired job offers, non-existent web pages), or incorrectly listed voivodeships of presented job offers.
Restrictions on website traffic from the same IP addresses; if the page is downloaded or refreshed too often, the website will be redirected to an authorisation page using a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) mechanism in order to reload.
Encrypted HTTPS traffic (Hypertext Transfer Protocol Secure) required for selected portals.
Encoding of links and job titles in the URL (Uniform Resource Locator) standard, forcing a conversion to the UTF-8 standard.
Technical issues (e.g., HyperText Markup Language, HTML, syntax errors, skipped tags, incorrect tag parameters, CSS file and/or JavaScript errors).

Based on the observed limitations, specific requirements, and available technologies, we decided to develop an automated web crawler that resembles human activity. To meet the requirements for the processing of website data using different technologies while taking into account the limitations of each site, and to ensure interaction with each of the services, after the initial testing period, the target system architecture was determined as follows:

The application shall be implemented using a high-level programming language such as Python, Java, or R script; Java language was selected due to the ease of migration between Windows, Linux, and MacOS systems.
The application shall use selected frameworks providing access to Web API (Application Programming Interface); for this purpose, Chrome Devkit was selected.
The framework provides access to the DevKit data structure, ensuring two-way communication. Selenium WebDriver was selected for this purpose.
Behaviour scenarios should be developed for each website to ensure appropriate interaction with the selected website and to save data in a uniform format.

The developed system works automatically, based on behavioural patterns defined for each of the processed portals. Each pattern contains the following information:

The home page of the portal;
The voivodeship naming scheme;
Interaction with the website to select job offers from the indicated region;
A scheme for detecting links to pages with links to full offers on individual subpages; we used regular pattern matching for this purpose;
A scheme for downloading the full content of the job offer for each link collected previously.

The data collection algorithm is performed on a weekly basis. High frequency of data collection allows us to obtain job offers with a short life cycle (job offers that expire soon after publishing). Its task is to provide data from portals that for various reasons are not always available, or from which there are difficulties in downloading data. Following is the description of steps of the algorithm for downloading data for one portal (see also Fig. 2 in Appendix 1).

The first step of data collection is to ensure that the website responds (step 1 in Fig. 2). For some reasons the website could be unavailable, and its pages would not load. In this case, a one-day delay is advised (step 2). In this case, the website needs to be checked again the next day (step 3). If the website is still unavailable, revision of the website is required (step 4). In this case, further data collection would not be possible, so the cycle ends. There are multiple reasons why a website may not respond. It could be closed by its owners, or our IP address has been blocked due to a large number of requests per unit of time (usually per minute or hour).

If the website responds, the data collection process begins. In steps 5.1–5.3 we collect all links of job offers for each region. The steps of automated link collection are as follows:

1.
Load the main portal page.
2.
Load the page interaction pattern and then download information to interactively select the voivodeship.
3.
Determine the number of website subpages on the basis of web page content analysis, using the developed HTML parser and processing algorithm.

When all links have been collected, we need to ensure that the process has finished successfully (step 6). Even small changes in website structure may cause an error. If an error occurred, the application needs to be checked and fixed (step 7). This step requires user interaction. Usually, such changes are small and do not require a major redesign of the data collection tool; however, some changes may be significant (e.g., changing from a static to a dynamic website), so such changes may require a major redesign of application architecture. After the application is fixed (step 8), steps 5 and 6 are repeated.

In the next step (step 9), for each link collected in step 5, the algorithm downloads full information about job offers. We do not scrape multiple job offers from a single data source simultaneously, as this would overload the website and may result in IP blocking. Multiple applications may download data from many sources in parallel, but as with multiple job offers, the applications must not poll the websites too often, or else run the risk of blocking access from the requestor IP address, or having queries to the site detected as a DDoS (denial of service) attack. The information about daily visits to a website can be obtained with Alexa and Similar Web^{Footnote 4} tools in order to estimate the number of requests permitted, so as not to overload a website’s hardware. The solution to this problem is to use delays between repeated website visits. A mechanism to delay sending requests to websites has been introduced to the application, with the ability to select a separate delay time for each website. Implementation of Java applications and the use of additional libraries available for various hardware platforms allow the system to be run on Windows and Linux-based systems.

The process of data collection for each link is as follows:

1.
Check whether a job offer was already downloaded in the previous cycle (step 10). Since the cycle repeats every week, some links which have been already downloaded may be retrieved with the new links. As it is pointless to download these links again, such job offers must be ignored while data collection continues to another link.
2.
If a job offer was not previously downloaded, the application downloads it (step 11).
3.
The application must verify whether the full data of a job offer was collected successfully (step 12). If so, it moves to the next job offer; otherwise, the job offer needs to be downloaded again.
4.
In the case of download failure, the application checks whether it is the first attempt to redownload the job offer’s data (step 13). There is a defined maximum of two attempts to redownload a link. In some cases, an error may occur and the data downloading process stops. This may be due to a job offer unavailable for downloading (access closed by owners) or a broken/invalid link. In such cases, a job offer will be unavailable even after repeated attempts. In step 14, the application tries to download the data once again after a small delay (up to 2 min). If the downloading error still occurs, this job offer link is dropped (step 15), and the application continues to download the next link.

The number of download iterations performed by the algorithm equals the total number of links ($i=1,\dots ,N$, where $N$ is the total number of links). In steps 9 and 16, the application ensures that all information about a job offer has been collected.

The last step (step 17) is the data export. At this point, we collected the full information about web pages with job offers. This was important for several reasons. Very often, the information published on websites ceased to be available after some time, which would have prevented access to historical data. Services change the structure of their portal with varying frequency, making it impossible to develop a single interaction and parsing pattern for all sites. By using interaction patterns in the application, we can easily make changes as well as save the behavioural patterns for historical data. Portals are characterised by different methods of accessibility. There are technical issues and many other situations that also may prevent access to content. By using an iterative data processing algorithm, the system retrieves as much data as possible at each iteration, and repeats iterations until a complete data set is obtained.

To prepare gathered information for analysis, we applied a parsing process to the data. The data stored in the standard HTML format were converted into plain text. Parsing removes such information as font size, colour, and other formatting tags unrelated to the intended job data. After this step, we could proceed to the analysis.

3.3 Cleaning and extraction of relevant information

After the data has been collected and stored in HTML format (each file representing one job offer), it needs to be processed. Raw data cannot be analysed, so we extract the following information:

The name of the online job board that contains the job offer.
Original link to the job offer.
Title of the job offer.
Location (NUTS region).
Date of publication.
Type of employment contract.
Position level.
Offered remuneration.
Job type (full- or part-time).
Additional job benefits.
Full content of the job offer.

During the text extraction, it was found that some Polish letters may be incorrectly encoded. Due to situations such as this, it is crucial to control the encoding process. While analysing remuneration in job offers, it is important to convert any hourly wages to monthly wages, as monthly wages are mostly used in Poland. The currency type also must be checked for consistency. Websites may use various formats of publication and expiration dates. To aggregate job offers by date, a common format must be decided upon for conversion (such as YYYY-MM-DD). Since job offers were collected in Polish and English, the data needed to be written in one common format. Examples include voivodeship names and decimal numbers, which in Polish are written with a ‘,’ separator and in English with a ‘.’ separator. Other information can be extracted from the job offer text using dedicated natural language processing techniques.

The main purpose of collecting and processing data obtained from job offers was to identify occupations, qualifications and skills based on the contents of job offers provided by employers, on the basis of publicly available website information. We used official classifications for this purpose. Direct comparison of benchmark data included in dictionaries of classifications with actual job offers did not produce the expected results. The level of recognition of certain phrases from classifications in text written by employers was very low. This inaccuracy of exact matching of the content entered by employers to the expected dictionary resulted from different conditions:

Differing sentence forms;
Differing word forms;
Meaningless words (‘stop words’, which prevent further lexical analysis);
The use of abbreviations;
The use of synonyms;
Incomplete matching;
Typographical errors;
Polish-language words written using only English characters;
Excessive whitespace characters (spaces, tabs, newline characters, etc.);
HTML tags in the text specifying special characters (e.g., &gt, &lt, &amp to mean > , < , and &, respectively).

The removal of excess whitespaces was performed using an appropriate regular expression, which for the entire text string finds any sequence of consecutive whitespaces and replaces the sequence with a single space. In addition, punctuation marks and other non-alphanumeric characters were removed. Special HTML tags remained unchanged.

From these listed issues, the second (differing word forms) decreased matching accuracy the most. The Polish language contains many word variants. It contains seven cases for nouns, and each of them may contain a different word variant. Variants of the verb “to be” for all persons can be different. These word variants are not exceptional; they are common in the Polish language.

In order to improve the level of recognition of job offer texts and the share of classified job offers, we decided to preprocess the actual job titles and content of job offers into a form that allowed for much more accurate matching of the content to the dictionary template. We used a morphological library (Morfologik-stemming-1.9.0) to unify the word forms to be analysed. The library has extensive possibilities for analysing Polish words. It contains 4,800,432 words with different variations. Some words were excluded from the processing mechanism due to sentence semantics.

The application uses a mechanism for converting words from any form to the basic form (lemmatisation); that is, the infinitive forms of verbs, and nouns as their nominative cases (singular first-person forms). We lemmatized the text of dictionaries and job offers (with titles).

At this stage, our most important task was to find specific educational traits within the text of job offers. Because job offers can contain various words to describe the same trait (occupation, qualification, skill, etc.), the algorithm must address the situation in which the trait is mentioned in the sentence, but with different words than those in the dictionary. To solve this problem, we first prepared a list of occupations, qualifications, and skills using the European Skills, Competences, Qualifications, and Occupations (ESCO) classification (European Commission 2020), and the International Standard Classification of Education (ISCED-F 2013) across fields of education and qualifications (UNESCO 2013). The advantage of ESCO as the classification of skills is that it supplies a dictionary of over 13,000 transversal and job-related skills, which is significantly more than alternative classifications (Pater et al. 2019). It also uses the International Labour Organisation ISCO classification of occupations, expanding the 4-digit codes of groups of occupations into 6-digit codes of occupations. This supplies almost 3,000 occupations. ISCED-F 2013 is the most commonly used qualification classification. In this case, alternative classifications such as ESCO provide qualification titles in excessive detail containing the exact source of qualification, such as, “Bachelor degree in Primary School Education. Department of Primary Education. Faculty of Education of Florina. University of Western Macedonia”. Companies expect future workers to finish a specific faculty, in this case ‘education’, but usually do not specify an exact school or university a person must have a degree from.

These official classifications formed only basic dictionaries, because companies can (and often do) use many synonyms, causal terms, and abbreviations to describe the occupations, qualifications and skills they seek; thus, we built dictionaries of such synonyms. We supplemented the ESCO dictionary of transversal skills by their synonyms from online dictionaries. The basic job-related skills dictionary was supplemented with their synonyms, also provided by ESCO. The dictionary of educational fields and qualifications was left unchanged for their specificity. We supplemented the ESCO dictionary of occupations with synonyms provided by the Statistics Poland agency, and with synonyms of vocational occupations provided by the Educational Research Institute in Warsaw. The dictionary was supplemented with the most common job titles in the Central Job Offers Database submitted from Local Labour Offices. This database contains job offers for which a Local Labour Office clerk assigned an ESCO occupation code. Individual assignment of a code by a LLO qualified clerk to some extent ensured that the coding was correct. To further increase this probability, we used the most common of these job titles.

In the dictionary of occupations, we encoded as ‘000000’ the job offers that did not indicate an occupation, but instead a specific workplace, such as seasonal work, casual work, or occasional work. We encoded as ‘999999’ any job that was not perceived as employment by official statistics; these included internships, contracts for specific work, and contracts of mandate.

For all classifications, we used numeric codes. We sorted the dictionaries from most specific phrases to most general; for example, ‘sales manager’ occurred before ‘manager’. The matching algorithm tried to match the phrase from a dictionary with the job offer text, in order from most specific to general. In the case of managerial occupations, all types of managers were attempted to be matched with job offer text. If it failed, the word ‘manager’ was checked for matching. The exceptions were codes ‘999999’ which was searched first, and ‘000000’ which was searched last. This ensured that a job not considered as employment would not be calculated as valid, but would be noted in the database; and that some short-term jobs, mostly without high requirements, would also be counted, but as a separate category. Sometimes there were two or more identical phrases containing different codes. In such a case, this trait was counted as 1/n, where n is the number of phrases in each of the codes, and the fractions always summed to 1, meaning one occurrence.

After finishing the dictionaries, we analysed matching results, checking the results individually in a random sample of 1,000 job offers. During this procedure we created a dictionary of exceptions. This dictionary contains mostly single words with ambiguous meanings. Whenever possible, these words were substituted with phrases that clarified the context. We also found and corrected mistakes in the morphological library itself. As a result, for the purposes of the study we collected 40 million job offers during 2017–2019, based on which we extracted relevant educational profiles required by employers from potential job seekers. To calculate degree of mismatch, we needed respective information on the labour supply, which is described in the next section.

3.4 Labour supply survey

The main challenge in designing a method of continuous mismatch monitoring in the labour market, especially at a detailed level, was to prepare a tool that allows for quick aggregation of data from large representative samples of people of working age. Moreover, the chosen technique should be compatible with another equally important source of information; in this case, the source is online job advertisements. The choice of the Internet as a source of job advertisements (labour demand), as well as the means of conducting labour supply research among people of working age, results from the fact that in 2016, 93.7% of enterprises had internet access (including 93.2% broadband) and 80.4% of households (of which 75.7% used broadband). Considering the range of mobile internet services, a significant portion of Polish society is within the reach of the Internet and uses it.

In order to calculate labour market mismatch, we needed the characteristics of labour supply (job seekers) to compare them to the demand for labour (observed through job offers). The labour supply data was obtained from a CAWI (Computer-Assisted Web Interview) survey of Poles aged 18–65. The study was conducted on a nationwide random-quota sample of N = 16,119, where quotas for gender, age and size of the place of residence were consistent with those in the Polish population (Table 1). The questionnaire included a division of respondents into people currently working and not working. Even though some respondents were not looking for a job at the time of the survey, they still could have been considered as part of the population we aimed to study; that is, of people potentially interested in finding a job. The survey was conducted during an unprecedented boom in both the Polish economy and its job vacancy market, with many employers having unfilled work positions. The market was thus full of employment possibilities, a perfect situation for this study. The questionnaire consisted of four main sections: demographics, work situation, qualifications, and competencies.

Table 1 The structure of the sample for the population of Poles aged 18–65

Full size table

Since we used online job offers to measure vacancies, the survey of potential job seekers included only internet users. The CAWI survey was conducted during September and October 2017.^{Footnote 5}

4 Mismatch indices

The scale of educational mismatch in the labour market at the aggregate level is usually measured with the use of structural compliance indices (Schioppa 1991). We decided to use four measures and a Pearson correlation coefficient. A common feature of these measures is an uneven response to differences between individual components of the considered structures. Two of them, namely, ${V}_{2}$ and ${V}_{3}$, strongly react to changes in components with low shares, while another (${V}_{4}$) more strongly differentiates changes in components with high shares. Of all the measures, ${V}_{1}$ is the most resistant to differences in the structures of supply and demand. Besides the correlation coefficient, all of the presented indices are measures of differentiation, which means that the higher the index, the more different the structures are, and hence the greater the labour market mismatch. An index value of 0 means a full match, while a value of 1 means a total mismatch. The indices, excluding correlation coefficient, have two properties: their range of variation is $\langle \mathrm{0,1}\rangle$, and they meet the symmetry condition. At the stage of data preparation, cases where both $\alpha$ and $\beta$ were equal to 0 were excluded from the analysis.

The ${V}_{1}$ measure is often used in the economic literature (Jackman and Roper 1987), which makes the results of this analysis comparable with previous studies. However, those analyses were conducted in less detail than those presented in this article. This measure takes the following form:

$$V_{1} = \frac{{\mathop \sum \nolimits_{i = 1}^{k} \left| {\alpha_{i} - \beta_{i} } \right|}}{2}$$

(1)

where $\alpha$ is the total share of job offers and represents the demand side of the market; and $\beta$ is the share of surveyed respondents that possess a particular educational trait (potential job seekers), and represents the supply side of the market. $i=1,\dots ,k$ is an index representing components of a given educational trait, such as occupation, field of education, transversal skill, job-related skill, or region. ${V}_{1}$ was calculated separately for each educational trait, such as an occupation. In such a case, $i$ represents each occupation, and ${V}_{1}$ is calculated over k = 2,747 occupations.

Another measure of educational mismatch was built on the basis of Clark’s divergence coefficient (Clark 1952):

$$V_{2} = \left[ {\frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \left( {\frac{{\alpha_{i} - \beta_{i} }}{{\alpha_{i} + \beta_{i} }}} \right)^{2} } \right]^{\frac{1}{2}}$$

(2)

This measure “prefers” differences between components with relatively low shares. This means that it should be used in particular cases where both compared structures are characterised by a relatively even distribution of shares, or when the differences between the components have smaller values. In the case of educational mismatch, it can be particularly useful for comparing selected parts of the structure when they contain relatively small values. It has limited use for entire structures.

The next mismatch measure is based on the Canberra distance (Lance and Williams 1966). The values in this measure are within the range $\langle 0,k\rangle$. That is why we used a modified version of this measure, which takes values from the range of $\langle \mathrm{0,1}\rangle$. This measure takes the following form:

$$V_{3} = \frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \frac{{\left| {\alpha_{i} - \beta_{i} } \right|}}{{\alpha_{i} + \beta_{i} }}$$

(3)

As in the case of Clark’s divergence coefficient, the Canberra distance is best used to compare structures with relatively small differences between their components. In addition, it is particularly sensitive to values close to 0.

The last mismatch index takes the form:

$$V_{4} = \left[ {\frac{1}{2}\mathop \sum \limits_{i = 1}^{k} \left| {\alpha_{i}^{2} - \beta_{i}^{2} } \right|} \right]^{\frac{1}{2}}$$

(4)

A special feature of this measure is that it prefers changes between the components with the highest shares. In our analysis, it is particularly useful for occupational and job-related skills mismatches, because their components (shares of jobs and workers having a skill or occupation) take especially small values.

5 Results

5.1 General

Table 2 summarises the results of educational mismatch indices across different educational traits. For indices V₁ to V₄, greater values mean greater mismatch, while in the case of correlation, greater values mean more similar structures, hence smaller mismatch.^{Footnote 6}

Table 2 Summary of educational mismatch measures in the labour market

Full size table

What educational features result in structural mismatch between supply and demand in the Polish labour market? All measures consistently indicate that the greatest match occurs across NUTS-2 regions (voivodeships). Regional (spatial) matching seems to be greater than educational matching in Poland, at least from a regional perspective. The results also indicate a relatively high match between supply and demand for labour across fields of education. In this case, the differences between mismatch measures are much greater than in the case of spatial matching. Matching by transversal skills is indicated to be smaller, according to most of mismatch indices. The results show that the largest mismatch occurs in the case of job-related skills, and then by occupations. This was true for all mismatch indices with the exception of ${V}_{4}$. However, ${V}_{4}$ tended towards higher component values, while the shares of job offers and potential job seekers having particular occupations or job-related skills were lower. Therefore, this index was not considered appropriate for these educational features.

The observed educational mismatch decreases with the higher level of data aggregation. Job-related skills are very specific (at a low level of aggregation) and several thousand were identified among the respondents. Employers recognise specific skills at this level of detail. They expect future employees to be able to perform such specific tasks in addition to having transversal skills (general preparation for starting a job). In terms of features measured on larger aggregates (i.e., fields of education and voivodeships), the mismatch was lesser. Since the most important skills for employers are often specialised skills, this can be an indication for educational policy to place further emphasis on narrower educational paths (teaching individual skills), not only on traditional long-term paths (four- to five-year studies). This may be particularly important for post-graduate education. In Poland, post-graduate studies usually last one to two years, while potential job seekers may have greater needs for shorter courses and trainings of particular skills. Shorter post-graduate education paths appear to increase the skill mobility of Poles due to the relatively short amount of time required as a student and the lesser financial burden when compared to traditional, longer education paths. Another reason for greater job-related skills mismatch was a lack of employer knowledge about the job-related skills of potential job seekers. Employers were only able to identify such skills to a limited extent. In itself, this can be a reason for skill mismatch. This information should be enhanced to improve educational matching. While companies frequently use skills terminology, job seekers are mainly able to use the fields of education and occupation terminology. Skills recognition is scarce among these.

5.2 Case study of ICT occupations

The purpose of this section is to show that the methodology we use can provide deep, insightful information on labour market mismatch. In the literature, the general matching/mismatch measures are called the “black box” (Petrongolo and Pissarides 2001). We do not know the reasons for mismatch, just its aggregate measure. By having the detailed characteristics of labour supply and demand, we can “open the black box” (Marinescu and Wolthoff 2020). This section shows one specific group of occupations in more detail.

We analysed the Information and Communications Technologies (ICT) occupations across NUTS-2 regions (voivodeships, Table 3). As much as 51.9% of the total supply of ICT professions came from the five regions with the largest share of the labour supply; the share of demand was 68.4%. There is a greater concentration of ICT labour demand than supply. The greatest difference between high demand and low supply was found in the Mazowieckie voivodeship. The Małopolskie voivodeship also ranked higher in labour demand than supply. In the Lubelskie, Kujawsko-pomorskie, and Podkarpackie voivodeships, supply share was significantly higher than the share of demand. The results indicate excess labour demand in well-developed regions and excess supply in poorly developed ones.

Table 3 The share of demand and supply for ICT occupations across NUTS-2 regions in Poland

Full size table

The analysed job offer data show that Polish employers demanded employees from 140 out of 159 classified ICT occupations. Among them, 32.1% (n = 45) occupations were identified on both the supply and demand sides. The share of all classified ICT occupations that were identified only on the demand side was 67.9% (n = 95). Occupations only occurring on the supply side were not identified. The share of job offers for representatives of ICT occupations was 6.6% of all online job offers, while the percentage of people with a learned occupation in the field of ICT on the supply side was 4.4%.

In terms of major occupational groups (1-digit ISCO coding), the vast majority of occupational demand came from group 2 (“Professionals”, 48.6%) and group 3 (“Technicians and associate professionals”, 33.6%), with the least demand from group 4 (“Clerical support workers”, < 0.1%). The greatest share of job offers was for the occupational group with 3-digit ISCO code 251 (“Software and Applications Developers and Analysts”, 17.9%) while the lowest was from group 422 (“Client Information Workers”, 0.7%).

The ten ICT occupations with the highest number of job offers constituted 56.4% of the total number of job offers for ICT specialists, while the share of the ten most-learned occupations constituted about 78.2% of the total job offers (Fig. 1)

.

We found that most occupations with the highest demand (70% of occupations) belonged to group 25, “Information and Communications Technology Professionals”, meaning that IT professionals were the most demanded ICT occupations. The rest of the most- demanded ICT occupations came from groups 13 (“Production and specialised services managers”), 21 (“Science and engineering professionals”), and 52 (“Sales workers”). The supply side was more evenly distributed among different groups of occupations, with 80% of the most frequently-learned ICT occupations coming from four different 2-digit occupation groups.

In order to find the most mismatched occupations, we calculated differences between the supply and demand shares of identified occupations. Table 4 shows occupations with the largest surplus of relative demand over relative supply (those for which the share of job offers was greater than the share of people having a particular learned occupation), and surplus of supply over demand. IT specialists and sales consultants had the highest surplus of relative demand over relative supply. The supply share was higher than the demand share for technicians (vocational education representatives) and communications technology specialists.

Table 4 The highest surplus of demand and supply shares for ICT occupations in Poland

Full size table

An important aspect of analysing the demand for skills is to examine their distribution across occupations. In other words, it is important to understand which skills are most demanded for the occupations of interest. Such an analysis was possible as a result of the holistic approach we used in this study. For this purpose, we chose the five most demanded job-related skills appearing in job offers for the eight most-demanded occupations (Table 5). For example, the SQL (Structured Query Language) skill appeared in job offers for five out of the eight most demanded occupations; while Java, database management, customer relations, problem-solving, process design, project management, and customer service appeared in two out of eight occupations. The remaining skills were unique to particular occupations.

Table 5 The most demanded job-related skills across the most demanded ICT occupations in Poland

Full size table

6 Conclusions

We proposed an approach for analysing online job offers in the context of demanded occupations, qualifications, and skills in the labour market, with the intent to provide full educational characteristics of unmet labour demand using online job vacancies. The presented procedure shows how to address selected problems of analysing many online job boards, and to conduct regular vacancy market research based on internet sources. We applied this method to the Polish job market to study the labour market mismatch from the point of view of educational traits. Our approach is universal in the sense that it can be applied to any country and any labour market. This is confirmed by the fact that we did not assume any specific online job board, or job offer structure prior to data collection. Contrary to most previous works, we showed a thorough procedure for online data: collection, transformation, information extraction, and application. We also indicated some country specificity that should be considered when different languages are present in the data. We obtained these results through lexical lemmatisation. Lemmatisation helped us to increase the matching frequency between dictionaries and job offer texts. All our dictionaries contained Polish and English phrases. This may have a large impact on the efficiency of the algorithms applied to many countries. We argue that each language should be treated uniquely, to some extent, to obtain the highest quality of information extraction.

We connected educational terminology based on qualifications and labour market occupational terminology (used by official classifications) to the “market” terminology (used by entrepreneurs) based on direct measures of skills, and provided a cross-analysis of educational traits. Unlike most previous research, which used only proxies for measures of skills, this study used explicit measures.

Official classifications enabled us to better organise unstructured data collected online. For this, we used detailed ESCO and ISCED-F classifications, unlike Cedefop (2018; 2019a; 2019b), which used only ESCO. An alternative approach based on the O*Net database provides greater aggregations for skills. Instead, we focused on whole educational characteristics at low levels of aggregation.

Cedefop uses an expert method to choose websites (Cedefop 2019a). They grade websites according to ten specific criteria. We supplemented this method with an automatic one based on Google Trends, using 25 websites that were chosen on the basis of this criterion. Cedefop uses the 4-digit ESCO codes, while we used the 6-digit level of codes, since this format better suits our method and gives more accurate results. Cedefop (2018), whose results also include company sector and company size, analyse more traits in job offers than we did. We instead focused on providing full educational characteristics.

Our results showed that educational characteristics matter more for determining mismatch than spatial features. We found job-related skills to be more important in finding a labour market match than occupations or fields of education. It was also worth dividing the skills into job-related and transversal groups. Even though job-related skills defined the demand for ICT occupations to a larger extent than transversal skills, the latter were also important. Differences in transversal skills were the third-most important source of mismatch in the Polish labour market. This general conclusion confirms previous findings of the importance of these skills for financial occupations (Constantino and Rodzinka 2022) and for ICT occupations (Pater et al. 2022).

We did not aim to provide the best possible solution to perform matching, text analysis, and duplicate (non-unique entity) identification. These steps require dedicated approaches. Despite a deterministic occupation identification method (occupation classification-job offer matching), we still obtained a high degree of matching (93%). Application of newer techniques, for example of Zhao et al. (2021) for duplicate detection; Rybak et al. (2020), and Mroczkowski et al. (2021) for natural language processing; and including classification error treatment will further improve the algorithm. We aimed to provide a general approach to analyse online job offers, to extract educational information from them, and to apply the results to the economic problem of labour market mismatch. We also suggested some improvements in existing approaches to online data analysis. The next step of the analysis will be to further investigate workplace characteristics, namely: working hours, mode of work (remote, traditional or hybrid work), and contract types. This information will provide more detailed estimates of labour market demand. Job vacancy surveys mainly measure standard work contracts, while business-to-business contracts or contracts for specific work are becoming more popular and may greatly change the estimates of labour demand. Using detailed data from online job offers, we may be able to analyse the scale of those contracts.

Notes

https://www.conference-board.org/topics/help-wanted-online
www.wirtualnemedia.pl
We used this method only for scientific and non-commercial purposes. We collected only publicly available data. During the web scraping process, we ensured that the source of the data was unchanged. We show only aggregate results, without the possibility of identifying any single economic entity.
www.alexa.com and www.similarweb.com
The survey was conducted by the research panel of Ariadna (https://panelariadna.pl/), using the questionnaire we delivered. We chose Ariadna since it seemed to ensure the highest quality of research in Poland, based on the following criteria: having a PKJPA Polish certificate of the quality of CAWI surveys, the size of its Internet panel of respondents, the share of its registered users, experience (in terms of time and the number of large surveys conducted). Ariadna also fulfilled the “28 Questions to Help Buyers of Online Samples” (ESOMAR 2015). The novelty of our approach is a method of continuous collection of information on labour demand (job offers). In order to calculate mismatch indices, we also needed the data on potential job seekers (CAWI survey). The CAWI survey was conducted in the beginning of the period during which we collected job offers. The set of information that we collected with the CAWI survey was unique. No continuously collected database contained such information. Because the sample and scope of information collected within it was vast, and the survey was very costly, it was impossible to repeat the survey. Hence, our data consists of a time series for labour demand and static (one-period data) information on job seekers. While this might be a limitation, we think that the characteristics of people does not change profoundly over two years. That is why we think that the comparisons we make represent the true mismatch between the two sides of the labour market.
Since we do not provide confidence intervals for mismatch measures, the results do not allow making comparisons across regions. However, such analysis is beyond the scope of this article.

References

Acemoglu, D., Autor, D., Hazell, J., & Restrepo, P.: AI and jobs: Evidence from online vacancies. National Bureau of Economic Research Working Paper No. 28257 (2020)
Askitas, N., Zimmermann, K.: The internet as a data source for advancement in social sciences. Int. J. Manpower (2015). https://doi.org/10.1108/IJM-02-2015-0029
Article Google Scholar
Barnichon, R.: Building a composite help-wanted index. Econom. Lett. (2010). https://doi.org/10.1016/j.econlet.2010.08.029
Article Google Scholar
Beręsewicz, M., & Pater, R.: (2020) Inferring job vacancies from online job advertisements. Publ. Office European Union, Luxembourg. https://doi.org/10.2785/963837
Blair, P.Q., Deming, D.J.: Structural increases in demand for skill after the great recession. AEA Papers Proc. 110, 362–365 (2020)
Article Google Scholar
Cedefop: Mapping the landscape of online job vacancies. Background country report. https://www.cedefop.europa.eu/files/rlmi_-_mapping_online_vacancies_poland.pdf (2018). Accessed 05 September 2020
Cedefop: The online job vacancy market in the EU. Driving forces and emerging trends. Cedefop research paper (2019a). https://doi.org/10.2801/16675
Cedefop: Online job vacancies and skills analysis: a Cedefop pan-European approach. Cedefop Research Paper (2019b). https://doi.org/10.2801/097022
Choi, H., Varian, H.: Predicting the present with Google trends. Econom. Rec. (2012). https://doi.org/10.1111/j.1475-4932.2012.00809.x
Article Google Scholar
Clark, P.J.: An extension of the coefficient of divergence for use with multiple characters. Copeia (1952). https://doi.org/10.2307/1438532
Article Google Scholar
Colombo, E., Mercorio, F., Mezzanzanica, M.: Applying machine learning tools on web vacancies for labour market and skill analysis. Paper presented at the Technology Policy Institute Conference on The Economics and Policy Implications of AI, Washington, United States. https://techpolicyinstitute.org/wp-content/uploads/2018/02/Colombo_paper.pdf (2018). Accessed 05 February 2021
Costantino, L., Rodzinka, J.: The role of soft skills in employability in the financial industry. Financ. Internet Quart. 18(1), 44–55 (2022)
Article Google Scholar
Deming, D.: The growing importance of social skills on the labour market. Quart. J. Econom. (2017). https://doi.org/10.1093/qje/qjx022
Article Google Scholar
Deming, D., Kahn, L.: Skill requirements across firms and labour markets: evidence from job postings for professionals. J. Labor. Econom. (2018). https://doi.org/10.1086/694106
Article Google Scholar
Deming, D.J., Noray, K.: Earnings dynamics, changing job skills, and STEM careers. Quart. J. Econom. 135(4), 1965–2005 (2020)
Article Google Scholar
ESOMAR: 28 Questions to help buyers of online samples. 28 Questions to help buyers of online samples. https://www.esomar.org/uploads/public/knowledge-and-standards/documents/ESOMAR-28-Questions-to-Help-Buyers-of-Online-Samples-September-2012.pdf (2015). Accessed 20 March 2023
European Commission: European skills, competences, qualifications and occupations. https://ec.europa.eu/esco/portal/ (2020). Accessed 20 November 2020
Forsythe, E., Kahn, L.B., Lange, F., Wiczer, D.: Labor demand in the time of COVID-19: evidence from vacancy postings and UI claims. J. Public Econom. 189, 104238 (2020)
Article Google Scholar
Hershbein, B., Kahn, L.: Do recessions accelerate routine-biased technological change? Evidence from vacancy postings. Am. Econom. Rev. (2018). https://doi.org/10.1257/aer.20161570
Article Google Scholar
International Labour Office: International standard classification of occupations 2008 (ISCO-08): Structure, group definitions and correspondence tables. ILO, Geneva, Switzerland. https://www.ilo.org/wcmsp5/groups/public/---dgreports/---dcomm/---publ/documents/publication/wcms_172572.pdf (2012). Accessed 05 November 2020
Jackman, R., Roper, S.: Structural unemployment. Oxford Bull Econom Statist. (1987). https://doi.org/10.1111/j.1468-0084.1987.mp49001002.x
Article Google Scholar
Kudlyak, M., Tasci, M., & Tüzemen, D.: Minimum Wage Increases and Vacancies. IZA Discussion Paper No. 15254 (2022)
Lance, G., Williams, W.: Computer programs for hierarchical polythetic classification (“similarity analysis”). Comput. J. (1966). https://doi.org/10.1093/comjnl/9.1.60
Article Google Scholar
Lovaglio, P.G., Mezzanzanica, M., Colombo, E.: Comparing time series characteristics of official and web job vacancy data. Qual. Quantity (2020). https://doi.org/10.1007/s11135-019-00940-3
Article Google Scholar
Marinescu, I., Wolthoff, R.: Opening the black box of the matching function: the power of words. J. Labor Econom. 38(2), 535–568 (2020)
Article Google Scholar
Modestino, A.S., Shoag, D., Ballance, J.: Upskilling: do employers demand greater skill when workers are plentiful? Rev. Econom. Statist. 102(4), 793–805 (2022)
Article Google Scholar
Mroczkowski, R., Rybak, P., Wróblewska, A., & Gawlik, I.: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 1–10, Association for Computational Linguistics (2021).
OECD: Towards an OECD skills strategy. OECD Publishing. http://www.oecd.org/education/47769000.pdf (2011). Accessed 20 January 2021
Pater, R., Cherniaiev, H., Kozak, M.: A dream job? Skill demand and skill mismatch in ICT. J. Ed. Work 32(6–7), 641–665 (2022). https://doi.org/10.1080/13639080.2022.2128187
Article Google Scholar
Pater, R., Szkoła, J., Kozak, M.: A method for measuring detailed demand for workers’ competences. Economics: The Open-Access, Open-Assessment E-Journal (2019). https://doi.org/10.5018/economics-ejournal.ja.2019-27
Petrongolo, B., Pissarides, C.A.: Looking into the black box: a survey of the matching function. J. Econom. Literat. 39(2), 390–431 (2001)
Article Google Scholar
Rybak, P., Mroczkowski, R., Tracz, J., & Gawlik, I.: KLEJ: Comprehensive Benchmark for Polish Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1191–1201, Association for Computational Linguistics (2020)
Schioppa, F.P. Mismatch and labour mobility. Cambridge University Press, Cambridge. (1991). https://doi.org/10.1017/CBO9780511599316
UNESCO: ISCED Fields of education and training 2013 (ISCED-F 2013). UNESCO Institute for Statistics, Montreal, Canada (2013). https://doi.org/10.15220/978-92-9189-150-4-en
Wall, V., Fale, A.: Job vacancy monitoring in New Zealand and jobs online. Paper presented at the Joint Annual NZAE & LEANZ conference, Auckland, New Zealand. https://www.nzae.org.nz/wp-content/uploads/2011/08/Wall_and_Fale__Job_Vacancy_Monitoring_in_NZ_and_Jobs_Online.pdf (2010). Accessed 07 February 2021
Zhao, Y., Chen, H., & Mason, C. M.: A Framework for Duplicate Detection from Online Job Postings. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 249–256, Association for Computing Machinery, Essendon (2021). https://doi.org/10.1145/3486622.3493928

Download references

Funding

The research was funded by the Polish Ministry of Science and Higher Education within the Programme DIALOG, Grant No. DIALOG 0127/2016 “Horizontal educational mismatch: a new method of measurement with application to Poland.”

Author information

Authors and Affiliations

Department of Statistics, Poznań, University of Economics and Business, Poznań, Poland
Maciej Beręsewicz
Department of Economics and Finance, University of Information Technology and Management in Rzeszów, Rzeszów, Poland
Herman Cherniaiev, Andrzej Mantaj & Robert Pater
Educational Research Institute, Warsaw, Poland
Robert Pater
Statistical Office in Poznań, Poznań, Poland
Maciej Beręsewicz

Authors

Maciej Beręsewicz
View author publications
You can also search for this author in PubMed Google Scholar
Herman Cherniaiev
View author publications
You can also search for this author in PubMed Google Scholar
Andrzej Mantaj
View author publications
You can also search for this author in PubMed Google Scholar
Robert Pater
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Pater.

Ethics declarations

Conflict of interest

The authors have no other competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The procedure of data collection

Appendix 2: Representativeness

The list of websites from which we collected information is provided in Table 6.

Table 6 List of Polish job boards

Full size table

The aim of this section is to analyse the representativeness of websites with job offers. The main goal is to determine to what extent the data from the Internet differs from what is collected by Statistics Poland. In Poland, the main source of information on labour demand is the Labour Demand Survey, a randomised quarterly survey. This survey provides information on realised and unrealised demand, employed persons, vacancies, and newly created jobs. The research population includes entities of the national economy with one or more employees. The survey covers entities included in all types of activities. For the purposes of the Labour Demand Survey, job vacancies are defined as jobs created as a result of the movement of employed persons, or newly created jobs that meet three conditions simultaneously:

Positions on the reporting date are vacant;
The employer is trying to find people willing to take up the position;
If a suitable person is found, the employer is ready to accept them immediately.

In the case of online data sources, we rely on the definitions used in the regulations of websites. For example, the OLX portal uses the following definition: “an advertisement prepared by the User regarding a sale (invitation to conclude a sales contract), exchange, work, Offered services, etc., posted on the Website, under the conditions provided in the Regulations.”

Thanks to our developed method (see Sect. 2), 93% of all advertisements collected in terms of occupation, occupational group, or information that a given offer related to an internship or apprenticeship were identified. Occupational group instead of an occupation was identified in cases where the employer did not provide sufficient data in the job offer to identify the specific occupation (i.e., at the 6-digit level of ESCO classification). The codes of occupations were attached to the data set, of which records were of differing levels of detail (Table 7).

Table 7 The level of detail of the classified occupations

Full size table

72% of offers were classified to the 6th digit, while only 2% were classified at the one-digit (most general) level of main occupation.

For the purposes of this study, the collection excludes those job offers that concerned internships, apprenticeships, farmers, gardeners, foresters, and fishermen, which are to a smaller extent represented in the Demand for Labour Survey. For the purposes of the study, the following measures of representativeness were used:

Absolute difference
$$RB = \hat{\theta }_{net,k} - \hat{\theta }_{GUS,k} ,$$
(5)
Absolute relative difference
$$ARW = \hat{\theta }_{net,k} - \hat{\theta }_{GUS,k} \vee \hat{\theta }_{GUS,k} ,$$
(6)

where $\hat{\theta }_{GUS,k}$ is an estimate of the percentage of vacancies for an occupational group $k=\{\mathrm{1,2},\mathrm{3,4},\mathrm{5,7},\mathrm{8,9}\}$ according to the Demand for Labour Survey, and ${\widehat{\theta }}_{net,k}$ is the estimate of the percentage of vacancies for an occupational group $k$ on the basis of online data. RB values greater than 0 indicate an underestimation of internet portals, and its values less than 0 indicate an over-representation of specific job offers. The difference is expressed in percentage points.

The analysis focuses on the comparison of percentages, because the number of online job offers may contain repeated or outdated offers which would inflate the actual number of job offers. Due to the frequent occurrence of some specific job offers, this situation may also affect structures, but to a lesser extent.

The comparison of the percentages shows that the greatest differences exist in the cases of specialists and of machine and device operators (Fig. 3). The absolute difference distribution is shown in Figs. 4 and 5. The absolute relative difference is shown in Fig. 6.

This section aimed to examine the representativeness of online data sources in terms of job vacancies, using the Labour Demand Survey as a reference source. The most important conclusions are as follows:

The greatest differences between the online data and the Labour Demand Survey were observed for specialists, industrial workers, and craftsmen, as well as operators and assemblers of machines and devices.
Internet portals systematically overrepresent the number of specialists and underrepresent the demand for workers performing simple tasks.
In the case of office workers, the distribution of job offers did not differ between the Internet and the Statistics Poland survey.
The analysis of individual online data sources showed that the websites with least differences from the Demand for Labour Survey were those from Gazetapraca, Gumtree, and Infopraca. These are portals that contain job offers for various occupations, and do not specialise in any particular occupation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Beręsewicz, M., Cherniaiev, H., Mantaj, A. et al. Text analysis of job offers for mismatch of educational characteristics to labour market demands. Qual Quant 58, 1799–1825 (2024). https://doi.org/10.1007/s11135-023-01707-7

Download citation

Accepted: 24 June 2023
Published: 21 July 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11135-023-01707-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Text analysis of job offers for mismatch of educational characteristics to labour market demands

Abstract

Similar content being viewed by others

Using online vacancies and web surveys to analyse the labour market: a methodological inquiry

How unemployment scarring affects skilled young workers: evidence from a factorial survey of Swiss recruiters

Does it matter if immigrants work in jobs related to their education?

1 Introduction

2 Related literature