1 Introduction

There is an increasing need for a detailed analysis of companies’ demand for workers; that is, for occupations, skills (competencies) and qualifications (Deming 2017; Deming and Kahn 2018; Hershbein and Kahn 2018). Current surveys conducted by statistical offices do not contain information on company demands for future workers’ skills or qualifications. One may consider online job offers as supporting or alternative data to job market surveys. However, these data sources are unstructured, so relevant information should be extracted from this data to conduct quality assessments, to determine representativeness of the data, and to develop an estimation process.

The purpose of this research is to provide complete educational characteristics of labour demand with job vacancies. Educational characteristics include occupations, qualifications, and skills. Skills are divided into transversal and job-related categories. Initially, extracting occupational data is especially important because official statistics use the International Standard Classification of Occupations (ISCO, International Labour Office 2012). Educational terminology is based on qualifications grouped by fields of study. To compare labour demand with educational characteristics, this standardised qualification terminology must be used. Finally, development of the global economy brings greater attention to skills, as people’s knowledge and skills become more specialised. For this reason, extraction of job skill data is extremely important for research.

We use online job offers as rich sources of data on labour demand. Although these data are comprehensive, online job databases often contain extraneous and unstructured information. To better structure this information, we use labour market and educational terminology with official classifications.

Our main contribution is the proposal of an approach for analysing online job offers in the context of their detailed structural information. We use information from job offers to extract educational characteristics of job openings. This information is then used to calculate educational mismatches between labour supply characteristics and labour demand as observed in job offers. We show that such detailed information may be used to provide estimates of skill (competence), qualification, occupation, and regional mismatch between labour supply and demand. We provide a robust procedure that can be applied to different websites with minimum comparability. The procedure consists of choosing online job portals, collecting data from them, extracting relevant information, and then performing representativeness corrections based on this information. We present some challenges related to this research and propose solutions to them. Through the use of large dictionaries, we use detailed classifications and explicit measures of skills instead of proxies to more accurately describe labour demand. We then apply this method to the Polish job market and compare the results with ongoing representative surveys.

These results are especially important for economists, the educational sector, and labour market institutions, such as the Organisation for Economic Co-operation and Development Skills Strategy (OECD 2011). Such detailed information may be used to adjust labour market and educational policy, especially policies directed toward reducing structural unemployment.

The article is organised as follows. Section 2 reviews the related literature on data collection from online job websites and using them to analyse labour market matching. Section 3 describes our data sources, collection, and preparation. Section 4 shows the measures of mismatch that we use. Section 5 presents our results. Section 6 discusses the results and provides our conclusions. Appendix 1 includes data collection algorithm. Appendix 2 contains the representativeness analysis.

2 Related literature

Our approach relates to previously conducted online job offer research. This method is based on collecting online job offers and analysing them using data mining techniques. Most of the previous research of this type has been devoted to a specific topic, not conducted periodically, or were based on a single job website. Although such research may prove valuable, our method may be compared to others that address data from many online job sites at a regular basis. One such study was conducted by the Australian Bureau of Statistics and New Zealand’s Department of Labour (Wall and Fale 2010). This study analysed online job vacancies from selected websites according to occupation, industry, and region. The educational characteristics of offered jobs were not emphasized, as the index was mainly used to track time series changes of vacancies by occupation, industry, and region.

The longest history of analysing job offers has been maintained by the US Conference Board, currently publishing the Help Wanted OnLine indexFootnote 1 (see e.g., Barnichon 2010). The index is disaggregated to local labour markets. A more detailed analysis of skills has been performed in cooperation with Burning Glass Technologies. A relatively large number of articles has been written based on these data in recent years. However, they contain little information about the procedure of collecting, cleaning, and classifying data, as their data processing methods are largely unknown. Hershbein and Kahn (2018) show that the representativeness of Burning Glass Technologies data is stable over time across groups of occupations. Acemoglu et al. (2020), Blair and Deming (2020), Deming and Noray (2020), Forsythe et al. (2020), Modestino et al. (2022), and Kudlyak et al. (2022) study the requirements of employers in these job offers and their changes. Among the variables they extract are: wages, level of education, industry, work experience, occupational groups, and skills across local labour markets.

Cedefop (2019a; b) and Colombo et al. (2018) conduct the largest study in Europe, collecting online job offers from all of the European Union member states. While the dataset is large, Cedefop pays limited attention to the composition of websites from which the information is collected. The approach is rather focused on obtaining large numbers of job offers. This leads to a potential overuse of job portals that aggregate job offers from other sources. Such websites may not last long or may change their sources frequently, which leads to instability of results. Also, a problem of language treatment arises, wherein different languages may require dedicated approaches. Finally, the data collection and treatment of Cedefop makes it difficult to analyse and correct various biases of online data, which leads to data representativeness problems (Beręsewicz and Pater 2020).

Lovaglio et al. (2020) used Cedefop data to compare the characteristics of online job offers and survey-based data on vacancies. They show selected time series properties of online job offers. We use online job offer data to show the usefulness of their structural properties at disaggregated levels. In our research, we aim to provide the most thorough educational characteristics of vacancies, while minimising other potential information obtained from online job offers.

3 Data

3.1 Sources

As there are many websites with job offers on the Internet, estimating the number of websites with job offers in all of Poland with reasonable accuracy would be an unrealistic task. Websites with job offers are also heterogeneous; they can be divided into types. In particular, these types include:

  • Country-specific specialised websites.

  • Locally specialised websites, most often encompassing a city, a community (NUTS-5 region according to the European Union Nomenclature of Territorial Units for Statistics), or a voivodeship (NUTS-2 region).

  • Specialised websites with job offers, such as financial occupations or information technology (IT) jobs.

  • Websites of Local Labour Offices (LLOs) operating in the NUTS-4 region, and official Public Information Bulletins (PIBs) containing most of the job offers in the public sector.

  • Employer websites.

  • Internet forums and social media groups (for example, Facebook groups).

  • Websites that aggregate job offers from other portals.

Information obtained from the artefakt.pl website indicates that 97.3% of internet users in Poland use the Google search engine. We used Google Trends to find the most popular internet websites with job offers. Google Trends is an index of the volume of Google queries by geographic location and category (Choi and Varian 2012). This technique was suggested by Askitas and Zimmermann (2015) for social sciences. Most often, after entering a search for local job offers, Google search shows links to countrywide websites. This is connected to the use of positioning techniques by website administrators. Countrywide websites often contain paid advertisements of job offers.

Job offers from local websites often contain a short description with more detailed job descriptions (e.g., employee requirements, skills, detailed qualifications) shown less often than for national websites. Also, employers using such sites know less about the professional terminology of the labour market and related education. Such offers may contain grammatical errors and may be less structured. They more often contain job offers for people without higher education. Local websites are more popular in smaller or medium-sized cities than in major cities. They often contain various local advertisements in addition to job offers. Some sites also assume the role of local information portals that additionally enable job posting.

A significant share of branch websites with job offers allow posting of job offers only after completing a registration process. These websites often allow free advertising. Fees are often charged for placing promoted job offers. To a large extent, they also contain branch articles, guides, and news about a selected branch.

Queries regarding the pages of Local Labour Offices (LLOs) and Public Information Bulletins (PIBs) constituted a relatively large part of all employee searches. However, they are less than half as popular a source of obtaining information about job offers than websites with national coverage. LLO and PIB websites are more popular in smaller and medium-sized areas than in large cities.

Job offers posted in Facebook groups are a separate category of information sources for employment seekers. These groups can be public or private. Access to public groups is transparent, and interaction with people posting job advertisements is permitted after joining the group. Private groups have limited access to job offers, as a person must be admitted to the group to view them. Groups on Facebook enable much greater interaction between job posters and job seekers. They usually have a regional specification, such as work in Kraków, or a specialisation, such as jobs for computer graphic designers.

Internet job seekers less frequently query the websites of a potential employer or specialised websites, especially in smaller cities. Specialty websites, local websites, and social networking sites occasionally also include jobs. Job descriptions here are unclear as to whether they relate to any formal contract. Websites that aggregate job offers from other websites contain the most job offers. However, they do not monitor which offers are obsolete or are still valid. The information on these websites is organised in various ways, since they refer to several differently-structured websites. Moreover, collecting data automatically from job aggregators is difficult because such websites usually block web crawlers. Two possible solutions to this limitation are to limit the number of data requests or to use proxy servers for web scraping.

We classified websites according to the frequency of search queries and average numbers of job offers they contained. We also used data from a media tracking website (Wirtualne MediaFootnote 2) on most popular websites according to registered users, coverage, and page views. Finally, we chose twenty-five additional websites (see Table 6 in Appendix 2). These included mostly national websites, but also a few sites that aggregate job offers and websites that contain local subsites (e.g., the OLX portal). These websites cover both national and local job offers, bigger cities and small towns, and ensure a sufficient quantity of job offers. We excluded other websites for reasons including insufficient coverage of job offers containing the necessary detailed information (such as required skills), or information that was not readily extractable (e.g., Facebook groups).

3.2 Data collection procedure

Collection of online data from multiple sources is extremely impractical to do manually, especially cyclically. To aggregate job offers automatically, we developed web scraping applications (data collection tools).Footnote 3 The biggest advantage of web scraping is that once the application is ready, it can be used multiple times without much additional interaction. However, the application needs to be designed for each website separately, taking into consideration the website’s features such as its structure, type (dynamic or static), the technologies used, limitations, etc. Moreover, even a small change in a website may cause a critical error in the scraping tool, so monitoring of applications is also an important part of data collection process.

Since the data included mostly Polish and English data, we limited the analysis to these languages. At the initial stage of website analysis, a problem with encoding of non-Latin characters (including Polish letters) on Windows operating systems was identified. The issue was resolved by deciding that all information obtained from various portals would be saved using Unicode Transformation Format (UTF-8) encoding, which is the most common type of text encoding, thus ensuring consistency of data formatting.

Before the programming platform was created, a thorough analysis of websites with job offers was carried out, in order to recognise the structures and features of such websites. Some websites load page content dynamically during scrolling, while others required the user to be signed into a website-specific account to observe job offers; specific software was required to address these situations. All of these features have a significant impact on application design process and its sustainable work in the future. We noted that websites are characterised by:

  • Unique user interface (UI) design.

  • The necessity of authorization (user account creation and login).

  • Different paging mechanisms (e.g., selected websites provide a range of subpages, while other websites read subpages dynamically).

  • Various naming conventions for regions, although every website used NUTS-2 classification.

  • Various interaction models (e.g., selection of information after completing the form, selection of data based on shared selection lists, and dynamic loading of data using buttons and text fields).

  • Limited website availability at times.

  • Data inconsistency (e.g., non-existent links, expired job offers, non-existent web pages), or incorrectly listed voivodeships of presented job offers.

  • Restrictions on website traffic from the same IP addresses; if the page is downloaded or refreshed too often, the website will be redirected to an authorisation page using a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) mechanism in order to reload.

  • Encrypted HTTPS traffic (Hypertext Transfer Protocol Secure) required for selected portals.

  • Encoding of links and job titles in the URL (Uniform Resource Locator) standard, forcing a conversion to the UTF-8 standard.

  • Technical issues (e.g., HyperText Markup Language, HTML, syntax errors, skipped tags, incorrect tag parameters, CSS file and/or JavaScript errors).

Based on the observed limitations, specific requirements, and available technologies, we decided to develop an automated web crawler that resembles human activity. To meet the requirements for the processing of website data using different technologies while taking into account the limitations of each site, and to ensure interaction with each of the services, after the initial testing period, the target system architecture was determined as follows:

  • The application shall be implemented using a high-level programming language such as Python, Java, or R script; Java language was selected due to the ease of migration between Windows, Linux, and MacOS systems.

  • The application shall use selected frameworks providing access to Web API (Application Programming Interface); for this purpose, Chrome Devkit was selected.

  • The framework provides access to the DevKit data structure, ensuring two-way communication. Selenium WebDriver was selected for this purpose.

  • Behaviour scenarios should be developed for each website to ensure appropriate interaction with the selected website and to save data in a uniform format.

The developed system works automatically, based on behavioural patterns defined for each of the processed portals. Each pattern contains the following information:

  • The home page of the portal;

  • The voivodeship naming scheme;

  • Interaction with the website to select job offers from the indicated region;

  • A scheme for detecting links to pages with links to full offers on individual subpages; we used regular pattern matching for this purpose;

  • A scheme for downloading the full content of the job offer for each link collected previously.

The data collection algorithm is performed on a weekly basis. High frequency of data collection allows us to obtain job offers with a short life cycle (job offers that expire soon after publishing). Its task is to provide data from portals that for various reasons are not always available, or from which there are difficulties in downloading data. Following is the description of steps of the algorithm for downloading data for one portal (see also Fig. 2 in Appendix 1).

The first step of data collection is to ensure that the website responds (step 1 in Fig. 2). For some reasons the website could be unavailable, and its pages would not load. In this case, a one-day delay is advised (step 2). In this case, the website needs to be checked again the next day (step 3). If the website is still unavailable, revision of the website is required (step 4). In this case, further data collection would not be possible, so the cycle ends. There are multiple reasons why a website may not respond. It could be closed by its owners, or our IP address has been blocked due to a large number of requests per unit of time (usually per minute or hour).

If the website responds, the data collection process begins. In steps 5.1–5.3 we collect all links of job offers for each region. The steps of automated link collection are as follows:

  1. 1.

    Load the main portal page.

  2. 2.

    Load the page interaction pattern and then download information to interactively select the voivodeship.

  3. 3.

    Determine the number of website subpages on the basis of web page content analysis, using the developed HTML parser and processing algorithm.

When all links have been collected, we need to ensure that the process has finished successfully (step 6). Even small changes in website structure may cause an error. If an error occurred, the application needs to be checked and fixed (step 7). This step requires user interaction. Usually, such changes are small and do not require a major redesign of the data collection tool; however, some changes may be significant (e.g., changing from a static to a dynamic website), so such changes may require a major redesign of application architecture. After the application is fixed (step 8), steps 5 and 6 are repeated.

In the next step (step 9), for each link collected in step 5, the algorithm downloads full information about job offers. We do not scrape multiple job offers from a single data source simultaneously, as this would overload the website and may result in IP blocking. Multiple applications may download data from many sources in parallel, but as with multiple job offers, the applications must not poll the websites too often, or else run the risk of blocking access from the requestor IP address, or having queries to the site detected as a DDoS (denial of service) attack. The information about daily visits to a website can be obtained with Alexa and Similar WebFootnote 4 tools in order to estimate the number of requests permitted, so as not to overload a website’s hardware. The solution to this problem is to use delays between repeated website visits. A mechanism to delay sending requests to websites has been introduced to the application, with the ability to select a separate delay time for each website. Implementation of Java applications and the use of additional libraries available for various hardware platforms allow the system to be run on Windows and Linux-based systems.

The process of data collection for each link is as follows:

  1. 1.

    Check whether a job offer was already downloaded in the previous cycle (step 10). Since the cycle repeats every week, some links which have been already downloaded may be retrieved with the new links. As it is pointless to download these links again, such job offers must be ignored while data collection continues to another link.

  2. 2.

    If a job offer was not previously downloaded, the application downloads it (step 11).

  3. 3.

    The application must verify whether the full data of a job offer was collected successfully (step 12). If so, it moves to the next job offer; otherwise, the job offer needs to be downloaded again.

  4. 4.

    In the case of download failure, the application checks whether it is the first attempt to redownload the job offer’s data (step 13). There is a defined maximum of two attempts to redownload a link. In some cases, an error may occur and the data downloading process stops. This may be due to a job offer unavailable for downloading (access closed by owners) or a broken/invalid link. In such cases, a job offer will be unavailable even after repeated attempts. In step 14, the application tries to download the data once again after a small delay (up to 2 min). If the downloading error still occurs, this job offer link is dropped (step 15), and the application continues to download the next link.

The number of download iterations performed by the algorithm equals the total number of links (\(i=1,\dots ,N\), where \(N\) is the total number of links). In steps 9 and 16, the application ensures that all information about a job offer has been collected.

The last step (step 17) is the data export. At this point, we collected the full information about web pages with job offers. This was important for several reasons. Very often, the information published on websites ceased to be available after some time, which would have prevented access to historical data. Services change the structure of their portal with varying frequency, making it impossible to develop a single interaction and parsing pattern for all sites. By using interaction patterns in the application, we can easily make changes as well as save the behavioural patterns for historical data. Portals are characterised by different methods of accessibility. There are technical issues and many other situations that also may prevent access to content. By using an iterative data processing algorithm, the system retrieves as much data as possible at each iteration, and repeats iterations until a complete data set is obtained.

To prepare gathered information for analysis, we applied a parsing process to the data. The data stored in the standard HTML format were converted into plain text. Parsing removes such information as font size, colour, and other formatting tags unrelated to the intended job data. After this step, we could proceed to the analysis.

3.3 Cleaning and extraction of relevant information

After the data has been collected and stored in HTML format (each file representing one job offer), it needs to be processed. Raw data cannot be analysed, so we extract the following information:

  • The name of the online job board that contains the job offer.

  • Original link to the job offer.

  • Title of the job offer.

  • Location (NUTS region).

  • Date of publication.

  • Type of employment contract.

  • Position level.

  • Offered remuneration.

  • Job type (full- or part-time).

  • Additional job benefits.

  • Full content of the job offer.

During the text extraction, it was found that some Polish letters may be incorrectly encoded. Due to situations such as this, it is crucial to control the encoding process. While analysing remuneration in job offers, it is important to convert any hourly wages to monthly wages, as monthly wages are mostly used in Poland. The currency type also must be checked for consistency. Websites may use various formats of publication and expiration dates. To aggregate job offers by date, a common format must be decided upon for conversion (such as YYYY-MM-DD). Since job offers were collected in Polish and English, the data needed to be written in one common format. Examples include voivodeship names and decimal numbers, which in Polish are written with a ‘,’ separator and in English with a ‘.’ separator. Other information can be extracted from the job offer text using dedicated natural language processing techniques.

The main purpose of collecting and processing data obtained from job offers was to identify occupations, qualifications and skills based on the contents of job offers provided by employers, on the basis of publicly available website information. We used official classifications for this purpose. Direct comparison of benchmark data included in dictionaries of classifications with actual job offers did not produce the expected results. The level of recognition of certain phrases from classifications in text written by employers was very low. This inaccuracy of exact matching of the content entered by employers to the expected dictionary resulted from different conditions:

  • Differing sentence forms;

  • Differing word forms;

  • Meaningless words (‘stop words’, which prevent further lexical analysis);

  • The use of abbreviations;

  • The use of synonyms;

  • Incomplete matching;

  • Typographical errors;

  • Polish-language words written using only English characters;

  • Excessive whitespace characters (spaces, tabs, newline characters, etc.);

  • HTML tags in the text specifying special characters (e.g., &gt, &lt, &amp to mean > , < , and &, respectively).

The removal of excess whitespaces was performed using an appropriate regular expression, which for the entire text string finds any sequence of consecutive whitespaces and replaces the sequence with a single space. In addition, punctuation marks and other non-alphanumeric characters were removed. Special HTML tags remained unchanged.

From these listed issues, the second (differing word forms) decreased matching accuracy the most. The Polish language contains many word variants. It contains seven cases for nouns, and each of them may contain a different word variant. Variants of the verb “to be” for all persons can be different. These word variants are not exceptional; they are common in the Polish language.

In order to improve the level of recognition of job offer texts and the share of classified job offers, we decided to preprocess the actual job titles and content of job offers into a form that allowed for much more accurate matching of the content to the dictionary template. We used a morphological library (Morfologik-stemming-1.9.0) to unify the word forms to be analysed. The library has extensive possibilities for analysing Polish words. It contains 4,800,432 words with different variations. Some words were excluded from the processing mechanism due to sentence semantics.

The application uses a mechanism for converting words from any form to the basic form (lemmatisation); that is, the infinitive forms of verbs, and nouns as their nominative cases (singular first-person forms). We lemmatized the text of dictionaries and job offers (with titles).

At this stage, our most important task was to find specific educational traits within the text of job offers. Because job offers can contain various words to describe the same trait (occupation, qualification, skill, etc.), the algorithm must address the situation in which the trait is mentioned in the sentence, but with different words than those in the dictionary. To solve this problem, we first prepared a list of occupations, qualifications, and skills using the European Skills, Competences, Qualifications, and Occupations (ESCO) classification (European Commission 2020), and the International Standard Classification of Education (ISCED-F 2013) across fields of education and qualifications (UNESCO 2013). The advantage of ESCO as the classification of skills is that it supplies a dictionary of over 13,000 transversal and job-related skills, which is significantly more than alternative classifications (Pater et al. 2019). It also uses the International Labour Organisation ISCO classification of occupations, expanding the 4-digit codes of groups of occupations into 6-digit codes of occupations. This supplies almost 3,000 occupations. ISCED-F 2013 is the most commonly used qualification classification. In this case, alternative classifications such as ESCO provide qualification titles in excessive detail containing the exact source of qualification, such as, “Bachelor degree in Primary School Education. Department of Primary Education. Faculty of Education of Florina. University of Western Macedonia”. Companies expect future workers to finish a specific faculty, in this case ‘education’, but usually do not specify an exact school or university a person must have a degree from.

These official classifications formed only basic dictionaries, because companies can (and often do) use many synonyms, causal terms, and abbreviations to describe the occupations, qualifications and skills they seek; thus, we built dictionaries of such synonyms. We supplemented the ESCO dictionary of transversal skills by their synonyms from online dictionaries. The basic job-related skills dictionary was supplemented with their synonyms, also provided by ESCO. The dictionary of educational fields and qualifications was left unchanged for their specificity. We supplemented the ESCO dictionary of occupations with synonyms provided by the Statistics Poland agency, and with synonyms of vocational occupations provided by the Educational Research Institute in Warsaw. The dictionary was supplemented with the most common job titles in the Central Job Offers Database submitted from Local Labour Offices. This database contains job offers for which a Local Labour Office clerk assigned an ESCO occupation code. Individual assignment of a code by a LLO qualified clerk to some extent ensured that the coding was correct. To further increase this probability, we used the most common of these job titles.

In the dictionary of occupations, we encoded as ‘000000’ the job offers that did not indicate an occupation, but instead a specific workplace, such as seasonal work, casual work, or occasional work. We encoded as ‘999999’ any job that was not perceived as employment by official statistics; these included internships, contracts for specific work, and contracts of mandate.

For all classifications, we used numeric codes. We sorted the dictionaries from most specific phrases to most general; for example, ‘sales manager’ occurred before ‘manager’. The matching algorithm tried to match the phrase from a dictionary with the job offer text, in order from most specific to general. In the case of managerial occupations, all types of managers were attempted to be matched with job offer text. If it failed, the word ‘manager’ was checked for matching. The exceptions were codes ‘999999’ which was searched first, and ‘000000’ which was searched last. This ensured that a job not considered as employment would not be calculated as valid, but would be noted in the database; and that some short-term jobs, mostly without high requirements, would also be counted, but as a separate category. Sometimes there were two or more identical phrases containing different codes. In such a case, this trait was counted as 1/n, where n is the number of phrases in each of the codes, and the fractions always summed to 1, meaning one occurrence.

After finishing the dictionaries, we analysed matching results, checking the results individually in a random sample of 1,000 job offers. During this procedure we created a dictionary of exceptions. This dictionary contains mostly single words with ambiguous meanings. Whenever possible, these words were substituted with phrases that clarified the context. We also found and corrected mistakes in the morphological library itself. As a result, for the purposes of the study we collected 40 million job offers during 2017–2019, based on which we extracted relevant educational profiles required by employers from potential job seekers. To calculate degree of mismatch, we needed respective information on the labour supply, which is described in the next section.

3.4 Labour supply survey

The main challenge in designing a method of continuous mismatch monitoring in the labour market, especially at a detailed level, was to prepare a tool that allows for quick aggregation of data from large representative samples of people of working age. Moreover, the chosen technique should be compatible with another equally important source of information; in this case, the source is online job advertisements. The choice of the Internet as a source of job advertisements (labour demand), as well as the means of conducting labour supply research among people of working age, results from the fact that in 2016, 93.7% of enterprises had internet access (including 93.2% broadband) and 80.4% of households (of which 75.7% used broadband). Considering the range of mobile internet services, a significant portion of Polish society is within the reach of the Internet and uses it.

In order to calculate labour market mismatch, we needed the characteristics of labour supply (job seekers) to compare them to the demand for labour (observed through job offers). The labour supply data was obtained from a CAWI (Computer-Assisted Web Interview) survey of Poles aged 18–65. The study was conducted on a nationwide random-quota sample of N = 16,119, where quotas for gender, age and size of the place of residence were consistent with those in the Polish population (Table 1). The questionnaire included a division of respondents into people currently working and not working. Even though some respondents were not looking for a job at the time of the survey, they still could have been considered as part of the population we aimed to study; that is, of people potentially interested in finding a job. The survey was conducted during an unprecedented boom in both the Polish economy and its job vacancy market, with many employers having unfilled work positions. The market was thus full of employment possibilities, a perfect situation for this study. The questionnaire consisted of four main sections: demographics, work situation, qualifications, and competencies.

Table 1 The structure of the sample for the population of Poles aged 18–65

Since we used online job offers to measure vacancies, the survey of potential job seekers included only internet users. The CAWI survey was conducted during September and October 2017.Footnote 5

4 Mismatch indices

The scale of educational mismatch in the labour market at the aggregate level is usually measured with the use of structural compliance indices (Schioppa 1991). We decided to use four measures and a Pearson correlation coefficient. A common feature of these measures is an uneven response to differences between individual components of the considered structures. Two of them, namely, \({V}_{2}\) and \({V}_{3}\), strongly react to changes in components with low shares, while another (\({V}_{4}\)) more strongly differentiates changes in components with high shares. Of all the measures, \({V}_{1}\) is the most resistant to differences in the structures of supply and demand. Besides the correlation coefficient, all of the presented indices are measures of differentiation, which means that the higher the index, the more different the structures are, and hence the greater the labour market mismatch. An index value of 0 means a full match, while a value of 1 means a total mismatch. The indices, excluding correlation coefficient, have two properties: their range of variation is \(\langle \mathrm{0,1}\rangle\), and they meet the symmetry condition. At the stage of data preparation, cases where both \(\alpha\) and \(\beta\) were equal to 0 were excluded from the analysis.

The \({V}_{1}\) measure is often used in the economic literature (Jackman and Roper 1987), which makes the results of this analysis comparable with previous studies. However, those analyses were conducted in less detail than those presented in this article. This measure takes the following form:

$$V_{1} = \frac{{\mathop \sum \nolimits_{i = 1}^{k} \left| {\alpha_{i} - \beta_{i} } \right|}}{2}$$
(1)

where \(\alpha\) is the total share of job offers and represents the demand side of the market; and \(\beta\) is the share of surveyed respondents that possess a particular educational trait (potential job seekers), and represents the supply side of the market. \(i=1,\dots ,k\) is an index representing components of a given educational trait, such as occupation, field of education, transversal skill, job-related skill, or region. \({V}_{1}\) was calculated separately for each educational trait, such as an occupation. In such a case, \(i\) represents each occupation, and \({V}_{1}\) is calculated over k = 2,747 occupations.

Another measure of educational mismatch was built on the basis of Clark’s divergence coefficient (Clark 1952):

$$V_{2} = \left[ {\frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \left( {\frac{{\alpha_{i} - \beta_{i} }}{{\alpha_{i} + \beta_{i} }}} \right)^{2} } \right]^{\frac{1}{2}}$$
(2)

This measure “prefers” differences between components with relatively low shares. This means that it should be used in particular cases where both compared structures are characterised by a relatively even distribution of shares, or when the differences between the components have smaller values. In the case of educational mismatch, it can be particularly useful for comparing selected parts of the structure when they contain relatively small values. It has limited use for entire structures.

The next mismatch measure is based on the Canberra distance (Lance and Williams 1966). The values in this measure are within the range \(\langle 0,k\rangle\). That is why we used a modified version of this measure, which takes values from the range of \(\langle \mathrm{0,1}\rangle\). This measure takes the following form:

$$V_{3} = \frac{1}{k}\mathop \sum \limits_{i = 1}^{k} \frac{{\left| {\alpha_{i} - \beta_{i} } \right|}}{{\alpha_{i} + \beta_{i} }}$$
(3)

As in the case of Clark’s divergence coefficient, the Canberra distance is best used to compare structures with relatively small differences between their components. In addition, it is particularly sensitive to values close to 0.

The last mismatch index takes the form:

$$V_{4} = \left[ {\frac{1}{2}\mathop \sum \limits_{i = 1}^{k} \left| {\alpha_{i}^{2} - \beta_{i}^{2} } \right|} \right]^{\frac{1}{2}}$$
(4)

A special feature of this measure is that it prefers changes between the components with the highest shares. In our analysis, it is particularly useful for occupational and job-related skills mismatches, because their components (shares of jobs and workers having a skill or occupation) take especially small values.

5 Results

5.1 General

Table 2 summarises the results of educational mismatch indices across different educational traits. For indices V1 to V4, greater values mean greater mismatch, while in the case of correlation, greater values mean more similar structures, hence smaller mismatch.Footnote 6

Table 2 Summary of educational mismatch measures in the labour market

What educational features result in structural mismatch between supply and demand in the Polish labour market? All measures consistently indicate that the greatest match occurs across NUTS-2 regions (voivodeships). Regional (spatial) matching seems to be greater than educational matching in Poland, at least from a regional perspective. The results also indicate a relatively high match between supply and demand for labour across fields of education. In this case, the differences between mismatch measures are much greater than in the case of spatial matching. Matching by transversal skills is indicated to be smaller, according to most of mismatch indices. The results show that the largest mismatch occurs in the case of job-related skills, and then by occupations. This was true for all mismatch indices with the exception of \({V}_{4}\). However, \({V}_{4}\) tended towards higher component values, while the shares of job offers and potential job seekers having particular occupations or job-related skills were lower. Therefore, this index was not considered appropriate for these educational features.

The observed educational mismatch decreases with the higher level of data aggregation. Job-related skills are very specific (at a low level of aggregation) and several thousand were identified among the respondents. Employers recognise specific skills at this level of detail. They expect future employees to be able to perform such specific tasks in addition to having transversal skills (general preparation for starting a job). In terms of features measured on larger aggregates (i.e., fields of education and voivodeships), the mismatch was lesser. Since the most important skills for employers are often specialised skills, this can be an indication for educational policy to place further emphasis on narrower educational paths (teaching individual skills), not only on traditional long-term paths (four- to five-year studies). This may be particularly important for post-graduate education. In Poland, post-graduate studies usually last one to two years, while potential job seekers may have greater needs for shorter courses and trainings of particular skills. Shorter post-graduate education paths appear to increase the skill mobility of Poles due to the relatively short amount of time required as a student and the lesser financial burden when compared to traditional, longer education paths. Another reason for greater job-related skills mismatch was a lack of employer knowledge about the job-related skills of potential job seekers. Employers were only able to identify such skills to a limited extent. In itself, this can be a reason for skill mismatch. This information should be enhanced to improve educational matching. While companies frequently use skills terminology, job seekers are mainly able to use the fields of education and occupation terminology. Skills recognition is scarce among these.

5.2 Case study of ICT occupations

The purpose of this section is to show that the methodology we use can provide deep, insightful information on labour market mismatch. In the literature, the general matching/mismatch measures are called the “black box” (Petrongolo and Pissarides 2001). We do not know the reasons for mismatch, just its aggregate measure. By having the detailed characteristics of labour supply and demand, we can “open the black box” (Marinescu and Wolthoff 2020). This section shows one specific group of occupations in more detail.

We analysed the Information and Communications Technologies (ICT) occupations across NUTS-2 regions (voivodeships, Table 3). As much as 51.9% of the total supply of ICT professions came from the five regions with the largest share of the labour supply; the share of demand was 68.4%. There is a greater concentration of ICT labour demand than supply. The greatest difference between high demand and low supply was found in the Mazowieckie voivodeship. The Małopolskie voivodeship also ranked higher in labour demand than supply. In the Lubelskie, Kujawsko-pomorskie, and Podkarpackie voivodeships, supply share was significantly higher than the share of demand. The results indicate excess labour demand in well-developed regions and excess supply in poorly developed ones.

Table 3 The share of demand and supply for ICT occupations across NUTS-2 regions in Poland

The analysed job offer data show that Polish employers demanded employees from 140 out of 159 classified ICT occupations. Among them, 32.1% (n = 45) occupations were identified on both the supply and demand sides. The share of all classified ICT occupations that were identified only on the demand side was 67.9% (n = 95). Occupations only occurring on the supply side were not identified. The share of job offers for representatives of ICT occupations was 6.6% of all online job offers, while the percentage of people with a learned occupation in the field of ICT on the supply side was 4.4%.

In terms of major occupational groups (1-digit ISCO coding), the vast majority of occupational demand came from group 2 (“Professionals”, 48.6%) and group 3 (“Technicians and associate professionals”, 33.6%), with the least demand from group 4 (“Clerical support workers”, < 0.1%). The greatest share of job offers was for the occupational group with 3-digit ISCO code 251 (“Software and Applications Developers and Analysts”, 17.9%) while the lowest was from group 422 (“Client Information Workers”, 0.7%).

The ten ICT occupations with the highest number of job offers constituted 56.4% of the total number of job offers for ICT specialists, while the share of the ten most-learned occupations constituted about 78.2% of the total job offers (Fig. 1)

Fig. 1
figure 1

The share of demand and supply for the ten most frequent ICT occupations in Poland

.

We found that most occupations with the highest demand (70% of occupations) belonged to group 25, “Information and Communications Technology Professionals”, meaning that IT professionals were the most demanded ICT occupations. The rest of the most- demanded ICT occupations came from groups 13 (“Production and specialised services managers”), 21 (“Science and engineering professionals”), and 52 (“Sales workers”). The supply side was more evenly distributed among different groups of occupations, with 80% of the most frequently-learned ICT occupations coming from four different 2-digit occupation groups.

In order to find the most mismatched occupations, we calculated differences between the supply and demand shares of identified occupations. Table 4 shows occupations with the largest surplus of relative demand over relative supply (those for which the share of job offers was greater than the share of people having a particular learned occupation), and surplus of supply over demand. IT specialists and sales consultants had the highest surplus of relative demand over relative supply. The supply share was higher than the demand share for technicians (vocational education representatives) and communications technology specialists.

Table 4 The highest surplus of demand and supply shares for ICT occupations in Poland

An important aspect of analysing the demand for skills is to examine their distribution across occupations. In other words, it is important to understand which skills are most demanded for the occupations of interest. Such an analysis was possible as a result of the holistic approach we used in this study. For this purpose, we chose the five most demanded job-related skills appearing in job offers for the eight most-demanded occupations (Table 5). For example, the SQL (Structured Query Language) skill appeared in job offers for five out of the eight most demanded occupations; while Java, database management, customer relations, problem-solving, process design, project management, and customer service appeared in two out of eight occupations. The remaining skills were unique to particular occupations.

Table 5 The most demanded job-related skills across the most demanded ICT occupations in Poland

6 Conclusions

We proposed an approach for analysing online job offers in the context of demanded occupations, qualifications, and skills in the labour market, with the intent to provide full educational characteristics of unmet labour demand using online job vacancies. The presented procedure shows how to address selected problems of analysing many online job boards, and to conduct regular vacancy market research based on internet sources. We applied this method to the Polish job market to study the labour market mismatch from the point of view of educational traits. Our approach is universal in the sense that it can be applied to any country and any labour market. This is confirmed by the fact that we did not assume any specific online job board, or job offer structure prior to data collection. Contrary to most previous works, we showed a thorough procedure for online data: collection, transformation, information extraction, and application. We also indicated some country specificity that should be considered when different languages are present in the data. We obtained these results through lexical lemmatisation. Lemmatisation helped us to increase the matching frequency between dictionaries and job offer texts. All our dictionaries contained Polish and English phrases. This may have a large impact on the efficiency of the algorithms applied to many countries. We argue that each language should be treated uniquely, to some extent, to obtain the highest quality of information extraction.

We connected educational terminology based on qualifications and labour market occupational terminology (used by official classifications) to the “market” terminology (used by entrepreneurs) based on direct measures of skills, and provided a cross-analysis of educational traits. Unlike most previous research, which used only proxies for measures of skills, this study used explicit measures.

Official classifications enabled us to better organise unstructured data collected online. For this, we used detailed ESCO and ISCED-F classifications, unlike Cedefop (2018; 2019a; 2019b), which used only ESCO. An alternative approach based on the O*Net database provides greater aggregations for skills. Instead, we focused on whole educational characteristics at low levels of aggregation.

Cedefop uses an expert method to choose websites (Cedefop 2019a). They grade websites according to ten specific criteria. We supplemented this method with an automatic one based on Google Trends, using 25 websites that were chosen on the basis of this criterion. Cedefop uses the 4-digit ESCO codes, while we used the 6-digit level of codes, since this format better suits our method and gives more accurate results. Cedefop (2018), whose results also include company sector and company size, analyse more traits in job offers than we did. We instead focused on providing full educational characteristics.

Our results showed that educational characteristics matter more for determining mismatch than spatial features. We found job-related skills to be more important in finding a labour market match than occupations or fields of education. It was also worth dividing the skills into job-related and transversal groups. Even though job-related skills defined the demand for ICT occupations to a larger extent than transversal skills, the latter were also important. Differences in transversal skills were the third-most important source of mismatch in the Polish labour market. This general conclusion confirms previous findings of the importance of these skills for financial occupations (Constantino and Rodzinka 2022) and for ICT occupations (Pater et al. 2022).

We did not aim to provide the best possible solution to perform matching, text analysis, and duplicate (non-unique entity) identification. These steps require dedicated approaches. Despite a deterministic occupation identification method (occupation classification-job offer matching), we still obtained a high degree of matching (93%). Application of newer techniques, for example of Zhao et al. (2021) for duplicate detection; Rybak et al. (2020), and Mroczkowski et al. (2021) for natural language processing; and including classification error treatment will further improve the algorithm. We aimed to provide a general approach to analyse online job offers, to extract educational information from them, and to apply the results to the economic problem of labour market mismatch. We also suggested some improvements in existing approaches to online data analysis. The next step of the analysis will be to further investigate workplace characteristics, namely: working hours, mode of work (remote, traditional or hybrid work), and contract types. This information will provide more detailed estimates of labour market demand. Job vacancy surveys mainly measure standard work contracts, while business-to-business contracts or contracts for specific work are becoming more popular and may greatly change the estimates of labour demand. Using detailed data from online job offers, we may be able to analyse the scale of those contracts.