1 Introduction

We live in the world where huge amount of data on almost any aspect of our life is produced and collected. Capturing, understanding and fully exploiting nontraditional data, through advanced analytics, machine learning and artificial intelligence, might yield benefits for policy makers. For example, the dynamic changes on the labour markets driven by digitalization, automation, robotization and green transition require adequate response of key players to mitigate skills shortages. The timely understanding of emerging issues across sectors and regions would allow policy makers to better manage strands of education and training (E&T) policy and better design labour market policies. The relevance of this issues is underlined by the related policy questions in a recently published European Commission policy report (European Commission, Joint Research Centre, 2022).

Below we will discuss how the analysis of nontraditional data might allow to address various questions, for which the granularity of traditional data was not adequate, such as:

  • Which are the most demanded skills brought about by the recent changes in labour markets?

  • Is there a gap between the supply and demand side with respect to skills?

  • What is the best possible way to reskill and upskill individuals building on their current skill sets?

  • Are “new skills” concentrated in specific economic sectors or regions?

Against this background the European Centre for the Development of Vocational Training (Cedefop) has taken up the challenge to integrate its work on skills intelligence making use of big data collected via web sourcing of online job advertisements. Footnote 1 Drawing on Cedefop’s work and expertise, in this chapter we will focus on the presentation of existing research that based analysis on nontraditional data or applied data science or AI-based analytical approaches to better understand ongoing changes in skills.

2 Existing Literature

Traditionally, the labour market intelligence (LMI) was based on information collected via well-established surveys or administrative data. In the rapidly changing labour market, job seekers, teachers and trainers search for timely and more fine-grained information to support their decisions. As the recent society more and more relies on the Internet for the day-to-day needs, it naturally changes also the way employer-employee job matching process occurs.

With the growing number of employers who use websites to reach out for potential candidates and also the increase of users searching for jobs online, the analysis based on the information extracted from online job advertisements (OJAs) became the most promising approach for addressing some of the most relevant questions that new labour market trends are posing (Colombo et al., 2018). Potential of labour market systems based on big data lays in giving access to a greater variety of data sources producing information beyond ability of traditional survey. This allows for labour market comparisons at regional level and for subpopulations, as well as at the level of skills Footnote 2 rather than occupations. Yet, these systems are not free from shortcomings, e.g. not suitable for long-term projections, having limits in representativeness related to coverage or completeness and subject of missing data as a result of inconsistencies in unstructured text [see Naughtin et al. (2017) and Cedefop et al. (2021)].

Although these examples come from mainly one-off and exploratory research projects, which very often are based on one data source, they still confirm the capacity of using online job advertisements in a variety of analysis. The potential of this source of information to draw conclusions about labour market trends across multiple dimensions as occupation, geography, level of education and the type of contract was confirmed in the study by Tkalec et al. (2020). Moreover, there were various examples of efforts to identify skills for emerging jobs, for example, skill requirements for business and data analytics positions (Verma et al., 2019), for ICT and statistician positions (Lovaglio et al., 2018) and for software engineering jobs (Gurcan & Cagiltay, 2019; Papoutsoglou et al., 2019). Few studies focused on skills identification in specific sectors—IT (Ternikov & Aleksandrova, 2020), tourism (Marrero-Rodríguez et al., 2020) and manufacturing (Leigh et al., 2020)—or requested for specific occupations: computer scientist positions (Grüger & Schneider, 2019), various types of analyst positions (Nasir et al., 2020) or skills requested in the public health job (Watts et al., 2019). Alekseeva et al. (2019) searched online job advertisements data for terms related to artificial intelligence (AI) to understand which professions are demanding these skills. The jobs information collected over time allows identification of trends in the skill set requirements for different industries as done by Prüfer and Prüfer (2019), who provided insights into the dynamics of demand for entrepreneurial skills in the Netherlands and also identified professions for which entrepreneurial skills are particularly important. Fabo et al. (2017) analysed how important are foreign language skills in the labour markets of Central and Eastern Europe. Pater et al. (2019) analysed demand for transversal skills on Polish labour market. Dawson et al. (2021a) used longitudinal job advertisements data to analyse changes in journalists’ skills and understand changes in the situation of this occupation group on the Australian labour market.

Although a growing number of employers use the web to advertise job openings, this data is still being criticized for being more skewed toward employers seeking more highly skilled professions or those more exposed to the Internet (Carnevale et al., 2014; Kureková et al., 2015b). Nevertheless, Beblavý et al. (2016) using Slovak job portals or Kureková et al. (2015a) using Czech, Irish and Danish publicly administered cross-European job search portal data delivered evidence on skills requested specifically in low- and medium-skilled occupations. Wardrip et al. (2015) focused on understanding employers’ educational preferences studying medium-skilled job advertisements.

The extraction of information on skills level allows to calculate different types of skills required (e.g. soft, transversal, digital) and therefore better understand how soft and hard skills influence each other which was analysed by Borner et al. (2018).

The potential of extending studies on labour market polarization by including the information on relevance of specific skills and skill bundles was indicated in few studies (Alabdulkareem et al., 2018; Salvatori, 2018; Xu et al., 2021). The skills-based approach to study possible transitions from lower-wage into better-paying occupations based on online job advertisements data was explored by Demaria et al. (2020). The rich structure of neural language models encourages researchers to make attempts in building more sophisticated models, e.g. that predicts wages from job postings’ text [see Bana (2021)].

Building online job advertisements databases over extended periods of time allows introducing longitudinal perspective into the analysis. For example, Adams-Prassl et al. (2020) used job advertisements data to get more insights into determinants of employers’ demand for flexible work arrangements. Blair and Deming (2020) analysed changes in demand for skills in the USA indicating that increase in the demand for graduates with bachelor’s degree is of structural rather than cyclical nature. Shandra (2020) analysed trends in skills requirements of internship positions. Das et al. (2020) explored how occupational task demands have changed over the past decade due to AI innovation. Acemoglu et al. (2020) studied AI effects on the labour market, indicating the increase in demand for AI skills after 2014. Recently, the job advertisements data were used to study the impact of introduced social distancing measures during Covid-19 pandemic on the labour market in the EU (Pouliakas & Branka, 2020). Similar analysis carried out on the Covid-19 impact on labour market demand in the USA gave insights about state-level, for essential and non-essential sectors, and teleworkable and non-teleworkable occupations (Forsythe et al., 2020).

Using real-time labour market data can also bring valuable insight on the reasons of the low employability of graduates and help learners make informed decisions about acquisition of skills requested by employers. Persaud (2020) combined information extracted from job postings and from programmes offered by universities and colleges and identified what skills employers are seeking for big data analytics professions and to what extent these competencies are acquired by students. Universities may use AI-based analytics for mapping of competences from job adverts and compare them with curricula and course descriptions to better design future education offer (Ketamo et al., 2019). Borner et al. (2018) systematically analysed the interplay of job advertisement contents, courses and degrees offered and publication records to understand skills gaps and proposed not only methodology but also visualizations to ease making data-driven decisions by less tech-savvy stakeholders’ groups. Brüning and Mangeol (2020) using job posting data analysed geographical differences in demand for graduates’ skills in the USA. They tried to find the answers on what skills employers look when searching for graduates that did not follow the vocational career pathway. They looked also how open are employers in need of ICT specialist to hire graduates from other study fields (ibidem).

Although there is little evidence that individuals when receiving supporting information from job recommending tools change their job search behaviour (Hensvik et al., 2020) and also that such recommending tools are effectively decreasing skills mismatches on the labour market, the recent advances in artificial intelligence has spurred research contributions in the areas of career pathway planning, curriculum planning, job transition tools and software supporting job search. The advancements in extraction of information on skills based on online job advertisement lead to the proposals of solutions of the job searching tools allowing to filter recommendations by skill set and company attribute (e.g. size, revenue) (Muthyala et al., 2017) or proposal of models that can be used in building job prediction applications based on descriptions of user knowledge and skills (Van Huynh et al., 2020). There are also solutions being developed (based on vacancies information) that given starting sets of skills recommend job options which are not only matched with individual skill sets (Giabelli et al., 2020b, 2021) but also aligned with career ambitions and personal interests of job seeker (Sadro & Klenk, 2021).

The aggregation of data sources on skills with job advertisements data and education offer (e.g. from local university, database of online courses) allows for building solutions with more personalized career information, advice and guidance. Such recommendation tools have a matching solution with in-built information from existing sources on education offer that provides job seekers with information about potential career opportunities together with information what courses to take to acquire missing skills. At the same time, these tools account for the available time to learn new skills a job seeker/person interested in changing profession has (Sadro & Klenk, 2021). Recommendation tools powered by labour market information could be personalized even further, for example, to allow job opportunities to be filtered to match individuals’ health requirements (Sadro & Klenk, 2021) or commuting expectations of job seeker (Berg, 2018; Sadro & Klenk, 2021).

Networking websites for matching workers with employers could serve as source of information to give more insights about demand but also about supply of skills. From the demand perspective, such data allows to retrieve additional company metadata to investigate the relationship between company characteristics and workers’ skills (Chang et al., 2019). Information extracted from workers’ career profiles allows to extend the analysis to differentiate skills from entry- to middle- to top-level jobs. Such data was used to test the effectiveness of proposed framework to predict career trajectory with in-built time variable that allows to account for different length of workers’ experience (Wang, 2021). The networking websites may allow also for checking which job advertisements were visited more frequently and which information could be used, e.g. to improve analysis on the tightness of the labour market (Adrjan & Lydon, 2019).

The information about users’ skills from online CV profiles could also be used as an input in career guiding tools as proposed by Ghosh et al. (2020) to support people in their decisions on which skills to acquire to achieve their career goals. The analysis of big data of real changes in careers allows getting more understanding into possibilities for intersectoral mobility (International Labour Organization, 2020). Natural language processing (NLP) solutions were applied to find overlapping skills between occupations based on which potential job transition was established (Kanders et al., 2020). Dawson et al. (2021b) built job transitions recommender system getting more insights of similar sets of skills by combining information from longitudinal datasets of real-time job advertisements and occupational transitions from a household survey. In the ongoing project (Cedefop, 2020), the analysis based on data about labour market transitions extracted from more than 10 million anonymized CVs from across EU member states was carried out to feed the recommendation tool. Footnote 3 This tool will support job seekers by providing them with information on occupations alternative to their own. Allowing the worker for identification of skills which acquisition would yield the highest utility gains could translate into the improvement of his/her employment outcomes and increase his/her productivity. Sun (2021) presented a new data-driven skill recommendation tool based on deep reinforcement learning solution that also allows to account for learning difficulty. Stephany (2021) combining information about freelancers’ skills and wages calculated the marginal gains of learning a new skill. Insights from this study could help designing individual reskilling pathways and help to increase individuals’ employability. There is ongoing feasibility study Footnote 4 which aims is to explore the potential of information extracted from work platforms that play intermediary role on the labour market, in better understanding of interplay between workers’ skills, tasks and occupations.

3 Computational Guidelines

The growing body of knowledge on labour market generated based on the online sources translated into the increasing interest in taking advantage of the skills intelligence for policy making. In 2014, Cedefop started building a pan-European system to collect and classify online job advertisements data. The initial phase included only five EU countries. Yet, with time the project was scaled up and extended to the whole EU, including all 27 Member States + UK and all 24 official languages of the EU (Cedefop, 2019). This positive experience led Cedefop to join efforts with Eurostat (and creation of its Web Intelligence Hub) in developing well-documented data production system that has big data element integrated into the production of official statistics (Descy et al., 2019). Yet, the retrieval of good-quality and robust information from online data sources to deliver labour market analysis in an efficient way is still a challenging task. The identified key challenges in using online job advertisements (OJAs) for skills and labour market analysis are representativeness, completeness, maturity, simplification, duplication and status of vacancies (International Labour Organization, 2020). The computational challenges with building reliable time series data based on collecting information from online data sources can be grouped into four areas related to:

  • Data ingestion

  • Deduplication

  • Classification of occupations and skills

  • Representativeness

When the focus is on the data ingestion and landscaping part, then the source stability is one of the main technical problems, which has a direct impact on the representativeness of collected information and the reliability of further analysis. Firstly, some sources of information might be blocked from data collection not allowing for extraction of information, and prior agreements with the website’s owners will be needed to access the information. Secondly, some websites may not be available during the data extraction because of technical problems. Thirdly, there is also a natural lifecycle of the online sources as some new websites may appear while existing ones can close or rebrand. It has been shown that inclusion of the website that contained a large volume of spurious and anonymous job postings could lead to the discrepancy with the official vacancy statistics (International Labour Organization, 2020). In order to ensure stability of data sources, the added value of using tools like analytic hierarchy process to help in ranking of the online sources based on various dimensions, including information coverage, update frequency, popularity and expert assessment and validation, is explored. Footnote 5

The challenges with deduplication relate to the fact that it is common that the same job advertisement appears in various sources on the Internet. This can happen either intentionally, when employer publishes OJA on more than one portal, or unintentionally due to activities of aggregators—portals that automatically crawl other websites with the view of republishing OJAs. Very often, the content of such job advertisements is almost identical differing only in a small portion of the text (e.g. date of release). There are several ways to allow for identification of near-duplicate job advertisement to avoid counting the same information multiple times, e.g. using bag of words, shingling and hashing techniques (Lecocq, 2015). In the process of deduplication, the comparison of several fields in the job advertisement (e.g. job title, name of employer, sector) is done to determine whether it is a duplicate or not. Footnote 6 Metadata derived from job portals is another way to help identifying duplicate advertisements (e.g. reference ID, page URL). In addition, machine learning algorithms could be used to remove irrelevant content, e.g. training offers.

In the next phase of the data processing, the challenges relate to the classification of occupations, and skills emerge from the fact that the information is extracted from unstructured fields of job advertisements. For example, employers might have a tendency to conceal tacitly expected requirements by explicitly mentioning only a few skills from the list of required ones in online job advertisements. Similarly, the candidates building their online career profiles may signal only selected skills they have, for example, indication of “Hadoop” and “Java” could infer workers’ expertise as well as for “MapReduce” (Muthyala et al., 2017). Sometimes the same word may have different meaning depending on the context, e.g. philosophy as the field of study or as the company philosophy, informal written guidelines on how people should perform and conduct themselves at work; Java could either come from the job advertisement searching for IT or coffee making person.

In general, two approaches are used in the information extraction from unstructured text: cluster analysis and classification (Ternikov & Aleksandrova, 2020). For example, Zhao et al. (2015) developed a system for skill entity recognition and normalization based on information from resumes, while Djumalieva and Sleeman (2018) used online job advertisements data and employed machine learning methods, such as word embeddings, network community detection algorithms and consensus clustering to build general skills taxonomy. In a similar way, Khaouja et al. (2019) created a taxonomy of soft skills applying combination of DBpedia and word embeddings and evaluated similarity of concepts with cosine distance. Moreover, a social network analysis was used to build a hierarchy of terms.

The unavailability of high-quality training datasets was believed to constrain advancements in the use of AI in extraction of information from unstructured text. Yet, it is observed that solutions based on structured and fully semantic ontological approaches or taxonomies proved to work better allowing to extract meaningful information from online data compared to applications exclusively based on machine learning approaches (International Labour Organization, 2020; Sadro & Klenk, 2021). Nevertheless, the taxonomy-based extraction processes are not free from deficiencies, as the quality of extracted information tends to be as good as the underlying taxonomies used for this purpose (Cedefop et al., 2021). Plaimauer (2018) studying matches between taxonomy terms and language used in vacancies published on Austrian labour market shows that 56% of the terms from taxonomy never appeared in job advertisements. She also observed that longer terms were identified with less frequency in the vacancies’ descriptions. Grammatical cases in some language seem challenging for natural language processing tools, which often leads to misinterpretation of recognized skills (Ketamo et al., 2019).

The mapping of the unstructured text (e.g. of job titles, skills) to existing taxonomies (e.g. ISCO—International Standard Classification of Occupations) is usually done in a few steps, and pipelines are built for separate languages (Boselli et al., 2017). First the text needs to be extracted from the body of the job adverts; this could be done by bag of word or Word2Vec approach Footnote 7 (Boselli et al., 2017). In both cases the usual steps were applied to preprocess the text. Footnote 8 The bag of word extraction leads to creation of sets of n-consecutive words (so called n-grams); usually unigrams or bigrams are analysed (ibidem). The Word2Vec extraction is based on replacement of each word in a title by a corresponding vector of n-dimensional space. This approach requires huge text corpora for producing meaningful vectors (ibidem). The corpuses with specific domain can significantly improve the quality of obtained word embeddings. In the next step of the classification pipeline, machine learning techniques (e.g. decision trees, naïve Bayes, K-nearest neighbour (k-NN), support vector machines (SVM), convolutional neural network) are applied to match with the “closest” code. The similarity is judged based on the value of one of the existing indexes of similarity (e.g. Cosine, Motyka, Ruzicka, Jaccard, Levenshtein distance, Sørensen-Dice index).

The evaluation of the quality of obtained matches (e.g. between job titles and occupation classifier) is not an easy task, although the problem related to matching itself is not a new one as previously some AI solutions were developed for coding of open answers on job titles provided by respondents in survey data (Schierholz & Schonlau, 2020).

Yet, the main difference between the information on job titles provided by individual worker and information originating from job titles mentioned in online job advertisements is that the latter includes more extraneous information (e.g. “ideal candidate”, “involve regular travel”) and tends to be more difficult to parse (Turrell et al., 2019). One way to validate that the occupation classifier is generating meaningful predictions is to check the implied occupational hierarchies (Bana et al., 2021). For example, a classifier that misclassifies a high-skilled profession with a low-skilled one would be judged as performing worse than the one that would categorize such occupation as belonging to more general category but within adequate hierarchical occupational group. Nevertheless, Malandri et al. (2021a) who applied word embeddings approach to job advertisements data identified existing mismatches in the taxonomy compared to real market examples. In particular, analysing the market of ICT occupations, they showed that although in ESCO taxonomy data engineer and a data scientist belonged to the same occupation group, these are not similar occupations in the real labour market (ibidem). The previous studies show that the level of accuracy of extracted information depends from field to field and also on the level of detail, as the accuracy rate of six-digit occupation coding was about 10 percentage points lower than when done for major groups at two-digit ISCO level (Carnevale et al., 2014). A similar trade-off between more granularity and less accuracy was observed by Turrell et al. (2019) who decided to use three-digit occupation classification. Yet, using supervised algorithms it was proven to be possible at least for English language to achieve good performances (over 80%) in classifying textual job vacancies gathered from the online advertisements with respect to the fourth-level ISCO taxonomy (Boselli et al., 2017). Nevertheless, less than 85% of titles were correctly classified in the matching exercise of job titles advertised on Dutch websites with ISCO-08 ontology (Tijdens & Kaandorp, 2019). The manual check of the unclassified terms showed that job titles in vacancies could be either more specialized compared to the terms in ontology or vice versa. However, some wrong classifications also occurred despite the high reliability score of classifications for these titles that included some similar words, e.g. campaign manager versus camping manager (ibidem). Another challenge with finding matching occupation classifier is that sometimes job advertisements can have generic, meaningless job titles or no title at all. Therefore, it is also important to design and train the classifiers that, e.g. could suggest a job title acknowledging the content of entire job description as, for example, the proposed Job-Oriented Asymmetrical Pairing System (JOAPS) by Bernard et al. (2020).

Overall, the main disadvantage of classifying unstructured information with use of taxonomies is that they are not forward-looking, and the frequency of revisions that oftentimes lean on expert panels and surveys allows updating them with the information on the emerging skills and/or occupations only with substantial delays. The AI solutions were introduced to update ESCO taxonomy with information on occupations; however the detailed information on the applied procedure was not provided in the official reports (European Commission, 2021a, b). A tool with capacity to automatically enrich the standard occupation and skills taxonomy with terms that represent new occupations was proposed by Giabelli et al. (2020b, 2021). This tool identifies the most suitable terms to be added to the taxonomy on the basis of four measures, namely, Generality, Adequacy Specificity and Comparability (GASC) (for formal definitions of these measures, see Giabelli et al. (2020a)). Very often inconsistencies in terminology used by job seekers and the jargon of employers when describing the same skills are the reasons for which the solution developers struggle when matching information from different data holders (Sadro & Klenk, 2021). One way to overcome the problem is to apply the AI and advanced linguistic understanding and build a platform which “translates” jargon of job advertisements to a simpler language for job seekers (Sadro & Klenk, 2021). The revealed comparative advantage (RCA) Footnote 9 was used as a measure of the importance of a skill for an individual job by Anna Giabelli et al. (2020a) to enrich ESCO taxonomy with real labour market-derived information about skills relevance and skills similarity. Footnote 10 Another AI-based methodology to refine taxonomy was proposed by Malandri et al. (2021b). The novelty of this approach is based on the automation of the process, which is to be carried out without involvement of experts. It is based on the implementation of domain-independent metric called hierarchical semantic similarity applied to judge the semantic similarity between new terms and taxonomic elements, which value is later used to evaluate the embeddings obtained from domain-specific corpus and, eventually, the suggestions on which new terms should be assigned to a different concept are made based on comparison of these evaluations. Chiarello et al. (2021) proposed a methodology that can be used to improve taxonomy. The innovation of this approach lays in the use of the natural language processing tools for knowledge extraction from scientific papers. The extracted terms are later linked with the existing ones allowing for identification of these which were not included in the taxonomy before.

As the final step before starting to analyse online data, it is crucial to explore its representativeness. The sources of potential bias in online job advertisements are multiple (Beręsewicz & Pater, 2021). Moreover, the population of job vacancies and its structure are practically unknown, and for non-probability samples the traditional weighting cannot be used as an adjusting method (Kureková et al., 2015b). Researchers who recognize the problem of online data representativeness very often provide results of their analysis together with information from other data sources, i.e. representative surveys and registry data [e.g. Colombo et al. (2019)]. Beresewicz et al. (2021) suggest applying a combined traditional calibration with the LASSO-assisted approach to correct representation error in the online data.

4 The Way Forward

The aim of this chapter was to map the diversity of existing research projects that used big data and artificial intelligence approaches to research the topic of changing job skills in the changing world. It also tried to summarize the computational challenges related to extraction of information from online, unstructured data and the other issues that analysts using such data may struggle with. Based on the existing evidence, suggestions were made for the design of future research projects, which in this very vivid research area may already be addressed but were not identified by us in our mapping exercise.

Having said that, firstly, one needs to focus more on the understanding of the quality of applied classification methods. Although one ongoing project that investigates the quality of job title classifiers was identified [see (Bana et al., 2021)], the projects focusing on design and testing of some alternative approaches with outputs allowing to understand and improve the quality of existing solutions will be welcomed.

Secondly, the projects which aim at delivering comparable information across countries (e.g. Skills Online Vacancy Analysis Tool for Europe—OVATE Footnote 11) would benefit from further research aiming at understanding the language characteristics’ role in the extraction capacity of taxonomies or quality of these extractions. For example, the analysis on the number of extracted skills obtained in each OVATE language extraction pipeline shows huge variation across language pipelines. In general, translated version of ESCO taxonomy Footnote 12 from English to any other language used in EU countries brings lower number of extracted skills, but the reasons behind this are not known. Research projects with approach presented in Sostero and Fernández-Macías (2021) or similar approaches with other existing ontologies used as a benchmark or applied to other than English languages would be highly welcomed.

Thirdly, the use of artificial intelligence approaches in identification of new/emerging skills, which are not included in taxonomies, is another research area that requires more investment and knowledge building. The ongoing research tendered by Cedefop/Eurostat may bring some more understanding to this discussion, but other possibilities should also be explored, e.g. identifying gaps by merging taxonomy terms with the information extracted from academic journals [see Chiarello et al. (2021)].

Furthermore, the researchers using results of online data analysis to inform policy makers to be transparent about the potential biases should also include an explanatory methodological note on the representativeness of their data. The AI-based approaches to correct representation error in the online data are also a developing field in the researchers’ discussions [see Beresewicz et al. (2021)].

Lastly, the various recommendation tools that appear on the market to offer help to job seekers are based on the similarities or overlap between skills of two occupations, and these similarities are very often calculated with use of similar techniques to classifying unstructured text with existing taxonomies [see Amdur et al. (2016) and Domeniconi et al. (2016)]. It would be worth researching and evaluating the quality of existing solutions and the suggested transitions offered to job seekers, especially that some recommendation tools do not account for hierarchical structure of skills or duration of learning time.