Introduction

Throughout the last decades, the concept of competence has gained relevance, not only in the workplace (Smirnov et al. 2016), but also in the academic field (Fazel-Zarandi 2013; Paquette 2007), where the knowledge of the competencies that are required for a profession is of great importance for the update of professional profiles (e.g., job advertisements) and curricula (e.g., academic profiles) (Paquette 2016). In this paper, the “academic profile” term meaning is the academic competencies (formation), and in a general way, is the online profiles of job seekers, students, academic websites, etc. In particular, the comparison between academic profiles and job advertisements makes possible to identify which ones have similar competencies, and which competencies should be added so that an academic profile, e.g., a student’s LinkedIn page, an online resume, or a graduate student website, can be matched to a job opportunity in a job listing website, like Monster and Indeed websites.

On the other hand, the developments in Web technologies and AI techniques to build the Semantic Web have allowed a new set of applications with important implications for Web-based education (Aguilar et al. 2015). One possible utilization consists in the characterization of the competencies based on the professional profiles and curricula. However, this comparison presents difficulties, mainly due to the way in which competencies are expressed in both contexts (Paquette et al. 2012). For example, in the academic context they are manifested as learning outcomes (Worsley and Blikstein 2018), while in the work context the competencies are presented as functions, knowledge areas, or skill levels in specific subjects (Rácz et al. 2018; Rosa et al. 2015). Consequently, there is a problem of understanding the meaning of competencies, such that one competence can be similar to another, even though the same words are not used to express them.

Based on the definition that competence is something “that can demonstrate the application of a generic skill on some knowledge” (Paquette et al. 2012; Paquette 2007), the purpose of the present work is the development of a comparison scheme between competencies, which allows to overcome the problem of ambiguity among them, using similarity measures of texts, combined with thesauri. To do this, two lexical measures are used, such as Levenshtein and Dice’s, to determine the levels of coincidence between knowledge and skills topics of the academic profiles and job advertisements. Then, according to a threshold, those with the greatest lexical similarity are chosen. Then, we use the taxonomic structure of the thesaurus to obtain a measure of semantic similarity of knowledge and skill topics, inspired by the Ant Colony Optimization algorithm. First, the levels of coincidence of the topics in the thesaurus are identified, to later determine the highest similarity through the analysis of ancestors, brothers and sons of each one of these topics (González-Eras and Aguilar 2015; Mendonza et al. 2015; González-Eras et al. 2017; Guevara et al. 2017). As a result, a similarity is obtained from the analysis of competencies of each profile, which considers not only topic characteristics, but also their context.

There are research efforts in which semantic representations and similarity measures are used to compare processes or models, and thus solve ambiguity problems in textual expressions, caused by the use of synonyms, homonyms, or different levels of abstraction in the description of entities or concepts. In (Ehrig et al. 2007) it is obtained the similarity between business process models, represented in a Petri network, with two measures, the first establishes the lexical distance between pairs of concepts and a dictionary, to determine synonyms, and the second, is a structural measure that recognizes homonyms between concepts, comparing their position in each model. Likewise, in (Dijkman et al. 2011) the similarity between the processes is through the edit distance between process names, and then a weighting of relationships intersection of common names and synonyms of names not common in chains is performed. In (Van Dongen et al. 2013) similarity measures are used to compare business process models: measures of similarity of process names, which measure the similarity between words and structural similarity measures, which in addition to aligning process names, also measure the relationships between them. As for the comparison of competencies, in the work of (Malzahn et al. 2013) similarity measures are used to compare entities and competencies in professional profiles, first according to their editing distance (Levenshtein), then with the support of thesauri (Germa-Net) and dictionaries (Wortschatz), to detect synonyms, and, through semantic measures, to align concepts according to their frequency.

For the present work, we propose the implementation of similarity algorithms that make an alignment of the knowledge and skill topics found in academic profiles and job advertisements, against the topics present in a competence thesaurus. Firstly, by means of a lexical measure, that compares them letter by letter, and once the topic of the thesaurus of greater similarity is found, it uses a measure of structural similarity to verify that they have equal ancestors, brothers and sons within the thesaurus taxonomy (González-Eras and Aguilar 2015). This establishes a semantic measure for the competence alignment between academic profiles and job advertisements, based on the similarity of their knowledge and skill topics. If our approach works well, the metrics will indicate the curricular topics with which the job listings are most aligned.

This article is structured as follows: first, the characterization of competencies is carried out in the context of academic profiles and job advertisements, followed by the similarity in the case of competencies; then the architecture of the proposal is addressed, and then the experimentation, the analysis of results, and the conclusions of this work.

Characterization of the Competence Concept

The context that will be used during the article, to explain the different concepts used in it, is the domain of Computer Science. The objective is to make the comparison of academic profiles and job advertisements according to the competencies; For this, functional or specific competencies are analyzed, based on the concept that says that a competence is defined by “the ability with which a professional develops in a specific area of knowledge” (Paquette et al. 2012). Thus, we understand as competence the constituent elements of skill and knowledge, since knowledge includes the set of topics or issues that are part of a profession that are necessary to function in it (De Leenheer et al. 2010), while skill represents the capacity to use knowledge to act successfully in the development of an activity (Beckers 2011; Blanco-González et al. 2011).

In common practice, the competencies representation has been carried out through linguistic declarations, which do not formally describe the domains of knowledge or skill, in addition to not being suitable for computational processes; which makes it difficult to compare competencies in job advertisements and academic profiles. In addition, for each type of profile, the sentence structure containing the competencies is different. Table 1 shows examples of the text structures found in the profiles. As we can see, the expressions highlighted in red represent skills, while the expressions highlighted in blue represent knowledge.

Table 1 Examples of competences found in profiles

Although these statements demonstrate the presence of skills and knowledge, sentences have different lengths, present more than one verb to denote skill levels, and use different words to express the same knowledge. Consequently, comparing profiles based on these statements implies the alignment of knowledge topics, to establish similarities between ambiguous topics; and the alignment of skill topics, in order to select those skills that represent the competence.

In general, the ambiguities that can be found in knowledge topics correspond to synonymic relationships, where two topics have the same meaning, although they are written differently. For example, the topics “parallel processing” and “distributed computing” are similar terms since they share the same knowledge area. There are also cases of hyponymy / hypernymy, where a topic has a hierarchical semantic relationship with another topic, for example, “systems” with “operating systems” or “distributed systems”; and the relations of meronymy where the topics share the same hierarchical level, as is the case of “programming languages” with “Java language” and “PHP language” (Lundqvist et al. 2011). In the same way, the topics of skill: “demonstrate”, “indicate” and “expose” share a semantic relationship because they are, according to dictionaries and thesauri, synonyms (Ortiz Sánchez 2016).

Semantic Sources

One way to resolve cases of textual ambiguity is to align text units against semantic structures, thus obtaining their similarity (Harispe et al. 2013). For competencies, the semantic sources normally used are thesauri, taxonomies and dictionaries in the same language and domain of knowledge. For this reason, two of these semantic structures were used for our research: the DISCO II thesaurusFootnote 1 and a thesaurus based on the BLOOM taxonomy proposed in (Ortiz Sánchez 2016).

The DISCO II thesaurus is an international standard used in the creation of competence profiles in the labor and educational fields, which has a Spanish version. Furthermore, the Computer Science area includes statements that paraphrase competencies, which contain knowledge elements that represent learning outcomes (Müller-Riedlhuber 2009). It is a controlled vocabulary, and the existing relationships between terms are of three types: 1. Semantic equivalences, which represent synonyms, 2. Hierarchical relations, which establish relations of hypernymy/hyponymy and meronymy, and 3. Relationships by association, which specify any other contextual, semantic or use relationship (Reichhold et al. 2012; Müller-Riedlhuber 2017).

The alignment of a topic of knowledge against the DISCO II thesauri requires a similarity between the topic and a taxonomic level of thesaurus tree. Figure 1 presents three cases of similarity with DISCO II thesaurus, where topics belong to the same subtree within the thesaurus and, therefore, have the same upper hierarchical level in the tree. Thus, for example, topics such as “network computing” and “parallel computing”, besides having a lexical similarity (by the word computation), have a relationship of meronymy because they share the same subtree within the thesaurus (case 1). This is also the case between “Geoinformatics” and “Geographic data processing”, which have a synonymy relation (case 2), and for “database analysis” and “data modeling”, there is a relation of hyperonymy/hyponymy because these topics are part of the subtree corresponding to “knowledge of databases” (case 3). Consequently, to achieve the alignment of two knowledge topics, the first step is to find that topic of the tree whose lexical similarity for each topic is high, and then determines the degree of similarity between the subtrees of each one of them.

Fig. 1
figure 1

Different types of topic similarity according to the DISCO II thesaurus, in cases of meronymy (1), synonymy (2) and hyperonymy/hyponymy (3)

To perform the alignment of skills topics, there is a thesaurus of synonyms built on the basis of Bloom’s taxonomy (Anderson et al. 2001), proposed in (Ortiz Sánchez 2016), which contains 6 cognitive levels (knowledge, understanding, application, analysis, synthesis, evaluation), 255 verbs associated with each cognitive level, and approximately 800 synonyms related to each verb. The relationships between the verbs of this thesaurus correspond to the belonging to a cognitive level, either by its inclusion in the set of related verbs, or in the set of synonyms.

In the same way, the alignment of a topic of skill against the BLOOM thesauri requires obtaining a similarity of the topic with the taxonomic levels of the thesaurus. Figure 2 presents two cases of similarity of skill topics in the alignments with the BLOOM thesaurus. As we can see, there is a synonymic relation between “articulate” and “compose”, since regardless of whether they belong to different groups of related verbs (assemble and write), they are under the cognitive level “Synthesis”, which determines that both skill topics are similar (case 1). In the same way, there is a relation of similarity between “program” and “develop”, because they are under the cognitive level “Application”, although they do not belong to the same set of related verbs or synonyms (case 2). In summary, the alignment of two skill topics is obtained, firstly by finding that group of the thesaurus in which each topic is found, and then determining if they have the same cognitive level, or that skill topic, because they are at a higher cognitive level, covers another.

Fig. 2
figure 2

Cases of synonymy for different skill topics according to the BLOOM thesaurus

Architecture

Figure 3 shows the general architecture for determining competencies in job advertisements, which consists of the following phases: the first phase performs the alignment of topics against thesauri, obtaining as a result measures of similarity (lexical and semantic); and, the second phase corresponds to the alignment of the profiles based on the similarity measures obtained in the first phase. There is an initial step, for the pre-processing of the texts from the Web (see Fig. 4), in order to obtain the knowledge and skill topics, which is based on a linguistic analysis that uses linguistic patterns formed by sequences of words with specific grammatical categories, defined in (González-Eras and Aguilar 2015).

Fig. 3
figure 3

Profile alignment architecture

Fig. 4
figure 4

Phase of obtaining of knowledge and skill topics

The pre-processing is based on the approach defined in (González-Eras and Aguilar 2015), which uses competence concepts and about its elements (skills and knowledge), applied in each domain (academic and professional), to describe them. Then, it defines logical descriptions to characterize their patterns. These patterns are used during the pre-processing step of our architecture (Fig. 4). Particularly, a pre-processing is carried out, where HTML tags (headers, numbers, dates, metadata) are deleted, leaving only those texts that are inside type tags <p > </ p>. Then, a morpho-lexical analysis is carried out to extract sentences and words (tokens), which are normalized (capitalization, lemmatization, etc.) (Manning et al. 2009) and labeled with a grammatical category (Faria et al. 2014). Finally, in the analysis of patterns, the patterns are applied to the text in order to recognize the topics of skill and knowledge (González-Eras and Aguilar 2015). The whole process is supported by NLP tools that offer libraries for the development of each step in different languages (González-Eras and Aguilar 2019). Some examples of this process are found in section A of the “Experimentation” section (see Figs. 5 and 6).

Fig. 5
figure 5

Example of the data processing of the experiment

Fig. 6
figure 6

Excerpt from the experimental dataset

With respect to the alignment of topics against thesauri, it is carried out based on the lexical and semantic similarities defined below, which determine the similarity between the skills and knowledge extracted from the academic profiles and job advertisements with these thesauri (see Fig. 3). Finally, the aligned topics of the academic profiles and job advertisements with the thesauri, are now aligned between them. These phases are based on a set of definitions and algorithms, which are below defined in this section.

Statement 1

Let X = {(id1,p1,c),···, (idn,pn,c)} be a valid collection of professional profiles (job advertisements), where idi is a profile identifier, pi is a set of multidimensional features and c has a value as defined in Eq. (1).

$$ c=\left\{\begin{array}{c}1\ if\ {p}_i\ is\ an\ academic\ profile\\ {}2\ if\ {\mathrm{p}}_i\ is\ a\ job\ profile\kern3em \end{array}\right. $$
(1)

Definition 1

A p i profile is a collection of phrases described according to Eq.( 2 ),

$$ {p}_i=\left\{{F}_i\dots {F}_n\right\} $$
(2)

Definition 2

A phrase F i is a collection of topics described according to Eq. ( 3 ),

$$ {F}_i=\left\{\left({H}_i,{C}_i\right)\dots \left(\ {H}_n,{C}_n\right)\right\} $$
(3)

Where a profile pi can have one or more phrases and each phrase Fi can contain one or more knowledge topics Ci or skill topics Hi. Table 2 presents an example of this structure.

Table 2 Example of statement 1 structure

Likewise, for the DISCO and BLOOM thesauri the following definitions are made, which correspond to the structures presented in Figs. 1 and 2:

Statement 2

Let D = {(C’1,n1),···, (C’n,nn)} be a set of knowledge topics organized hierarchically, where a topic C′ belongs to a level n.

Definition 3

A topic C′ may have associated a set of phrases F’ described according to Eq. ( 4 ),

$$ C{\prime}_i=\left\{F{\prime}_i\dots F{\prime}_n\right\} $$
(4)

Definition 4

A phrase F’ i is a collection of topics described according to Eq. ( 5 ),

$$ F{\prime}_i=\left\{\left(H{\prime}_i,{C}_i^{\prime}\right)\dots \left(\ H{\prime}_n,C{\prime}_n\right)\right\} $$
(5)

Where a topic C’i may have zero or more competence phrases F’i, and each one of them may contain one or more knowledge topics C’i or skill topics H’i.

Statement 3

Let B = {(TH1,n1),···,(THn,nn)} be a set of hierarchically organized skills topics, where a TH topic belongs to a level n.

On the other hand, for the implementation of the architecture, it is necessary to make the following definitions:

Definition 5

A topic TH has associated a set of related verbs V, described according to Eq. (6),

$$ {TH}_i=\left\{{V}_i\dots {V}_n\right\} $$
(6)

Definition 6

A verb V i has a collection of S synonyms described according to Eq. ( 7 ),

$$ V{\prime}_i=\left\{{S}_i\dots {S}_n\right\} $$
(7)

A process of similarity analysis, which is defined by the following stages, uses these definitions:

Lexical Similarity

Lexical similarity calculation of knowledge topics is explained in Table 3. With this algorithm, a lexical similarity is established between each Ci knowledge topic (profiles) and each topic belonging to the DISCO C’i thesaurus. For this, two measures are considered, the first one called Dislex(C,C′), which uses the Levenshtein measure to determine the edit distance between the topics considered (Levenshtein 1966); and the second one called Simlex(C,C′), which uses the Dice’s coefficient to determine the similarity between topics, according to the similarity of its character pairs (Alqadah and Bhatnagar 2011). These measures are described in the following definitions.

Table 3 Pseudocode of the calculation of the lexical similarity

Definition 7

The editing distance between two topics C and C′is given by the number of character changes that must be made so that the topic Cbecomes the topicC′(see Eq. (8)) (Levenshtein 1996),

$$ {Dis}_{lex}\left(C,{C}^{\prime}\right)=\left\{\begin{array}{c}\max \left(C,{C}^{\prime}\right)\ If\ {Dis}_{lex}\left(C,{C}^{\prime}\right)=0\ \\ {}\min \left(C,{C}^{\prime}\right)\kern0.5em If\ {Dis}_{lex}\left(C,{C}^{\prime}\right)>0\end{array}\right. $$
(8)

Then, the value of the measure is maximum when the number of changes is zero (C and C’ are equal), and is minimum otherwise.

Definition 8

The lexical similarity between two topicsCandC’is twice the number of pairs of characters that are common to both topics, divided by the sum of the number of pairs of characters in the two topics (see Eq. 9),

$$ {Sim}_{lex}\left(C,{C}^{\prime}\right)=\frac{2\ x\mid pairs(C)\cap pairs\left({C}^{\prime}\right)\mid }{\mid pairs(C)\mid +\mid pairs\left({C}^{\prime}\right)\mid } $$
(9)

Then, for each pair of topics is compared their characters, and a similarity value between zero and one is obtained, where zero represents no similarity and one represents high similarity (Alqadah and Bhatnagar 2011).

Statement 4

A distance threshold UD equal to four is established, which corresponds to the minimum edit distance that can exist between C and C’ to consider that they have a similarity. The UD value was defined by observing the result of the similarity calculation in 100 cases, based on the work done in (Dijkman et al. 2011). Table 4 presents an example of the calculation of the lexical similarity of the topics of the dataset, based on the two measures mentioned. As shown, the use of the two lexical measures increases the coverage of similar topics within the dataset. For example, the topics “operating systems” and “distributed systems” would have a low similarity in relation to the topic “systems” (Dislex > 4), if only the lexical distance would be taken into account as a comparative measure, the same happens with the topics “database” and “statistical database”.

Table 4 Units for magnetic properties

Statement 5

A UL threshold equal to 0.4 is established, which corresponds to the minimum lexical similarity that may exist between C and C’ to consider that they have a similarity. The UL value was defined by observing the similarity calculation in 100 cases, based on the work done in (Dijkman et al. 2011; Van Dongen et al. 2013).

Statement 6

Let P’ = {(c,id1,C1,H1,n),···, (c,idn,Cn,Hn,n)} be a valid topic dataset, where c indicates the type of profile according to (1). idi is the identifier of the profile, Ci represents the topic of knowledge profiles, Hi is the ability related to the topic, and n is the tree level D where the maximum similarity value of the topic is found, which fulfills the thresholds defined in the statements 4 and 5. Table 11 presents an example of this structure, which is the result of the lexical similarity phase.

Semantic Similarity

Semantic similarity calculation of profile topics is explained in the macro algorithm of Table 5.

Table 5 Pseudocode of the calculation of the semantic similarity

The procedure begins with a pair’s alignment analysis of selected topics by their lexical similarity against the DISCO II Thesauri (Müller-Riedlhuber 2009), by means of a scheme proposed in (González-Eras and Aguilar 2015), in which, for each pair of topics C and C’, the similarity of their ancestors, brothers and sons in the thesaurus is verified (Mendonza et al. 2015).

Structural Comparison with DISCO Thesaurus

Definition 9

The semantic similarity of two topics C and C’is given by the sum of the similarities of ancestors, siblings and children of topic C, divided by 3. Then, for each pair of topics a similarity value is obtained in the range of zero to one, considering that zero represents no similarity and one represents high similarity (see Eq. (10)).

$$ {Sim}_{sem}\left(C,{C}^{\prime}\right)=\frac{SA\left(C,{C}^{\prime}\right)+ SD\left(C,{C}^{\prime}\right)+ SS\left(C,{C}^{\prime}\right)}{3} $$
(10)

Here it is presented each one of the measures:

Definition 10

The similarity of topicsCandC’will be proportional to the similarity of their Ancestors concepts. In this case, the average of the maximum similarities of each ancestor of topicsC and C’is considered (see Eq. (11)),

$$ SA\left(C,{C}^{\prime}\right)=\frac{1}{n}\sum \limits_{i=1}^n\max \Big( Sim\left({Anc}_i(C),{Anc}_1\left({C}^{\prime}\right)\right),\dots, Sim\left({Anc}_i(C),{Anc}_n\left({C}^{\prime}\right)\right) $$
(11)

Where:

Anci(C):

ancestor i of topic C.

Sim(Anci(C),Ancj(C’)):

measure of similarity between ancestors of topics C and C’, according to Definition 8.

n:

maximum level of lexical similarity between C and C’.

Definition 11

The similarity of topicsC and C’will be proportional to the similarity of the siblings. In this case, the average of the maximum similarities of the siblings of topics CandC’is considered (see Eq. (12)),

$$ SS\left(C,{C}^{\prime}\right)=\frac{1}{n}\sum \limits_{i=1}^n\max \left( Sim\left({Sin}_i, Si{n}_1^{\prime}\right),\dots, Si m\left({Sin}_i, Si{n}_n^{\prime}\right)\right) $$
(12)

Where:

Sini:

corresponds to the i brother of topic C.

Sin’j:

corresponds to the j brother of topic C’.

Sim(Sini,Sin’n):

the measure of similarity between the siblings of topics C and C’ according to Definition 8.

n:

maximum level of lexical similarity between C and C’.

Definition 12

The similarity between two topicsCandC’will be proportional to the similarity of their direct descendants. In this case, the average of the maximum similarities of the children of topicCwith the children of topicC’is considered (Eq.13),

$$ SD\left(C,{C}^{\prime}\right)=\frac{1}{n}\sum \limits_{i=1}^n\max \left( Sim\left({Des}_i, Des{\prime}_1\right),\dots, Sim\left({Des}_i, Des{\prime}_n\right)\right) $$
(13)

Where:

Desi:

corresponds to the son of topic C.

Des’j:

corresponds to the son of topic C’.

Sim (Desi, Des’j):

the measure similarity between the children of topics C and C’ according to Definition 8.

n:

maximum level of lexical similarity between C and C’.

Statement 7

It is considered that md represents the root topic of the subtree of the DISCO thesauri, where topics C and C’ reach a maximum similarity value.

Structural Comparison with the Bloom Thesaurus

The process of the skill topics structural comparison includes its comparison with the skill topics of the BLOOM thesaurus of Statement 3, which implies identifying the cognitive level that is the root of the subtree where the compared topics are located, for which purpose it is posed the following statement:

Statement 8

It is considered that mb represents the cognitive level that is the root of the subtree of the BLOOM thesaurus, where skill topics H and H′ are found. Table 6 presents examples of semantic similarity on skill topics of the profiles. As we can see, there is a similarity relation between “innovate and advise” (lines 3 and 4), since both topics belong to the same subtree of the BLOOM thesaurus, whose root corresponds to cognitive level “Apply”, so that a relation of synonymy is established between them. For the other topics, there is a relation of non-similarity, because they do not share the same subtree in the thesaurus.

Table 6 Comparison of skill topics with the BLOOM Thesaurus

For results registration of topic semantic similarity, the following statements are made:

Statement 9

Let TC = {(c, id1, C1, Ms1, md1), ···, (c, idn, Cn, Msn, mdn)} be a valid knowledge topic dataset, where c indicates the profile type according to (1). idi is the profile identifier, Ci represents the profile knowledge topic; Ms is the maximum measure of semantic similarity Simsem(Ci,C’i), obtained according to Eq. (10), and md is the root topic of the DISCO thesaurus subtree according to Statement 7. Table 12 presents an example of this structure, which is the result of the semantic similarity phase of knowledge topics. Statement 10. Let TH = {(c, id1, H1, Ms1, mb1), ···, (c, idn, Cn, Msn, mbn)} be a valid skill topic dataset, where c indicates the profile type according to (1). idi is the profile identifier, Hi represents the profile skill topic; Ms is the maximum measure of semantic similarity Simsem(Hi,H'i), obtained according to eq. (10), and mb is the root topic of the BLOOM thesaurus subtree according to Statement 8. Table 13 presents an example of this structure, which is the result of the semantic similarity phase of skill topics.

Alignment

The profile alignment process is performed according to the root topic of the thesaurus subtree where was found the similarity relation of knowledge and skill topics in the semantic similarity phase. For this purpose, Table 7 explains the process in the following macro algorithm:

Table 7 Alignment phase pseudocode

The process of alignment of the profiles begins with the filtering of knowledge and skill topics, using the Us threshold, which allows selecting those topics with a greater measure of semantic similarity according to Eq. (10) Then, for each one of the remaining knowledge and skill topics, the frequency of them in the profiles is calculated. With this frequency, the position of the profiles is established according to the root topic of the thesaurus tree to which they belong (Jones et al. 2000). These measures are described in the following definitions.

Statement 11

A threshold equal to 0.45 is established, which corresponds to the minimum measure that can exist between C and C′ to consider that they have a semantic similarity. Us is equal to 0.45 The value of Us was defined by observing the result of the similarity calculation in 100 cases, based on the work done in (González-Eras and Aguilar 2015). Table 8 presents examples of the calculation of the semantic similarity on the knowledge topics of the profiles. As is shown, the calculation process allows obtaining those pairs of topics that are semantically similar, according to the structure of the DISCO thesaurus tree, associating this similarity with the root topic of the subtree that they share in the thesaurus. The threshold of similarity (Us > 0.45) allows detecting the topics of knowledge that present a greater semantic similarity.

Table 8 Instance Comparison using the DISCO Thesaurus

Definition 14

The relevance value of a Profile consists of the position of the profile within the collection of profiles analyzed, based on the knowledge and skill topics it contains. In particular, the relevance value of an Idi profile according to the topic of the mdithesaurus (knowledge or skill), is given by Eq. (14). This is a measure used in information retrieval, known as Okapi BM25, which orders by relevance the documents in function of the topic they contain (Robertson and Zaragoza2009). The classical metric TF-IDF takes into account the frequency of occurrence of a topic within the collection of documents (Jones et al. 2000), but Okapi BM25 is more sensitive because also takes into account the length of the documents (Yuanhua and Zhailk 2011).

$$ Score\left({id}_i,{md}_i\right)=\sum \limits_{i=1}^n IDF\left({md}_i\right).\kern0.5em \frac{f\left({md}_i,{id}_i\right).\left({k}_1+1\right)}{f\left({md}_i,{id}_i\right)+{k}_1.\left(1-b+b.\frac{\left|D\right|}{avgdl}\right)} $$
(14)

Where:

f(mdi, idi):

is the frequency of topical mdi in the idi profile according to the definition 15.

| D |:

is the number of topics (length of profile idi).

avgdl:

is the average length of the profiles that make up the collection.

K1and b:

adjustable parameters of the function Score(idi,mdi)Footnote 2 to the set of profiles of the specific characteristics (frequency of topics and length of the document, respectively) (Robertson and Zaragoza 2009).

IDF (mdi):

is the weight given to topical mdi in the collection of profiles, according to the definition 16.

n:

number of profiles in the collection.

Definition 15

The frequency of appearance of a topic consists of the number of knowledge topics roots of the sub-trees of the thesaurus that contains the profile. The frequency of occurrence of the topic mdi in idiis given by Eq. (15),

$$ f\left({md}_i,{id}_i\right)=\sum \limits_{i=1}^n{md}_i $$
(15)

Where n represents the total of md topics found in the idi profile (Manning et al. 2009).

Definition 16

The weight of a topic md i IDF(md i ) is given by the inverse frequency of the same in relation to the collection of profiles, which is presented in the following Eq. ( 16 ),

$$ IDF\left({md}_i\right)=\mathit{\log}\frac{N-{md}_i+\delta }{md_i+\delta } $$
(16)

Where N is the number of profiles in the collection, n(mdi) is the number of profiles that contain the topic mdi, and δ is a parameter of adjustment to the weight given to a topic, according to the characteristics of its frequency in the collection of profiles and the length of the documents (Yuanhua and Zhailk 2011).

Experimentation

Processing of Experimental Data

The general architecture for determining competencies of the “Architecture” section has been automated. In this section, we study its behavior. For the experiment, 35 documents in Spanish were taken as input: 20 academic profiles, obtained from university portals (id1, ..., id20), and 15 job advertisements, obtained from internet employment portals (id21, .. id35). From each profile, text extracts were selected that were under sections, such as description, objectives, competencies, skills, knowledge. In these sentences, there are elements of competence, such as skills and knowledge, of which we can see examples in Table 1.

The first step is the pre-processing of the texts to obtain the knowledge and skill topics, according to the procedure indicated in the general architecture (see “Characterization of the Competence Concept” section). It starts with the development of a linguistic analysis, to recognize the instances of knowledge and skill, based on linguistic patterns (González-Eras and Aguilar 2015). Figure 5 presents an example of the analysis for the first sentence of the profiles represented in Table 2, where the knowledge instances are recognized according to patterns, which are formed by the noun, preposition or adjective sequences ([NC], [NC-SP-NC], [NC-NC], [NC-AQ])4; while skill instances by patterns with verb, noun, preposition or conjunction sequences ([VMN], [NC-SP], [NC-CC-NC])Footnote 3 (González-Eras 2017).

As a result, 93 instances of knowledge and 70 instances of skill were detected in the academic profiles, while in the job advertisements 204 instances of knowledge and 96 of skills were detected. In Fig. 6, an extract of the structure of the dataset is presented, which was organized, as indicated in statement 1 and definitions 1 and 2.

Semantic Sources

For the present experiment, Table 9 presents the root topics of the sub-trees of the DISCO thesaurus against which the profiles are aligned, which correspond to sub-areas of Computer Science and Computer Science. In the same way, Table 10 presents the root topics of the BLOOM thesaurus sub-trees, which correspond to the cognitive levels defined in Bloom’s taxonomy.

Table 9 Definition of the roots of the subtrees of the thesaurus Disco II
Table 10 Definition of the topics of the subtrees of the thesaurus Bloom
Table 11 Calculation of the lexical similarity for topics of the profiles
Table 12 Calculation of the semantic similarity for the knowledge topics of disco thesaurus
Table 13 Calculation of semantic similarity for topics of skill with Bloom
Table 14 Calculation of the alignment of profiles and topics of knowledge
Table 15 Calculation of the alignment of profiles and topics of skill
Table 16 Results of the alignment of profiles to the DISCO II thesaurus
Table 17 Results of the alignment of profiles to the BLOOM thesaurus

Profiles Alignment

Phase 1: Lexical Similarity

In this phase, the lexical similarity between the knowledge topics of the profiles and the DISCO Thesaurus is sought through the similarity measures presented in Eqs. (8) and (9) of the macro algorithm of Table 3.

It is observed that the calculated distance between instance C and C′, according to Definitions 7 and 8 and the UL threshold (statement 5), allows identifying the topic in the DISCO thesaurus, with which the topics of the profiles have a greater lexical similarity. This is important for the development of the next phase, because we know the level of the thesaurus where this topic is located, and thus, we obtain the subtree on which the calculation of the taxonomic measure will be made. Table 11 shows an extract of the result of the calculation of the lexical similarity for the instances of the dataset of Fig. 5, according to Statement 6. As is seen in the table, the use of the measurements with the threshold allows identifying that subtree that presents a greater possibility of relating to the context of the topic. For example, “statistical databases - databases” and “Java – Java Language” share the same context, so there is a semantic similarity between these topics.

Phase 2: Semantic Similarity

In this phase, we look for the semantic similarity between knowledge topics and the DISCO thesaurus according to Eq. (10); and, the similarity between the topics of skill and the thesaurus BLOOM according to the Statement 8. For that, the macro algorithm of Table 5 is invoked.

It is observed that the measure calculated between the instances C and C′, according to definition 9, allows identifying the general topic of the DISCO thesaurus, with which the topics of the profiles have a greater similarity. Table 12 shows an extract of the result of the calculation of the semantic similarity for the topics of the dataset of Fig. 5, according to Statement 9. For example, computer applications, computer projects, SQL language, Java language and Java have a relation of similarity with the general topic of the thesaurus “programming”, which confirms that they belong to the same context, in this case Programming. In the same way, it is good to note here that, although the values of Ms are close to the threshold Us > 0.45, there is a fairly clear similarity between these topics. In addition, the similarity value of the Java topic shows that the topic exists as it is written in the DISCO thesaurus; this is the reason why the Java language topic obtains a similarity value of 0.58 (lower).

In the same way, Table 13 presents the calculation of the similarity measure for instances H and H′, using Statement 8. As is shown, according to the calculated measure, the administrator, direct, interaction and planning instances have a semantic relationship with the general topic “Synthesis”, according to the BLOOM thesaurus. In the case of the topics of ability to develop and program, it is evident that there is a relation of synonymy within the context of “Application”.

Phase 3: Alignment

In this phase, using the previous similarity measures, we determine the alignment of profiles of the collection (their knowledge and skill topics) with the thesauri. With the aligned topics, we establish the topics around which the documents (academic profiles and job advertisements) relate between them, and those in which they have no relationship. For that, the algorithm of Table 7 is invoked. At the follows is given an example of this process on 3 documents: id1, id2 and id21. The results of the alignment phase of the entire collection of documents are presented in “Discussion about the Obtain Result” section.

Table 14 and Fig. 7 show the result of the alignment of the profiles based on the knowledge topics, according to Definition 14, considering that k = 1.2 b = 0.75 and δ = 1. It is observed that the profiles id2 and id21 are aligned around the topics Tc6 and Tc7 (programming and knowledge of databases respectively), being id21 the one that presents a greater value of relevance in relation to the topic “programming” (0.62 versus 0.59), while id2 has a higher relevance value in the topic “knowledge of databases (0.41 versus 0.18). There is also an alignment between id1 and id2 around the topic Tc3 (fields of specialization in IT), where id1 has the highest relevance value (0.29 against 0.16). The previous results indicate that the academic profile id2 covers the requirements of the job advertisement id21, it is not the case of id1 that does not present any alignment with id21. In addition, it is clear the high relevance value that reaches id21 in the topic Tc6 (programming), gives a first notion of feedback from the work context to the academic context, emphasizing the importance that companies give to this topic within their job offers.

Fig. 7
figure 7

Alignment of profiles id1, id2 and id21 according to knowledge topics

On the other hand, the relevance value of id1 in the topic Tc1 (installation and IT configuration) exceeds the value of 1 (1.33), because the number of profiles containing the topic Tc1 within the collection (n(mdi)) is low, with respect to the other topics (Tc1 is presented in 3 academic profiles and 1 job advertisement). Consequently, the relevance equation gives it a greater weight (IDF(mdi) total of 0.923).

In the same way, Table 15 and Fig. 8 present the alignment of the profiles based on the topics of skill, according to Definition 14, considering k 1 = 1.2 b = 0.75 and δ = 1.

Fig. 8
figure 8

Alignment of profiles id1, id2 and id21 according to skill topics

It is observed that the profiles id1, id2 and id21 are aligned, around the topic Th3 (application), being id1 the one that presents a greater value of relevance (0.46 against 0.41 and 0.12), which indicates that the academic profiles give great importance to the application of knowledge. There is also an alignment between id1, id2 and id21 around the topic Th5 (creation), where the three offers have very close relevance values (0.10, 0.11 and 0.10 respectively), which indicates that both two academic profiles cover to id21 in terms of the ability to create of knowledge. In the same way, id21 and id1 are aligned around the topic Th6 (evaluation), highlighting this skill as a requirement of the job context, which is also present in the academic profile id1 but in lower level (0.30 against 018, respectively). Finally, we identify that Th4 (synthesis) is a skill requested by companies, which is not considered in the academic profiles id1 and id2.

Discussion about the Obtained Results

Figures 9 and 10 present the results of the alignment of academic profiles and job advertisements (id1…, id35), with the knowledge topics of the DISCO II thesaurus (Tc1, …, Tc15). As is seen, the average of the documents of the collection focus on the topics: “software development” (Tc2), “IT specialization fields” (Tc3), “IT analysis” (Tc5), “programming” (Tc6), “knowledge of databases” (Tc7) and “operating systems” (Tc8). Some topics, like “IT project management” (Tc9), “IT administration” (Tc10), or “network technology” (Tc15), have a high alignment with one or several of the documents, but in general, their averages in the collection are low. Overall, the topics with the highest average of alignment in the documents of the collection are those comprised in the interval Tc1 to Tc8. With the other topics, the average is lower or does not exist alignment, as in the case of Tc4 (IT consulting).

Fig. 9
figure 9

Alignment of the documents with the knowledge topics (tc1 to tc7)) of the of the DISCO II thesaurus

Fig. 10
figure 10

Alignment of the documents with the knowledge topics (tc8 to tc15) of the DISCO II thesaurus

Table 16 presents the alignment values of the profiles (documents) for the topics with the highest average of alignment. Summarizing, the collection of academic profiles and job advertisements present a tendency towards the first 8 topics of the DISCO II thesaurus. For example, for the topic Tc2, it can be seen that the documents have alignment values from higher to lower in the following way (see Table 16): id6 (0.78), id3 (0.66), id1 (0.54), id8 e id27 (0.5), id11 and id25 (0.45), id24 (0.41).

Figure 11 presents the results of the alignment of academic profiles and job advertisements, according to the skill topics of the BLOOM thesaurus. As is seen, the documents of the collection focus on the topics “application” (Th3), “synthesis” (Th4), “creation” (Th5) and “evaluation” (Th6), and the topics with the highest averages are Th3 and Th6. In the case of other topics, the average is smaller or does not exist alignment, as in the case of Th1 (knowledge) and Th2 (understanding). Table 17 presents the relevance value for the topics with the highest average of alignment in the documents.

Fig. 11
figure 11

Alignment of the profile collection according to skill topics

It is said, then, that the collection of academic profiles and job advertisements present a tendency towards the topics Th3, Th4, Th5 and Th6 of the BLOOM thesaurus, establishing a feedback between them according to their relevance value. For example, for the Th3 topic, it is seen that the profiles have values from higher to lower as (see Table 17): id6 and id18 (0.55), id17 and id10 (0.47), id1 (0.46), id2 and id14 (0.41), id20 (0.37), id19 (0.33), id12 (0.30), id4 (0.25), id27 (0.15), id16 and id25 (0.14), id21 (0.12), id28 (0.10), id34 (0.08) and id33 (0.06).

With the results obtained in this phase, we can establish that academic profiles and job advertisements are aligned to the same topic of the thesauri; and what is the strength of these alignments. In addition, feedback between them can be done; for example, what academic profiles cover the topics of knowledge required by job advertisements, or which universities have their academic profiles aligned to a work topic. In the same way, establishing what skills require job advertisements and what academic profiles can cover them. These results can be used in different contexts, like for the planning of professional careers, recruitment of personnel, among other domains.

Comparison with Other Works

The lexical similarity measures have been used in other works, such as editing distance (Levenshtein 1966), to find the similarity between concepts (Alqadah and Bhatnagar 2011), process names (Dijkman et al. 2011), or in the context of learning analytic approaches, to identify correlations between multimodal learning data, using different similarity metrics, such as the temporal similarity or the temporally relaxed similarity (Worsley and Blikstein 2018). In the same way, in the field of competencies, they have been used to establish hierarchical relationships between knowledge topics (Malzahn et al. 2013), and to classify documents using the Dice’s coefficient as a measure of similarity (Gomaa and Fahmy 2013); others to find the best contractors to perform the business process tasks, comparing candidates’ skills and knowledge with them (Pawełoszek 2017; Sanchez et al. 2018). In the present work, we use the lexical distance and the Dice’s coefficient, to find the lexical similarity between knowledge topics and the topics of a thesaurus, where the combination of the measures allows handling the limitations generated by the length of topics when use only the editing distance (Kalmukov 2013).

With the use of thesauri and structural similarity measures, problems of ambiguity between the topics can be solved. For this, the taxonomy of the thesaurus is used to find common levels among them (Harispe et al. 2013), associating them to the same context, and in this case, to an area of knowledge. There are works where this concept is applied to align competencies, creating semantic networks (Malzahn et al. 2013; Sánchez et al. 2015) and graphs (Rácz et al. 2018). In others, they are used to determine suitable candidates to meet specific requirements, and align their profiles with job offers according to a thesaurus (Wordnet) (Montuschi et al. 2015). Also, the combination of structural similarity measures and thesauri allows the design of generic skills and accreditation requirements for university careers (Gluga et al. 2013) and smart learning environments (Paquette 2016). In the present work, a measure of similarity based on the adaptation of ant colony algorithms (ACO) is used (González-Eras and Aguilar 2015), to find semantic similarities, between the topics found in job advertisements and the taxonomies of the DISCO II and BLOOM thesauri, with the purpose of solving ambiguity problems. There are several works that use linked data vocabularies for the representation of professional offers or in the educational context (Smirnov et al. 2016; Faria et al. 2014; Sateli et al. 2017), but they have not been used for the representation of skills and knowledges of the competencies.

Regarding the calculation of the relevance of profiles according to the topics of the thesauri, Eq. 14 is a variation of the probabilistic model proposed in (Yuanhua and Zhailk 2011), sensitive to the frequency of topics aligned to thesauri and to the length of the documents, expressed in the topics (Definition 14). There are works where this model is used to identify profiles of authors according to characteristics (Weren et al. 2014); in others, the Okapi model is used for the recommendation where a weight is established to manage the frequency of the topics based on the characteristics of the collection (Nishioka et al. 2015). In our work, we consider the values of δ of 0.5 and 1 for the collections of academic profiles and job advertisements, respectively, because both collections present a frequency distribution and number of topics of the thesaurus used different. This weight represents the importance given to a topic in Eq. (16), according to its frequency and the length of the documents. With the relevance equation proposed in Definition 14, these differences can be handled to obtain results similar to those of other investigations.

In (Rosa et al. 2015), they present MultCComp, a multi-temporal context-aware system for competence management, which considers the workers’ present and past contexts to help them to develop their competencies. Also, in (Rau 2017) is modeled the knowledge-component of the competencies of the students based on the hypothesis that knowledge-component models that describe the content knowledge and representational competencies should be more accurate than knowledge-component models that describe only content knowledge. They conclude that students can learn abstract content knowledge only if they have a prerequisite level of representational competencies, and that educational technologies should use adaptive knowledge-component models that capture representational competencies the student has not yet mastered. In our work, it is proposed the management of the competences via the implementation of similarity algorithms that make an alignment of the knowledge and skill topics found in academic profiles and job advertisements, against the topics present in a competence thesaurus. This establishes a semantic measure for the competence alignment between academic profiles and job advertisements, based on the similarity of their knowledge and skill topics.

There are Applicant Tracking Systems, like Jobscan, which analyze summaries and job descriptions, recognizing job titles, education levels and skills, to establish a ranking and to give recommendations. Also, there are Automated Resume Screening applications, such as Ideal, which select candidates according to their experiences, skills, among other things. In addition, there are patents based on techniques of natural language processing for analyzing candidates resumes (Dane 2012). These tools differ with our model, in terms of the profiles, the text processing techniques, and the knowledge bases, used to align the data. Our model uses linguistic patterns for the recognition and alignment of professional profiles based on competencies. In general, all these previous applications contribute to improve the candidate recruitment process, but none is oriented towards labor competence analysis to match academic profiles.

Conclusions and Future Work

The present work presents a model of alignment of academic profiles and job advertisements based on competencies, combining measures of lexical and semantic similarity, and the adaptation of a measure of relevance to the frequency of the topics and the length of the profiles. The obtained results allow determining the similarity of the profiles against the knowledge and skill topics of the DISCO II and BLOOM thesauri, and in this way to establish the relevance of the profile alignments based on a ranking measure.

The difference that our work presents with respect to others lies in the analysis of job advertisements in Spanish, through the combined use of lexical and semantic measures. The combination of lexical and semantic measures allows obtaining similar values ​​on profiles in Spanish, in comparison with other works that analyze them in other languages. Another novelty is the use of the Okapi BM25 measure for the alignment of profiles, instead of the traditional TF-IDF algorithm, with a modification that makes this measure sensitive to the relationship between topic frequency and document length expressed in the topics. In addition, the use of two thesauri (DISCO II and BLOOM) allows the alignment process to be strengthened, using the topics contained in them. In this way, through the measures used, we have achieved the alignment of academic profiles and job advertisements, detecting the academic topics with which the job offers are most aligned, and giving feedback to those topics of job offers that did not align with any academic profile.

Our proposal can be applied in academic contexts, for the development of academic management systems and decision-making based on competencies, such as semantic search engines for educational resources, automatic creation of academic profiles of careers and subjects according to work requirements, evaluation of tasks, exams and courses based on professional competencies (González-Eras et al. 2017; Guevara et al. 2017; González-Eras and Aguilar 2019; Sánchez et al. 2015). Also, the model is applicable in the recruitment context, aligning the candidate resumes and job advertisements, according to their knowledge and skills; or for the automatic definition of staff training plans, according to the current competencies of the employees and the required skill and knowledge required in their job positions.

The flexibility of the proposed model allows it not only to be applicable to the context of Computer Science, but also to other areas of knowledge where the knowledge bases in Spanish are available. In addition, our proposal can be used in other languages and domains, through the use of the appropriate NLP tools and knowledge bases, according to the required language and context, respectively. For instance, in our case, we have used Python and Stanford Core NLP libraries for Spanish as NLP tools (Manning et al. 2014). For the semantic analysis of the domain, we have used DISCO II, which is a multilingual thesaurus for different contexts (such as education, labor market, etc.) that offers the mapping of competencies in several languages.

The following steps are aimed at testing the schema in a corpus of larger profiles, as well as improving the detection mechanisms of topics in the profiles, as well as analyzing other groups of profiles and job offers. All of the above will allow us observing their alignment around the topics of the DISCO II and Bloom thesauri. Likewise, it is necessary to initiate experiments to replicate the proposed model with other knowledge bases in the domain of Computer Science, as is the case of ACM.

This work is part of an architecture for the analysis of job advertisements, which includes a phase of characterization of them according to competencies, representing linguistic and semantic aspects through descriptive and dialectical logic; which allows the recognition of the topics of knowledge and skill in the documents. Future work concentrates on completing the feedback phase, based on classification techniques and clustering of topics, and in the definition of learning analytic tasks (Sanchez et al. 2018) and intelligent recommender systems (Aguilar et al. 2017) of educational resources based on this architecture. Finally, future works will analyze the relationship between the Applicant Tracking Systems, the Automated Resume Screening applications, with our approach, in order to extend them with the labor competence analysis and academic profiles capabilities.