1 Introduzione

Among the various information that a user profile in adaptive systems may include, there is also her competence in a specific knowledge domain. In this article, we propose a system able to implicitly assess the user’s expertise in a particular topic based on her publications (e.g., scientific papers) on it and available through online bibliographic databases, such as Scopus Footnote 1, Google Scholar Footnote 2, and ResearchGate Footnote 3. The proposed system takes in input a candidate user u and a specific knowledge area ka and returns a score(uka) expressing the level of competence of u in ka. This task is performed through two different approaches, both of them based on a graph-based model. The first approach (content-based) considers the text content, the second one (collaborative) analyzes the relationships in the same content in terms of co-citations. Specifically, the content-based approach retrieves the most relevant documents for a given knowledge area ka, extracts the most significant entities and stores them in a graph database. Then, it performs the same operations on the documents generated by u on ka and builds a second graph. Finally, the similarity between the two graphs is computed in order to estimate score(uka). The collaborative approach always involves the collection of documents related to the topic ka, but takes into account only the co-citations among them and, therefore, their authors. The evaluation of score(uka) is performed through a version of the well-known Hyperlink Induced Topic Search (HITS) algorithm [16], which considers the incoming and outgoing edges among nodes.

2 The Proposed System

Nowadays, the increasing availability of online material has led to the need for adaptive systems for its personalized selection [6, 7, 10], based on the target user’s characteristics. Those systems can take into account the personality [8, 17], the context [4, 5], as well as the effective nature [11,12,13,14,15] and the temporal dynamics [1, 3, 9] of users’ interests. Some adaptive systems also consider the information on the user’s expertise in specific knowledge areas. Such information may be obtained through the so-called expertise retrieval systems [2]. Approaches to expertise retrieval can be categorized in two main classes, inherited from the Information Retrieval techniques: the first one based on the information content (content-based) and the second one independent of it (collaborative). The former ones take advantage of the information extracted from the domain of the individual’s knowledge to create a profile of her experiences, where the relevance of her documents to the specific field is evaluated. Differently, in collaborative systems user’s expertise is assessed based on the authority inferred by analyzing her social network. Both of these approaches have been implemented within the proposed system.

Fig. 1.
figure 1

Content-based approach schema.

Content-Based Approach. In Fig. 1 the diagram of the overall content-based approach is depicted. Specifically, the first step consists in extracting a set of documents related to the subject from the knowledge database. A topic annotatorFootnote 4 is used to extract the entities that characterize those documents. Such entities are stored in a graph database along with information about authors, abstract, affiliations, tags, and categories. When a user has to be profiled, the system performs steps similar to the previous ones but only comprising information regarding her content. Figure 2 illustrates a snapshot of the graph database with regard to the content-based approach. Note the different types of node, such as authors, papers, abstracts, entities, and categories.

Fig. 2.
figure 2

Snapshot of the graph database in the content-based approach.

Once the domain is defined, different strategies can be applied to evaluate a user’s expertise. More specifically, the following four strategies have been implemented in the proposed system:

  • Occurrences. The first method performs the analysis of occurrences by comparing the keywords extracted from the user’s profile with those extracted the stored domain within the graph. The ratio is then between the absolute value of their intersection set and the set of keywords that characterize the domain, as expressed in the Eq. 1, where \(KW_{ka}\) identifies the set of keywords describing the knowledge area, \(KW_ {u}\) denotes the set of keywords related to the topic used by the user. Such ratio gives a score, which expresses the user u’s expertise level in that specific knowledge area.

    $$\begin{aligned} score(u,ka) = \frac{|KW_{u} \bigcap KW_{ka}|}{|KW_{ka}|} \end{aligned}$$
    (1)
  • Weighed Occurrences. The second method is a variant of the first one, in which it is also considered the weight that each identified entity within the domain and the user’s profile has associated according to how much that entity is relevant to the topic under examination. Such weight is calculated by estimating the distance, namely, the number of levels between pages and categories, between the Wikipedia page associated with the extracted term and the page related to the domain of interest. The weight is also stored inside the edge that links the tag to its abstract within the graph. The method can be described through the following equation:

    $$\begin{aligned} score_{weight}(u,ka) = \frac{|WeightedKW_{u} \bigcap WeightedKW_{ka}|}{|WeightedKW_{ka}|} \end{aligned}$$
    (2)
  • Log-Entity. It relies on the comparison through the cosine-similarity metric between the vector representing the candidate user and the one representing the topic. For the weighting function a version of the TF-IDF model, well-known in Information Retrieval, has been employed. In particular, the equation for weighing the user is as follows:

    $$\begin{aligned} u= \left\langle \bigg ( e_1, \log \big (\frac{|D_{u}|}{|d : e_1 \in d|}\big ) \cdot w_{e_1,t}\bigg ),\ldots ,\bigg ( e_n, \log \big (\frac{|D_{u}|}{|d : e_n \in d|}\big ) \cdot w_{e_n,t}\bigg ) \right\rangle \end{aligned}$$
    (3)

    while the equation for weighing the domain is as follows:

    $$\begin{aligned} ka = \left\langle \bigg ( e_1, \log \big (\frac{|D|}{|d : e_1 \in d|}\big ) \cdot w_{e_1,t}\bigg ),\ldots ,\bigg ( e_n, \log \big (\frac{|D|}{|d : e_n \in d|}\big ) \cdot w_{e_n,t}\bigg ) \right\rangle \end{aligned}$$
    (4)

    The vectors so obtained are then compared using the cosine-similarity metric. The obtained results, comprised between 0 and 1, describe the user’s expertise level in that specific knowledge area.

  • Entity Frequency. This method, as the previous one, relies on the computation of the cosine-similarity between vectors, but differs from the previous one for the weighing of the vector. In this case, the vector describing the user’s profile is constituted by elements which, for each entity belonging to the user’s profile, have associated the number of user’s documents that contain that entity.

    $$\begin{aligned} u = \left\langle \Bigg (e_1, \bigg (\frac{|d : e_1 \in d_{u}|}{|D_{u}|}\bigg )\Bigg ),\ldots ,\Bigg (e_n, \bigg (\frac{|d : e_n \in d_{u}|}{|D_{u}|}\bigg )\Bigg ) \right\rangle \end{aligned}$$
    (5)

    The weighing of the vector related to the knowledge area takes place analogously and is described as follows:

    $$\begin{aligned} ka = \left\langle \Bigg (e_1, \bigg (\frac{|d : e_1 \in d|}{|D|}\bigg )\Bigg ),\ldots ,\Bigg (e_n, \bigg (\frac{|d : e_n \in d|}{|D|}\bigg )\Bigg ) \right\rangle \end{aligned}$$
    (6)

    The two vectors are then compared through the cosine-similarity technique, which returns a score expressing the user’s expertise in that specific subject.

Collaborative Approach. The system developed according to the collaborative approach analyzes information concerning co-citations among documents related to a particular topic. More specifically, a graph containing documents and their co-citations is built. Such a graph is then analyzed via the HITS algorithm, which for each entity p within the graph calculates the authority score A(p) and the hub score H(p). Once the ranking of documents is obtained, sorted by their authority value, the ranking of the authors corresponding to those documents is generated. Assuming the possibility that several documents can be written by the same author, it was decided to assign the authority value to the user according to the Eq. 7, which allows us to modify how much weight to assign to the sum of all the authority values of the documents produced by the author or the maximum authority value among the user’s documents:

$$\begin{aligned} Authority(u) = A \cdot \lambda + B \cdot (1-\lambda ) \end{aligned}$$
(7)

The \(\lambda \) parameter identifies a value between 0 and 1. A and B are respectively the values given by the sum of authority values and the maximum authority value among documents written by the candidate user. In Fig. 3, the diagram of the overall collaborative approach is shown. In this approach, unlike the previous one, the edges of the graph database are only related to the co-citations among documents.

Fig. 3.
figure 3

Collaborative approach schema.

2.1 Experimental Evaluation

To evaluate the performance of our system, we carried out some experimental tests on six candidate users using both approaches. Those candidates were selected so that \(u_2\), \(u_3\), and \(u_5\) were to be considered actually experts on the knowledge area of interest, while the other candidates were less experienced. As to the content-based approach, we obtained the results shown in Table 1. In particular, the first two columns show data when the candidates were evaluated through the co-occurrence of terms and those occurrences were subsequently multiplied by the weight that entity obtains related to the subject, based on the ontology extracted from Wikipedia. The third column shows the results by comparing by means of the cosine-similarity the vectors weighed through a weighing based on the Log-Entity. The vector is weighed by the occurrences of the entities within the user’s production and the product with the relevance value that given entity obtains with respect to the topic under consideration. The last columns show the results obtained with the Entity Frequency method while varying the reference domain, that is, taking into account the first n elements of the list of entities in descending order of frequency within the graph.

Table 1. Experimental results of the content-based approach

Table 2 shows the results obtained for the same candidate users through the collaborative approach. Notice the maximum authority value obtained by a document produced by the candidate user, the sum of the authority values related to each document of the graph associated with the candidate user u, and the value given by Eq. 7 with \(\lambda =2\).

Table 2. Experimental results of the collaborative approach

The obtained data allow us to make some interesting observations. It can be noted that the content-based method considering the occurrences, whether not weighed or weighed by the relevance of the entities within the context, does not seem to produce results as expected. The Entity Frequency method, especially in its filtered version (i.e., based on the extraction of the top-n entities belonging to the domain), instead shows satisfactory results. The candidate users, which were assessed based on their generated content, were evaluated on their experience so to obtain positive values but differentiated, and the score gap between the expert users known to us and the other candidates is a faithful picture of the supposed accuracy of this method. Especially in the version with \(n=10\), the results show reliable values. Finally, the scores obtained through the collaborative approach show that the algorithm built through the HITS implementation performs rather trustworthy evaluations of expert candidates, but only if within the dataset (i.e., the graph built on the co-citations among the different documents) the expert candidate u’s documents were found. For instance, the collaborative approach was not able to assign a value to the candidate \(u_1\)’s expertise, which is therefore set equal to 0.

3 Conclusions

In this article, we have described a system for the implicit assessment of a user’s expertise in a specific knowledge area. The development of two main approaches allows us to choose between one or both of them, thus enabling the system to overcome their individual weaknesses. The experimental results show that in some situations the content-based approach can be better, in others the collaborative one is to be preferred. Hence, the best results may come from an integrated solution. The heterogeneous structure of the graph database chosen for the system implementation actually enables complex queries to be satisfied based on the different stored information.

Among the possible future developments, we would like to increase the number of knowledge bases (i.e., available documents) to enhance the reliability of the system output. As for the experimental evaluation, we plan to test our system on other domains and allow testers to provide explicit feedbacks on the received results.