A network approach to expertise retrieval based on path similarity and credit allocation

With the increasing availability of online scholarly databases, publication records can be easily extracted and analysed. Researchers can promptly keep abreast of others' scientific production and, in principle, can select new collaborators and build new research teams. A critical factor one should consider when contemplating new potential collaborations is the possibility of unambiguously defining the expertise of other researchers. While some organisations have established database systems to enable their members to manually produce a profile, maintaining such systems is time-consuming and costly. Therefore, there has been a growing interest in retrieving expertise through automated approaches. Indeed, the identification of researchers' expertise is of great value in many applications, such as identifying qualified experts to supervise new researchers, assigning manuscripts to reviewers, and forming a qualified team. Here, we propose a network-based approach to the construction of authors' expertise profiles. Using the MEDLINE corpus as an example, we show that our method can be applied to a number of widely used data sets and outperforms other methods traditionally used for expertise identification.


Introduction
The increasing complexity of research problems calls for innovative solutions which combine knowledge from different scientific disciplines (Van Rijnsoever and Hessels 2011). Therefore, many researchers become involved in interdisciplinary projects, thus collaborating with people with a variety of expertise. When facing the task of finding collaborators, scholars need to answer two inter-related questions: 1) How to identify an expert, i.e., how to find someone who is competent in a given field; and 2) how to profile an expert, i.e., how to identify the fields in which a given scholar is an expert. In general, both questions jointly describe the objective of expertise retrieval (Balog et al. 2012). Indeed figuring out the research area associated with an individual represents a challenging research problem. Search engines such as Google Scholar or DBLP are of great help for finding documents (Hertzum and Pejtersen 2000). However, these engines only return scientific documents, not the specific expertise of people. Even in an academic environment, researchers still have to rely on their social networks to identify the expertise of others (Hofmann et al. 2010).
Identifying experts is crucial for academic groups when they need to involve a collaborator with specific expertise. In organisational settings, knowing the expertise of relevant researchers facilitates the assignment of important roles and jobs. For example, conference organisers may search for moderators, session chairs and keynote speakers with the proper expertise. And universities may want to recruit researchers with expertise in a particular fast-developing area to improve their reputation. A good method for expertise retrieval is therefore fundamental to provide the necessary knowledge for such activities.
However, expertise retrieval is challenging for many reasons. First, expertise is a relatively abstract concept, and there is currently no consensus on how to define it. Besides, expertise is a particular kind of knowledge stored in one's mind, and thus hard to identify. The only way to access people's expertise is through their works, e.g., documents, books, articles. Second, experts' names are often ambiguous. A single name may belong to multiple people, and the name of the same expert can vary in different databases. Indeed name disambiguation has recently become a specific and independent area of enquiry, and many studies have been carried out in this field ). Finally, it is difficult to evaluate the strength of the association between an expert and the works he or she has been involved in, especially because an increasing amount of scientific production is co-authored by multiple individuals. Those challenges have made expertise retrieval a multi-faceted research area. In particular, since we learn about researchers' expertise mainly from their publications, the task of expertise retrieval has mainly been articulated into identifying the knowledge areas/topics in the text corpus and assigning them to the researchers (Silva et al. 2018).
Inspired by previous approaches to dealing with credit allocation (Shen and Barabási 2014) and by recent studies on finding node similarity in heterogeneous information networks (HIN) ), we formalise the topics/expertise extracted from a given scientific publication as credit to be assigned to the co-authors of the publication, and propose a new method to allocate them to the co-authors based on their publication histories. Traditional approaches to the identification of the knowledge areas within the text corpus use topic-modelling methods such as Latent Dirichlet Allocation (LDA) based on controlled vocabulary from well-known classification systems such as the Medical Subject Headings (M eSH) in MEDLINE 1 and the topic tags in Microsoft Academic Graph (MAG) 2 .
Our work focuses on the process of evaluating the degree of each co-author's contribution to a collaborative work. We propose a new method for properly assigning the expertise to each co-author according to his or her contribution. Our method differs from traditional ones where the contribution of authors is assumed to be equal or assessed simply based on the order of authors in the byline. Moreover, our method can deal with large-scale data sets, and produces results that vary dynamically as the data set is updated over time. Unlike some citation-based approaches to the assessment of contributions, which require a certain time to account for the citations that accumulate over time, our method is experience-based and the update of authors' expertise is determined once the new records are added into the data set.
The rest of the article is organised as follows. In Section 2 we review strengths and limitations of existing literature on expertise identification, and motivate our work. In Section 3 we introduce the data used in our study. In Section 4 and Section 5 we present our new method and different selection strategies. In Section 6, we provide some extensions to account for weights and time. In Section 7 we report results obtained using the MEDLINE corpus and various examples. Section 8 summarises the findings of this work and outlines their implications for research and practice.

Literature review
Previous work on expert profiling has primarily focused on identifying and ranking topics for a given expert (Balog et al. 2007;Serdyukov et al. 2011). However, only few studies have considered the temporal aspects of expertise. The work by Tsatsaronis et al. (2011) was one of the first studies which focused on the evolution of authors' expertise over time. Their work was based on co-authorship information, and proposed evolution indices to measure the dynamics of authors' expertise. Inspired by their work, Rybak et al. (2014) constructed temporal hierarchical expertise profiles using topic models. Typically, the underlying question of expert profiling is: What topics does a person know about? (Balog et al. 2007;Rybak et al. 2014). Indeed the word "topic" is commonly used in the various definitions of expertise because the traditional approaches to expertise profiling rely on topic models and Natural Language Processing (NLP) techniques (Van Gysel et al. 2016). The main purpose of using those models is to classify documents into a number of topics and find a better match between authors and topics according to the topics extracted from their documents. As most of the machine learning algorithms belong to unsupervised learning, the topics are simply collections of words and thus not always appropriate for identifying expertise (Silva et al. 2018).
Since the main focus of expertise retrieval tasks is on the analysis of the documents, NLP techniques have commonly been applied. Traditional approaches to the expert profiling tasks are based on the LDA algorithm. LDA is a generative statistical model, first proposed in 2003, which considers each document as a mixture of a small number of topics and according to which the presence of each word is attributable to one of the topics of the document (Blei et al. 2003). LDA is a powerful tool to analyse documents and pinpoint topics, but it was not designed to address the task of identifying expertise. There is no better solution but to treat an author as a bigger document by combining all documents he or she has published. To include authorship information, Rosen-Zvi et al. (2004) extended LDA and proposed the author-topic model for identifying the interests of authors. To make LDA suitable for different tasks in various contexts, many extensions have been proposed over the years. Some examples are the Author-Conference Topic model (Tang et al. 2008), the Author-Conference Topic-Connection model (Wang et al. 2012), and the Author-Topic over Time model (Xu et al. 2014). Some of these have been applied to practice as a part of a new search engine Aminer 3 (Tang 2016).
However, classic LDA algorithms have several characteristics that are not ideal for such tasks. First, LDA requires a manual choice of the topic number. But one can hardly tell whether the choice is good or not since the performance of an LDA model is evaluated by perplexity, a metric proposed by Blei et al. (2003). Therefore it is difficult to decide and evaluate the number of topics. When such number is too large or too small, the research areas (corresponding to the topics) provided by LDA may become too general or too specific (Berendsen et al. 2013). Second, since LDA is an unsupervised learning algorithm, topics generated from LDA are just distributions of words without labels which can be hard to interpret. Additionally, the academic research areas are always connected and have a hierarchical structure. However, LDA generates independent topics without any kind of relationships between them (Silva et al. 2018).
While most studies are concerned with better solutions to address the flaws of topic models, few have highlighted the importance of author-document connections in the tasks of expertise retrieval. In 2012, Duan et al. (2012) first integrated community discovery with topic modelling, and proposed the Mutual Enhanced Infinite Community-Topic model which finds communities and the topics they discuss in text-augmented social networks. Lately, more studies have started using information networks to avoid the problems of the LDA models. Gerlach et al. (2018) represent the data as a bipartite network of words and documents and convert the task into finding communities in such a network. Some different approaches that focus on topic modelling using HINs have been proposed (Sun et al. 2009b). Subsequently, a pioneer algorithm called Rankclus was designed. It uses a generative model that operates on bipartite topologies and simultaneous clusters and ranks nodes in a HIN (Sun et al. 2009a). More recently, different community detection methods, such as generative model and modularity optimisation, have been applied to the creation of hierarchical expert profiles (Silva et al. 2018;Wang et al. 2015).
Despite the efforts of many scholars to find better ways for extracting individuals' interests from the works they produced, most studies have paid little attention to the unequal contributions of authors in collaborative works. Authors that publish with other co-authors in several fields can be associated with multiple topics found in their publications. Identifying the expert on a specific field associated with a paper requires the identification of the different contributions of authors in collaborative works, and therefore identifying one or more people as experts bears a resemblance to a credit allocation problem.
In the last decade, as the complexity and interdisciplinarity of modern research have steadily risen, collaborations among researchers have been playing an increasingly important role (Newman 2004). The multidisciplinary nature of research requires expertise from different scientific fields (Lawrence 2007). In turn, as a result of the increasing size of the newly formed scientific groups, the scientific credit system has come under mounting pressure (Koopman et al. 2010). As a matter of fact, the interdisciplinarity of modern science not only endangers the current credit allocation system, but also poses more obstacles to expertise retrieval. In such interdisciplinary collaborations, authors from different fields work together to produce one result (e.g., an article), but each author contributes only partly to the publication. It can therefore be difficult to quantitatively discern the individual co-authors contributions to a multi-authored publication (Bao and Zhai 2017). Most topic models for expertise retrieval cannot solve this problem, and new approaches to allocating scientific credit to co-authors are therefore required.
Current approaches to credit allocation fall in several major categories. The first and classic one is to view each author as the sole author contributing a copy of the same publication. The second is to distribute the contribution to all co-authors evenly, and the third according to the order in the publication byline or to the role of the co-authors (Hirsch 2005(Hirsch , 2007Stallings et al. 2013). The first two categories are obviously biased to some degree, and the third is based on some acquiescent agreements according to disciplines which may not be easily acceptable by others. Recently, scholars have been working on allocating credit based on the specific contribution of each author (Foulkes and Neylon 1996;Tscharntke et al. 2007). Shen and Barabási (2014) proposed a new method which focuses on the co-citations. This method is based on the intuition that the more an author appears in a co-cited paper, the more credit he or she should receive. And they managed to capture the contribution of co-authors as perceived by the scientific community and successfully tested on the Nobel Prize publications. Considering that the novelty of a paper and the attention paid to it tend to fade with time, Bao and Zhai (2017) extended their idea and proposed a dynamic credit allocation algorithm.
As science can be regarded as a complex, self-organising and evolving network of scholars, projects, papers and ideas (Fortunato et al. 2018), another way to deal with the unequal contributions of multiple authors to collaborative works is to use the similarity between a node representing a given topic and a node representing a given author to assess the contribution that the author made to the focal document with respect to the topic. Information networks are networks consisting of data items linked in some way. The best known example is the World Wide Web where the nodes are web pages consisting of texts, pictures or other information, and the links are hyperlinks that allow us to navigate from one page to another. There are some networks which could be considered information networks and also have social connotations. Examples include the networks of email communication, and online social networks such as Twitter and Facebook (Xiong et al. 2015).
An information network is defined as a directed graph G = (V, E) with an object type mapping function φ : V → A and a link type mapping function ψ(e) : E → R, where each object v ∈ V belongs to one particular object type φ(v) ∈ A, and each link e ∈ E belongs to a particular relation ψ(e) ∈ R. Unlike the traditional network definition, we explicitly distinguish object types and relationship types in the network. Notice that, if there exists a relation from type A to type B, denoted as A R − → B, the inverse relation R −1 holds naturally for B Most of the time, R and its inverse R −1 are not equal, unless the two types are the same and R is symmetric. When the types of objects |A| > 1 or the types of relations |R| > 1, the network is called heterogeneous information network (HIN); otherwise, it is a homogeneous information network. In real-world networks, multiple-typed objects are often interconnected, forming HINs (Shi et al. 2012). A bibliographic information network is a typical HIN, containing objects from several types of entities. The most common entities are papers (P ), venues (conferences/journals) (V ), authors (A), affiliations (af f ), and terms (T ). The DBLP and ACM data in Fig. 1 is a typical example ). There are links connecting different-typed objects and the link types are defined by the relations between two object types. For a bibliographic network, links can exist between nodes of the same or different types. For example, there are links between authors and papers denoting the "write" or "written-by" relations, and links between papers denoting "cite" and "cited-by" relations.  In a heterogeneous network, two objects can be connected via different paths. For example, two authors can be connected via the "author-paper-author" path, the "author-paper-venue-paper-author" path, and so forth. Formally, these paths are called meta-paths. In a graph T G = (A, R), where A is the set of node types and R is the set of relation types, a meta path P is a path denoted in the form of where • denotes the composition operator on relations .
Similarity search is a primitive operation in large-scale HINs that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks. Traditional similarity measures (e.g., cosine similarity) are computed between vector representations of features, using numerical data types (Nguyen and Bai 2010). In information networks, however, the interconnections between objects are sometimes more important than the features of the objects themselves.
To capture the information contained in the links, Lin et al. (2006) proposed a link-based similarity measure PageSim and applied it to the identification of similar web pages. PageSim only works on networks with one type of nodes (e.g., homogeneous information networks), but many networks are heterogeneous. Considering the semantics in meta paths constituted by different-typed objects, Sun et al. (2011) first proposed the path-based similarity measure PathSim to evaluate the similarity of sametyped objects based on symmetric paths. Following their work, Yao et al. (2014) extended PathSim by incorporating richer information, such as transitive similarity, temporal dynamics, and supportive attributes. A path-based similarity join method JoinSim was proposed to return the top k-similar pairs of objects based on user-specified join paths (Begum et al. 2016). Wang et al. (2016) defined a metapath-based relation similarity measure, RelSim, to examine the similarity between relation instances in schema-rich HINs. In order to evaluate the relevance of different-typed objects, Shi et al. (2014) proposed HeteSim to measure the relevance of any object pair under arbitrary meta path. To overcome the problem related to the high computational and memory requirements of HeteSim, Meng et al. (2014) proposed the AvgSim measure that evaluates the similarity scores, respectively, through two random walk processes along the given meta path and the reverse meta path.
The idea of node similarity can be useful in expertise retrieval because, if we can measure the similarity between a given author and a field, we can assess the author's expertise in that field. HeteSim has been designed to evaluate the relevance of different-typed objects, and thus has the potential to be applied to the task of expertise retrieval. However, this task needs to explicitly account for the uneven contribution of various authors to collaborative efforts, and therefore cannot be carried out merely by applying simple measures of similarity between nodes. For this reason, we decided to draw on HeteSim, and propose a properly adjusted method for capturing authors' expertise in evolving networks.
As a result of the increasing interest in extracting relevant topics from scientific publications, many widely used online data sets provide external controlled vocabulary to classify publications. Some examples are the M eSH classification system in MEDLINE and the topic tags in MAG. Those systems have used a variety of techniques to improve the reliability of the classifications, and some scholars have started to use them as ground truth or baseline in their works (AlShebli et al. 2018). Our method simplifies the process of topic extraction from documents by using the MEDLINE corpus as an example, and focuses on how to allocate expertise to co-authors that unevenly contribute to collaborative efforts.
The method for collective credit allocation in science developed by (Shen and Barabási 2014) is conceptually similar to our method. Yet, it differs from ours in one important aspect: it focuses on the process of appropriately allocating the credit of a given paper to each of the co-authors. It uses the co-citations to the given paper and other papers published by the co-authors to determine the proportion to be assigned to each co-author of the paper. If more papers have cited at the same time the focal paper and other papers published by a given co-author, a larger proportion of the credit will be allocated to this co-author, indicating a larger contribution is made by the co-author in this work. However, at the time when a paper is published and therefore has no citations, contributions to this paper are equally allocated across co-authors. Moreover, because the citations vary over the years, so does the credit allocated to each co-author by this method. Clearly, one shortcoming of this method lies on the fact that the contribution of an author to a paper should be unambiguously defined once the paper is published, and should therefore be assessed according to the experience or background of each co-author rather than based on future citations.

MEDLINE (Medical Literature Analysis and Retrieval System
Online) is a bibliographic database of life sciences and biomedical information, maintained and curated by the US National Library of Medicine. It includes bibliographic information on articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and healthcare. The database contains records from more than 5, 000 selected journals covering biomedicine and health from 1948 to the present. The database is freely accessible via the PubMed interface 4 .
In addition, PubMed provides an online scientific publication search engine that associates each paper with several M eSH terms. These terms are similar to keywords of papers, except that a controlled vocabulary is used to classify publications. Since the M eSH terms of a paper are not given by the authors, they are not subject to subjective biases and can be considered as labels which indicate the major topics discussed in the paper. PubMed also constructed tree structures for M eSH terms 5 so that one can look for the research field of each M eSH term.
In particular, in PubMed, each M eSH term has one M eSH Unique ID (starting with letter 'D' followed by 6 digits) and at least one M eSH Tree ID (starting with a letter followed by digits separated by dots). For example, the M eSH Tree ID of 'Anatomic Landmarks' is 'A01.111' and its M eSH Unique ID is 'D059925'. The first letter of the M eSH Tree ID of a M eSH term indicates which one of the 16 categories the M eSH term belongs to. 6 However, the M eSH terms in the raw data are indexed by the M eSH Unique ID rather than the M eSH Tree ID. To map each M eSH Unique ID with the corresponding M eSH Tree ID, we downloaded detailed information about each M eSH Unique ID and used Regular Expression (Regex) to search the match between each M eSH Unique ID and the corresponding M eSH Tree ID. 7 The M eSH Tree ID can have a different depth (the depth of a node is the number of edges from the node to the tree's root node). Some M eSH IDs have corresponding M eSH Tree IDs of depth five (e.g., 'A15.378.316.378'), others only have depth of two (e.g., 'B02'). To ensure that all M eSH IDs can be mapped to the same depth of M eSH Tree IDs, we converted all M eSH Tree IDs to depth two by cutting the numbers after the first point. As a result, all M eSH IDs have been mapped to 127 M eSH Tree IDs of depth two.
To disambiguate authors' names we used the data set provided by Torvik named Author-ity . The data set provides the disambiguated authors' names appearing in the MED-LINE data set up to the year 2008. In our work, we used the first decade of publications in MEDLINE, from 1948 to 1957, to test the method we developed and make a comparison between a baseline (BL) method and our method.
4 HeteAlloc: An algorithm based on path similarity

The method
Based on the idea described above, the task of expertise profiling can be transformed into a dynamic M eSH terms allocation problem: given a time T , an author A and a M eSH term M , what is the expertise of author A on M eSH term M at time T ? To answer this question, we have developed a method based on the idea of credit allocation, using the author-paper and paper-M eSH connections. Notice that what we care about is the effort devoted by an author to a M eSH term (measured by the number of papers published with that M eSH term, or possibly by the reputation or impact factor of the journals, research venues and outlets where these papers have appeared), rather than the reputation of the author (measured by the citations received).
Problem description. We focus on a subset of the HIN which contains three types of nodes: Papers, Authors and M eSH terms. A simple example of this HIN is shown in Fig. 2. In this network, the M eSH terms are indexed by M eSH tree IDs, and the links between papers and M eSH terms show which M eSH term the papers are associated with. Our problem is how to allocate credit to single authors. The input to this question is the link lists of every year between 1948 to 1957, and the output is a vector for each author with a value for each of the 127 M eSH categories indicating the author's expertise in those M eSH categories.
We developed a dynamic credit allocation algorithm based on Path Similarity which we shall call HeteAlloc. Based on the HIN with three types of nodes (i.e., authors, papers and M eSH terms), our task is to assign the credit of each M eSH term in a paper to the corresponding authors, and to use the whole publication history of authors to find their expertise. Our method will calculate the similarity between an author and a M eSH term, and assign a value to each author based on the similarity. It is based on Heterogeneous Similarity (HeteSim). HeteSim is a measurement of the relatedness of heterogeneous objects based on an arbitrary search path. The properties of HeteSim (e.g., symmetric and self-maximum) make it suitable for a number of applications. We define HeteSim as follows: HeteSim: Given a relevance path P = R 1 • R 2 • · · · R l , the HeteSim score between two objects s and t (s ∈ R 1 .S and t ∈ R l .T ) is where O(s|R 1 ) is the out-neighbours of s based on relation R 1 , and I(t|R l ) is the in-neighbours of t based on relation R l . Transition probability matrix. The adjacent matrix W AB is defined for all links from nodes of type A to nodes of type B. The transition probability matrix U AB is the normalised matrix of W AB along the row vectors.
Reachable probability matrix. Given a network G = (V, E) following a network schema S = (A, R), a reachable probability matrix P M for a path P = (A1A2 Al + 1) is defined as PM P = U A1A2 U A2A3 U A l A l+1 . PM(i, j) represents the probability of object i ∈ A 1 of reaching object j ∈ A l+1 under the path P .
Using the reachable probability matrices (Ramage et al. 2009), the HeteSim between two nodes a and b can be written in a matrix form as where P M is the reachable probability matrix, and P M P (a, :) refers to the a-th row in P M P . Finally, Equation 3 provides the normalised version of HeteSim, which ensures that the similarity between a node and itself is equal to one HeteSim(a, b|P ) = PM P L (a, :)PM P R −1 (b, :) HeteSim in M eSH term assignment. The definition of HeteSim in Equation 3 can be directly applied to our network. For a node of type author (A) a 0 and a node of type M eSH (M ) m 0 , the HeteSim between a 0 and m 0 is where M AP and M M P are adjacency matrices between the Author nodes, Paper nodes and between M eSH nodes and Paper nodes, respectively. In Equation 4, the adjacency matrix is used instead of the reachable probability matrix to make our method more interpretable. It can be shown that the formalisation of HeteSim using the adjacency matrix can be the same in an unweighted network as the formalisation of HeteSim based on the reachable probability matrix. Note that M M P = M P M , the matrix product resulting by multiplying M AP and M P M , is the weighted reachable matrix between node type Author and node type M eSH. Formally, we have In the same way, and interpreted as HeteSim(a 0 , m 0 |a 0 ∈ A, m 0 ∈ M ) = N papers published by authora0 which include the M eSH m0 N papers published by author a0 · N papers which include the M eSH term m0 .
Though HeteSim is quite suitable for our task, there are some disadvantages. The most important one is that HeteSim is a "global" measure in a sense. When the similarity between an author and a M eSH term is calculated, all papers are taken into consideration, even those which have no connection with the target author. For example, if someone published a paper with a M eSH term M 1, the similarity of all authors with M 1 will decrease even if none of them has ever worked with him or her. As a matter of fact, the original HeteSim measures the contribution of each author to the total knowledge (limited in the data set) of a M eSH term. However, the expertise we want to examine refers to the M eSH term where an author conducted most of his or her work. In a real-world situation, one can only contribute to several hundreds of papers at most. And if we compare this fraction of papers to the tremendous overall amount of papers available in online databases, the similarity will be significantly small and the original HeteSim will have a poor performance.
Modification of HeteSim (HeteAlloc). To address this shortcoming of HeteSim, here we propose a modified version, namely HeteAlloc. The underlying idea is to limit the calculation to a subset of papers, which can be selected according to the context. Formally, we have where the operation is the element-wise product, and M sub is the subset selection matrix with M sub [a, n] = 1 if the n th paper is in the selected subset of target author a 0 otherwise Like the original HeteSim, our method is based on the cosine of two vectors. As Pirotte et al. (2007) pointed out, the angle between the node vectors is a much more predictive measure than the distance between the nodes. The only difference is that the second vector is filtered by a row of subset selection matrix. The selection of the subset is the essential part of our method, and requires a considerable amount of effort towards the design and computation of the matrix multiplication.
In what follows, we shall present three subset selection strategies, and then show how to compute the measure, discuss the advantages and disadvantages of each strategy, and finally provide interpretations.

Subset of co-authors' papers.
The basic idea of this strategy is that only those who have co-authored with the focal author should be entitled to influence the assignment of his or her expertise. The HeteSim measure should therefore be limited to the subset of papers published either by our target author or those who have co-authored with this author. To find the subset, we provide the following definition: Binary Reachable Matrix of Path Length i: Given relation A R → B and the adjacency matrix W AB between type A and type B, the Binary Reachable Matrix of Path Length i from A to B following meta-path AB i is where M (i) AB = W AB · (W BA · W AB ) (i−1) . The selected subset, RM 2 AP , follows the meta-path 'APAP', which, for each author, creates the subset of papers published by the author or his/her co-authors. To be more specific, the n-th row of RM 2 AP is a vector where the m th value is 1 if, for the n-th author, paper m is included in the subset. To this end, we define HeteAlloc which can be interpreted as HeteAlloc(a, m) = N papers of a which include m N papers of a · N papers of a's co-authors which include m .
The advantage of this selection strategy is that the similarity between an author and any M eSH term will not be influenced by an irrelevant global change of the data set. The subset matrix is constant for all target M eSH terms. However, this selection does not reflect on which specific M eSH term an author has collaborated with another author, and simply includes the papers of all co-authors into the subset.
5.2 Subset of co-authors' papers in a target M eSH term.
The basic idea of this strategy is to add the target M eSH term as another constraint for selecting the subset. The subset includes all papers published by the target author and by the authors who have coauthored with him or her in the target M eSH term. Since this subset varies according to M eSH terms, we use the reachable vector of a and m to replace RM sub Equation 16 can be interpreted as HeteAlloc(a, m) = N papers of awhich include m N papers of a · N papers of a's co-authors which include m .
The advantage of this selection strategy is that the similarity between an author and any M eSH term will not be influenced by any irrelevant global changes of the data set. The similarity is M eSHsensitive, and the subset vector can filter out co-authors who had no experience on the target M eSH term. However, this selection will lead to a low score for those who have worked with very experienced authors.

Subset of all papers published by the co-authors of the focal paper.
For each paper p, the subset includes all papers published by the co-authors of p. And for each pair, author a and M eSH term m, the calculation is conducted for every paper p of author a which includes the M eSH term m, and the average or the sum of all papers is used as the final score. The sum can be considered as a method for credit allocation and the average as a similarity measure. Here we shall use the sum as an example: HeteAlloc(a, m) = where V (a,p) sub = W AP (a, :) W P A · W AP (p, :).
Equation 21 can be interpreted as: HeteAlloc(a, m) = all papers of a N papers of awhich include m N papers of a · N papers of co-authors of paper p .
This similarity avoids a significant decrease when the target author co-authors with a more experienced one in the target M eSH term. The similarity retains the property of having a M eSH-sensitive subset. Notice that this method works better when applied to calculate the absolute value of expertise.

Weighted version of HeteAlloc
The formalisation above is based on an unweighted network. Yet, one may want to capture the concentration of an author's effort on a specific topic (M eSH term). For example, let us suppose that all papers of author A 1 only contain one M eSH term M 1 and all papers of another author A 2 contain two M eSH terms, M 1 and M 2. In this case, one may argue that A 1 concentrates more than A 2 on M 1 since A 1 has worked exclusively on this topic while A 2 on the additional topic M 2. According to this idea, we propose a weighted version of HeteAlloc which accounts for the weights of the links between papers and M eSH terms. The weight of a link between a paper and a M eSH term is inversely proportional to the number of M eSH terms associated with the paper. HeteAlloc can be applied to a weighted network by using U M P instead of M M P , where U M P is a normalised matrix of M M P along the column vector.
The weighted HeteAlloc can capture authors' concentration on specific topics and identify the authors whose papers are more focused on smaller M eSH sets. However, this characteristic is not necessarily an advantage, but simply a different strategy to deal with the number of M eSH terms in a paper. There may exist different views about the similarity between an author and a given M eSH term. For example, one may believe that an author is entirely devoted to a given research topic, if each of his or her papers contains the corresponding M eSH term. In this case, the similarity between the author and the M eSH term would be equal to one (i.e., the idea behind the unweighted version). However, others may believe that the similarity between the author and the M eSH term should never be equal to 1 unless an authors work is exclusively about this M eSH term (i.e., the idea behind the weighted version). The decision should be made after careful examination of the context, and should also be based on the assumptions made by potential users of the method (e.g., researchers or funding agencies.).
Here we shall provide our personal recommendation and blueprint. For smaller M eSH term numbers, the weighted version will work better since it is not common for researchers to work in a completely different M eSH term (say, Finance and Chemistry). However, when the division of topics is too fragmented and most papers have many M eSH terms, then the performance of the weighted version may not work well, and the unweighted version would be recommended.

Iterative calculations over the years
The original HeteSim is designed for a "static" measurement of similarity. However, authors keep publishing papers over the years, and their expertise may change over time. When expertise is measured at year T , only the papers published before this year should be considered. To make our method HeteAlloc applicable to dynamic calculation, we distinguish the links connecting Author and Paper between the experience/history links before year T and the update links at year T . This can be done by using two adjacency matrices: M update and M experience . Since it is difficult to identify the time ordering of publications published in the year T , we assume that papers of year T were published at the same time. The formalisation of HeteAlloc needs to be modified and the calculation, based on the modified measure, can be conducted iteratively over the years.
We shall refer to the modified algorithm as DynamicHeteAlloc ( where For each paper, we add I nn [p i , :] to M experience [a, :] in Equation 26 to include the current paper in the experience paper set so as to avoid the case where M experience is a zero matrix.
According to the formalisation of DHA, we have implemented Algorithm 1: Algorithm An example of this method using illustrative networks is provided in the Appendix. The results are given in the form of expertise matrices, where the value corresponding to row i and column j indicates the expertise of Author i on M eSH j . In the example, we use the publication lists of 4 authors from year 1 to year 10 and calculate the expertise matrices for each author at each year. We also show the result using the (BL) method, which equally attributes every M eSH term of a paper to all co-authors. In this case, the expertise of a focal author is therefore computed through the cumulative counts of M eSH terms associated with all publications of the author. Thus, in the expertise matrix calculated using the (BL) method for a year t, the value in row i and column j is equal to the number of papers published by Author i with M eSH j before year t.

Results
To compare the performance of different selections of subsets on HIN, we have calculated the similarity between all pairs extracted from the pair set {a, m|a ∈ Author, m ∈ M eSH} based on three small examples of networks using the (BL) method mentioned above, the original HeteSim, the HeteAlloc with the subset of co-authors papers (HA1 ), the HeteAlloc with the subset of co-authors papers in a target M eSH term (HA2 ), the HeteAlloc with the subset of all papers published by the co-authors of the focal paper (HA3 ), and the corresponding weighted versions of HA1, HA2, HA3 (i.e., WHA1, WHA2, WHA3 ).
In the first example in Fig. 3, BL, HA2 and HA3 perform well (see Table 1; the similarities characterised by better performance have been highlighted in bold). These methods can uncover the difference between (A1, M 1) and (A1, M 2). To be more specific, A1 published two papers with M 1 and just one paper with M 2, and the similarity between A1 and M 1 should be higher than that between A1 and M 2. Since each paper contains only one M eSH term, the weighted versions in this example degenerate to the unweighted ones. In the second example network in Fig. 4, HA3 performs well. It shows that author A1 is more experienced than A3 in M 1. To be more specific, A1 published a paper with M 1 alone and another with a very experienced author, A2. A3 published a paper with M 1 alone and another paper with M 2 alone. The similarity between A1 and M 1 should be greater than that between A3 and M 1. Compared to other methods, only HA3 gives a higher similarity for (A1, M 1), and a higher score for the expert A2 with M 1. Since each paper contains only one M eSH term, the weighted versions in this example degenerate to the unweighted ones.   For the third example shown in Fig. 5, the weighted methods differentiate between Sim(A1, M 1) and Sim(A2, M 1), while the unweighted methods are unable to distinguish between them. To be more specific, both A1 and A2 published two papers with M 1, and the only difference between A1 and A2 in M 1 is that paper P 3 published by A2 contains M 2 as well. As mentioned in Section 6.1, the weighted version can capture the concentration of research efforts in some M eSH terms, and is biased in favour of the authors whose papers are more concentrated on a smaller M eSH set. In what follows, we will use the third selection strategy and perform a comparison between our method (DHA) and the (BL) method applied to the MEDLINE data set. As in our data set most publications are associated with multiple M eSH terms, we chose to use the unweighted version of our method.
The output of both methods are vectors associated with authors representing their expertise in terms of each topic (i.e., M eSH term). To compare the two methods, for each author we consider the following measures: (1) the ratio between maximum and minimum values of the author's expertise; (2) the author's maximum normalised expertise (i.e., obtained by dividing all values in a vector by its norm); and (3) the normalised maximum expertise of authors that have published more than 10 papers at the time of the assessment of expertise (i.e., criterion 2 applied only to the subset of productive authors). Moreover, for every year, we calculate the mean and standard deviation of the values produced by the above assessment measures, and compare them between methods. (1) the ratio between maximum and minimum values of the author's expertise; (2)  The results reported in Table 4 show that the mean and standard deviation of the ratio between maximum and minimum values of author's expertise obtained with the DHA method are higher than the mean and standard deviation obtained with the BL method, which suggests that DHA can better distinguish authors according to their expertise areas, whereas BL considers all authors involved in works relevant to multiple topics as interdisciplinary authors (i.e., with the same expertise on all M eSH terms, thus producing smaller ratios of maximum to minimum values of expertise). The results based on normalised maximum expertise of DHA are similar to those of BL when all authors are considered, but they differ when the methods are applied only to a restricted subset of productive authors, which suggests that our method has the potential to identify authors' main areas of expertise precisely when they are most likely to work in multiple areas. Figure 6 shows the frequency of productive authors with normalised maximum expertise ranging from 0 to 1. The (BL) method shows no authors with maximum expertise higher than 0.9, which suggests that there is no researcher dedicated to one single area and the maximum expertise of most authors lies in the middle. However, the results obtained with our method clearly highlight its ability to identify specialised authors that preferentially focus on one area (i.e., with high maximum expertise) and at the same time interdisciplinary authors whose work spans different areas (i.e., those with low maximum expertise).

Conclusions
In this work, we have proposed a new method based on path similarity and a number of subset selection strategies to identify authors' expertise. Our method differs from previous works as it assigns expertise to a focal author by accounting for co-authors' contributions to the works they were involved with. We have shown that our method can be applied to the HIN constructed from the MEDLINE corpus. However, the applicability of our method is not limited to just one data set. Indeed if we replace M eSH terms by the topic tags in MAG, our method can be directly applied to MAG. In this case, it can retrieve authors' expertise based on topics as classified in MAG, and it can be suitably adjusted to reflect the depth and granularity required by users. In more general cases, users can generate their own topics from documents using topic modelling or other methods. By linking the generated topics and the corresponding documents, users can produce similar networks as those shown in Fig. 2 and they can then apply our method by selecting an appropriate subset. Our work can also be used to integrate standard approaches, for example in conjunction with topic modelling for documents or by using topic classification systems.
The lack of a ground truth does not enable a definitive validation of our method. While this represents a limitation of our work, it also opens up new avenues for future work. For example, to mitigate this limitation, we could check the Contributor Roles Taxonomy (CRediT) author statement available from several journals 8 to identify which author was involved in which part of the research. However, CRediT statements are self-declared and not verifiable, which again highlights the need for methods such as the one we proposed in this article. Moreover, the CRediT author statements are not detailed enough to unambiguously indicate which specific expertise (e.g., M eSH term) should be associated with which author. Another possibility is to handpick some very interdisciplinary papers (i.e., with many M eSH terms). By reading the CV of the authors or searching for relevant information about them, we might be able to infer the M eSH terms associated with each author, and then compare our prior knowledge with the results obtained using our method. This test represents a "sanity check", and an example is given in the Appendix.
Our method has a number of important applications for research and practice. Understanding the composition of a team and being able to associate each co-author of a paper to one or several fields of expertise can spur new studies of the interdisciplinarity of research teams. For example, our method will enable us to distinguish between interdisciplinary papers co-authored by researchers with overlapping expertise, and equally interdisciplinary papers in which the co-authors have non-overlapping research profiles. This, in turn, could shed further light on the impact of team diversity on scientific success and knowledge creation. Moreover, being able to identify expertise facilitates a comparative assessment of two equally interdisciplinary studies, one pursued by an individual and the other by a group or researchers. In particular, our method enables us to distinguish between research solely pursued by one individual scholar with a highly interdisciplinary background and research pursued by an interdisciplinary group comprising of several highly specialised scholars. This variation in type and sources of interdisciplinarity is likely to be a critical nuance with non-trivial implications for innovation, research performance, and the long-term impact of publications.
Our method has also practical implications for funding agencies, research institutions and scientists. First, it can assist funding agencies in the identification of appropriate reviewers with the right competence to evaluate research proposals. In turn, it may also assist reviewers in uncovering possible gaps between a proposed research and the combined expertise of the pool of applicants. Second, our method can also help research institutions to develop effective recruitment policies targeted at strengthening specific research fields or at developing new and fast-developing areas that require a prompt investment of resources. Finally, the identification of special expertise can help scientists in identifying potential collaborators and shaping successful research groups.

A Appendix
A.1 Example of DHA using illustrative networks Here we show how our method works out in full using illustrative networks, and we then compare the results with those obtained using the BL method. Figure 7 shows the illustrative networks from year 1 to year 5 (identical networks for five years). Figure 8 shows the illustrative networks from year 6 to year 10 (identical networks for five years). Before year 5, the four authors worked separately. A1 worked on M 2 and M 3 equally. A2 mainly worked on M 1 and had some works related to M 3. A3 mainly worked on M 2 and had some works related to M 3. A4 worked on M 1 and M 3 equally. From year 6, they started to collaborate. Specifically, A1 and A2 collaborated on papers related to M 2 and M 3, A2 and A3 collaborated on M 1 and M 2, A3 and A4 collaborated on M 1 and M 3. The publication lists can be found in Tables 5 and 6. Based on their experience, it is not likely for A2 to have many contributions on M 2 in P 1 from year 6 to year 10 since he or she did not have any previous experience on that M eSH category. Similarly, it is not likely for A3 to have many contributions on M 1 in P 2 from year 6 to year 10. But they may acquire some experience from those collaborations. Thus, a good method should be able to allocate the credit of those collaborative works to those collaborators with corresponding experience.
Equations 28-33 listed the expertise matrices given by BL and DHA, respectively. The results are similar between year 1 and year 5 and begin to differentiate from year 6.
At the end of year 5, both methods suggest that all four authors had similar expertise on M 3, whereas A2 and A3 were experts on M 1 and M 2, respectively. BL simply counts for the number of papers each author published on every M eSH term, and adds them together. Following this idea, A2 gained the same amount of credit as A1 on M 2 from P 1 and as A3 from P 2 from year 6 to year 10 although A2 never worked on M 2 before year 6. As a result, at the end of year 10, A2 was recognised as an expert on M 2, with the same expertise as A3.
However, under most circumstances, the contribution each scholar makes to the joint work is likely to relate to the specific topics or fields in which his or her expertise lies. Specifically, it is more reasonable to think that during the collaboration of P 2 from year 6 to year 10, A2 contributed on M 1 and A3 contributed on M 2 based on their expertise. Therefore, A2 should gain the credit of M 1 and A3 should gain the credit of M 2. And the results obtained using DHA gave the expected result: i.e., A2 is an expert on M 1 and A3 is an expert on M 2.    The results are given in Table 9. Upon publication of this paper, Stanislas de Sèze obtains 0.762 on B01, 0.371 on C05 and 0.106 on C10, since he was the most experienced author in these three categories. Similarly, D. Hioco obtains 0.315 on D01 and 0.265 on A12; A. Lichtwitz obtains 0.193 on D01 and 0.211 on C19. However, M. Delaville does not achieve a high score as he was not the most experienced author in any of these categories. As for the new author, he gains some experience in nearly every category, especially those in which no one had much experience. In this example, he obtained 0.535 on D23, 0.424 on G02 and 0.366 on G03. In general, our method clearly returns a reasonable result which meets our expectation.

A.3 Summary
In Appendix A.1, we showed how our method works out in full using illustrative networks, and then compared the results with those obtained with the BL method. In this example, four authors with their publication lists of 10 years are given. By checking the publication history of those authors, indeed we can confirm that the second and the third authors are experts in different topics. Our method was able to correctly identify the expertise of each author. However, the BL method gave a result according to which the research profiles of the two authors were the same. This example and the comparison between methods thus showed that our method outperformed the BL one. In Appendix A.2, we gave an example of a handpicked paper, and provided the results obtained using our method. We showed that our method correctly assigned expertise to the most experienced author on most M eSH terms. And authors would not acquire much experience in categories that they were not familiar with. The result showed that our method was able to add appropriate value to the co-authors expertise vectors and update them so that they could better represent the evolution of co-authors expertise.
Despite the lack of ground truth data to definitively validate the performance of our method, the examples in the Appendix provide some possible ways to test our method. The results showed that our method can provide a reasonable assessment of authors' expertise.