Identifying Key Opinion Leaders in Evolving Co-authorship Networks—A Descriptive Study of a Proxy Variable for Betweenness Centrality
Many researchers identify influentials in a network by their betweenness centrality. Whereas betweenness centrality can be calculated in small, static, connected networks, its calculation in complex, large, evolving networks frequently causes some problems. Hence, we propose a proxy variable for a node’s betweenness centrality that can be calculated in large, evolving networks. We illustrate our approach using the example of Key Opinion Leader (KOL) identification in an evolving co-authorship network of researchers who have published articles about PCSK9 (a protein that regulates cholesterol levels).
The analysis of complex networks has become one of the main research topics in contemporary computer science. The analysis of evolving networks has been a particular focus of attention (see  for a literature review). In this context, one main research question has been to identify the most important nodes (hubs, influentials) in a network. One of the most prominent measures of a node’s importance in a network is a node’s betweenness centrality .
Whereas a node’s betweenness centrality can be calculated in small, static, connected networks, the calculation of a node’s betweenness centrality in complex, large, evolving networks frequently causes some problems. For example, betweenness centrality can be calculated in connected graphs only. However, the early evolutionary stages of a network are often characterised by a few edges and nodes. Consequently, the corresponding graph consists of many disconnected components (compare ), and the measure of betweenness centrality is either undefined (at the whole network level) or can be calculated separately for each of the unconnected components only. Furthermore, calculating betweenness centrality is computationally too costly to allow for dynamic analyses in large networks (compare [10, 18] for the execution times of betweenness centrality calculations on commodity machines).1
Nevertheless, many application scenarios require some knowledge about the nodes’ betweenness centrality in the (early) evolutionary stages of a network. Therefore, in this paper we propose a proxy variable for a node’s betweenness centrality that can also be calculated in the early evolutionary stages of a network and that allows for dynamic analyses.
We illustrate our approach using the example of key opinion leader (KOL) identification in PCSK9 research (see Sect. 3, Data). KOLs are physicians and researchers who influence the treatments prescribed by their peers. Pharmaceutical enterprises spend considerable time and effort identifying KOLs and maintaining a good relationship with them. However, to best build relationships, pharmaceutical enterprises have to identify KOLs in the early stages of the emergence of a new research field and track their importance over time. We suppose that KOLs can be identified through their embeddedness in a co-authorship network. In the network, authors serve as nodes, and a tie is assumed between two authors who have co-authored a publication (compare, for example, [20, 21]).
To summarise, our paper has two research objectives. Our main objective is to answer the research question of whether there is a proxy variable for the KOLs’ betweenness centrality that can also be calculated in the (early) evolutionary stages of a co-authorship network. However, to answer this question, we first have to identify who the researchers/KOLs are that have the highest betweenness centrality in the PCSK9 co-authorship network.
The remainder of this paper is structured as follows. The next section reviews the related literature. Section 3, Data, introduces the dataset used for our analyses. Section 4 presents our analyses. The last section, Discussion, addresses the theoretical and managerial implications of our work, notes the limitations of this study, and points to further research.
2 Related Work
In this literature review, we particularly focus on two streams of thought. The first is literature about evolutionary network analysis in computer science. An extensive literature review about this kind of work can be found in a recent paper by Aggarwal and Subbian . Hence, a review about this kind of work is beyond the scope of this paper.
The second stream of thought is research that analyses scientific co-authorship by the means of network analysis (e.g. [20, 21, 23]). In this context, it is particularly interesting to highlight papers that examine the evolution of a co-authorship network. For example, Barabási et al.  analyse in a seminal paper the small world properties of an evolving co-authorship network (i.e. they examine whether the network has a larger clustering coefficient  than expected for a random network and a small average separation/shortest-path-length).
Backstrom et al.  examine how communities/groups develop and evolve in networks using data from DBLP.2 They are particularly interested in determining who will join which community in the future and how people and topics move between communities.
Also, Franceschet  uses data from DBLP for his analyses, in which he distinguishes between the author-paper affiliation network and the (author) collaboration network. Whereas the author-paper affiliation network is a bipartite graph with two types of nodes (authors and papers (and an edge from an author to a paper if the author has written the paper)), the
collaboration network is an undirected graph obtained from the projection of the author paper affiliation network on the author set of nodes. Nodes of the collaboration network represent authors and there is an edge between two authors if they have collaborated in at least one paper
(p. 1995). Like Franceschet , we focus on the authors’ collaboration network in this paper.
Although the collaboration network is a coarser representation with respect to the affiliation network ... [it] is highly informative since many collaboration patterns can be captured by analyzing this form of representation
([11, p. 1995]). For example, Franceschet  analyses the temporal evolution of the connectivity of the collaboration network, the distribution of the number of scholar collaborators, network clustering, the average separation distance among scholars, as well as assortativity by the number of collaborators. He finds that the network is a widely connected small world. Furthermore, he finds the distribution of collaboration among scholars to be highly skewed and concentrated (i.e. there are a few highly productive collaborators responsible for a relatively high share of collaborations). However, he finds the network to be resilient to the removal of these highly productive collaborators.
Liu and Xia  examine the structure and evolution of the co-authorship network in the interdisciplinary field of “evolution of cooperation”. They illustrate how small clusters evolve into a giant component that can be considered as a small-world network.
Whereas most of the studies presented above focus on network topology and macro-level network properties (such as diameter, distance, components, clustering coefficient etc.) , Yan and Ding  take a different approach by studying micro-level network properties (i.e. centrality measures). Specifically, they examine how authors’ centrality measures change over time. Yan and Ding’s  study is probably the one the most related to our research.
Finally, Yang et al.  infer a node’s future centrality in a network by a Node Prominence Profile (NPP). They base the NPP on the principles of preferential attachment  and triadic closure . However, in their study, Yang et al.  focus on degree centrality. In this work, we intend to infer a node’s betweenness centrality from a single, easily available proxy variable.
To analyse KOLs in an evolving complex network, we analysed the co-authorship network of researchers who published an article about PCSK9.3 In the network, authors serve as nodes of the network, and a tie is assumed between two authors who have co-authored at least one publication. To obtain this network, we searched the PubMed/MEDLINE database4 for all articles that contain the phrase “PCSK9” in any search field. PubMed/MEDLINE is a comprehensive scholarly bibliographic database maintained by the United States National Library of Medicine that also has been used in comparable research (e.g. ).
The decision to analyse the PSCK9 network was taken as we wanted to analyse a network with several thousand nodes only to ensure a high data quality and a clear boundary specification of the network (compare  for additional advantages of studying a collaboration network of this size). In total, 952 articles were retrieved from the Pubmed/MEDLINE database. These articles were reported to have been written by 4213 authors. (Two of the articles did not provide any authorship information). Since PubMed did not provide a unique researcher ID for many authors, we manually checked the data for inconsistencies. Indeed, many names that were recognised as belonging to different individuals by our system in fact belonged to the same individual. In most of these cases, an author used her or his initial for some of the publications but not others (e.g. we assumed that the names “Abdiche, Yasmina” and “Abdiche, Yasmina N” belong to the same individual; compare  for this procedure). After data cleansing, 3905 authors remained in the database.
The average author of these 3905 authors has written 1.742 articles about PCSK9. This number seems rather low. For example, Yan and Ding  find in their co-authorship analyses that the average author in the field of “library and information science” has written 2.4 articles. Newman  finds numbers between 2.55 and 11.6 articles for different research domains. The low number in the case of PCSK9 research can be explained by the fact that PCSK9 is a rather new, narrow research field and authors also publish papers about different research subjects.
In contrast, the average PCSK9 article has been written by 7.14 authors. This number seems rather high. In comparison, Yan and Ding  report that the average article in the field of library and information science has 1.8 authors. Newman  finds numbers between 1.99 (for the field of theoretical high-energy physics) and 8.96 authors per paper (for the field of high energy physics in general). The high number of 7.14 in the case of biomedical research can be explained by the fact that experimental bio-medical research requires a large group of collaborators (similar to those of experimental high-energy physics). For example, large scale clinical trials are conducted by more than 100 people. In this context, the PubMed/MEDLINE database contains a single PCSK9 paper with 186 authors. (However, this number is still low compared with experimental high-energy physics for which Newman  reports a single paper with 1,681 authors.)
Number of articles per year
Number of articles
Number of articles
Using this data sample, we calculated a series of network statistics (including betweenness centrality) for all authors in the network using the R libraries “sna”  and “igraph” , as well as the software Gephi 0.8.2 beta.5 During the analyses, we aggregated the network from year to year, which means we assumed that a tie between two authors who co-authored a paper in the past endured till the year of analysis (and was not resolved).6 Consequently, we studied the cumulative network structure in one-year intervals (i.e. the network from 1993 to 2003, the network from 1993 to 2004, the network from 1993 to 2005, and so on) (compare ). This approach is also taken by most comparable studies (e.g. ).
Top-Authors by betweenness centrality
Seidah Nabil G
Wasserman Scott M
Rader Daniel J
Hovingh G Kees
Humphries Steve E
Jukema J Wouter
Park Sahng Wook
Horton Jay D
Thompson John R
Konrad Robert J
Stein Evan A
Davis Harry R Jr
Ballantyne Christie M
Kastelein John J P
Table 2 shows the 25-top-authors of this main component by their betweenness centrality. Interviews with managers responsible for PCSK9 from the pharmaceutical industry confirmed that most of these authors are among the most influential people in PCSK9 research.10
The main research question of this paper was whether these authors can be identified in the early evolutionary stages of a network by taking a variable as a proxy for the authors’ betweenness centrality in the future.
Spearman correlation between the number of an author’s unclosed triads and this author’s betweenness centrality in 2015
is the rank of author i’s betweenness centrality,
is the rank of author i’s number of unclosed triads,
is the mean rank of all betweenness centrality scores, and
is the mean rank of all number of unclosed triads.
Since there was only a single publication by two authors between 1993 and 2002, a calculation of Spearman’s rho for these years is not meaningful. Also the correlation coefficient for 2003 should be viewed with caution as there were only two publications about PCSK9 in 2003. However, for the remaining years, Spearman’s rho indicates a strong to very strong correlation between the author’s betweenness centrality in 2015 and the number of the author’s unclosed triads in the respective years.
Spearman correlation between an author’s degree centrality and this author’s betweenness centrality in 2015
By comparing Tables 3 and 4, one can see that the correlation coefficients between an author’s number of unclosed triads and this author’s betweenness centrality are much higher than those between an author’s degree centrality and this author’s betweenness centrality.12 Hence, we can conclude that the number of an author’s unclosed triads is a better proxy variable for this author’s betweenness centrality than her or his degree centrality.
Although we found that the number of an author’s unclosed triads is a good proxy for her or his betweenness centrality in the PCSK9 co-authorship network, this might possibly be a peculiarity of our dataset.
Spearman correlation between the actor’s betweenness centrality and the actor’s number of unclosed triads
Scale-free networks (\(\gamma =2\))
Scale-free networks (\(\gamma =3\))
However, we also wanted to ensure that the number of an actor’s unclosed triads is a good proxy variable for her or his betweenness centrality in not only these artificial networks but also real world networks. Therefore, we examined the correlation between a node’s betweenness centrality and the suggested proxy variable using four well-known real world networks that stem from different domains and have different sizes: (1) Jeong and colleagues’  “protein interaction network” (\(\rho =0.971\)), (2) Watts and Strogatz’  ‘power grid” (\(\rho =0.858\)), (3) Padgett’s “Florentine families network”  (\(\rho =0.854\)), and (4) Knuth’s  “Les Miserables dataset” (\(\rho =0.980\)). Also, in these four real world networks, the strong correlations indicate that the suggested proxy variable is a good indicator for a node’s betweenness centrality. Nevertheless, we cannot infer whether the accuracy of the proxy variable changes with the size of the network based on this data.
For the final step of our analyses, we examined what percentages of the top nodes in the co-authorship network and the four real world networks are correctly identified as such using the proposed heuristic. Figure 2 depicts the number of nodes selected as top nodes (in percent) on the x-axis. The y-axis depicts how many nodes are correctly identified as top nodes using the proposed approach.
In general, across all networks, the proposed approach identifies a sufficiently large number of top nodes for a variety of application scenarios (such as, for example, KOL identification for marketing campaigns in the co-authorship network).
In this paper, we aimed at two research objectives. First, we identified KOLs in PCSK9 research by their embeddedness in a co-authorship network. Specifically, we identified them by using their betweenness centrality in the co-authorship network. Second, we proposed a proxy variable for the betweenness centrality of these nodes (i.e. the number of an author’s unclosed triads).14
We think that both points in themselves are important contributions to practice and literature. Pharmaceutical enterprises spend considerable time and effort identifying KOLs. In this paper, we illustrated an easy and cheap alternative to identify KOLs on the basis of co-authorship data. The proposed method can also be easily conducted with search terms other than “PCSK9”.
Furthermore, the proposed proxy variable may serve as an indicator for the nodes’ betweenness centrality in a variety of settings where betweenness centrality cannot feasibly be calculated. Since the collaboration network of scientist is a prototype example of a complex evolving network , our findings also seem applicable to a variety of other networks as well.
Of course, as with any empirical study, this study is subject to some limitations. We do not consider most of these limitations to void the results, so long as readers remain aware of them as they draw their conclusions. In fact, the limitations suggest some future research. There are four specific limitations to discuss.
First, since PubMed/MEDLINE did not provide unique researcher IDs for all researchers, there might be some problems in distinguishing some of them. Either some researchers might have the same name or some authors might change their name (e.g. after marriage) and be recognised as two different nodes in a network. However, we manually checked the data for inconsistencies and think that the remaining error is of the order of a few percent (compare ). Therefore, we do not think that this methodological limitation will significantly affect our results (compare also ). Nevertheless, future research should conduct related analyses with a dataset in which each author has a unique researchers’ ID.
Second, by selecting all articles about PCSK9 research indexed by the PubMed database, we obtained a clear boundary specification of the network. However, this boundary specification is rather artificial (compare ). For example, some of the authors might have collaborated on other articles, and we neglected these co-authorship links in our analyses. Therefore, future research should analyse co-authorship networks with a different boundary specification.
Third, we neglected the fact that KOLs retire and stop publishing papers . Hence, we might possibly have identified some KOLs in our analyses that are not active anymore. Future research could collect additional data on this aspect and explicitly consider the retirement of KOLs in the analyses.
Fourth, during our analyses, we focused on the author collaboration network (compare ) and used the author-paper affiliation network (i.e. a bipartite graph) for calculating some descriptive statistics only (such as the number of authors per paper, or the number of papers per author). Future research could analyse the author-paper affiliation network in more detail.
Our hope is that our research will assist others in conducting these types of studies and form the basis for substantial future research into identifying KOLs in co-authorship networks, as well as the use of the number of unclosed triads as a proxy variable for betweenness centrality.
DBLP is a database of computer science publications; http://dblp.org, accessed on July 22nd 2015.
PCSK9 is a protein which regulates LDL cholesterol levels. By blocking PCSK9, cholesterol levels can be brought substantially down. Hence, drugs can be developed that reduce the risk of cardiovascular diseases by blocking PCSK9.
http://www.ncbi.nlm.nih.gov/pubmed, accessed on June 4th 2015.
http://gephi.org, accessed on July 14th 2015.
In the analyses, we left out the years 1994–2002, since no papers about PCSK9 were published then.
Although Freeman  proposed a standardised measure of betweenness centrality that can theoretically be used for comparing centrality scores between components of different size, we think that it is, for example, not meaningful to compare the maximal betweenness centrality of a node in a component with three actors to that of a node in the main component of a co-authorship network.
The main component of a network is also sometimes referred to as the “giant component”.
There were no other meaningful big components in the network. For example, the second (third) biggest component in the network comprised 1.23 % (0.69 %) of all authors.
These influential people include basic researchers as well as researchers conducting clinical trials. Hence, some context knowledge is helpful for reading the tables.
We suppose that node A will have a high betweenness centrality in the final network, although node B and node C are more likely to co-author a paper in the future than two random nodes if both have co-authored a paper with node A. In the literature, this fact has been termed the “forbidden triad” .
This is true for all years except 2004 and 2005. However, there were only very few publications in these years (14 and 18 respectively, compare Table 1), and the high correlation coefficients between an author’s degree centrality and this author’s betweenness centrality for those two years can be explained by chance. Furthermore, the differences in the correlation coefficients between the number of an author’s unclosed triads and betweennness centrality and author’s degree centrality and betweenness centrality for the years 2004 and 2005 are not very large (0.4428995 vs. 0.5012240 and 0.4435424 vs. 0.4919271).
The “Florentine families network” is a very small network (with 16 nodes only).
Encouraged by a literature review and interviews with marketing managers from the pharmaceutical industry, we assumed that authors with a high betweenness centrality have a high influence as well. Although we think that this is a reasonable assumption for co-authorship networks, we want to be clear that structural importance and dynamic influence of nodes do not necessarily have to be the same.
This work was supported by a fellowship within the FITweltweit programme of the German Academic Exchange Service (DAAD).
- 1.Aggarwal, C., Subbian, K.: Evolutionary network analysis: a survey. ACM Comput. Surv. 47(1), 10 (2014). doi:10.1145/2601412
- 2.Backstrom, L., Huttenlocher, D., Kleinberg, J., Lan, X.: Group formation in large social networks: membership, growth, and evolution. In: Paper presented at the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USAGoogle Scholar
- 3.Bader, D.A., Kintali, S., Madduri, K., Mihail, M.: Approximating betweenness centrality. In: Algorithms and Models for the Web-Graph, pp. 124–137. Springer, (2007)Google Scholar
- 9.Czárdi, G., Nepusz, T.: The igraph software package for complex network research. InterJournal, Complex Syst. 1695 (2006)Google Scholar
- 10.Ediger, D., Jiang, K., Riedy, J., Bader, D., Corley, C., Farber, R., Reynolds, W.N.: Massive social network analysis: mining twitter for social good. In: Paper presented at the 39th International Conference on Parallel Processing (ICPP) San Diego, CAGoogle Scholar
- 16.Knuth, D.E.: The Stanford GraphBase: A Platform for Combinatorial Computing. Addison-Wesley Reading (1993)Google Scholar
- 18.McLaughlin, A., Bader, D.A.: Scalable and high performance betweenness centrality on the GPU. In: Paper presented at the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LouisanaGoogle Scholar
- 19.Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the sample good enough? comparing data from Twitter’s streaming API with Twitter’s firehose. In: Paper presented at the International AAAI Conference on Web and Social Media, Cambridge, MassachusettsGoogle Scholar
- 20.Newman, M.E.: Scientific collaboration networks—I. Network construction and fundamental results. Phys. Rev. E 64, 016131 (2001)Google Scholar
- 21.Newman, M.E.: Scientific collaboration networks—II. Shortest paths, weighted networks, and centrality. Phys. Rev. E 64, 016132 (2001)Google Scholar
- 24.Shi, X., Bonner, M., Adamic, L.A., Gilbert, A.C.: The very small world of the well-connected. In: Paper presented at the Nineteenth ACM Conference on Hypertext and Hypermedia, Pittsburgh, PA, USAGoogle Scholar
- 27.Vidgen, R., Henneberg, S., Naude, P.: What sort of community is the European Conference on information systems? a social network analysis 1993–2005. Eur. J. Inf. Syst. 16(1), 5–19 (2007)Google Scholar
- 31.Yang, Y., Dong, Y., Chawla, N.V.: Predicting node degree centrality with the node prominence profile. Sci. Rep. 4(7236) (2014). doi:10.1038/srep07236