Background

MEDLINE is the premier bibliographic database of the U.S National Library of Medicine (NLM), containing over 22 million biomedicine-related entries. With approximately 750,000 new citations added in 2014, it is essential to identify literature of the highest quality for priority reading [1]. High citation rates (in addition to journal impact factor and circulation rates) are proposed to be predictive of article quality [2], thus in turn, scientific importance. Factors such as bias towards review articles and variable bibliographic lengths however suggest that such methods are not always optimal [3].

Citation counts give no weighting towards articles of greater importance. Naturally, definition of such importance is a subjective task. In a static system of inter-article referencing, we observe that a citation by an article from a low distribution journal has equivalence to a citation from a large-scale systematic review. Perhaps a weighting approach would favour articles of greater perceived ‘scientific gravity’, however this may neglect the emerging relevance of an article’s spread through the scientific community. Therefore a method of objectively weighting literature importance would be highly beneficial.

The PageRank algorithm, originally used for link analysis by the search engine, Google [4], provides one such method of ranking by importance. The concept, originally applied to web pages, proposes that a web page itself carries a greater importance if linked to by other high importance pages. Thus for a closed system of total web pages online, a system of merit can be constructed based on assigning a relative weighting (as a proportion of the entire database) to each web page.

Much as web pages are interconnected through hyperlinks, scientific articles are themselves linked via their citations. As such, this study seeks to investigate PageRank-based bibliometrics as an alternative to citation counts alone.

Methods

The PubMed Central open access subset (PMC-OAS) represents a more liberally-licenced part of the PubMed Central collection [5], freely available online. Contributing journals provide selected full text articles in eXtensible Markup Language (XML) format, specifically for data mining purposes.

PMC-OAS was here chosen, both due to ease of accessibility, though also as a training corpus allowing concept validation prior to expansion to the entirety of MEDLINE. With over 600,000 unique manuscripts included, the dataset amounts to some 40Gb uncompressed [6]. Data parsing and computation was performed in three steps (Fig. 1).

Fig. 1
figure 1

Methodology flowchart. Flowchart representing the major steps of data manipulation, as outlined in Methods

XML parsing

With data ingestion going beyond the capability of traditional desktop computing, on-demand cloud-computing infrastructure was leveraged to parallelise metadata extraction. This commodity cluster environment represents a readily-available, low-cost method of scaling up ‘embarrassingly parallel’ computational tasks [7].

XML parsing was performed in parallel on four compute nodes (2Gb RAM, 2 virtual CPU cores) using a hand-written Python [8] parser in under two hours (Appendix 1). PubMed identification (PMID) numbers of ‘outbound’ citations were extracted from each article’s reference list and used as reference keys for every citation vertex in the graph of article nodes.

PageRank computation

PageRank computation was performed on a single compute node (specifications as previous) using an open source C++ based implementation of the algorithm [9]. The algorithm can be summarised as per Fig. 2, where pi represents the set of all unique PMIDs in the citation network (and PR(pi) its individual PageRank), d is the dampening factor (d = 0.85 here), N is the total number of unique PMIDs, M(pi) represents the set of all inbound citations to pi, PR(pj) represents the PageRank values of all inbound citations to pi and L(pj) is the number of outbound citations of pj.

Fig. 2
figure 2

PageRank algorithm. PageRank algorithm representation. Set of unique PMIDs in citation network [pi], individual PageRank [PR(pi)], dampening factor [d = 0.85], total number of unique PMIDs [N], set of all inbound citations to pi [M(pi)], PageRank values of all inbound citations to pi [PR(pj)] and number of outbound citations of pj [L(pj)]

A dampening factor was originally introduced in PageRank to model an imaginary surfer randomly clicking on links, that will eventually stop clicking. 0.85 suggests an 85 % probability that at any step, this imaginary surfer will continue to click. Due to the recursive nature of the algorithm, a convergence value (epsilon) of 0.00001 was used to guarantee precision. The algorithm was used as per the reference implementation except where otherwise described.

Inverted citation index creation

MapReduce, a programming model for large corpus processing, also developed at Google, was used to create an ‘inverted citation index’. This distributed computational approach allows near linear scalability with increasing cluster size [10], thus facilitating a route for future corpus expansion. The inverted citation index generates a list of ‘inbound’ citations for each article node in the graph, with a corresponding total citation count.

The high-level programming language, Pig [11] was used as a layer on top of MapReduce for near-natural language manipulation of the dataset. A Pig script was written to facilitate numeric comparison between derived citation count and calculated PageRank (Appendix 2).

Statistical analysis

Statistical analysis was performed using IBM SPSS version 21.0.0.0 [12].

Results

The PageRank algorithm processed and ranked a total of 6293819 unique PMIDs as graph nodes, with 24626354 vertices, representing corresponding outbound citations. A random, 5 % sample of the data was taken (using SPSS randomisation) for statistical analysis. This figure comfortably exceeds the sample size calculation (n = 385 required, Raosoft [13]), detailed in Appendix 3.

PageRank is shown to be a surrogate of literature importance

A statistically significant correlation between PageRank and citation count was observed (P < 0.01) with a high correlation coefficient (R = 0.905). Simple linear regression was performed, obtaining R2 = 0.819 with the fitted regression line being statistically significant (P < 0.01), illustrated in Fig. 3.

Fig. 3
figure 3

PageRank versus citation count. Scatter plot of PageRank versus citation count for random, 5 % sample of data. R = 0.905 (P < 0.01), R2 = 0.819 (P < 0.01)

As such, given the current role of citation count as a marker of literature importance, we demonstrate PageRank to be a similar such surrogate due to high degree of correlation. In light of this finding, we suggest that novel rankings would likely remain broadly similar and thus suggest that implementation of PageRank into the ranking of biomedical literature is feasible.

Top of the corpus comparison

If the putative benefits of PageRank in quantifying importance are to be observed, it must be through outliers from those otherwise highly correlated with citation count. Such outliers may have been preferentially weighted by the algorithm, based on perceived importance. Due to the training subset size, it would be infeasible to account for such examples, however a top of corpus comparison allows some speculative inspection.

The top ten ranking articles of the corpus were compared by descending PageRank (Table 1). This table size was chosen for illustrational ease as graphical whole corpus analysis, aside from regression testing, was outside the scope of this research. From inspection, citation count decrement order matches that of PageRank (as expected from the high degree of correlation), with the exception of citation 11846609 (†), a method article with a lower relative PageRank ranking to its citation count.

Table 1 Top of the corpus comparison

Whilst this represents a single example, we hypothesize that a method article is likely to be widely cited by those utilising its techniques, however this gives little information about the importance of such implementers. As such, we suggest that this correlation outlier has been proportionally ‘down-ranked’ by the PageRank algorithm in relation to the rest of the comparative head.

Whilst further work is required to validate such claims, we suggest this finding may build upon the notion of PageRank’s potential benefits in outweighing citation count alone. If the method is truly able to better weight those articles with higher importance rather than mass citation, we propose that its implementation into the ranking of biomedical literature may be warranted.

Discussion

PageRank can be trivially calculated on commodity cluster hardware

The use of on-demand cloud computing infrastructure for data extraction and computation allows for scalability with increasing corpus size. In the event of increasing article burden, additional XML parsing nodes could be employed with linear cost and throughput. Despite the uncompressed corpus totalling approximately 40Gb, the fully citation-extracted form was <500 Mb. Therefore, we suggest that growth by an order of magnitude (in the range of entire MEDLINE database size) could still be stored on a single commodity hard drive.

Whilst the PageRank calculation was performed on a single node, expansion beyond 2Gb of RAM on a single computer is becoming cheaper and widely available [14]. The use of MapReduce for inverted citation network creation allows near-linear scalability, similar to XML parsing, and can thus be trivially re-evaluated as the corpus grows. PMC-OAS is updated daily, thus all metrics can be recalculated in a matter of minutes (minus the cost of data parsing), as required by the maintainer.

Expanding automated XML processing to MEDLINE as a whole is problematic

The PMC-OAS full-text articles are freely available in XML format, facilitating automated citation extraction. Unfortunately, the vast majority of MEDLINE articles are not open access, meaning that full-text access in not trivially available without bulk licencing programmes. Furthermore, the lack of XML-based metadata in non-open access articles limits the capability for rapid citation network generation.

Efforts have been made to parse bibliographic data from papers [15, 16], however attempts are limited by paid access to such articles in addition to the efficiency of extraction from a variety of article distribution file formats. We thus identify expansion beyond this 600,000-article training corpus as a major barrier to non-proprietary bibliometrics.

Articles appearing in PMC-OAS, referenced articles, which were not included in the corpus. This means that the latter’s PMID appeared in the citation network and thus received a PageRank. However, due to the limited inclusion set of this work, the PageRank (and thus relative ordering) is by no means final and would inevitably change should expansion to the whole of MEDLINE be feasible.

Other methods of importance quantification

Thus far, importance analysis has been derived from article citation networks alone. However, importance is a non-static entity, with the impact of papers going beyond that of, who cites who. Indeed, importance of a particular work may be represented by its spread through the scientific community, rather than an ‘acknowledgement-based’ system of the traditional publishing model. Social media may provide a real-time window into this community dissemination.

Altmetrics, the use of the social web for insight into article impact [17], has previously shown promise in correlation with citation count and may therefore add to bibliometrics through real-time importance weighting [18]. Consideration of social impact is beyond the scope of this research, though provides an exciting avenue for further exploration, perhaps in conjunction with PageRank.

Conclusions

PageRank is a novel method for determining the importance of biomedical literature. The possibility of commodity cluster hardware use and value re-calculation following corpus expansion suggests that curation of an open access citation network is not beyond the limits of a single maintainer. Whilst further work will inevitably be required to expand the network beyond the XML data-mining corpus of the PubMed Central open access subset, the 600,000-article training corpus provides a starting platform for PageRank’s addition to existing importance ranking methods.