The relationship between usage and citations in an open access mega journal

How does usage of an article relate to the number of citations it accrues? Does the timeframe in which an article is used (and how much that article is used) have an effect on when and how much that article is cited? What role does an article's subject area play in the relationship between usage and citations? This paper aims to answer these questions through an observational study of usage and citation data collected about a multidisciplinary, open access mega journal, Scientific Reports. We find that while the direct correlation between usage and citations is only moderate at best, the relationship between how early and how much an article is used and how early it is cited is much clearer. What is more, we find that when an article is cited earlier it is also cited more often, leading to the assertion that if an article is more highly accessed early on, it is more likely to be cited earlier and more often. As Scientific Reports is a multidisciplinary journal covering all natural and clinical sciences, this study was also able to look at the differences across subject areas and found some interesting variations when comparing the major subject areas covered by the journal (i.e. biological, Earth, physical and health sciences).


Introduction
Understanding the impact of published research is of the utmost importance to the success and continued development of the scientific dialogue. Without knowing if a published research article (or group of articles) has been impactful, it becomes difficult to best direct the development of science (Taylor 2011;Hicks 2011;Grgić 2015). While many metrics have been developed to help understand the impact of published research articles (Bollen et al. 2009), the main focus has historically been on the number of citations articles receive -with the use of citations for this purpose dating as far back as 1927 (Gross 1927). Beyond citations though, the other major metric that is used to understand the impact of published articles is the level of usage they receive (also variously referred to as 'views ', 'reads' or 'downloads'). Usage data is most often primarily provided by the publisher of the research article. With the move towards online article publication, data on article usage (online usage at least) became much more accessible and so more readily available to use in assessing impact (Duy et al. 2006;Armbruster 2010). Usage had previously been taken into account when deciding on the reach or impact of an article, but before journals became available online (either wholly, or alongside a continuing print version), this was based only on the physical circulation of a journal (Peritz 1995;Buffardi & Nichols 1981). The use of print circulation however only provided an indication of usage, as it did not specifically track individual uses of article. The move to effectively tracking online usage provided a much greater opportunity to understand and assess article impact through usage.
Although citations and usage are two of the primary metrics used to assess impact, an agreed understanding of how each helps in presenting the impact of published articles has not yet been fully reached. While there has been a number of studies and deliberations on the relationship between these two metrics, these have not produced a definitive answer (Perneger 2004;Bollen et al. 2005;Moed 2005;Brody et al. 2006;McDonald 2007;Chu & Krichel 2007;Garfield 2011, Guerrero-Bote & Moya-Anegón 2014. Many of the studies carried out on this topic are also restricted to assessing specific subject areas (Schloegl & Gorraiz 2010;Nieder et al. 2013;Gorraiz 2014) or geographical regions (Vaughan et al. 2017;Chi & Glänzel 2017;Wical & Vandenbark 2015) and even in these cases the results of different studies do often not provide a homogenous indication as to the relationship between citations and usage. Furthermore, although there has been research into the impact open access makes on this relationship (Antelman 2004;Davis et al. 2008;Gargouri et al. 2010;Davis 2011;Davis & Walters 2011;Wang et al. 2015 ) and one study looked at the relationship between tweets and citations and usage (de Winter 2015), there has not been (as far as we know) any studies utilizing data from an open access 'mega journal' to assess the relationship between usage and citations on a large scale and across multiple subject areas.
With this in mind, we set out to further study the relationship between online article usage and citations, using data from the world's largest, open access mega journal, Scientific Reports (Davis 2017). As Scientific Reports covers all areas of the natural and clinical sciences and it publishes a large volume of articles, it provides an excellent subject on which to carry out a large-scale study of the relationship between usage and citations, while also allowing us to look at additional breakdowns of this relationship (namely the influence of time and subject areas). This study analyses more than 6000 articles with the aim of understanding, through an observational analysis of historical data, the relationship between usage and citations -including their correlation, the influence of timeframe on both citations and usage, and the differences between subject areas.

Initial data collection
To generate the dataset on which we carried out the analysis of citations and usage data, we extracted article information (Digital Object Identifiers and Publication Date) for all articles published in Scientific Reports between 2012 and 2014 from the journal's publishing platform. As Scientific Reports is a multidisciplinary journal and has a high rate of publication, focusing on this time period ensured we had a sizeable group of articles (7,381) that had been published across an assessable period of time (2 years) and also covered a broad range of subject areas. Scientific Reports only publishes two types of article content, 'Research' and 'Amendments and Corrections'; as we were only interested in understanding the relationship between citations and usage for published research, we removed all 'Amendments and Corrections' from our record counts (n = 159), leaving us with just the 'Research' articles published in this period (n = 7222).
For each article, we extracted their citation counts for the first two years since publication and their usage statistics for the first year since publication. We decided to use these different timeframes as we know from a number of previous studies (e.g. Perneger 2004, Moed 2005, Brody et al. 2006) that there is a delay in the time it takes for articles to accrue citations whereas usage data starts being generated immediately. To extract each article's citation counts we utilized the API for the Scopus database (https://www.scopus.com/home.uri), which enabled us to collect monthly citation counts over two years for each article -giving us 24 monthly citation counts per article. To extract each article's monthly usage data for the first year since publication we utilized WebTrends (https://www.webtrends.com), the web tracking system used by Scientific Reports' publisher (Nature Publishing Group) to monitor the journal's web activity. From this system we extracted three different monthly usage counts for each article (HTML views, PDF downloads and combined HTML and PDF usage), we will explore these different usage counts in the next section. As our aim was to focus on the potential correlation (or at least associations) between citations and usage, and so we only required cited articles, we removed all articles that had not been cited at the time of extraction (n = 373). We further removed those articles that were outliers in terms of usage (i.e. those that had more than 100,000 views), as there was a very small number of them (n = 6) and these had the potential to greatly skew the results of any analysis carried out. The resultant primary dataset was therefore made up of 6841 articles published in 2012-2014, which had been cited within two years of being published, and cover all the 68 different subject areas Scientific Reports lists on its website (https://www.nature.com/srep/browse-subjects).

Citation counts and definition of 'usage'
In our gathering of the data, we faced a potential difficulty in regard to "usage". While Scopus provided us with a single monthly citation count per article, meaning that our primary dataset had clear citation data for each article, WebTrends provided us with three different types of usage data per articlenamely a count for HTML views per article, a count for PDF downloads per article and a count for combined HTML views and PDF downloads per article (hereafter referred to as 'HTML+PDF usage'). To identify which of these three different usage measurements was the most appropriate (and best suited) for our analysis, we first set about seeing if there was a difference in the way in which each of the usage counts were distributed and how these compared to the distribution of citations. Additionally, previous studies have focused specifically on PDF downloads' relationship with citations because, as one study puts it, "…they measure at least the intention to use the downloaded material" (Gorraiz et al.2014). So, beyond our main aim of investigating the relationship between usage and citations, we were also tangentially interested in the question of whether such an assertion was correct, and if different types of usage measurements had different relationships with citations. Therefore, we looked at the distribution of total citations over the first 24 months of an article's life, as well as the distribution of the totals for each of the different usage measurements (HTML views, PDF downloads and HTML+PDF usage) in the first 12 months after an article had been published.  We found that the distributions of all elements (citations, HTML views, PDF downloads, HTML+PDF usage) were not normal. The skewed distributions for citations and HTML+PDF usage can be seen in Figure 1 and Figure 2, with HTML views and PDF downloads' distributions being very similarly skewed, as can be seen in the specifics of each of these distributions detailed in the next paragraph.
For citations the total counts ranged from 1 to 196, with a mean of 8.9 and median of 6, which signals a very skewed distribution, with 90% of the articles having less than 19 citations and 70% having less than 10. The distribution of HTML+PDF usage was also very skewed, ranging from 207 to 99,365, with a mean of 2,746 and a median of 1,686; 90% of the articles have less than 5,081 counts and 70% have less than 2,523. HTML views were similarly distributed ranging from 0 to 80,990, with a mean of 1,607 and median of 1,035, 90% of the articles have less than 2,843 counts and 70% have less than 1,501. The same holds for PDF downloads, which range from 0 to 83,510, with a mean of 1,139 and median of 548, and 90% of the articles have less than 2,051 counts and 70% have less than 866.
In order to detect any difference between each of the measurements of usage, we decided to include all of them in our following analysis. Therefore, where relevant, our study will be broken into three different sets of analysis in regard to usage: 1. HTML views vs citations 2. PDF downloads vs citations 3. HTML+PDF usage vs citations

Definition of subject area groups
To enable us to carry out an analysis of the relationship between citations and usage a subject level, we also broke down our dataset into subject area groups using Scientific Reports' standardized structure for subject tagging. The journal's tagging system is based on authorselected keywords (hereafter referred to as 'subject area tags'), which are chosen at point of submission and inserted into the article's XML at publication -these also appear on the article page online. There are hundreds of subject area tags available, but all of those contained in our dataset fall under four top-level subject areas: biological sciences, Earth & environmental sciences, health sciences, and physical sciences.
Articles can have as many subject area tags associated with them as the article's author(s) choose, as these tags are not curated by the journal's Editorial staff. This presents a problem when aiming to assess the relationship between citations and usage at a subject level, as the subject areas chosen may not always be the most appropriate to the topic of the article, due to human error or a misunderstanding of tags. To minimize the effect of this, we only grouped articles based on the four top-level subject areas, based on the reasoning that even if an author selected some irrelevant subject tags, the primary subject area(s) the article relates to would be captured in the top-level subject areas.
Even after we reduced the number of subject areas to the four top-level areas, articles still appeared in multiple subject areas. As there would be double counted articles if we simply carried out an analysis of each area, we looked at the distribution of every tagging combination across the four top-level subject areas (see Table 1).

Ro w # Biological sciences
Earth & Environmental sciences Health sciences Physical sciences  We identified 614 articles that covered three or more of the four top-level subject areas. These highly multidisciplinary articles (rows 5, 8, 12-14, in Table 1) were not distinct enough to include in a meaningful subject analysis and so we removed them from the sample.
Furthermore, our analysis of the remaining multidisciplinary articles found that there was a high degree of variation in the different subject tags to make any meaningful assumption that articles in these groups represented a distinct subject group (as opposed to simply 'multidisciplinary' research) and therefore we removed articles that contained any of the following, contrasting tagging combinations: • 'Biological sciences' and 'Physical sciences' (row 2 in Table 1, 1342 articles) • 'Biological sciences' and 'Earth and environmental sciences' (row 6 in Table 1, 294 articles) • 'Health sciences' and 'Physical sciences' (row 11 in Table 1, 27 articles) We were then left with six groups of articles (rows 1, 3, 4, 7, 9 and 10 in Table 1); of these, the 36 articles that had only 'Earth and environmental sciences' tags (and no other subject area tags) or the 106 articles that had only 'health sciences' tags (and no other subject area tags) were comparatively too few (row 10 and row 9 in Table 1, respectively) to lead to any definite conclusions. Because the size of the remaining groups varied considerably (ranging between 36 articles and 2,432 articles), and the topics covered by groups of the top-level subject (namely 'biological sciences' and 'health sciences'; and 'Earth and environmental sciences' and 'physical sciences') were reasonably similar, we settled on combining the following six article groups into two distinct subject area groups: • 'Bio-Health' class (1812 articles), which contains: ○ all articles with 'biological sciences' tags and without 'physical sciences' tags and without 'Earth and environment sciences' tags and without 'health sciences' tags (row 4 in Table 1, 850 articles) ○ all articles with 'health sciences' tags and without 'physical sciences' tags and without 'Earth and environment sciences' tags and without 'biological sciences' tags (row 9 in Table 1, 106 articles) ○ all articles with both 'biological sciences' and 'health sciences' tags and without 'physical sciences' tags and without 'Earth and environment sciences' tags (row 3, 856 articles) • 'Phys-Earth' class (2752 articles), which contains: ○ all articles with 'physical sciences' tags and without 'biological sciences' tags and without 'health sciences' tags and without 'Earth and environment sciences' tags (row 1 in Table 1, 2432 articles) ○ all articles with 'Earth and environment sciences' tags and without 'biological sciences' tags and without 'health sciences' tags and without 'physical sciences' tags (row 10 in Table 1, 36 articles) ○ all articles with both 'physical sciences' and 'Earth and environmental sciences' tags and without 'biological sciences' tags and without 'health sciences' tags (row 7in Table 1, 284 articles) These final two subject area groups ('Bio-Health' and 'Phys-Earth') have no overlap with each other and so provide two distinct subject area groups on which solid conclusions can be drawn in regard to their impact on the relationship between usage and citations.

Overlap of 'top cited' and 'top usage'
We began our analysis with the following question: are the top cited articles (in our dataset) also those with the top usage?
To find out the answer to this we identified the articles that were in the top 10% cited articles (those articles with 19 citation or more and which are hereafter referred to as 'Top cited') for the dataset as a whole and for each subject area group. We also identified those articles which were in the top 10% for article usage (broken down by each of our three usage measurement groups, hereafter each are referred to 'Top HTML views', 'Top PDF downloads' and 'Top HTML+PDF usage', or 'Top usage' collectively). We carried out an analysis of each of these subsets to calculate the overlap between 'Top cited' and 'Top usage' for all articles in the dataset, as well as specifically for the 'Bio-Health' and 'Phys-Earth' subject area groups.  As can be seen in Table 2, the number of 'Top cited' articles identified for this overlap analysis was 637 in the 'All articles' group, 178 when looking at the 'Bio-Health' group and 259 for the 'Phys-Earth' group. The discrepancy in totals between the subject area groups and all articles is due to the fact that 200 of the 'Top cited' articles fell into subject area groups that were not included in our subject-level analysis.
Based on the number of articles in the overlap between the top cited and the top usage articles, according to each of the measures considered (HTML views, PDF downloads, and combined HTML+PDF usage), we calculated the proportion of these numbers with the number of articles in the top cited group, and obtained 95% confidence intervals using the 1-sample proportions test with continuity correction in the statistical programming language R (R Core Team 2017). The results show that the confidence intervals overlap in all cases, leading us to the conclusion that there is no significant difference between the proportions in the different groups, both across the subject areas considered and the usage metrics analyzed.
Through this analysis we can see that, when using 'HTML+PDF usage' one could likely, in most cases, identify a closer relationship between citations and usage than when simply using 'HTML views' or 'PDF downloads' (although the difference between the different usage types is very small). While there is a noticeable overlap between the articles that are most used and those that are top cited, this overlap does not include all or even the majority of articles, and so citations and usage cannot be said to be interchangeable.
This led us to conduct a more in-depth analysis aimed to identify the relationship between these metrics, by looking at the correlation of total citations with total usage, as described in the next section.

Correlations of Citations and Usage
As we found that there was a reasonable but not major overlap between the articles in our dataset that are most cited and those that have the top usage, we expanded the scope of our analysis to better understand the relationship between citations and usage. To do this we broadened our focus to look at the total numbers for citations and usage (as opposed to looking only at those articles that scored highest for these metrics), with the aim of utilizing the totals to identify the overall correlation between these groups. With our understanding of the different ways and time frames in which citations and usage metrics are accrued on articles, we calculated total counts as follows: Usage = total number of usage counts (according to all three usage groups) at the end of the first year after publication; Citations = total number of citations at the end of the second year after publication.
Articles take some time to accrue citations (the mean month of first citation is 6.303), whereas usage metrics begin to accrue immediately after publication. Therefore, we decided to consider citations over the first two years since publication and usage counts over the first year since publication. This has two main benefits. First, it ensures that there is enough data in both distributions analyzed in the correlation (usage and citations) to enable robust correlation results. Second, by using this approach we also created the basis for understanding the role of time in the relationship between usage and citations. Starting from the selected timeframes (a year for usage and two years for citations), we went on to refine these in a more detailed analysis, reported in a later section of this article, 'Role of time in the impact of usage on citations'.

Correlation tests on total usage and total citations
To find the overall correlation between total usage and total citations, we ran a Spearman correlation test on the total number of citations at the end of the second year after publication and the total number of 'HTML+PDF usage' at the end of the first year since publication for each article. We also ran the same test using 'HTML views' and then again using 'PDF downloads'.  The tests on the two subject area groups ('Bio-Health' and 'Phys-Earth'), produced the results shown in Table 4 and Table 5, all statistically significant at the 0.05 level.   From Tables 3-5 we can see that there is a statistically significant, moderate correlation between usage and citations. We can also see that in most cases this correlation tends to be slightly stronger when just looking at 'PDF downloads' of an article as opposed to 'HTML views' -although the difference is small, with the exception of 'Bio-Health' articles.

Bio-Health
We see clearly that the correlation between usage and citations is markedly more pronounced in 'Phys-Earth' articles than in 'Bio-Health' articles (0.53 vs 0.43), and we hypothesize that this may be down to the fact that 'Bio-Health' articles tend to display a higher degree of variance in usage data than 'Phys-Earth' articles. This level of correlation means that in most cases it is safe to assume that when an article's usage (no matter its subject area) increases, so will its citations, albeit at a later date.
But at which point in those first 12 months of usage can one start to comfortably make this assumption? To answer this, we moved on to assess the change in correlation between total citations and total usage over the first 12 months after an article has been published.

Correlations by month for all article groups
To understand how the correlation between total usage and total citations changes over the first 12 months after an article is published, we ran another set of Spearman correlation tests. We ran a Spearman test between the total citation counts at the end of the second year and the cumulative 'HTML+PDF usage' at the end of each of the months from month 2 to month 12, we thus produced 33 test results, across the three groups of 'All articles', 'Bio-Health' and 'Phys-Earth'.  As would be expected, due to the cumulative nature of the counts being used in these correlation tests, the coefficients across all groups increase as the number of months increases, but it is notable that there is a period of increase across all groups which then levels off. For the 'All articles' group, the coefficient increases steadily over the first 7 months after publication and levels out momentarily at month 8 and 9, increases to 0.49 at month 10 and stays at this rate until the end of the 12-month period (as can be seen in Figure 3). From this levelling off at 10 months, at a coefficient value of 0.49, we can reasonably make the assertion that 10 months after publication it is possible to utilize usage to (at least partially) understand the level of citations an article will receive in comparison to other articles of the same age. As the correlation is only moderate, one cannot predict the number of citations at the end of the 2nd year after publication purely based on the usage accrued at 10 months; however, one could reasonable state that an article with higher usage at 10 months is likely to have more citations after 24 months than an article with lower usage in the first 10 months. The increase of coefficients for 'Bio-Health' articles levels off much sooner than the other two groups (as seen in Figure 4). The correlation coefficient for this group of articles at the beginning of the time frame is lower than the other groups and does not reach the same level as the other groups (meaning that the correlation between usage and citations is weaker for these articles), but it does level off at a much earlier time. After an article has been available for six months, the coefficient levels off at 0.43 and stays at this level until month 12. This means that assumptions about the relationship between usage and citations for 'Bio-Health' articles can be made by month six, but as the coefficient is lower for this group (and so the correlation slightly weaker than moderate), the confidence one can have in making these assumptions is lower than with the other groups of articles.

Fig. 5
Distribution of correlation coefficients between total citations and cumulative usage for in months 2-12 after publication ('Phys-Earth' group) Finally, for 'Phys-Earth' articles we see a similar increase in coefficient levels as in the other groups, albeit starting from a higher value (0.42 at month 2). As for the 'All Articles' group, the coefficients of the 'Phys-Earth' articles level off between month 8 and month 9 and then again between month 10 and month 11, with a slight increase at month 12, ending with a coefficient of 0.53. This final coefficient is considerably higher than the one of the 'Bio-Health' group, which may suggest that we need to look at usage counts beyond the first year before the rate of increase of this correlation levels off completely. This higher coefficient at month 12 enables us to make a more comfortable assertion that the rate of usage in the first year for an article will relate to the level of citations that article will have two years after publication.
Based on these correlation tests of citation and usage (for all articles and the different usage and subject area groups within our dataset), we can confidently say that there is a moderate correlation between the level of usage in the first 12 months since publication and the number of citations 24 months after publication. Moreover, general assumptions can be drawn ten months after publication about the level of citations an article will have after two years (in relation to similarly-aged articles); this can likely start by month 6-7 for 'Bio-Health' articles, but can be drawn with more confidence for 'Phys-Earth' articles.
So far the two approaches we have taken to explore the relationship between citations and usage (comparing 'top cited' and 'top usage' articles, and looking at the correlation between total citations and total usage) have shown that there is a relationship between these two metrics, but this relationship has a time aspect that varies depending on an article's subject areas and we can only be moderately confident in the effect one will have on the other (primarily the effect usage will have on citations, as usage is accrued earlier than citations). Therefore, this paper will close by exploring if we can infer a more precise relationship between the two metrics by looking closer at the role time plays in their accrual.

Role of time in the impact of usage on citations
In the final section of this paper we will explore the role time plays in regard to the relationship between usage and citations. To do this we will look at the following questions: • How is usage accrued in the first 12 months?
• How does usage relate to the time of first citation of an article?
• How does the time of first citation of an article relate to the total number of citations at the end of the first 24 months since publication?

Usage levels and time of first citation
We calculated the mean month of first citation for all articles, which is 6.30 (6.89 for 'Bio-Health' and 5.99 for 'Phys-Earth'), and the median month of first citation, which is 5 (6 for 'Bio-Health' and 5 for 'Phys-Earth'), respectively. To better understand the role time plays in the relationship between usage and citations, and as the usage metrics used in our analyses are cumulative, we decided to reduce the timeframe for analyzing how usage related to time of first citation to look at just the usage accrued in first six months. By narrowing the time window in this way, we were able to reduce the potential noise created by looking at the full 12 months. In particular the possible impact of a 'double effect' resulting from an increase in usage following an article's additional exposure from being cited. 1 We defined 'top used' articles in each group as those articles that appear in the top 10% of all articles according to their usage counts in the first six months. We then compared this group with all other articles with respect of the month of their first citation. Figure 6 displays the distribution of the month of first citation for these two groups of articles.

Fig. 6 Histogram of the distribution of month of first citation for 'top used' articles and for all other articles
As shown in Table 7, the mean month of first citation for top used articles is 5.25 and the median is 4, while the mean month of first citation for all other articles is 6.42 and the median is 5. Therefore, there seems to be evidence for the fact that when an article is used more than average in the first six months, it is more likely to be also cited earlier.
The mean month of first citation for 'Phys-Earth' articles is 5.99, and for highly downloaded articles it is 5.29. The mean month of first citation for 'Bio-Health' articles is 6.89, and for highly downloaded articles it is 5.95. So, Physical-Earth articles have more downloads and are cited sooner than 'Bio-Health' papers.    Figure 6 (which is very similar as those for the 'Bio-Health' and 'Phys-Earth' groups) shows that the articles which were 'highly used' in the first 6 months seem to be more likely to have their first citation occur earlier, which is also confirmed by the box and whiskers plot in Figure 7. In order to test this statistically, we performed a one-sided Wilcoxon rank sum test with continuity correction between the two distributions. The results (W=1811500, p < 0.0001) show that a statistically significant difference, with the top used articles having a lower median month of first citation compared to all other articles (the pseudo-median of the difference between the two is estimated to be -0.99). Therefore, it can be reasonably stated that if an article is 'highly used' in the first 6 months after publication, then it is more likely to be cited earlier than an article that is not used as much in the first 6 months. Very similar results hold for the 'Bio-Health' (W = 127180, p = 0.0007, estimate of pseudo-median of the difference = -1.00) and 'Phys-Earth' articles (W = 300860, p = 0.0005, estimate of pseudo-median of the difference = -0.99).
The mean month of first citation for 'Phys-Earth' articles is 5.99, and for highly downloaded articles it is 5.29. The mean month of first citation for 'Bio-Health' articles is 6.89, and for highly downloaded articles it is 5.95. So, Physical-Earth articles have more downloads and are cited sooner than 'Bio-Health' articles.
Can we conclude that if an article is 'highly used' in the first 6 months then it is more likely to be cited earlier but what effect does being cited earlier have on overall citations in the first 24 months?

Time of first citation and total citations
To find the effect that being an early cited article has on an article's total citations after 24 months, we defined 'early cited' articles as those articles whose first citation is occurs before the median month of first citation (which is 5 for all articles, 6 for 'Bio-Health' and 5 for 'Phys-Earth').  We performed a Wilcoxon test on the distribution of total citations of early cited articles and the distribution of total citations of all other articles. The results (W=8071200, p < 0.0001) are statistically significant and show that the early cited articles have a median of total citations that is higher than that of all other articles (the pseudo-median of the difference between the two is estimated to be 4.00). The results are very similar in the case of 'Bio-Health' (W = 590660, p < 0.0001, estimate of pseudo-median of the difference = 2.99) and 'Phys-Earth' articles (W = 1303400, p < 0.0001, estimate of pseudo-median of the difference = 4.00). Through this analysis, we found that, across the board, when an article is cited earlier it is cited more over the first 24 months after publication (as can be seen from Table 7). Very similar results were obtained for the 'Bio-Health' and 'Phys-Earth' group and so we can safely assume that the results derived from this analysis for all groups are suitably robust.
In the 'All articles' group we found that the mean number of total citations of all 6841 articles is 8.9 citations after 24 months since publication, with a median of 6 (across a range of 1-196). However, of the 2964 'early cited' articles in this group, the mean number of total citations is 12.1 with a median of 9. The difference between the mean number of total citations of all articles and 'early cited' articles in this group represents an average uplift of 36%. Therefore, we can comfortably assert that when an article is 'early cited' it is much more likely to have higher total citations 24 months after publication.
Likewise, in the 'Bio-Health' group we found that the mean number of total citations for all 1,812 articles in this group is 6.7, with a median of 5 (and a range of 1-74), while the mean number of total citations for the 856 'early cited' articles is 9.0 with a median of 7. For the 'Bio-Health' group, this represents a 34% uplift in the average number of citations for 'early cited' articles. So once again we find that those articles that are cited early have a greater number of total cites at the end of their first two years of publication. Finally, (perhaps unsurprisingly) we found a very similar story for the 2,752 'Phys-Earth' articles. In this group the overall mean total citation is 10.6 with a median of 7 and a range of 1-196 whereas the mean number of total citations for the 1,240 'early cited' articles increases to 14.1 and the median is 10 -representing an average uplift in total citations of 33%.
Based on these results we are confidently able to state that if an article is 'early cited' then it is more likely be cited more over all (after two years) and it is likely to have a citation uplift of approximately 30% over those articles that are not 'early cited'

Inferring the impact of usage on total citations
Throughout this section we have looked at the role time plays in the relationship between usage and citations, both from the point of view of how usage impacts when an article is first cited, as well as how the time of first citation of an article relates to the total number of cites that article has after 24 months.
At a simplified level the analysis in this section has shown that when an article is downloaded more in the first six months it is much more likely to be cited earlier and when an article is cited earlier it is much more likely to be cited more in total. We can therefore infer from this that when an article is downloaded more in the first six months it is more likely to be cited more overall within its first two years.

Discussion and conclusion
Throughout this paper we have aimed to understand the relationship between usage and citations in the first two years of a published article's life. In exploring this relationship, we have looked at the correlation between the citations and usage in articles published in a multidisciplinary, open access mega journal through the lenses of types of usage, subject areas and the role of time.

Types of usage
In relation to types of usage, this paper concludes that for this type of analysis one can very reasonably use HTML+PDF usage as the best proxy for article usage. While there is some variation between this metric and either HTML views or PDF downloads on their own, we have shown that it is more reliable to use HTML+PDF usage to identify a relationship between 'top cited' and 'top usage' articles and the difference between the three metrics is mostly negligible when looking at correlations between usage and citations. Therefore, this paper proposes that research looking at the usage of articles and its relationship with citations can successfully and reliably use the HTML+PDF usage metric for its analysis.

Subject areas
From a subject area perspective, we have found that for this dataset there appears to be a clear difference in the relationship between citations and usage when looking at different subject areas. While there is no statistically significant difference between the subject groups in the overlap between 'top cited' and 'top usage' articles, there is a noticeably higher correlation between the total usage and total citations of articles in the 'Phys-Earth' group when compared to the 'Bio-Health' group, although both still only showed a moderate correlation (0.53 and 0.43, respectively). Finally, we found that although 'Phys-Earth' articles are always more likely to be cited earlier on average than 'Bio-Health' articles and cited more on average in total after 24 months, 'Bio-Health' articles see a greater benefit from being 'highly used', as well as from being 'early cited'.

The role of time
To better understand the role time plays in the relationship between citations and usage, we devised a variety of experiments. First, we looked at the role that being highly used within the first six months plays on the time when citations accrued and found that the more highly used an article is, the more likely it is to be cited earlier than average. We then moved on to analyze the impact of early citations on the overall number of citations an article will accrue in the 24-month period after publication: here we found a clear uplift (of ~30% on average) in overall citations for those articles that were cited early.
In this paper we have examined a number of aspects in the relationship between usage and citations. There is a greater amount of analysis and research that the results of this paper point towards: primarily, understanding more clearly the relationship between time and citations, and better understanding what role being a mega journal plays in the results we have found and conclusions we have drawn. We intend to explore these elements in future research projects and hope that the findings we have presented here are used as a basis for experimentation and analysis by others.