Introduction

The growing demand for accountability in the use of public and private funds renders it increasingly important to measure the use and impact of scientific publications. Academic libraries must present quantitative data demonstrating the value of journal subscriptions in order to justify budgetary allocation to these, but journal use is difficult to observe directly in its full magnitude and, therefore, to quantify objectively (Chew et al., 2016). Various methods have been proposed for this, ranging from measuring downloads of electronic subscriptions (Duan & Xiong, 2017; Fernández-Ramos et al., 2019) to analysing bibliographic references in researchers’ scientific output (Peñaflor & Aliwalas, 2022; Vaaler, 2018; Wilson & Tenopir, 2008). It is also common to combine these methods with cost indicators (Gumpenberger et al., 2012; Kurtz & Bollen, 2010). Despite the utility of each of these methods, they present limitations when employed in isolation, and only provide a partial view of the use and usefulness of collections.

As regards the analysis of downloads and citations to evaluate collections, Ivanov et al. (2020) view these as complementary indicators of a journal’s intellectual value (identifying the frequency with which its articles are cited) and of a publication’s usefulness (identifying the frequency with which a journal’s articles are consulted and downloaded). Martin et al. (2016) stress, however, that the two metrics are not the same and therefore not comparable, because downloading an article requires less effort than citing an article. Thus, the number of downloads of a widely-used title is likely to be much higher than the number of citations of a frequently cited article (Chu & Krichel, 2007; Gorraiz et al., 2014; Moed, 2005; Wan et al., 2010; Watson, 2009).

It should also be borne in mind that not all content is downloaded for research purposes; some may be downloaded just for information by professionals, or for learning purposes by teachers or students (Gorraiz et al., 2014; Martin et al., 2016). As a logical consequence, articles are often downloaded many times but remain uncited. Thus, on the one hand, downloads may not be a perfect proxy to estimate the overall usage, but they measure at least the intention to use the downloaded material. On the other hand, many citations are included in the lists of references without previous reading of the cited document and citations can just measure the impact in the “publish or perish” community (Gorraiz et al., 2014). Accordingly, usage metrics can be regarded as complementary to citation metrics (Bollen et al., 2005; Chi & Glänzel, 2017; Hitchcock et al., 2003; O’Leary, 2008).

Furthermore, although both metrics change over time, their evolution is not necessarily parallel because citation of an article usually occurs some time after its download due to the interval between consulting it and citing it in a subsequent publication (Wan et al., 2010; Watson, 2009). Meanwhile, Vogl et al. (2018) have suggested that because citations increase over time, they may be the best indicator of an article’s quality. In contrast, downloads and other alternative metrics have a shorter half-life, tending to stagnate after publication, and therefore measure immediate influence. However, both measures influenced each other. Early downloads are a predictor of citations and downloads give and idea of the potential of a paper (Bollen et al., 2005; McDonald (2007); and citations influenced latter downloads (Moed, 2005; Watson, 2009). These circumstances condition the correlations between the two variables, which will not always be high, as pointed out by Coats (2008), who highlighted the lack of consensus about the value of an article.

Martin et al. (2016) have observed that although the literature abounds in studies analysing the use of journal subscriptions on the basis of either download data or citations in the scientific output of researchers, far fewer studies have combined both types of data. However, this type of analysis is highly important because it provides a more complete picture of the usefulness of collections in institutions and minimises the limitations and partial view offered by the isolated use of citation or download data.

Thus, in the context of an institution or a group of institutions, a joint analysis of download and citation data would help determine whether there is a relationship between these two variables in such a way that one predicts the other. Examples of this combined use of data include Wical and Vandenbark’s (2014) study at the University of Wisconsin-Eau Claire and Faulkner’s (2021) study at the Psychology Department of California State University. In both cases, the authors indicated that the results would be used to make decisions regarding journal subscriptions.

Several studies in the specialised literature have analysed this relationship, but the results obtained have been mixed. Before the existence of standardised usage statistics (COUNTER), Tsay (1998) compared the use of journals in a medical library with citations by researchers at the institution over the same period, and found a statistically significant relationship between frequency of use and the number of medical science journal citations. Another early study suggesting a correlation between citations and other measures of journal usage was the one conducted by Blecic (1999) in the health sciences, at the University of Illinois (Chicago). Similarly, after reviewing COUNTER statistics for the California Institute of Technology, McDonald (2007) reported that the use of online journals was a significant variable in predicting citation patterns. Other studies in which positive correlations have been found were those conducted by Feyereisen and Spoiden (2009) in the Department of Psychology and Education Science at the University of Louvain, and by Gumpenberger et al. (2012) at the University of Vienna.

More recently, Wood-Doughty et al. (2019) analysed this association at the ten universities belonging to the University of California System, studying the scientific output of their researchers between 2010 and 2016. They found a positive correlation between the two variables, but with small differences depending on subject area. Other studies that have reported positive correlations include Rodríguez-Bravo et al. (2021), who analysed scientific production on Library and Information Science at universities in Castile and León (Spain), and De Groote et al. (2013), who analysed scientific production in medicine at the University of Illinois (Chicago). In contrast, studies by Gao (2016) at the University of Houston School of Communication, Ke and Bronicki (2015), also at the University of Houston but in the field of psychology, and Fernández-Ramos et al. (2022) in the same field and limited to the university library consortium of Castile and León (Spain), found no significant correlation between citations and downloads.

Besides the reasons given by Vogl et al. (2018), other explanations for this disparity in the results of studies analysing the relationship between citations and downloads include the characteristics of each institution and its users, with very different citation patterns depending on the discipline, and the method employed in each study. The correlation between citation and usage data depends on the discipline’s publication output as documented in previous studies focused on particular disciplines, journals (Coats, 2008; Moed, 2005; O’Leary, 2008; Watson, 2009) or platforms (Bollen et al., 2005; Brody et al., 2006; Chu & Krichel, 2007; Wan et al., 2010).

Besides, although relationships have generally been measured using the same method, not all studies have examined the totality of downloads and citations or used the same sampling technique. Thus, for example, in a study conducted at the Galter Health Sciences Library in Chicago analysing the correlation between downloads and citations of dermatology publications issued between 2007 and 2016, Pastva et al. (2018) found that the results obtained when including all the most frequently cited journals differed from those obtained when journals from other disciplines were excluded. Similarly, in the studies by Rodríguez-Bravo et al. (2021) and Fernández-Ramos et al. (2022), the correlation coefficient increased significantly when only discipline-specific journals were included.

Meanwhile, a recent analysis of downloads of journals included in the main big deals (ScienceDirect, Emerald, Springer and Wiley) subscribed to by the Castile and León consortium found that downloads of bundle titles had risen in recent years (2012–2018) (Fernández-Ramos et al., 2019), despite a parallel decline in the number of teaching staff and students over the same period and despite the proliferation of open access journals, repositories, academic social networks and platforms such as Sci-Hub, which are opening new and increasingly important avenues of access to scientific information for the academic community. However, the same study also found that only a limited number of bundle titles was being used in the consortium universities, with a small number of titles accounting for the majority of downloads.

We believe that the rise in downloads reported in this study is related to the convenient—transparent and direct—access that researchers have to subscribed resources and that the still significant use of subscribed journals is likely to lead to an increase in consultation and citation of articles from these journals, which strongly suggests a need to determine whether there is a relationship between downloads and their citation in scientific output. Thus, the aim of the present study was to ascertain the degree of relationship between downloads of journals subscribed to by the seven universities that make up two consortia of university libraries in Spain (Castile and León and Galicia) and the citation of these journals in the bibliographic references provided in scientific production by these universities’ researchers, limiting the study to articles indexed in Scopus over the period 2010–2017.

Methods

We used an observational and quantitative method to achieve the proposed objectives. Thus, we obtained data on downloads of scientific journals subscribed to by the university libraries included in the study; we searched Scopus for the scientific output from these seven universities, downloading and normalising all relevant bibliographic records from Scopus; we extracted and analysed the bibliographic references included in these records; and we compared downloads of the subscribed journals against their citation in the bibliographic references given in scientific production by researchers at the universities included in the study. Figure 1 depicts the stages included in the research.

Fig. 1
figure 1

Research stages

Download Data Collection

Download data were obtained and standardised for journals included in the Emerald, ScienceDirect, Springer and Wiley bundles subscribed to by the two consortia included in this study, for the study period 2010–2016. This information was provided to us by the participating libraries based on the COUNTER Journal Report 1 (JR1), disaggregated by year, university and provider.

Search and Download of Scientific Output

We searched Scopus and downloaded indexed scientific production published between 2010 and 2017 and written by researchers from the seven public universities that make up the two consortia of Castile and León and Galicia (see Table 1). Given the time lag between downloading an article and subsequently citing it, an additional year was considered in the case of scientific production. The search was conducted in July 2018 for each of the seven universities using the university name in the “Affiliation” field. Records were downloaded in.csv format and then imported into Excel and analysed as described below.

Table 1 Universities included in the study

Analysis of Bibliographic References

Bibliographic references were extracted from the “References” field of each of the downloaded scientific production records, and a database was created in which the references corresponding to journal articles were standardised and purified. This process was semi-automated, using an algorithm designed to identify references to journal articles by analysing the structure of the bibliographic references and locating the journal title. However, lack of standardisation rendered it necessary to conduct manual checking of errors in the references (e.g. modifying references written in Chicago style) and ambiguities in some journal names (journals with abbreviated or expanded titles, subtitles or words preceding the journal name). Subsequently, bibliographic references were counted for each journal.

Analysis of the Relationship Between Citations and Downloads

Once the bibliographic references corresponding to journals had been identified, these were matched against the list of journals included in the four big deals mentioned in stage one of the research, for each of the universities. This enabled us to select those references that corresponded to subscribed and cited journals. We used this information to create a table containing the citation and download data for the subscribed journals that had been cited in the scientific output from these seven universities between 2010 and 2017.

These data were then used to calculate the percentage of subscribed journals that had been cited in the study period and the volume of citations corresponding to subscribed journals. To analyse the relationships between citations and downloads (of cited and subscribed journals), we generated scatter plots of the correlations between the two variables. These plots enabled us to identify a series of outliers that might distort the results, and we eliminated all those assigned an anomaly index over 100 by SPSS (v 26). Once these values (which were less than 0.001 of the total) had been removed, Pearson’s correlation coefficients were calculated to test the correlation between citations and downloads for each of the seven universities. These coefficients were obtained separately for the following conditions: using data for all subscribed journals at each university, and using only data for subscribed journals that had been cited at least once.

Discussion of Results

Scientific Production and Bibliographical References

Table 2 shows the universities’ scientific production indexed in Scopus. As can be seen, there was a sustained increase in the volume of publications over the study period, albeit with some differences in the universities’ scientific output, which were mainly due to disparities in university size in terms of the number of students and—above all—researchers at each university (Table 3). Thus, the University of Santiago de Compostela was the most productive, while the universities of Burgos and León, which had the fewest students and teaching staff, presented the lowest level of scientific output.

Table 2 Scientific production at the universities included in the study
Table 3 Number of researchers

As expected, the bibliographical references cited in publications mainly corresponded to scientific journal articles. Although some differences were detected between universities, ranging from 73.97 to 80.97% of references (Table 4), they were not particularly significant. These small variations might be due to greater specialisation in one or another subject area at each university. It is well known that not all disciplines present the same citation patterns and that some disciplines primarily use scientific journals, as in the case of the health sciences (Larivière et al., 2006; Tucker, 2013), whereas others rely more heavily on books and book chapters, as in the case of the humanities (Arakaki, 2018; Ezema & Asogwa, 2014), or on conference proceedings, as in the case of engineering (Zhang, 2018).

Table 4 Distribution of bibliographic references

One of the most striking results of the study was the limited percentage of the subscribed journals included in this study that was cited in the researchers’ scientific production, as can be seen in Table 5. Although there were differences between universities, ranging from 15% at the University of Burgos to more than 50% at the University of Santiago de Compostela, in general we found that a high percentage of the scientific journals subscribed to were not cited by researchers in their publications for a period of time as long as eight years. These results are in line with those of other studies, such as Fernández-Ramos et al. (2022) and Shu et al. (2018), and also agree with studies reporting that many of the journals subscribed to through big deals are rarely if ever downloaded (Fernández-Ramos et al., 2019; Srivastava & Kumar, 2018; Zhu & Xiang, 2016).

Table 5 Use of subscribed journals

Predictably, citation data for subscribed journals are closely related to volume of scientific output. As can be seen in Fig. 2, there was a strong correlation between the two variables: the higher the scientific output, the higher the percentage of subscribed journals that were cited, since the more articles published, the greater the chances of citing any of the subscribed journals (and other non-subscribed journals).

Fig. 2
figure 2

Relationship between volume of scientific production and percentage of subscribed journals cited

Relationship Between Citations and Downloads

Most downloads of subscribed journals corresponded to journals that had been cited (at least once in the study period), as can be seen in Table 6, which shows percentages of around 90% for most universities. The exception was the University of Burgos, with a percentage of 71.78%, which, as can be seen in Table 1, was the university with the lowest scientific production.

Table 6 Downloads corresponding to cited journals

Our results showed a strong correlation between citations and downloads in the universities analysed; however, as in the case of the number of journals cited, this correlation was not the same for all universities, being greater in the case of universities with a higher scientific output. Table 7 shows the Pearson’s correlation coefficients for each of the universities analysed, giving the correlations between citations and downloads separately for analyses that included (1) all subscribed journals, and (2) only journals that had been cited at least once. We found a slightly higher correlation when all subscribed journals were included than when only cited subscribed journals were included. The probable explanation for this finding is that many journals are neither cited nor downloaded.

Table 7 Correlations between citations and downloads according to Pearson’s correlation coefficient

The figures below show the dispersion of citation and download values for the journals with at least one citation subscribed to by the universities included in the study, ranked from lowest to highest correlation between citations and downloads. In these figures, a logarithmic transformation has been applied to both variables in order to better illustrate and highlight the correlations between them (Figs. 3, 4, 5, 6, 7, 8 and 9).

Fig. 3
figure 3

Relationship between citations and downloads at UBU

Fig. 4
figure 4

Relationship between citations and downloads at USAL

Fig. 5
figure 5

Relationship between citations and downloads at UVA

Fig. 6
figure 6

Relationship between citations and downloads at UDC

Fig. 7
figure 7

Relationship between citations and downloads at UVIGO

Fig. 8
figure 8

Relationship between citations and downloads at ULE

Fig. 9
figure 9

Relationship between citations and downloads at USC

These results are consistent with previous studies showing a similar positive correlation between downloads and citations, the former being a variable that can predict the values of the latter (Feyereisen & Spoiden, 2009; Gumpenberger et al., 2012; McDonalds, 2007; Rodríguez-Bravo et al., 2021; Wood-Doughty et al., 2019;). However, other studies have failed to find significant correlations between the two variables, as in the case of Gao (2016) and Ke and Bronicki (2015).

Conclusions

The results of this study confirm a relationship between the size of the universities analysed and the volume of their scientific production, which increased over the study period. Likewise, they confirm the importance of scientific journals as a fundamental vehicle for the transmission of knowledge, as evidenced by the finding that more than 73% of the references analysed in this study corresponded to this type of document, in line with the results found in other studies (Fernández-Ramos et al., 2022). This importance of scientific journals has recently been highlighted by Kim et al. (2020) and Herman et al. (2020). The latter indicate that journals are the only product that still consistently fulfil all the functions traditionally attributed to them—recording, curation, evaluation, distribution and archiving—and that they remain necessary to institutionalise and confidently add a scholarly contribution to the body of knowledge. It should also be noted that, in the case of Spanish researchers, the current evaluation system influences document type, marginalising monographs or book chapters in favour of journal articles (Osca-Lluch et al., 2019).

The citation of subscribed scientific journals reached a moderate percentage in most universities, the highest being 50% at the University of Santiago. It is important to highlight the strong correlation found between citation of subscribed journals and the volume of scientific output. According to Shu et al. (2018), researchers only cite a fraction of the journals subscribed to by their libraries, and that fraction is decreasing, reducing the value of subscribed journal bundles, especially when the size of the university is small, as it is the case of some universities in this study. However, citations of journals included in subscription bundles confirm that the publishers distribute and facilitate access to quality—useful—content. Thus, they give visibility to the journals they distribute and promote their reading and subsequent citation, although the use of subscribed journals in scientific production varies considerably depending on discipline, as previous studies have found (Fernández-Ramos et al., 2022; Mongeon et al., 2021; Rodríguez-Bravo et al., 2021). Moreover, it should be kept in mind that articles are often downloaded many times but remain uncited because not all content is downloaded for research purposes (Gorraiz et al., 2014; Martin et al., 2016).

It should be borne in mind that the present analysis was limited to four electronic subscription bundles, not to all the subscriptions maintained by the universities studied, albeit these bundles included three of the main big deals—ScienceDirect, Springer and Wiley—and one of them—ScienceDirect—which contains the most widely-used content, as reported in various studies, including some conducted in the consortium of Castile and León (Fernández-Ramos et al., 2019). Despite the increase in downloads noted in previous studies, this volume of downloads does not strictly parallel the volume of citations. Previous studies (Fernández-Ramos et al., 2022; Rodríguez-Bravo et al., 2021) have found that besides the journals distributed as part of a big deal, widely-used journals also include those from other commercial publishers such as Taylor & Francis, prestigious institutional publishers and publishers that offer open access content. However, we found a significant presence of the most frequently downloaded journals among the cited journals, with high percentages (around 90%) in almost all universities. This result agrees with other studies that saw a bigger correlation when compared the articles more downloaded to the more cited (Chu & Krichel, 2007; Gumpenberger et al., 2012; O’Leary, 2008).

One of the main findings of this study is the high correlation between citations and downloads generally observed in the universities included in the analysis. All of these universities have a Pearson Correlation Coefficient under 0,5 except the one with less scientific production, the University of Burgos. In general terms, a higher correlation has been observed in universities with a higher output. This finding supports the idea, reported elsewhere in the literature (Tenopir & King, 2000), that researchers are the main users of scientific journal articles and that they use them primarily for research purposes. We conclude, therefore, that our results indicate that download values can predict future citation values. This highlights the usefulness of downloads data when making decisions about collection management in academic libraries (Gumpenberger et al., 2012).

These results should be viewed in light of the particularities of the data analysed (Scopus as the source of analysis of scientific production and four big deals as the source of scientific journal download data) and the following limitations: on the one hand, citation and download data for a given period of time were considered in conjunction, which only allows an approximation to reality since the date of download of a cited article is uncertain, although it is generally close to the date of citation. Time delays between downloads and citation show a large variability among users, due to differences in the amount of time they need to prepare a manuscript, and to differences in publication delays among journals selected for publication (Moed, 2005). As pointed out by Brody et al. (2006), the time delay may range anywhere from 3 months to 1–2 years or even longer. Besides, downloads and citations show different obsolescence functions (Ding et al., 2021; Moed & Halevi, 2016). Furthermore, there are disciplinary differences in obsolescence characteristics between citations and downloads using synchronic and diachronic counts (Gorraiz et al., 2014). Correlation between citations and downloads is dependent on the discipline as well (McGillivray & Astell, 2019; Moed, 2005; Moed & Halevi, 2016; Wan et al., 2010).

On the other hand, regarding downloads, COUNTER JR does not cover downloads made to other versions of papers published (such as preprints or postprints in repositories or academic social networks) or downloads of open access articles made from outside the university domain (Gorraiz & Gumpenberger, 2010; Mongeon et al., 2021). Furthermore, errors may have occurred in the standardisation of journal titles, which may have resulted in duplicate journals. In this respect, it is worth highlighting the intrinsic difficulty of analyses such as the present one because of the time required for manual data cleaning and standardisation (Belter & Kaske, 2016; Mongeon et al., 2021; Rodríguez-Bravo et al., 2021). It is also worth noting the existence of outliers, which corresponded to extreme cases of journals that were frequently cited but rarely downloaded. In some cases, this may have been because the subscription had been discontinued at some point or the journals had changed their names but continued to be cited. Such cases would require an in-depth analysis of each of these journals.