This study has built on and extended previous scientometric research inquiring into the sizes of ASEBDs. It is novel in so far as it improves query-based methods for assessing ASEBDs and establishes those methods as adequate, fast predictors of the sizes of most ASEBDs. The methods used made it possible to assess a multitude of different ASEBDs and compare their sizes. The process not only delivered size information but also some insights into the diverse query functionalities of ASEBDs that prove to be the basis of the daily scientific enquiries of many researchers.
Size
We have obtained a QHC from ten of the 12 ASEBDs examined. Based on this QHC data we can assume that Google Scholar, with 389 million records, provides by far the greatest volume of scholarly information. Our maximum QHC in this regard seems plausible when compared to similar multidisciplinary search engines like Microsoft Academic that as of January 2018 covers more than 170 million records and is considered, with a ratio of 1:2.17, considerably smaller than Google Scholar (Orduña-Malea et al. 2015). If we apply the same ratio between the two search engines in January 2018, Google Scholar would amount to roughly 372 million records, a number close to our QHC of 389 million. Nevertheless, it is important to bear in mind that this size comparison might be flawed as Microsoft Academic has relaunched since the Orduña-Malea et al. (2015) research was conducted. This relaunch could have significantly impacted the structure and size of Microsoft Academic (Hug and Braendle 2017) and its performance in retrieving search results with high precision and recall (Thelwall 2018).
Comparing previous research findings with our QHC results we found that with 389 million records Google Scholar’s maximum QHC in January 2018 amounts to an increase of 121% compared to the previously estimated size of 176 million by Orduña-Malea et al. (2015) in May 2014. The QHC estimation of both our study and that of Orduña-Malea et al. (2015) include articles, citations, and patents indexed on Google Scholar and thus can reasonably be compared. This size difference most likely stems from two factors: time difference and method difference. With regard to time difference, if we exactly replicate the query that resulted in the 176 million hits obtained by Orduña-Malea et al. (2015) (<1-site:ssstfsffsdfasdfsf.com> and wide year range), we arrive at a QHC of 247 million records. This indicates that in 44 months Google Scholar increased its size by 40% or an average growth rate of 1.6 million records per month. This monthly growth rate would only exceed Microsoft Academic’s current monthly growth rate of 1.3 million records by a reasonable margin (Hug and Braendle 2017). Given this plausible increase in records over 44 months, it seems logical to assume that the same QHC method in May 2014 produced comparable results in January 2018 too.
With regard to method difference, as with all databases we iteratively tried other queries to identify a maximum QHC for Google Scholar. Indeed, two queries (asterisk and time span) resulted in significantly higher QHCs. This indicates that as of January 2018 Google Scholar’s size was 389 million records. Accordingly, we believe that the second factor accounting for QHC differences between May 2014 (176 million) and January 2018 (389 million) is attributable to a difference in query method. We reason that it is plausible to assume that Google Scholar’s QHC at 389 million is considerably higher than previously estimated. The question is whether Orduña-Malea et al. would also have obtained a higher maximum QHC had they used these same query methods in 2014.
Further, the most recent comparative data on Google Scholar’s size stems from the work of Delgado López-Cózar et al. (2018), which estimates its size at 331 million records in March 2017. In comparison this would mean that our QHCs 10 months later indicate an increase of Google Scholar’s total size (including articles, patents, and citations) of 18%. As Delgado López-Cózar et al. (2018) use a different estimation method that involved adding Google Scholar’s yearly QHC to an overall total sum, we cannot compare our results directly, as we know from previous research (Orduña-Malea et al. 2015) that year-by-year queries might lead to slightly lower total QHCs than using wide year ranges. If one assumes the same percentage difference for 331 million records obtained by year-by-year estimation, one could calculate a hypothetical 343 million for wide year range estimation in March 2017. Then, the remaining difference of 46 million records ought to stem in part from an expansion of Google Scholar’s database within these 10 months and in part from method difference. Using the previously calculated monthly growth rate of 1.6 million records, would leave 30 million records attributable to method difference, indicating that we found a specific absurd query variation that leads to a higher QHC. Hence, these findings suggest it is worthwhile employing iterative methodology to estimate Google Scholar’s maximum QHC.
WorldWideScience seems to have the second largest QHC with 323 million records. However, its QHCs have to be considered highly unstable as identical queries result in entirely different QHCs if performed only seconds apart. QHCs are also comparatively significantly lower than the official size data. We therefore assume QHCs inadequately reflect WorldWideScience’s total size. Further, according to Ortega (2014) WorldWideScience offers “more quantity than quality” as the system is assumed to produce “a large amount of duplicated results and is very time consuming”. These downsides make it significantly less user-friendly compared to Google Scholar for example. Nevertheless, one significant advantage of WorldWideScience is its capacity to access data from the deep web, which cannot be harvested by search engines such as Google Scholar (Ortega 2014).
Our analysis of 19 databases provided by ProQuest revealed that its 280 million records place it among the most comprehensive ASEBDs. The scope of ProQuest, similar to EbscoHost and Web of Science, is probably even higher if all available scientific databases could be accessed. Hence, for these providers our QHCs ought to be seen as indicative of their minimum total size, assuming that unrestricted access results in even higher QHCs. Nevertheless, our QHCs are indicative of their dimensions relative to other providers. For example, ProQuest is one of the largest ASEBDs and EbscoHost and Web of Science, both with more than 100 million records, have similar sizes to BASE, yet are considerably larger than CiteSeerX, Q-Sensei Scholar, Scopus, and Semantic Scholar. In the end the total size of these providers is theoretical; users can only access a portion of the total volume due to subjective resource restrictions, compared to search engines such as Google Scholar that provide access to all indexed resources. In this regard this study is to our knowledge the first to offer a size measure to EbscoHost and ProQuest. The scope of Web of Science was estimated before, predominately for its popular product, the Core Collection (Orduña-Malea et al. 2015; Martín-Martín et al. 2015; Orduña-Malea, Ayllón, et al. 2014).
A size of 118 million records and the greatest portion of its content being open access (Ortega 2014) makes BASE a search engine especially valuable for users without access to paywalled content. Conversely, the focus on open access content means that large portions of the academic web are not represented. Nevertheless, if the user is aware of this shortcoming BASE is one of the most valuable multidisciplinary academic databases, especially when given its responsiveness and filtering options. The remaining ASEBDs, Q-Sensei Scholar (plausible QHC, 55 million records), Semantic Scholar (official data of 40 million records), and CiteSeerX (plausible QHC, 8 million records) are smaller than their counterparts, yet all of them draw legitimacy from having a distinct vision of how an academic search engine should function (Ortega 2014).
The ASEBDs in our sample without QHC data (AMiner and Microsoft Academic) provide updated information on their sizes themselves. While these sources provide large sets of resources—232 million in the case of AMiner and 171 million in that of Microsoft Academic—these systems are extremely difficult (and sometimes impossible) to access through a systematic query-based data retrieval, a criterion necessary for systematic literature reviews for example.
Queries
The results show that ASEBDs are diverse in their functionality and features, so their analysis requires an overarching comparative methodology. Most of the different query variations employed successfully retrieved a maximum QHC in at least one case. This first shows that academic services function differently and second underlines the validity of our broad iterative approach of testing a multitude of query variations. We found that employing “absurd” or “direct” queries (Orduña-Malea et al. 2015) is not absurd after all, as we could produce plausible QHCs for seven ASEBDs: BASE, CiteSeerX, EbscoHost, ProQuest, Q-Sensei Scholar, Scopus, and Web of Science. Specifically, the results show that for most ASEBDs, queries with varying symbols were most effective in terms of retrieving a maximum QHC.
The only ASEBD in our sample where QHCs exactly matched official size information was BASE. In some cases, the resulting QHC was higher than the number provided by the ASEBD operator, illustrating the problem that size statements are frequently outdated. In two cases (Q-Sensei Scholar and Web of Science Core Collection) official numbers were only slightly higher than maximum QHCs, indicating that not all of the providers’ database’s records can effectively be accessed via query or at least not via the queries that were tested in this study.
Despite the QHC proving a relevant tool to assess the sizes of most ASEBDs, it was not suitable in all cases. In fact, for four search engines in our sample (AMiner, Microsoft Academic, Semantic Scholar, and WorldWideScience), the QHC proved to be inadequate to a greater or lesser degree. AMiner and Microsoft Academic did not report their QHC while providing up-to-date size information on their websites. Queries on Semantic Scholar and WorldWideScience returned variable results and could not be verified. It remains uncertain whether the outdated official size information for these two search engines correctly indicates the volume of records actually accessible to the user.
We found that Google Scholar’s QHC for identical queries seemed reliable and precise at some points of time and unreliable and imprecise at other times. This issue was identified by Jacsó as early as 2008 and again by Orduña-Malea et al. (2015). To examine Google Scholar’s query results, we made an effort to discern patterns of reliability and precision. The current analysis benefits from that of Orduña-Malea et al. (2015), which found that the introduction of a limiter “non-existent_site” produced more plausible and stable results. We confirmed their findings in so far as Google Scholar produces significantly fewer results with straightforward queries not using any other limiters. Following our iterative approach, we did not however just replicate the queries of Orduña-Malea et al. (2015) but also tested different search strings to verify if the QHC was indeed the maximum value. We found that “non-existent_site” produced the same results, while changes to the “common_term” altered the QHC significantly. Keeping the “non-existent_site” the same, we identified differences in the QHCs as we changed the terms from “1” to “a” or “the” or to other symbols. Queries with more than 30 s loading time resulted in a time out notification. To reduce the server load, we limited the length of queries. The process of iteration revealed a set of characters that produced the maximum QHCs (see Table 3). It also made it possible to record a maximum QHC of 389 million for the time span of 1700–2099. The fact that we received this maximum QHC with two methods (asterisk or time span only) could indicate that the QHC results are valid. Without the operator “non-existent_site”, the same query however produced a QHC of only 710,000.
The exact workings of Google Scholar’s database remain a mystery. While our results remained stable during the examination period, we verified the results a few months later and found considerable differences. Our findings of Google Scholar’s lack of stability and reliability of its reported QHC are in line with earlier research (Martín-Martín et al. 2017; Mingers and Meyer 2017; Aguillo 2012; Orduña-Malea and Delgado López-Cózar 2014; Jacsó 2005, 2008, 2012; Orduna-Malea et al. 2017). Despite these irregularities, employing the identical method as 4 years earlier (Orduña-Malea et al. 2015), we could replicate a reasonable QHC of Google Scholar. This could indicate that “absurd queries” can be a valid instrument to assess and replicate the QHC of Google Scholar over long periods of time. The current difficulties in replicating QHC results notwithstanding, our findings indicate that QHC methods can be reliable estimators of Google Scholar’s size. Compared to other major databases, Google Scholar seems to provide a multidisciplinary database outperforming the coverage of competitors such as Web of Science and Scopus (Martín-Martín, Orduna-Malea, and Delgado López-Cózar 2018; Martín-Martín, Orduna-Malea, Thelwall, et al. 2018).
While some variation in QHCs seem to be commonplace among popular search engines, such as Bing or Google (Wilkinson and Thelwall 2013), it should not happen in the scientific context where study outcomes depend on the resources available in databases. Whenever QHC variations occur, the question remains whether they stem from actual variations in available records or mere counting errors by the search system. The former would be particularly problematic in the academic context where accuracy and replicability are important criteria. These problems seem to be shared only by search engines. We found that all of the bibliographic databases and aggregators we examined—EbscoHost, ProQuest, Scopus, and Web of Science—provide plausible QHC results. This is not surprising given these services access a stable and curated database over which they have extensive control.
Further, this study highlights another important issue in academic document searching. While EbscoHost, ProQuest, and Web of Science seem to provide plausible QHC results, the scope of these services is often not clear for the user, as the volume of retrieved information depends on the specific settings of the user accessing it. In these three cases, academic institutions subscribe to different databases that are hosted by these providers. Therefore, what a user captures varies according to the subscriptions held. Users’ search scope might be suboptimal owing to limited institutional access, but those users might also not be aware of this limitation. Inexperienced users might think that these bibliographic databases and aggregators in fact only consist of a single, unitary database. The significant volume of academic research that mentions ProQuest or EbscoHost as its search frame, without stating the specific databases accessed, is indicative of this issue. In such cases the exact scope of the search remains unclear to readers and reviewers, which is especially worrying when research-synthesis studies are concerned. For reasons of scientific rigour, we suggest researchers should be educated on the issues around accurately reporting search scope.
Limitations and future research
This research found that the QHC measure a consistent methodology and seems a valid predictor of the sizes of most ASEBDs in our sample. Nevertheless, we will point out four limitations that at the same time provide avenues for future research.
First, following earlier research (Orduña-Malea et al. 2015; Khabsa and Giles 2014) some queries employed in this study focus on records that at least in some part use the English alphabet or English terms, that is, in using “the” and word combinations of the Oxford word list. While this procedure seemingly focuses on English documents only it is rarely the case that non-English documents use non-English letters or type only. The word “a” for example is used in multiple different languages; or, further a significant number of Chinese documents include some translation of the title or abstract or use single keywords or letters that makes these documents identifiable via English-based query methods. To provide an alternative to language-based queries, we employed queries that would work irrespective of language, such as digits and ANSI symbols. For many ASEBDs, these non-language-based queries proved successful in providing maximum QHCs. Building on these language-issues in queries, we suggest future research assesses search systems comparatively with regards to the scope of each language’s coverage. Longitudinal analyses might prove particularly productive in quantifying the development of English versus non-English scientific publication activity.
Second, the actual number of records a database contains might never be known with absolute certainty. While our method of using QHCs was compared against size numbers from official and research sources, the assessment of size is ultimately always based on some information on provision stemming from the ASEBD itself. While it is possible to expose irregularities in this information through plausibility-checking methods such as those employed in this study, validation of the numbers is another question. Validating the accuracy of this information with absolute certainty is most likely impossible without having access to the full dataset. While we know that Google Scholar’s metrics are problematic to some degree, we can never be sure if BASE’s, for example, are not also. The latter is a search engine that updates and publishes its information of its knowledge stock in real time, but being sure that information is accurate would involve downloading all records and counting them, which is not only impractical, but in most (if not all) cases impossible. For example, Google Scholar limits visible records to a maximum of 1000 and Web of Science sets a limit of 100,000. Such lack of transparency means researchers have to work with the information that is available, while constantly challenging the validity of the numbers concerned. This study has tried to minimize these limitations through triangulation of data through multiple query variations and comparative size numbers, and accordingly, we believe that the QHCs reported in this study are a good proxy of the actual database sizes available to users.
Third, QHCs reflect the number of all indexed records on a database, not the number of unique records indexed. This means duplicates, incorrect links, or incorrectly indexed records are all included in the size metrics provided by ASEBDs. Hence, the number of unique records contained by ASEBDs, especially by larger multidisciplinary search engines with automated curation processes, is likely to be systematically exaggerated by QHCs. It is estimated that Scopus for example contains 12.6% duplicates (Valderrama-Zurián et al. 2015) and Google Scholar is assumed to list up to 10% erroneous, undated records (Orduña-Malea et al. 2015). These estimates show that duplicates constitute a significant proportion of total records in both search engines and other database types. As the ratio of unique records to duplicates or other erroneous records differs among ASEBDs, this factor is likely to affect their comparative size if assessed in terms of unique records. Hence, this limitation shows it is important to consider the types of records available behind the size numbers.
Fourth, the size of a database is only one of multiple criteria that need to be assessed jointly to get an overall picture. For users a balance of these criteria, weighted for their unique preferences and requirements, influences the choice over which database best fits a given task. To assess databases further, especially concerning their suitability for academia, research would need to consider aggregate measures consisting of multiple variables such as relevance, objectivity, functional scope, or the user interface.