Introduction

Academic search engines and bibliographic databases (ASEBDs) are now the standard place from which to access up-to-date scientific publications. These services make an ever-increasing stock of scientific knowledge accessible for scientists by filtering the most relevant information. Students and scholars start their web searches with ASEBDs providing the lens through which they view science and conduct investigations (Haines et al. 2010).

In the late 1990s, the rise of the internet saw ASEBDs become relevant and increasingly replace traditional offline systems of information retrieval (for an overview see Table 1). Existing data providers and publishers such as ProQuest, Ebsco, Thomson Reuters, and Elsevier entered the online realm to offer their information services. Nevertheless, only in the early 2000s did innovations in data access transform access to scientific information. Large crawler-based search engines such as Google Scholar, Microsoft Academic, and Scirus started to make huge volumes of scholarly data readily accessible to anyone at no cost (Ortega 2014). Google Scholar became the number one go-to information source in academia (van Noorden 2014) and is often used due to its convenience and users’ familiarity with the search system (Georgas 2014; Jamali and Asadi 2010; Duke and Asher 2012). While not all documents were available in full-text form, Google Scholar could build up a significant resource of publicly available documents covering a large array of disciplines and languages. Google Scholar seems unrivalled in the efficient and effective provision of scholarly documents online. Yet, Microsoft Academic, after discontinuing its service, relaunched its academic search machine in 2017 to compete with Google Scholar once again (Harzing and Alakangas 2017). Beside Google Scholar and Microsoft Academic, there are however many other larger multidisciplinary search engines, bibliographic databases, and other information services that try to convince academic users of the validity of their unique information offering.

Table 1 Overview of the characteristics of 12 large multidisciplinary ASEBDs

Search system scope

While academic users have a choice of which service to use, it is often unclear which search system serves them best. There are multiple criteria for evaluating the quality of search systems, such as relevance, objectivity, or accuracy (Jansen and Spink 2003; Brophy and Bawden 2005; Eastman and Jansen 2003). In this study we concentrate on one criterion, the scope of a search system in terms of its size, reflecting the number of accessible resources for a specific user (Lawrence and Giles 1999; Grigas et al. 2016; Hawking et al. 2001). The results an academic user obtains with a query are influenced, among other quality criteria, by the limits of the data available on the specific search engine or bibliographic database. When information overload is accounted for with relevance, a larger scope brings better search results than a smaller scope.

In addition to academic users, other groups interested in knowing the sizes of academic search systems include: information specialists at research institutions or libraries interested in knowing the sizes of search systems at a particular point of time to allow comparison, and in knowing the size of single search systems at multiple points of time to allow longitudinal assessment of performance and stability. Therefore, knowing the scope of a given search system is not only worthwhile for academic users, but also for information specialists.

Nevertheless, the growth in the ASEBD offering not only improved the way scholars accessed information, but also created drawbacks in transparency of scope (Halevi et al. 2017; Shariff et al. 2013; Aguillo 2012). Particularly Google Scholar’s scope remains a mystery and a source of speculation, especially because Google Scholar’s aim is to index the entire universe of scholarly information, estimating its size has attracted numerous academic works. Knowing Google Scholar’s size and growth might be indicative of the size and growth of scholarly data as a whole (Orduña-Malea et al. 2015; Halevi et al. 2017): “[p]erhaps even Google Scholar does not know this “number”… a number that approximately represents the online scientific heritage circulating at present” (Orduña-Malea, Ayllón, et al. 2014, p. 29). Researchers remain frustrated over Google Scholar’s secrecy: “its secretiveness about every aspect of Google Scholar is on par with that of the North Korean government. The database is getting bigger and bigger but in the wrong way, through hoarding giga collections of irrelevant and/or non-scholarly content” (Jacsó 2012, p. 466). Google Scholar encourages scholarly research on its coverage to address such criticism, as shown on its FAQ pages: “all such questions [on search coverage] are best answered by searching for a statistical sample of papers that has the property of interest—journal, author, protein, etc. Many coverage comparisons are available if you search for [allintitle:”google scholar”], but some of them are more statistically valid than others”. The suggestion illustrates that Google Scholar acknowledges the validity of some of the scientometric methods it is examined by.

Research on Google Scholar’s size has a long tradition and is considered by some as the “golden fleece” (Orduña-Malea et al. 2015). Indeed, even just two years after Google Scholar’s launch in late 2004, Mayr and Walter (2007) took up the challenge to be the first to assess its coverage. The study concluded that Google Scholar’s coverage of Thomson Scientific Journal lists, Directory of Open Access Journals, and Journals from the SOLIS database was 78.5%. Later-on Aguillo (2012) found that Google Scholar might list a total of more than 86 million records. Two years later, Khabsa and Giles (2014) estimated that close to 100 million records were listed. Utilizing query hit count (QHC) methodology, Orduña-Malea et al. (2015) concluded that its size must extend beyond all previous estimates and concluded that Google Scholar is likely to contain 176 million documents, including articles, citations, and patents. Nevertheless, due to the opacity of Google Scholars’ technical functionality “all methods [of assessing its coverage] show great inconsistencies, limitations and uncertainties” (Orduña-Malea et al. 2015, p. 947). In the face of these challenges, the question remains whether Google itself is only unwilling to report its size, or perhaps is in fact is incapable of doing so. This work intends to shed more light onto how large Google Scholar actually is and how it compares to other large multidisciplinary ASEBDs.

While Google Scholar is one of the most popular academic search engines, it is not the only one relevant for scientific enquiries (Orduña-Malea, Martín-Martín, et al. 2014). With an increasing number of search engines and bibliographic databases, so the competitive pressure increases to provide useful information. As the number of search systems increases, so the features and functionality offered in accessing search results diversifies. Hence, as ASEBDs became important gatekeepers of the provision of secondary information, and their role in science became increasingly relevant, research also became increasingly interested in investigating them. Since the millennium, research on the size of web search engines and other information search systems has featured in scientometric, informetric, bibliometrics, webometrics, and altmetrics journals (Orduña-Malea et al. 2015; Orduña-Malea and Delgado López-Cózar 2014; Hood and Wilson 2001; Thelwall 2008, 2009; Bar-Ilan 2008). Nevertheless, given the increase in ASEBDs, all differing in scope and functionality, research efforts have not caught up with their investigation. Currently there is no study to assess and compare major ASEBDs—a considerable gap in research this study aims to fill.

To monitor a larger set of ASEBDs requires a method capable of including all different ASEBDs. It is evident that all ASEBDs differ in qualities such as functionality, scope, data handling, and syntax. Previous studies assessed the size of ASEBDs with a variety of methods (Ortega 2014; Khan et al. 2017). These estimates of ASEBDs’ sizes were predominantly performed for databases where this information was not officially reported. ASEBDs were assessed using queries against multiple journal lists (Mayr and Walter 2007), the overlap between ASEBDs (Khabsa and Giles 2014), the query of top-level domains (Aguillo 2012), and the use of blank or “absurd” queries to receive QHCs (Orduña-Malea et al. 2015; Orduña-Malea, Ayllón, et al. 2014). So far studies have examined ASEBDs individually (Aguillo 2012; Halevi et al. 2017; Mayr and Walter 2007; Orduña-Malea et al. 2015; Hug and Braendle 2017; Harzing 2014) or compared them in pairs or multiples (Meho and Yang 2007; Shultz 2007; Chadegani et al. 2013; Khabsa and Giles 2014; de Winter et al. 2014). Nevertheless, what has been missing so far is an up-to-date comparative overview of the sizes of the most popular ASEBDs. One reason for this shortcoming is the different estimating-methods employed that have made comparing the size of an ASEBD difficult.

Study objective

This study’s aim is to estimate ASEBD sizes with a method that is applicable for most systems. We reasoned that all ASEBDs with a focus on the user would provide some form of query function. Hence, the goal of our analysis was to retrieve a maximum quantity of records of a given ASEBD with one single query. We investigated scope in terms of what information is actually available to the user, rather than the theoretically indexed knowledge. Even when databases might contain more articles in theory, the inaccessibility of these articles makes them irrelevant for the user. Hence the value of information systems in terms of scope lies in the knowledge stock it makes accessible through queries, not the stock it has theoretically stored or indexed on its servers but fails to list through query-based methods. To assess the quantity of knowledge actually accessible to users, we use the same tools available to the user. This means straightforward queries are assumed to retrieve the datasets that are effectively available to the searchers. While this query technique presents a query bias, as datasets that are not reached through regular query might be systematically disregarded, this limitation is the same as the user has. Hence, queries define the line between what data can and what data cannot be retrieved by the regular user (Bharat and Broder 1998). Nevertheless, it is noteworthy that accessible records do not mean accessible unique records. Indeed, search systems often include a significant portion of duplicates and indexing, or other cataloguing errors seemingly boost the total size of the system while not providing any new information to the user (Jacsó 2008; Valderrama-Zurián et al. 2015). Acknowledging the difficulty of assessing multiple multidisciplinary ASEBDs that vary in functionality this study tackles the need for up-to-date information on search system scope.

Method and data

Building on previous scientometric research, this study introduces an iterative method to compare the sizes of widely used multidisciplinary ASEBDs. These query-based size estimates are then assessed to discern their plausibility by comparing them to the official size information given by the database providers and the size information reported by other scientific studies.

Selection of search engines and bibliographic databases

We based our selection of academic search engines on the work of Ortega (2014) that presents a comprehensive guide to the landscape of academic search engines up until 2014. At that point the available search engines were: AMiner, Bielefeld Academic Search Engine (BASE), CiteSeerX, Google Scholar, Microsoft Academic, Q-Sensei Scholar, Scirus and WorldWideScience. Of these eight search engines, Scirus could not be analysed as its services terminated in 2014. To this sample of seven we added a search engine that went online after Ortega’s contribution (Semantic Scholar) as well as four large multidisciplinary bibliographic databases and aggregators (EbscoHost, ProQuest, Scopus, and Web of Science). Hence, this study analyses 12 ASEBDs. Their main characteristics such as “owner”, “year of launch” and “coverage” are described in Table 1.

As ASEBDs are heterogenic in their functionality and data input formats, this study had to find a common method to access them. Previously researchers had been interested in the characteristics of single ASEBDs or a comparison of a few. Here a multitude of methods were applied including, webometric analysis (Aguillo 2012), capture/recapture methods (Khabsa and Giles 2014), citation analysis (Meho and Yang 2007; Hug and Braendle 2017), and search result comparison (Shultz 2007). However, as these methods are not practically applicable for most ASEBDs in a similar fashion, we introduced an iterative method to test the features of ASEBDs in our sample. This research builds on previous methodology developed and employed by Vaughan and Thelwall (2004) and Orduña-Malea et al. (2015) and advances their methods for finding ASEBD metrics. We implement an iterative element to identify the maximum QHC, meaning iterating towards a query that provides the maximum number of hits for a given search system.

Any given query of a search system is assumed to retrieve a set of records, and not retrieve others that lie outside of the query’s scope. The sum of both retrieved and non-retrieved records amounts to the search system’s coverage or its overall size. Recall denotes the search system’s capability to retrieve all relevant records over a query (Croft et al. 2015). Our measure of QHCs denotes the number of retrieved records, while the total size of the database remains known only to the database provider. A given query retrieves either all records or a fraction of all records. QHCs therefore denote an estimation of a search system’s minimally assumed size—the least number of records that it is expected to contain. This means an ASEBD at least covers this number of resources, and maybe more.

Accordingly, we included different resource formats and qualities, an approach similar to that of Orduña-Malea et al. (2015). Hence, QHCs reflect the scope of scholarly search engines and bibliographic databases as a determinant of their overall usefulness for scholarly work, while they do not state which database contains most of some particular academic resource type, such as peer reviewed articles. All ASEBDs analysed in this study were accessed between January 2018 and August 2018.

Search strategies and equations

Utilizing an iterative approach to find best estimates of the size of ASEBDs extends the methodologies used in scientometrics and information science. We followed the methodology employed recently by a number of studies in information metrics where ASEBD size is determined through queries with different search string designs (Halevi et al. 2017; Orduña-Malea et al. 2015; Orduña-Malea, Ayllón, et al. 2014). Earlier, this method was used to evaluate the scope of non-academic search engines (Vaughan and Thelwall 2004). In this study we build on these previous experiences and combine them with an iterative methodology that is, through variation of search strings, geared towards maximizing QHCs. In information metrics Orduña-Malea et al. (2015) already experimented with “direct queries” that searched with a specific filter and “absurd queries” that contained arbitrary characters. The logic of the “direct queries” was to utilize filter functions without including a search string, while the logic of “absurd queries” was to retrieve data with variations of a search string. With the latter, the idea was to select the most universal characters such as “1” or “a”, as almost any serious document would feature those characters at some point. Orduña-Malea, Ayllón, et al. (2014) note in relation to “absurd queries” that the method is “more accurate than it seems at first because the search engine is forced to check the entire database to answer the query, as the time responses are suggesting […] the final figures provided seem logical and coherent, and close to those achieved by other methods. […]” “Surprisingly, even though all methods seem invalid for various and diverse reasons, the external method and internal method based on absurd query (with all variants considered) return similar results despite being of a different nature, reinforcing the validity of the estimation performed” (Orduña-Malea et al. 2015, p. 947).

Following the motto anything might work, we iteratively tested five different categories of variations of search strings to formulate “direct queries” and “absurd queries” for each database: single characters, digits, terms, ANSI symbols, and also their cross-combinations and queries with wide data ranges. The query variations we utilized are outlined in Table 2. The reasoning was that almost all listed publications would contain at least one of these variations and therefore would be identified through these query-based methods. In particular we expected that most records would be written in English (Khabsa and Giles 2014) and that all of these would at least contain one of the most frequently used English words in its text. While this provides a language bias, it is not uncommon to focus on English articles as the largest ASEBDs seem to do so (Orduña-Malea et al. 2015). Accordingly, we consulted the 2008 Oxford Word List (Oxford University Press 2008) and interlinked sets of the top 100, top 50, top 25 or fewer most utilized English words with Boolean operators. To mitigate this language bias we also tested whether non-English-based variations, such as, digits, year ranges, and ANSI symbols, were capable of retrieving the maximum QHC. Whenever more than one character, digit, symbol, or term was used as input, the query was separated with Boolean “OR” operators. Furthermore, we performed queries by selecting exhaustive time spans in the expectation of covering the entire data set underlying the ASEBD. When all methods failed to produce a plausible QHC (as in the case of Q-Sensei Scholar) we tried queries with facets provided by the database. All queries were tested with and without using quotation marks. Queries were performed with Google Chrome in incognito mode and were tested under different paywall restrictions (i.e., university subscriptions) and locations (IP addresses). The exact composition of queries and the utilized preferences for each of the ASEBDs are illustrated in detail in “Appendix 2”.

Table 2 Query methods

Google Scholar

Google Scholar presents a special case among ASEBDs in that it is both one of the most frequently used, yet also one of the least understood and validated. This is why we dedicated particular effort to iterating some valid, stable method to obtain a good estimate of Google Scholar’s size. We started from the methodology of Orduña-Malea et al. (2015, p. 937) who collected hit count data through absurd queries: “[…]we ran test queries using the following syntax: <common_term -site:non-existent_site> The idea behind this is to query the number of occurrences of a very common term (likely to appear in almost all written records), and to filter out its appearances in a non-existent web site, which means that we are implicitly selecting every existing site. For example: <a -site:ssstfsffsdfasdfsf.com>, or <1 -site:ssstfsffsdfasdfsf.com>. The reason for including a term before the “-site” command is that this command does not work on its own”. In this study we tested the same proposed query, “1 -site:ssstfsffsdfasdfsf.com” and altered the search string (different “common_term” and different “non-existent_sites”), and the time frame (different time spans) according to our defined test search strings (see “Appendix 2”).

Plausibility assessment

Through varying queries iteratively, we received different QHC size estimations for every query. We took the maximum QHC value as the best estimate of the total number of a database’s records. In order to validate the QHCs obtained we performed two plausibility checks with our data. First, we collected official size statements provided by the ASEBD operators themselves. Second, other research studies might have previously examined ASEBD sizes using similar or different methods to ours. We compared the maximum QHCs with the size information of the ASEBDs themselves or of research conducted on the size of the ASEBD. The plausibility check was straightforward for most ASEBDs. When our maximum QHC was within plausible range of these comparative numbers we considered the QHC plausible. Plausible range was primarily determined by taking into account the time difference in size information. For subscription-based ASEBDs that provide access to multiple distinct bibliographic databases (i.e., EbscoHost, ProQuest, Web of Science) we also retrieved QHC data for a specific database where comparative size information was available from official sources. This way we could assess if and to what degree QHC data matched the official size statement of the provider. If the QHC was plausible for a single database, we reasoned QHCs would also be similarly plausible for multiple databases.

Results

Our analysis reveals the query hit counts of ASEBDs. We found that QHC sizes varied significantly from the smallest (CiteSeerX) containing 8,401,126 hits to the largest (Google Scholar) containing 389,000,000 hits. The results show that based on QHC, Google Scholar, WorldWideScience, and ProQuest (selection of 19 databases, see “Appendix 1”) are by far the largest systems providing scholarly information, with each containing about 300 million records. This leading group is followed by BASE, Web of Science (selection of ten databases, see “Appendix 1”), and EbscoHost (selection of 25 databases, see “Appendix 1”) each containing more than 100 million records; somewhat smaller ASEBDs are Scopus, Web of Science (Core Collection), and Q-Sensei Scholar each containing around 60 million records. In the case of those providers linking multiple databases—EbscoHost, ProQuest, and Web of Science—it is important to consider that the QHC reflects a selection of databases, and therefore, their QHCs are likely to be higher when all available databases of a provider are selected at once.

Two of the 12 ASEBDs—AMiner and Microsoft Academic—did not report numbers suitable for query-based size estimation. AMiner only reports QHCs of up to 1000 hits making it impossible to retrieve actual QHC data. Similarly, Microsoft Academic does not report data sets exceeding 5000 records since its relaunch in 2017. Earlier studies on Microsoft Academic were still able to report size numbers via simple queries (e.g., Orduña-Malea, Ayllón, et al. 2014; Orduña-Malea, Martín-Martín, et al. 2014).

We found that most of the query variations we employed proved successful in retrieving maximum QHCs for some ASEBDs. No query method returned the highest QHC of all databases. Most ASEBDs returned the highest QHCs via “direct queries” of wide time spans (5 times) or with symbol-queries (5 times). The asterisk (*) was the most successful symbol in retrieving maximum QHCs. In three cases—Google Scholar, ProQuest, and Web of Science—two methods simultaneously produced the same maximum QHC. Neither the single terms “a” and “the” nor character and number combinations provided a maximum QHC in our analysis. Only for one database (Scopus) did a combination of words prove successful in retrieving a maximum QHC, signifying that for this database alone a longer search string actually meant more retrieved records. In this case we therefore iteratively expanded the search string to see if the maximum QHC could be further increased. Indeed, a combination of the top 100 terms, all digits, and the English alphabet increased the QHC by almost 2% to 72 million records. To exclude a potential language bias, we additionally expanded the query with Russian and Chinese letters, but could not find any difference in maximum QHC. Table 3 presents the detailed outcomes.

Table 3 QHCs of search engines and bibliographic databases

Results of plausibility assessment

While maximum QHCs did in some cases diverge considerably from comparative measures, they were not necessarily implausible. In the case of CiteSeerX for example, official numbers were outdated and hence reported 17% fewer records than the QHC predicted. We hence assumed that the QHC probably reflected the search engine’s size at that time. We also found that official size statements were frequently outdated or entirely unavailable for other databases. The plausibility assessment for all ASEBDs in our sample can be found in Table 4.

Table 4 Plausibility assessment of QHCs

Overall, when comparison was possible, we found that QHCs were a plausible and therefore valid instrument for assessing the sizes of ASEBDs. Plausibility checks allowed us to conclude that QHC data was plausible for seven out of ten ASEBDs: Bielefeld Academic Search Engine (BASE), CiteSeerX, EbscoHost, Q-Sensei Scholar, ProQuest, Scopus, and Web of Science. In the case of BASE, the QHC exactly matched the official size information. Q-Sensei Scholar provided an exception as the maximum QHC was not identified through query but through selection of multiple facets. For this database we identified the maximum QHC by selecting all “year” or “type” facets. The resulting QHC only fell short by less than 1% compared to the updated official size information.

For EbscoHost, ProQuest, and Web of Science—which all adopt a subscription model—we found that the QHC depended significantly on which databases were searched. We found that QHCs for single databases were perfectly plausible (EbscoHost’s ERIC, ProQuest Dissertations and Theses Global, and Web of Science’s Core Collection). Hence, we reasoned that the QHCs were also plausible for multiple databases. Nevertheless, the QHCs for a joint search of all available scholarly databases fell short of official size numbers. This discrepancy can be explained by the limitation of the databases accessed because firstly, not all databases from these information services provide scientific content and some were thus excluded from our search; and secondly, we could not access all available databases ourselves because we lacked the necessary subscriptions. Therefore, the resulting QHCs reflect the volume of records available according to the unique scope determined by the searcher. Hence, for EbscoHost, ProQuest, and Web of Science maximum QHCs do not reflect the total, objective size of the service, but the aggregated size of the selected databases of that provider. The databases we selected are listed in “Appendix 1”.

Only two QHCs were implausible: Semantic Scholar and WorldWideScience. We found that these two ASEBDs also provided inconsistent QHCs during the data retrieval process. Their QHCs were both significantly different from official size information and varied considerably when queries were repeated. Having presented the results of the QHC plausibility assessment of nine ASEBDs, the remaining search engine Google Scholar seems to produce questionable QHCs owing to its lack of stability over query variations. Our QHC indicates that Google Scholar incorporated 389 million records in January 2018.

Discussion

This study has built on and extended previous scientometric research inquiring into the sizes of ASEBDs. It is novel in so far as it improves query-based methods for assessing ASEBDs and establishes those methods as adequate, fast predictors of the sizes of most ASEBDs. The methods used made it possible to assess a multitude of different ASEBDs and compare their sizes. The process not only delivered size information but also some insights into the diverse query functionalities of ASEBDs that prove to be the basis of the daily scientific enquiries of many researchers.

Size

We have obtained a QHC from ten of the 12 ASEBDs examined. Based on this QHC data we can assume that Google Scholar, with 389 million records, provides by far the greatest volume of scholarly information. Our maximum QHC in this regard seems plausible when compared to similar multidisciplinary search engines like Microsoft Academic that as of January 2018 covers more than 170 million records and is considered, with a ratio of 1:2.17, considerably smaller than Google Scholar (Orduña-Malea et al. 2015). If we apply the same ratio between the two search engines in January 2018, Google Scholar would amount to roughly 372 million records, a number close to our QHC of 389 million. Nevertheless, it is important to bear in mind that this size comparison might be flawed as Microsoft Academic has relaunched since the Orduña-Malea et al. (2015) research was conducted. This relaunch could have significantly impacted the structure and size of Microsoft Academic (Hug and Braendle 2017) and its performance in retrieving search results with high precision and recall (Thelwall 2018).

Comparing previous research findings with our QHC results we found that with 389 million records Google Scholar’s maximum QHC in January 2018 amounts to an increase of 121% compared to the previously estimated size of 176 million by Orduña-Malea et al. (2015) in May 2014. The QHC estimation of both our study and that of Orduña-Malea et al. (2015) include articles, citations, and patents indexed on Google Scholar and thus can reasonably be compared. This size difference most likely stems from two factors: time difference and method difference. With regard to time difference, if we exactly replicate the query that resulted in the 176 million hits obtained by Orduña-Malea et al. (2015) (<1-site:ssstfsffsdfasdfsf.com> and wide year range), we arrive at a QHC of 247 million records. This indicates that in 44 months Google Scholar increased its size by 40% or an average growth rate of 1.6 million records per month. This monthly growth rate would only exceed Microsoft Academic’s current monthly growth rate of 1.3 million records by a reasonable margin (Hug and Braendle 2017). Given this plausible increase in records over 44 months, it seems logical to assume that the same QHC method in May 2014 produced comparable results in January 2018 too.

With regard to method difference, as with all databases we iteratively tried other queries to identify a maximum QHC for Google Scholar. Indeed, two queries (asterisk and time span) resulted in significantly higher QHCs. This indicates that as of January 2018 Google Scholar’s size was 389 million records. Accordingly, we believe that the second factor accounting for QHC differences between May 2014 (176 million) and January 2018 (389 million) is attributable to a difference in query method. We reason that it is plausible to assume that Google Scholar’s QHC at 389 million is considerably higher than previously estimated. The question is whether Orduña-Malea et al. would also have obtained a higher maximum QHC had they used these same query methods in 2014.

Further, the most recent comparative data on Google Scholar’s size stems from the work of Delgado López-Cózar et al. (2018), which estimates its size at 331 million records in March 2017. In comparison this would mean that our QHCs 10 months later indicate an increase of Google Scholar’s total size (including articles, patents, and citations) of 18%. As Delgado López-Cózar et al. (2018) use a different estimation method that involved adding Google Scholar’s yearly QHC to an overall total sum, we cannot compare our results directly, as we know from previous research (Orduña-Malea et al. 2015) that year-by-year queries might lead to slightly lower total QHCs than using wide year ranges. If one assumes the same percentage difference for 331 million records obtained by year-by-year estimation, one could calculate a hypothetical 343 million for wide year range estimation in March 2017. Then, the remaining difference of 46 million records ought to stem in part from an expansion of Google Scholar’s database within these 10 months and in part from method difference. Using the previously calculated monthly growth rate of 1.6 million records, would leave 30 million records attributable to method difference, indicating that we found a specific absurd query variation that leads to a higher QHC. Hence, these findings suggest it is worthwhile employing iterative methodology to estimate Google Scholar’s maximum QHC.

WorldWideScience seems to have the second largest QHC with 323 million records. However, its QHCs have to be considered highly unstable as identical queries result in entirely different QHCs if performed only seconds apart. QHCs are also comparatively significantly lower than the official size data. We therefore assume QHCs inadequately reflect WorldWideScience’s total size. Further, according to Ortega (2014) WorldWideScience offers “more quantity than quality” as the system is assumed to produce “a large amount of duplicated results and is very time consuming”. These downsides make it significantly less user-friendly compared to Google Scholar for example. Nevertheless, one significant advantage of WorldWideScience is its capacity to access data from the deep web, which cannot be harvested by search engines such as Google Scholar (Ortega 2014).

Our analysis of 19 databases provided by ProQuest revealed that its 280 million records place it among the most comprehensive ASEBDs. The scope of ProQuest, similar to EbscoHost and Web of Science, is probably even higher if all available scientific databases could be accessed. Hence, for these providers our QHCs ought to be seen as indicative of their minimum total size, assuming that unrestricted access results in even higher QHCs. Nevertheless, our QHCs are indicative of their dimensions relative to other providers. For example, ProQuest is one of the largest ASEBDs and EbscoHost and Web of Science, both with more than 100 million records, have similar sizes to BASE, yet are considerably larger than CiteSeerX, Q-Sensei Scholar, Scopus, and Semantic Scholar. In the end the total size of these providers is theoretical; users can only access a portion of the total volume due to subjective resource restrictions, compared to search engines such as Google Scholar that provide access to all indexed resources. In this regard this study is to our knowledge the first to offer a size measure to EbscoHost and ProQuest. The scope of Web of Science was estimated before, predominately for its popular product, the Core Collection (Orduña-Malea et al. 2015; Martín-Martín et al. 2015; Orduña-Malea, Ayllón, et al. 2014).

A size of 118 million records and the greatest portion of its content being open access (Ortega 2014) makes BASE a search engine especially valuable for users without access to paywalled content. Conversely, the focus on open access content means that large portions of the academic web are not represented. Nevertheless, if the user is aware of this shortcoming BASE is one of the most valuable multidisciplinary academic databases, especially when given its responsiveness and filtering options. The remaining ASEBDs, Q-Sensei Scholar (plausible QHC, 55 million records), Semantic Scholar (official data of 40 million records), and CiteSeerX (plausible QHC, 8 million records) are smaller than their counterparts, yet all of them draw legitimacy from having a distinct vision of how an academic search engine should function (Ortega 2014).

The ASEBDs in our sample without QHC data (AMiner and Microsoft Academic) provide updated information on their sizes themselves. While these sources provide large sets of resources—232 million in the case of AMiner and 171 million in that of Microsoft Academic—these systems are extremely difficult (and sometimes impossible) to access through a systematic query-based data retrieval, a criterion necessary for systematic literature reviews for example.

Queries

The results show that ASEBDs are diverse in their functionality and features, so their analysis requires an overarching comparative methodology. Most of the different query variations employed successfully retrieved a maximum QHC in at least one case. This first shows that academic services function differently and second underlines the validity of our broad iterative approach of testing a multitude of query variations. We found that employing “absurd” or “direct” queries (Orduña-Malea et al. 2015) is not absurd after all, as we could produce plausible QHCs for seven ASEBDs: BASE, CiteSeerX, EbscoHost, ProQuest, Q-Sensei Scholar, Scopus, and Web of Science. Specifically, the results show that for most ASEBDs, queries with varying symbols were most effective in terms of retrieving a maximum QHC.

The only ASEBD in our sample where QHCs exactly matched official size information was BASE. In some cases, the resulting QHC was higher than the number provided by the ASEBD operator, illustrating the problem that size statements are frequently outdated. In two cases (Q-Sensei Scholar and Web of Science Core Collection) official numbers were only slightly higher than maximum QHCs, indicating that not all of the providers’ database’s records can effectively be accessed via query or at least not via the queries that were tested in this study.

Despite the QHC proving a relevant tool to assess the sizes of most ASEBDs, it was not suitable in all cases. In fact, for four search engines in our sample (AMiner, Microsoft Academic, Semantic Scholar, and WorldWideScience), the QHC proved to be inadequate to a greater or lesser degree. AMiner and Microsoft Academic did not report their QHC while providing up-to-date size information on their websites. Queries on Semantic Scholar and WorldWideScience returned variable results and could not be verified. It remains uncertain whether the outdated official size information for these two search engines correctly indicates the volume of records actually accessible to the user.

We found that Google Scholar’s QHC for identical queries seemed reliable and precise at some points of time and unreliable and imprecise at other times. This issue was identified by Jacsó as early as 2008 and again by Orduña-Malea et al. (2015). To examine Google Scholar’s query results, we made an effort to discern patterns of reliability and precision. The current analysis benefits from that of Orduña-Malea et al. (2015), which found that the introduction of a limiter “non-existent_site” produced more plausible and stable results. We confirmed their findings in so far as Google Scholar produces significantly fewer results with straightforward queries not using any other limiters. Following our iterative approach, we did not however just replicate the queries of Orduña-Malea et al. (2015) but also tested different search strings to verify if the QHC was indeed the maximum value. We found that “non-existent_site” produced the same results, while changes to the “common_term” altered the QHC significantly. Keeping the “non-existent_site” the same, we identified differences in the QHCs as we changed the terms from “1” to “a” or “the” or to other symbols. Queries with more than 30 s loading time resulted in a time out notification. To reduce the server load, we limited the length of queries. The process of iteration revealed a set of characters that produced the maximum QHCs (see Table 3). It also made it possible to record a maximum QHC of 389 million for the time span of 1700–2099. The fact that we received this maximum QHC with two methods (asterisk or time span only) could indicate that the QHC results are valid. Without the operator “non-existent_site”, the same query however produced a QHC of only 710,000.

The exact workings of Google Scholar’s database remain a mystery. While our results remained stable during the examination period, we verified the results a few months later and found considerable differences. Our findings of Google Scholar’s lack of stability and reliability of its reported QHC are in line with earlier research (Martín-Martín et al. 2017; Mingers and Meyer 2017; Aguillo 2012; Orduña-Malea and Delgado López-Cózar 2014; Jacsó 2005, 2008, 2012; Orduna-Malea et al. 2017). Despite these irregularities, employing the identical method as 4 years earlier (Orduña-Malea et al. 2015), we could replicate a reasonable QHC of Google Scholar. This could indicate that “absurd queries” can be a valid instrument to assess and replicate the QHC of Google Scholar over long periods of time. The current difficulties in replicating QHC results notwithstanding, our findings indicate that QHC methods can be reliable estimators of Google Scholar’s size. Compared to other major databases, Google Scholar seems to provide a multidisciplinary database outperforming the coverage of competitors such as Web of Science and Scopus (Martín-Martín, Orduna-Malea, and Delgado López-Cózar 2018; Martín-Martín, Orduna-Malea, Thelwall, et al. 2018).

While some variation in QHCs seem to be commonplace among popular search engines, such as Bing or Google (Wilkinson and Thelwall 2013), it should not happen in the scientific context where study outcomes depend on the resources available in databases. Whenever QHC variations occur, the question remains whether they stem from actual variations in available records or mere counting errors by the search system. The former would be particularly problematic in the academic context where accuracy and replicability are important criteria. These problems seem to be shared only by search engines. We found that all of the bibliographic databases and aggregators we examined—EbscoHost, ProQuest, Scopus, and Web of Science—provide plausible QHC results. This is not surprising given these services access a stable and curated database over which they have extensive control.

Further, this study highlights another important issue in academic document searching. While EbscoHost, ProQuest, and Web of Science seem to provide plausible QHC results, the scope of these services is often not clear for the user, as the volume of retrieved information depends on the specific settings of the user accessing it. In these three cases, academic institutions subscribe to different databases that are hosted by these providers. Therefore, what a user captures varies according to the subscriptions held. Users’ search scope might be suboptimal owing to limited institutional access, but those users might also not be aware of this limitation. Inexperienced users might think that these bibliographic databases and aggregators in fact only consist of a single, unitary database. The significant volume of academic research that mentions ProQuest or EbscoHost as its search frame, without stating the specific databases accessed, is indicative of this issue. In such cases the exact scope of the search remains unclear to readers and reviewers, which is especially worrying when research-synthesis studies are concerned. For reasons of scientific rigour, we suggest researchers should be educated on the issues around accurately reporting search scope.

Limitations and future research

This research found that the QHC measure a consistent methodology and seems a valid predictor of the sizes of most ASEBDs in our sample. Nevertheless, we will point out four limitations that at the same time provide avenues for future research.

First, following earlier research (Orduña-Malea et al. 2015; Khabsa and Giles 2014) some queries employed in this study focus on records that at least in some part use the English alphabet or English terms, that is, in using “the” and word combinations of the Oxford word list. While this procedure seemingly focuses on English documents only it is rarely the case that non-English documents use non-English letters or type only. The word “a” for example is used in multiple different languages; or, further a significant number of Chinese documents include some translation of the title or abstract or use single keywords or letters that makes these documents identifiable via English-based query methods. To provide an alternative to language-based queries, we employed queries that would work irrespective of language, such as digits and ANSI symbols. For many ASEBDs, these non-language-based queries proved successful in providing maximum QHCs. Building on these language-issues in queries, we suggest future research assesses search systems comparatively with regards to the scope of each language’s coverage. Longitudinal analyses might prove particularly productive in quantifying the development of English versus non-English scientific publication activity.

Second, the actual number of records a database contains might never be known with absolute certainty. While our method of using QHCs was compared against size numbers from official and research sources, the assessment of size is ultimately always based on some information on provision stemming from the ASEBD itself. While it is possible to expose irregularities in this information through plausibility-checking methods such as those employed in this study, validation of the numbers is another question. Validating the accuracy of this information with absolute certainty is most likely impossible without having access to the full dataset. While we know that Google Scholar’s metrics are problematic to some degree, we can never be sure if BASE’s, for example, are not also. The latter is a search engine that updates and publishes its information of its knowledge stock in real time, but being sure that information is accurate would involve downloading all records and counting them, which is not only impractical, but in most (if not all) cases impossible. For example, Google Scholar limits visible records to a maximum of 1000 and Web of Science sets a limit of 100,000. Such lack of transparency means researchers have to work with the information that is available, while constantly challenging the validity of the numbers concerned. This study has tried to minimize these limitations through triangulation of data through multiple query variations and comparative size numbers, and accordingly, we believe that the QHCs reported in this study are a good proxy of the actual database sizes available to users.

Third, QHCs reflect the number of all indexed records on a database, not the number of unique records indexed. This means duplicates, incorrect links, or incorrectly indexed records are all included in the size metrics provided by ASEBDs. Hence, the number of unique records contained by ASEBDs, especially by larger multidisciplinary search engines with automated curation processes, is likely to be systematically exaggerated by QHCs. It is estimated that Scopus for example contains 12.6% duplicates (Valderrama-Zurián et al. 2015) and Google Scholar is assumed to list up to 10% erroneous, undated records (Orduña-Malea et al. 2015). These estimates show that duplicates constitute a significant proportion of total records in both search engines and other database types. As the ratio of unique records to duplicates or other erroneous records differs among ASEBDs, this factor is likely to affect their comparative size if assessed in terms of unique records. Hence, this limitation shows it is important to consider the types of records available behind the size numbers.

Fourth, the size of a database is only one of multiple criteria that need to be assessed jointly to get an overall picture. For users a balance of these criteria, weighted for their unique preferences and requirements, influences the choice over which database best fits a given task. To assess databases further, especially concerning their suitability for academia, research would need to consider aggregate measures consisting of multiple variables such as relevance, objectivity, functional scope, or the user interface.

Conclusion

We conclude that the QHC measure is in most cases adequate to discern the sizes of ASEBDs. The iterative method used in this study has proven useful to receive plausible up-to-date information on the sizes of eight of the 12 ASEBDs examined: BASE, CiteSeerX, EbscoHost, Google Scholar, ProQuest, Q-Sensei Scholar, Scopus, and Web of Science. While BASE and Q-Sensei Scholar provide updated size information on their websites, the other six ASEBDs do not, making QHCs relevant and necessary size predictors. For ASEBDs, where comparative numbers were entirely missing, our study is the first to introduce these sizes numbers.

Specifically, we found that it is plausible to assume that Google Scholar’s size has been underestimated by between 8% (compared to Delgado López-Cózar et al. 2018) and 55% (compared to Orduña-Malea et al. 2015) so far owing to method difference, not time difference. It is certainly the most comprehensive academia search engine, but nevertheless, it remains unclear why Google Scholar does not report its size. Given the unstable nature of Google Scholar’s QHC it might be possible that Google itself either has difficulties accurately assessing its size or does not want to acknowledge that its size fluctuates significantly. Perhaps it is important to Google to convey to those searching for information that it offers a structured, reliable, and stable source of knowledge. If Google maintains its policy of offering no information, scientometric estimation will have to remain the sole source of information on its size.

For all ASEBDs for which QHCs have been shown to function plausibly, they provide a simple and quick insight into ASEBD scope, particularly compared with other scientometric methods that require more statistics and data manipulation. The method presented here is replicable and permits anyone to quickly obtain updated information that can be tailored to specific categories of content, provided a specific ASEBD supports such filters. For example, Web of Science can be searched via the “:” operator; the resulting records can then be refined according to document type, organization, content category, et cetera. Using that approach makes the volume of the available content easily divisible and researchable. For the exceptions where QHCs are not plausible, other scientometric methods might bring more satisfactory results.

Furthermore, our methodology of QHC-based size estimates will prove useful for longitudinal analysis of ASEBD growth and time series monitoring. The method also makes it possible to compute the time lags between date of publication and indexing of items on the respective ASEBD, which allows the enquirer to assess freshness (Croft et al. 2015) of the ASEBD’s underlying data. The simplicity of the QHC method in requiring no statistical calculations reduces the workload tremendously, a quality that should prove critical for further application (Prins et al. 2016). Monitoring ASEBDs is even more necessary in times of exponential growth of information and scientific output (Bornmann and Mutz 2014). Ideally metrics ought not only to track from time to time but monitor continuously. Hence, the QHC method could bring easy replicability to receive updated size metrics, thus increasing data relevance. While we have focused only on large multidisciplinary ASEBDs, the QHC can also be relevant to check the sizes of other (and particularly smaller) information systems such as repositories, digital libraries, library catalogues, bibliographic databases, and journal platforms.