Introduction

Earth science data come from a variety of sources, including satellites, aerial surveys, models, ground-based monitoring stations, and ocean buoys, each contributing to a wealth of information on Earth's environmental and geophysical systems. NASA’s vast amount of Earth science data are archived, distributed and managed by Distributed Active Archive Centers (DAACs) as a part of the Earth Observing System Data and Information System (EOSDIS). NASA’s open science policy grants unrestricted access to Earth science data as noted in the Scientific Information Policy for NASA’s Science Mission Directorate (SPD-41a, 2022). However, these data archives are faced with critical challenges in the era of open science which include data provenance, accessibility, transparency, reproducibility, and accountability. Data archives maintain data metrics such as citation frequency of the data in peer reviewed journals and reports, data access and usage patterns, and scientific applications of the data. Usage metrics offer quantitative insights into a dataset's impact and relevance, promoting transparency, reproducibility, and accountability in scientific research, which are the pillars of open science.

There is a need to integrate persistent digital identifiers, like DOIs, for datasets, coupled with the establishment of rigorous citation protocols that mandate referencing these datasets with DOIs in scholarly outputs. Such practices are pivotal for ensuring the datasets' continued relevance, accessibility, versioning, and verifiability in the scientific ecosystem, thereby supporting the advancement of data-driven research and innovation. Introduced in the late 1990s (Parsons, 2019), dataset citation is akin to citing a research article, acknowledging the dataset creators, enabling automated access to data, and generating usage metrics (CODATA-ICSTI, 2013; Costas, 2013; Kratz, 2015). The push towards open science, enhanced data sharing, and reproducibility has accelerated the practice of citing datasets (Robinson-Garcia, 2016; Wilkinson, 2016) and its importance (CODATA-ICSTI, 2013). The Data Citation Index, launched by the WoS in 2012, facilitated the exploration of datasets and the assessment of their scholarly impact (Pavlech, 2016). Additionally, DataCite (Brase, 2009; Hirsch, 2024) emerged as a pivotal bibliometric tool for uncovering datasets and mapping the connections between datasets and scholarly documents (Robinson-Garcia, 2017). In 2022, the Data Citation Index encompassed over 12 million datasets and their associated literature citations, while DataCite had records for 16 million datasets, with platforms like Scopus, GS, and Crossref also establishing links between document citations and datasets.

Citing datasets in scholarly works has been in practice for over three decades, but it has been challenging to link datasets to research articles because citation rates of datasets are low in the research literature (Mooney, 2012; Park, 2017; Robinson-Garcia, 2016; Silvello, 2018; Vannan, 2020; Zhao, 2018). Robinson-Garcia (2016) and Peters (2016) reported that over 85% of datasets covered by WoS's Data Citation Index were uncited. An exhaustive search for dataset citations is necessary to maximize output of bibliographic records citing the datasets. Since usage of datasets may not be fully captured in selected journals covered by WoS and Scopus, bibliometric sources with more exhaustive coverage such as GS must be considered. Additionally, as Crossref increases its records reference coverage, Crossref and DataCite should be evaluated as emerging bibliometric sources for retrieving dataset citing documents.

Numerous studies have examined how GS and various academic databases cover document citation counts, with GS often emerging as the most comprehensive source (Harzing, 2016, 2019; Moed, 2016; Martín-Martín, 2018; Thelwall, 2018; Delgado López-Cózar, 2018; Chapman, 2019; Levine-Clark, 2021; Martín-Martín, 2021). Collectively, these studies highlight the critical role of database selection in academic research, with GS consistently offering broad citation coverage, though attention to detail is essential for accurate citation analysis. While some of the above-mentioned studies (Martín-Martín, 2021) included Crossref, which is the largest document DOI registry covering the overwhelming majority of the documents and aiming to contain comprehensive metadata for each record, those studies consistently reported Crossref's significant underperformance in cross-document linkage. With publishers depositing references into Crossref recently following the Initiative for Open Citations (IO4C, 2024), Crossref is becoming a promising source for citation counts and is thus included in our work.

While considerable research has focused on the connections between documents across bibliographic databases, there is a notable deficiency in examining the relationships between documents and datasets. This study aims to explore the extent of dataset citation coverage in journal and conference articles, reports, and book records retrieved from various bibliographic sources, and also to make a deeper examination of the time trends. The temporal analysis of citations provides insights into whether emerging sources such as Crossref and DataCite show improvement, and helps to evaluate the fitness of Crossref as a source for citation linkages. Here we address the intersection and uniqueness, extent of cover, and temporal trend of citations retrieved from WoS, Scopus, Crossref, GS, and DataCite. The outcome of this study can be used to decide the most reliable bibliographic source and recommend the ideal combination of sources to maximize the search for the dataset-citing documents.

Methodology

NASA Earth science datasets

The study utilized 11,000 DOIs of the datasets available to the public through NASA's EOSDIS data archive centers (Behnke, 2019). These datasets contain remote sensing, models, and ground observation data covering a variety of Earth science disciplines: atmospheric chemistry and dynamics, hydrology, water and energy cycles, land processes, and others. These datasets significantly vary in quantities of files and file sizes, file formats, and available services. The datasets are open to the public and widely used in various applications and research. To ensure dataset accessibility, findability, and provenance, DAACs are required to register DOIs for each publicly available version of the datasets. This DOI registration policy was initiated in 2012 and currently the majority of EOSDIS public datasets have assigned DOIs. These dataset DOIs are registered with the DataCite registry using one of the three prefixes: 10.5067, 10.3334, and 10.7229. The 11,000 dataset DOIs used in this study represent about 0.06% of the total dataset records registered with DataCite, which stands at around 16 million as of the end of 2023 (Hirsch, 2024). Although the study focuses on EOSDIS data, its results can also be applied to other Earth science data archives.

Bibliographic sources

This study retrieves bibliographic records from WoS (Birkle, 2020), Scopus (Burnham, 2006), GS (Van Noorden, 2014), the Crossref OpenCitations Index (COCI, 2024; Heibi, 2019), and DataCite. These bibliographic sources differ in the number of records they contain, the coverage of publication titles and sources, and the cost of accessing them, as well as the search results attributes needed for citation coverage analysis.

WoS is a database of citations chosen through field expert reviews (De Bellis, 2009), containing mainly English publications but also including sources from Chinese, Korean, Russian, South, and Latin American languages. By 2020, its main collection held 70 million records, totaling 150 million overall (Birkle, 2020). Scopus, part of Elsevier's citation index database, selects content through independent subject experts and advisory boards (Burnham, 2006). By 2020, Scopus covered 76 million records (Baas, 2020), emphasizing international publications and proceedings more than WoS (Mongeon, 2016). While Scopus better covers art and social sciences, science and engineering receive similar coverage in both databases. It is worth mentioning that a study by Van Eck (2019) found issues with missing and incorrect references in both WoS and Scopus, with the number of incorrect references decreasing after 2002, remaining below 1.5% according to time trend analysis.

Crossref is the DOI registry for various research outputs, covering journals, conference papers, reports, pre-prints, datasets, and more (Hendricks, 2020). With 13,000 members from 120 countries, Crossref is an international organization that contained 134 million metadata records in early 2022. The COCI was created using the OpenCitation architecture (Peroni, 2020) in 2018 and by early 2022 it reached 72 million records. Although COCI has the lowest citation coverage compared to major citation databases (Harzing, 2019; Martin-Martin, 2021), the Initiative for Open Citations (I4OC, 2024) was founded in 2017 to encourage publishers to deposit document references to Crossref, which should positively impact COCI's coverage in the future.

In contrast to citation databases, GS indexes documents rather than their sources (Van Noorden, 2014; Prins, 2016). GS's undisclosed number of records reflects its ongoing improvement in indexing and searching, estimated at around 300–389 million (Delgado López-Cózar, 2018; Gusenbauer, 2019). GS indexes scholarly documents by crawling various sources, including academic databases, search engines, publishers, universities, and social media. GS indexes all documents from sources considered academic, which may result in including non-scholar content, such as book reviews, low-impact documents, and duplicates, leading to inflated citation counts (Delgado López-Cózar, 2018; Halevi, 2017; Prins, 2016; Martin-Martin, 2018). Despite these issues, GS offers comprehensive coverage, including dissertations, books, and conference proceedings not covered by WoS or Scopus, valuable for assessing scholarly impact in certain fields (Martin-Martin, 2018). Yet, its limited metadata poses challenges for bibliometric research (Chapman, 2019; Halevi, 2017; Martin-Martin, 2018).

DataCite is the largest registry of dataset DOIs, founded in 2009 (Brase, 2009). The total number of records reached over 16 million in 2019 (Dudek, 2019) with a total number of datasets only reaching 16 million by the end of 2023 (Hirsch, 2024). By providing the DOIs of the documents citing the datasets along with the metadata describing them, DataCite is a novel bibliometric source (Robinson-Garcia, 2017). Another initiative, Scholarly Link eXchange (Scholix) aims to link datasets with articles citing them and presents the potential to be a bibliometric source (Cousijn, 2019; Khan, 2020).

Collecting citation data

The process of retrieving records of documents citing the datasets was automated using available Application Programming Interfaces (APIs). Availability of API, document coverage, and essential bibliographic metadata for the considered bibliometric sources are summarized in Table 1.

Table 1 Comparison of API availability, document coverage, and bibliographic metadata for bibliographic sources used in this study

Free-of-charge APIs are offered by COCI, Scopus, and DataCite. Scopus API was queried for all records linked to each dataset DOI. This linkage is created not only for the dataset DOIs appearing in the document references but also for the dataset DOIs appearing in data statements. DataCite API was queried for the citations of each dataset DOI. To find dataset-citing document records in COCI, the Crossref Event data API was queried by NASA Earth science dataset DOI prefixes: 10.5067, 10.3334, and 10.7229, and the API returned the document records referencing DOIs with these prefixes. To examine references of document records found across all bibliography sources, Crossref API was queried with DOI of each document.

Despite the absence of a free API from GS, a subscription service provided by an external company, SerpApi (SerpApi, 2024), was utilized to access GS via API. This study differentiates from prior studies that manually searched or code-scraped GS web pages (Martin-Martin, 2018, 2021; Harzing, 2019) by utilizing APIs for GS citation search. The limited document metadata returned by GS search, which lacks a document identifier such as DOI or ISBN, is inadequate for identifying the document. To overcome this limitation and obtain bibliometric metadata of GS results, a Zotero translation server (Corporation for Digital Scholarship, 2022) was utilized to convert the GS result's URL into document metadata. This method is similar to Chapman (2019) who used RefWorks citation manager to collect bibliometric data based on the document's URL. For untranslatable content, Crossref was queried with the record title, first author, and year of publication, and Jaro-Winkler similarity (Winkler, 1990) with a threshold of 0.95 was used to determine the similarity of the titles. Additionally, the publication years were compared.

The WoS Expanded API required a paid license to be used, which the authors did not have access to during the study. The WoS Cited References Web interface was searched for NASA’s Earth science dataset DOI prefixes: 10.5067, 10.3334, and 10.7229. Obtained document records and their references were processed to link the records to individual dataset DOIs.

Figure 1 presents a flowchart describing the citation collection and processing. Once the document citations have been collected and mapped to the dataset DOIs they undergo further processing to retain the citations with the desired types and remove duplicates.

Fig. 1
figure 1

Document records collection and processing flowchart

Processing citation data

While previous studies have used a variety of matching procedures, including titles, publication years, and author names (Martin-Martin, 2018; Thelwall, 2018; Visser, 2021) our study used only DOIs of the document records for matching. Since DOIs became widely available, they have been assigned not only to journal articles and proceedings, but also to other scholarly content types such as dissertations, reports, books, and book chapters. While it is possible a document can have more than one DOI, the fraction of such documents is negligible. In this study, the overwhelming quantity of scholarly content published after 2012 has been assigned DOIs, making them a reliable and efficient means of matching citations. Since the document DOI was used as a primary means for document matching the study is limited to only scholarly documents that have DOIs.

To ensure the retention of quality content, the resulting document citations were processed to retain documents of only certain types such as journal and proceedings articles, books and book chapters, dissertations, monographs, and reports. The study excluded non-scholarly and duplicated content types such as discussion papers, preprints, webpage articles, software, and links to PDF documents. The crosslinks between dataset DOIs, though rare, were also excluded. The document types were obtained through Crossref API queries with Crossref providing types for 99.9% of the collected citations.

Metadata from Crossref was used to enrich citations of resulting documents with journals and publishers' names, and article research area subjects were obtained for 99% of documents. A resulting set of document citations included: document DOI, publication year, title, journal and publisher names, subjects, list of bibliographic sources the citation was obtained from, and a list of dataset DOIs the citation is linked to. Additionally, Crossref references were retrieved for 94% of the resulting document citations.

Results

As of the beginning of 2023, the total count of publication records citing EOSDIS dataset DOIs collected from all five bibliographical sources was 17,095. Yearly counts of collected bibliographic records along with EOSDIS unique dataset citation counts are displayed on Fig. 2, where each year represents a year a document record was published. Citation data in Fig. 2 starts in 2012 since the total citation count before 2012 is under 100 due to the very limited number of registered EOSDIS dataset DOIs that existed prior to 2012. The number of citations citing NASA Earth Science datasets has increased significantly each year since 2012 when registration became an established practice (Wanchoo, 2017). The counts of unique dataset DOIs cited each year show the steady growth in citing the dataset DOIs. The total number of unique dataset DOIs found in all publications is around 3,000, which is approximately 27% of all registered dataset DOIs. As most dataset DOIs were registered in the few years following 2012, it takes time to publish works that utilize the new datasets.

Fig. 2
figure 2

Yearly count of unique document records collected from all bibliographic sources and unique EOSDIS dataset DOIs cited in those documents

To characterize the science disciplines of EOSDIS dataset-citing publications, the document record metadata is presented in Table 2, which displays the top ten popular publication subjects out of the total 227 retrieved from Crossref for each publication. Crossref was able to return subjects for 95% of all collected publications. It should be noted that each publication may have one or more subjects. As Table 2 indicates, the majority of publications cover Earth and environmental science areas, with a significant skew towards studies concerning atmospheric and general Earth sciences.

Table 2 The ten most occurring subjects of EOSDIS dataset citing documents retrieved from Crossref

Figure 3 displays the counts and percentage of document records contributed by each bibliographic source from the total count of 17,095. The most publications are contributed by GS (70%), and the least by DataCite (19%). WoS contributes 8% more publications than Scopus and Scopus contributes 12% more publications than Crossref.

Fig. 3
figure 3

Contribution of each bibliographic source to the total number of collected publications citing EOSDIS dataset DOIs

Citation intersection between bibliographic sources

This section examines the intersections and uniqueness of the dataset citing publications obtained from WoS, Scopus, Crossref, GS, and DataCite. In order to achieve this, we analyze the intersections of citations from two, three, and four bibliographic sources. As was mentioned earlier, the intersection between sources is influenced by the availability of the document references and texts, the source coverage of the publications, and the bibliographic source citation index system. While the obtained data does not explain the reasons for intersections between the sources, it provides important information about source coverage, the range of intersections, and the contribution of unique citations by individual sources in source combinations. Figure 4 presents the citation intersection and coverage of all permutations of two sources, where the intersection between two sources is calculated as Source1Source2 and the two-source citation coverage is calculated as (Source1Source2) / (Source1 ⋃ … ⋃ Source5). Note that SourceN is a set of all document DOIs obtained from that source. The Venn circles in Fig. 4 show the percentages of source unique citations and sources intersection calculated as (Source1—(Source1Source2)) / (Source1Source2) and (Source1Source2) / (Source1Source2), correspondingly. For example, Fig. 4a shows that GS and WoS together cover 88% of total collected document records. The WoS and GS unique and joint fractions in their coverage are distributed in the following way: 40% GS unique, 22% WoS unique, and 38% of citations found in both GS and WoS.

Fig. 4
figure 4

Two-source Venn diagrams with intersections and percentages of two-source publications coverage for GS, WoS, Scopus, Crossref, and DataCite collected from EOSDIS dataset DOI search

The overlap of citations for each of two sources is calculated as (Source1Source2) / Source1, which is the fraction of the citations of Source1 that is also found in Source2. A larger overlap indicates a more redundant source. For example, as shown in Fig. 4a, 50% of GS citations are overlapping with WoS and 65% of WoS citations are overlapping with GS, meaning that WoS is more redundant than GS when these two sources are considered.

As demonstrated in Fig. 4a, the best coverage is provided by GS and WoS combined search, yielding 88% of all records, compared to 70% and 53% of records returned by these sources individually (as shown in Fig. 3). This two-source coverage is followed by GS coverage with Scopus (81%), Crossref (80%) and DataCite (76%). Without GS, the largest two-source coverage is provided by the WoS and Scopus combination at 72%. An important observation that could be taken from Fig. 4 is that in all two-source coverages except for those with DataCite, the fraction of unique citations carried by each source is quite significant, and overlap of two sources is not as large as was reported in document-to-document citation coverage studies. For instance, while (Martin-Martin, 2021) reports GS overlap with other sources to be at least 90%, our results show that for GS two-source coverage, other sources add a significant fraction of unique citations: WoS (22%), Scopus (15%) and Crossref (14%). A similar observation is apparent in Fig. 4e for the intersection between WoS and Scopus with only 36% of overlapping citations, 49% of citations unique to Scopus and 57% of citations unique to WoS.

Triple source coverage fraction, intersections, and unique records found in each bibliographic source are presented in Fig. 5. Citation coverages of GS-WoS-Scopus and GS-WoS-Crossref triples are the most comprehensive, and each equals 94%. As demonstrated in Fig. 5a, b, either Scopus or Crossref add an equal fraction of citations, 7%, to the GS-WoS source pair. Figure 5c demonstrates that DataCite is an even less significant addition to the GS-WoS pair, adding 4% of citations. In the triple source combination of sources without GS, the biggest three-source coverage is 80%, as shown in Fig. 4g, and it is achieved in WoS-Scopus-Crossref triple with each source contributing a significant amount of unique citations: WoS 25%, Scopus 16%, and Crossref 10%.

Fig. 5
figure 5

Three-source Venn diagrams with percentages of three-source document citation coverage from the total records for EOSDIS dataset DOIs collected from all bibliographic sources. Numbers in Venn circles are the fractions of citations from the triple-source citation union

There is no representation of coverage from four sources. Instead, the unique fraction of citations from a single source that can add up to the combination of four other sources is presented in Fig. 6.

Fig. 6
figure 6

Fractions of unique EOSDIS dataset-citing documents collected from a single bibliographic source

The unique citation fraction of a single source is calculated as (Source1—(Source2Source3Source4Source5)) / (Source1 ⋃ … ⋃ Source5). As demonstrated in Fig. 6, GS and WoS are both significant sources of citations not found by any other sources, contributing 19% and 10%, correspondingly. As with triple source combinations, the presentation of unique citations from Scopus, Crossref, and DataCite is low to insignificant: 4% Scopus, 3% Crossref, and 1% DataCite.

Temporal trends of citation intersection

The temporal trends of citation counts obtained from bibliographic sources are examined based on the year those documents were published. For this, we look at the yearly publication counts obtained from each bibliographic source for the last five years at the time of writing, from 2018 until 2022. Documents published before 2018 are not used for this analysis because the counts of the documents are significantly lower, as shown earlier in Fig. 2.

As depicted in Fig. 7, the counts of publications obtained from all sources increase yearly, but with varying rates. GS and WoS exhibit a higher rate of increase compared to other sources. While GS returns the highest citation counts for each year, the citation counts ratio of WoS and Scopus changes over the years. WoS underperforms Scopus for 2018–2020, but in 2021 and 2022, it outperforms Scopus significantly, and in 2022, it reaches the same level as GS. Crossref underperforms GS, WoS, and Scopus until 2022 when it reaches Scopus. DataCite significantly underperforms all sources until 2022, but in 2022, it experiences a fivefold increase in publication counts compared to 2021 and starts approaching Crossref.

Fig. 7
figure 7

Yearly counts and trends of publications citing EOSDIS dataset DOIs obtained from bibliographic sources

To analyze the temporal trends of source contribution to the total count of publications, the fraction of publications contributed by each source yearly was examined. As shown in Fig. 8, GS contributes around 70% of all publications each year, which is consistent with the GS overall citation contribution in Fig. 3. Contrary to the yearly consistent performance of GS, other sources exhibit either upward or downward trends. WoS citation contribution shows the strongest upward trend, changing from 37% in 2018 to 68% in 2022. Both Crossref and DataCite also have upward trends, with similar rates of publication growth, though not as strong as WoS. In contrast, Scopus displays a downward trend, with 58% of publications in 2018 reduced to 43% in 2022. The temporal behavior of Scopus and WoS can explain the disagreement with prior works that reported Scopus outperforming WoS. Indeed, Scopus outperformed WoS until 2020, as previous studies reported, but in 2021, WoS started outperforming Scopus.

Fig. 8
figure 8

Yearly fractions and trends of publications citing EOSDIS dataset DOIs obtained from each bibliographic source

For all collected publications records, Fig. 4 in the previous section shows overlap between sources as fractions of the citations that are found by the other source. In this section we look further at these overlaps by studying their temporal trends. As a reminder from the previous section, the fraction of the citations of Source1 found in Source2 is calculated as (Source1Source2)) / Source1. Using that formula, Figs. 913 illustrate the intersection of EOSDIS dataset-citing document records from one source with the other sources and the temporal trends of this intersection.

Fig. 9
figure 9

Yearly GS publications intersections with other sources for EOSDIS dataset citing records

The data for the temporal trends can be compared with the Fig. 4 intersection averages over all years. The average fraction of GS records found by WoS over all years, (GS ⋂ WoS) / GS, is equal to 50% as seen in Fig. 4a. When interpreting the intersection temporal trends in Fig. 9, we can see that in 2018, 35% of records found in GS could be found in WoS. This fraction steadily increases each year, becoming 63% in 2022. Similar to this, Fig. 4c shows that while the average fraction of GS citations that is found in Crossref, (GS ⋂ Crossref) / GS, is equal to 31%, the temporal trend of this fraction changes from 33% in 2018 to 45% in 2022 with a minimum of 23% in 2020. These temporal trends show that while the overlap of GS citations with WoS, Crossref, and DataCite grew over the years and reached its peak in 2022, the overlap of GS with Scopus decreased over the years from 57% in 2018 to 49% in 2022.

Figure 10 reveals another trend: over the years 2018–2022, approximately 65% of WoS citations for EOSDIS datasets are persistently found by GS. Perhaps this is due to GS indexing consistently having access to a certain proportion of references in publications indexed by WoS.

Fig. 10
figure 10

Yearly WoS publications intersections with other sources for EOSDIS dataset citing records

Figure 11 shows that while fractions of Scopus publications found in GS and WoS in 2018 were 69% and 39%, respectively, those fractions grew substantially to 80% for both sources in 2022. These data resonate with the Fig. 5a triple Venn diagram for all time intersections between WoS, GS, and Scopus, where the fraction of unique Scopus publications in union of these three sources is only 7%.

Fig. 11
figure 11

Yearly Scopus publications intersections with other sources for EOSDIS dataset citing records

Figure 12 shows that trends for Crossref and WoS and GS intersections are very similar to those of Scopus. Comparing Fig. 12 results with Fig. 5b, we see that the fraction of Crossref publications in all time unions of Crossref, WoS, and GS is the same as Scopus and equal to 7%, even though the highest fractions of Crossref citations found in these sources are 73% and 74%, correspondingly.

Fig. 12
figure 12

Yearly Crossref publications intersections with other sources for EOSDIS dataset citing records

As demonstrated in Fig. 13, the intersection between DataCite and Crossref is the highest, with 100% of all DataCite citations being found by Crossref in 2022. This is in line with the fact that DataCite obtains its dataset-to-document linkage from Crossref. Unless the community adds document citations to DataCite, or DataCite acquires citations from other bibliographic sources, the coverage of DataCite by Crossref should stay 100%.

Fig. 13
figure 13

Yearly DataCite publications intersections with other sources for EOSDIS dataset citing records

In conclusion, the common trend observed in Figs. 913 is that the intersection of EOSDIS dataset citing records collected from any source with other sources tends to increase over time, except for the intersection with Scopus. The highest rate of intersection growth is observed for intersections of WoS with other sources.

Dataset citation coverage by Crossref

In 2017, calls for Open Citations led to the publishers depositing references of publications into the document DOI registry, primarily Crossref, making them openly available. Our study aimed to evaluate the current state of references available in Crossref for document records found in various bibliographic sources. Out of 17,095 EOSDIS dataset-citing publications, 99% had a record in Crossref. Out of the 16,920 Crossref records, 94% had references available in Crossref. Despite this high percentage, the Crossref Event data interface only returns 33% of all publications, which raises the question of why such a widely-cited source does not have a higher interface return rate.

To address this discrepancy, we conducted a thorough analysis by examining the references of each of the 16,119 Crossref records. Specifically, we searched for the presence of EOSDIS dataset DOIs in the references. Surprisingly, we found that 9792, or 57%, of the Crossref records contained EOSDIS dataset DOIs in their references, which is significantly larger than the 33% of citations returned by the Crossref Event data interface, as shown in Fig. 3. This indicates that the Event data interface of Crossref does not capture all records that reference dataset DOIs, highlighting the need for alternative methods for discovering these references.

Figure 14 compares the two-source intersections and coverage between GS, WoS, and Crossref, based on data collected from both the Crossref Event data API and the content of Crossref references for document records from all five sources. The table shows that if the Event data API had identified all dataset DOIs present in its document references, the intersection between Crossref and other sources would have grown significantly.

Fig. 14
figure 14

Two-source Venn diagrams and two-source citation coverage for intersections of Crossref’s EOSDIS dataset citing records with GS, WoS, or Scopus based on (a) Crossref Event data API results and (b) content of Crossref records’ references

Comparing columns (a) and (b) in Fig. 14, it can be seen that the fraction of GS, WoS, and Scopus citations found by Crossref when using the Event data API grows from 32%, 36%, and 40% to 52%, 59%, and 68%, respectively, when considering Crossref’s publications reference content. This indicates that the Crossref Event data API is missing at least 20% of the records that have dataset DOIs in their references.

The findings in Fig. 14 reveal that even if Crossref's Event data API were to detect all citations that have dataset DOIs in their references, a significant portion of document citations would still be missed. Specifically, Crossref only finds dataset DOI references for 52%, 59%, and 68% of GS, WoS, and Scopus records, correspondingly. Despite this, however, 93.6%, 97.5%, and 95.0% of citations found by GS, WoS, and Scopus, respectively, have references available in Crossref. To explain this discrepancy, several cases were examined. It was observed that although the document text may contain an EOSDIS dataset DOI in its references, the dataset DOI may not be present in the references deposited in Crossref, which would prevent Crossref from detecting the dataset reference record. Additionally, for some datasets, WoS can derive the dataset DOI based on the dataset citation, even if the DOI does not appear in the paper references. In those cases, WoS returns the document citation along with references containing dataset DOIs. Lastly, it was observed that GS and Scopus returned records whose references did not contain dataset DOIs or citations, indicating that the dataset DOIs were located elsewhere within the paper.

To better understand the reasons behind the discrepancy between Crossref and other sources' results, a thorough examination of paper references and texts is necessary. However, the authors believe that the major reason for the difference is the absence of dataset DOIs in Crossref record references when they are present in the original paper. The authors point out that for the WoS intersection, all records obtained from WoS are verified to have dataset DOIs in their references, and only those that do are retained. This explains the large discrepancy between WoS and Crossref results, where 97.5% of WoS records have references in Crossref, but only 59% of them intersect with Crossref. This discrepancy cannot be explained by WoS adding DOIs to references that do not have them.

Considering that publishers deposit references into Crossref, the authors grouped document counts obtained from bibliographic sources by publishers to determine whether citation counts differ by publisher.

Figure 15 displays the counts of EOSDIS dataset-citing publications collected from bibliographic sources grouped by major publishers. Figure 15 shows data for the eight largest publishers, which represent 80% of the records returned by all bibliographic sources analyzed in this study. The Crossref results shown in the figure reflect the counts of publications that have references with dataset DOIs in Crossref, not the counts obtained from the Crossref Event data search. The ratio between the Crossref and Total counts in Fig. 15 can be considered as an indicator of the ratio of document records that have EOSDIS dataset DOIs in their references deposited by the publisher into Crossref. Based on this ratio, Copernicus has the highest ratio of document records with references containing EOSDIS dataset DOIs. For Elsevier, the largest publisher among the collected publications, Fig. 15 indicates that more than 2000 document records deposited in Crossref do not have dataset DOIs in their references. Meanwhile, the high GS results for Elsevier suggest that dataset DOIs are present in the paper references or other document parts. The WoS results for Elsevier cover half of the total records, indicating that the publications' references contain dataset DOIs that are missed in Crossref. The low Scopus record counts for Elsevier likely indicate that Scopus does not obtain all publication references containing dataset DOIs.

Fig. 15
figure 15

Unique counts of publications citing EOSDIS datasets per each bibliographic source grouped by major publishers

Conclusions

This study can only be compared to previous research examining document-to-document linkages. As demonstrated in our study, certain bibliographic sources may lack access to dataset DOIs, even when referenced in publications. Moreover, these sources may not consistently provide dataset-to-document linkages, as illustrated by the case of Crossref. It's important to consider these factors when comparing our findings with those of earlier studies. The results of this study agree with previous findings (Moed, 2016; Delgado López-Cózar, 2018; Martin-Martin, 2018, 2021; Harzing, 2019; Levine-Clark, 2021) about GS outperforming other bibliographic sources. While the works of Martin-Martin, 2018, 2021; Harzing, 2019; Levine-Clark, 2021 show that Scopus outperforms WoS in the quantity of detected citation links, this study has determined that on average WoS significantly outperforms Scopus and that the overlap between Scopus and WoS is not as large. However, looking into our study of temporal trends, we show that from 2018 to 2020 Scopus indeed outperformed WoS, which agrees with prior works. The performance of Crossref generally agrees with the previous studies as Crossref underperforms GS, WoS, and Scopus.

Among the five bibliometric sources considered, GS provides the largest fraction of total citations (70%) and the most unique citations (19%) compared to the other sources. Combining GS and WoS provides access to 88% of all citations, while the combination of GS, WoS, and either Scopus or Crossref provides access to 94% of all citations. Although the number of citations obtained from each source increases from 2018 to 2022, their contribution to the total citation count varies. WoS shows a high growth rate, while the rates are lower for GS, Crossref, and DataCite. Interestingly, the contribution of citations from Scopus decreases over time.

Crossref was examined as a potential source for obtaining linked content and was found to contain records for 99% of all citations and 95% of all references. However, the low fraction of document citations returned by Crossref (33%) can be attributed to the underperforming Crossref Event Data API and the lack of dataset DOIs in references deposited by publishers, with Elsevier being responsible for the majority of references without dataset DOIs.

Based on the findings of this study, researchers seeking a comprehensive list of research publications related to a particular dataset should follow the recommendations listed below.

Recommendation 1. Although GS yields the most document citations, it is not recommended as the sole source, as WoS, Scopus, or Crossref each individually account for at least 14% more document citations linked to the datasets. The optimal combination of two bibliographic sources for the most extensive coverage is suggested to be GS and WoS. If a third source is deemed necessary, the combination of GS, WoS, and Scopus or GS, WoS, and Crossref provides approximately equal coverage of document citations.

Recommendation 2. To collect citations using GS, additional processing is required to identify research documents and remove duplicates. This process involves using Zotero translator software to retrieve the document DOI from the GS’s URL and then utilizing Crossref to determine the document type.

Recommendation 3. Collecting citations with WoS is a semi-manual process unless one has access to the WoS Extended API. In this study, the process was simplified by utilizing only three dataset DOI prefixes to search the WoS web interface. It should be noted, however, that the time required for web interface scraping is directly proportional to the number of DOI common prefixes used.

Recommendation 4. Accessing WoS and GS through SerpAPI incurs subscription costs, and in combination with manual efforts and additional result processing, the use of these interfaces is the most time-consuming and expensive. However, the benefits of their extensive citation coverage must be weighed against these costs.

Recommendation 5. Using Scopus, Crossref, and DataCite is a straightforward and cost-free process through their APIs. However, it should be noted that the combination of these three sources only provides 62% coverage of all citations.

Recommendation 6. Crossref has the potential to be a cost-free bibliographic source, offering a straightforward API that can provide access to a significant number of document citations, provided that its Event Data API is improved and citations are completed with dataset DOIs. Thus, the state of Crossref should be re-evaluated in the future to estimate its capability to meet these requirements.

Recommendation 7. Similar to Crossref, DataCite is a potential cost-free bibliographic source that offers a straightforward API. Furthermore, DataCite has the added advantage of allowing for the addition of linked citations that do not contain DOIs in the references, potentially enabling it to provide even larger quantities of dataset-to-document linkage than Crossref. Therefore, DataCite should also be evaluated for the improvement of citation linkages.

In summary, this study underscores the significance of ongoing endeavors to enhance the accuracy and comprehensiveness of dataset citations. It offers valuable insights into the performance of different bibliographic sources concerning the dataset citation indices.