1 Introduction

The National Research Data Infrastructure (NFDI) is an association that consists of 26 consortia from all scientific disciplines with a common commitment to the FAIR principles [1] and other central goals of the NFDI such as making the data holdings of publicly funded research findable and accessible. However, the FAIR principles only apply to machine actionability, they do not take user needs into account, nor do they deal with the differences in the subject-specific data philosophies, which give quite different answer to seemingly simple questions such as: What even is research data and what is not? Who has it? And what does ’rich’ metadata description mean?

In the evolving landscape of National Research Data Infrastructures (NFDI), the integration of diverse research data and their metadata is still challenging. While some have suggested the need for a singular, unified search system to navigate this heterogeneous landscape, there are still open questions regarding the usefulness and feasibility of this idea. Scientists are interested in finding proper research data irrespective of whether the data source is associated with NFDI or not. Thus, we need comprehensive search and harvesting approaches incorporating high-quality data sources across NFDI consortia that fit to the respective disciplinary community approaches.

While some consortia, such as NFDI4DSFootnote 1 and FAIRMatFootnote 2 strive to offer a unified access point on what they consider the relevant research data, others, such as KonsortSWD, pursue a strategy of opening their metadata to harvesting services [2] in a decentralized way. There already exist many harvesting services for research data, such as OpenAIREFootnote 3, BASEFootnote 4 and critically discussed Google Dataset SearchFootnote 5 However, it is still a challenge to address the very different user needs regarding search and harvesting across the different services.

The working group on Search and Harvesting, a collaborative effort across NFDI consortia, has dedicated substantial effort to understanding these challenges. By focusing on user and service requirements, analyzing the data sources landscape, and developing tailored recommendations, the group aims to address the specific needs of different data types, including spatial and sensitive data, regarding search and harvesting.

This paper presents the gaps and challenges in search and harvesting identified by the NFDI consortia and structured by the Search and Harvesting Working Group. The paper advocates for a strategy centered on international collaboration, a deep understanding of user needs, and an acknowledgement of the heterogeneous nature of research data infrastructures, as found in the NFDI and beyond. We close the paper with recommendations for next steps to improve search and harvesting practices in the NFDI, so that researchers in Germany and beyond will be supported to find and ultimately re-use research data relevant to them.

2 Related Work

2.1 Search Services for Research Data

Zuiderwijk et al. [3] define the term “search” as an interaction between the tools and the users: “A non-linear process containing many feedback loops.” Platforms and services should be seen as linked, as they mirror each other in terms of users, metadata and technology. Users jump from one system to the next with apparent ease, but also out of necessity, as none of the platforms or services fully fulfill their requirements neither in the open (like OpenAIRE and BASE) nor in the closed infrastructure (despite several high-profile efforts, including Google’s [4, 5]). Instead, small data services, often community-driven, thrive as increasing numbers of researchers are willing to share and re-use data [6].

2.2 Data Search

Research looking at the “socio-technical practice” of data search [7], and observation of data seekers [8] reveals the strong interconnectivity of activities involved with literature search, web search, and personal networks. Additionally, these practices depend on the features of the different data services and data sources. This is also the first aspect addressed in the “The Principles of Open Scholarly Infrastructure” [9]: “… research transcends disciplines, geography, institutions, and stakeholders. The infrastructure that supports it needs to do the same.” A knowledge infrastructure plays an important role in this process because it “mediates exchanges between creators and consumers by both enabling and restricting the use of that data” [10].

While there are no comprehensive studies on the data discovery process, interviews [7, 11,12,13] observations [8] and surveys [14, 15] indicate that the most common strategies to launch data discovery are the following: people (contacting colleagues, visiting conferences), use of generic and domain agnostic search engines, querying domain data repositories, and literature review (as illustrated in Fig. 1). These strategies are not mutually exclusive, and many users seem to follow multiple strategies, depending on the context.

Fig. 1
figure 1

Based on the globally distributed multidisciplinary survey with nearly 1,700 responding researchers conducted by Gregory et al. in 2020 [11], most researchers search for the targeted data via literature (75%), a common web search engine like Google (59%), or directly in a repository or discovery service they already know (41%), rather than first identifying the discovery service that best matches their use case. Literature, and networks also implicitly use web search, as direct links are usually not available, despite the current efforts to increase use of PID in data citation

3 Landscape of NFDI data sources

Varied conditions prevail within the NFDI consortia, encompassing user needs for data searches as well as the harvesting processes of data sources and services. Efficient harvesting of metadata necessitates a thorough understanding of the resource types, the standards, and the protocols used by the diverse data sources to achieve cross-domain discoverability and accessibility of data. The highly diverse data sources distributed across the NFDI consortia exhibit varying levels of maturity concerning metadata disclosure. For instance, federal authorities or smaller project repositories often contain valuable data crucial for research, yet external access is prevented by the difficulty of detection and the lack of open interfaces for harvesting metadata. To gain a deeper understanding of the current landscape and propose concrete improvements, it is imperative to compile a comprehensive overview of the maturity levels, standards, and specific needs across the data sources of the NFDI consortia. Thus, the maturity of technologies employed for operational harvesting and search services largely depends on the resources and means available to the data sources. To underpin the heterogeneity of data search needs and metadata provision and management approaches, we showcase selected approaches from various NFDI consortia.

The NFDI4Earth [16] consortium supports research data management for the Earth System Sciences (ESS) community and, thus, implements solutions for the search and harvesting of distributed and linked spatio-temporal data. The ESS community is broad and includes, for instance, climate researchers, geologists, and hydrography scientists. Even within the different ESS sub-disciplines, the researchers have different needs depending on their use cases, e.g., for map-based filter options or filters for specific variables. Moreover, they publish and manage their data in different repositories or data journals. NFDI4Earth has thus collected information on more than 100 potential data sources for the Earth System Sciences and has developed a knowledge hub that manages related harmonised metadata. Researchers in other disciplines, e.g., humanities scholars and social scientists are more interested in search filters for texts, images, or similar data types. One approach, suited for data with a textual representation which is already pursued in the NFDI consortium Text+ [17], is to allow unified access to research data in different data repositories through a common search-and-retrieve protocol based on the “Search/Retrieve via URL” (SRU) and the “Contextual Query Language” (CQL) standards (see Sect. 4.3 for more details).

In the NFDI consortium KonsortSWD (Social sciences) a decentralized approach is followed. All the research data centres are registered through the registration agency da\(|\)ra and have assigned a DOI for their data. However, only core metadata is transmitted that way, more detailed information is shown at the landing page of each individual data center and measures are taken that these are easily findable, e.g. by providing metadata in web standards and making sure they are indexed in web search [2].

While our initial efforts involved compiling lists of available resources, additional resources and efforts are required to initiate a meaningful dialogue with harvesters and data providers across the NFDI to overcome identified gaps and challenges.

On the search side, the process depends on the demands, expertise and context of the data seeker. There are common, but also quite specific needs, as discussed e.g., in the review article by Bardi et al. in 2022 [18] by introducing the term “open ecosystem for data discovery” defined as the entirety of e‑infrastructures and services supporting researchers at discovering data using open standards. This ecosystem consists of interdependent actors and loosely coupled resources and services. Fig. 2 shows schematically which obstacles and decisions occur throughout the discovery process.

Fig. 2
figure 2

Phases and stages during the data discovery journey

Depending on the context of the user there are various needs and depending on social and technical aspects different strategies are used by data seekers to navigate through the ecosystem to eventually access and reuse the requested data. Depending on the expectations regarding the quality and relevance of the results, a search via well-known commercial search engines may be sufficient in some cases. However, for scientific research that demands high quality and relevance of the search results, catalogues and search services should be used that follow the FAIR principles and agreed standards.

Such services are essential for exploiting the full potential of very large digital data resources across different domains of research. In addition, a distributed network of data repositories requires not only a suitable metadata infrastructure but also advanced federated search protocols for accessing the research data themselves.

4 Gaps and Challenges

For the landscaping effort, we performed a three-step method. First, we started with a literature review of existing discovery and harvesting approaches. We then performed a landscape analysis of existing services with particular respect to NFDI aspects and requirements, e.g., metadata harmonization. Additionally, we conducted expert interviews and then worked towards a synthesis by providing a collection of gaps and challenges. Through summarizing and structuring, we have derived three main categories: users, metadata, and technology, which we will describe in the subsections below in more detail.

Fig. 3 illustrates the challenges mentioned in the previous section in a layered structure. The lowest level represents the task of Harvesting from different sources with same or similar metadata record, which, by and large, can be regarded as solved. At the next level, a significant challenge arises in Harmonizing metadata across diverse data sources and various metadata formats. This is followed by the task of De-duplication and relation enrichment. In addition to these challenges, factors such as the user interface (UI) (e.g. missing filter options) and Metrics (e.g. views, downloads and citations) can also cause difficulties. Finally, the most significant challenge is solving problems arising from the Heterogeneity of user needs.

Fig. 3
figure 3

Concept layer with challenges of Search & Harvesting

4.1 Users

Inexperienced users: For literature search, researchers usually get a foundational introduction during their education. This basic knowledge is then deepened during the research career through frequent use. However, for data search, there is still a lack of proper education and training materials [8]. Data literacy has been identified as a core skill for researchers and data management professionals [19]. However, data search is often not a part or not a large part of the data literacy curriculum. This situation highlights the need for more comprehensive training in data search techniques as part of the broader data literacy education [20].

Inappropriate and invisible entry points: Again, in comparison to literature search, users are often not aware of relevant entry points for their data search [8]. Instead, they often rely on 1) general web search or 2) literature review to find names of datasets and then try to find them on the Web [7]. As a result, datasets don’t get found, especially when they are published in disciplinary data repositories with a small number of provided datasets. However, common Web-based search engines lack specific search and filter options. To increase the visibility and the discoverability, entry points should be offered with faceted search filters as well as with search user interfaces adapted to the research domain and needs, for example filtering spatial coverage via map-based UI for georeferenced data.

However, the pure existence of these data services does not mean that users will be able to find them casually. We can observe, through web traffic, that there are stark differences in user numbers between different data services. Harvesting services and meta-search engines, often have trouble attracting users, unless they also provide original data, or use other means of marketing, such as sponsored links, aggressive search engine optimazation (SEO), or traditional marketing within the community. Services with original data and download functionality can achieve web traffic hundred or thousand times higher, despite often having a much lower potential user base then the multi-disciplinary large harvesters.

Missing feedback mechanism and detailed user statistics: Research data search providers often lack knowledge about their users’ needs and problems with their services, due to inadequate feedback mechanisms for evaluating user data and reporting missing features. This hampers systematic improvement of their search functions. Gregory et al. [21] provides an overview of user needs and practices by discipline, however, only a fraction of systems use systematic user research or even tracking. Data providers may use surveys [7, 22], interviews [23], interaction log analysis [24, 25], analytics [26] or observation [8] to gain more insight, usually discovering areas of improvement in their systems.

4.2 Metadata

In general the metadata should be as FAIR as possible, at a minimum, data identifiers (PID/DOI) must be provided, and — depending on the use case — the metadata should have the highest possible degree of correctness, completeness, and interoperability that can be reached within the interested communities. However, what “complete” or “correct” means depends on the specific research questions in the related scientific communities, which cannot be known a priori. Hence, we need to strive for an optimum in flexibility on the one hand, while ensuring efficient search and harvesting of metadata on the other. The subsequent sections discuss three aspects that help with the re-utilisation of data : References to related data, granularity of the data and information about the provenance of the data.

4.2.1 Missing links

While approaches on linking different resource types on a metadata level, e.g. datasets and related Web services, are well-known (Cross-Ref, Scholix, Linked Data, …), search and harvesting services of the overall ecosystem are often not linked to each other or links are not visible/usable in the user interfaces. However, research data needs context information for re-use and machine-actionable processing, e.g. from documentation, relevant research articles or tools. Thus, missing links address interoperability and adequate UI design challenges. An approach to overcome this issue would be a persistent identifier (PID) graph service that allows interlinking resources across disciplines and data sources by using PIDs.

4.2.2 Granularity issues

Granularity has two aspects: modularity with regard to the metadata fields and the structure of the data records themselves:

“Common” metadata properties such as title, description, author, etc. (see for instance the Dublin Core schemaFootnote 6) are not sufficient for many disciplines, and the granularity level of metadata for a dataset is likewise not sufficient for an efficient search. For some disciplines, therefore, several metadata schemata have been developed that allow to reference specific information, and to describe hierarchically structured data in a discipline-specific hierarchically structured metadata. In the context of search and harvesting, however, it still needs to be investigated to what extent such highly structured metadata can be presented and/or linked with other metadata. As it still needs major efforts to provide harmonized metadata across the NFDI consortia, that fit researchers’ needs, agreed definitions for resource types, authors, metadata licenses, spatial and temporal information can be used as a starting point to develop strategies that foster search and harvesting for high-quality, highly structured and high-quality detailed metadata.

Even if the search was successful in identifying relevant data, there is often only a collection of data from which the desired part must first be cut out or, conversely, the accessible data records are too granular and do not cover the required range. Ideally, the repository offers the functionality to choose which granularity level to search on and to navigate between them.

Metadata records could be exposed in more than one repository. The challenges arising from the harvesting perspective of similar records from different data sources and raise the question: How can one effectively handle duplicate metadata records when harvesting from different sources?

Based on real-world examples, metadata for similar content records are exposed via more than one endpoint. For example, the record of the OpenAIRE Research Graph Dump[27] is findable at the original deposit repository at Zenodo [28], as well as in an institutional repository at Bielefeld University Library [29]. Despite what one might expect, the metadata records in the same metadata format are not equal. Some of these changes are cosmetic, but we can observe tangible changes, such as different granularity of author identifications.

4.2.3 Provenance

The provision of provenance information on the data with the metadata is essential for the reuse of data and the evaluation of the data quality. However, it is still often lacking completely, or lacking in part, as some metadata schemata only offer options to describe one part of the provenance chain, such as one processing step. Moreover, when input data and final data products are published in different sources, we still lack services that can gather, summarise and visualise provenance information across those different services.

4.3 Technology

This section outlines the technologies used and discusses existing issues and challenges in searching and harvesting research data. Fig. 2 shows the ecosystem of research data search. For over twenty years, bibliographic metadata has been exchanged through repositories using the “Open Archives Initiative Protocol for Metadata Harvesting” (OAI-PMH v2 [30]) on the internet. This protocol is used to harvest metadata on research articles, data, and software, as well as metadata on projects, instruments, conferences, services, and other entities. For each entity, the persistent identifier is a key factor in the metadata. A comprehensive overview of use cases for persistent identifiers is provided by the DFG-funded project PID-Network Deutschland [31].

The challenge is to make the repositories and their metadata available for search. This can be achieved in various ways. One approach is to use aggregators on a community basis, such as B2FIND [32] and BASE [33] or on an international level, such as OpenAIRE and OpenAlex [34]. These aggregators harvest the metadata and make it available for searching.

Research communities have developed research infrastructures with services that operate their own Application Programming Interface (API) endpoints and/or standardised endpoints for SPARQL to search and (re-)use datasets. SPARQL, which stands for SPARQL Protocol and RDF Query Language, combines a protocol and a query language for graphs in Resource Description Format (RDF). Alternatively, GraphQLFootnote 7, a standardised query language, can also be used.

Another technique mentioned in the section ’Landscape’ is SRU/SQL and derived protocols like the Federated Content Search (FCSFootnote 8): Each data repository that participates in this infrastructure provides an endpoint for accessing its research data which is registered in a common service registry. The endpoint receives standardised queries from the client, translates the query into its local query language, and returns the retrieved search results in standardised formats. Such a federated search functionality enables users of the infrastructure to search research data not only in one individual data repository but across the whole network of repositories, thereby offering the full range of content available in the infrastructure and making it accessible to text and data mining (TDM) services and complex automatic processing workflows in the FCS resource inventory.

Harvesting via OAI-PMH is generally not challenging anymore, as long as everyone uses the appropriate technology. However, handling the harvested metadata can still be problematic, as demonstrated in the previous Sect. 4.2.

An illustrative example is the deduplication of research articles, a process that has been technologically tested and is currently being used in a similar or extended form. Atzori [35] introduces and explains the use of GDup for managing the deduplication of research articles. This technology has been implemented in initiatives such as the successive phases of the OpenAIRE project, sponsored by the European Commission, and has been operational for an extended period. The current OpenAIRE graph shows that there are more than 230 million records including deduplicated records vs. 176 million without.

Current scholarly investigations on information retrieval for the identification of research datasets applicable across diverse research disciplines remain constrained. This limitation arises from the diversity of domain-specific user requirements. In particular, which metadata fields are relevant. One solution suggested is setting up separate user interfaces for the different domainsFootnote 9, but even that does not solve the issue of the underlying metadata traditions. Even the best technology cannot show information that were never fed into the system in the first place.

5 Conclusion

In the previous sections, we identified gaps and challenges on the technical levels of metadata and technologies used for search and harvesting. In addition, we also identified needs of users who wish to access and reuse data by distinguishing between the requirements from the content side (comprehensive and rich data description, which scientific tasks need to be supported), and requirements from the IT side (re-use of metadata for interoperable search and harvesting in distributed repositories) in accordance with the charter of our working group [36]. Our goal is to initiate a meaningful dialogue with harvesters and data providers across the NFDI. To this end, we need comprehensive strategies for data providers to provide high-quality metadata, for data users to specify their needs and to provide feedback on search results and evaluated data, and for repository providers to provide suitable functionalities across repositories. For the NFDI – and as a starting point for the next activities of our working group –, we would conclude with the following recommendations for the above-mentioned roles:

Landscape synthesis. Although we have already collected information on several aspects of the search and harvesting landscape, in particular, an up-to-date landscape synthesis with respect to metadata is still an ongoing task. To increase the visibility of NFDI data sources, and to enable users to perform cross-disciplinary data searches, we recommend providing a cross-consortial synthesis of the consortia collected lists of data sources. The compilation should be automated as much as possible and use existing sources. It should also identify and collect best practices as well as search and harvesting failures encountered to be resolved for better metadata quality. In effect, specific measures to improve search and harvesting in the NFDI might be proposed, e.g. by linking SPARQL and OAI-PMH environments, or working towards agreed definitions for resource types or the description of spatial or temporal information. This work would also contribute to a harmonisation of metadata standards or disambiguating metadata and terminologies.

Feedback for metadata practices. Only very recently FAIR and metadata quality checker tools have become available to provide data providers and aggregators with a way to measure the quality of their metadata. However, they often fail to demonstrate the negative consequences of failed metadata practices, such as diminished findability in large systems, and the positive consequences of good metadata practices. Data providers should be advised and supported to better handle the search and harvesting chain from the initial implementation of harvest provider endpoints and robust metadata ingestion procedures to the ongoing improvement of metadata quality and completeness.

Communication between data providers and aggregators. By providing a common perspective on search and harvesting issues and approaches/strategies, the needs and interests of the NFDI data sources can better be highlighted within the international community. To this end, NFDI’s coverage can be used to communicate the needs and expectations of aggregators back to the data providers. The exchange of data providers and aggregators should be fostered.

Additionally to that, we urge all the consortia to conduct user research, educate their target groups in data search and make sure all their services and data are easily discoverable. There are still many gaps in our understanding what users need and how to incorporate this into the harvesting landscape.