This section describes the technical implementation of the DDRSFootnote 22 within the Humanities at Scale project. It is important to distinguish between an ideal concept of the service and the actual implementation during the project. The latter one has to consider the availability of resources and time as well as the institutional context.
As a reminder: the DDRS assists the user in identifying suitable research data repositories for the individual case depending on only a few criteria, like formats of the research dataset, language or affiliation or certain indispensable functions.Footnote 23 The result of this step is a ranked list of repositories which can be used by the user as it is. The questions leading to the result list are not mandatory but the result gains quality by answering more questions. After displaying the result list, the user can decide to proceed to the second functionality layer of the DDRS, which is about the structured description of the specific research dataset. The aim of this step is to gain, as easily and conveniently as possible, a structured and coherent data description which serves as basis for initiating the ingest process with the repository. At this stage, the DDRS serves only as communication handler on behalf of the user, pointing his or her ingest request to the appropriate contact person.
Figure 8 provides an overview of the simple infrastructure which has been set up within the project. The result is a functional demonstrator, flexible to be developed further upon or to be enhanced with additional functionalities. This result serves as proof-of-concept for the idea and will highlight the community’s demand for such a service.
As basic infrastructure for this stage of the DDRS a virtual machine (VM), accessible via the internet is sufficient. The VM consists of all necessary applications and will initially be accessible over an IP address.Footnote 24
It was decided that the branding of the service would be quite close to DARIAH’s, obviously including the logo of the project in which the DDRS was created: Humanities at Scale and the logo of the underlying service which provides the data: re3data. The URL was also branded as DARIAH: https://ddrs-dev.dariah.eu/ also keeping in mind that the service is in a demonstration phase.
The DDRS infrastructure model (see Fig. 8 above) illustrates the basic infrastructure layer and several components facilitating the use of the DDRS functionalities for the user. The following components are part of this infrastructure:
A web server hosts the components described below.
A simple website provides the user with explanatory information on the service, practices for research data in the humanities, further information sources, and displaying the results of the user requests for layer 1 (repository identification via a search) and layer 2 (data description).
A simple questionnaire suggests to the user a ranked list of suitable research data repositories for the specific use case. The questionnaire is designed in such a way that adjustments to the questions are possible in an easy way via the administration section. This is necessary as the database used for the requests - initially re3data - will likely change over time. For example, new research funder mandates could be reflected in the metadata and the DDRS had to consider this.
A web form describes the specific research dataset in a structured way (this can be implemented in a similar way as the questionnaire). The questionnaire is also designed in a flexible way to allow further adjustments to the research data criteria that are to be described by the user. This will likely be the case as the research data practices in the humanities develop and new standards emerge. The current implementation is GDPR compliant as the user data gets submitted only to the selected repository contacts. The submitted user specific data is after sending not available to the DDRS.
CurrentlyFootnote 25 the DDRS sends queries directly from the server to the ElasticSearch of re3data. A request API conducts the requests to identify the repositories. The API sends - either filter by filter or all in one - (a) request(s) to the re3data database, displaying in the end a list of repositories fulfilling the respective criteria. On the basis of early tests of the re3data API the data quality and performance seem to be sufficient for our purpose and do not seem to impact on the re3data API’s general performance.
A database is used to enrich the request results from re3data with contact details. This enrichment is necessary as the DDRS not only wants to suggest suitable repositories but also points the user to a specific point of contact to facilitate the ingest of the individual research data. Therefore, someone with expertise in humanities research data is necessary but this information is not available through the re3data database as this is a non-disciplinary service.
A forwarding component, basically a mail server. This component mails the completed data description form to the relevant repositories.
A usage statistics component, currently Matomo. At this point it is not clear what kind of data could be collected by this service in the future. If the DDRS has a considerable user uptake in the future the usage statistics could become a valuable asset to be used for further added value services and to demonstrate the value of the service.
Regarding the quality of the search results one has to consider, first of all, the limitations of the current approach which relies heavily on re3data’s database.
Initially the design of the DDRS relied on an include-exclude table which meant that the DDRS could select the search results only by applying the filters which are given by the re3data metadata schema v2.2 and its 39 main properties and related sub properties.Footnote 26 The DDRS now includes an additional database containing information on the points of contact for forwarding the ingest request. The re3data schema contains only information on technical points of contact for the repositories but not for research data managers or information specialists. This additional database relies on re3data’s external persistent identifiers in order to keep the information always bound to the same repository; contact information can only be connected to a single repository within re3data.
The DDRS supplementary database also includes a selected set of research data repositories of generic, national or European provenance. This ensures that a user will always receive a result list, in case the filtering of re3data would result in zero results. Although this approach makes sense from re3data’s perspective, it is not helpful with look at the DDRS’ use case. Our aim is to equip each user with a selection of suitable research data repositories. To avoid a zero result upon filtering the DDRS database had been supplemented with a set of generic research data repositories suitable for humanities data and referring to the national or European level.
However, considering these limitations the decision was still taken to use the re3data database. To our understanding re3data has the potential to grow in data quantity and usage and, for this end, it is a better choice than setting up an own exclusive database for the DDRS. Our assessment of the future development of re3data also implies a further enhancement of their schema. With more and more established practices and growing use of research data management infrastructures in the humanities, additional properties reflecting this growth will enrich the schema and database. The current concept of the DDRS permits the integration of other databases, but not easily as it would need access to their ElasticSearch servers or with the APIs that are being provided.
The following remarks describe in a more technical way the information retrieval of the DDRS from re3data starting with a result list after filtering for two countries affiliations (Germany, France).
Figure 9 shows a snippet of the search result of re3data’s Elasticsearch server for the following query (it is not possible to provide the full URL as this is not a public API):
http://….../_search?q = institutions.country.raw:DEU AND subjects.text:11 Humanities
The search requests re3data to deliver all repositories with German affiliation and included in the DFG subject “11 Humanities”. The aforementioned integration of additional sources like the DDRS supplementary database (or even completely different sources) poses rather a challenge in terms of information science than of technology. Different data sources merging into one result for the user requires a mapping on side of the DDRS to ensure that additional properties are associated with the concerned repository. The merged information is done thanks to the use of the re3data’s external persistent identifiers, the ones used in their public API, such as “r3d100010677”.
Presentation of search results to the user
Technically there are three concepts available for the information retrieval:
Simultaneous retrievals: for each filter 2 requests are sent to the re3data Elasticsearch server (1 request to get a query’s result and 1 request to retrieve the information of the saved generic repositories) and the result is displayed immediately to the user. The questionnaire used for the repository identification is in this case used as a kind of live search. With each filter applied, the number of repositories returned is reduced and the user can decide after each filter to browse the results or apply another filter.
Consolidated retrieval: the user answers all questions necessary for the repository identification in a row and after this, a request to re3data is sent and the result is displayed. The main difference of the consolidated against the simultaneous approach is that the user doesn’t see a “filter history”. The user only receives the results, and, in some case, this may only be one or no repository. In terms of usability the simultaneous approach is therefore the better choice.
DDRS-ranked results: multiple API retrievals of re3data are stored in the session and ranked for the user as a list. This concept is able to combine aspects of the two other concepts, but it is technically more elaborate and possibly not useful in all cases.
In practice a hybrid solution has been implemented. It is a combination of simultaneous retrieval and enrichment by the DDRS database. As the number of questions had been condensed a consolidated retrieval is currently not necessary. This could change if the questionnaire in the beginning would be extended with more questions via the administration section.
A simple example illustrating the search principles using the public API - the user searches for repositories using ARKFootnote 27 as PIDs:
and ends up with 24 results.Footnote 28 But the user also wants to include the ones using DOI as PIDs in the search as the research data only needs a PID, but not necessarily one or the other:
and ends up with 761 results. After applying the filter for both PID systems at once:
only 13 results are remaining. However, this last result is confusing as one would like to have all the results using ARK and all results using DOI, but not only the repositories using both ARK and DOI. Therefore, using the public API, the DDRS would be forced to launch multiple simple queries in order to retrieve meaningful repositories for users. This is a technical reason for liaising with the re3data’s team in order to find a solution for this issue. Re3data kindly provided the team with a full ElasticSearch server on their private network which allows the DDRS to make complex queries more easily as seen below.
http://….../_search?q = pidSystems.text.raw:ARK OR pidSystems.text.raw:DOI
This provides 772 repositories (761 using DOI, 24 using ARK but including 13 using both) which are more useful to someone looking for a repository using PIDs in general.
This issue may also be more complex when other filters are applied, for instance specific technical functionalities or metadata requirements of the repositories. The third concept would add a ranking mechanism to the results. In other words, the user checks five filters and the results compliant with all five filters would appear on top, the results compliant to only one filter at the bottom of the list. Additionally, the ranking concept could be enhanced by weighting of criteria, for example the availability of a specific author identification system, such as ORCID,Footnote 29 is more important than the national affiliation of the repository. This weighted ranking is more sophisticated than the simple ranking and requires a more complex questionnaire approach than the concept currently allows. The current design of the DDRS does not include this option due to the limited number of humanities-specific research data repositories. This may change in the future.