1 Motivation

In the past 15 years, the database research community has experienced a great shift in best practices for managing research data, as highlighted in a recent overview [1]: Influential conferences like VLDB now require the availability of research data (or artifacts) in the review-process (and not only after publication). For double-anonymous reviews (e.g., SIGMOD’24), authors go to great lengths to anonymize their research data, either creating pseudonymous GitHub repositories or using services such as Anonymous GitHubFootnote 1.

After paper acceptance, a growing number of authors make artifacts available or even subject their artifacts to independent reproducibility checks [1]. Even without independent certification, many research groups actively share their artifacts through institutional websites or third-party platforms like GitHub and Zenodo. Researchers thus invest considerable time and effort to disseminate their research data, and thereby their results and ideas.

However, this evident progress in best practices mainly concerns the creation and dissemination of research data. Yet over time, the question of how and where data is stored and who has access can become problematic. A rookie mistake is that of a PhD student who publishes a team’s research data under his or her private GitHub account and then graduates. Another scenario is that of a researcher switching institutions, yet still requiring access to data stored on premise with the former institution. A major problem can arise when distributed teams dissolve – not only as a consequence of disagreements within the team but also when projects are completed. In these cases, it often remains unclear who will retain access and use of the data.

This article contends that the current state of research data storage is lacking, particularly when viewed from a legal perspective. To approach the issue, we explore rights in the source information, the dynamic interplay between stakeholders, and potential “ownership” in the accumulated, structured, and analyzed data. We also discuss how these relationships evolve over time. In particular, we pinpoint specific challenges and show that existing research data storage systems are not equipped to meet the requirements of all parties involved.

In response, we make a case for the database research community to design research data storage systems with legal considerations in mind from the very start, rather than as a hindsight.

Scope. While we discuss examples under European Union law, most of the rules discussed also apply – with some reservations – in most other nations, including the United States.

We specifically focus on research data in the context of “computational reproducibility” [2], which involves data as well as code, but excludes physical artifacts such as tissue samples or heavy machinery. In this article, the highly ambiguous term “database” will refer to the specific database instance.

Structure. In Sect. 2, we break down different kinds of research data common to data management research (such as code, schema, and data instances) and discuss their legal ownership. In Sect. 3, we introduce the various stakeholders involved in the research process. Sect. 4 describes the legal relationships between these stakeholders and typical changes. We dedicate Sect. 5 to an outlook that combines the legal perspective with opportunities for the database research community to contribute.

2 “Ownership” of Research Data and Databases

Different parties involved in the research process might have a legally protected interest in research data. The following example illustrates the complexity of even common scenarios.

Example 1

A research group consisting of researchers A (a professor), B (a researcher from another institution), and C (a post-graduate) is formed at University Y. Their research is funded by funding organization I. Car manufacturer M provides internal data from their manufacturing process under a non-disclosure agreement for research purposes. Students S and T write code and design the User Interface.

A, B, and C will receive copyright protection if the organization or structure of the data is deemed original. Y or I will be granted a sui generis right if their investment is considered substantial. Even S and T will receive copyright protection if their work is a sufficient creative act. These legal positions do not eliminate trade secret protection for M, which must be upheld by all persons involved. The resulting database can only be published (or otherwise used) if all parties involved agree to the relevant terms (or their part can be excluded).

Below, we systematically introduce the legal background required for understanding the ownership relationships in the above example.

2.1 Research Data

The question of ownership of research data is often exclusively discussed in regard to (scientific) ethics: Plagiarism and the inability to provide proper evidence of research data and methodology are both unacceptable in all scientific disciplines. However, ethical requirements are not identical to the applicable legal framework.

A legal definition of research data can be found in Art. 2 (9) of the Open-Data-DirectiveFootnote 2. The directive applies the term to “documents in a digital form, other than scientific publications, which are collected or produced in the course of scientific research activities and are used as evidence in the research process, or are commonly accepted in the research community as necessary to validate research findings and results”. The directive remains intentionally open with regard to access to such data: It supports policies that provide access “as open as possible, as closed as necessary” but does not require them. Only data covering publicly funded research that was (already) made publicly available through an institutional or subject-based repository by researchers, research performing organizations or research funding organizations must remain available (usually for free) to any interested party, including commercial entities (Art. 10 Open-Data-Directive). However, even in that (limited) scope, legitimate commercial interests, knowledge transfer activities, and pre-existing intellectual property rights will take precedence [3, 4]. Thus, not all research data has to be made available to the public as such [5]. and even in those cases, specific permissions to read, change, delete, and append the dataset have to be determined. To assign such rights, one has to identify the person(s) or entity able to allow or disallow specific usage scenarios.

2.2 Content

As a general rule, information as such does not belong to anyone. “Ownership” (understood as the need for consent or justification by law) is limited to mainly three areas of law:

  • Data protection (privacy) laws, e.g., the GDPR,Footnote 3 prevent the distribution and publication of content linked to a specific (living) human without consent or other justification [6].

  • Copyright might attach to, original content included in any table, independently from the protection of the database instance as a whole [7].

  • Finally, any record or the database instance as a whole may be subject to trade secret protectionFootnote 4. Trade secret protection applies to any information not generally known among or readily accessible which is subject to reasonable protective measures (e.g., Non-Disclosure-Agreements [NDA] and technical intrusion prevention) and is of commercial value due to its secrecy [8]. Trade Secrets include inter alia inventions not (yet) protected by patents as well as economic know-how (e.g., customer-lists, costs of production) and even morally questionable activities intentionally hidden from the public. Such secrets may be held by persons providing data but also by researchers or institutions working with data (in light of future commercial exploitation or even in competition regarding funding, qualified personnel, or collaborative projects) [9, 10].

2.3 Code, Structure, Interfaces, and Instance

Independent of any possible rights to specific content included in database tables, the collection of data as such, as well as the specific implementation, might themselves be legally protected [11].

Copyright will protect any computer programs underlying the database system (as well as stored procedures and functions), the selection of contents (i.e., any filter applied), and the database schema (and views), provided that they show sufficient originality. Additionally, any interface (whether it is an API, a GUI, or a web-service) might be subject to copyright if it is considered an original creation [12]. In a collaborative project, these rights might belong to different legal entities, including individual researchers and developers, user interface designers, or even student assistants not part of the research team.

In the EU, a specific “sui generis rightFootnote 5 in databases might cumulatively apply to the database instance as such [13]. This right exists independently and does not affect the aforementioned copyright positions or any rights in the content. The sui generis right is a European specialty without any corresponding legal position in most other countries. Unlike copyright, it does not reward creative quality, but instead (only) requires “qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents” and prohibits “extraction and/or re-utilization of the whole or of a substantial part of the contents of that database”, evaluated qualitatively and/or quantitatively. Thus, unlike copyright, the sui generis right is often held by the research institutions providing the funding and infrastructure – and not by the researchers as individuals.

We refer to our earlier contribution in Datenbank-Spektrum [11], where we discuss the “sui generis right” in greater detail.

2.4 License Agreements

Since no position supersedes the other, the aforementioned rights in content, selection, structure, software, user interface and the database instance exist side-by-side. Therefore, use of research data will require the agreement of any person who was granted any of these rights. Consent to use data under specific conditions is the core of contractual (license) agreements. Often, these requirements are passed on – e.g., someone may allow use of a copyrighted text as part of a research project under the condition that he or she is attributed as the author of his text. In that case, the designers of the database cannot allow users to remove such references to the original author and must impose similar requirements down the line (to the storage providers and eventual end-users). The same interdependence applies to the rights vested in user-interface-designers, developers, and database engineers who provide separable parts and must agree on common terms of use. [14]

Most persons storing or using data do not care about copyright, trade-secrets or the sui generis right, but instead focus on license-agreements. [15] Anyone agreeing to such an agreement therefore accepts the duty to use the data or database in accordance with its requirements. These agreements may be very permissive (e.g., Open Data licenses) or extremely restrictive (e.g. Non Disclosure Agreements), but almost uniformly impose deterring sanctions (i.e., fines to be paid in case of violation, exclusion from further use, etc.). Consequently, the EU Open Data Directive in its narrow applicable scope (Sect. 2.1) requires not only findable, accessible, interoperable and re-usable data but also tries to prevent unfair license terms. [16]

The absolute rights granted by copyright law (and the sui generis right) remain important insofar, as licensing terms should be determined exclusively by persons whose consent is actually required (i.e., those who have rights in either the data or the database instance). They also provide fallback-solutions against entities gaining good-faith access to research data, usually via a third party in violation of their contractual duties. Since these third parties never agreed to any license agreement, they are not bound by its duties (and subject to penalties imposed by the contract). Legal claims to damages and injunctions against such third parties are therefore only viable under copyright law, the sui generis right, or trade-secret law.

3 Stakeholders

Next, we introduce the stakeholders in the management of research data.

3.1 Data Providers

As stated above, certain data might be subject to protection under privacy laws, copyright, or trade secrecy protection. The use of such data might be privileged under specific exemptions granted by law, but in most cases, consent is required. Often such consent is expressed in formalised license-agreements, which will be binding for any steps building upon such protected content.

3.2 Researchers as Collectors and Organizers of Research Data

Any research project depends on individuals creating, collecting and/or organizing the data – the researchers. These are often but not necessarily part of the same research institution. Although, in the best case, a research group remains stable throughout the whole project, numerous external developments might cause persons to leave the group or join the group later on. Furthermore, the affiliation to an institution might change at any time. Researchers are subject to the specific constitutional guarantee of freedom of researchFootnote 6, which explains their exemptions from certain legal obligations, e.g. in Art. 89 (2) GDPR. However, this specific position is problematic in their relationship to the research institutions involved: Unlike private employers, public research institutions have to respect academic freedom and cannot single-handedly direct activities or determine license terms.

“Researchers” in the aforementioned context must be distinguished from “third-parties” seeking access and use of the data (see Sect. 3.6) – which may include independent researchers, who may merge, enrich, filter, or modify the database (and thereby possibly acquire rights in their fork of the original database).

3.3 Research Institutions

Research is usually funded by institutions, who often provide necessary infrastructure (e.g., workplaces, hardware, software licenses). These include universities, but also public and private organizations. Since funding is an essential requirement for research, such an institution has a significant influence, at least with regard to specific projects. Often, third-party funding is made subject to specific research data policies, including, but not limited to, providing data under an Open Access license. If researchers do not agree to those terms, they will not be granted the necessary funds – thus providing strong leverage. However, research institutions have significantly less influence if research is not funded under a specific agreement relating to a well-determined project but is considered part of more general funding.

The conflict between terms imposed by institutions and individual freedom of research comes to a head when a researcher switches institutions and plans to take data with them – thereby “stealing” any potential benefits of a project from the entity that originally funded the research. And yet, it seems unacceptable for institutions to prevent researchers from accessing data they gathered, organized, and stored during their former academic affiliation. We will discuss these conflicts in Sect. 4.2.

3.4 Support Personnel

Apart from the researchers themselves, persons not directly involved in the specific research might provide technical services, e.g., in database modeling, coding, UI design, and thereby have a significant impact on the resulting database. These supporting persons are not necessarily formal members of the research group and might not even be researchers themselves; they may work in-house (e.g., student assistants) or be independently contracted. However, they might legally provide original creations and therefore gain copyright in user interface, software, or database design.

Contracts often require these support persons to assign any rights or provide an exclusive license to their works to their client (usually the research institutions, not the researchers), which alleviates the risk of their involvement in later disputes. This is possible because they are not protected by individual freedom of research and therefore able to grant exclusive licenses much like in any other business constellation.

3.5 Infrastructure Providers, esp. Data Repositories

The management of research data is often not viable in small-scale local storage systems. Instead, requirements regarding security and integrity (such as computational power for analysis, as well as indexing and the volume of data to be stored) will often require larger, centralized storage. Such repositories could be provided by the research institution itself, a third-party funding institution, and even private entities. While these entities have no rights to the data themselves, they are essential for accessing and modifying them. If a service ceases to operate or unexpectedly changes or modifies the data, research will be severely hampered. While a local backup might seem like a solution, this is not always feasible.

Contractual agreements with infrastructure providers have high relevance with regard to potential personal data – Art. 28 GDPR requires a special written agreement covering the relationship and responsibilities of a processor (i.e., the infrastructure provider) acting on behalf of a controller (i.e., the researchers or their research institution). Comparable duties may arise due to trade-secret protection. Contracts often also provide for indemnification against third-party claims in the case of data breaches.

3.6 Interested Third Parties and the Public at Large

Occasionally, persons not part of the research group will require (read-only) access to the data, even before eventual publication. This includes reviewers, but also colleagues carrying out related research, and who are granted an early preview of the data. To ensure compliance with data protection (privacy) rules or non-disclosure agreements (relating to trade secrets), such access might be limited or restricted. In most cases the ability to delete, append, or modify data is not desired; further restrictions (e.g., to certain quotas) may apply and even the data itself might be pre-processed (filtered, anonymized). While one might rely on a code of honor, contractual agreements and technical measures should at least be considered in those cases. The licenses provided for public access do not apply in this early stage – thus, special precautions might be required.

While not all research data is open-access by definition, in many cases, data will eventually be made available to at least the relevant scientific community, if not to the public at large (subject to licensing terms). This imposes additional challenges with regard to potential filtering or pseudonymization/anonymization of data to ensure compliance with data protection (privacy) law, and the protection of potential trade secrets included in the data. The general rules are comparable to those applied to third parties discussed above.

In addition, license agreements often require the attribution of the data to its source (even if neither the database instance nor its structure are legally protected). Often, efforts are made to restrict the use (e.g., prevent commercial exploitation or military use or avoid any use of data within a context deemed morally questionable by the researchers). Further requirements may ensure technical integrity, e.g. bandwith-limits etc. Finally, licenses might require any derivative databases to subject themselves to comparable open licenses (often referred to as a “viral” license, infecting other data like a virus). Even though enforcement of these license terms might be problematic, they serve to ensure the free flow of information [17].

3.7 Overview

Figure 1 visually summarizes our discussion. Data may be subject to legal ownership as early as the data provider-level (far left), which are passed on to any users by license conditions (or limitations to data protection consent). Researchers and/or research institutions will acquire additional copyright, sui generis right or trade secret protection for their work (e.g., by selecting data, adding meta-data or joining different data sets). These rights, too, are passed on through licenses. As a result, data repositories (second column from right) receive data that is already double-burdened, which they further restrict with the terms of use or technical requirements of their platform. The interested third parties (far right) are insofar subject to triple-burdened terms of use. However, it is uncertain whether these terms adequately reflect the desires of researchers. These include facilitated preliminary access for reviewers, restrictions on commercial use, or attribution (far right, lower half), since many data repositories regularly do not allow for individual license terms, but are limited to common license types [18].

Fig. 1
figure 1

The major stakeholders in research data storage. Handling consent and licenses is a cross-cutting concern that affects all stakeholders (shown horizontally). Research institutions play a central role, since they enter the contract with the providers of data repositories. Security, compliance, and availability (bottom) are cross-cutting concerns

In the lower third, there is a reference to the fact that the ostensibly uniform group of “researchers” actually refers to different entities – we distinguish individual researchers, support staff, and the (regularly cost and organization-bearing) research organizations. We also emphasize that the organization usually has stronger negotiating power, as it often commissions the repositories (and can enforce its own contract terms alongside the license).

At the very bottom (virtually as “foundation”), general requirements which all stakeholders are fundamentally interested in (or should be) are outlined. Yet in practice, these primarily lie with the repositories –‑ data security (access control, abuse prevention, authentication assurance, e.g., through hash values), legal compliance (avoidance of content inciting hate, pornography, glorification of violence, etc.), and availability (redundant storage, permanent storage, efficient storage).

4 Relationships

As outlined, research data storage is subject to a nexus of contracts: Since different persons may legally own the contents, the database, the software, and user interface employed, most of the aforementioned entities rely on their individual agreements among each other. In this section, we examine the original relationship and how the consequences change over time. Although an exhaustive exploration of all possible scenarios is beyond the scope of this paper, most situations can be reduced to a number of fundamental cases.

4.1 Original Nexus of Contracts

The natural spider sitting at the center of the web of contracts would be the Research Institution. It usually is party to any contract with external infrastructure providers and any support personnel, and employs (most of) the researchers. This becomes troublesome in the case of inter-institutional research: Who will be the leading institution – and what power does that institution have over researchers at other institutions? Furthermore, the interests of research institutions will not always align with the interests of researchers. For instance, the imperative of cost efficiency may prompt the institution to terminate contracts or impose restrictions, such as bandwidth limitations or performance constraints, thereby impacting access to vital data.

In addition, researchers may move to another institution, and thus no longer be bound by their former legal relationship. Finally, giving too much power to the institution might actually complicate research – as any change in membership of the researcher group would formally (hopefully not factually) not be decided by the researchers, but by the institution who drafts and signs the relevant contractual agreements. In addition, the researchers would be unable to define or change the terms of use.

Due to the aforementioned constitutional protection of scientific research, one might be tempted to assign the responsibility to the researchers themselves. Yet without legal advice, such agreements are unlikely to cover all eventualities. In addition, the researchers would lack full control over the operation of the storage system. Accountability would also require them to at least outline and supervise technical and organizational measures to protect personal data and trade secrets. While model contracts (providing some leeway to cater for individual requirements) might lessen some of those burdens, a significant risk of omitting essential rules remains.

Finally, infrastructure providers might employ their market power to enforce pre-formed contracts with standard clauses on anyone using their services [19]. There is little room for negotiation when contractual terms are provided on a take-it-or-leave-it basis. The Digital Markets ActFootnote 7 expressly considers “cloud computing services” (defined as any “digital service that enables access to a scalable and elastic pool of shareable computing resources”) as gatekeepers subject to strict regulation if they provide an important gateway for business users to reach end users, have a significant impact on the internal market, and enjoy a stable and strong position. However, those requirements have to be determined in advance by the EU commission on a case-by-case basis (e.g., one day with regard to Amazon Web ServicesFootnote 8) and do not provide a general solution.

In summary, in many cases contracts on the use of data will be subject to a predefined framework set by a research institution or infrastructure provider. This framework will, however, require adaptation if more than one institution or any third party is involved. In addition, individual researchers should at least be informed about the core terms and be given the opportunity to amend or change them for their specific project. If external infrastructure providers are employed, certain minimum standards are required under the GDPR; further requirements might be imposed due to third-party funding. Beyond that, the terms are often non-negotiable.

4.2 Changes in Affiliation

If a researcher leaves the original research institution and joins a new workplace (possibly in the private sector), the fate of the accumulated research data might be uncertain unless provided for in a contract. Under traditional German copyright lawFootnote 9 the researcher as an individual retains any rights to content created at a former institution. The research institution only receives a non-exclusive license to make use of copyrighted material generated in the course of employment, but may not prohibit any use (or further licenses) by its former researcher. However, there is neither an EU-law nor a globally accepted rule on ownership in research results and data [20, 21]. Furthermore, as mentioned above, the data itself is often not subject to copyright protection, and the database design may be protected by another researcher (or even a support person, who provided the research institution with a valid exclusive license). Thus, the research data may well be out of reach for the former researcher.

As long as the volume of data remains manageable, the leaving researcher might be granted access to the data via a (possibly encrypted) storage medium (e.g., a harddisk or even a flash drive). Such a process seems preferable in the case of trade secrets or privacy-relevant data, which cannot be transferred to or even over public services. However, in many scenarios, this will not be viable due to the sheer volume of data (and possibly concerns regarding integrity during transport). In fact, the data collection and analysis process may not even be completed at the time the researcher switches affiliations.

Duplicate storage at both the previous and the new research institutions would potentially lead to either a fork in the dataset or the need for (possibly repeated) synchronization. On the other hand, the previous institution might deny (write-)access to the data to unaffiliated researchers. In the worst case, the previous institution might even try to cut costs by simply abandoning the storage and/or computing capacities used for the project. In the best-case scenario, a change to another institution would lead to an agreement between both institutions regarding access to and storage of the research data (much like the aforementioned scenario about research projects conducted at multiple institutions right from the start). Although agreements between the researcher as an individual and the (former) institution might appear to be a possible solution, this imposes a huge responsibility on the individual researcher (who often lacks negotiation power). Similarly, terms in the original agreement between the researcher and the institution (or general research data policy) will often turn out to be overly abstract and unspecific.

In a shared research project, other researchers might have an interest in the data – both those originally part of the project (at the original institution) and potentially researchers at the new place of employment. However, adding new members to an existing research group requires unanimous consent, unless all researchers agreed on another rule in advance (such as a majority vote or a decision by a specific leading member). Therefore, even though affiliation changes, neither researchers nor institutions may unilaterally add additional members to the project. In general, there is furthermore no formal way to exclude a former member against their will – unless some rules are put in place in advance or the persons involved are willing to go to court about it.

As mentioned above, research data may also be stored in a repository managed by a third party. These infrastructure providers typically act on the basis of a contract with the original research institution. Even if the entire research group or at least a significant part of its members change affiliation, the new institution will not automatically become part of the contract. This is problematic due to obligations arising from privacy protection (i.e., the GDPR) and trade secrets law, as the new institution might become liable vis-a-vis third parties for research conducted on its behalf. Thus, an assignment of the contract or at least joining the contract seems necessary (not precluding a decision on the running costs). Granting access to the researchers is still subject to possible secrecy-requirements and licensing by third-party data providers. Again, those contracts do not automatically transfer if they were signed by the institution. Therefore, the researcher might be unable to access such data, requiring Non-Disclosure Agreements or an expansion of the license. Many of these issues can be avoided by appropriate clauses in the original agreements with third parties; however, few institutions currently provide for such contract terms by default. On a technical level, an appropriate permission and authentication mechanism must be in place.

4.3 Changes in Data and Data Use

Nowadays, storing a well-defined database and the accompanying framework for analysis, etc. for public access is a common and not overly complex task, especially for professional providers. However, research data must also be administered before that final phase. As long as the research is still ongoing, the requirements, e.g. the volume of data and the processing power for analysis, can change, despite meticulous planning. Even worse from a legal perspective is the possibility that personal data might get mixed into the dataset or formerly pseudonymized/anonymized data might become de-anonymizable through later additions. Similarly, datasets might unintentionally be contaminated by content subject to trade secrecy or unlicensed copyrighted content. While some filtering might occur automatically, precautions regarding access, logging, and intrusion prevention are necessary at any stage.

Changes in data (and potential liability arising from such changes) must therefore be taken into account in the early contract design phase at least when third party infrastructure providers are involved. Yet even when information is stored on premise by an institution, the design of the storage systems employed has to adapt to changing technical or legal requirements. In the worst case, this might lead to overly restrictive systems, preventing as much as possible, to limit the danger that data might unexpectedly turn out to need special protection. A better solution would provide periodic reviews and potential adaptations and/or scalable models for security and access control.

4.4 Third Party Access

Before eventual publication, researchers (and third parties providing data for the research project) usually want to limit access to a database. However, they might want to or have to grant access to third parties not part of the original research group (see Sect. 3.6). Specifically, at the time of review and before publication of papers, data is often not yet opened to the public and thereby not subject to a potential future Open Data license. In this phase, such third parties should ideally be subject to specific agreements detailing their permissions; such limiting agreements might also be formed implicitly (e.g., without a written agreement or passing a click-through-banner by mere access to a dataset including a README-file). However, mere reliance on good faith or a code of honor is insufficient both in trade secret law as well as under the GDPR (if applicable).

As long as data is stored on-premise at the institution with which the researchers are affiliated, one-sided restrictions are unlikely. Instead, the institution will respect the freedom of research, which especially covers both collaborative research as well as publication of the eventual results (including the review process). However, precautions such as anonymization/pseudonymisation of datasets, access control, logging or even watermarking of data downloaded might still be required for other reasons: Personal data or trade secrets included in the database-instance require a higher protection standard; similarly, the researchers themselves might want to put caution before mere trust.

However, on-premise storage will generally not be possible for publications subject to double-blind-review, as the place of storage would reveal the affiliation of the researchers (and their identity). Similar concerns apply to logging access (and registration of reviewers) at the institution where research is conducted. Unless data is stored in a distributed network or on a legally independent, central platform managed by a large number of institutions, these concerns can neither be technically nor legally eliminated. Insofar, the review process itself requires that any data is made available via third-party repositories, at least for the time of review. Simple technical supervision becomes practically impossible in those cases. Further, legal sanctions (damages, fines, etc.) in the case of a data leak (or other abuse of data) become harder to enforce. Insofar, a trusted entity that is independent of both the researcher and the reviewer should provide for supervision. This trusted entity should keep all identifying data hidden from all parties until it becomes necessary to attribute a (proven) breach.

4.5 Integrity and Security

As mentioned above, both data protection law and trade secret law require technical and organizational measures to prevent abuse of the data stored. In fact, both areas of law are based on a scaling system depending on the sensitivity of the data – the higher the importance of the data, the more efforts will be required. Such precautions include intrusion prevention, etc. This also relates to the issue of ensuring integrity of the data against other influences, like media failure, errors in data transmission, etc. Especially in the case of distributed storage, ensuring synchronization is of high importance.

While this sounds simple in theory, the actual practice is subject not only to an ever-evolving state-of-the-art but also to the steady risk of liability in case of insufficient precautions. Again, this gives rise to oversensitivity by institutions and third parties, causing them to simply reject to store certain types of data (especially “special categories of personal data” protected under Art. 9 (1) GDPR, including “data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation”). This might severely hamper certain areas of research, since these constraints may reduce the number of available third-party providers, and the institutions themselves might be reluctant to shoulder the additional costs imposed by securely storing such data. Yet, at least theoretically, freedom of research would prevail unless the costs are truly insurmountable, while the research itself is considered almost worthless to society.

4.6 Restrictions on Use of Data

Research data frequently possess value not solely for academic endeavours but also for practical applications. As mentioned above, the data as such is usually not owned by anyone – yet many potential parties might compete for their possible commercial exploitation. Further, researchers might want to prevent a certain use of their data (e.g., for military purposes).

There is a broad scope of licenses available, but many public repositories require or at least recommend the “CC-0”-licenseFootnote 10 – which precludes any potential claims by the persons publishing the data (including attribution of the source). “CC-BY”Footnote 11 at least requires mentioning the original license, the source, and the persons claiming authorship, and finally “CC-BY-SA”Footnote 12 additionally requires sharing any modified (including reorganized, modified, filtered, etc.) data under the same license. “CC-BY-ND”Footnote 13 would prevent any modification and require re-distribution as is and “CC-BY-NC”Footnote 14, “CC-BY-SA-NC”Footnote 15 and “CC-BY-ND-NC”Footnote 16 prohibit any redistribution for commercial purposes. The latter restriction is extremely unclear [22] and it is recommended to seek an individual license beyond the “NC”-terms for any use which might carry even a scent of commercialization.

While CC-licenses are extremely common, they are by no means the only choice – even among “Open” data licenses; e.g., the Community Data License Agreement (CDLA)Footnote 17, the Open Data Commons Open Database License (ODbL)Footnote 18, the Open Data Commons Attribution License (ODC-By)Footnote 19 or the Open Data Commons Public Domain Dedication and License (PDDL)Footnote 20 specifically deal with databases separate from their contents. Details of the specific terms are beyond the scope of this article.

Although third parties may be excluded through license agreements, the different legal positions (Sect. 2) held by researchers and research institutions would call for an unanimous decision. Since this is often impractical, that decision should be made as early as possible. As mentioned before, some providers impose specific requirements on the license and limit the choice.

As a side note: Infrastructure providers may not use the data without agreement by their customers (i.e., the research institutions). Storage contracts do not allow such use (and do not even allow for data mining of the information stored). Although storage providers might try to rely on the statutory rules allowing for text-and-data mining, these rules only apply to copyright law as applied to content, code, or structure, but not to any personal data included.

4.7 Termination of Storage

One final issue is the fact that the storage of data in an accessible form might impose costs that someone has to pay [23]. Once the funding of a research project runs out, there might be interest in simply switching off the respective servers. This also reduces compliance costs: Once data is no longer accessible, there is no risk of a breach of privacy or trade secrecy. Third-party funding of research projects is usually subject to minimum storage periods. However, research carried out without a special funding agreement will only be subject to data policies by the research institution (if any).

Initially, things seem even simpler when data is stored at external infrastructure providers. They are bound by their respective contractual agreement, which usually includes a specific term of service (potentially renewable). Yet private entities might enter bankruptcy or merge with (foreign) companies. Some precautions may be taken in a contract – and national bankruptcy laws might protect the person storing data on another entity’s infrastructure. However, distributed and global storage networks (often including subsidiaries and even independent third parties) will hinder the enforcement of contractual duties. All in all, keeping a local backup seems recommendable in any case.

5 Discussion and Outlook

Even when research data is eventually openly shared, its storage gives rise to a nexus of contracts involving multiple parties. These relationships not only include the researchers among themselves but also apply to the research institutions and possible external infrastructure providers.

At least when research data include personal data or trade secrets, great importance must be paid to security against third-party attacks. Research ethics (though usually not the law) also require precautions to ensure integrity and redundancy in the event of failure. The risk of failing these requirements lies primarily with the entity acting as a “data controller” – usually the research institution (not the individual researchers acting on its behalf). However, delegation to infrastructure providers does not exempt the original controller from responsibility. Instead, the contractual agreement has to provide not only for the necessary technical and organization measures but also put in place an appropriate supervision mechanism.

Although many of the issues that arise with regard to the storage of research data could be resolved by contract terms, many research institutions lack appropriate model contracts. Instead, the responsibility is often delegated to the researchers themselves. Furthermore, external storage providers often have superior negotiation power and lack interest in adapting their terms to specific research projects.

Without an initial agreement that covers potential issues, any change will cause significant conflict. This applies to changes in membership, third-party access, and termination of storage. Furthermore, any potential commercial use by researchers, institutions, or third parties might be illegal and the risk of integrity and security issues remains unassigned.

There is currently neither a workable (nor fair) default solution for the aforementioned issues provided by law or technology. At least for the review process, it is essential to decouple research data management from personal affiliations – which necessitates a third-party provider. However, existing repositories are often focused on the later stage of providing access to data that has already been cleaned and/or anonymized: Many well-accepted providers limit the choice of licenses [18]. The requirements imposed by Art. 28 GDPR when processing personal data and especially appropriate measures for the protection of trade secrets often require additional agreements or are left to the persons storing the dataFootnote 21. Authentication and logging (and access to those data in case of potential legal disputes) is not yet a common service. Free services are often provided by established research institutions – but nevertheless with limited warranties with regards to potential termination of the storage agreement.

Ultimately, it is up to the research community to define potential templates to resolve or at least alleviate the aforementioned issues, which may then be translated into legal text and made available. This would reduce the cost of determining the appropriate terms, avoid the lack of certainty, and provide custom-tailored solutions for each individual project. Once those terms are clear, technology has to follow suit – both with regard to the storage as well as to permissions, views, and potential limitations. Discrepancies between what is allowed (legally) and what is achievable (technically) should be avoided as much as possible.

Specifically, from the perspective of database systems research, this requires that we design a comprehensive system architecture for managing research data that is aligned with legal considerations from the start, rather than as an afterthought.