1 Introduction

The general data protection regulation (GDPR) is an EU regulation that defines personal data lifecycle requirements at a European level. It ranges from data subjects’ rights, fines, policies, or business processes. For example, fines for non-compliance can be up to 20 million euros, or up to 4% of the company’s yearly global turnover [1]. Thus, entities working with information systems (IS) must pay close attention to the legal requirements, as non-compliance might cause them (on top of fines) a loss of reputation and increased human and monetary spending.

Even though 5 years have passed, companies are still regularly fined. For instance, in January 2023, Meta received a 390,000,000€ fine from the Irish Data Protection Commission [2]. According to [3], EU-wide, more than 1000 fines have been given between July 2018 and March 2022 for violations of GDPR. The average fine was of around 1 500 000€, the highest being a 746,000,000€ fine for Amazon Luxembourg, in July 2021, at the time of this paper's writing.

As such, it is clear that GDPR is, at best, partially implemented by many private actors despite the strong financial incentive to achieve compliance. In this context, the need to include GDPR and privacy requirements in the software development life cycle (SDLC) becomes imperative. However, handling requirements emanating from regulations and legal documents is challenging [4, 5]. This situation arises due to the conflicting nature of legal prose versus requirements and the SDLC [6, 7]. Legal prose tends to be vague, so it can travel through time and be applied in multiple contexts. Conversely, RE aims at having precise requirements [8, 9].

To address this issue, requirement engineering researchers have proposed, over the last years, various artifacts to tackle diverse aspects of GDPR compliance—either from a technical point of view or following a wider interdisciplinary approach. This paper aims to map and identify the approaches RE has proposed for GDPR compliance and understand the current state of the art through a systematic approach. Through a systematic mapping study, the objective is to identify what has been proposed and future areas of research, venues, and research methodology. Consequently, the results are to provide the reader with an exhaustive list of the artifacts proposed in the RE for GDPR compliance. The expected outcome is a list of diverse artifacts, ranging from frameworks to extensions of conceptual languages.

2 Background and related work

RE has a longstanding history with regulatory data protection requirements. In the early stages, the RE community researched and proposed about safety requirements for critical systems. Multiple lines of research have concluded that software engineers find regulatory requirements—including data protection regulations—challenging to understand and translate into the information system [4, 6, 10,11,12,13,14].

2.1 Privacy versus data protection regulatory requirements

There are several challenges with privacy requirements. Firstly, the definition of privacy may vary due to cultural elements, personal preferences, and conceptualization [15]. Given this situation, privacy can be a vague term that may encompass different issues [15]. Therefore, regulations dealing with personal data prefer to regulate information privacy or data protection—that is to say, issues regarding allowing the data subject to determine by themselves which type of personal data, how and when will be shared, including its’ life cycle, in line with [16] definition of privacyFootnote 1 In other words, as a generalization, it is the individual who chooses how their data should be processed rather than the system itself..

Privacy requirements also differ from data protection regulatory requirements because the latter seeks to set requirements in one specific aspect of the former. Regulatory data protection requirements emanate from a regulation or legal body. Therefore, regulatory requirements come from a specific type of document(s) and may signal specific requirements. For example, regulation can mandate organizational requirements such as appointing a data protection officer or identifying the entity responsible for data processing—such as the [1]—which might not be present as an element in a privacy ontology. Using Glinz [9] taxonomy on requirements, regulatory requirements for data protection could be labeled as constraints. Even if the stakeholders do not agree on a requirement set by the regulation—for example, appointing a DPO—this requirement is a restriction set by the regulation—in our example, the GDPR [1]—.

Furthermore, working with regulatory requirements for data protection requires specific knowledge and expertise about that specific regulatory body [4], when the same is not necessarily the case for privacy requirements. The regulatory requirements can reference other pieces of regulation and evolve over time [4, 7]. For example, the GDPR entered into force in 2018 and interacts with the Privacy and Electronic Communications Directive (ePrivacy directive) or with the Digital Service Act (DSA) and Digital Markets Act (DMA), which have entered into force in the recent months. Consequently, knowledge of these policies and others is required for regulatory requirements for data protection in Europe, which is not the case for all privacy requirements.Footnote 2

All in all, privacy requirements are not necessarily the same as regulatory requirements for data protection. The two have different origins, expectations, and specifications, among others. While stakeholders might have different conceptualizations of privacy, regulatory data protection requirements set out requirements independently of the privacy conceptualization, even if the wording allows for different interpretations.

2.2 Privacy requirements engineering

From an RE perspective, privacy is a well-established area of research [17]. Various research papers focus on privacy requirements, including legal concerns, each emphasizing different levels or topics [18]. Several reviews—systematic or not—have been conducted on the subject.

Kalloniatis et al. [17] researched the management and elicitation of privacy requirements, taking a holistic view of privacy requirements. Their research does not follow a systematic approach but reviews well-known frameworks and approaches for privacy requirements [17]. In addition, they highlight the importance of including security and privacy requirements from the early phase of the software development lifecycle, in line with established academic work [4, 5, 17].

Morales-Trujillo et al. [19] share their results of a systematic mapping study on the privacy-by-design paradigm in software engineering. They report that there has been an increased interest in the subject in 2018, which they relate to the entry into force of the GDPR [19]. In addition, most papers propose models for the subject; however, most of the contributions are in the initial stages and need further development [19].

Netto et al. [20] carried out a systematic literature review of privacy requirements engineering, focused on the years from 2000 to 2016. In their research, most of the literature on requirements engineering would focus on the elicitation process on privacy requirements, followed by their analysis [20]. Furthermore, they highlight that language used in the legal text is very different from requirement engineering, which complicates the work between the two domains, and that there is a lack of modeling language that can bridge both said domains [20].

Recently [21] published a systematic literature review on privacy requirements and their perception across IT practitioners, understanding privacy requirements broadly. They provided a list of requirements elicitation techniques, methods, and frameworks published until 2021 [20]. They conclude that most used tools or frameworks in academia do not align with those in the private sector or practitioners [20].

All the previous works do not focus specifically on regulatory data protection requirements compliance. Some acknowledge the subject and discuss the implications of the corresponding regulation. For example, both [19, 21] point out that there seems to be a peak of published papers related to privacy requirements in 2018, which relate to the entry into force of the GDPR. However, they do not focus on the compliance of a specific regulation or the GDPR.

Several proposals are consistently mentioned and studied throughout these papers as ways to tackle privacy requirements. Some examples are:

  • LINDDUN is a privacy threat modeling framework based on data flow diagrams that allow the analyst to elicit and model privacy threats from early stages SDLC [22, 23]. By including privacy concerns from the beginning of the SDLC, the idea is to help software developers build PbD software [23]. One of the latest development is LINDUUN GO, which is a lightweight and gamified approach to the framework [24].

  • Privacy safeguard (PriS) is an organizational goal-oriented framework that helps analyze the business processes from a privacy perspective [18, 25, 26]. Based on the Enterprise Knowledge Development, “PriS provides a set of concepts for modeling privacy requirements in the organization domain and a systematic way-of-working for translating these requirements into system models” [25]. In fact, it identifies eight privacy goals—authentication, authorization, identification, data protection, anonymity, pseudonymity, unlinkability, and unobservability—and 7 privacy-process patterns that help to achieve the goals [25, 26]. Through a determined methodology that consists of 4 steps, PriS allows the practitioner to elicit privacy goals, analyze and understand the impact of these goals and identify which patterns and techniques may better support the achievement of the privacy goals [25, 26].

  • The role-based access control (RBAC) approach is proposed by [5]. Through a goal-driven approach, their framework helps model privacy requirements from early phases in role engineering to bridge the gap between “high-level privacy requirements and low-level access control policies” [5]. Furthermore, their framework helps modeling and analyzing competing security and privacy requirements [5].

  • Spiekermann and Cranor [27] suggests that privacy requirements should be tackled from an architectural (privacy-by-architecture) and policy (privacy-by-policy) point of view, taking a hybrid approach. Using the FIPPs principles and privacy reflections as a starting point, they identify that privacy can be divided into three spheres: the user, joint, and recipient spheres. “The ‘user sphere’ encompasses a user’s device [...] The ‘recipient sphere’ is a company-centric sphere of data control that involves back-end infrastructure and data sharing networks” [27] while the joint sphere denotes the services that companies provide to users [27]. Privacy requirements are essential in all three spheres and are divided into data transfer, storage, and processing [27]. Accordingly, they propose that taking a hybrid approach of privacy-by-policy—which focuses on choice and notice—and privacy-by-architecture—which focuses on data minimization, anonymization, and PETs—“satisfies business needs while minimizing privacy risk” [27].

Other methods and frameworks have also been proposed for requirements engineering on privacy.

Across all these proposals, the conceptual model of privacy is not necessarily the same. Each place emphasizes on different characteristics of privacy. Similarly to what was previously mentioned, these proposals do not focus primarily on regulatory data protection compliance or the GDPR. Indeed, some of the proposals discuss and touch on regulatory data protection, but it is not their main focus. Hence, their fit is more related to privacy requirements than to regulatory data protection requirements.

2.3 Regulatory data protection requirements engineering

Data protection and regulatory requirements have long been studied in RE [7, 28]. Multiple frameworks, tools, methodologies, and artifacts have been proposed, either for specific regulatory regimes (or applied to specific laws) or data protection regulatory requirements in a general manner.

[28] carried out a systematic literature mapping on modeling for regulatory compliance. The authors compared how the goal-based and non-goal-based approaches differ towards legal and regulatory compliance, highlighting their respective benefits and drawbacks. Their research found that compliance modeling and compliance checking were the most popular topics in modeling, followed by analysis [28]. Furthermore, healthcare was the domain that received the most attention, with the Health Insurance Portability and Accountability Act (HIPPA) the most popular legal document.

[29] identified the critical factors in implementing GDPR in general in organizations through a systematic literature review. In broad terms, they suggest that few papers discuss the GDPR [29]. They mention several benefits to implementing the GDPR in an organization, including but not limited to better data management, cost reduction, and better reputation [29]. They concluded that the main challenges are that the GDPR is a complex regulation—in line with what [6] indicates on regulatory data protection requirements engineering—and there is a lack of people with expertise on the subject [29]. In addition, finding data protections is difficult and expensive, and implementing the GDPR is time-consuming and costly in financial and human resources [29].

[30] carried out a systematic mapping study on automated GDPR compliance using natural language processing (NLP) tools in RE. In particular, they researched which “NLP approaches are useful for RE and for which RE activity?” [30]. They gathered papers up to 2021, and compliance is out of their scope [30]. They have identified that NLP for RE is an ongoing trend.

From an ontological perspective, some proposals either fully seek to tackle requirements from the GDPR or include some of it. A non-exhaustive list is the following:

  • PrOnto [31] proposes an ontology for GDPR requirements based on legal reasoning. It does not focus on privacy but on legal data protection aspects to check compliance. Another stated goal of PrOnto is to help with legal reasoning and “web of data and information retrieval” [31] that can be used for legal reasoning. The methodology used to develop the ontology is “methodology for building Legal Ontology” (MeLOn), which is frequently used in the legal domain to create ontologies [31]. For example, it has a variety of classes to represent regulatory data protection requirements, such as obligation, rights, or purpose class. From this work, the authors have extended their work to propose the DAPRECO [32] knowledge base.

  • CoPRI is a privacy ontology proposed by Gharib et al. [33], that includes some aspects of the GDPR, even though it is not their main purpose to be a GDPR ontology. It includes elements that go beyond the scope of the GDPR. It does not use legal reasoning for the ontology.

  • Similarly, LloPY also includes legal aspects to its ontology [34], although it focuses on IoT instruments. It seeks to include specific attributes of regulatory data protection requirements, such as consent or choice, into privacy policies [34]. It does not use legal reasoning for the ontology.

  • The GDPRov family comprises of the GDPRov, GDPRtEXT, and GConsent proposals, which are combined ontologies that tackle specific legal requirements of the GDPR. [35,36,37]. GDPRov is an OWL2 ontology that focuses on the specific elements of the GDPR, namely “acquisition, usage, storage, deletion, and sharing of consent and data lifecycles” [35]. It focuses on processes like data deletion and access, consent management, and personal data [35]. GConsent [37] is more specific, focusing solely on GDPR consent requirements. The approach of these ontologies is similar to PrOnto as the GDPR plays a fundamental role, although they do not explicitly state they use legal reasoning.

In this manner, our work differs from other works in RE, as it focuses solely on data protection requirements, specifically the GDPR. Previous work has focused either on privacy requirements—which we have already discussed how they differ from data protection—or on compliance requirements in general. In comparison, data protection is becoming more standardized: the OECD privacy principles [38] and the GDPR became de-facto the international standards for data protection legal instruments. Furthermore, this research aims at studying what has been done in RE to achieve or help compliance, and not GDPR requirements from a general perspective.

3 Research method

This research follows the guidelines from Petersen et al. [39,40,41] to gather, analyze and produce the systematic mapping study. Mapping studies’ objective is to discover trends over a specific area, whereas a systematic literature review tries to answer a research question [40]. In this manner, a mapping does not necessarily need to find all the research articles that may answer a question nor have one, but instead grab a good representative sample of the area of interest [39, 40, 42]. Given this approach, mapping studies do not require a qualitative assessment [39].

Following what Petersen et al. [39] indicates, the approach of this study is a “’thematic analysis’ that counts papers related to specific themes or categories”. Similarly, this mapping study contains a few research questions closer to a systematic literature review than a mapping, as they cannot be answered only by reading the abstract. However, mappings and reviews can be considered a continuum, benefiting by using research strategies from one another [39]. Therefore the researcher does not necessarily have to restrict themself to read one part of the article when doing a mapping study [39].

3.1 Objective and mapping studies

This systematic mapping study was done with the guidelines from Petersen et al. [39,40,41] for planning the research, as seen in Fig. 1. In particular, the paper gathering sample plan is based Petersen et al. [39].

Fig. 1
figure 1

Systematic mapping process by Petersen et al. [39]

This mapping study aims to discover the trends of what initiatives have been proposed in requirement engineering to achieve GDPR compliance. Its main objective is to summarize and disseminate the current state of affairs of GDPR regulatory requirements in RE.

To discover the trends and fulfill the objective of the mapping study—as mapping studies do not necessarily answer a question [40]—the following sub-questions were chosen:

RQ1::

When, where and in what type of venue has the research been published? (I.e: Type of venue)

RQ2::

Are the authors of multiple disciplines?

RQ3::

What type of research is it?

RQ4::

On what stage of the RE process does the paper focus?

RQ5::

What compliance elements of the GDPR does the research article focus on?

RQ6.1::

What type of proposal is the paper?

RQ6.2::

If a modeling language extension is proposed, it is an extension of which language?

This research understands initiatives in a broad manner, as artifacts or treatments proposed by Wieringa et al. [43]. Research is understood as investigations that tackles knowledge questions on the domain [43]. We are also interested in knowledge questions, as they act as guiding elements for research.

3.2 Search planning

In order to create the search string and define the exclusion and inclusion criteria, we followed the PICO (Population, Intervention, Comparison, and Outcomes) approach per Kitchenham and Charters [41]. Although the PICO approach is recommended for systematic reviews and not mappings, it does help identify keywords, as Petersen et al. [40] did on their mapping.

  • Population: Our population of interest is GDPR.

  • Intervention: “The intervention is the software methodology/tool/technology/procedure that addresses a specific issue” [41], which in our case is requirements engineering, more precisely compliance.

  • Comparison: We compare what has been proposed in RE, understanding proposals flexibly (so knowledge questions can be included). Following a similar strategy to Petersen et al. [40], we do not empirically compare the proposal, as this study aims to discover trends, not to do a systematic literature review.

  • Outcomes: As this research is a mapping study, as indicated by [40] this item does not necessarily apply to our research. However, the outcome is to have a systematic list of proposal from RE for GDPR compliance.

As a result, with PICO we have identified keywords. Overall, and taking a similar approach as Petersen et al. [40], there are four groups of words to be searched:

  • Set 1: Searching elements related to the GDPR, such as data protection regulation.

  • Set 2: The scope is within requirements engineering.

  • Set 3: The requirements need to be linked with compliance or adherence.

  • Set 4: The requirements must come from a legal or regulatory document.

Hence, a list of synonyms is identified and provided in Table 1. We performed the search string based on this identified synonyms.

Table 1 Keywords and synonyms

3.3 Exclusion and inclusion criteria

For the exclusion of research articles, the following criteria was applied (Table 2).

Table 2 Inclusion and exclusion criteria

These criteria have been chosen so that the selected articles align with the study’s objective and PICO. Our interest in excluding research published in unranked venues is to perform a quality check, although this is not necessary for mapping studies [40]. Table 2 summarizes the exclusion/inclusion criteria.

Our primary concern is requirements engineering, and our exclusion criteria are all papers whose main focus is not requirements engineering. Although many research papers talk about requirements, they are be excluded if they are not specifically talking about achieving compliance with GDPR requirements. Similarly, if a specific framework was applied and tangentially showed compliance with the GDPR, it would also be excluded, as its primary concern was not achieving compliance with the GDPR. The reason behind this decision is because although these types of papers may be helpful for several reasons, requirements engineering and compliance with the GDPR are not their main focus. All technology that falls in scope with the GDPR shall comply with the GDPR. Hence briefly claiming compliance and/or discussing the GDPR regulatory requirement does not make the research article necessarily interesting for our research objective.

Other exclusion criteria are articles that are not primary studies, have not been peer-reviewed, or are grey literature. The focus of this study, as the research question shows, is what initiatives have been proposed for improved compliance in the GPDR. Hence, although some opinion articles may be interesting, they do not fall within the scope of our research question.

3.4 Search string

The search string was carefully thought out to include synonyms of the identified keywords, allowing us to answer all the questions. Complementary to the PICO approach, we tested iteratively different strings to find the optimal solution (trying to find a trade-off between a sample too big too analyze of thousands of papers, to a extremely limited sample of a dozen). Furthermore, we would verify if the most important papers were still present in the new strings, following a similar approach as Kitchenham [44] of test-retest.

We decided not to include “privacy” in the search string, as the focus of this study is protecting personal data under the GDPR framework, not “privacy” in broad terms. This decision is: (1) based on the definition of privacy; and (2) due to the test-retest approach.

Firstly, defining privacy is challenging, as there is no clear-cut definition, and it encompasses a wide range of issues [15, 45]. This discussion is shared in Sect. 2. Our second reason for not using “privacy” in the search string is due to the test-retest approach. When the term privacy was used, the search results in the database increased enormously and included articles that were not within the scope of this research. Some of those articles, for instance, were about cryptography, the cloud, legal texts, philosophical texts, and formal code verification, among others. As a result, it was defined that the word “privacy" would not be used because the articles that appeared were not of interest. By excluding this synonym, we still found the research articles of interest and those identified as necessary from the domain.

Databases used to obtain the articles for this research were the following: IEEE, SCOPUS (which includes ScienceDirect and Springer), and ACM. These databases were chosen because of their notoriety and importance within the field of computer science.

Databases of law research were not selected, as they were preliminarly tested with our inclusion/exclusion criteria, and no papers with RE focused were found, and hence, were out of scope. At this phase, we queried the HeinOnline and JSTOR databases to identify papers of interest. We experimented with string with and without the keyword for requirements engineering. When querying with the keyword “requirements engineering” in JSTOR—for example—we got only 2 papers. Both of these papers were outside the scope of this research, as they did not discuss requirements engineering. Therefore, we decided not to query law databases, which we further discuss at Sect. 6 on the possible impacts.

We designed a specific search string for each database (see Table 3), following the flexibilities or restrictions of each platform. The queries (and subsequent data extraction) were carried in January 2023.

Table 3 Search string per database

3.5 Data extraction

Fig. 2
figure 2

Selection of sample papers for the mapping study

Once publications were selected/excluded regarding the previously mentioned restrictions, the remaining final publications were analyzed with the data extraction form, shown in Table 4. We conducted the data extraction and reviewed each other’s work. Reviewing each other’s work increases the study’s validity and provides more robustness.

Table 4 Extraction form

For tackling RQ3, we take the taxonomy proposed by [46] and [43]. Wieringa [46] has proposed that the engineering cycle can be divided into two main areas of research: knowledge questions and design research. Knowledge questions motivate research that seek to solve a question about the world [46]. The design research proposes an artifact to contributing, solving, improving, or making an environmental effect for a specific problem [47].

Similarly, Wieringa et al. [43] proposed a study classification taxonomy for papers, that we followed in this research. The proposed types of research defined by this taxonomy are:

  • Validation research: “This paper investigates the properties of a solution proposal that has not yet been implemented in the RE practice. [...] The investigation uses a thorough, methodologically sound research set up” [43].

  • Evaluation research: “Techniques are implemented in practice, and the technique is evaluated. That means, it is shown how the technique is implemented in practice...” [39].

  • Solution proposal: this type of article presents a solution to a defined problem [43]. “The solution can be novel or a significant extension of an existing technique” [39].

  • Conceptual proposal:“These papers sketch a new way of looking at things, a new conceptual framework” [43]. Following [42] we prefer to name this type of research as conceptual proposals, rather than philosophical papers [43], as it is more leading towards the objective of the mapping study.

  • Experience papers:the authors share their experience over a subject, where the focus is on the how rather than the what [39, 43].

  • Opinion papers: the authors present their opinion on some subject, without methodology [43].

We left out opinion papers, as they are not proposing or carrying out primary research.

Table 5 Evaluation and validation research category, proposed by Petersen et al. [40], based on [43]

Alongside these lines, we also classified the type of research method followed in papers categorized as validation or evaluation [46]. Although there are several types of research methods available, and labeling each precisely would be inconvenient, we followed the proposal made by Petersen et al. [40], which is based in Wieringa et al. [43], as shared in Table 5. In this manner, it is possible to see which research methods seem predominant in the field and identify trends.

For defining which process of the RE cycle the paper was focusing, we used the taxonomy provided by Pohl and Rupp [8] and add an extra element presented by Sommerville [48]. In particular, we divided the RE cycle as the following five categories:

Elicitation::

Refers to the activity that seeks to inquire from different stakeholders that may be affected by the IS [8]. This can range from expectations and goals from users, to requirements set by legal documents. For this mapping study, the analysis process was also included in the elicitation phase [48].

Specification::

”... is the process of writing down the user and system requirements in a requirements document. Ideally, the user and system requirements should be clear, unambiguous, easy to understand, complete, and consistent” [48].

Verification and validation::

Requirements need to be validated (as for example, do they meet the objectives of the stakeholders?) and verified (are all the elements included?) [8]. In the GDPR context, certain articles specify what IS should and should not include. For example, privacy policies must fulfill a list of requirements to be considered valid.

Management::

Requirements might evolve through time and change or have different prioritization, management is linked to all the other RE activities [8].

Documentation::

Requirements need to be traced and their specification should be written in a document to keep track of them [8, 48].

Regarding the classification of GDPR topics, to the best of our efforts, we could not find a taxonomy or classification scheme for papers that would encompass the whole regulation. Given that the GDPR is a regulation with 99 articles and 173 recitals, classifying the research papers per article would not have been practical nor useful. Hence, in order to classify papers by the area of GDPR they discuss, the authors of this article came up with a classification scheme of topics they have experienced to be commonly discussed, grouped by sets. This situation, cataloged as “emerging classification” by Petersen et al. [40], is common between mapping studies. According to Petersen et al. [40], 40 of 55 studies reviewed in their mapping used an emerging classification.

The classification we created for this research is as follow:

  • GDPR principles: Are the guiding principles of the regulation. They are: (1) lawfulness, fairness, and transparency, (2) purpose limitation, (3) data minimization, (4) accuracy, (5) storage limitation, (6) integrity and confidentiality (security), and (7) accountability [1].

  • Legal basis for processing (except consent): This is the lawful basis for which a controller can process the data. The GDPR defines them in Art.6 [1].

  • Consent: Consent is a legal basis for data processing, part of the previous set. It is mentioned through the GDPR, which stands out as a legal basis with precise requirements, thus being classified in its own set. It is defined in Art.4(11), Art.6 sets it as a lawful basis, Art.7 defines its conditions, Art.8 sets the rules of consent for children, Art.9 defines how consent is to be gathered with special categories of data, among other articles and recitals [1, 49, 50]. Consent has a particular type of governance [49] that has sparked an area of research.

  • Data transfers to 3rd countries or international organizations: Is usually abbreviated as data transfer to 3rd countries. Chapter V (Art.44–50) of the [1] sets the requirements of data transfers to 3rd countries and international organizations. Different mechanisms for doing so must fulfill a set of requirements, such as security, agreements, contracts, among others [1, 49].

  • Identification of actors: Organizations must identify the actors involved in an information system to define its duties and requirements. Accordingly, it must identify the data protection officer (DPO; Art.37), who is the processor and controller (Art.26-29, for example), if the processor or controller is not in the European Union (Art.27), and who is the data subject, among others.

  • Duties of actors: The obligations of processors and controllers are stipulated in chapter IV of the [1].

  • Data subject rights:These are data subjects’ rights over their personal data, as stated in Chapter III of the [1]. Among these rights are the rights: (1) to be informed, (2) to access, (3) to rectification, (4) to erasure, (5) to restrict processing, (6) to data portability, (7) to object, and (8) concerning automated decision making and profiling. These different rights impose several requirements on the information system.

  • Privacy policies: Privacy policies are related to the requirement of transparency by the [1]. It is expressed explicitly under Chapter III, as part of the data subjects’ rights, in Art.12–14 of [1], and also relates to the right to be informed. The objective of a privacy policy is to inform the data subject about the data governance model.

  • Privacy-by-Design-and-Default (PbD &D): Relates mainly to the Art.25 of the [1]. The idea of PbD &D is that privacy elements and requirements should be from a design point of view in IS [51]. In other words, privacy requirements are dealt with from the early stages of the SDLC, they should not conflict with other requirements of the IS, and they should be the default setting IS and be user-centric [51].

  • Security requirements: Refers mainly to the requirement set in Art.32 of the [1] among others (such as recital 83 or 49). The idea is that technical and organizational measures (TOMs) should be in place to secure the data processing, particularly based on the risk of such processing [1, 49, 50, 52].

  • Other: category created to keep track of other elements outside this list. This category would keep a record of the different elements.

  • Specific articles: A classification created if a paper would discuss a specific article outside the scope of the proposed classification

  • This classification was created for articles that would either:

General:

  1. (a)

    Discuss GDPR compliance on general terms, without referencing articles, reflecting about compliance at high level

  2. (b)

    Focus on GDPR compliance as a whole while also addressing some specific articles and issues whose main purpose was GDPR compliance.

In this classification, we created “Other” and “Specific Articles” categories to avoid missing topics that could be labeled but did not fall into any of the proposed sets. “Specific Articles” would record if a research article only focuses on a specific article not part of the GDPR principles (legal basis, consent, data transfer, identification of actors, duties of actors, data subjects rights, privacy policies, PbD &D, or security). Therefore, we can track which specific article the paper would focus. At the same time, “Other” could be another area of interest without touching a specific article of the GDPR (such as a record of processing activities in a generic manner). We do not claim that this classification scheme is final or robust, but it was the method we chose to fulfill this study’s objective [40]. To the best of our knowledge, there is no widely accepted taxonomy or classification of interest areas for the GDPR.

In order to give more robustness to this classification, we decided to record all GDPR articles mentioned in the sampled papers, regardless of whether the paper focused on that GDPR article or not. In this way, several GDPR articles could be mentioned throughout the paper but are not necessarily the main interest of the research article. To illustrate, a paper could focus on consent but mention articles: 4–8, 25 and 44; even when not all these articles are directly linked to consent.

4 Results from the mapping

Fig. 3
figure 3

Number of publications per year

Table 6 Search results per database

The total number of studies per database is presented in Table 6. For the sample selection, we found a total of 402 papers. The final sample is of 90 papers. The selection process is described in Fig. 2 and the list of the papers is provided in “Appendix 1”.

4.1 RQ1: When, where and in what type of venue has the paper been published?

4.1.1 When: year of publication

Looking at the numbers of publications by year as seen in Fig. 3, it can be seen that most of the publications happened after 2018, the year the GDPR came into force. Although there are publications since the year of the signing of the GDPR (2016), the highest number of publications is in the year 2019 (25 publications, 27,8%), followed by 2020 (20 publications, 22,2%) and 2022 (20 publications, 22,2%).

4.1.2 Where: venue published

Table 7 Venues that have published more than one article on the topic

We see no clear trend concerning where the sample papers are published. No specific venue has more than five publications (shown at Table 17 in “Appendix 2”). Table 7 shows the frequency of the venues that have published more than one paper. As it can be seen, only seven venues have published more than three, and more precisely, only three venues have published more than four papers on the subject. The central theme of the first two venues are privacy and security (Information and Computer Security and ESORICs workshops), while the third venue’s main topic is requirements engineering (RE).

4.1.3 In what: type of venue

Fig. 4
figure 4

Number of publications per venue

If we divide by type of venue where the research has been published, we obtain the following data: 44 articles (48.9%) have been published in conferences, 33 (36.7%) in journals, and finally, 13 (14.4%) in workshops; as shared in Fig. 4.

4.2 RQ2: Are the authors from multiple disciplines?

Table 8 Research articles that have at least one author not affiliated to computer science or informatics

As mentioned above, each paper author was reviewed to check their affiliations. If the affiliation was not in the paper, it was searched through a search engine. After this research, we found that there were 20 articles,representing 22% of the paper sample (as shown in Table 8), in which at least one of the authors is not affiliated with a department of computer science or business informatics.

Besides computer science and business informatics, authors came from the following areas: Humanities and arts, Business Administration and Management, Law, Archival and Library and Information Science, Medicine, Philosophy, Health science, and Economics.Footnote 3

4.3 RQ3: What type of research is it?

Using the taxonomy presented in Sect. 3 and Table 9 it can be seen that 11 papers (12%) were classified as only knowledge questions—i.e., not related to the design science process. When focusing on the design science research category done, a significant percentage of the papers have some validation or evaluation (44 articles, corresponding to 48%).

Table 9 Type of research, based on [43]

At the same time, there is a high number of papers that could not be defined under a single category. 40 papers (44%) correspond to only one category. The most frequent single categorization is “knowledge question”, followed by solution proposal. In comparison, 39 (43%) papers were categorized under two categories, with “solution proposal” and “proposal validation”, and “solution proposal” and “proposal evaluation” being the two most popular combination of research type. Finally, 11 (12%) of the remaining papers were classified under three categories (such as “knowledge question” and “solution proposal” and “validation proposal”).

Table 10 Validation and evaluation research research method

When we revise what research methods have been applied for validation or evaluation, case studies, industrial case studies, prototypes, and controlled experiments with practitioners are the most popular (Table 10).

4.4 RQ4: On what stage of the RE process does the paper focus?

As this is a mapping study, we followed a high-level classification of the RE process. It was difficult to classify the majority papers into a single category within the RE cycle, as most times they would fall under the scope of 2 or more. Therefore, 44 (48.9%) items were classified under a single category, 45 (50%) have two or more categories, and 1 (1.1%) of them could not be classified under any part of the requirements lifecycle.Footnote 4 The largest number of articles is found under the elicitation category, followed by verification and validation. The classification is presented in Table 11.

Table 11 RE process frequency in papers

In order to analyze what areas of interest exist in the RE process, we grouped the data based on whether the papers touched or discussed that area. In this way, the total number of individual occurrences is 154 since—as previously mentioned—a paper can have more than one area of interest in the RE process.

47 papers address the requirements elicitation stage. This is followed by specification with 35 papers, and verification and validation with 33 papers. As presented in Sects. 1 and 2, the focus on elicitation may be related to the fact that the GDPR requirements must be interpreted, as they are emanating from legal prose. This task of interpreting regulation into software requirements calls for a particular set of skills. A more detailed reflection is presented in Sect. 5.

4.5 RQ5: What compliance elements of the GDPR does the research article focus on?

Quite a few articles touch on only one aspect of the GDPR and seek to contribute only to it. Table 12 shows the areas of interest in the GDPR per year and the total of research articles that discussed the topic. In order to gather more insight to answer this research question, we kept track of which articles were mentioned in the papers, as previously stated. Consequently, for example, if a research article mentioned Art.4, this would be recorded as a mention.

Table 12 GDPR subjects of interest per year

57 papers also refer to the GDPR in a general manner, either mentioning some articles for exemplification purposes but not focusing on these; or just discussing the GDPR without mentioning or making references to specific articles. Hence, the category “General GDPR” is the most popular topic. In second place, there is consent, with 27 papers discussing the subjects. Following are privacy policies and, afterward, security elements.

Consent is a popular topic in the paper sample. A significant percentage of the sample acknowledges it as a topic directly in scope of their paper (27 papers, see Table 14). 31 papers also mention consent somewhere (measured per references to Art.7). Similarly, Art.6 which deals with the legal basis for processing—including consent—is mentioned 37 times. Many of the proposals relate specifically to consent management. Compared to the other legal basis, only 12 papers focus on the other legals basis aside from consent, yet Art.6 is mentioned more frequently, as shown in Table 13.

Table 13 Number of papers mentionning each GDPR article

The third most popular topic is privacy policies. This topic is the main focus in 21 papers of the sample. Privacy policies requirements are specified mainly—but not solely—from Art.12–14 of [1]. Art.12 is mentioned 19 times, Art.13 29 times and Art.14 30 times.

In order to identify which other elements may have been of interest, these were coded as “Other” and specified which element they focused on, as seen in Table 14. In this category, record of processing activities was the focus of four papers.

Table 14 Other GDPR topics of interest, with their references

4.6 RQ6.1: What type of proposal is the paper?

Frameworks and conceptual frameworks are the two most common proposal types, with 22 and 18 papers classified as such accordingly, as shown in Table 15. Due to the extension and scope of the GDPR, a framework seems to be a logical proposal. Tools are the third most common type of proposal. Table 15 shows the frequency of other types of proposals, with no clear tendency from the fourth position onwards.

As specified in Sect. 3, “Knowledge questions” articles and some “Experience papers” do not provide a type of proposal. Hence, they were not categorized for this question.

Table 15 Frequency and percentage of type of proposal

4.7 RQ6.2: If a modeling language extension is proposed, it is an extension of which language?

Table 16 Which modelling language is a proposal based on

When reviewing the proposed modeling languages or proposition of extension of existing modeling language, we see that STS-ml [54] has been the basis for three investigations. The rest of the languages are: Secure Tropos [55], Goal-oriented language [56], Unified Modeling Language [57], and Process Reference Model, as shown in Table 16.

5 Discussion

Some trends and research gaps have been identified and analyzed using the mapping study’s information.

5.1 RQ1: When, where and in what type of venue has the paper been published?

The number of publications per year shows that the topic of requirements compliance for the GDPR seems still attractive. Most papers were published in 2019, 1 year after the entry into force of the regulation, and there is a drop in 2021, which the COVID-19 pandemic could potentially explain. It would be interesting to check if the publication in other areas dropped in the same manner.

On the other hand, it is interesting to note that prior to 2018 (the year GDPR came into force), up to five papers were published in the sample. This means that organizations had limited tools, artifacts, methodologies, or approaches to handle the regulatory requirements of the GDPR. Even in 2018, only seven articles were published. Consequently, organizations did not have much work available from an RE perspective.

Regarding the venues where research was published, there is no trend of preference. Indeed, although there are 3 top venues, none have published more than five articles. A possible explanation might be that, given the importance of the GDPR for organizations, the regulation applies to a wide range of areas.

Similarly, there does not seem to be a significant difference between conferences and journals. However, there are considerably fewer publications in workshops.

5.2 RQ2: Are the authors from multiple disciplines?

Around 22% of the articles selected in the mapping study are written by multidisciplinary teams with at least one author not related to computer science. This is a significant portion of the articles. Given that the GDPR is a legal text related to ethical matters, this is not surprising.

However, it would be interesting to compare the approaches and interpretation of GDPR articles between research conducted by interdisciplinary and monodisciplinary. The interpretation of legal texts and their translation into requirements is a challenging task, which is usually difficult for software engineers to carry out [5, 10]. For example, does research conducted by teams that includes lawyers consider consent or PbD &D in a different way? What are the approaches to specific technologies (such as blockchain, for example) and their compliance? These questions could be addressed in future research.

5.3 RQ3: What type of research is it?

The papers sampled used a wide variety of research methods, with proposals focusing on presenting some type of validation or evaluation to their proposals. 26 solution proposals are simultaneously either an evaluation or a validation. 7 solution proposals also address a knowledge question and include an evaluation. At the same time, 4 solution proposals address a knowledge question and include a validation. Similarly, 11 papers present knowledge questions alone, and 10 are just proposed solutions. These metrics show that the community is interested in grounding its work in real life and provide evidence of how it interacts with stakeholders or the context. In other words, the community seeks for application of the proposal, avoiding leaving it as just a theoretical proposal and sharing its validation and evaluation accordingly.

5.4 RQ4: On what stage of the RE process does the paper focus?

Overall, there is interest in the requirements elicitation stage. As the GDPR is a new regulation, understanding, interpreting, and extracting the requirements from this law can be regarded as a new activity. Moreover, since the GDPR is not straightforward in what it expects, interpretations on how to comply will be context-dependent.

The specification stage was the second process which attracted the most interest, and verification and validation was a close third. GDPR requirements are context-dependent, for instance, it is not the same to gather consent for a smartphone game for children than for a website. As a consequence, specifications may vary significantly and affect the design of the IS. On the other side, the verification and validation stage are addressed through the problem of compliance verification. More specifically, there is an interest in validating requirements, such as privacy policies, by addressing questions such as: is the privacy policy meeting all the GDPR requirements? Tools for verifying and validating GDPR requirements would be useful for organizations and regulators. Organizations could use them to check their requirements, while regulators would be able to audit organizations faster.

Because it was almost impossible to frame research papers into just one part of the RE cycle, as some papers did not specify in which area they were interested, some papers may address two or more RE process.

Future research could focus on benchmarking and comparing the different proposals, which we do not address in this research, as it is outside of scope. However, it would be an interesting future work. For example, for the verification and validation tool, which areas of the GDPR do they focus on? If they focus on privacy policies, what do they verify and validate? Do they use AI? What tools are there proposed for requirements management? Are the proposed specifications GDPR compliant from legal reasoning?

5.5 RQ5: What compliance elements of the GDPR does the research article focus on?

The most significant area of interest for the GDPR was compliance with the regulation at a general level. Although some of these research papers would discuss specific regulation elements, others would barely discuss them. As this is a mapping study, we did not check for paper’s quality. However, discussing or reference to specific elements of data protection regulation should be considered in quality check in future systematic literature reviews. Given the complexity and extensiveness of the regulation, in future research, claims of compliance with the GDPR should be verified against law and other regulatory texts.

Consent was the second topic of interest of the selected papers. One possible explanation is the requirements for valid consent (Art.7 [1]). The fact that consent must be free, specific, informed, and unambiguous puts stringent requirements on organizations. How can “specific” be translated into a specification? How can it be proved that it was informed? Would just ticking a box make consent informed?

These requirements have less to do with the business model and more with requirements on how consent is gathered. In other words, it can relate to human-computer interaction, requirements traceability, documentation, and others. Another aspect of consent that differs from other legal basis in the GDPR, is that the user has complete control over the legal basis, and may withdraw it when they wish. This legal basis being in the user’s control, a process for withdrawing consent must be implemented, as well as keeping a record of it and other requirements [1]. This could explain why consent has attracted so much interest compared to other legal basis, such as legitimate interest, contract fulfillment, a public interest, among others. Specific topics that papers discuss for consent management include user interfaces, privacy policies, semantic web, and consent “receipts”, among others.

Interesting future research could analyze users’ acceptability of the proposal that deals with them directly. Furthermore, comparisons and benchmarking of the different proposals would also be interesting. Given the amount of research related to consent, the different proposals should be analyzed to identify under which context they can be implemented.

On the other hand, other areas have attracted considerably less interest. For example, GDPR articles about data transfer to third parties impose strict requirements—from security elements to contractual agreements—that do not seem as popular in the RE literature as consent or privacy policies. For example, given the usage of cloud services, controllers must verify that the cloud service is compliant with the GDPR if they use it to process data in any way. Similarly, the GDPR principles have not received much attention in the RE literature, even though they are the guiding ideas of the regulation.

5.6 RQ6.1: What type of proposal is the paper?

The literature provides a number of conceptual frameworks on regulatory requirements for the GDPR. Given the novelty of the GDPR, the number of conceptual frameworks that aim to provide a high view of the world [43] is not surprising. Probably the new elements of the GDPR sparked a strong interest in proposing a new conceptual framework, as new ontologies or taxonomies are necessary. Similarly, different frameworks have been proposed for achieving compliance with the GDPR. Each framework focuses on a different aspect of the regulation.

GDPR compliance tools have been proposed too. It would be interesting to research which technologies and frameworks these tools work with. For example, do they use artificial intelligence (AI)? If they do, how were the models trained? Did a lawyer provide advice or not? If a conceptual model was used in a specific part of the creation of the tool, did this conceptual model include legal reasoning? What is the degree of reliability of these tools?

5.7 RQ6.2: If a modeling language extension is proposed, it is an extension of which language?

The modeling languages (or extensions of these) used in the selected papers are predominantly goal-oriented (such as GRL, SecTro, and STS-ml). For instance, both SecTro and the STS-ml use primitives from the i* framework [54]. Both modeling languages have a strong emphasis on security requirements. In future research, the use of these modeling languages in industrial environments should be empirically studied and receive feedback from practitioners.

6 Threat to validity

As with most systematic mapping studies, there are reliability and validity concerns over the results obtained. As reflected by [42], secondary research can have significant problems with reliability. Even if the same classification for papers is used in two different secondary research, the same research article may be classified differently [42]. This situation can have multiple causes such as the authors’ expertise and background, the need for a more concrete classification or even unclear writing by the research article author’s being judged [42]. We cannot rule out that this bias can be present in this mapping. However, we did the best of our efforts to use clear and defined classification schemes with known research methodologies and the two first authors constantly comparing their results. By having two researchers reviewing each other’s work, we aimed to improve reliability in the classification of papers and results of this mapping. Furthermore, we have tried to provide as much detail as possible on our process to help the reliability of the study.

The application inclusion/exclusion criteria and the data extraction were conducted using the same pairing strategy. The papers were divided in two, so each researcher could apply the inclusion/exclusion criteria and extract information. Once the tasks were achieved, the researchers would review each other work. If disagreements arose, a meeting would be held to discuss each other points and come to a conclusion. This approach is based on Petersen et al. [39] and Wohlin et al. [42] recommendations for reliability issues.

The study uses well-known and accepted classification schemes and methodologies to gather results. By using well-known accepted guidelines and methodologies, we aim at reducing bias [39].

One of the biggest threats of this research is the classification scheme used for classifying the area of interest for the GDPR. In our preliminary research, we could not find a classification scheme for the different GDPR or data protection requirements that would fulfill the objective of this research. In this phase, with unsuccessful results, we queried different legal databases, such as HeinOnline, in the search for GDPR or regulatory data protection requirements schemes. From a computer science perspective, GDPR ontologies exist, such as [31], but they were unsuitable for our research objective. Furthermore, we could have used the different chapters of the GDPR to classify the requirements. If we had taken this approach, we would have lost valuable information we were interested in. In addition, elements—such as consent—may appear in several articles and chapters of the GDPR.

In order to face this challenge, we decided to create a classification scheme based on different GDPR guidelines [49, 50, 52, 58], ontologies and vocabularies [31, 59], and the GDPR itself [1]. We proceeded with this approach as previous literature had reported it in other systematic studies, making it a valid option[40]. Indeed, [40] found that 40 out of 55 papers in their sample used this approach, denominating it as an “emerging classification”. Our classification scheme used in this article has not been proposed or validated elsewhere; thus, it may faces low reliability and validity. To re-emphasize, the sole purpose of it is to help us analyze and gather data for this mapping study.

The content validity of our classification scheme may be low, as we are unsure that the classification captures the whole nature of the GDPR—or at least, of areas of interest in the GDPR for software engineering. For that reason, the classification may contain bias. Furthermore, it may be inadequate for capturing all the relationships and aspects of the GDPR, either being too narrow or too broad. For these reasons, to mitigate the threats, we also provide the GDPR chapters where each element can be found. Moreover, we discussed the classification scheme with different data protection lawyers, albeit in a non-systematic manner. This classification scheme should be taken cautiously, and its context validity checked.

In this light, no better approach was available to classify the interest in the GDPR apart from categorizing per article. This approach would have been impractical for a mapping study, as it would provide too many details. Furthermore, it would require a precise and detailed analysis of each paper, which would fall under the scope of a systematic literature review.

Thus, to provide more validity and mitigate threats the classification scheme included “Other” and “Specific article” in case the classification sets provided insufficient. As seen in Sect. 4, a record of processing activities was a category not proposed by our classification that appeared in 4 papers. Hence, if our classification scheme did not capture an important aspect, we could still record it with these other elements. This gave us more flexibility for the classification and helped us mitigate the fact that we might have avoided important GDPR aspects.

All-in-all, given this threat on the usage of a made-up classification scheme, future work could focus on creating a classification scheme for regulatory data protection requirements. A possible approximation could be the creation of an international regulatory data protection ontology.

On another topic, some databases’ search strings only included abstracts, keywords, and titles. Therefore, some papers could have been missed from the sample gathered. Some well-known papers were added with snowballing, but this method is not enough to alleviate the inherent threats of search string [42]. Likewise, as this is a mapping study, the inclusion/exclusion criteria were based on the abstracts and meta-data of the population of papers found. How much to read or not to read is subjective and dependent on the authors [60].

However, we do not consider this element a big threat to the validity of our research. As the literature suggests, mapping studies should aim at having a good representative sample of the study area rather than all the papers [39, 40, 42]. Hence, even if our sample might miss some papers, this is an accepted feature between mapping studies, as it differentiates them from literature reviews.

7 Conclusion

GDPR compliance has been an area of interest for the requirements engineering community since the GDPR was signed in 2016 and enacted in 2018. The GDPR lists functional and non-functional requirements, ranging from transparency to specific system functionalities as retention periods. As a result, many requirements researchers have investigated and proposed different artifacts to help organizations achieve compliance with the GDPR.

From a chronological point of view, before the GDPR was enacted, only 12 papers had been published from our sample, meaning that practitioners had limited tools at the moment that organizations were expected to comply. From that date to the present, there has been a growing and ongoing interest regarding requirements for GDPR compliance. The trend seems to keep on going, with new proposals appearing. With the advent of new technological regulations around the world and the EU, it seems that the RE domain will continue studying the subject.

Most papers (57 out of 90) from the sample discussed GDPR compliance as a whole, therefore not focusing on specific elements to achieve GDPR compliance. Based on topics of interest, consent sprung as the subject that attracted the most interest, followed by privacy policies. In both cases, these GDPR articles have very specific requirements that organizations must follow to show compliance. The creation of tools using AI for checking privacy policies seems to be a growing area, that future research should focus on. Regarding the methodology done for validation or evaluating research, the sample does not show a trend and a diversity of methods are used.

As the GDPR is a legal text that may require knowledge of other regulations or laws, interdisciplinary research is essential. Lawyers and policy experts may have different mental models than software engineering over what the law entitles and how to translate it into requirements. For example, erasure may imply the deletion of the data and may be unreadable [61]. In the sample, 20 (22%) papers have at least one author not affiliated with computer science or business informatics, which leads us to deduce they did interdisciplinary research. This result is encouraging, as it means that a 22% of the papers selected may be cataloged as interdisciplinary. It could be interesting for future studies to compare how the analysis of the GDPR requirements differ between papers with at least one author unrelated to computer science/informatics and those where all the authors are from this discipline.

The most popular publication venue is conferences, with 44 papers of the sample being published there. Journals follow with 33 papers, and workshops with 13 papers. Even with 48,9% of papers published in conferences, no conference or venue has published more than 5 papers, as seen in Table 17. Indeed, the venues that have published more than 1 paper on the subject are provided in Table 7. Therefore, the venues of publications are still dispersed. This situation could be explained by the fact that GDPR requirements affect a range of disciplines and not only requirements engineering.

All in all, requirements engineering for GDPR compliance seems to be a well-established study area with ongoing interest. Given the range of proposals for different matters, future studies could focus on comparing the different proposals. Furthermore, as new technological regulations enter into force, the research could focus on whether it is possible to reuse the proposal or artifacts for the GDPR if they have elements in common and which lessons can be learned.