1 Introduction

In terms of information security, insider threat refers to the risk posed by an organization’s employees, partners, or customers to the organization’s information [1]. Data leakage is the disclosure of information to unauthorized entities or individuals [2], commonly caused by an intentionally or unintentional threat to the insider [3], [4], [5]. Data leakage protection (DLP) systems or DLPS are designed primarily to monitor data flow in an organization and apply predefined measures on terminal devices or networks within the organization [2]. The measures range from logging activities, sending alerts to end users and administrators, to quarantining data or blocking it altogether. DLP tools can monitor data at rest and in motion to detect sensitive information [3], [6].

In both corporate and hospital environments, the security of classified information is vital, the cost to companies of the lack of DLP technologies is estimated at over $200 per employee per year, and the human factor accounts for 35% of the causes of security breaches, including malicious and unintentional activities of both employees and third parties [7]. Not all sectors are equally affected by the costs of data leakage, the most sensitive being the healthcare and banking sectors due to the large volume of personal data they both handle [8]. The Spanish report [9] shows several aspects that give rise to data leakage in the healthcare sector, with malicious insider threats and unintentional employee actions being evident. Motivated by all of the above, the main objective of this work is to survey the literature to detect existing techniques to protect data leakage, and to identify the methods used to address the insider threat.

Studies similar to this focus on reviewing the functions of DRM products popular in 2011 and available on the market, quantitatively evaluating the impact of the use of these products [10]; analyzing the existing digital forensics and incident management literature with the aim of contributing to the knowledge gaps in incident management in the cloud environment [11]; outline lines of research based on a systematic review focused on blockchain technology applied to eHealth [12]; examining the state of the art in security, privacy, and big data protection research [13]; in [14] a survey about sensitive data leakage prevention and anti-theft technologies for protecting the information security of e-government users; and in [15] study monitoring strategies for confidential documents based on virtual file system (VFS), in [16] a systematic review of the literature focused on management functions in information security is carried out. The recent studio [17] presents a review focused on the mobile agent model for data leakage prevention. The review only considered papers published in the journal “Communications and Network” and conference papers published between 2009 and 2019. Mobile agent-based distributed intrusion prevention and detection systems were analyzed in terms of their design, capabilities, and shortcomings. Other studies focus on reviewing blockchain strategies for secure and shareable computing, examining the state of blockchain security in the literature, from the point of view of information system security issues, classified into three levels: process level, data level, and infrastructure level [18], survey the literature to analyze how blockchain systems can overcome potential cybersecurity barriers to achieve intelligence in Industry 4.0 [19].

Research on data protection has increased with the introduction of telecommuting due to the pandemic and the need to move data to external devices and networks. Similar work has been found to exist in reviews related to data protection, but it is worth noting that there is no recent study focused on grouping the work developed in the last ten years on DLP tools, where special attention is given to the techniques used in DLP tools and methods to combat the insider threat. The main contributions of this article are the following: (1) it highlights the most used techniques in DLP tools, (2) it summarizes the methods found in the literature to face the insider threat, with the aim of promoting the transformation of protection against data leaks in this sense, to make it more secure, and (3) exposes the limitations, advances, and applications of DLPS, in order to encourage the development of new tools.

This paper addresses the following research questions:

RQ1. What techniques or technologies are used as DLP tools?

It is solved in Sections 2 and 5, giving a presentation of the main tools found in Section 2 and an analysis of their frequency of use in relevant studies in Section 5.

RQ2. How is the insider threat addressed in the DLP tools found in the literature?

The answer to this question is presented in section 3, which summarizes how insider threat is addressed in the literature analyzed.

RQ3. What are the highlights the most used techniques in DLP tools, limitations, advances, and applications of DLPS in different fields, in order to encourage the development of new tools, and 2) it exposes the methods found in the literature to face the insider threat, with the aim of promoting the transformation of data leakage protection in this sense, to make it more secure advances and applications of DLP systems?

This question is answered in Section 5.3, where the main advances and applications of DLP systems in the period studied are presented.

The rest of the document is organized as follows: Sect. 2 describes the main techniques and technologies used in DLP. Section 3 presents the methods to address insider threats found in the literature and Sect. 4 describes the methodology followed for the literature review. Section 5 discusses the results obtained and the main limitations, advances, and applications of DLPS. Finally, this article is concluded, and future work is presented.

2 Techniques and Technologies in DLP

Several studies propose novel DLPS integrated by different techniques and technologies to try to ensure optimal protection of confidential information, this section gives an overview of the most used techniques and technologies in the papers relevant to this study.

2.1 Overview of techniques most commonly used in DLPS

2.1.1 Inteligents documents

This technique consists of encapsulating within the document both the data it contains and the security mechanisms to control the use of such data [20], [21]. The security mechanisms can be content deletion, content editing, content reading, or an authorized user to perform each operation. This technique makes it possible to record where, when, and by whom the content of the document is accessed [22]. It is a technique generally used in DRM systems and very useful in combination with DLPS.

2.1.2 Encryption

The most widely used technique in DLPS is cryptography, this is because it is the main basis of security and is based on the conversion of data from a readable format to an encrypted format. Any encryption algorithm is equivalent to a mutating substitution algorithm, the substitution unit being the concept of “block”, and the substitution table being something nonfixed (and therefore mutating). The robustness of the algorithm is given by the mutability, which prevents statistical attacks [3].

2.1.3 Hash

A widely used DLPS approach is exact file hash matching. This method is based on the verification of outbound traffic by comparing the hash values of the intercepted traffic and existing sensitive data [2]. If a match is detected between the values, a leak is detected by the system. This approach presents the problem that any modification of the original document may result in a completely different hash value, which would not allow the system to detect the confidential document [20].

2.1.4 Virtual file system (VFS)

A VSF is an abstraction layer on top of a real file system (RFS), that is, an intermediate layer between system calls and the RFS driver [15]. They also provide the ability to perform operations before and after reading, writing, etc. In exchange for this intermediate “translation” between the applications and the actual file system, some of the original RFS performance is lost.

2.1.5 Challenges or context-based keys

Challenges replace a stored key with a calculated key, eliminating the security problem in key storage and distribution [3], [21], [22], in turn, allowing the user to be identified through biometric data, the location of the computer by nearby Wi-Fi signals or GPS, among other benefits that this technique allows.

2.1.6 Minifilters

Minifilters are low-level applications that run in Windows kernel mode and perform value-added functions (backup, encryption, monitoring, etc.) on filesystem operations (read, write, metadata modification, etc.) [23], [24], [25].

2.1.7 Biometric information

This technique is widely used in DLPS to identify the user accessing the information and thus try to ensure that it is a legitimate user with permissions to access the information [26], [27], [28].

2.1.8 Hypervisor

Hypervisor-based memory introspection, the approach looks for the presence of sensitive raw data in memory on both client and server machines, transcending the dependency on pre-existing security perimeters. This solution presents a high computational cost as a hypervisor-based tool consists of deploying one or more virtual machines to monitor system calls, which consumes too much hardware resources, such as memory and processing [29].

2.2 DRM for document protection

Digital Rights Management (DRM) systems, this term refers to a set of policies, techniques and tools that guide the proper use of digital content. A DRM system is based on ensuring that only intended recipients can view sensitive files regardless of their location. Thus, ensuring data protection beyond the boundaries controlled by DLP systems, so that an organization is always in control of its information [30], [31].

The integration of DLPS and DRM policies ensures that vulnerabilities are minimized and that an organization can immediately deny access to any file, regardless of its location [6]. In [31], [32] and [33], the enterprise digital rights management (eDRM) system is presented, which provides persistent protection for documents using cryptographic methods and also includes features for document protection that are easy to use for the enterprise. In the study [34] the authors reveal the importance of DRM solutions to prevent unauthorized users, inside or outside the boundaries of the organization, from reading an accidentally sent document. As well as, their limitations towards certain types of documents, in addition to preventing the file from further propagation on the external network once filtered, nor an expert hacker from attempting to decrypt the file’s content. In [35] DRM systems are compared with the proposed DLPS (UC4Win). In [36] the authors reveal some of the problems faced by DRM systems as a document security solution, expose that they are difficult or inapplicable to the organization’s IT infrastructure and that they rely on certain plugins and these plugins may be used.

2.3 DLPS in the literature

Table 1 summarizes the contributions of the works found in the literature focused on the development and implementation of DLPS, as well as the techniques and technologies employed.

Table 1 Summary of relevant papers

3 Methods to address the Insider threat

The main concern of recent times, in information security, is the internal threat posed by employees, partners, and collaborators of the organizations originating confidential information. One of the main measures adopted in the literature is the control of information use, which goes beyond access control [35] allowing to restrict operations that allow data leakage of confidential information and to regulate its use.

The authors of [54] highlight the importance of strengthening the security of the confidential document management system in the face of the threat of company employees to confidential information; to address this situation, they propose a security model for confidential documents with a distribution control strategy. The first is based on storing the content encrypted with a symmetric encryption algorithm, ensuring that only the authorized user is able to decrypt the content; access control information is stored that allows to know the degree of authority of the user to use such confidential information and records each operation that the user performs; in addition, a hash function is used to ensure the integrity of the content. To control the distribution of confidential information, a client-server strategy is used in which a client will not be able to distribute confidential documentation without permission from the server, in which the control policies defined by the administrator are used and a monitor is installed on the client’s computer that allows the server to control the operations performed by the user and prohibit unauthorized operations.

In the study [35] a DLPS based on usage control and dynamic data flow monitoring (UC4Win) is presented. This system can to monitor process calls to the Windows API in order to prevent or modify data flows that pose a threat of confidential information leakage.

In [55] a scheme based on mandatory kernel-level encryption on write operation and decryption on open operation is proposed through middleware to ensure that data remain encrypted in memory. In addition, usage control policies are established, such as read-only, save, export, write, backup, and impression rights. For access control, a method of mutual authentication and key agreement between client and server is proposed, using the SM2 algorithm for its management.

In [26] an approach is presented to control the use of confidential documentation, through the capture of biometric signals from users who interact with the object (document), correlating this information with the content accessed by users, without storing biometric information, but the correlation between the two. In this way, when a loss of information occurs, the organization will be able to know which user accessed the information, minimizing the risk of an attack on the biometric data.

The authors of [23] propose a DLPS based on widows file system mini-filters to control the use of classified documentation by controlling OS I/O operations. The proposed system will block I/O operations from any external storage device. In addition, a strategy is adopted to restrict the movement of classified information by adding the process that performs the read request on the path where the classified information is stored to a blacklist and blocking subsequent write attempts from that process.

The authors of [56] propose a Document Semantic Signature (DSS) approach to address the insider threat. To obtain the DSS, the content of a document is extracted and summarized, updating the DSS dynamically whenever the information is modified. The DLPS monitors the newly generated information by tracking its transfer or exfiltration by comparing the DSS of such information and the DSS of sensitive information. The study takes into account the possibility that an employee with access to confidential information can change the content using synonyms to evade the DLPS, which is based on keyword-based leak detection, and the proposed system addresses this problem. The system was tested with a public dataset achieving encouraging results.

In the study conducted by the authors of [57], a prototype of an anti-leakage system based on the enterprise cloud is presented. The system uses keyword-based content monitoring and filtering techniques. Once the keywords, which represent confidential information in a document to be sent, are detected, the user and the network administrator are alerted of the possible data leakage, and a trace is left in a log where the incidence is written.

In [27] it is proposed to use eye tracking technology for information protection. This technology allows obtaining user behavior information such as gaze location, gaze tracking, and points of interest. This technology in information security can be used to identify the user interacting with confidential information through biometric eye data, obtain metadata of the user performing operations of creating, sending, modifying, and receiving confidential information for use in cases of conflicts detected by the DLP system, in addition it can serve to improve the security and integrity of documents based on the information of which parts of the document are of greatest interest to the user.

The authors of the study [25] propose as a solution to the internal threat a free DLPS that is based on detecting confidential information at the exit of the USB ports by means of automatic learning and blocks the copy operation, for this purpose it integrates modules in the kernel space (minifilters). The system is developed for the Windows OS as it is the most widely distributed in business environments.

The study [1] focuses on the insider threat that can be intentionally caused by an employee, for this, they propose “Efficient DLP-Visor” which is a context-based DLPS. The system is a thin hypervisor that intercepts call in kernel space. The proposed DLPS makes it possible to detect data leaks even though the employee in question is the system administrator himself. Basically, the System works as follows: The administrator sets a File System path where sensitive information is stored, the DLPS logs any process that opens or reads a document from that path as critical, and any file written by that process is logged as sensitive, as well as any process that receives information from a critical process. DLPS tracks critical processes by capturing kernel mode calls and blocks the relevant operations of those processes.

In [3] and [21] a DLPS for the protection of confidential information is proposed. The proposed system allows access control, through the development of the encryption key, through the combination of a set of parameters; these parameters can be biometric identification of the user accessing the information, geographic location, electronic fingerprint of the device, date, and time, among others. Although this proposal does not specifically present usage control, it is robust due to the ability to require several parameters to generate the decryption key, thus ensuring that the content remains encrypted as long as the established criteria are not met, since the key is never stored.

It has also been seen in the results obtained that DRM tools are based on the control of copies of protected information and therefore gain value for the control of use and protection of information from the threat of collaborators and partners. The proposal of [33] and [32] allows the implementation of an information system independent of the servers containing the control policies that were necessary to access with conventional systems. It controls the use and access to the document through a license (document xml apart from the confidential information) containing the security rules and the configuration of the various security modules necessary for the management of the document. The rules are encrypted by means of public and private keys stored and known by the user.

The authors of [58] analyze three models of traditional document security management, exposing the limitations of each of them, and to try to overcome them they propose a system based on storage in the private enterprise cloud, with a system of authorization and encryption of documents in a virtual machine that encrypts all the document that is written in it, as well as light clients with a common terminal in the virtual machine that will guarantee that all written documents are encrypted and to decrypt them will have to be done through the same encryption system that will guarantee that the user leaves a trace of the operation carried out. External users will need an electronic certificate to decrypt the document.

In [36] the main problems of different solutions for information protection within an organization are identified, among which DLP and DRM solutions are described. A solution based on active documents and DRM is proposed that allows the control of document usage, mainly copy, paste, cut, delete, and print operations, inside and outside the organization of origin. The transfer channels considered in this work were removable storage, e-mails, and shared folders. This work does not implement the system, but proposes an idea on how to solve the problem of data leakage with active documents.

Given the persistent concern in organizations and enterprises regarding the internal threat to data leakage protection, it has attracted the interest of the research community in an attempt to circumvent it. The recent study conducted in [59], presents a system CITD for the detection of insider threats based on the behavior of workers according to their role and machine learning. The system was tested in three real organizations to reduce false positives that allow improvements in the tool.

4 Methodology

This paper utilizes the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method [60] in a literature review to analyze existing techniques and technologies for DLP focused on electronic document and classified information leakage. Three stages of PRISMA application are shown in this study: literature search; selection of relevant articles; and data extraction.

4.1 Literature search

For this research, the search was focused on articles related to the techniques and technologies used for DLP published in impact journals, conference articles, and book section, mainly in scientific databases such as Google Scholar, Science Direct, IEEE Xplore, Web of Science, Scopus and ACM Digital Library, from 2011 to April 2022, these databases cover relevant scientific information in multiple engineering fields, allowing access to articles published in scientific and academic journals, repositories, archives and other collections of scientific texts.

Fig. 1
figure 1

Search criteria in different databases

The following keywords were used for the literature search: “Security” AND (“DLP” OR (“Data AND (“Leak” OR “Loss”) AND (“Prevention OR Protection”). These terms are searched in Abstract/Title/Keywords from 2011 to 2022. Figure 1 shows the search strategy used in this research, the search criteria used are provided by the search engine of each of the scientific databases.

4.2 Study selection and relevant papers

Once the terms have been entered in the search engines of the databases, the articles to be analyzed are selected by reading the titles of the results obtained (in this case 158). Repeated entries in more than one database were eliminated (56 articles). Selection criteria were applied in the analysis of the abstracts of 102 articles to classify those that were completely analyzed, the selection criteria were as follows: (1) Studies of novel proposals of techniques and technologies for DLP. (2) Studies of analysis of techniques and technologies for DLP; 65 articles were obtained for complete analysis, then those studies aimed at systems for malware and rootkit protection, image cryptography and steganography were eliminated, as well as reviews of techniques and technologies since they are related works to this, but not relevant to the analysis. A total of 42 articles remained for analysis.

The procedure described is shown in in Fig. 2 the PRISMA diagram, where the paper selection process can be seen and how, out of a total of 158 papers found, a total of 42 papers papers were relevant for analysis in this paper.

Fig. 2
figure 2

PRISMA Methodology

5 Discussion of results

This section discusses the results, after applying the above methodology, classifying the relevant studies according to year and type of publication, analyzing the number of relevant publications for each year reviewed and their origin, to determine where the greatest dissemination of the topic in question is to be found. We analyze the use of the main techniques in DLPS, discuss their limitations, advances and applications according to the reviewed literature.

5.1 Classification according to year and type of publication

Figure 3 shows the frequency of publications by year of the relevant studies found during the period 2011–2022. It is observed that the year 2019 reaches the highest number of publications in this period. In general, the number of papers published per year ranges from 1 to 8 with a statistical mode of 3 and a mean of 4 approximately, which means that in the years 2011 and 2019 the mean was exceeded. We can appreciate that approximately 60% of the relevant articles for this study were found in the period between 2017 and 2022, which shows a significant interest in recent years in the security of sensitive digital information. Figure 4 shows the number of published papers according to their origin, it is observed that 60% of the relevant papers come from congresses and approximately 30% of them from journals, demonstrating the deep interest in the academic field for protection against data leakage.

Fig. 3
figure 3

Frequency of papers published per year

Fig. 4
figure 4

Paper / Article source

5.2 Analysis of the use of the main techniques and technologies in DLPS

Figure 5 shows the most frequently used techniques in the literature. It can be seen that among the most used is cryptography with 40% of use and ML is present in 12% of the articles studied, being evident the progress of DLPS in the use of this technique for the classification of sensitive documentation. Others, such as hypervisor, biometric information capture, and intelligent documents, are present in 10% of the 42 relevant papers to this study. In the literature it has been seen that these techniques and technologies are widely used in combination with each other, For example, in systems where mini-filters and VFS or middleware are used, documents are often encrypted for storage in memory. Also, when active documents are used, hash algorithms are incorporated to guarantee the integrity of the information, as well as ML to classify the information according to the degree of confidentiality to apply security and access policies accordingly. In DLPS, these and other techniques used as a complement can undoubtedly guarantee maximum security to confidential information.

Fig. 5
figure 5

Percentage of use of the most frequent techniques and technologies

5.3 Limitations, advances, and applications

Limitations that have emerged over the years are the almost complete dependence on the quality of the security policies used and the precise definition of the data to be protected, as well as the necessary over-approaches in the dynamic monitoring of the data flow [35]. In [36] four challenges facing document security are identified, one of them being human negligence, DLPS are not able to overcome this challenge since as a means of security they rely on user, password and security policies to ensure the security of information, without taking into account that the user himself may be the one who provides the data leakage, they themselves are the tools to perform the security policy of any organization so a user and password is not enough. The tracking of unmarked documents [37] or not classified as confidential also represented a major limitation in the DLPS at the time.

Some of these problems have already been solved with the incorporation of new techniques and technologies to DLPS, such as ML for document classification, the recent study [61] proposes a multilayer framework for insider threat detection based on a hybrid method composed of two predictive models with an accuracy level higher than 97%, another application of ML in data protection are network intrusion detection systems, which can be seen in studies [62], [63], [64]. DRM systems for tracking sensitive information outside the organization, biometric information for user identification, and context-based keys to determine the date, place and time of information access. An important advance is the incorporation of blockchain to protect the DLPS logs where the information of detected anomalies is stored, storing these DLPS logs in the Hyperledger Fabric ledger in real time, thus preventing the manipulation of these logs by authorized users to try to eliminate evidence of data leakage [65].

In terms of DLPS applications, the studies reviewed focus on the security of sensitive information at the enterprise level and as such, most of the trends and developments lean in this area. However, the authors of [21] propose a DLP solution using context-based encryption to prevent information leakage in drones. In the poster [66] the authors propose a data leak detection tool for a health information system based on memory introspection. A recent study proposes a blockchain-based architecture that allows the secure transfer of electronic health records between different health care systems, verifying the integrity and consistency of requests and responses to electronic health records [67].

6 Conclusions

This research focuses on a literature survey where a total of 42 relevant studies were obtained. The survey allowed answering three research questions that met the objective proposed in this study. A deep interest in evading insider threat was detected in more than 40% of the analyzed studies. In addition, it is given that the DLPS with the highest incidence in this regard have access control and control of the use of confidential information by controlling the operations that allow data leakage (copy, opening, writing and reading), as well as policies of privacy. DRM for the case of partners and collaborators. These tools mainly use biometric information capture techniques, interception of calls in kernel space using hypervisor, VFS, middleware, and mini-filters. As well as security policies encapsulated in documents. In the analysis of the techniques and technologies that are the most used, We found the encryption technique with 40% use in the studies analyzed.

Significant progress is seen in DLP tools with the incorporation of techniques such as ML for the classification of sensitive information and detection of anomalous activity, in addition to blockchain for the protection of DLPS records. No article was found in the literature that provides the open access code of DLPS for reuse and improvement by other researchers. Few studies focused on data security in the healthcare sector and only one applying DLP on the Internet of Things (IoT) was found in the search results.That is why we propose as future lines of work to carry out studies on the security and protection of the electronic health record, as well as the development and implementation of a DLPS focused on the insider threat, based on the experience of the works found that meet the requirements of being lightweight, unobtrusive, where access to information does not depend on user data and saved passwords, with free access to the source code so that other researchers can adapt it to their needs and provide validations and improvements. To this end, we propose to carry out a study of the techniques and technologies that allow the development of virtual file systems, for the implementation of a secure file system as a DLP tool. As well as, the study of lightweight encryption and decryption algorithms suitable for the needs of a virtual file system. Another line of research that DLPS intends to adopt is its application to IoT, since this technology is advancing every day and most of them are high collectors of personal data.