INTRODUCTION

According to the InfoWatch expert analytics center, the first three quarters of 2020 saw the theft of 9.93 billion personal data (PD) and payment data records. Compared with the similar period in 2019, the amount of leaks and compromised records went down worldwide by 7.4 and 1.4%, respectively. Compared with the 69.5 million PD and payment data leaks in the first three quarters of 2019, their amount in the similar period in 2020 went down by 29.2%. The reduction in the number of leaks registered (discovered) around the world stems mainly from the influence of COVID-19 on private businesses and state-run enterprises: many companies could have reduced the control over information assets as a result of the hurried restructuring of their workflow and redeploying a large share of employees to remote work, and a large part of incidents was left unregistered. For the statistics on the data leaks registered by source, see Fig. 1.

Fig. 1.
figure 1

Statistics of data leaks by source.

At the same time, the number of leaks detected in Russia continued to grow even despite the pandemic, which can be juxtaposed with an intermittent growth of the number of applications for buying or handling products for leak control and monitoring employee activities. For the distribution of insiders by various categories of corporate personnel, see Fig. 2.

Fig. 2.
figure 2

Distribution of intruders by categories.

At the same time, major incidents became much more numerous; as a result of each of them, there was at least a million PD and payment data records leaks: from six in January—September 2019, the number of these incidents in the first nine months of 2020 grew to fifteen [1].

As noted in work [2], threats through the fault of insiders are the most dangerous threats for many organizations, including public establishments. In this case, malicious activities are taken by trusted persons inside organizations, which causes major damage.

The first part of this article considers the model of information security threats in corporate data transfer networks and formulates the study objective. The second part presents an algorithm of detecting encrypted files and the results of practical tests conducted to estimate the accuracy the classification of files by the developed algorithm.

1. A MODEL OF INFORMATION SECURITY THREATS CAUSED BY INSIDERS IN CORPORATE DATA TRANSFER NETWORKS

To evaluate the actions of insiders, a model of information security threats in corporate data transfer networks has been developed, and it is presented in Fig. 3.

Fig. 3.
figure 3

Model of information security threats from insiders.

According to this model, an insider can be an ordinary or a privileged corporate employee, or malware installed in a hidden manner. When confidential information is transferred beyond the corporate perimeter, the insider can use information encryption to transfer this information through data leak detection and prevention tools. A leak can occur, because the available DLP systems exhibit a low level of accuracy in detecting encrypted data due to their statistical similarity with data of various classes, such as compressed images, archives, video and audio files [3, 4].

The available DLP systems analyze the service information inherent to data transfer or on the basis of searching for various signatures and regular expressions directly in the data. Several works point out that encrypted or compressed confidential data cannot be detected [5, 6].

Several researchers note that there are no efficient and accurate methods of classifying high-entropy sources, for example, data encryption and compression algorithms [7, 8].

The insider has encryption and compression tools available for use, which allows concluding that the classification of encrypted and compressed data is a relevant task. There are several reasons why the considered approaches are not reliable solutions for detecting encrypted data transfers.

First of all, traffic analysis methods are inapplicable, because data transfer beyond the controlled corporate perimeter should never come to pass. It is necessary to develop tools for analyzing data before they are sent outwards, for example, after they are loaded to an email server.

Secondly, neural networks consider mainly file headings with magic bytes of high discriminative ability.

Thirdly, the considered entropic and other approaches also handle file headings with magic bytes.

Since data leak detection and prevention tools deal with insiders, there is usually open access to files and data due for being sent by the insider beyond the controlled corporate perimeter. This is why we should consider content methods of data analysis. For the results of analyzing a subject domain, see Table 1.

Table 1. Results of analyzing content classifier investigations

Despite a broad diversity of encrypted and compressed data classification methods, all of them have one common flaw: analysis of digital signatures contained in file headings.

Thus, the goal of this study is to develop an encrypted data detection algorithm that will allow classifying and separating to a high degree of accuracy encrypted and compressed files from open-access files that circulate in corporate networks and include office docs, images, and text data. The developed algorithm must not consider digital signatures and context information.

2 FILE CLASSIFICATION ALGORITHM

For the encrypted file detection algorithm, see Fig. 4.

Fig. 4.
figure 4

Encrypted file detection algorithm.

The first step for ensuring that the algorithm works correctly is to define high-entropy files and files with even byte distributions, that is, encrypted and compressed files. Modules 1–7 separate potentially dangerous data from data legitimately used in corporate networks.

According to work [17], if a file’s entropy exceeds 6.5, this file potentially contains encrypted or compressed sequences. Thus, module 1 calculates the entropy of analyzed data. If this threshold value is exceeded, the file is forwarded to module 3; otherwise, the analyzing process is finished.

In module 3 the file is analyzed by the trained classifier based on the extra trees algorithm. This stage implies training the classifier, setting up its hyperparameters, and defining the most significant features of the PSR model that allow classifying encrypted/compressed data as accurately as possible by separating them from other classes.

The features defined in module 3 are transferred to module 4, where the values of the features of the analyzed file are calculated by the model of pseudo-random sequences [18].

At stages 5 and 6 the iteration movement across the tree nodes is executed. The process is terminated when the terminal tree node is reached containing the respective class token assigned to the analyzed file. If the file is recognized as a pseudo-random sequence, the algorithm continues running; otherwise, the algorithm terminates.

In module 8 the analyzed file is transferred to the entry of the random forest classifier. At this stage, the classifier is trained, its hyperparameters are defined, and key features are calculated that allow classifying encrypted and compressed sequences between each other; in other words, a binary operation is executed. The resulting features are transferred to block 9 for calculating their values in the analyzed file.

The analyzed file is classified in modules 10–12; as a result, this file is associated with a token of encrypted or compressed data.

The extra trees algorithm is chosen on the basis of the conducted experiments. For the results of estimating the accuracy of the multiclass classification of encrypted/compressed (aes, camellia, des, rc4, GOST 34.12 “Grasshopper”, zip, rar, 7z, gz, xz, bz2) data, images (jpg), text (txt), and tabulated MS Office data (xls) see Table 2.

Table 2.   Accuracy estimation of four data classes by various machine learning algorithms

The conclusion derived proceeding from the findings is that, according to both, the accuracy metric and the other metrics, all of the tested algorithms exhibit a high level of accuracy. Since it was fairly difficult to opt for a specific algorithm with the help of metrics, we considered the temporal operating characteristic of those algorithms. The shortest learning time is characteristic of the extra trees classifier. Because the learning time is directly pro rata with the file classification time, this algorithm was chosen as the best one.

The chosen classifier was evaluated proceeding from the experiments on the basis of the earlier generated sample consisting of files of four classes. A crossover check was conducted on the basis of 10 divisions of data in subsamples. For the results of the check, see Table 3.

Table 3.   Accuracy estimation of four data classes by the extra trees algorithm

The average classification accuracy achieved using the resulting classifier with various metrics was 0.99.

For the boundaries separating the classes according to the two key features, see Fig. 5.

Fig. 5.
figure 5

Separating boundaries among the four classes.

As follows from analyzing Fig. 5, it can be concluded that the extra trees classifier can very accurately separate encrypted/compressed, text, graphic, and tabular files.

The importance of features in the multiclass classification conducted by the extra trees algorithm is shown in Fig. 6.

Fig. 6.
figure 6

Importance of features in using the extra trees algorithm.

The importance estimation of the features according to Shapley values is presented in Fig. 7. These values are calculated as

$${{\omega }_{i}}(p) = \sum\limits_{S \subseteq N/\{ i\} } {\frac{{\left| S \right|!(n - \left| S \right| - 1)!}}{{n!}}} (p(S \cup \{ i\} ) - p(S)),$$

where: \(p(S \cup \{ i\} )\) is the model prediction based on the ith feature, \(p(S)\) is the model prediction without this feature, n is the number of features, and S is the set of values without the ith feature.

Fig. 7.
figure 7

Defining weight of features according to Shapley value.

The Shapley value for each feature is calculated for each file in the data sample; then, the calculated values are summed by module and the weight of each feature is defined.

The feature weight values calculated according to Shapley values allow selecting the most significant features that allow classifying highly entropic data from those legitimately used in the system and shrinking the analyzed space of features by rejecting the features with the lowest weights.

CONCLUSIONS

The main contribution of this work is exposed below.

1. Several works on information security have been analyzed for examining the application of data classification methods and ML algorithms. The conclusion about the weaknesses of the existing approaches has been made; the requirements on the developed approach to classifying encrypted and compressed data before their transfer to the external network have been suggested.

2. The model of PRSs shaped by data encryption and compression algorithms (pseudo-random sequences) has been suggested that differs from its counterparts, considering the distribution of binary subsequences of N bits in length.

3. Limitations have been formed for use in actual work: the data chunks necessary for achieving the maximal accuracy of PRS classification must be fairly large and reach at least 600 KB. When chunks of about 50 KB (accuracy by metrics) are used, the fraction of right answers is 0.81. The strengths of the suggested PSR classification method are that the PSR model does not take into consideration file headings and magic bytes of compressed PSRs.

The developed approach has shown a high level of accuracy in classifying encrypted and compressed sequences of 0.97 and can be used to improve existing DLP systems or adopted in email servers for analyzing email attachments before they are sent beyond the corporate perimeter.