## Abstract

Log data is a well-known source for anomaly detection in cyber security. Accordingly, a large number of approaches based on self-learning algorithms have been proposed in the past. Most of these approaches focus on numeric features extracted from logs, since these variables are convenient to use with commonly known machine learning techniques. However, system log data frequently involves multiple categorical features that provide further insights into the state of a computer system and thus have the potential to improve detection accuracy. Unfortunately, it is non-trivial to derive useful correlation rules from the vast number of possible values of all available categorical variables. Therefore, we propose the Variable Correlation Detector (VCD) that employs a sequence of selection constraints to efficiently disclose pairs of variables with correlating values. The approach also comprises of an online mode that continuously updates the identified variable correlations to account for system evolution and applies statistical tests on conditional occurrence probabilities for anomaly detection. Our evaluations show that the VCD is well adjustable to fit properties of the data at hand and discloses associated variables with high accuracy. Our experiments with real log data indicate that the VCD is capable of detecting attacks such as scans and brute-force intrusions with higher accuracy than existing detectors.

This is a preview of subscription content, access via your institution.

## Buying options

## Notes

- 1.
https://tools.kali.org/password-attacks/hydra, accessed: 2021-04-21.

- 2.
https://cirt.net/Nikto2, accessed: 2021-04-21.

## References

Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, vol. 1215, pp. 487–499. Citeseer (1994)

Bergsma, W.: A bias-correction for cramér’s v and tschuprow’s t. J. Kor. Stat. Soc.

**42**(3), 323–328 (2013)Bolboacă, S.D., Jäntschi, L., Sestraş, A.F., Sestraş, R.E., Pamfil, D.C.: Pearson-fisher chi-square statistic revisited. Information

**2**(3), 528–545 (2011)Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv.

**41**(3), 1–58 (2009)Chen, T., Tang, L.A., Sun, Y., Chen, Z., Zhang, K.: Entity embedding-based anomaly detection for heterogeneous categorical events. arXiv preprint arXiv:1608.07502 (2016)

Das, K., Schneider, J.: Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 220–229 (2007)

Djenouri, Y., Belhadi, A., Fournier-Viger, P.: Extracting useful knowledge from event logs: a frequent itemset mining approach. Knowl.-Based Syst.

**139**, 132–148 (2018)Eiras-Franco, C., Martinez-Rego, D., Guijarro-Berdinas, B., Alonso-Betanzos, A., Bahamonde, A.: Large scale anomaly detection in mixed numerical and categorical input spaces. Inf. Sci.

**487**, 115–127 (2019)Gupta, G.P., Kulariya, M.: A framework for fast and efficient cyber security network intrusion detection using apache spark. Procedia Comput. Sci.

**93**, 824–831 (2016)He, S., Zhu, J., He, P., Lyu, M.R.: Experience report: system log analysis for anomaly detection. In: Proceedings of the 27th International Symposium on Software Reliability Engineering, pp. 207–218. IEEE (2016)

Ienco, D., Pensa, R.G., Meo, R.: A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans. Neural Netw. Learn. Syst.

**28**(5), 1017–1029 (2016)Khalili, A., Sami, A.: Sysdetect: a systematic approach to critical state determination for industrial intrusion detection systems using apriori algorithm. J. Process Control

**32**, 154–160 (2015)Landauer, M., Skopik, F., Wurzenberger, M., Hotwagner, W., Rauber, A.: Have it your way: generating customized log datasets with a model-driven simulation testbed. IEEE Trans. Reliab

**70**(1), 402–415 (2021)Moustafa, N., Slay, J.: The evaluation of network anomaly detection systems: statistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set. Inf. Secur. J. Glob. Perspect.

**25**(1–3), 18–31 (2016)Narita, K., Kitagawa, H.: Detecting outliers in categorical record databases based on attribute associations. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 111–123. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78849-2_13

Pande, A., Ahuja, V.: Weac: word embeddings for anomaly classification from event logs. In: Proceedings of the International Conference on Big Data, pp. 1095–1100. IEEE (2017)

Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes. The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge (2007)

Ren, J., Wu, Q., Zhang, J., Hu, C.: Efficient outlier detection algorithm for heterogeneous data streams. In: Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery, vol. 5, pp. 259–264. IEEE (2009)

Taha, A., Hadi, A.S.: Anomaly detection methods for categorical data: a review. ACM Comput. Surv.

**52**(2), 1–35 (2019)Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., Robinson, S.: Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. arXiv preprint arXiv:1710.00811 (2017)

Wurzenberger, M., et al.: Logdata-anomaly-miner. https://github.com/ait-aecid/logdata-anomaly-miner, Accessed 21 Apr 2021

## Acknowledgements

This work was partly funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU H2020 project GUARD (833456).

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## A Appendix

### A Appendix

### 1.1 A.1 Threshold Parameter Selection

The filtering steps for correlations between variables and values presented in Sect. 4 make use of threshold parameters \(\theta _{1}\)-\(\theta _{8}\) to narrow down the search space and select only those correlations that are likely to positively contribute to the detection of anomalies. This section investigates the influence of these threshold parameters on the resulting correlations and thereby supports the manual parameter selection process, in particular, by relating each parameter to specific properties of the data at hand. In the following, we first explain the generation of synthetic data for this evaluation and then describe our experiments.

**Data.** To measure the influence of thresholds on the correlation selection, it is necessary to control properties of the input data. Therefore, we generate synthetic data for our experiments. We use three variables \(V_1\), \(V_2\), and \(V_3\), of which only \(V_1\) and \(V_2\) correlate with varying strength, and monitor the correlations found by the VCD for different threshold settings. We use values \(\mathcal {V}_i = \left\{ 0, 1, ..., x \right\} , x \in \mathbb {N}\) for each variable and compute their occurrence probabilities as normalized geometric series. Equation 15 shows how the probabilities for values in \(V_1\) and \(V_3\) are computed, where \(p_i = 1\) means that all values are equally likely to occur, and lower values mean that one or more values are dominating the probability distribution. Equation 16 shows how the conditional probabilities of values in \(V_2\) given values from \(V_1\) are computed. Thereby, \(\rho \) specifies the correlation strength, i.e., larger values for \(\rho \) indicate that the same values co-occur more frequently with each other, and \(\zeta \) is a damping factor that reduces the correlation strength for larger \(v_{i, j}\), i.e., higher values for \(\zeta \) cause more co-occurrences between different values.

Figure 4 shows the co-occurrences of values from \(V_1\) and \(V_2\) for a sample configuration of \(x = 9\), \(p_1 = 0.7\), \(\rho = 0.9\), and \(\zeta = 0.4\). Due to the relatively strong correlation factor, most values in \(V_1\) occur with the same value of \(V_2\). The figure also shows that higher values of \(V_1\) co-occur with more values of \(V_2\) due to the damping factor, e.g., while \(v_{1, 1}\) only occurs with four different values of \(V_2\), \(v_{1, 9}\) occurs with each value of \(V_2\) at least once.

To evaluate the accuracy of the correlation selection procedure, we generate a ground truth of expected value correlations that contains all \(v_{1, j} \leadsto v_{2, l}\) and \(v_{2, l} \leadsto v_{1, j}\) that occur at least once in the data. We count correlations selected by the VCD and present in the ground truth as true positives (TP), correlations not present in the ground truth as false positives (FP), correlations missed by the VCD as false negatives (FN), and all other correlations as true negatives (TN). We use the F-score \(F_1 = TP / \left( TP + 0.5 \cdot \left( FP + FN \right) \right) \) to measure the accuracy in the next section.

**Results.** We first experiment with \(\theta _{7}\), which is essential for selecting correlations that represent actual dependencies between the values and do not spuriously emerge from skewed value probability distributions. To analyze the relationship between \(\theta _{7}\) and the correlation strength, we increase \(\theta _{7}\) in steps of 0.05 and \(\rho \) in steps of 0.1 in the range \(\left[ 0, 1 \right] \) while leaving \(p_1 = 0.7, p_3 = 0.7, \zeta =0.4\) constant, generate 10 data samples with 10000 events respectively as outlined in the previous section, and then compute the average F-score of these simulation runs. The results visualized in Fig. 5a show that weaker correlation strengths require \(\theta _{7}\) to be sufficiently low to select all correct correlations and achieve the highest possible F-score of 1. However, setting \(\theta _{7}\) to 0 causes a decrease of the F-score independent of the correlation strength. The reason for this is that correlations involving \(V_3\) are not checked for dependency and are thus incorrectly selected, which increases the number of FP. We therefore conclude that \(\theta _{7}\) should be set to a low, but non-zero value, e.g., 0.05. Note that the selection of \(\theta _{7}\) is not affected by \(\zeta \), since additional value co-occurrences only have little influence on the sum of variances as long as they are not dominating the distribution.

Threshold \(\theta _{5}\) on the other hand relies on the total number of co-occurrences for a given value and is thus influenced by \(\zeta \) in addition to \(\rho \). Figure 5b shows the F-score for various combinations of \(\theta _{5}\) and \(\zeta \), while \(\rho = 1\) is fixed. As expected, increasing values for \(\zeta \) yield lower F-scores for a given \(\theta _{5}\), because the number of distinct co-occurring values for any given value increases quickly (cf. Fig. 4). Accordingly, it is necessary to set \(\theta _{5} \ge 1\) for \(\zeta > 0.5\) to select any correlations. For \(\zeta \le 0.5\), \(\theta _{5}\) effectively steers the allowed number of distinct co-occurrences, e.g., for \(\theta _{5} = 0.5\) at most 5 co-occurring values are allowed since \(\left| \mathcal {V}_i \right| = 10, \forall i\).

We argue that the influence of other thresholds is trivial and therefore omit the plots for brevity. Table 4 shows a summary of all thresholds and the data properties with the highest influence on their selection. Note that \(\theta _{8}\) is most influenced by \(\theta _{5}\) and \(\theta _{6}\) rather than a property of the input data, because these thresholds regulate the generation of value correlations that affect the selection criterion involving \(\theta _{8}\). The table also provides default values that we identified as useful during our experiments and are used in the evaluations in Sect. 5.

These results indicate that the large number of parameters does not impede practical application of the VCD, since the thresholds are mostly independent from each other and allow to configure the correlation selection constraints specifically to counteract otherwise problematic properties of the data. For example, a high number of correlations involving many distinct values (i.e., \(\left| \mathcal {V} \right| \) is large) or weakly correlated variables (i.e., \(\rho \) is low) should be addressed by adjusting \(\theta _{1}\) and \(\theta _{7}\) accordingly to reduce the total number of correlations that are considered for anomaly detection as shown in Sect. 5.1.

## Rights and permissions

## Copyright information

© 2021 Springer Nature Switzerland AG

## About this paper

### Cite this paper

Landauer, M., Höld, G., Wurzenberger, M., Skopik, F., Rauber, A. (2021). Iterative Selection of Categorical Variables for Log Data Anomaly Detection. In: Bertino, E., Shulman, H., Waidner, M. (eds) Computer Security – ESORICS 2021. ESORICS 2021. Lecture Notes in Computer Science(), vol 12972. Springer, Cham. https://doi.org/10.1007/978-3-030-88418-5_36

### Download citation

DOI: https://doi.org/10.1007/978-3-030-88418-5_36

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-88417-8

Online ISBN: 978-3-030-88418-5

eBook Packages: Computer ScienceComputer Science (R0)