1 Introduction

Nowadays, industrial and IT systems produce and collect a tremendous amount of logs during their everyday operation. The vast majority of the lines in these logs are generated as part of the normal operation: they provide some information about a request, a service or any important event occurring in the system. Error messages indicating failures are also multiplexed into the stream of log lines.

The lines of the log files represent an important, rich source of information for a domain expert helping to deduce the state of the system, to identify suspicious activity or to identify the first signs of future anomalies. Due to cost implications, such domain experts are not available in the majority of IT systems. To reduce the cost and to enable real-time processing, several machine learning based methods appeared in the past decade for supporting the automatic analysis of system logs.

The goal of these log analytic systems is to identify operation anomalies, log line sequences that need attention and possibly human intervention. This task is much more than just detecting error messages and failures, as many important anomalies only manifest themselves in the order or in the timing of normal looking log lines.

Machine learning-based log analytics has come a long way in the past years with deep learning based solutions dominating recently. These deep learning-based models are given the sequence of log lines, represented either by template keys or by numerical vectors holding semantic information. In case of unsupervised methods, the model learns the usual patterns in the log sequence, and the output is usually the distribution of the expected log line/word/word piece, given by the final, soft-max layer. Anomalies are indicated when the model assigns a low probability to the observed log line. These methods perform well, but have a serious flaw: log messages never seen before are not handled appropriately, since the soft-max layer providing the output can not extend at inference time.

In this paper we introduce a method that has two important features: (1) it is template-less, thus in the inference phase it can handle log lines that it has never seen before, (2) it is unsupervised, i.e., it does not need labeled data for training the model. There are methods that implement one of these features, but to the best of our knowledge, not the combination of both.

The rest of the paper is organized as follows. The related solutions available in the literature are introduced in Sect. 2. Our proposed method is described in Sect. 3. Numerical examples are presented in Sect. 4 to demonstrate the performance of the procedure. Finally, Sect. 5 concludes the paper.

2 Anomaly detection in log sequences

2.1 Grouping of log lines

The great difficulty of log anomaly detection is that the output belonging to potentially independent services are multiplexed into a single stream of log lines. It is hopeless to detect anomalies accurately in the raw, multiplexed log stream. The quality of pre-processing has a fundamental impact on the accuracy of the results (“garbage in \(\rightarrow \) garbage out”). It is usually easy to de-multiplex the log stream to get the separate output of various services.

In some cases it is also possible to identify log lines that belong to the same session. A session (also called group in this paper) has a well-defined beginning and end, and a certain number of log lines between them. E.g., in the HDFS log the log lines corresponding to the same block id can be considered as a group, or in some OpenStack logs the operations on the same virtual machine also make up a group. If such a grouping is possible, then the normal/anomalous rating can be assigned to groups rather than to individual log lines, enabling the better detection of sequence anomalies.

2.2 Availability of labels

Several log anomaly detection methods assume that there is a labeled data set available for training, and formulate the problem as a supervised machine learning problem. In practical applications, however, such a data set cannot be obtained due to the large human effort required. Another issue with this approach is that such models do not react well when new kinds of anomalies occur, that were not seen during the training phase.

In case of unsupervised log anomaly detection, the model can detect rare log lines or unusual sequences, that do not follow the regular pattern. Based on the model’s advice the anomalies can be identified with human aid.

Besides the supervised and unsupervised approaches, a third approach also exists, that falls in between the two, where models are trained using normal log sequences. Finding training data without anomalies requires a kind of “supervision”, still, in the literature this family of methods is also classified as unsupervised. Compared to the fully supervised methods these models have the advantage that they do not learn negative samples explicitly, hence they provide more reliable response to log lines that were not included in the training set.

2.3 Representation of log lines

To feed the log lines into the model, they have to be represented appropriately. There are two dominant solutions for this purpose:

  • Using log keys (also called as “log templates”). In this case the log files are pre-processed using a log parser (Drain [1] being a popular choice), that extracts regular expression-like templates from the set of log lines and represents each log line by an integer number (the identifier, or the key, of the log template). The input of the anomaly scoring model is then a sequence of integer numbers.

  • Semantic solutions use text embedding techniques (from natural language processing) to create a numeric vector representing the semantic meaning of a log line. Word2vec [2], its contrastive variant [3] and GloVec [4] are popular choices for this purpose. Most recently, BERT models trained on general English text [5] were also used for log line embedding, being able to handle out-of-vocabulary (OOV) issues at inference time.

In general, template based solutions are simpler and faster (yielding models with fewer parameters), but they cannot take advantage of the semantic content of the log lines and they cannot cope with OOV situations.

2.4 Neural network architectures

In recent years, it became obvious that the deep learning-based solutions – especially sequence models like LSTM and Transformer Encoders – can be used for anomaly detection of log sequences effectively. As for the ideal neural network architecture, there is no consensus. We list four different possibilities below. We are going to show later on that all of them are capable of solving the anomaly detection problem.

Predict-next approach

In this case, a certain number of previous log lines are fed into the sequence model, based on which it predicts the next log line. If the model assigns a low probability to the actual, observed log line, an anomaly is detected. This is the only model discussed in this paper that can be used also when there is no grouping (no sessions) available.

Bi-directional prediction approach

This is the bi-directional extension of the previous model, that is applicable only to grouped log data. In this case, every log line of a group is predicted both based on the previous and based on the following log lines. If the actual log entry does not follow from the future and the past entries, it is considered to be an anomaly.

Masked Language Model based solutions

An MLM is trained to reproduce its input sequence on the output side such that some elements of the input sequence are masked out. The model has to understand the meaning of the sequence in order to reconstruct the missing elements. MLMs require grouped log sequences. The reconstruction loss of a masked out log line indicates how unexpected it is, hence it can be used as an anomaly score.

Auto-encoder based solutions

Log sequence anomalies can be detected based on sequence auto-encoders, too. In this case, the log lines of a group are fed into a sequence encoder, that summarizes the whole sequence into a single vector. Then, at the decoder side the input sequence is reconstructed. The anomaly scores again can be derived from the reconstruction loss.

LSTM vs. Transformer

All the neural network architectures listed above and used in this paper are sequence models. Currently, the most widely applied sequence models are LSTMs (Long Short Term Memory, [6]), with Transformers [7] gaining popularity quickly. We found that these two models perform similarly, but LSTMs yield slightly better results in all of the cases we studied. The choice of the sequence models and its hyper-parameter tuning is, however, out of the scope of this paper.

All of the above mentioned models are reconstruction based. Their output is a set of log lines: either the next one in the sequence, or the ones masked out at the input, or the ones generated by the decoder of the auto-encoder.

2.5 Related work

One of the first deep learning-based unsupervised methods for log anomaly detection, DeepLog [8], relies on log parsers and follows the predict-next approach. The input of the model consists of the past log keys up to a certain window size (one-hot encoded), the output is the next log key identifier, and the loss function in the training phase is cross-entropy.

To improve performance, LogAnomaly [9] introduces a distributional lexical-contrast embedding method called template2vec to make use of the semantic information of the log lines. The method follows the bi-directional prediction approach, the input of the model is the sequence of template vectors, the output is the next log key, and it uses the cross entropy loss for training.

The BERT models, introduced for various NLP tasks, also inspired several algorithms for log anomaly detection. LogBERT [10] proposes an architecture similar to transformer encoders and introduces an MLM-based approach to find anomalies in logs. The model input is the sequence of log key embeddings (the optimal vectors are learned during the training of the model), and the output is a distribution over all possible log keys. The loss function is cross entropy in the masked out position of the sequence. The LAnoBERT [11] algorithm uses a BERT model to obtain the vector embeddings of the log lines, trained on the log file itself. The log sequence (consisting of word pieces created by the tokenizer) is the input of the model. The anomaly detection follows the MLM approach, the loss function is cross entropy.

It is also worth listing the most notable supervised methods, although the method presented in this paper is unsupervised. In LogRobust [12] the words of log lines are represented by semantic vectorization, and the embedding of a log line is obtained by the TF-IDF based aggregation of the word embeddings. This method uses a bi-directional LSTM model to classify log lines (or group of log lines) as anomalous or normal.

HitAnomaly [13], similarly, relies on semantic vectorization, but instead of domain specific embedding it uses off-the-shelf BERT encoding of words, aggregates them to lines using a transformer, and uses another transformer for creating log sequence embedding vectors. Finally, a standard classification problem is solved with these sequence embeddings as input.

LogSy [14] uses a spherical loss function to produce anomaly scores, and needs abnormal log lines as auxiliary data to move the score of normal samples closer to each other. Its central element is a transformer model, whose input is the sequence of log template identifiers and the output is the anomaly score.

In NeuralLog [15] the log lines are transformed to numeric vectors with an off-the-shelf BERT model, then a transformer is fed by the sequence of these vectors and performs the classification.

For a comprehensive review and comparison of anomaly detection methods for logs, see survey papers [16] and [17].

3 The proposed method

All the unsupervised methods listed in Sect. 2.5 are reconstruction based: they provide the distribution of the expected log line at a given position of the log sequence, and indicate an anomaly if the observed log line has a low probability according to this distribution. A serious flaw of this approach is that it only works if all possible log keys are known at training time. This condition is too restrictive for many practical applications, since log messages that correspond to rare events may not be present in the training set. Furthermore, version changes of services may also introduce new log messages never seen before.

In this paper we present a fully template-less procedure. Instead of relying on log parsers, we represent the log lines based on their BERT embedding, using the off-the-shelf BERT model pre-trained with general English text. This choice makes it possible to exploit the semantic information in the log lines and also eliminates the out-of-vocabulary problem, since all log lines, even those not seen during the training phase, have a valid, fixed size BERT vector representation. The tokenizer of BERT can cope with words that are not part of the standard vocabulary, too. The novelty of the presented procedure is that the output of the neural network model is not the distribution of the expected log line, but the semantic vector representation of the top-K most probable log lines. We show that this approach, together with a new loss function, performs competitively, while having the benefit of handling new kinds of log lines which makes it suitable for online processing of logs.

Not having to know every possible log line at training time makes our method more suitable for practical use. However, the semantic approach and the use of BERT to embed log lines comes at a price: our models are larger than template-based ones resulting in slower training and slightly higher memory consumption. We show through numerical examples that this is not a limitation in practice.

3.1 Overview of the solution

The structure of log files can be very diverse, pre-processing them for anomaly detection is a difficult, hard to automate problem. Processing the raw logs usually leads to suboptimal results. A domain expert is needed to examine the structure of the log and split it to multiple sequences wherever possible: to the logs of the separate services (like we do later with the Thunderbird data set) or to the logs of separate nodes of the supercomputer (like in the BGL case, discussed later) or to the log lines corresponding to the same HDFS block (as we do with the HDFS data set later). In log analysis, feeding the models only with related log lines is critical for the successful application of the models.

We assume that the lines of the log file are grouped already, using some domain knowledge (e.g., by the block number in case of the HDFS log). If this is not possible, then groups can be formed from overlapping windows consisting of subsequent log lines.

Let us denote the set of groups by \({\mathcal {L}}=\{g^{(1)}, g^{(2)}, \dots \}\), where \(g^{(i)}\) is a vector of log lines corresponding to group i. More formally, we have that \(g^{(i) } = [x_1^{(i)}, \dots , x_{N_i}^{(i)}]\), where \(x_j^{(i)}\) represents the jth log line of group i and \(N_i\) is the number of log lines in that group. Each log line consists of a certain amount of words, \(x_j^{(i)} = [w_{j,1}^{(i)}, \dots , w_{j,L_j^{(i)}}^{(i)}]\), where \(w_{j,m}^{(i)}\) is the textual representation of the mth word in the jth log line of group i.

The aim of the presented procedure is to identify rare groups in \({\mathcal {L}}\), which are possible indications of anomalies. In the training phase the parameters of the model are trained to successfully predict every log line based on other log lines of the group. In the inference phase the model assigns an anomaly score to the given group based on the accuracy of the predictions. The “normal” or “anomaly” label for the group is obtained by thresholding this score.

Fig. 1
figure 1

Steps of the proposed method

The main steps of the algorithm are shown by Fig. 1. The variable length input text of the log lines is transformed to a fixed-dimensional numerical embedding vector. The embedding vectors of the log lines are given to a neural network as input, which predicts the embedding vector of each log line using the other log lines of the group. Instead of one, K predictions are provided to improve robustness of the solution (described later in more details). If the embedding of the actual log line is close enough to the closest predicted one, the log line is labeled as normal, otherwise it is labeled as anomaly. The line-wise anomaly scores can be aggregated for the group, if necessary.

3.2 Representation of the log lines

Although this method does not need a log parser, some kind of pre-processing is still necessary. We found that it is beneficial to eliminate the most varying – mainly numerical – attributes from the log lines, as they play no role in sequence anomalies (detecting attribute anomalies is out of the scope of this paper). We detect the most common attributes using a set of regular expressions and replace them with some general text, according to Table 1.

Table 1 Attributes to replace in the pre-processing step

After replacing the given attributes, the log lines are converted from the textual representation to vectors having semantic information. For this task we use the pre-trained, uncased BERT model of [5] to transform \(x_j^{(i)}\) (the textual representation of the jth log line of group i) to numerical (row) vector \(\hat{x}_j^{(i)}\), such that the outputs of the last hidden layer of the BERT model corresponding to words \(w_{j,m}^{(i)}\) are summed up for \(m=1,\dots ,L_j^{(i)}\). The pre-processing steps are summarized by Fig. 2.

Fig. 2
figure 2

Pre-processing steps to obtain the vector embedding of log lines

3.3 Supported neural network models for anomaly detection

The sequence of numerical vectors representing the log lines is the input of the model. As listed in Sect. 2.4, several architectures can be used for log anomaly detection (Fig. 3). Our approach is compatible with all of them, and later in Sect. 4 we study the performance of all of them.

Fig. 3
figure 3

Neural network architectures for log anomaly detection

The input of these models is as follows:

Predict-next

The input is the sequence of numerical vectors \(\hat{x}_1^{(i)},\dots ,\hat{x}_j^{(i)}\), prepended by a special start-of-sequence symbol. After feeding \(\hat{x}_j^{(i)}\) to the model, the expected output is the vector representing \(\hat{x}_{j+1}^{(i)}\), or, at the end of the sequence, the vector corresponding to the end-of-sequence symbol.

Bi-directional prediction

These models have two inputs: the forward and a reverse sequence of numerical vectors, prepended by the start-of-sequence symbol and appended by the end-of-sequence symbol. The forward and backward models are coupled such that \(\hat{x}_{j}^{(i)}\) is predicted from \(\hat{x}_{1}^{(i)},\dots ,\hat{x}_{j-1}^{(i)}\) by the forward model and it is predicted from \(\hat{x}_{N_i}^{(i)},\dots ,\hat{x}_{j+1}^{(i)}\) by the backward model.

Masked Language Model

The input is the sequence of numerical vectors with 20% of the vectors replaced by a special mask symbol, complemented by the start-of-sequence and end-of-sequence symbols. The model has to predict those \(\hat{x}_{j}^{(i)}\) vectors that were masked out at the input side.

Autoencoder

As before, the input of the model is the whole sequence of numerical vectors \(\hat{x}_1^{(i)},\dots ,\hat{x}_{N_i}^{(i)}\) with the start-of-sequence and end-of-sequence symbols, and the expected output is the same sequence.

3.4 Adding a Top-K layer to the models

All previously published unsupervised log anomaly detection methods use a softmax layer to provide the output of the model: if the probability of the observed log line is lower than a threshold, an anomaly is indicated.

Such a softmax layer is not an option in a fully template-less solution, as the set of possible log lines or words is not known in advance. This is a fundamental difficulty to be solved. Making the models output the numerical vector of the expected/predicted log line is also a bad idea, since log sequences can often continue in a number of different ways. To illustrate the problem, Fig. 4 depicts the embeddings of log lines projected to the two-dimensional plane. Given a previous history, the sequence can continue either with a log line denoted by number #1 in the figure, or with a log line number #2, both being non-anomalous options. To minimize the loss, however, the model gives an output in between these two options, close to vectors that have nothing to do with the two possible options, which leads to bad performance.

Fig. 4
figure 4

The problem of predicting a single numeric vector

To address this issue, we add an extra layer, called Top-K layer to the aforementioned neural network architectures, enabling the model to emit not only one, but K predictions for a log line simultaneously. This extra layer is just a dense linear layer representing K matrix multiplications, each creating a prediction from the last hidden layer of the model.

3.5 The top-K loss function

Since our models have non-standard output, we also need to introduce a new loss function, called top-K loss. The idea is to compare the K model predictions for the semantic vector of a log line to the one of the actual log line and measure the cosine similarity with the closest one.

Assume that size E row vector y holds the semantic vector of the target log line, and the rows of matrix z are the top-K predictions of the model (the size of matrix z is \(K\times E\)). First we calculate the cosine similarity of y with all the predictions, yielding vector u, defined by

$$\begin{aligned} u_k=\frac{y \cdot z_k^T}{{\Vert y\Vert \cdot \Vert z_{k}\Vert }}, \end{aligned}$$
(1)

where \(z_{k}\) is the kth row of matrix z (i.e., the kth prediction). Next we have to compute the sum of the predictions weighted by their softmaxed cosine similarity to the target, leading to vector w as

$$\begin{aligned} w = \sum _{k=1}^K \text {softmax}(u_k)\cdot z_{k}. \end{aligned}$$
(2)

Finally, the loss \(\ell \) is given by the cosine similarity of this weighted sum and the target vector, according to

$$\begin{aligned} \ell = \frac{y \cdot w^T}{{\Vert y\Vert \cdot \Vert w\Vert }}. \end{aligned}$$
(3)

Distance metrics other than cosine similarities can also be used in the top-K loss function, nevertheless cosine similarity is the most common distance metric used in NLP to measure the distance between high-dimensional embeddings. We also performed tests with the mean squared error and obtained slightly inferior results.

3.6 Obtaining the anomaly score

The prediction loss for an individual log line can directly be used as an anomaly score, since higher prediction loss means that based on the history the log line does not follow the expected behavior.

In case of group data it is also beneficial to obtain a single anomaly score to characterize how anomalous the whole group is. If the loss of log line j of group i is denoted by \(\ell _j^{(i)}\), the aggregated anomaly score for group i is calculated by

$$\begin{aligned} \ell ^{(i)} = \max _j \ell _j^{(i)}, \end{aligned}$$
(4)

thus the “worst” line in the group determines the score of the group.

All the architectures considered, except the sequence autoencoder, inherently provide an explanation for the anomaly. The position of the worst prediction helps to localize the anomaly, and the top-K predictions help to identify what kind of log lines were expected instead of the observed one (by listing the log lines with embedding closest to the top-K predictions).

3.7 Taking the advantage of multiple models

In Sect. 3.3 we have introduced four neural network architectures for anomaly detection. It makes sense to use all of them to detect anomalies in logs. There are several viable possibilities to aggregate the results of the models:

  • Taking the maximum of the anomaly scores: In this case those log lines/groups are marked as anomalous that were found to be suspicious by at least one model. Following this route improves the recall of the model.

  • Taking the minimum of the anomaly scores: This way logs are marked as anomalous only if all models are on a consensus. This approach improves the precision of the results.

4 Numerical examples

To investigate the efficiency of the proposed method, referred to as TeleDAL in the sequelFootnote 1, we present case studies on the HDFS, the BGL and the Thunderbird data sets, which are used as a benchmark for log anomaly detection in the literature.

Our results are compared against two well-known, template-based solutions, DeepLog [8] and LogAnomaly [9]. We have to emphasize that the results presented in this section are not directly comparable to the results published in the literature, due to the following reasons:

  • There are many implementations available for these algorithms with slight differences;

  • The random train-validation-test split can be different, leading to different metrics;

  • TeleDAL is the only regression-based method in this comparison – we believe that this approach (being able to handle log lines that were never seen before) has benefits that make it preferable even if the accuracy is a tiny bit lower (although still up to the level of the state-of-the-art).

4.1 The HDFS data set

This is the most common benchmark data set for log anomaly detection. In fact, this is the only data set that is long enough, has a large number of groups, and has high quality, group-wise anomaly labels available.

The log lines of the HDFS data set can be separated into groups according to the block number identifier. In this case study the log lines of a certain amount of non-anomalous blocks were used as the training set, and a certain amount of blocks served as the validation and the test set. The test set contains lines of both normal and anomalous blocks, according to Table 2.

Table 2 The composition of the training and tests sets for the HDFS data set

We have implemented all the architectures presented by Sect. 3.3 in Keras/TensorFlow. The hyper-parameters of the models were not optimized, they were set to yield approximately the same number of trainable parameters, to ensure fair comparison. We only optimized the number of training epochs on the validation set. (The performance of the methods can be improved by hyper-parameter tuning, however the model behavior can be perfectly studied in this setting, too.) The input of the models are the sequence of size 768 vectors generated by BERT. The next layer is a dense layer without activation to reduce the vector size to 32. The remaining layers (except for the last) are model specific. In the “predict-next” model there is a two-layer LSTM with 64 hidden states, in the “bidirectional prediction” models there are two single-layer LSTMs (a forward and a reverse one), in the masked language model a bi-directional LSTM, finally, in the sequence autoencoder both the encoder and the decoder are single layer LSTMs with 64 hidden states and the size of the bottleneck is 32.

The following methods are involved in the case study:

  • A very simple parser-based baseline, that indicates anomaly if and only if a log sequence contains template identifiers not seen in the training set.

  • DeepLog: it is the classification-based variant of our predict-next model, relying on log parser. Compared to our predict-next model the difference is that it is fed by the template identifiers instead of the semantic vectors, and the model output is the distribution of the possible next template (instead of the semantic vectors of the top-K candidates).

  • LogAnomaly: it is a classification-based method where the neural network has two kinds of input: the log sequence itself (given by embeddings, which we made trainable, contrary to the original paper), and the count vectors of log templates along a sliding window. The output of this model is the distribution of the possible next template.

  • LogBERT: it is also a classification-based method that applies a transformer structure introduced by BERT to detect anomalies in logs. We have used the publicly available implementation of the authors with the default parameters and with the same training, validation and test data used across this numerical example.

  • The results of the four presented models, with top-K output, trained with top-K loss function, for \(K=1,\ldots ,8\).

  • The combined models, as described in Sect. 3.7, with max and min aggregation policy.

For every method and K parameter setting the precision, recall and F1 scores are recorded with the best possible thresholds leading to the highest F1 scores.

Table 3 Results for the HDFS data set

The results are summarized by Table 3 and Fig. 5. For simplicity, the table contains the results only for \(K=2,4,6,8\), which is enough for drawing the conclusions. Both the table and the figure confirm that allowing the models to have multiple outputs via the Top-K layer is effective, increasing the K parameter increases the F1 score significantly. With the Top-K layer, F1 scores similar to the template-based DeepLog can be achieved, while having the benefits of the template-less operation, that is these models do not have to know all the possible log templates in advance.

Comparing the four architectures the good performance of the “Predict-next” approach is worth mentioning. This is the only architecture from the four that does not assume the existence of grouping, it does not benefit from future log lines, hence it can process session-less stream of log lines, too. In case of the autoencoder-based solution we found that the size of the bottleneck vector is critical: if the bottleneck is not tight enough, the model learns to encode the whole input sequence in this vector in an almost lossless way, irrespective of K. In our experiments we have set it to 8. Since the different lines of the HDFS log can not be encoded using \(8\times 32\) bits (8 single precision floating point numbers), the model had to learn the relations between the log lines to minimize the loss. In this case increasing K had a positive impact on the F1 score. The MLM model performed very well in this comparison, it achieved high F1 scores even with relatively low K values. In general, increasing K beyond a certain value (\(K=5\) in this particular case) makes all models perform equally well.

Fig. 5
figure 5

The effect of parameter K in the HDFS data set

Section 3.7 described a way to combine the results of the different models. Using the “maximum” operator on the scores of the models increases the recall (log lines are tagged as anomalous when at least one model thinks so), while the “minimum” operator increases the precision of the combined output (log lines are tagged as anomalous when all models have a consensus on it). This behavior is demonstrated on the HDFS data set by Fig. 6 and by Table 3. The practical relevance of combining models lies in the fact that operators typically prefer to get notified only when the presence of anomaly is almost certain, calling for models with high precision.

Fig. 6
figure 6

Precision and recall of the basic and combined models

By increasing parameter K the number of parameters of the model increases, too, which has an impact on the training time. Figure 7 depicts the execution time of one epoch with various K parameters and three different model architectures (using an NVIDIA GeForce GTX 1660 Super GPU with batch size set to 128). The MLM architecture is not included in the comparison since it is transformer-based and has significantly higher computational demand. The results confirm the intuition, the training time increases with K linearly. The extra memory demand of the top-K layer is not significant, it is the size \(K\times E\) matrix denoted by z in Sect. 3.5.

Fig. 7
figure 7

Time of one epoch of training with the HDFS data set

4.2 The BGL data set

The BGL dataset has been collected from a BlueGene/L supercomputer system in year 2005 [18]. After filtering out the invalid lines it contains 4630958 entries generated by 69245 nodes. This dataset has got labels, but contrary to the HDFS, the labels are associated with the log lines instead of the log sequences. The number of anomalous lines is 348091, which represents \(7.5\%\) of the log file.

We perform two experiments with this data set:

  • Finding anomalous nodes: In this experiment the entire log sequence of a node (and not its individual lines) is qualified as anomalous or as normal. Nodes having at least one anomalous log lines are tagged an anomalous.

  • Finding anomalous lines: In this case the log file of every node is processed using a sliding window of size 50, and individual lines are tagged as normal or as anomaly. A log line is normal if the TeleDAL model was able to predict the next log line better than a threshold, based on the past 50 log lines (given that the lines in the window are normal, too). It is an anomaly, if TeleDAL makes a bad (above threshold) prediction for the given log line.

For simplicity, all results in this section are obtained with the “Predict-Next” architecture with a 2-layer LSTM having 128 internal states. The threshold of the anomaly score has been determined using the validation set. The results are depicted by Fig. 8 (for finding anomalous nodes) and Fig. 9 (for finding anomalous lines), they confirm the effectiveness of the Top-K layer and the competitiveness of the regression-based approach.

Fig. 8
figure 8

The effect of parameter K in case of the BGL data set, for anomaly detection of nodes

Comparing the two figures it is obvious that the anomaly detection problem is easier for entire nodes than for log sequences (based on a sliding window), with high enough K value the F1 score is close to \(100\%\). For log sequences the results are also very close to the classification-based methods, while having all the benefits of the regression-based approach. Setting the K parameter to 5 (or greater) seems to be an optimal choice, just like for the HDFS data set.

Fig. 9
figure 9

The effect of parameter K in case of the BGL data set, for anomaly detection of log lines

4.3 The Thunderbird-sshd data set

Thunderbird is an open dataset of logs collected from a Thunderbird supercomputer system at Sandia National Labs (SNL) in Albuquerque [18]. It consists of over 211 million lines covering 244 days of operation. This log is used frequently as a benchmark data set in log analysis research. Its first column is the alert category, due to which it is considered to be a labeled data set. Taking a closer look, however, reveals that the alert tagging is far from being perfect. Only four of the 140 different operating system services have alert tags associated with their log lines (kernel, server, pbs_mon and check-disks), while all services do sometimes produce error messages (including obvious fatal errors), which are not labeled as alerts. Hence, in this study we do not consider this dataset as labeled, and evaluate the proposed method in a fully unsupervised setting. Due to the lack of labels, the results of the proposed method will be evaluated differently in this section, instead of F1 scores we will rely on statistical and interpretation-based analysis.

4.3.1 Data preparation

The Thunderbird data set is very rich, it contains logs of many operating system services running on 5864 processors. Performing log analysis on the whole, multiplexed, unfiltered data set does not make any sense, since related log lines can be potentially far away from each other, furthermore, no machine learning model is expected to observe sequence anomalies in such complex data. In this study we extract a particular log stream from the whole data set as follows:

  • We keep only the log lines produced by the “worker” nodes, which can be identified based on their names: they start with two lowercase letters followed by some decimal digits. This way we can get rid of nodes serving special purposes including admin nodes (tbird-admin1, aadmin1, tsqe1, #2116#, etc.) that have a different behavior which can mislead the machine learning model. As a result of the filtering the log lines of 4514 worker nodes were kept.

  • We keep only the log lines of a particular service. Without this step, obtaining a useful model is not a realistic expectation. It is possible to train a model with commendable metrics, but the generality of that model would be questionable. There are many services to choose from, we decided to go with the sshd service.

As a result of the filtering, we got more than 9 million lines of sshd log messages produced by 4514 worker nodes of the supercomputer, that we call “thunderbird-sshd” log in the sequel.

4.3.2 Evaluating the results

To get rid of the varying and numerical attributes, it is enough to define only three regular expressions:

  • Numbers are replaced by string “number”,

  • IP addresses are replaced by “address”,

  • Identifiers starting and ending with # character are replaced by “name”.

After these replacements the number of unique lines went down to 40, for which the BERT embedding vectors were obtained according to Sect. 3.2. We emphasize that the model is not aware of the number of unique lines, any new log line can be encoded and processed.

Sessions and grouping, as present in case of the HDFS logs, are not available in the thunderbird-sshd log, it is a continuous stream of log lines. Consequently, from the four neural network architectures introduced in Sect. 3.3, only the predict-next turned out to be applicable. For this study we use a neural network model consisting of an embedding layer to reduce the size of the huge BERT-originated vectors to size 32 and a 2-layer LSTM with 64 states. The last layer is the Top-K layer (Sect. 3.4) with \(K=12\). The input of the model is a sequence of semantic vectors of the previous 30 sshd log lines and the target is the next one. The log sequences of different nodes are not multiplexed, the historical log lines are always taken from the same node of the system. The training set consists of the log lines of 2500 nodes, the logs of the remaining nodes served as the test set. The model training consisted of 15 epochs, the loss function was as defined in Sect. 3.5.

Fig. 10
figure 10

Log lines with highest Top-K loss

The lack of labels (or, due to omitting them) makes the evaluation of the model results difficult. Figure 10 lists the log lines that (at least at one of their occurrences) had the highest Top-K cosine similarity loss. Even without domain knowledge it is clear that all of them are error messages.

Fig. 11
figure 11

Mean top-K losses vs. the occurrences

It is a natural expectation in anomaly detection that the model assigns higher loss to less frequently occurring log lines. According to Fig. 11 the result of the model satisfies this expectation. The points in the plot represent the mean losses of the 40 different log lines, while the size of the points are proportional with the variance. (Due to the different context the same log line can be considered either normal or anomalous). In the Figure, log lines occurring more frequently have lower loss.

Fig. 12
figure 12

Top-K losses of various log lines

Figure 12 tells more about the variability of the losses. It is clear that rare lines always have higher losses. It can be observed, that frequently occurring log lines might also have a high loss when their context is unusual.

Next, we investigate the anomaly scores assigned to message ’RSA1 key generation succeeded’, which is a status message (message #15). It does not look like an error indication, still the model loss can be high in some rare cases. According to Fig. 13 the message before ’RSA1 key generation succeeded’ is ’RSA key generation succeeded’ in \(84\%\) of the log sequences, and the model loss is low in these cases, too. Only in \(0.1\%\) of the cases, ’RSA1 key generation succeeded’ is preceded by message ’succeeded’, accompanied by high loss. This behavior confirms that the model is able to take the context around a message also into consideration, hence the sequence anomalies are also detected properly. Investigating the log file reveals that these messages are generated during the boot process of a node, it could be an important feedback to the operator that a very small fraction of the nodes do not generate all kinds of keys generated by the majority of the nodes.

Fig. 13
figure 13

Relation between the anomaly score and the preceding message for message #15

5 Conclusion

Formulating the anomaly detection in log files as a regression problem is not a trivial task. This approach has an important benefit: by getting rid of the log parser the model can also cope with unseen log lines, which is an essential requirement in online applications. At the other hand, regression in machine learning is always more involved than classification, even so when the model output is a vector with semantic meaning. Introducing the Top-K layer and the corresponding Top-K loss function made it possible to obtain state-of-the art results in the field of log anomaly detection, without relying on template parsers. We also investigated the performance of the proposed approach with four different neural network architectures, and provided a solution to combine their results to get higher precision or recall metrics.