1 Introduction

Cyber attacks have become a prevalent and severe threat against the society, including its infrastructures, economy, and citizens’ privacy. According to a 2017 report by SymantecFootnote 1, cyber attacks in year 2016 include multi-million dollar virtual bank heists as well as overt attempts to disrupt the U.S. election process; according to another 2017 report by NetDiligenceFootnote 2, the average cyber breach cost is $394K and companies with revenues greater than $2B suffer an average breach cost of $3.2M.

Given the severe consequence of cyber attacks, cyber defense capability needs to be substantially improved. One approach to improving cyber defense is to forecast or predict cyber attacks, similar to how weather forecasting has benefited the society in mitigating natural hazards. The prediction capability can guide defenders to achieve cost-effective, if not optimally, allocation of defense resources [14]. For example, the defender may need to allocate more resources for deep packet inspection [5] to accommodate the predicted high cyber attack rate. Moreover, researchers have studied how to use a Bayesian method to predict the increase or decrease of cyber attacks [6], how to use a hidden Markov model to predict the increase or decrease of Bot agents [7], how to use a seasonal ARIMA model to predict cyber attacks [8], how to use a FARIMA model to predict cyber attack rates when the time series data exhibits long-range dependence [1], how to use a FARIMA+GARCH model to achieve even more accurate predictions by further accommodating the extreme values exhibited by the time series data [9], how to use a marked point process to model extreme cyber attack rates while considering both magnitudes and inter-arrival times of time series [10], how to use a vine copula model to quantify the effectiveness of cyber defense early-warning mechanisms [11], and how to use a vine copula model to predict multivariate time series of cybersecurity attacks while accommodating the high-dimensional dependence between the time series [12]. We refer to two recent surveys on the use of statistical methods in cyber incident and attack detection and prediction [13, 14].

A particular kind of cyber threat data is the time series of cyber attacks observed by a cyber defense instrument known as honeypots, which passively monitor the incoming Internet connections. Such datasets exhibit rich phenomena, including long-range dependence (LRD) and highly nonlinearity [1, 9].

It is worth mentioning that the usefulness of prediction capabilities in the context of cyber defense ultimately depends on the degree of prediction accuracy, a situation similar to the usefulness of weather forecasting. This factor should be made fully aware to cyber defense practitioners. Although the prediction accuracy could be assured by leveraging large amounts of data, which is indeed true to the case of weather forecasting, the collection of large amounts of cyber attack data may be challenging. Nevertheless, understanding the usefulness of prediction capabilities in the context of cyber security is a problem of high importance but has yet to be thoroughly investigated.

1.1 Our contributions

The contribution of the present paper is in two-fold. First, we propose a novel bi-directional recurrent neural networks with long short-term memory framework, or BRNN-LSTM for short, to accommodate the statistical properties exhibited by cyber attack rate time series data. The framework gives users the flexibility in choosing the number of LSTM layers that are incorporated into the BRNN structure. Second, we use real-world cyber attack rate datasets to show that BRNN-LSTM can achieve a substantially higher prediction accuracy than statistical prediction models, including the one proposed in literature [9] and the ones that are studied in the present paper for comparison purposes.

1.2 Related work

Statistical methods have been widely used in the context of data-driven cyber security research, such as intrusion detection [1518]. However, deep learning has not received the due amount of attention in the context of cyber security [13, 14]. This is true despite the fact that deep learning has been tremendous successful in other application domains [1921] and has started to be employed in the cyber security domains, including adversarial malware detection [22, 23] and vulnerability detection [24, 25].

In the context of vulnerability detection, supervised machine learning methods inlcuding logistic regression, neural network, and random forest, have been proposed for this purpose [26, 27]. These models are trained using large-scale vulnerability data. However, unlike deep learning models that can directly work on raw data, those models require the data to be preprocessed to extract features. There are also other approaches to detecting vulnerabilities. For example, an architectural approach to pinpointing memory-based vulnerabilities has been proposed in [28], which consists of an online attack detector and an offline vulnerability locator that are linked by a record and replay mechanism. Specifically, it records the execution history of a program and simultaneously monitors its execution for attacks. If an attack is detected by the online detector, the execution history is replayed by the offline locator to locate the vulnerability that is being exploited. For more discussions on the vulnerability detection, please refer to [24, 25, 27, 28], and the references therein.

In the context of time series analytics, various statistical approaches have been developed. For example, ARIMA, Holt-Winters, and GARCH models are among the most popular statistical approaches for analyzing time series data [1, 8, 9, 29]. Other statistical models, such as Gaussian mixture models, hidden Markov models, and state space models have been developed to analyze time series data with uncertainties and/or some unobservable factors [17, 30]. Recently, it was discovered that deep learning is very efficient in time series prediction. For example, deep learning has been employed to predict financial data, which contains some noise and volatility [21]. In the context of transportation application, deep learning has been used to predict passenger demands for on-demand ride service [31]. In particular, it is discovered that deep learning can achieve a higher accuracy than statistical time series models (e.g., ARMA and Holt-Winters models) in predicting transportation traffic [3234]. It is further argued in [32] that a particular class of deep learning models, known as feed-forward neural networks, are the best predictors when taking into account both prediction precision and model complexity. In [34], the prediction performances of the deep learning approach and of the statistical ARIMA approach are compared against each other. It is shown that the deep learning approach can significantly (more than 80%) reduce the error rate when compared with the ARIMA models.

The rest of the paper is organized as follows. In the “Preliminaries” section, we review some concepts of deep learning that are related to the deep learning framework we will propose in this paper. In the “Framework” section, we present the framework we propose for predicting cyber attack rates. In the “Empirical study” section, we present our experiments on applying the framework to a dataset of cyber attack rates and compare the resulting prediction accuracy with the accuracy of the statistical approach reported in the literature. In the “Conclusion” section, we conclude the present paper with future research directions.

In order to improve the readability of the paper, we summarize the main notations that are used in the present paper in Table 1:

Table 1 Summary of notations

2 Preliminaries

In this section, we review three deep learning concepts that are related to the present work: recurrent neural network (RNN), bi-directional RNN, and long short-term memory (LSTM).

2.1 RNN

Figure 1 highlights the standard RNN structure, which updates its hidden layers according to the information received from the input layer and the activation from the previous forward propagation. When compared with feed-forward neural networks, RNN can accommodate the temporal information embedded into the sequence of input data (see, e.g., [35, 36]). Intuitively, this explains why RNN is suitable for natural language processing and time series analysis (see, e.g., [3639]). This observation motivates us to leverage RNN as a starting point in designing our framework that will be presented later.

Fig. 1
figure 1

A standard unfolded RNN structure at time t

As highlighted in Fig. 1, the computing process at each time step of RNN is

$$h_{t}=\sigma(W_{x} \cdot x_{t}+W_{h} \cdot h_{t-1}+b_{h}),$$

where WxRm×n is the weight matrix connecting the input layer and the hidden layer with m being the size of the input and n being the size of the hidden layer, WhRn×n is the weight matrix between two consecutive hidden states ht−1 and ht,bh is the bias vector of the hidden layer, and σ is the activation function to generate the hidden state. As a result, the network output can be described by

$$y_{t}=\sigma(W_{y} \cdot h_{t}+b_{y}),$$

where WyRn is the weight connecting the hidden layer and the output layer, by is the bias vector of the output layer, and σ is the activation function of the output layer.

2.2 Bi-directional RNN

A uni-directional RNN is a RNN that only takes one sequence as the input. A uni-directional RNN cannot take full advantage of the input data in the sense that it only learns information from the “past.” In order to overcome this issue, the concept of bi-directional RNN is introduced to make a RNN learn from both the past and the future [40]. Technically speaking, a bi-directional RNN is essentially two uni-directional RNNs that are combined together, where one learns from the past and the other learns from the “future”; the results of the two uni-directional RNNs are merged together to compute a final output.

2.3 LSTM

The training process of RNNs can suffer from the gradient vanishing/exploding problem [41], which can be alleviated by another RNN structure known as LSTM [42]. LSTM is composed of units called memory blocks, each of which contains some memory cells with self-connections, which store (or remember) the temporal state of the network, and some special multiplicative units called gates. Each memory block contains an input gate, which controls the flow of input activations into the memory cell; an output gate, which controls the output flow of cell activations into the rest of the network; and a forget gate.

As highlighted in Fig. 2, the activation at step t, namely, ht, is computed based on four pieces of gate input, namely, the information gate it, the forget gate ft, the output gate ot, and the cell gate ct [43]. Specifically, the information gate input at step t is

$$i_{t} = \sigma\left(U_{i}\cdot h_{t-1}+W_{i}\cdot \mathbf{x}_{t}+b_{i}\right), $$

where σ(·) is a sigmoid activation function, bi is the bias, xt is the input vector at step t, and Wi and Ui are weight matrices. The forget gate input and the output gate input are respectively computed as

$$\begin{array}{@{}rcl@{}} f_{t} &=& \sigma\left(U_{f}\cdot h_{t-1}+W_{f}\cdot \mathbf{x}_{t}+b_{f}\right), \\ o_{t} &=& \sigma\left(U_{o}\cdot h_{t-1}+W_{o}\cdot \mathbf{x}_{t}+b_{o}\right), \end{array} $$
Fig. 2
figure 2

LSTM block at step t with information gate it, forget gate ft, output gate ot, and cell gate ct

where Uf,Uo,Wf, and Wo are weight matrices, and bf and bo are biases. The cell gate input is computed as

$$c_{t} \,=\, f_{t}\cdot c_{t-1} + i_{t}\cdot k_{t}~~\text{with}~~ k_{t}\,=\, \tanh\left(U_{k}\cdot h_{t-1}\,+\,W_{k}\cdot \mathbf{x}_{t}\,+\,b_{k}\right), $$

where tanh is the hyperbolic tangent function, Uk and Wk are weights, and bk is bias. The activation at step t is computed as

$$ h_{t} = o_{t} \cdot \tanh(c_{t}). $$

Intuitively, the key component of LSTM is the cell state, which flows throughout the network. Given input ht−1 and xt, the forget gate ft decides to throw away what information from the previous cell state ct−1. The forget gate ft takes ht−1 and xt as input and uses the sigmoid activation function σ(·) to generate a number between 0 and 1 for each value in cell state ct−1. The information gate it determines what new information in the current cell state ct to be stored, via two steps: a set of candidate values are computed by kt based on the current input; the information gate it then uses σ(·) to decide which candidate values will be stored in ct. The cell gate will then compute ct. Finally, ht is computed based on ct and ot, where the latter is the information from the output gate.

3 The bi-directional RNN with LSTM framework

The framework we propose for predicting cyber attack rates is called bi-directional RNN with LSTM or BRNN-LSTM for short, which incorporates some LSTM layers into a bi-directional RNN. BRNN-LSTM has three components: an input layer, a number of hidden layers, and an output layer, where each hidden layer is replaced with a LSTM cell. The same sequential input, denoted by xt={x0,...,xt}, is passed to the two states of the LSTM layers, the forward state, and the backward state. There is no connection in between the two states. The outputs from the two states are then combined together to predict a target value at each step. Figure 3 highlights the structure of BRNN-LSTM with three LSTM layers.

Fig. 3
figure 3

BRNN-LSTM with three LSTM layers

For training a BRNN-LSTM model, we propose using the following objective function:

$$ J = \frac{1}{2m} \cdot \sum\limits^{m}_{i = 1}(\hat{y}_{i}-y_{i})^{2}+\frac{\lambda}{2} \left(||\mathbf{W}||_{2}^{2}+||\mathbf{U}||_{2}^{2}\right), $$
(1)

where m is the size of the input, \(\hat {y}_{i}\) and yi are respectively the output of network and the observed values at step i, W and U are weight matrices, \(\mathbf {W} = \{W_{f},W_{i},W_{k},W_{o}\}, \mathbf {U} = \{U_{f},U_{i},U_{k},U_{o}\}, ||\cdot ||_{2}^{2}\) represents the squared L2 norm of weight matrices, and λ is a user-defined penalty parameter. Note that the second term in Eq. (1) is the penalty term for avoiding overfitting. The optimization is defined as

$$\Theta^{*}=\arg\min_{\boldsymbol{\Theta}} J,$$

where Θ=(W,U) are model parameters and can be solved by using the gradient descent method [42, 44].

4 Empirical study

4.1 Accuracy metrics

Let (y1,…,yN) be observed values and \(\left (\hat y_{1},\ldots,\hat y_{N}\right)\) be the predicted values. In order to evaluate the accuracy of the BRNN-LSTM framework, we propose using the following widely used metrics [1, 9, 45].

  • Mean square error (MSE): \(\text {MSE}={\sum \nolimits }_{i=1}^{N} \left (y_{i}-\hat y_{i}\right)^{2}/N\).

  • Mean absolute deviation (MAD): \(\text {MAD}={\sum \nolimits }_{i=1}^{N} \left |y_{i}-\hat y_{i}\right |/N\).

  • Percent mean absolute deviation (PMAD): \(\text {PMAD}={\sum \nolimits }_{i=1}^{N} \left |y_{i}-\hat y_{i}\right |/{\sum \nolimits }_{i=1}^{N} |y_{i}|\).

  • Mean absolute percentage error (MAPE): \(\text {MAPE}={\sum \nolimits }_{i=1}^{N} \left |(y_{i}-\hat y_{i})/y_{i}\right |/N\).

4.2 Data collection

The dataset we analyze is the same as the dataset analyzed in [1]. The dataset was collected by a low-interaction honeypot consisting of 166 consecutive IP addresses during five periods of time in the interval between year 2010 and year 2011. These five periods of time are respectively 1,123, 421, 1,375, 528, and 1920 h, each of which is represented by a separate dataset. The honeypot runs the following four honeypot programs: DionaeaFootnote 3, MwcollectorFootnote 4, AmunFootnote 5, and Nepenthes [46], which run some vulnerable services such as SMB (with Microsoft Windows Server Service Buffer Overflow vulnerability MS06040 and Workstation Service Vulnerability MS06070), NetBIOS, HTTP, MySQL and SSH. A honeypot computer runs multiple honeypot programs, each of which monitors (i.e., is associated to) one IP address. A dedicated computer collects the raw network traffic coming to the honeypot as pcap files. Honeypot-captured data are treated as cyber attacks because no legitimate services are associated to the honeypot computers. We refer to [1] for more details about the honeypot instrument.

4.3 Data preprocessing

As in [1] and many analyses, we treat flows (rather than packets) as attacks, while noting that flows can be based on the TCP or UDP protocol. A TCP flow is uniquely identified by an attacker’s IP address, the port used by the attacker to wage the attack, a victim IP address (belonging to the honeypot), and the port of the victim IP address under attack. An unfinished TCP handshake is also treated as a flow or attack because the unsuccess may be attributed to the fact that the connection is dropped because the port in question is busy. Also as in [1], the preprocessing contains the following steps. First, we disregard the cyber attacks that are waged against the non-production (i.e., unassigned) ports (i.e., any ports that are not associated with the honeypot programs) because these TCP connections are often dropped. Since low-interaction honeypot programs do not collect adequate traffic information that would allows us to determine specific attacks, we only consider the attack rate or the number of attacks (rather than specific types of attacks). Second, the following two widely used parameters [47] are also used to preprocess network traffic flows not ending with the FIN flag (meaning that these flows are terminated unsafely) or the RST flag (meaning that these flows are terminated unnaturally): 60 s for the flow timeout time (meaning that an attack or flow expires after being idle for 60 s) and 300 s for the flow lifetime (meaning that an attack or flow does not span over 5 min or 300 s).

For each period or dataset, the data is represented by {(t,xt)} for t=0,1,2,…, where xt is the number of attacks (i.e., attack rate) that are observed by the honeypot at time t. Unlike [1], we further preprocess the derived attack rate time series by normalizing attack rates into interval (0,1]. Then, small data batches (periods) are selected based on a pre-defined mini-batch size. For prediction purposes, we split each time series into an in-sample part (for model training) and an out-of-sample part (for prediction). As in [1], we set the last 120 h of each period as the out-of-sample part for evaluating prediction accuracy.

4.4 Model training and selection

In the training process, we use the mini-batch gradient descent method to compute the minimum of the objective function, which is described in Eq. (1). We use 10,000 iterations to train a network and set the penalty parameter λ =.001 because other parameters do not lead to any significantly better result. For each dataset, we use Algorithm 1 to compute the fitted values with varying model parameters. We select the model that achieves the minimum MSE.

Table 2 describes the selected model and MSE for each dataset. We observe that the selected model for different datasets may use different batch size r and different number l of LSTM layers. For datasets I, IV, and V, the selected batch size is 20; for datasets II and III, the selected batch size is respectively 30 and 40. For the number of LSTM layers, datasets I and IV prefer to 4 layers; datasets II and V prefer to 2 layers; and period IV prefers to 3 layers.

Table 2 Parameters (r,l) of selected model and MSE for each dataset

Figure 4 plots the fitting of the selected model corresponding to each dataset. We observe that the selected models have satisfactory fitting accuracy. In particular, the extreme values are fitted well in every dataset.

Fig. 4
figure 4

BRNN-LSTM fitting results of cyber attack rates in the five datasets (black line: observed values; red circles: fitted values)

4.5 Prediction accuracy

We use Algorithm 2 to predict cyber attack rates corresponding to the out-of-samples, which allow us to calculate the prediction accuracy.

Table 3 describes the prediction results in terms of the accuracy metrics mentioned above. Based on metrics PMAD and MAPE, BRNN-LSTM achieves a remarkable prediction accuracy for datasets I, II, III, and V because prediction errors are less than 5%. However, for dataset IV, the prediction accuracy in metric PMAD is around 17% and in metric MAPE is around 27%. Fortunately, BRNN-LSTM can be easily calibrated to improve its prediction accuracy via a rolling approach as follows. For period IV, we re-estimate model parameters in Θ via Algorithm 1 after observing 20 more data points; the corresponding prediction accuracy, indicated by “IV*” in Table 3, is much better than the original prediction accuracy. For example, the rolling approach reduces the PMAD metric to 10% and reduces the MAPE metric to 13%.

Table 3 Parameters of selected models and prediction accuracy metrics of these selected models, where IV* indicates the rolling approach for dataset IV

Figure 5 plots the prediction results. We observe that predicted values match observed values well, but some observed values that are still missed by BRNN-LSTM. For example, for dataset III, the extreme value is missed and some observed values are over-predicted. Nevertheless, we conclude that the prediction accuracy is satisfactory.

Fig. 5
figure 5

Prediction accuracy of BRNN-LSTM (black line: observed values; red circles: predicted values)

4.6 Model comparisons

In order to further evaluate the prediction accuracy of the proposed framework, we now compare it with other popular models.

4.6.1 ARIMA

The first model we consider (as a benchmark) is the AutoRegressive Integrated Moving Average or ARIMA (p,d,q), which is perhaps the most well-known model in time series analysis [29, 30]. The ARIMA model is described as

$$\begin{array}{@{}rcl@{}} \phi(B)(1-B)^{d} Y_{t}=\theta(B) e_{t}, \end{array} $$

where B is the backshift operator, and ϕ(B) and θ(B) are respectively the AR and MA characteristic polynomials evaluated at B. In order to select the ARIMA model for prediction purpose, we use the AIC criterion while allowing the orders of p and q to vary from 0 to 5 and d to vary from 0 to 2.

4.6.2 ARMA+GARCH

The second model we consider further incorporates the Generalized AutoRegressive Conditional Heteroscedastic or GARCH model, which is widely used in financial time series applications. We use GARCH(1,1) to model the conditional variance and the ARMA model to accommodate the conditional mean. This leads to the following ARMA+GARCH model:

$$Y_{t}=\mathrm{E}(Y_{t}|\mathfrak{F}_{t-1})+\epsilon_{t}, $$

where E(·|·) is the conditional expectation function, \(\mathfrak {F}_{t-1}\) is the historic information up to time t−1, and εt is the innovation of the time series. Since the mean part is modeled as ARMA (p,q), the model can be rewritten as

$$ Y_{t}= \mu+\sum\limits_{k=1}^{p} \phi_{k} Y_{t-k} +\sum\limits_{l=1}^{q} \theta_{l} \epsilon_{t-l} +\epsilon_{t}, $$
(2)

where εt=σtZt with Zt being i.i.d. innovations. For the standard GARCH(1,1) model, we have

$$ \sigma_{t}^{2}=w+ \alpha_{1} \epsilon^{2}_{t-1}+ \beta_{1} \sigma^{2}_{t-1}, $$
(3)

where \(\sigma ^{2}_{t}\) is the conditional variance and w is the intercept. After some preliminary analysis, we set the order of ARMA to (1,1) as a higher order does not provide significant better predictions.

4.6.3 Hybrid model

The third model we consider is based on the recently developed hybrid approach, which is a two-step procedure [48, 49]. The hybrid model first extracts the linear relationship using an ARIMA model, and then uses a nonlinear approach to determine the nonlinear relationship. The nonlinear step can be considered as a prediction on the error term. The resulting hybrid model is written as

$$\begin{array}{@{}rcl@{}} Y_{t}=L_{t}+N_{t}, \end{array} $$

where Lt is the linear part and Nt is the nonlinear part. Since Lt is modeled by an ARIMA model, the residuals at time t are

$$e_{t}=Y_{t}-\hat Y_{t},$$

where \(\hat Y_{t}\) is the fitted value. The residuals are modeled by a nonlinear model, which utilizes the lag information. We consider the following three types of hybrid models:

$${{} \begin{aligned} \text{H1}: \quad N_{t}&=f(e_{t-1},e_{t-2},\ldots,e_{t-n})+\epsilon_{t}, \\ \text{H2}: \quad N_{t}&\,=\,f(\!e_{t-1},e_{t-2},\ldots,e_{t-n},y_{t-1},y_{t-2},\ldots,\!y_{t-m}\!)\,+\,\epsilon_{t},\\ \text{H3}: \quad N_{t}&=f(y_{t-1},y_{t-2},\ldots,y_{t-n})+\epsilon_{t}, \end{aligned}} $$

where epsilont is the random error at time t and f is a nonlinear function. For nonlinear function f, we consider the following three popular machine learning approaches [50]: random Forest or RF [49], support vector machine or SVM [51], and artificial neural network or ANN [48, 52].

In order to achieve the best prediction accuracy, we examine a number of models. For the linear part of ARIMA (p,d,q), we use the AIC criterion to select models in the training process, where p and d vary from 0 to 5 and d varies from 0 to 1. For the nonlinear model, we vary the lag parameter from 1 to 12. All of the models are trained by using 10-folder validation. For RF, we set the number of trees to 1000; for SVM, we consider the following kernel functions: linear, polynomial, radial basis, and sigmoid; for ANN, we set the number of hidden layers to one while varying the number of hidden nodes from 1 to 10.

4.6.4 Comparison

We select the highest prediction accuracy in terms of the MSE metric derived from the predicted values and the out-of-sample data. For dataset I, the best prediction model is ARIMA(2,1,1)+ANN+H3 with the number of lags being 5 and 8 hidden nodes. For dataset II, the best prediction model is ARIMA(3,1,1)+“linear SVM”+H2 with the number of lags being 6. For dataset III, the best prediction model is ARIMA(3,0,1)+“radial SVM”+H3 with the number of lags being 8. For dataset IV, the best prediction model is ARIMA(0,1,2)+“radial SVM”+H1 with the number of lags being 4. For dataset V, the best prediction model is ARIMA+“radial SVM”+H3 with the number of lags being 7.

Table 4 summarizes the one-step ahead rolling prediction accuracy. Considering the MSE metric, we observe that the ARIMA model has the worst prediction accuracy for datasets I–IV, and the hybrid model outperforms the ARMA+GARCH model for every dataset; we also observe that the ARIMA model has the smallest MSE for dataset V. Considering the MAD metric, we observe that the hybrid model outperforms the other two models for datasets I, III, and IV, but the ARMA+GARCH model outperforms the other two models for dataset II; we also observe that the ARIMA model has the smallest MAD for dataset IV. Considering metrics PMAD and MAPE, we observe that the hybrid model outperforms the other two models for datasets I, III, IV, and V, and the ARMA+GARCH model is slightly better than the hybrid model for dataset II; we also observe that all of the models have the worst prediction accuracy for datasets IV and V, which coincides with the conclusion drawn in [9], namely, that the PMADs of one-step ahead rolling prediction of the FARIMA+GARCH model are respectively 0.138,0.121,0.140,0.339, and 0.378 for the five datasets. By comparing Tables 3 and 4, we draw:

Table 4 Prediction accuracy of the selected model with respect to each dataset

Insight 1

The BRNN+LSTM framework achieves a higher prediction accuracy than the FARIMA+GARCH model proposed in [9] and the ARIMA, ARIMA+GARCH, and hybrid models considered above.

5 Conclusion

We proposed a BRNN-LSTM framework for predicting cyber attack rates. The framework can accommodate complex phenomena exhibited by datasets, including long-range dependence and highly nonlinearity. Using five real-world datasets, we showed that the framework significantly outperforms the other prediction approaches in terms of prediction accuracy, which confirms that LSTM cells can indeed accommodate the long memory behavior of cyber attack rates. From these five datasets, we found that only dataset IV requires to re-training the model in order to achieve a better prediction accuracy. We compared the prediction accuracy of BRNN-LSTM and other prediction approaches, which use rolling predictions (i.e., re-building the prediction model after observing a new value). We hope the present work will inspire more research in deploying deep learning to prediction tasks in the cybersecurity domain.