Introduction

With the progressive technical advancements, numerous data streams are produced robustly in recent times. Some of the latest technologies are sensor networks, spam filtering models, traffic management, and intrusion prediction [1]. Certainly, a data stream \(S\) is meant to be potentially uncovered, and sequential instances are frequently derived with greater speed. The major limitation in data stream learning is to resolve the concept drift, the principle behind this model should be drifted in a dynamic fashion. Usually, the concept drift exists in real-time application. For instance, in recommend systems (RS), user priorities might be changed on behalf of trend, finance, and various other external factors. Also, the climate detection models are modified according to the seasonal change in the environment. This modification intends to degrade the classification process. Hence, a classifier applied in this study must be eligible to examine and get adopted to these alterations. The main theme of this work is to develop a classifier learning module to mine the streaming data in dynamic platforms effectively.

Concept drift can be classified on the basis of speed, as sudden and gradual drifts, as shown in Fig. 1. Here, sudden concept drift is represented by the massive changes from basic class distribution as well as the incoming samples within the time duration. Second, the gradual concept drift is a time-consuming process and represents the change in differences of fundamental class distributions between previous and new instances. Obviously, the type of change is not considered, and it has to be applicable to observe and track the changes. It is general that real-time data streams can appear in future. It shows an exclusive type of drift called recurring concepts. For instance, news reading desire of a user might be changed immediately. People may have different thoughts on weekends, mornings, and evenings. Additionally, user explores the astrology articles in new-year and economical articles for each quarter. But some models have been applied. Sometimes, the classifiers which are employed in past might also be applied in future. Therefore, the traditional works on drift prediction ignore the phenomenon and intend to consider the concept as new one. Because of the statement in drift prediction, it captures the changes in data streams and upgrades the prediction approach to maintain higher accuracy.

Fig. 1
figure 1

Types of concept drift: a sudden, b gradual, and c recurrent

Overview of concept drift

Concept drift exists when the target is modified in limited time period. Assume two target concepts A and B, and sequence of samples \(I~ = ~\left\{ {i_{1} ,~i_{2} ,~ \ldots ,~i_{n} } \right\}\). In prior to instance id, target concept is not modified and remains in A. Afterwards, ∆x is a concept which is stable under diverse concept called B. Hence, concept is drifting among the sample \(i_{{d + 1}} ~~\) and \(i_{d} + \Delta x\), and replace concept target A for B. Based on the efficiency of a drift (∆x), a modification is may allocated as gradual in drifts which is slow from two concepts, while in abrupt drifts the change occurs suddenly.

The concept drift models are defined in three distinct ways such as window-related, weight-related, and ensemble of classification models. Initially, the window-related methods attempt in selecting samples from dynamic sliding window while a weight-based method weights the samples and removes the former ones according to the weights. Third, the ensemble classification provides different classification models and integrates them to accomplish final and effective classification model. The sample count is considered in training phase. The concept drift handling methodologies are classified as online approach: it upgrades the classifier after getting the instance while the batch approach spends time to receive massive instances to start learning process. Followed by, learning approach gains the streams of data and divide into batches. Few models are used for dealing with stream of batches as shown in the following: Full-memory: a learner applies classical training samples (batches), No-memory: applies the current batch for training process and finally, Window-of-fixed size n: applies the n most current batches. Here, window-based model with fixed window size (n = 10) has been applied.

Problem formulation

Assume input data stream gathered from \(n\) sources \(\left( {{\text{So}}_{i} } \right)\) are referred as \({\text{So}}_{{\text{l}}}\), \({\text{So}}_{2} ,~So_{3} ,~ \ldots ,~~{\text{So}}_{n}\). A source i produces k streams \(\left( {{\text{So}}_{{ik}} } \right)\), i.e., \({\text{So}}_{{{\text{i}}1}} ,{\text{~So}}_{{{\text{i}}2}} , \ldots ,~{\text{So}}_{{ik}}\). The samples from these sources make complete streaming data \({\text{USo}}_{{\text{i}}} = {\text{So}}\). The central premises of data preprocessing method are to declare the storage of reservoir \(S_{{\text{R}}}\) for stream data \(So\) from \(n\) sources. The two factors are significant for examination of statistical reservoir size for complete stream data. Hence, the degree of disparity in stream data shows the difference in count of samples distributed for every source. A maximum degree of disparity leads to minimum confidence interval which is possible for estimating the correct value [2].

$$ \left| {S_{{\text{R}}} } \right| = N/1 + Ne^{2} ~, $$
(1)

where in Eq. (1), \(\left| {S_{{\text{R}}} } \right|\) implies the overall sample size and \({\text{N}}\) shows the overall population. In addition, \(e\) indicates the confidence interval. For low confidence intervals, the data sampling method decides maximum number of instances. Else, a minimum number of samples are essential to show the complete stream data. Once the sampling process is applied, the two class problems are constant in stream data classification. Assume the online ensemble classifier \(\Theta\) which receives novel instance \(x_{t}\) at \(t\) time, and detected class label is \(y_{{\text{t}}}^{\prime }\). When the prediction is computed, a classifier receives desired label of \(x_{t}\) as \(y_{t}\). Therefore, predicted and desired label allocates \(\left\{ {1,{\text{~}} - 1} \right\}\). The result of ensemble classifier \(\Theta\) is divided into four classes namely,

  1. 1.

    True positive if \(y_{t} = y_{t}^{\prime } = 1\)

  2. 2.

    True negative if \(y_{t} = y_{t}^{\prime } = - 1\)

  3. 3.

    False positive if \(y_{t} = - 1;y_{t}^{\prime } = 1\)

  4. 4.

    False negative if \(y_{t} = 1;y_{t}^{\prime } = - 1\)

Based on the above measures, the ensemble classification accuracy has been evaluated for minority and majority class instances. Therefore, imbalance factor is quantified with the help of occurrence possibility of minority classes. Because of the imbalance in distribution of samples between majority and minority class instances, the classifier performance gets degraded.

Paper contributions

Learning from data streams (incremental learning) has significantly attracted the research communities owing to several issues and real-time applications. The concept drift detection is a strategy while the changes in data distribution make recent prediction method as inaccurate. The stream data classifier with no concept drift adaptation is not desirable to classify imbalance class distribution. Therefore, this paper designs a novel class imbalance with concept drift detection (CIDD) using Adadelta optimizer-based deep neural networks (ADODNN), named CIDD-ADODNN model to classify highly imbalance data streams. The proposed model uses adaptive synthetic (ADASYN) technique for handling class imbalance data. In addition, the adaptive sliding window (ADWIN) technique is applied for the recognition of concept drift in the applied streaming data. At last, ADODNN model is utilized for the classification processes. For ensuring the classifier results of the CIDD-ADODNN model, three streaming datasets are used namely intrusion detection (NSL KDDCup) dataset, Spam dataset, and Chess dataset.

In short, the contribution of the paper is listed as follows:

  • Develop a new CIDD-ADODNN model to classify highly imbalance data streams.

  • Employ ADASYN technique for handling class imbalance data and ADWIN technique for the recognition of concept drift in the applied streaming data.

  • Lastly, ADODNN model is utilized for the classification processes.

  • Validate the performance of the CIDD-ADODNN model, three streaming datasets.

Literature survey

Mostly, the big data streaming domains suffer from problems like class imbalance as well as concept drift. The classical sampling models make use of two modules to overcome the above defined problems like resampling and similarity methodologies. Resampling is one of the effective schemes at the data level. Some of the resampling approaches manage the data distribution by applying deterministic frameworks [3]. The remarkable approaches are used in selecting the instances from frequently incoming data stream under the application of sampling with and without replacement. Also, sampling with alternative has been applied when there is a requirement of fixed sample size while sampling with no replacement can be utilized for the applications. The traditional approaches are not suitable in sample adequacy with no redundancy, and secondary technique is not applicable for sub-streams which refers to diverse patterns.

In Wu et al. [4], Dynamic Feature Group Weighting framework with Importance Sampling (DFGW-IS) aims resolving the issues of concept drift and class imbalance. Hence, the weighted ensemble undergo training on the feature group that is extracted randomly. It refers that the minority classes remain same; however, the minority class instances in previous window are dissimilar to present classes. Additionally, solutions of irregular class distribution by applying classical samples are not applied to concept drift significantly. Hence, the sampling approaches in Cervellera and Macciò [5] use the recursive binary partition across input samples and decides the instance showcasing the entire stream. Hence, the greedy optimality as well as explicit error bound are applicable in managing the problems related to concept drift.

The adaptive sampling approach [6] on irregular data streams takes place under the application of repeatable and scalable prediction approaches. Hence, a predictive method has been developed if the data are imbalanced minimally. If the data are imbalanced heavily, then it activates a data scan by enough minority instances. Therefore, the major constraints of this model are that it is implemented with accurate reservoir and does not assume the worst case optimality. To overcome these problems, stream sampling as well as continuous random sampling make use of overlap independence. By the integration of density and distance metrics, the DENDIS implies the matrix from [7] to retain the semantic coherence.

The G-means Update Ensemble (GUE) in [8] tries to resolve the predefined issues. To manage the imbalanced class distribution, it employs the oversampling operation and applies weighting frameworks to handle the concept drift. A static threshold measure is not applicable to resolve the imbalanced class distribution. The Gradual Resampling Ensemble (GRE) method has been developed by Ren et al. [9] to overcome these problems. It has exploited resampling scheme for previously received minority classes and amplifies the present minority class labels. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is utilized for identifying the disjunct and eliminate the influence of disjunct on resemblance estimation. It helps GRE to apply the novel samples. Similarly, efficient learning the nonstationary imbalanced data stream has been projected in Meenakshi et al. [10]. It tries to limit the misclassified samples with the aid of two-class issues. It develops several blocks of chunk and a chunk training, while testing is processed by classification model. Therefore, severe problems have been experienced by multiclass classification.

In Iwashita et al. [11], the popular spiking neural networks are introduced to learn the data streams through online. The major objective of this work is to reduce the neuron repository size and to make use of the benefits of data reduction models and compressed neuron learning capability. The Knowledge-Maximized Ensemble (KME) in [12] unifies the online as well as chunk relied ensemble classification models to resolve different concept drift problems. The application of unsupervised learning techniques and saved recurrent models enhance the knowledge applied in stream data mining (DM). As a result, it enhances the accuracy of data classification. Though several works are existed in the literature, the classification of concept drift solution is considerably affected by class imbalance data. The sampling approaches are commonly employed for processing the incessantly incoming data stream with an adequate sample count. The chosen samples have constructed a statistical inference for supporting imbalance class distribution. The stream data classifier with no concept drift adaptation is not desirable to classify imbalance class distribution.

The proposed CIDD-ADODNN model

The working principle involved in the presented CIDD-ADODNN model is depicted in Fig. 2. Primally, data preprocessing takes place to transform the raw streaming data into a compatible format for further processing. Next, the ADASYN technique is applied for handling the class imbalance. Followed, next, a drift detection technique called ADWIN is employed to detect the existence of the concept drift. At last, the ADODNN model is applied to determine the class label of the streaming data which incorporates the ADO to tune the hyperparameters of the DNN model.

Fig. 2
figure 2

Working process of CIDD-ADODNN model

Data preprocessing

At the initial stage, preprocessing of the raw streaming data takes place in three ways such as format conversion, data transformation, and chunk generation. First, the online streaming data in any raw format are converted into the required.csv format. Second, the data transformation process alters the categorical values in the data to numerical values. Third, the streaming dataset in any size is divided into a number of chunks for further processing.

ADASYN based class imbalance data handling

The ADASYN model receives the preprocessed data as input and executed the ADASYN model to handle the class imbalance. It makes use of a weighted distribution for dissimilar minority class instances based on the learning levels of difficulty [13]. It generates distinct synthetic instances for the minority classes based on the distribution. Due to the popularity of synthetic models like synthetic minority oversampling technique (SMOTE), SMOTEBoost, and DataBoostIM has been introduced. It performs the learning from imbalanced data sets. Hence, objective is two-fold: limiting the bias and adaptively learning. Hence, the newly developed model for two-class classification problem is defined below:

Input: Training data set \(D_{{{\text{tr}}}}\) with \(m\) instance \(\left\{ {{\text{x}}_{{\text{i}}} ,{\text{~}}y_{i} } \right\},\) \(i = 1,m\), in which \(x_{i}\) is a sample in the \(n\) dimensional feature space \(X\) and \(y_{i} \in Y = \left\{ {1,{\text{~}} - 1} \right\}\) is a class identity label related to \(x_{i}\). Describe \(m_{{\text{s}}}\) and \(m\iota\) as count of minority class samples and count of majority class instance, correspondingly. Hence, \(m_{{\text{s}}} \le m_{{\text{l}}} ~\) and \(m_{{\text{s}}} + m_{{\text{l}}} = m.\)

Procedure

(1) Estimate the degree of class imbalance:

$$ d = m_{{\text{s}}} /m_{{\text{l}}} , $$
(2)

where \(d \in\) \((0\), 1].

(2) When \(d < d_{{{\text{th}}}}\) and \((d_{{{\text{th}}}}\) is a current threshold for highly tolerated degree of class imbalance ratio):

(a) Estimate the number count of synthetic data samples which has to be produced for minority class:

$$ G = \left( {m_{{\text{l}}} - m_{{\text{s}}} } \right){\text{~}} \times \beta , $$
(3)

where \(\beta \in\) [0, 1] defines a parameter applied to specify the required balance level when the synthetic data is generated. \(\beta\) \(=\) \(1\) defines a completely balanced data set is deployed after generalization process.

(b) For all example \(x_{i} \in\) minority class, identify \(K\) nearest neighbors (kNN) dependent on Euclidean distance in \(n\) dimensional space and estimate the ratio \(r_{i}\) described as:

$$ r_{i} = \vartriangle _{i} /K,{\text{~}}i = 1,{\text{~}}m_{{\text{s}}} , $$
(4)

where \(\vartriangle _{i}\) implies the count of samples in \({\text{kNN}}\) of \(x_{i}\) which comes under the majority class, hence \(r_{i} \in \left[ {0,{\text{~}}1} \right]\).

(c) Generalize \(r_{i}\) based on the \(\hat{r}_{i}\) \(= r_{i} /\mathop \sum \limits_{{i = {\text{l}}}}^{{m_{{\text{s}}} }} r_{i}\), thus the \(\hat{r}_{i}\) refers the density distribution \(\left( {\mathop \sum \limits_{i}^{~} \hat{r}_{i} = 1} \right)\).

(d) Estimate the count of synthetic data samples generated for a minority class \(x_{i}\):

$$ g_{i} = \hat{r} \times G~, $$
(5)

where \(G\) shows the overall count of synthetic data samples to be emanated for minority class as described in Eq. (3).

(e) For a minority class data sample \(x_{i}\), produce \(g_{i}\) synthetic data samples on the basis of given steps.

Create a Loop from 1 to \(g_{i}\):

(i) Select the minority data sample randomly, \(x_{{zi}} ,\) from kNN for data \(x_{i} .\)

(ii) Produce the synthetic data instance:

$$ s_{i} = x_{i} + \left( {x_{{zi}} - x_{i} } \right) \times \lambda ~, $$
(6)

where \(\left( {x_{{zi}} - x_{i} } \right)\) defines the difference vector in \(n\) dimensional spaces, and \(\lambda\) refers the random value: \(\lambda \in \left[ {0,~1} \right].\)

End Loop.

ADWIN-based drift detection

The application of ADASYN model balances the dataset effectively and then drift detection process gets executed by the use of ADWIN technique [14]. In this study, window-based approach is employed for drift detection with the window of fixed size (n = 10).

Bifet [15] presented an ADWIN technique, which is eligible for data streams with sudden drift. It has applied a sliding window \(W\) with currently reading samples. The major principle of ADWIN is listed in the following: when two huge sub-windows of \(W\) imply distinct enough averages, then the desired values are varied and existing portion of a window has been lost. The statistical hypothesis states that: “the average \(\mu _{t}\) is an ideal constant in \(W\) with confidence \(\delta\)”. The pseudo‐code of ADWIN is shown in Algorithm 1. The major portion of algorithm is definition of \(\varepsilon _{{{\text{cut}}}}\) and it is sampled. Assume \(n\) is a size of \(W\), and \(n_{0}\) and \(n_{{\text{l}}}\) be the sizes of \(W_{0}\) and \(W_{1}\) finally, the \(n = n_{0} + n_{1}\). Suppose \(\mu _{{\hat{W}_{0} }}\) and \(\mu _{{\hat{W}_{1} }}\) is an average of the values in \(W_{0}\) and \(W_{1} ,\) and \(\mu _{{W_{0} }}\) and \(\mu _{{W_{1} }}\) are desired measures. Thus, the value of \(\varepsilon _{{{\text{cut}}}}\) is presented in the following:

$$ \varepsilon _{{{\text{cut}}}} = {\text{~}}\sqrt {\frac{1}{{2m}} \cdot \frac{4}{{\delta ^{\prime}}}} ~,~ $$
(7)

where \(m = \frac{1}{{1/n_{0} + 1/n_{1} }}\), and \(\delta ^{\prime} =\) \(\frac{\delta }{n}.\)

The statistical test represented in pseudo‐code verifies the average in sub-windows is varied by threshold \(\varepsilon _{{{\text{cut}}}}\). A threshold is measured with the help of Hoeffding bound and provides formal assurance of fundamental classifier’s function. The phrase “holds for each split of \(W\) into \(W = W_{0} \cdot W_{1}\)” refers that every pair has to be verified while \(W_{0}\) and \(W_{{\text{l}}}\) are developed by dividing \(W\) into two portions. Hence, researchers have presented an improvement model to identify the optimal cut-point significantly. Therefore, actually presented ADWIN models are lossless learners, hence window size \(W\) grows uncertainly when there is no drift. It is enhanced simply by inclusion of parameters which reduces the windows maximal size.

figure a

ADODNN-based classification

Once the ADWIN technique has identified the concept drift, the trained model gets updated and then the classification process gets executed. By doing this, the classification results can be significantly improved. When the concept drift does not exist, then the classification process using ADODNN is straightaway performed instead of model updating process. The ADODNN has the ability to determine the actual class labels of the applied data and the application of ADO helps to attain improved classification performance.

Here, a DNN-based model is presented by applying stacked autoencoders (SAE) for concept drift classification to enhance the estimation measures. The DNN classifier in concept drift dataset has been developed under the application of SAE and softmax layer [16]. A dataset is comprised attributes and class variables which are defined in the following. Figure 3 illustrates the structure of DNN model. The parameters are induced as input for the input layer. Generally, DNN is developed by two layers of SAE. A network is composed of two hidden layers with neurons. Additionally, a softmax layer is attached with final hidden layer to perform the classification task. Hence, the output layer provides the possibilities of class labels for applied record.

Fig. 3
figure 3

Structure of DNN

Suppose N input vectors are considered for training the AE as \(\left\{ {x_{{\left( 1 \right)}} ,{\text{~}}x_{{\left( 2 \right)}} \ldots x_{{\left( N \right)}} } \right\}\). The reformation of input is processed by training AE as:

$$ x^{\prime} = f_{{\text{D}}} \left( {W^{\prime},{\text{~}}b^{\prime};f_{{\text{E}}} \left( {W,{\text{~}}b;x} \right)} \right), $$
(8)

which is represented as:

$$ x^{\prime} = f_{{{\text{AE}}}} \left( {W,{\text{~}}b,{\text{~}}W^{\prime},{\text{~}}b^{\prime};x} \right)~, $$
(9)

where \(f_{{{\text{AE}}}}\) implies the function that maps input into output as AE.

Followed by, AE undergoes training with the reduction of appropriate objective function that is applied by total error function as:

$$ E_{{{\text{Total}}}} = E_{{{\text{MSE}}}} + ~E_{{{\text{Reg}}}} + ~E_{{{\text{sparsity}}}} , $$
(10)

where \(E_{{{\text{MSE}}}} {\text{,}}\) \(E_{{{\text{Reg}}}} ,\) \(E_{{{\text{sparsity}}}}\) implies the mean square error (MSE), regularization factor as well as sparsity factor correspondingly. An MSE, \(E_{{{\text{MSE}}}}\) is determined by:

$$ {\text{MSE}} = \frac{1}{N}\mathop \sum \limits_{{i = 1}}^{N} e_{i}^{2} , $$
(11)

where \(e_{i}\) shows the error, which implies the difference among original output, \(x\left( i \right)\) and observed output, \(x^{\prime}\) (i). Hence, the error \(e_{i}\) is determined as:

$$ e_{i} = ~\left| {\left| {x\left( i \right) - x^{\prime}\left( i \right)} \right|} \right| $$
(12)

Deep networks are used in learning the point in training data and results in overfitting issues. To resolve the problem, regularization factor, \(E_{{{\text{Reg}}}}\) has been assumed in objective function to be estimated using the given expression:

$$ E_{{{\text{Reg}}}} \frac{\lambda }{2}\left( {\mathop \sum \limits_{{i = 1}}^{C} {\text{~}}\left| {\left| {w_{i} \left| {\left| { + \mathop \sum \limits_{{i = 1}}^{D} {\text{~}}} \right|} \right|w_{i}^{\prime } } \right|} \right|} \right)~, $$
(13)

where λ means the term for regularization of a method. Sparsity limitation enables a method for learning the essential features from data. Sparsity factor \(E_{{{\text{sparsity}}}}\) is evaluated by:

$$ E_{{{\text{sparsity}}}} = \;\;\beta \mathop \sum \limits_{{i = 1}}^{C} {\text{KL}}\left( {\rho |{\text{|}}\rho _{j} } \right), $$
(14)

where β denotes a sparsity weight term as well as \({\text{KL}}(\rho |{\text{|}}\rho _{j} {\text{)}}\) defines Kullback–Leibler divergence as projected by:

$$ {\text{KL}}\left( {\rho ||\rho _{j} } \right) = \rho \log \frac{\rho }{{\rho _{j} }} + \left( {1 - \rho } \right)\frac{{\left( {1 - \rho } \right)}}{{\left( {1 - \rho _{j} } \right)}}, $$
(15)

where sparsity constant is shown by ρ that implies average activation value of jth neuron that is measured by:

$$ \rho _{j} = \frac{1}{T}\mathop \sum \limits_{{i = 1}}^{T} f^{j} \left( {x_{{\left( i \right)}} } \right)~, $$
(16)

where \(f^{j} \left( {x_{{\left( i \right)}} } \right)\) signifies the activation function of jth neuron in a hidden layer of AE. Under the application of AE, cascading encoder layers. Recalling the mapping of AE in Eq. (6) and SAE is described as:

$$ f_{{{\text{SAE}}}} = f_{E}^{1} \circ f_{E}^{2} \circ f_{E}^{3} \cdots \circ f_{E}^{L} ~, $$
(17)

where the SAE function is implied as \(f_{{{\text{SAE}}}}\). In every layer of SAE, encoder function has been employed. It is apparent that a decoder function is absent in each layer.

Softmax classifier is defined as a multiclass classifier which applies Logistic Regression (LR) that is used in data classification. It has applied supervised learning mechanism that applies extended LR to categorize several classes. Therefore, LR depends upon this classification model. In multiclass classifier issues, softmax classifier evaluates the possibility of a class with classified data. Therefore, sum of possibilities in all classes might be 1. Also, it performs normalization and exponentiation to find the class probabilities. A function \(f_{{{\text{SC}}}}\) is connected with SAE. When the layers are trained, upcoming process of training the model is named fine tuning. It is the last step in classification process that is applied to enhance the model performance. To reduce the classification error, it is fine-tuned with supervised learning. Using the training data set, complete set of networks is trained as same as training process of multilayer perceptron (MLP). Here, the encoder portion of AE has been applied.

ADO-based parameter tuning

The deep learning (DL) based optimizers have a predefined learning rate by default [17]. But in practical cases, the DL models are non-convex problems. To determine the effective learning rate of the DNN model, ADO is applied which computes the learning rate in such a way to attain maximum classification performance. Adadelta was developed by Zeiler [18]. The main aim of this model is circumventing Adagrad’s vulnerability with drastic reduction in learning rate produced by the collection of the previously squared gradients in a denominator. The Adadelta measures the learning rate using the current gradients processed within the limited time period. Also, the Adadelta applies the accelerator by considering previous updates and Adadelta update rule is given below:

  • The gradient \(E^{{\left( t \right)}}\) is computed.

    $$\begin{aligned} E^{{\left( t \right)}} & = \frac{{\delta l\left( {\hat{X}^{{\left( t \right)}} } \right)}}{{\delta \hat{X}^{{\left( t \right)}} }} \\ & = \left( {1 - H} \right) \odot \left( {\hat{X}^{{\left( t \right)}} \cdot \left( {\left( {\hat{X}^{{\left( t \right)}} } \right)^{T} \cdot \hat{X}^{{\left( t \right)}} + \in \times I} \right)^{{ - 0.5}} } \right), \\ \end{aligned} $$
    (18)
  • The local average \(\tilde{G}^{{\left( t \right)}}\) of existing value is determined \(\left( {E^{{\left( t \right)}} } \right)^{2}\)

  • New term accumulating updates are estimated (momentum: acceleration term)

    $$ Z^{{\left( t \right)}} ~ = \rho \times Z^{{\left( {t - 1} \right)}} + \left( {1 - \rho } \right){\text{~}} \times {\text{~}}\left( {W^{{\left( {t - 1} \right)}} } \right)^{2} ,~ $$
    (19)
  • Finally, the update expression is applied below.

    $$ W^{{\left( t \right)}} {\text{~}} = \frac{{\sqrt {Z^{{\left( t \right)}} + \varepsilon ~ \times ~I} }}{{\alpha \sqrt {\tilde{G}\left( t \right)~ + ~\varepsilon ~~ \times ~I} }}~ \odot E^{{\left( t \right)}} ,~ $$
    (20)

Performance validation

For examining the detection performance of the CIDD-ADODNN model, a series of simulations were carried out using three benchmark datasets namely KDDCup99 [19], Spam [20], and Chess dataset [21]. The details about the dataset are given in Table 1. The first KDDCup99 dataset includes 42 features with a total number of 125,973 instances. Then, the second Spam dataset contains 58 features with a total number of 4601 instances. Third, the Chess dataset comprises nine features with a total number of 503 instances. For experimentation, tenfold cross validation is used to split the dataset into training and testing datasets. Figures 4, 5 and 6 visualizes the frequency distribution of the instances under distinct attributes on the applied three datasets. Besides, the snapshots generated at the time of simulations are provided in “Appendix”.

Table 1 Dataset description
Fig. 4
figure 4

Visualization of KDDCup99 dataset

Fig. 5
figure 5

Visualization of Spam dataset

Fig. 6
figure 6

Visualization of Chess dataset

Table 2 provides the outcome of the ADWIN technique for class imbalancement. The table values denoted that the initial 125,973 instances in the KDDCup99 dataset are balanced into a set of 134,689 instances. Similarly, on the applied Spam dataset, the actual 4601 instances are balanced into a set of 5457 instances. Third, on the Chess dataset, the available 503 instances are increased into 616 instances by balancing it.

Table 2 Result analysis of class imbalancement

Figure 7 shows the ROC curve generated by the ADODNN and CIDD-ADODNN models on the applied KDDCup'99 dataset. Figure 7a depicts that the ADODNN model has resulted in a maximum ROC of 0.95. Likewise, Fig. 7b illustrates that the CIDD-ADODNN model has also accomplished effective outcomes with a high ROC of 0.98.

Fig. 7
figure 7

ROC analysis on KDDCup99 dataset. a ADODNN, b CIDD-ADODNN

Figure 8 depicts the ROC curve generated by the ADODNN and CIDD-ADODNN models on the applied Spam dataset. Figure 8a illustrates that the ADODNN model has resulted in the highest ROC of 0.95. Likewise, Fig. 8b shows that the CIDD-ADODNN model has also accomplished effective results with a high ROC of 0.98.

Fig. 8
figure 8

ROC analysis on Spam dataset. a ADODNN, b CIDD-ADODNN

Figure 9 demonstrates the ROC curve generated by the ADODNN and CIDD-ADODNN methodologies on the applied Spam dataset. Figure 9a showcases that the ADODNN model has resulted in a superior ROC of 0.67. Likewise, Fig. 9b illustrates that the CIDD-ADODNN method has also accomplished an effective outcome with a high ROC of 0.85.

Fig. 9
figure 9

ROC analysis on Chess dataset. a ADODNN, b CIDD-ADODNN

Table 3 tabulates the classification results attained by the ADODNN and CIDD-ADODNN models on the applied three datasets. Figure 10 portrays the analysis of the results obtained by the ADODNN and CIDD-ADODNN models on the test KDDCup99 dataset. The figure demonstrated that the ADODNN model has resulted in a precision of 0.9311, recall of 0.9330, specificity of 0.9207, accuracy of 0.9273, and F score of 0.9320. At the same time, the CIDD-ADODNN model has exhibited considerably better outcomes over the ADODNN model with a higher precision of 0.9628, recall of 0.9552, specificity of 0.9631, accuracy of 0.9592, and F score of 0.9590.

Table 3 Result analysis of proposed methods on applied three dataset
Fig. 10
figure 10

Result analysis of CIDD-ADODNN method on KDDCup99 dataset

Figure 11 implies the analysis of the results attained by the ADODNN and CIDD-ADODNN methods on the test Spam dataset. The figure depicted that the ADODNN scheme has resulted in a precision of 0.9346, recall of 0.8917, specificity of 0.9040, accuracy of 0.8965, and F score of 0.9126. Meantime, the CIDD-ADODNN approach has implemented moderate outcome over the ADODNN model with maximum precision of 0.9272, recall of 0.9408, specificity of 0.9228, accuracy of 0.9320, and F score of 0.9340. Figure 12 displays the analysis of the results accomplished by the ADODNN and CIDD-ADODNN approaches on the test Chess dataset. The figure portrayed that the ADODNN model has shown a precision of 0.6296, recall of 0.6010, specificity of 0.7705, accuracy of 0.7038, and F score of 0.6150. Simultaneously, the CIDD-ADODNN scheme has displayed manageable results over the ADODNN model with the supreme precision of 0.7515, recall of 0.7974, specificity of 0.7311, accuracy of 0.7646, and F score of 0.7738.

Fig. 11
figure 11

Result analysis of CIDD-ADODNN method on Spam dataset

Fig. 12
figure 12

Result analysis of CIDD-ADODNN method on Chess dataset

Table 4 and Fig. 13 performs a detailed comparative results analysis of the CIDD-ADODNN model on the test KDDCup99 dataset [22]. The resultant values reported that Gradient Boosting and Naïve Bayesian models have depicted inferior performance by obtaining minimum accuracy values of 0.843 and 0.896, respectively. Besides, the Gaussian process and OC-SVM models have depicted slightly higher accuracy values of 0.911 and 0.918, respectively. Followed by, the DNN-SVM model has accomplished a manageable accuracy of 0.92. However, the presented ADODNN and CIDD-ADODNN models have exhibited superior performance by obtaining a higher accuracy of 0.927 and 0.959, respectively.

Table 4 Performance evaluation of proposed method with recent methods on KDDCup99 dataset
Fig. 13
figure 13

Comparative analysis of CIDD-ADODNN model on KDDCup99 dataset

Table 5 and Fig. 14 computes a detailed comparative results analysis of the CIDD-ADODNN model on the test Spam dataset [23,24,25]. The resultant scores reported that HELF and KNN models have depicted inferior performance by obtaining lower accuracy values of 0.750 and 0.818, respectively. Followed by, the GA and Adaboost models have depicted slightly higher accuracy values of 0.840 and 0.870 correspondingly.

Table 5 Performance evaluation of proposed method with recent methods on Spam dataset
Fig. 14
figure 14

Comparative analysis of CIDD-ADODNN model on Spam dataset

Similarly, the NB approach has depicted a reasonable result with accuracy value of 0.881. Followed by, the Flexible Bayes model has accomplished a manageable accuracy of 0.888. But, the proposed ADODNN and CIDD-ADODNN schemes have implemented supreme function by gaining maximum accuracy of 0.896 and 0.932, respectively.

Table 6 and Fig. 15 defines a detailed comparative results analysis of the CIDD-ADODNN model on the test Chess dataset [26]. The resultant values addressed that ZeroR and SVM models have depicted poor performance by accomplishing minimal accuracy values of 0.390 and 0.420, respectively. Then, the LR and OneR methods have demonstrated moderate accuracy values of 0.549 and 0.598 correspondingly. Besides, the MLP scheme has attained a considerable accuracy of 0.647. Thus, the newly projected ADODNN and CIDD-ADODNN approaches have represented supreme function by gaining optimal accuracy of 0.703 and 0.764, respectively.

Table 6 Performance evaluation of proposed method with recent methods on Chess dataset
Fig. 15
figure 15

Comparative analysis of CIDD-ADODNN model on Chess dataset

From the detailed experimental analysis, it is evident that the CIDD-ADODNN model has accomplished an effective outcome on all the applied dataset. Particularly, the presented CIDD-ADODNN model by obtaining a maximum accuracy of 0.9592, 0.9320, and 0.7646 on the applied KDDCup, Spam, and Chess dataset, respectively. It is due to the following reasons: effective handling of class imbalance problems, accurate drift detection, and proficient hyperparameter tuning process. Therefore, the CIDD-ADODNN model has been found to be an effective tool for classifying highly imbalanced streaming data.

Conclusion

This paper has designed a novel CIDD-ADODNN model for the classification of highly imbalanced streaming data. Primarily, preprocessing of the raw streaming data takes place in three ways such as format conversion, data transformation, and chunk generation. The ADASYN model receives the preprocessed data as input and makes use of a weighted distribution for dissimilar minority class instances based on the learning levels of difficulty. The application of ADASYN model balances the dataset effectively and then drift detection process gets executed by the use of ADWIN technique. To determine the effective learning rate of the DNN model, ADO is applied which computes the learning rate in such a way to attain maximum classification performance. For ensuring the classifier results of the CIDD-ADODNN model, a comprehensive set of experimentations were carried out. The simulation results verified the superior performance of the presented model by obtaining a maximum accuracy of 0.9592, 0.9320, and 0.7646 on the applied KDDCup, Spam, and Chess dataset, respectively. In future, the performance of the CIDDO-ADODNN model can be improved using feature selection and clustering techniques.