Deep learning framework for handling concept drift and class imbalanced complex decision-making on streaming data

In present times, data science become popular to support and improve decision-making process. Due to the accessibility of a wide application perspective of data streaming, class imbalance and concept drifting become crucial learning problems. The advent of deep learning (DL) models finds useful for the classification of concept drift in data streaming applications. This paper presents an effective class imbalance with concept drift detection (CIDD) using Adadelta optimizer-based deep neural networks (ADODNN), named CIDD-ADODNN model for the classification of highly imbalanced streaming data. The presented model involves four processes namely preprocessing, class imbalance handling, concept drift detection, and classification. The proposed model uses adaptive synthetic (ADASYN) technique for handling class imbalance data, which utilizes a weighted distribution for diverse minority class examples based on the level of difficulty in learning. Next, a drift detection technique called adaptive sliding window (ADWIN) is employed to detect the existence of the concept drift. Besides, ADODNN model is utilized for the classification processes. For increasing the classifier performance of the DNN model, ADO-based hyperparameter tuning process takes place to determine the optimal parameters of the DNN model. The performance of the presented model is evaluated using three streaming datasets namely intrusion detection (NSL KDDCup) dataset, Spam dataset, and Chess dataset. A detailed comparative results analysis takes place and the simulation results verified the superior performance of the presented model by obtaining a maximum accuracy of 0.9592, 0.9320, and 0.7646 on the applied KDDCup, Spam, and Chess dataset, respectively.


Introduction
With the progressive technical advancements, numerous data streams are produced robustly in recent times. Some of the latest technologies are sensor networks, spam filtering models, traffic management, and intrusion prediction [1]. Certainly, a data stream S is meant to be potentially uncovered, and sequential instances are frequently derived with greater speed. The major limitation in data stream learning is to resolve the concept drift, the principle behind this model should be drifted in a dynamic fashion. Usually, the concept drift exists in real-time application. For instance, in recommend systems (RS), user priorities might be changed on behalf of trend, finance, and various other external factors. Also, the climate detection models are modified according to the seasonal change in the environment. This modification intends to degrade the classification process. Hence, a classifier applied in this study must be eligible to examine and get adopted to these alterations. The main theme of this work is to develop a classifier learning module to mine the streaming data in dynamic platforms effectively.
Concept drift can be classified on the basis of speed, as sudden and gradual drifts, as shown in Fig. 1. Here, sudden concept drift is represented by the massive changes from basic class distribution as well as the incoming samples within the time duration. Second, the gradual concept drift is a time-consuming process and represents the change in 1 3 differences of fundamental class distributions between previous and new instances. Obviously, the type of change is not considered, and it has to be applicable to observe and track the changes. It is general that real-time data streams can appear in future. It shows an exclusive type of drift called recurring concepts. For instance, news reading desire of a user might be changed immediately. People may have different thoughts on weekends, mornings, and evenings. Additionally, user explores the astrology articles in new-year and economical articles for each quarter. But some models have been applied. Sometimes, the classifiers which are employed in past might also be applied in future. Therefore, the traditional works on drift prediction ignore the phenomenon and intend to consider the concept as new one. Because of the statement in drift prediction, it captures the changes in data streams and upgrades the prediction approach to maintain higher accuracy.

Overview of concept drift
Concept drift exists when the target is modified in limited time period. Assume two target concepts A and B, and sequence of samples I = i 1 , i 2 , … , i n . In prior to instance id, target concept is not modified and remains in A. Afterwards, ∆x is a concept which is stable under diverse concept called B. Hence, concept is drifting among the sample i d+1 and i d + Δx , and replace concept target A for B. Based on the efficiency of a drift (∆x), a modification is may allocated as gradual in drifts which is slow from two concepts, while in abrupt drifts the change occurs suddenly.
The concept drift models are defined in three distinct ways such as window-related, weight-related, and ensemble of classification models. Initially, the window-related methods attempt in selecting samples from dynamic sliding window while a weight-based method weights the samples and removes the former ones according to the weights. Third, the ensemble classification provides different classification models and integrates them to accomplish final and effective classification model. The sample count is considered in training phase. The concept drift handling methodologies are classified as online approach: it upgrades the classifier after getting the instance while the batch approach spends time to receive massive instances to start learning process. Followed by, learning approach gains the streams of data and divide into batches. Few models are used for dealing with stream of batches as shown in the following: Full-memory: a learner applies classical training samples (batches), Nomemory: applies the current batch for training process and finally, Window-of-fixed size n: applies the n most current batches. Here, window-based model with fixed window size (n = 10) has been applied.

Problem formulation
Assume input data stream gathered from n sources So i are referred as So l , So 2 , So 3 , … , So n . A source i produces k streams So ik , i.e., So i1 , So i2 , … , So ik . The samples from these sources make complete streaming data USo i = So . The central premises of data preprocessing method are to declare the storage of reservoir S R for stream data So from n sources. The two factors are significant for examination of statistical reservoir size for complete stream data. Hence, the degree of disparity in stream data shows the difference in count of samples distributed for every source. A maximum degree of disparity leads to minimum confidence interval which is possible for estimating the correct value [2].
where in Eq. (1), | | S R | | implies the overall sample size and N shows the overall population. In addition, e indicates the confidence interval. For low confidence intervals, the data sampling method decides maximum number of instances. Else, a minimum number of samples are essential to show the complete stream data. Once the sampling process is applied, the two class problems are constant in stream data classification. Assume the online ensemble classifier Θ which receives novel instance x t at t time, and detected class label is y ′ t . When the prediction is computed, a classifier receives desired label of x t as y t . Therefore, predicted and desired label allocates {1, − 1} . The result of ensemble classifier Θ is divided into four classes namely, Based on the above measures, the ensemble classification accuracy has been evaluated for minority and majority class instances. Therefore, imbalance factor is quantified with the help of occurrence possibility of minority classes. Because of the imbalance in distribution of samples between majority and minority class instances, the classifier performance gets degraded.

Paper contributions
Learning from data streams (incremental learning) has significantly attracted the research communities owing to several issues and real-time applications. The concept drift detection is a strategy while the changes in data distribution make recent prediction method as inaccurate. The stream data classifier with no concept drift adaptation is not desirable to classify imbalance class distribution. Therefore, this paper designs a novel class imbalance with concept drift detection (CIDD) using Adadelta optimizer-based deep neural networks (ADODNN), named CIDD-ADODNN model to classify highly imbalance data streams. The proposed model uses adaptive synthetic (ADASYN) technique for handling class imbalance data. In addition, the adaptive sliding window (ADWIN) technique is applied for the recognition of concept drift in the applied streaming data. At last, ADODNN model is utilized for the classification processes. For ensuring the classifier results of the CIDD-ADODNN model, three streaming datasets are used namely intrusion detection (NSL KDDCup) dataset, Spam dataset, and Chess dataset.
In short, the contribution of the paper is listed as follows: • Develop a new CIDD-ADODNN model to classify highly imbalance data streams. • Employ ADASYN technique for handling class imbalance data and ADWIN technique for the recognition of concept drift in the applied streaming data. • Lastly, ADODNN model is utilized for the classification processes. • Validate the performance of the CIDD-ADODNN model, three streaming datasets.

Literature survey
Mostly, the big data streaming domains suffer from problems like class imbalance as well as concept drift. The classical sampling models make use of two modules to overcome the above defined problems like resampling and similarity methodologies. Resampling is one of the effective schemes at the data level. Some of the resampling approaches manage the data distribution by applying deterministic frameworks [3]. The remarkable approaches are used in selecting the instances from frequently incoming data stream under the application of sampling with and without replacement. Also, sampling with alternative has been applied when there is a requirement of fixed sample size while sampling with no replacement can be utilized for the applications. The traditional approaches are not suitable in sample adequacy with no redundancy, and secondary technique is not applicable for sub-streams which refers to diverse patterns. In Wu et al. [4], Dynamic Feature Group Weighting framework with Importance Sampling (DFGW-IS) aims resolving the issues of concept drift and class imbalance. Hence, the weighted ensemble undergo training on the feature group that is extracted randomly. It refers that the minority classes remain same; however, the minority class instances in previous window are dissimilar to present classes. Additionally, solutions of irregular class distribution by applying classical samples are not applied to concept drift significantly. Hence, the sampling approaches in Cervellera and Macciò [5] use the recursive binary partition across input samples and decides the instance showcasing the entire stream. Hence, the greedy optimality as well as explicit error bound are applicable in managing the problems related to concept drift.
The adaptive sampling approach [6] on irregular data streams takes place under the application of repeatable and scalable prediction approaches. Hence, a predictive method has been developed if the data are imbalanced minimally. If the data are imbalanced heavily, then it activates a data scan by enough minority instances. Therefore, the major constraints of this model are that it is implemented with accurate reservoir and does not assume the worst case optimality. To overcome these problems, stream sampling as well as continuous random sampling make use of overlap independence. By the integration of density and distance metrics, the DENDIS implies the matrix from [7] to retain the semantic coherence.
The G-means Update Ensemble (GUE) in [8] tries to resolve the predefined issues. To manage the imbalanced class distribution, it employs the oversampling operation and applies weighting frameworks to handle the concept drift. A static threshold measure is not applicable to resolve the imbalanced class distribution. The Gradual Resampling Ensemble (GRE) method has been developed by Ren et al. [9] to overcome these problems. It has exploited resampling scheme for previously received minority classes and amplifies the present minority class labels. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is utilized for identifying the disjunct and eliminate the influence of disjunct on resemblance estimation. It helps GRE to apply the novel samples. Similarly, efficient learning the nonstationary imbalanced data stream has been projected in Meenakshi et al. [10]. It tries to limit the misclassified samples with the aid of two-class issues. It develops several blocks of chunk and a chunk training, while testing is processed by classification model. Therefore, severe problems have been experienced by multiclass classification.
In Iwashita et al. [11], the popular spiking neural networks are introduced to learn the data streams through online. The major objective of this work is to reduce the neuron repository size and to make use of the benefits of data reduction models and compressed neuron learning capability. The Knowledge-Maximized Ensemble (KME) in [12] unifies the online as well as chunk relied ensemble classification models to resolve different concept drift problems. The application of unsupervised learning techniques and saved recurrent models enhance the knowledge applied in stream data mining (DM). As a result, it enhances the accuracy of data classification. Though several works are existed in the literature, the classification of concept drift solution is considerably affected by class imbalance data. The sampling approaches are commonly employed for processing the incessantly incoming data stream with an adequate sample count. The chosen samples have constructed a statistical inference for supporting imbalance class distribution. The stream data classifier with no concept drift adaptation is not desirable to classify imbalance class distribution.

The proposed CIDD-ADODNN model
The working principle involved in the presented CIDD-ADODNN model is depicted in Fig. 2. Primally, data preprocessing takes place to transform the raw streaming data into a compatible format for further processing. Next, the ADASYN technique is applied for handling the class imbalance. Followed, next, a drift detection technique called ADWIN is employed to detect the existence of the concept drift. At last, the ADODNN model is applied to determine the class label of the streaming data which incorporates the ADO to tune the hyperparameters of the DNN model.

Data preprocessing
At the initial stage, preprocessing of the raw streaming data takes place in three ways such as format conversion, data transformation, and chunk generation. First, the online streaming data in any raw format are converted into the required.csv format. Second, the data transformation process alters the categorical values in the data to numerical values. Third, the streaming dataset in any size is divided into a number of chunks for further processing.

ADASYN based class imbalance data handling
The ADASYN model receives the preprocessed data as input and executed the ADASYN model to handle the class imbalance. It makes use of a weighted distribution for dissimilar minority class instances based on the learning levels of difficulty [13]. It generates distinct synthetic instances for the minority classes based on the distribution. Due to the popularity of synthetic models like synthetic minority oversampling technique (SMOTE), SMOTEBoost, and DataBoostIM has been introduced. It performs the learning from imbalanced data sets. Hence, objective is two-fold: limiting the

Procedure
(1) Estimate the degree of class imbalance: where d ∈ (0 , 1]. (2) When d < d th and (d th is a current threshold for highly tolerated degree of class imbalance ratio): (a) Estimate the number count of synthetic data samples which has to be produced for minority class: where ∈ [0, 1] defines a parameter applied to specify the required balance level when the synthetic data is generated. = 1 defines a completely balanced data set is deployed after generalization process.
(b) For all example x i ∈ minority class, identify K nearest neighbors (kNN) dependent on Euclidean distance in n dimensional space and estimate the ratio r i described as: where G shows the overall count of synthetic data samples to be emanated for minority class as described in Eq. (3).
(e) For a minority class data sample x i , produce g i synthetic data samples on the basis of given steps.
Create a Loop from 1 to g i : (i) Select the minority data sample randomly, x zi , from kNN for data x i .
(ii) Produce the synthetic data instance: (2) d = m s ∕m l , where x zi − x i defines the difference vector in n dimensional spaces, and refers the random value: ∈ [0, 1].
End Loop.

ADWIN-based drift detection
The application of ADASYN model balances the dataset effectively and then drift detection process gets executed by the use of ADWIN technique [14]. In this study, windowbased approach is employed for drift detection with the window of fixed size (n = 10). Bifet [15] presented an ADWIN technique, which is eligible for data streams with sudden drift. It has applied a sliding window W with currently reading samples. The major principle of ADWIN is listed in the following: when two huge sub-windows of W imply distinct enough averages, then the desired values are varied and existing portion of a window has been lost. The statistical hypothesis states that: "the average t is an ideal constant in W with confidence ". The pseudo-code of ADWIN is shown in Algorithm 1.
The major portion of algorithm is definition of cut and it is sampled. Assume n is a size of W , and n 0 and n l be the sizes of W 0 and W 1 finally, the n = n 0 + n 1 . Suppose Ŵ 0 and Ŵ 1 is an average of the values in W 0 and W 1 , and W 0 and W 1 are desired measures. Thus, the value of cut is presented in the following: where m = 1 1∕n 0 +1∕n 1 , and � = n . The statistical test represented in pseudo-code verifies the average in sub-windows is varied by threshold cut . A threshold is measured with the help of Hoeffding bound and provides formal assurance of fundamental classifier's function. The phrase "holds for each split of W into W = W 0 ⋅ W 1 " refers that every pair has to be verified while W 0 and W l are developed by dividing W into two portions. Hence, researchers have presented an improvement model to identify the optimal cut-point significantly. Therefore, actually presented ADWIN models are lossless learners, hence window size W grows uncertainly when there is no drift. It is enhanced simply by inclusion of parameters which reduces the windows maximal size.
Suppose N input vectors are considered for training the AE as x (1) , x (2) … x (N) . The reformation of input is processed by training AE as: which is represented as:

ADODNN-based classification
Once the ADWIN technique has identified the concept drift, the trained model gets updated and then the classification process gets executed. By doing this, the classification results can be significantly improved. When the concept drift does not exist, then the classification process using ADODNN is straightaway performed instead of model updating process. The ADODNN has the ability to determine the actual class labels of the applied data and the application of ADO helps to attain improved classification performance.
Here, a DNN-based model is presented by applying stacked autoencoders (SAE) for concept drift classification to enhance the estimation measures. The DNN classifier in concept drift dataset has been developed under the application of SAE and softmax layer [16]. A dataset is comprised attributes and class variables which are defined in the following. Figure 3 illustrates the structure of DNN model. The parameters are induced as input for the input layer. Generally, DNN is developed by two layers of SAE. A network is composed of two hidden layers with neurons. Additionally, a softmax layer is attached with final hidden layer to perform the classification task. Hence, the output layer provides the possibilities of class labels for applied record.
where f AE implies the function that maps input into output as AE.
Followed by, AE undergoes training with the reduction of appropriate objective function that is applied by total error function as: where E MSE , E Reg , E sparsity implies the mean square error (MSE), regularization factor as well as sparsity factor correspondingly. An MSE, E MSE is determined by: where e i shows the error, which implies the difference among original output, x(i) and observed output, x ′ (i). Hence, the error e i is determined as: Deep networks are used in learning the point in training data and results in overfitting issues. To resolve the problem, regularization factor, E Reg has been assumed in objective function to be estimated using the given expression: where λ means the term for regularization of a method. Sparsity limitation enables a method for learning the essential features from data. Sparsity factor E sparsity is evaluated by: where β denotes a sparsity weight term as well as KL( || j ) defines Kullback-Leibler divergence as projected by: where sparsity constant is shown by ρ that implies average activation value of jth neuron that is measured by: where f j x (i) signifies the activation function of jth neuron in a hidden layer of AE. Under the application of AE, cascading encoder layers. Recalling the mapping of AE in Eq. (6) and SAE is described as: where the SAE function is implied as f SAE . In every layer of SAE, encoder function has been employed. It is apparent that a decoder function is absent in each layer.
Softmax classifier is defined as a multiclass classifier which applies Logistic Regression (LR) that is used in data classification. It has applied supervised learning mechanism that applies extended LR to categorize several classes. Therefore, LR depends upon this classification model. In multiclass classifier issues, softmax classifier evaluates the possibility of a class with classified data. Therefore, sum of possibilities in all classes might be 1. Also, it performs normalization and exponentiation to find the class probabilities. A function f SC is connected with SAE. When the layers are trained, upcoming process of training the model is named fine tuning. It is the last step in classification process that is applied to enhance the model performance. To reduce the classification error, it is fine-tuned with supervised learning. Using the training data set, complete set of networks is trained as same as training process of multilayer perceptron (MLP). Here, the encoder portion of AE has been applied.

ADO-based parameter tuning
The deep learning (DL) based optimizers have a predefined learning rate by default [17]. But in practical cases, the DL models are non-convex problems. To determine the effective learning rate of the DNN model, ADO is applied which computes the learning rate in such a way to attain maximum classification performance. Adadelta was developed by Zeiler [18]. The main aim of this model is circumventing Adagrad's vulnerability with drastic reduction in learning rate produced by the collection of the previously squared gradients in a denominator. The Adadelta measures the learning rate using the current gradients processed within the limited time period. Also, the Adadelta applies the accelerator by considering previous updates and Adadelta update rule is given below: • The local average G (t) of existing value is determined E (t) 2 • New term accumulating updates are estimated (momentum: acceleration term)

3
• Finally, the update expression is applied below.

Performance validation
For examining the detection performance of the CIDD-ADODNN model, a series of simulations were carried out using three benchmark datasets namely features with a total number of 4601 instances. Third, the Chess dataset comprises nine features with a total number of 503 instances. For experimentation, tenfold cross validation is used to split the dataset into training and testing datasets. Figures 4, 5 and 6 visualizes the frequency distribution of the instances under distinct attributes on the applied three datasets. Besides, the snapshots generated at the time of simulations are provided in "Appendix". Table 2 provides the outcome of the ADWIN technique for class imbalancement. The table values denoted that the initial 125,973 instances in the KDDCup99 dataset are balanced into a set of 134,689 instances. Similarly, on the applied Spam dataset, the actual 4601 instances are balanced into a set of 5457 instances. Third, on the Chess dataset, the available 503 instances are increased into 616 instances by balancing it. Figure 7 shows the ROC curve generated by the ADODNN and CIDD-ADODNN models on the applied KDDCup'99 dataset. Figure 7a depicts that the ADODNN model has resulted in a maximum ROC of 0.95. Likewise, Fig. 7b illustrates that the CIDD-ADODNN model has also accomplished effective outcomes with a high ROC of 0.98. Figure 8 depicts the ROC curve generated by the ADODNN and CIDD-ADODNN models on the applied Spam dataset. Figure 8a illustrates that the ADODNN model has resulted in the highest ROC of 0.95. Likewise, Fig. 8b shows that the CIDD-ADODNN model has also accomplished effective results with a high ROC of 0.98.  Figure 9 demonstrates the ROC curve generated by the ADODNN and CIDD-ADODNN methodologies on the applied Spam dataset. Figure 9a showcases that the ADODNN model has resulted in a superior ROC of 0.67. Likewise, Fig. 9b illustrates that the CIDD-ADODNN   Table 3 tabulates the classification results attained by the ADODNN and CIDD-ADODNN models on the applied three datasets. Figure 10 Table 4 and Fig. 13 performs a detailed comparative results analysis of the CIDD-ADODNN model on the test KDDCup99 dataset [22]. The resultant values reported that Gradient Boosting and Naïve Bayesian models have depicted inferior performance by obtaining minimum accuracy values of 0.843 and 0.896, respectively. Besides, the Gaussian process and OC-SVM models have depicted slightly higher accuracy values of 0.911 and 0.918, respectively. Followed by, the DNN-SVM model has accomplished a manageable accuracy of 0.92. However, the presented ADODNN and CIDD-ADODNN models have exhibited superior performance by obtaining a higher accuracy of 0.927 and 0.959, respectively. Table 5 and Fig. 14  Similarly, the NB approach has depicted a reasonable result with accuracy value of 0.881. Followed by, the Flexible Bayes model has accomplished a manageable accuracy of 0.888. But, the proposed ADODNN and CIDD-ADODNN schemes have implemented supreme function by gaining maximum accuracy of 0.896 and 0.932, respectively.  Table 6 and Fig. 15 defines a detailed comparative results analysis of the CIDD-ADODNN model on the test Chess dataset [26]. The resultant values addressed that ZeroR and SVM models have depicted poor performance by accomplishing minimal accuracy values of 0.390 and    From the detailed experimental analysis, it is evident that the CIDD-ADODNN model has accomplished an effective outcome on all the applied dataset. Particularly, the presented CIDD-ADODNN model by obtaining a maximum accuracy of 0.9592, 0.9320, and 0.7646 on the applied KDDCup, Spam, and Chess dataset, respectively. It is due to the following reasons: effective handling of class imbalance problems, accurate drift detection, and proficient hyperparameter tuning process. Therefore, the CIDD-ADODNN model has been found to be an effective tool for classifying highly imbalanced streaming data.

Conclusion
This paper has designed a novel CIDD-ADODNN model for the classification of highly imbalanced streaming data. Primarily, preprocessing of the raw streaming data takes place in three ways such as format conversion, data transformation, and chunk generation. The ADASYN model receives the preprocessed data as input and makes use of a weighted distribution for dissimilar minority class instances based on the learning levels of difficulty. The application of ADASYN model balances the dataset effectively and then drift detection process gets executed by the use of ADWIN technique. To determine the effective learning rate of the DNN model, ADO is applied which computes the learning rate in such a way to attain maximum classification performance. For ensuring the classifier results of the CIDD-ADODNN model, a comprehensive set of experimentations were carried out. The simulation results verified the superior performance of the presented model by obtaining a maximum accuracy of 0.9592, 0.9320, and 0.7646 on the applied KDDCup, Spam, and Chess dataset, respectively. In future, the performance of the CIDDO-ADODNN model can be improved using feature selection and clustering techniques.