# Research on detection and integration classification based on concept drift of data stream

- 132 Downloads

**Part of the following topical collections:**

## Abstract

As a new type of data, data stream has the characteristics of massive, high-speed, orderly, and continuous and is widely distributed in sensor networks, mobile communication, financial transactions, network traffic analysis, and other fields. However, due to the inherent problem of concept drift, it poses a great challenge to data stream mining. Therefore, this paper proposes a dual detection mechanism to judge the drift of concepts, and on this basis, the integration classification of data stream is carried out. The system periodically detects data stream with the index of classification error and uses the features of the essential emerging pattern (eEP) with high discrimination to help build the integrated classifiers to solve the classification mining problems in the dynamic data stream environment. Experiments show that the proposed algorithm can obtain better classification results under the premise of effectively coping with the change of concepts.

## Keywords

Data stream Concept drift detection mechanism Essential emerging pattern Integration classification## Abbreviations

- eEP
Essential emerging pattern

- WUDCDD
Weighted classification and update algorithm of data stream based on concept drift detection

## 1 Introduction

With the continuous advancement of information technology and the rapid development of computer networks, the real world has generated a large number of data stream, such as weather monitoring data, stock trading data, and network access logs, etc. And as time goes on, the amount of data is constantly expanding, resulting in unstable data distribution, which is easy to generate drifting of concepts. At this point, timely identification data stream with concept changes and accurate classification has become a research hotspot of data mining.

In recent years, the problems of concept drift has attracted more and more scholars’ attention, and it has also proposed more reasonable solutions. In general, the mainstream algorithms for dealing with concept drift can be summarized as two types: direct algorithms and indirect algorithms. Initially, the most popular algorithms use a number of detection metrics to directly judge concept drift, such as the most commonly used entropy values [1] and error rates, and judging these metrics can measure concept changes, and even to estimate the degree of drift.

In addition to the above, other scholars indirectly judge the drift by the process of classification. In 2000, Street proposed the SEA (Streaming Ensemble Algorithm) [2], which introduced the integration learning to classification of data stream with concept drift for the first time. This method achieved a rapid response to the change of concepts and proved that it can adapt to any size of data stream. In 2007, the DWM (Dynamic Weighted Majority) algorithm was proposed in [3], which dynamically adjusted the weight of each base classifier for integration and effectively tracking the abrupt concept drifts. Sun et al. [4, 5] proposed an online integration classification algorithm, which updated the weight of the base classifier online and added or deleted the base classifier by weights, thus solving the classification problem of dynamic data stream while adapting to concept drift.

Based on the research progress [6, 7] of related scholars, this paper firstly proposes to use the dual detection mechanism based on classification error to monitor the concept drift, mainly by multi-dimensional comprehensive judgment of the Mahalanobis distance and *μ* value of the data stream samples. Secondly, under the background of concept drift, a classification algorithm [8, 9] based on EP is proposed to improve the accuracy of overall integration classifiers. Finally, the drift detection can be achieved while adjusting the performance of the classifier itself. The remainder of this paper is organized as follows. Section 2 presents a mechanism for detecting concept drift. Section 3 introduces an integration classification algorithm based on emerging patterns, and in Section 4, we proposes our major algorithm. Section 5 shows the experimental results of the proposed algorithm and analyzes them. Finally, it is summarized in Section 6.

## 2 A dual concept drift detection mechanism based on error rate

### 2.1 Mahalanobis distance detection standard based on error rate

As for high-dimensional datasets, Mahalanobis distance has a more significant advantage in calculation than the Euclidean distance. It is fully recognized by considering the correlation between different attributes of the dataset and independent to measurement scale.

**A**= (

*a*

_{1},

*a*

_{2}, … ,

*a*

_{i}, …

*a*

_{n}),

*a*

_{1}≠

*a*

_{j}, then the Mahalanobis distance between

*a*

_{i}and

*a*

_{j}is defined as

*S*is

Among them, *μ*_{i} = *E*(*a*_{i}) is used to represent the expectation value of each vector.

*A*= (

*A*

_{1},

*A*

_{2}, … ,

*A*

_{i}, …

*A*

_{n}, …)

^{T}, data stream is sequentially processed in blocks for the convenience of the experiment. Where

*A*

_{i}represents the

*i*th data block, the classification error rate on this data block is error

_{i}, and the error rate on each data block refers to the average classification error rate of all data on the data block. Then the Mahalanobis distance can be represented by a set of mean values

*μ*= (

*μ*

_{1},

*μ*

_{2}, …

*μ*

_{n})

^{T}and a covariance matrix

*S*, as shown in Eq. (3):

After calculation, the degree of error rate change on each data block can be obtained, which indirectly reflects the similarity of adjacent data blocks and compares with the experimental threshold value to conclude whether the drift actually occurs. The further the *D*_{M}(*A*) deviates from the threshold, the greater the possibility of concept drift, indicating that the warning state is entered at this time.

### 2.2 *μ* detection standard based on error rate

The principle of *μ* test in statistics: Let *X* be an arbitrary sample set, and there are first and second order matrix, which are respectively recorded as EX = *μ*, DX = *σ*^2 (*σ* is unknown). A unilateral assumption on *X* is as follows: the null hypothesis *H*_{0}: *μ* ≤ *μ*_{0} (*μ*_{0} is a constant) and the alternative hypothesis *H*_{1}: *μ* > *μ*_{0}.The test level α is 0.05 or 0.01, and the value of \( \overline{X} \) is to be tested. When the number of samples is large, that is, the value of *n* is large, the statistic \( U=\frac{\overline{X}-{\mu}_0}{S/\sqrt{n}} \), where \( \overline{\ X} \) is the average of the samples and *S* is the standard deviation of the samples. The statistic *U* obeys the standard normal distribution *N* (0, 1). According to the given test significance level *α*, there is *μ*_{α} that satisfies *P*{*U* > *μ*_{α}} ≈ *α*.

*X*have

*n*samples, in which the number of misclassified samples is

*m*, the average value of the misclassified subsamples \( \overline{X}=m/n \), and the subsample standard deviation \( S\hat{\mkern6mu} 2=\overline{X}\left(1-\overline{X}\right) \). At this point, the statistic

*U*can be described to the following form:

The *μ* test method in the data stream environment is implemented on the basis of a certain model. Due to the particularity of data stream, the classification error rate on each data block is mainly tested and the initialization is the average of the classification error rates on the first *i* data blocks when the data distribution is stable. Therefore, the statistic *U* can be expressed as \( U=\frac{\mathrm{err}-{\mu}_0}{\sqrt{\mathrm{err}\left(1-\mathrm{err}\right)/n}} \). After each data block arrives, the change of the statistical *U* value is monitored. When *U* ≥ *μ*_{α}, the classification error rate is considered to rise significantly and the concept drift occurs. Otherwise, the concepts in the current data stream remain stable.

The dual detection mechanism proposed in this part is mainly to classify each data block with the classifiers and measure the corresponding error rate. Bringing the classification error rate into two different dimensions of Mahalanobis distance and *μ* test is for calculation. The conclusion of concept drift can only be made when the two-dimensional requirement is reached at the same time. The workflow of the dual concept drift detection mechanism is as follows.

**Input:** Dataset *A*, the length of the data block is *L*; threshold *ε*, significance level *α*.

**Outpu**t: classification error rate err_{i} on the *i*th data block, Mahalanobis distance *D*_{M}(*A*),

*μ* test statistic *U*, the judgment of whether concept drift occurs.

**Process:**

- 1:
Data preprocessing ←Data blocks

*A*_{1},*A*_{2}, …*A*_{i},*A*_{i + 1}, …*A*_{n}… - 2:
Initialization: err

_{i}←0,*D*_{M}(*A*) ←0,*U*← 0. - 3:
For the arriving data block

*A*_{i} - 4:
Enter dual concept drift detection mechanism

- 5:
Apply the basic classification algorithm based on eEP to learn, return err

_{i}; - 6:
Enter the Mahalanobis distance detection part

- 7:
Calculate the Mahalanobis distance by the formula of (3)

- 8:
If

*D*_{M}(*A*) >*ϵ*, a warning appears, marked as Re1; - 9:
Enter the

*μ*hypothesis test module - 10:
The statistic of the current data block is obtained by the formula \( U=\frac{\mathrm{err}-{\mu}_0}{\sqrt{\mathrm{err}\left(1-\mathrm{err}\right)/n}} \)

- 11:
If

*U*≥*μ*_{α}, indicating that the*μ*test hypothesis is not true, denoted as Re2; - 12:
Take the intersection of the detection results of the two parts, Result = Re1∩Re2;

- 13:
The system determines that the concepts drift.

## 3 Integration classification algorithm based on EP

### 3.1 Basic concepts

Suppose the training data set DB consists of *n* samples, each of which contains *m*-dimensional attributes. It is assumed that *n* samples are divided into *K* categories *C*_{1}, *C*_{2}, … *C*_{k}. The duality of the attribute name and its corresponding value, that is, property name and attribute value constitutes a data item. *I*= { *i*_{1}, *i*_{2} …, i_{n} }, which denote a set of all data items, then any subset *X* is called an item set.

Definition 1: Suppose *D* is a subset of training set DB and records the support of item set *X* on *D* as Sup_{D}(*X*), which is defined as Sup_{D}(*X*) = Count_{D}(*X*)/ ∣ *D*∣, where Count_{D}(*X*) represents the number of samples containing *X* of *D*, and ∣*D*∣ represents the total number of samples of *D*.

*D*and

*D*

^{′}, the change of the item set

*X*from

*D*

^{′}to

*D*is the growth rate, marked as \( {\mathrm{GR}}_{D^{\prime}\to D}(X) \).

Definition 3: Set the growth rate threshold *ρ* > 1, if the growth rate of the item set *X* from *D*^{′} to *D* satisfies \( {\mathrm{GR}}_{D^{\prime}\to D}(X)\ge \rho \), then *X* is called emerging patterns (EP) from *D*^{′} to *D* and is referred as GR_{D}(*X*).

*X*satisfies:

- 1)
*X*is the EP of*D*; - 2)
The support of

*X*in*D*is not less than the minimum support threshold*ξ*; - 3)
Any true subset of

*X*does not meet the conditions 1 and 2;

then *X* is called essential an emerging pattern (eEP), which is the basic EP.

### 3.2 Using eEP to establish base classifier

For large databases, especially high-dimensional datasets, eEP has more obvious advantages in terms of time and space complexity than EP. And eEP is the shortest EP, which greatly reduces the redundancy problem of EP in classification.

Taking the sample *S* as an example, we try to use the relevant theory of eEP to judge. Let *D*_{i} be the set of *C*_{i} class training samples, \( {D}_i^{\prime } \) be the set of non-*C*_{i} class training samples, and *X* be the eEP of *C*_{i} class. If *X* does not appear in *S*, it cannot be judged whether *S* belongs to the *C*_{i} class. If *X* appears in *S*, *X* will have the probability of \( \frac{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1} \) to determine that *S* belongs to the *C*_{i} class and that *S* does not belong to the *C*_{i} class by the probability of \( \frac{1}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1} \). If \( \mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)=\infty \), \( \frac{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1}=1 \), and \( \frac{1}{\mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)+1}=0 \).

At the same time, the eEPs of the non-*C*_{i} class also contributes to determining whether *S* belongs to the *C*_{i} class. Let *Y* be an eEP of the non-*C*_{i} class, which appears in *S*. If the growth rate of Y is large, the effect of *Y* on determining that *S* belongs to the *C*_{i} class is negligible. However, when the growth rate of *Y* is not too large (such as \( \mathrm{GR}\left(X,{D}_i^{\prime },{D}_i\right)<5 \)), *Y* has a considerable influence on determining that *S* belongs to the *C*_{i} class. In general, we take the probability that *S* belonging to the *C*_{i} class is \( \frac{1}{GR\left(Y,{D}_i,{\mathrm{D}}_i^{\prime}\right)+1} \).

In order to classify the sample *S*, it is necessary to consider the effects of the eEPs of the *C*_{i} class and non-*C*_{i} class. Therefore, the concept of membership is introduced, and the possibility that *S* belongs to the *C*_{i} class is called the membership of *S* to *C*_{i}, denoted as Bel(*S*).

*i*= 1,2,….

*K*, let PS(

*S*,

*C*

_{i}) = {

*X*|

*X*is eEP of

*D*

_{i}, and

*X*appears in

*S*}, NS(

*S*,

*C*

_{i})=={

*Y*|

*Y*is eEP of \( {D}_i^{\prime } \), and

*Y*appears in

*S*}. The membership value of

*S*belonging to the

*C*

_{i}class is calculated by:

The probability of *S* belonging to each class is calculated by the above formula, and then *S* is classified by the following rules. *S* is classified as the class with the largest degree of membership. If the class with the highest degree of membership is not unique, it is determined by a majority voting strategy.

### 3.3 Integrate base classifier based on eEP

Considering the temporality and fluidity of data stream, the research in this paper is carried out in the sliding window. Suppose SW is a fixed-size sliding window, *K* is the number of basic windows in the sliding window. BW is the basic window, labeled as bw, and its length is |BW|. The trained base classifier of basic window bw_{i} is *E*_{i}.

*x*,

*c*), where

*c*is a real class label, the classification error of

*E*

_{i}is 1−\( {f}_c^i(x) \), where \( {f}_c^i(x) \) is determined by

*E*

_{i}that the probability of

*x*being class

*c*. Therefore, the mean square error of

*E*

_{i}is

The mean square error of the classifier when making random predictions is \( {\mathrm{MSE}}_r={\sum}_cp(c){\left(1-p(c)\right)}^2 \)

_{r}is used as the threshold for weighting the classifier. To simplify the calculation, the weight

*w*

_{i}is calculated using the following formula.

The integration algorithm is as follows:

**Input**: Sup, GR, *K* total number of base classifiers; *D* data contained in the basic window bw_{k + 1}; *E* set of *K*-base classifiers before adjusting weights;

**Output**: the top

*K*-base classifiers with the highest weight in

*E*∪{

*E*

_{k + 1}}

- (1)
Initialize

*K*, Sup, GR; - (2)
While(bw

_{k + 1}arrives) { - (3)
Train (

*D*, Sup, GR); / / training base classifier*E*_{k + 1} - (4)
Calculate the error rate of

*E*_{k + 1}on*D*(10-fold cross-validation); - (5)
- (6)
for(

*E*_{i}∈*E*) { - (7)
*E*_{i}←*T*rain(*E*_{i},*D*); - (8)
Calculate the MSE

_{i}of*E*_{i}on*D*; //Formula (1) - (9)
Calculate

*E*_{i}corresponding weight*w*_{i}; //Formula (2)

## 4 Integration system under the environment of data stream with concept drift

- (1)
Building an integration classifier

*K*-base classifiers to form the integrated classifier

*E*. When the sliding window reaches the (

*K*+ 1)th basic window, training the base classifier

*E*

_{k + 1}and calculating the classification error rate of each base classifier

*E*

_{i}. Then weighting and selecting the

*K*-base classifiers with the highest weight as the output according to the weighting method proposed in Section 3.3.

- (2)
Concept drift detection

*ε*, it is judged that there is a high probability that a concept change will occur, and the warning state is entered at this time. On this basis, the next hypothesis verification is carried out. If the classification error rate on the new data block is significantly increased, the system comprehensively judges the concept drifts.

- (3)
Updating classifiers

This part performs integration of classifiers by weighting each base classifier, and the weight of each base classifier uses the classification error rate. If the concept drift detection module determines that concept drift occurs, the data block in the current window is used as a training set, and each base classifier is relearned. And comparing the weights of the learned base classifiers, selectively eliminating or retaining the old base classifiers while keeping the total number of base classifiers remains unchanged, so that the updated system is more suitable for the current data stream environment.

## 5 Experimental results and discussion

### 5.1 Dataset

*d*-dimensional space is a set of points

*x*that satisfy the following conditions:

Where *x*_{i} is the *i*th coordinate of point *x*. The samples satisfy \( \sum \limits_{i=1}^d{w}_i{x}_i>{w}_0 \) and are marked as positive samples, and other samples satisfy \( \sum \limits_{i=1}^d{w}_i{x}_i<{w}_0 \) and are marked as negative samples. When simulating the time-varying concepts, we adjust the orientation of the hyperplane smoothly by adjusting the corresponding weight *w*_{i}, so the hyperplane is very important. In the experiment, the training set size is 10,000, the test set size is 1000, a total of 10 dimensions, the number of different values in each dimension is 4, and the noise rate is 5%.

## 6 Results and discussion

In the following section, we mainly compare the accuracy of the proposed algorithm WUDCDD, *G*_{K} (representing a single classifier trained on a sliding window that the size is *K*) and EC4.5 (integration classifiers based on a single classifier of C4.5) under different conditions. The accuracy is mainly compared from four aspects: (1) the influence of the size of the basic window on the change of classification accuracy, (2) the impact of the size of the sliding window on the change of accuracy, (3) the effect of the dimension of the drift on the change of accuracy, (4) the influence of the dimension of data stream on the change of accuracy.

It can be seen from Fig. 1 that the proposed algorithm is more effective than the corresponding single classifier *G*_{K}. When |BW| ≤ 250 × 3, the accuracy of WUDCDD is higher than EC4.5. When 250 × 3≤ |BW|≤ 250 × 6, it is comparable to EC4.5.The accuracy decline of WUDCDD is not as obvious as EC4.5 when the range is [250 × 4, 250 × 6], because we incrementally update each model before calculating the weight of each base classifier. It can better adapt to the concept drift.

When the basic window is small, each algorithm has better classification performance. Because the window contains less concept drift, the distribution of data is more stable. However, if it is too small, the accuracy is reduced because there is not enough data to train the base classifier. When the window is too large, it is difficult to detect whether the drift occurs, which also affects its performance. When the window is too small, we can improve the base classifier performance by reducing support.

Figure 2 shows that as the sliding window increases, and the accuracy of WUDCDD and EC4.5 increases continuously and has better performance than *G*_{K} with the reason of *G*_{K} not adapting well to concept drift. Moreover, the performance of WUDCDD and EC4.5 increases rapidly at the beginning and then the increase is gradually reduced. Because of the better detection of drift, the increase of base classifiers will have a weak effect on the classification performance. When |SW| < 8, WUDCDD is slightly better than EC4.5, because the former single classifier performance is better than C4.5. When *K* > 8, the performance of both is close.

*G*

_{K}is most affected because there is no mechanism for processing drift. When the range of varying dimension is [2,4], the performance difference between WUDCDD and EC4.5 is very small. When the range of varying dimension is [4,8], the accuracy of the latter decreases more obviously. Because there is no incremental adjustment decision tree with new data arriving, WUDCDD always maintains the most discriminating eEP and constantly adjusts the obtained EP to reflect the characteristics of the data.

*K*= 6. The experimental results are shown in Fig. 4.

According to the trend of the curve in the figure, it can be seen that as the dimension of the dataset increases and the accuracy rate decreases. This is because of the increase of dimensions leading to the large number of eEPs. But the support and growth rate generally decreases, thereby reducing the discrimination of eEPs and resulting in the decline of classification ability. So the accuracy of WUDCDD also decreases. As the number of dimensions increases, the number of classification rules of EC4.5 increases, which also causes the decrease in accuracy. When the accuracy of WUDCDD drops, we can adjust by lowering the support threshold.

## 7 Conclusions

How to train models from massive data to effectively predict future data stream has become a hot topic. The traditional data classification algorithm can not be directly applied to the data stream environment, therefore, this paper innovatively introduces the eEP classification algorithm into the data stream classification field and proposes the algorithm of detection and integration classification based on the data stream with concept drift. By comparing with the other two algorithms, it is proved that the proposed algorithm can better adapt to the data stream with concept drift and has better classification accuracy, which is also sufficient to compare with the integration algorithm based on the C4.5. Finally, through the experimental result, it can be seen that the update strategy of the algorithm in the sliding window needs further research and improvement in order to apply to more specific fields such as data mining.

## Notes

### Acknowledgements

Not applicable

### Funding

This paper is supported by Natural Youth Science Foundation of China (61401310) and Tianjin Science Foundation (18JCYBJC86400).

### Availability of data and materials

All data generated or analyzed during this study are included in this published article.

### Authors’ contributions

BJZ analyzed and proposed the significance of current data mining and data analysis and carried out the experimental verification. YDC analyzed the experimental data and become a major contributor to the writing of manuscripts. The final draft was read and approved by both authors.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

- 1.P. Vorburger, A. Bernstein, in
*International Conference on Data Mining*. Entropy-based concept shift detection (2006), pp. 1113–1118Google Scholar - 2.W.N. Street, in
*ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. A streaming ensemble algorithm (SEA) for large-scale classification (2001), pp. 377–382Google Scholar - 3.J.Z. Kolter, M.A. Maloof, in
*Proceedings of the 3rd IEEE International Conference on Data Mining*. Dynamic weighted majority: A new ensemble method for tracking concept drift (2003), pp. 123–130CrossRefGoogle Scholar - 4.Y. Sun, G.J. Mao, X. Liu, C.N. Liu, Concept drift mining in data stream based on multi-classifier. J. Autom.
**34**(1), 93–97 (2008)Google Scholar - 5.L.L. Minku, A.P. White, X. Yao, The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans. Knowl. Data Eng.
**22**(5), 730–742 (2010)CrossRefGoogle Scholar - 6.K. Nishida, K. Yamauchi, in proceedings of
*International Conference on Discovery Science*. Detecting concept drift using statistical testing (Springer-Verlag, Berlin Heidelberg 2007), pp. 264–269Google Scholar - 7.R. Elwell, R. Polikar, Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw.
**22**(10), 1517–1531 (2011)CrossRefGoogle Scholar - 8.M. Fan, M.X. Liu, H.L. Zhao, A classification algorithm based on basic exposure mode. Comput. Sci.
**31**(11), 211–214 (2004)Google Scholar - 9.L. Duan, C.J. Tang, N. Yang, C. Gou, Research and application progress of contrast mining based on revealing mode. J. Comput. Appl.
**32**(02), 304–308 (2012)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.