Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams

Yang, Ying; Wu, Xindong; Zhu, Xingquan

doi:10.1007/s10618-006-0050-x

Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams

Original Article
Published: 29 June 2006

Volume 13, pages 261–289, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Ying Yang¹,
Xindong Wu² &
Xingquan Zhu²

847 Accesses
66 Citations
Explore all metrics

Abstract

Prediction in streaming data is an important activity in the modern society. Two major challenges posed by data streams are (1) the data may grow without limit so that it is difficult to retain a long history of raw data; and (2) the underlying concept of the data may change over time. The novelties of this paper are in four folds. First, it uses a measure of conceptual equivalence to organize the data history into a history of concepts. This contrasts to the common practice that only keeps recent raw data. The concept history is compact while still retains essential information for learning. Second, it learns concept-transition patterns from the concept history and anticipates what the concept will be in the case of a concept change. It then proactively prepares a prediction model for the future change. This contrasts to the conventional methodology that passively waits until the change happens. Third, it incorporates proactive and reactive predictions. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. Finally, an efficient and effective system RePro is proposed to implement these new ideas. It carries out prediction at two levels, a general level of predicting each oncoming concept and a specific level of predicting each instance's class. Experiments are conducted to compare RePro with representative existing prediction methods on various benchmark data sets that represent diversified scenarios of concept change. Empirical evidence offers inspiring insights and demonstrates the proposed methodology is an advisable solution to prediction in data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Big data analytics on Apache Spark

Article 13 October 2016

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

References

Aggarwal, C.C., Han, J., Wang, J., and Yu, P.S. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases, pp. 81–92.
Blake, C.L. and Merz, C.J. 2005. UCI repository of machine learning databases [http://www.ics.uci.edu/~mlearn/milrepository.html] Department of Information and Computer Science, University of California, Irvine.
Ganti, V., Gehrke, J., and Ramakrishnan, R. 2001. Demon: Mining and monitoring evolving data. IEEE Transactions on Knowledge and Data Engineering, 13:50–63.
Article Google Scholar
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.Y. 1999. Boat-optimistic decision tree construction. In Proceedings ACM SIGMOD International Conference on Management of Data, pp. 169–180.
Harries, M.B. and Horn, K. 1996. Learning stable concepts in a changing world. In PRICAI Workshops, pp. 106–122.
Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106.
Jain, R. 1991. The art of computer systems performance analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, NY, Winner of ‘1991 Best Advanced How-To Book, Systems’ award from the Computer Press Association.
Keogh, E. and Kasetty, S. 2002. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 102–111.
Kolter, J.Z. and Maloof, M.A. 2003. Dynamic weighted majority: A new ensemble method for tracking concept drift. In Proceedings of the 3rd International IEEE Conference on Data Mining, pp. 123–130.
Lanquillon, C. and Renz, I. 1999. Adaptive information filtering: Detecting changes in text streams. In Proceedings of the 8th International Conference on Information and Knowledge Management, pp. 538–544.
Quinlan, J.R. 1993. C4.5: Programs for machine learning. Morgan Kaufmann Publishers.
Salganicoff, M. 1997. Tolerating concept and sampling shift in lazy learning using prediction error context switching. Artificial Intelligence Review, 11:133–155.
Article Google Scholar
Stanley, K.O. 2003. Learning concept drift with a committee of decision trees. Technical Report AI-03-302, Department of Computer Sciences, University of Texas at Austin.
Street, W.N. and Kim, Y. 2001. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382.
Tsymbal, A. 2004. The problem of concept drift: Definitions and related work. Technical Report TCD-CS-2004-15, Computer Science Department, Trinity College Dublin.
Wang, H., Fan, W., Yu, P.S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235.
Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23:69–101.
Google Scholar
Yang, Y., Wu, X., and Zhu, X. 2004. Dealing with predictive-but-unpredictable attributes in noisy data sources. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 471–483.

Download references

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, Monash University, Melbourne, VIC, 3800, Australia
Ying Yang
Department of Computer Science, University of Vermont, Burlington, VT, 05405, USA
Xindong Wu & Xingquan Zhu

Authors

Ying Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xindong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xingquan Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Yang.

Additional information

A preliminary and shorter version of this paper has been published in the Proceedings of the llth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2005), pp. 710–715.

Sometimes there are conflicts in the literature when describing these modes. For example, the concept shift in some papers means the concept drift in other papers. The definitions here are cleared up to the best of the authors’ understanding.

The value in each cell can be frequency as well as probability. The latter can be approximated from the former.

If the concept changes so fast that the learning can not catch up with it, the prediction will be inordinate. This also applies to human learning.

For example, C4.5rules (Quinlan, 1993) can achieve a 100% classification accuracy on the whole data set.

If the attribute value has less than 500 instances, all instances will be sampled without replacement.

If a data set has only nominal attributes, two nominal attributes will be selected. If a data set has only numeric attributes, two numeric attributes will be selected.

One can not manipulate these degrees in the hyperplane or network intrusion data, for which no results are presented.

The sample size is chosen to avoid observation noise caused by high classification variance.

These error rates may sometimes be higher than those reported in the original work (Hulten et al., 2001). It is because the original work used a much larger data size. There are many more instances coming after the new classifier becomes stable and hence can be classified correctly. This longer existence of each concept relieves CVFDT's dilemma and lowers its average error rate.

Please note that for DWCE, the optimal version whose buffer size equals to 10% of its window size has been used on those 3 artificial data streams. However, its prohibitively high time demand makes DWCE intractable when a large number (36) of real-world data streams are tested here. Hence a compromise version is used here instead whose buffer size is half of its window size. The results are sufficient to verify that DWCE trades time for accuracy. It can improve prediction accuracy on WCE, but is often too slow to be useful for on-line prediction.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y., Wu, X. & Zhu, X. Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams. Data Min Knowl Disc 13, 261–289 (2006). https://doi.org/10.1007/s10618-006-0050-x

Download citation

Received: 12 June 2005
Accepted: 12 December 2005
Published: 29 June 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s10618-006-0050-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams

Abstract

Access this article

Similar content being viewed by others

Learning from imbalanced data: open challenges and future directions

Big data analytics on Apache Spark

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams

Abstract

Access this article

Similar content being viewed by others

Learning from imbalanced data: open challenges and future directions

Big data analytics on Apache Spark

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation