Mining Concept-Drifting Data Streams

Wang, Haixun; Yu, Philip S.; Han, Jiawei

doi:10.1007/978-0-387-09823-4_40

Mining Concept-Drifting Data Streams

Haixun Wang³,
Philip S. Yu³ &
Jiawei Han⁴

Chapter
First Online: 01 January 2010

16k Accesses
4 Citations

Summary

Knowledge discovery from infinite data streams is an important and difficult task. We are facing two challenges, the overwhelming volume and the concept drifts of the streaming data. In this chapter, we introduce a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Babcock B., Babu S. , Datar M. , Motawani R. , and Widom J., Models and issues in data stream systems, In ACM Symposium on Principles of Database Systems (PODS), 2002.
Google Scholar
Babu S. and Widom J., Continuous queries over data streams. SIGMOD Record, 30:109– 120, 2001.
Article Google Scholar
Bauer, E. and Kohavi, R., An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999.
Article Google Scholar
Chen Y., Dong G., Han J., Wah B. W., and Wang B. W., Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hongkong, China, 2002.
Google Scholar
Cohen W., Fast effective rule induction. In Int’l Conf. on Machine Learning (ICML), pages 115–123, 1995.
Google Scholar
Domingos P., and Hulten G., Mining high-speed data streams. In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71–80, Boston, MA, 2000. ACM Press.
Google Scholar
Fan W., Wang H., Yu P., and Lo S. , Progressive modeling. In Int’l Conf. Data Mining (ICDM), 2002.
Google Scholar
Fan W., Wang H., Yu P., and Lo S. , Inductive learning in less than one sequential scan, In Int’l Joint Conf. on Artificial Intelligence, 2003.
Google Scholar
Fan W.,Wang H., Yu P., and Stolfo S., A framework for scalable cost-sensitive learning based on combining probabilities and benefits. In SIAM Int’l Conf. on Data Mining (SDM), 2002.
Google Scholar
Fan W., Chu F., Wang H., and Yu P. S., Pruning and dynamic scheduling of cost-sensitive ensembles, In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002.
Google Scholar
Freund Y., and Schapire R. E., Experiments with a new boosting algorithm, In Int’l Conf. on Machine Learning (ICML), pages 148–156, 1996.
Google Scholar
Gao L. and Wang X., Continually evaluating similarity-based pattern queries on a streaming time series, In Int’l Conf. Management of Data (SIGMOD), Madison, Wisconsin, June 2002.
Google Scholar
Gehrke J., Ganti V., Ramakrishnan R., and Loh W., BOAT– optimistic decision tree construction, In Int’l Conf. Management of Data (SIGMOD), 1999.
Google Scholar
Greenwald M., and Khanna S., Space-efficient online computation of quantile summaries, In Int’l Conf. Management of Data (SIGMOD), pages 58–66, Santa Barbara, CA, May 2001.
Google Scholar
Guha S., Milshra N., Motwani R., and O’Callaghan L., Clustering data streams, In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000.
Google Scholar
Hall L., Bowyer K., Kegelmeyer W., Moore T., and Chao C., Distributed learning on very large data sets, In Workshop on Distributed and Parallel Knowledge Discover, 2000.
Google Scholar
Hulten G., Spencer L., and Domingos P., Mining time-changing data streams, In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97–106, San Francisco, CA, 2001. ACM Press.
Google Scholar
Quinlan J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
Google Scholar
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decomposition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Article Google Scholar
Shafer C., Agrawal R., and Mehta M., Sprint: A scalable parallel classifier for Data Mining, In Proc. of Very Large Database (VLDB), 1996.
Google Scholar
Stolfo S., Fan W., Lee W., Prodromidis A., and Chan P., Credit card fraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.
Google Scholar
Street W. N. and Kim Y. S., A streaming ensemble algorithm (SEA) for large-scale classification. In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2001.
Google Scholar
Tumer K. and Ghosh J., Error correlation and error reduction in ensemble classifiers, Connection Science, 8(3-4):385–403, 1996.
Article Google Scholar
Utgoff, P. E., Incremental induction of decision trees, Machine Learning, 4:161–186, 1989.
Article Google Scholar
Wang H., FanW., Yu P. S., and Han J., Mining concept-drifting data streams using ensemble classifiers, In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2003.
Google Scholar

Download references

Ackowledgement

We thank Wei Fan of IBM T. J. Watson Research Center for providing us with a revised version of the C4.5 decision tree classifier and running some experiments.

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, New-York, 10598, USA
Haixun Wang & Philip S. Yu
University of Illinois, Urbana Champaign, Urbana, Illinois, USA
Jiawei Han

Authors

Haixun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haixun Wang .

Editor information

Editors and Affiliations

, Dept. Industrial Engineering, Tel Aviv University, Ramat Aviv, 69978, Israel
Oded Maimon
, Dept. Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Lior Rokach

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, H., Yu, P.S., Han, J. (2009). Mining Concept-Drifting Data Streams. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_40

Download citation

DOI: https://doi.org/10.1007/978-0-387-09823-4_40
Published: 07 July 2010
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics