Clustering Based Active Learning for Evolving Data Streams

Ienco, Dino; Bifet, Albert; Žliobaitė, Indrė; Pfahringer, Bernhard

doi:10.1007/978-3-642-40897-7_6

Dino Ienco^22,23,
Albert Bifet²⁴,
Indrė Žliobaitė²⁵ &
…
Bernhard Pfahringer²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8140))

Included in the following conference series:

International Conference on Discovery Science

1461 Accesses
19 Citations

Abstract

Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asuncion, A., Newman, D.: UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences (2007)
Google Scholar
Attenberg, J., Provost, F.: Online active inference and learning. In: Proc. of the 17th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD, pp. 186–194 (2011)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive Online Analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar
Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Worst-case analysis of selective sampling for linear classification. J. Mach. Learn. Res. 7, 1205–1230 (2006)
MathSciNet MATH Google Scholar
Cohn, D., Atlas, l., Ladner, R.: Improving generalization with active learning. Machine Learning 15, 201–221 (1994)
Google Scholar
Fan, W., Huang, Y., Wang, H., Yu, P.: Active mining of data streams. In: Proc. of the 4th SIAM Int. Conf. on Data Mining, SDM, pp. 457–461 (2004)
Google Scholar
Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004)
Chapter Google Scholar
Harries, M., Sammut, C., Horn, K.: Extracting hidden context. Machine Learning 32(2), 101–126 (1998)
Article MATH Google Scholar
Helmbold, D., Panizza, S.: Some label efficient learning results. In: Proc. of the 10th An. Conf. on Computational Learning Theory, COLT, pp. 218–230 (1997)
Google Scholar
Huang, S., Dong, Y.: An active learning system for mining time-changing data streams. Intelligent Data Analysis 11, 401–419 (2007)
Google Scholar
Ikonomovska, E., Gama, J., Dzeroski, S.: Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 23(1), 128–168 (2010)
Article MathSciNet Google Scholar
Klinkenberg, R.: Using labeled and unlabeled data to learn drifting concepts. In: IJCAI Workshop on Learning from Temporal and Spatial Data, pp. 16–24 (2001)
Google Scholar
Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proc. of the 17th An. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR, pp. 3–12 (1994)
Google Scholar
Lindstrom, P., Delany, S.J., MacNamee, B.: Handling concept drift in a text data stream constrained by high labelling cost. In: Proc. of the 23rd Int. Florida Artificial Intelligence Research Society Conference, FLAIRS (2010)
Google Scholar
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.: Big data: The next frontier for innovation, competition, and productivity (2011)
Google Scholar
Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: Classification and novel class detection in data streams with active mining. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part III. LNCS, vol. 6119, pp. 311–324. Springer, Heidelberg (2010)
Chapter Google Scholar
Masud, M., Woolam, C., Gao, J., Khan, L., Han, J., Hamlen, K., Oza, N.: Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl. Inf. Syst. 33(1), 213–244 (2011)
Article Google Scholar
Nguyen, H., Smeulders, A.: Active learning using pre-clustering. In: Proc. of the 21st Int. Conf. on Machine Learning, ICML, pp. 623–630 (2004)
Google Scholar
Quinlan, R.J.: C4.5: Programs for Machine Learning. Kaufmann Series in Machine Learning. Morgan Kaufmann. Morgan Kaufmann (1993)
Google Scholar
Sculley, D.: Online active learning methods for fast label-efficient spam filtering. In: Proc. of the 4th Conf. on Email and Anti-Spam, CEAS (2007)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing (2005)
Google Scholar
Widyantoro, D., Yen, J.: Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Tr. on Know. and Data Eng. 17, 401–412 (2005)
Article Google Scholar
Zhu, X., Zhang, P., Lin, X., Shi, Y.: Active learning from data streams. In: Proc. of the 7th IEEE Int. Conf. on Data Mining, ICDM, pp. 757–762 (2007)
Google Scholar
Zliobaite, I., Bifet, A., Pfahringer, B., Holmes, G.: Active learning with drifting streaming data. IEEE Trans. on Neural Networks and Learning Systems (page in press, 2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Irstea, UMR TETIS, Montpellier, France
Dino Ienco
LIRMM, Montpellier, France
Dino Ienco
Yahoo! Research Barcelona, Catalonia, Spain
Albert Bifet
Aalto University and Helsinki Institute for Information Technology, Finland
Indrė Žliobaitė
University of Waikato, Hamilton, New Zealand
Bernhard Pfahringer

Authors

Dino Ienco
View author publications
You can also search for this author in PubMed Google Scholar
Albert Bifet
View author publications
You can also search for this author in PubMed Google Scholar
Indrė Žliobaitė
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Pfahringer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

TU Darmstadt, Germany
Johannes Fürnkranz
Phillips-Universität Marburg, Germany
Eyke Hüllermeier
The Institute of Statistical Mathematics, Tokyo, Japan
Tomoyuki Higuchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ienco, D., Bifet, A., Žliobaitė, I., Pfahringer, B. (2013). Clustering Based Active Learning for Evolving Data Streams. In: Fürnkranz, J., Hüllermeier, E., Higuchi, T. (eds) Discovery Science. DS 2013. Lecture Notes in Computer Science(), vol 8140. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40897-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-40897-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40896-0
Online ISBN: 978-3-642-40897-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics