Concurrent Semi-supervised Learning with Active Learning of Data Streams

Nguyen, Hai-Long; Ng, Wee-Keong; Woon, Yew-Kwong

doi:10.1007/978-3-642-37574-3_5

Hai-Long Nguyen²¹,
Wee-Keong Ng²¹ &
Yew-Kwong Woon²²

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7790))

798 Accesses
4 Citations

Abstract

Conventional stream mining algorithms focus on stand-alone mining tasks. Given the single-pass nature of data streams, it makes sense to maximize throughput by performing multiple complementary mining tasks concurrently. We investigate the potential of concurrent semi-supervised learning on data streams and propose an incremental algorithm called CSL-Stream (Concurrent Semi–supervised Learning of Data Streams) that performs clustering and classification at the same time. Experiments using common synthetic and real datasets show that CSL-Stream outperforms prominent clustering and classification algorithms (D-Stream and SmSCluster) in terms of accuracy, speed and scalability. Moreover, enhanced with a novel active learning technique, CSL-Stream only requires a small number of queries to work well with very sparsely labeled datasets. The success of CSL-Stream paves the way for a new research direction in understanding latent commonalities among various data mining tasks in order to exploit the power of concurrent stream mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C., Han, J., Wang, J., Yu, P.: On Clustering Massive Data Streams: A Summarization Paradigm. Advances in Database Systems, vol. 31, pp. 9–38. Springer US (2007)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 81–92. VLDB Endowment (2003)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 852–863. VLDB Endowment (2004)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering 18(5), 577–589 (2006)
Article Google Scholar
Amigo, E., Gonzalo, J., Artiles, J.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12(4), 461–486 (2009)
Article Google Scholar
Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59–68. ACM (2004)
Google Scholar
Baum, E., Lang, K.: Query learning can work poorly when a human oracle is used. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 335–340. IEEE Press (1992)
Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research 7, 2399–2434 (2006)
MathSciNet MATH Google Scholar
Bifet, A., Holmes, G., Kirkby, R.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)
Google Scholar
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 19–26. Morgan Kaufmann Publishers Inc. (2001)
Google Scholar
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)
Google Scholar
Chapelle, O., Sindhwani, V., Keerthi, S.S.: Optimization techniques for semi-supervised support vector machines. The Journal of Machine Learning Research 9, 203–233 (2008)
MATH Google Scholar
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007)
Google Scholar
Cuzzocrea, A.: CAMS: OLAPing Multidimensional Data Streams Efficiently. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 48–62. Springer, Heidelberg (2009)
Chapter Google Scholar
Cuzzocrea, A.: Retrieving Accurate Estimates to OLAP Queries over Uncertain and Imprecise Multidimensional Data Streams. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 575–576. Springer, Heidelberg (2011)
Chapter Google Scholar
Cuzzocrea, A., Chakravarthy, S.: Event-based lossy compression for effective and efficient olap over data streams. Data & Knowledge Engineering 69(7), 678–708 (2010)
Article Google Scholar
Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
Han, J., Chen, Y., Dong, G., Pei, J., Wah, B., Wang, J., Cai, Y.: Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases 18(2), 173–197 (2005)
Article Google Scholar
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106. ACM (2001)
Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical, pp. 169–184. MIT Press (1999)
Google Scholar
Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis 8(3), 281–300 (2004)
Google Scholar
Kriegel, H.-P., Kröger, P., Ntoutsi, I., Zimek, A.: Density Based Subspace Clustering over Dynamic Data. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 387–404. Springer, Heidelberg (2011)
Chapter Google Scholar
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: The ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. ACM/Springer (1994)
Google Scholar
Martin, E., Hans-Peter, K., Jörg, S., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)
Google Scholar
Masud, M.M., Jing, G., Khan, L., Jiawei, H., Thuraisingham, B.: A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: ICDM, pp. 929–934 (2008)
Google Scholar
Melville, P., Mooney, R.J.: Diverse ensembles for active learning (2004)
Google Scholar
Park, N.H., Lee, W.S.: Grid-based subspace clustering over data streams. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 801–810. ACM (2007)
Google Scholar
Sattar, H., Ying, Y., Zahra, M., Mohammadreza, K.: Adapted one-vs-all decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering 21, 624–637 (2009)
Article Google Scholar
Sindhwani, V., Keerthi, S.S.: Large scale semi-supervised linear svms. In: ACM SIGIR, pp. 477–484. ACM (2006)
Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research 2, 45–66 (2002)
MATH Google Scholar
Vilalta, R., Rish, I.: A Decomposition of Classes via Clustering to Explain and Improve Naive Bayes. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 444–455. Springer, Heidelberg (2003)
Chapter Google Scholar
Wagstaff, K., Cardie, C., Rogers, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 577–584. Morgan Kaufmann Inc. (2001)
Google Scholar
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235. ACM (2003)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 103–114. ACM (1996)
Google Scholar
Zhou, A., Cao, F., Qian, W., Jin, C.: Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems 15(2), 181–214 (2008)
Article Google Scholar
Zhu, X., Goldberg, A.B., Brachman, R., Dietterich, T.: Introduction to Semi-Supervised Learning. Morgan and Claypool Publishers (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Nanyang Technological University, Singapore
Hai-Long Nguyen & Wee-Keong Ng
EADS Innovation Works South Asia, Singapore
Yew-Kwong Woon

Authors

Hai-Long Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Wee-Keong Ng
View author publications
You can also search for this author in PubMed Google Scholar
Yew-Kwong Woon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University,, 118 route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Institute for Application Oriented Knowledge Processing, 4020, Linz, Austria
Josef Küng
FAW, University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Roland Wagner
ICAR-CNR, University of Calabria, via P. Bucci 41C, 87036, Rende (CS), Italy
Alfredo Cuzzocrea
Hewlett-Packard Laboratories, 1501 Page Mill Road, 94304, Palo Alto, CA, USA
Umeshwar Dayal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nguyen, HL., Ng, WK., Woon, YK. (2013). Concurrent Semi-supervised Learning with Active Learning of Data Streams. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII. Lecture Notes in Computer Science, vol 7790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37574-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-37574-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37573-6
Online ISBN: 978-3-642-37574-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics