Skip to main content

Concurrent Semi-supervised Learning with Active Learning of Data Streams

  • Chapter
Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7790))

Abstract

Conventional stream mining algorithms focus on stand-alone mining tasks. Given the single-pass nature of data streams, it makes sense to maximize throughput by performing multiple complementary mining tasks concurrently. We investigate the potential of concurrent semi-supervised learning on data streams and propose an incremental algorithm called CSL-Stream (Concurrent Semi–supervised Learning of Data Streams) that performs clustering and classification at the same time. Experiments using common synthetic and real datasets show that CSL-Stream outperforms prominent clustering and classification algorithms (D-Stream and SmSCluster) in terms of accuracy, speed and scalability. Moreover, enhanced with a novel active learning technique, CSL-Stream only requires a small number of queries to work well with very sparsely labeled datasets. The success of CSL-Stream paves the way for a new research direction in understanding latent commonalities among various data mining tasks in order to exploit the power of concurrent stream mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C., Han, J., Wang, J., Yu, P.: On Clustering Massive Data Streams: A Summarization Paradigm. Advances in Database Systems, vol. 31, pp. 9–38. Springer US (2007)

    Google Scholar 

  2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 81–92. VLDB Endowment (2003)

    Google Scholar 

  3. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 852–863. VLDB Endowment (2004)

    Google Scholar 

  4. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering 18(5), 577–589 (2006)

    Article  Google Scholar 

  5. Amigo, E., Gonzalo, J., Artiles, J.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12(4), 461–486 (2009)

    Article  Google Scholar 

  6. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59–68. ACM (2004)

    Google Scholar 

  7. Baum, E., Lang, K.: Query learning can work poorly when a human oracle is used. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 335–340. IEEE Press (1992)

    Google Scholar 

  8. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research 7, 2399–2434 (2006)

    MathSciNet  MATH  Google Scholar 

  9. Bifet, A., Holmes, G., Kirkby, R.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)

    Google Scholar 

  10. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 19–26. Morgan Kaufmann Publishers Inc. (2001)

    Google Scholar 

  11. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)

    Google Scholar 

  12. Chapelle, O., Sindhwani, V., Keerthi, S.S.: Optimization techniques for semi-supervised support vector machines. The Journal of Machine Learning Research 9, 203–233 (2008)

    MATH  Google Scholar 

  13. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007)

    Google Scholar 

  14. Cuzzocrea, A.: CAMS: OLAPing Multidimensional Data Streams Efficiently. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 48–62. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  15. Cuzzocrea, A.: Retrieving Accurate Estimates to OLAP Queries over Uncertain and Imprecise Multidimensional Data Streams. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 575–576. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  16. Cuzzocrea, A., Chakravarthy, S.: Event-based lossy compression for effective and efficient olap over data streams. Data & Knowledge Engineering 69(7), 678–708 (2010)

    Article  Google Scholar 

  17. Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml

  18. Han, J., Chen, Y., Dong, G., Pei, J., Wah, B., Wang, J., Cai, Y.: Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases 18(2), 173–197 (2005)

    Article  Google Scholar 

  19. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106. ACM (2001)

    Google Scholar 

  20. Joachims, T.: Making large-scale support vector machine learning practical, pp. 169–184. MIT Press (1999)

    Google Scholar 

  21. Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis 8(3), 281–300 (2004)

    Google Scholar 

  22. Kriegel, H.-P., Kröger, P., Ntoutsi, I., Zimek, A.: Density Based Subspace Clustering over Dynamic Data. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 387–404. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  23. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: The ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12. ACM/Springer (1994)

    Google Scholar 

  24. Martin, E., Hans-Peter, K., Jörg, S., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)

    Google Scholar 

  25. Masud, M.M., Jing, G., Khan, L., Jiawei, H., Thuraisingham, B.: A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: ICDM, pp. 929–934 (2008)

    Google Scholar 

  26. Melville, P., Mooney, R.J.: Diverse ensembles for active learning (2004)

    Google Scholar 

  27. Park, N.H., Lee, W.S.: Grid-based subspace clustering over data streams. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 801–810. ACM (2007)

    Google Scholar 

  28. Sattar, H., Ying, Y., Zahra, M., Mohammadreza, K.: Adapted one-vs-all decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering 21, 624–637 (2009)

    Article  Google Scholar 

  29. Sindhwani, V., Keerthi, S.S.: Large scale semi-supervised linear svms. In: ACM SIGIR, pp. 477–484. ACM (2006)

    Google Scholar 

  30. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research 2, 45–66 (2002)

    MATH  Google Scholar 

  31. Vilalta, R., Rish, I.: A Decomposition of Classes via Clustering to Explain and Improve Naive Bayes. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 444–455. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  32. Wagstaff, K., Cardie, C., Rogers, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 577–584. Morgan Kaufmann Inc. (2001)

    Google Scholar 

  33. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235. ACM (2003)

    Google Scholar 

  34. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 103–114. ACM (1996)

    Google Scholar 

  35. Zhou, A., Cao, F., Qian, W., Jin, C.: Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems 15(2), 181–214 (2008)

    Article  Google Scholar 

  36. Zhu, X., Goldberg, A.B., Brachman, R., Dietterich, T.: Introduction to Semi-Supervised Learning. Morgan and Claypool Publishers (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Nguyen, HL., Ng, WK., Woon, YK. (2013). Concurrent Semi-supervised Learning with Active Learning of Data Streams. In: Hameurlain, A., KĂĽng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems VIII. Lecture Notes in Computer Science, vol 7790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37574-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37574-3_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37573-6

  • Online ISBN: 978-3-642-37574-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics