Concurrent Semi-supervised Learning of Data Streams

  • Hai-Long Nguyen
  • Wee-Keong Ng
  • Yew-Kwong Woon
  • Duc H. Tran
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6862)

Abstract

Conventional stream mining algorithms focus on single and stand-alone mining tasks. Given the single-pass nature of data streams, it makes sense to maximize throughput by performing multiple complementary mining tasks concurrently. We investigate the potential of concurrent semi-supervised learning on data streams and propose an incremental algorithm called CSL-Stream (Concurrent Semi–supervised Learning of Data Streams) that performs clustering and classification at the same time. Experiments using common synthetic and real datasets show that CSL-Stream outperforms prominent clustering and classification algorithms (D-Stream and SmSCluster) in terms of accuracy, speed and scalability. The success of CSL-Stream paves the way for a new research direction in understanding latent commonalities among various data mining tasks in order to exploit the power of concurrent stream mining.

Keywords

Data Stream Dense Node Tree Node Fading Model Concept Drift 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C., Han, J., Wang, J., Yu, P.: On Clustering Massive Data Streams: A Summarization Paradigm. Advances in Database Systems, vol. 31, pp. 9–38. Springer, US (2007)Google Scholar
  2. 2.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, vol. 29, pp. 81–92. VLDB Endowment (2003)Google Scholar
  3. 3.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for on-demand classification of evolving data streams. IEEE TKDE 18(5), 577–589 (2006)Google Scholar
  4. 4.
    Amigo, E., Gonzalo, J., Artiles, J.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12(4), 461–486 (2009)CrossRefGoogle Scholar
  5. 5.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: ACM SIGKDD, pp. 59–68. ACM, New York (2004)Google Scholar
  6. 6.
    Bifet, A., Holmes, G., Kirkby, R.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)Google Scholar
  7. 7.
    Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, pp. 328–339 (2006)Google Scholar
  8. 8.
    Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: ACM SIGKDD, pp. 133–142. ACM, New York (2007)Google Scholar
  9. 9.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231. AAAI Press, Menlo Park (1996)Google Scholar
  10. 10.
    Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: ACM SIGKDD, pp. 97–106. ACM, New York (2001)Google Scholar
  11. 11.
    Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal. 8(3), 281–300 (2004)Google Scholar
  12. 12.
    Masud, M.M., Jing, G., Khan, L., Jiawei, H., Thuraisingham, B.: A practical approach to classify evolving data streams: Training with limited amount of labeled data. In: ICDM, pp. 929–934 (2008)Google Scholar
  13. 13.
    Nguyen, H.L.: Concurrent semi-supervised learning of data streams. Technical report, Nanyang Technological Univ., Singapore (2011), http://www3.ntu.edu.sg/home2008/nguy0105/concurrent-mining.html
  14. 14.
    Park, N.H., Lee, W.S.: Grid-based subspace clustering over data streams. In: ACM CIKM, pp. 801–810. ACM, New York (2007)Google Scholar
  15. 15.
    Street, W.N., Kim, Y.: A streaming ensemble algorithm for large-scale classification. In: ACM SIGKDD, pp. 377–382. ACM, New York (2001)Google Scholar
  16. 16.
    Vilalta, R., Rish, I.: A decomposition of classes via clustering to explain and improve naive bayes. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 444–455. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  17. 17.
    Wagstaff, K., Cardie, C., Rogers, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584. Morgan Kaufmann Inc., San Francisco (2001)Google Scholar
  18. 18.
    Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: ACM SIGKDD, pp. 226–235. ACM, New York (2003)Google Scholar
  19. 19.
    Zhou, A., Cao, F., Qian, W., Jin, C.: Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems 15(2), 181–214 (2008)CrossRefGoogle Scholar
  20. 20.
    Zhu, X., Goldberg, A.B., Brachman, R., Dietterich, T.: Introduction to Semi-Supervised Learning. Morgan and Claypool Publishers, San Francisco (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Hai-Long Nguyen
    • 1
  • Wee-Keong Ng
    • 1
  • Yew-Kwong Woon
    • 2
  • Duc H. Tran
    • 1
  1. 1.Nanyang Technological UniversitySingapore
  2. 2.EADS Innovation WorksSingapore

Personalised recommendations