Advertisement

Exploring Early Classification Strategies of Streaming Data with Delayed Attributes

  • Mónica Millán-Giraldo
  • J. Salvador Sánchez
  • V. Javier Traver
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5863)

Abstract

In contrast to traditional machine learning algorithms, where all data are available in batch mode, the new paradigm of streaming data poses additional difficulties, since data samples arrive in a sequence and many hard decisions have to be made on-line. The problem addressed here consists of classifying streaming data which not only are unlabeled, but also have a number l of attributes arriving after some time delay τ. In this context, the main issues are what to do when the unlabeled incomplete samples and, later on, their missing attributes arrive; when and how to classify these incoming samples; and when and how to update the training set. Three different strategies (for l = 1 and constant τ) are explored and evaluated in terms of the accumulated classification error. The results reveal that the proposed on-line strategies, despite their simplicity, may outperform classifiers using only the original, labeled-and-complete samples as a fixed training set. In other words, learning is possible by properly tapping into the unlabeled, incomplete samples, and their delayed attributes. The many research issues identified include a better understanding of the link between the inherent properties of the data set and the design of the most suitable on-line classification strategy.

Keywords

Data mining Streaming data On-line classification Missing attributes 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agarwal, C.: On-Demand Classification of Data Streams. In: Proc. ACM International Conference on Knowledge Discovery and Data Mining, pp. 503–508 (2004)Google Scholar
  2. 2.
    Agarwal, C.: Data Streams: Models and Algorithms. Springer, New York (2007)Google Scholar
  3. 3.
    Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, School of Information and Computer Science. University of California, Irvine, CA (2007), http://archive.ics.uci.edu/ml/
  4. 4.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data Stream Systems. In: Proc. 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1–16 (2002)Google Scholar
  5. 5.
    Bruzzone, L., Roli, R., Serpico, S.B.: An Extension of the Jeffreys–Matusita Distance to Multiclass Cases for Feature Selection. IEEE Trans. on Geoscience and Remote Sensing 33(6), 1318–1321 (1995)CrossRefGoogle Scholar
  6. 6.
    Ganti, V., Gehrke, J., Ramakrishnan, R.: Demon: Mining and Monitoring Evolving Data. IEEE Trans. on Knowledge and Data Engineering 13(1), 50–63 (2001)CrossRefGoogle Scholar
  7. 7.
    Gelman, A., Meng, X.L.: Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspectives. John Wiley & Sons, Chichester (2004)zbMATHCrossRefGoogle Scholar
  8. 8.
    Kuncheva, L.I.: Classifier Ensembles for Detecting Concept Change in Streaming Data: Overview and Perspectives. In: Proc. 2nd Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 5–10 (2008)Google Scholar
  9. 9.
    Maimon, O., Rokach, L.: Data Mining and Knowledge Discovery Handbook. Springer Science+Business Media, New York (2005)zbMATHCrossRefGoogle Scholar
  10. 10.
    Marwala, T.: Computational Intelligence for Missing Data Imputation, Estimation and Management: Knowledge Optimization Techniques. Information Science Reference (2009)Google Scholar
  11. 11.
    Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science 1(2), 117–236 (2005)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)zbMATHGoogle Scholar
  13. 13.
    Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987)zbMATHGoogle Scholar
  14. 14.
    Street, W.N., Kim, Y.: A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification. In: Proc. 7th International Conference on Knowledge Discovery and Data Mining, pp. 377–382 (2001)Google Scholar
  15. 15.
    Takeuchi, J., Yamanishi, K.: A Unifying Framework for Detecting Outliers and Change Points from Time Series. IEEE Trans. on Knowledge and Data Engineering 18(4), 482–492 (2006)CrossRefGoogle Scholar
  16. 16.
    Tsymbal, A.: The Problem of Concept Drift: Definitions and Related Work. Technical Report. Department of Computer Science, Trinity College, Dublin, Ireland (2004)Google Scholar
  17. 17.
    Vázquez, F., Sánchez, J.S., Pla, F.: A Stochastic Approach to Wilsons Editing Algorithm. In: Proc. 2nd Iberian Conference on Pattern Recognition and Image Analysis, pp. 35–42 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Mónica Millán-Giraldo
    • 1
  • J. Salvador Sánchez
    • 1
  • V. Javier Traver
    • 1
  1. 1.Dept. Llenguatges i Sistemes InformàticsUniversitat Jaume ICastelló de la PlanaSpain

Personalised recommendations