Advertisement

On Dynamic Feature Weighting for Feature Drifting Data Streams

  • Jean Paul Barddal
  • Heitor Murilo Gomes
  • Fabrício Enembreck
  • Bernhard Pfahringer
  • Albert Bifet
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9852)

Abstract

The ubiquity of data streams has been encouraging the development of new incremental and adaptive learning algorithms. Data stream learners must be fast, memory-bounded, but mainly, tailored to adapt to possible changes in the data distribution, a phenomenon named concept drift. Recently, several works have shown the impact of a so far nearly neglected type of drifcccct: feature drifts. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the learning task. In this paper we (i) provide insights into how the relevance of features can be tracked as a stream progresses according to information theoretical Symmetrical Uncertainty; and (ii) how it can be used to boost two learning schemes: Naive Bayesian and k-Nearest Neighbor. Furthermore, we investigate the usage of these two new dynamically weighted learners as prediction models in the leaves of the Hoeffding Adaptive Tree classifier. Results show improvements in accuracy (an average of 10.69 % for k-Nearest Neighbor, 6.23 % for Naive Bayes and 4.42 % for Hoeffding Adaptive Trees) in both synthetic and real-world datasets at the expense of a bounded increase in both memory consumption and processing time.

Keywords

Data Stream Concept Drift Split Node Relevant Subset Unlabeled Instance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Aggarwal, C.C.: An introduction to data classification. In: Data Classification: Algorithms and Applications, pp. 1–36 (2014). http://www.crcnetbase.com/doi/abs/10.1201/b17320-2
  2. 2.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, vol. 29. pp. 81–92. VLDB Endowment (2003). http://dl.acm.org/citation.cfm?id=1315451.1315460
  3. 3.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  4. 4.
    Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. IEEE Trans. Knowl. Data Eng. 5(6), 914–925 (1993)CrossRefGoogle Scholar
  5. 5.
    Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)MATHGoogle Scholar
  6. 6.
    Barddal, J.P., Gomes, H.M., Enembreck, F.: Analyzing the impact of feature drifts in streaming learning. In: Arik, S., Huang, T., Lai, W.K., Liu, Q. (eds.) ICONIP 2015. LNCS, pp. 21–28. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-26532-2_3 CrossRefGoogle Scholar
  7. 7.
    Barddal, J.P., Gomes, H.M., Enembreck, F.: A survey on feature drift adaptation. In: Proceedings of the International Conference on Tools with Artificial Intelligence. IEEE, November 2015Google Scholar
  8. 8.
    Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In. SIAM International Conference on Data Mining (2007)Google Scholar
  9. 9.
    Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)Google Scholar
  10. 10.
    Bifet, A., Read, J., Žliobaitė, I., Pfahringer, B., Holmes, G.: Pitfalls in benchmarking data stream classification and how to avoid them. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS (LNAI), pp. 465–479. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40988-2_30 CrossRefGoogle Scholar
  11. 11.
    Chen, L., Wang, S.: Automated feature weighting in naive bayes for high-dimensional data classification. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 1243–1252. ACM, New York (2012). http://doi.acm.org/10.1145/2396761.2398426
  12. 12.
    Corder, G., Foreman, D.: Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach. Wiley, Hoboken (2011). http://books.google.com.br/books?id=T3qOqdpSz6YC
  13. 13.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2000, pp. 71–80. ACM, New York (2000). http://doi.acm.org/10.1145/347090.347107
  14. 14.
    Enembreck, F., Avila, B.C., Scalabrin, E.E., Barthès, J.P.A.: Learning drifting negotiations. Appl. Artif. Intell. 21(9), 861–881 (2007). http://dblp.uni-trier.de/db/journals/aai/aai21.html#EnembreckASB07
  15. 15.
    Gama, J., Rodrigues, P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, pp. 329–338, June 2009Google Scholar
  16. 16.
    Gama, J., Pinto, C.: Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM Symposium on Applied Computing, SAC 2006, pp. 662–667. ACM, New York (2006). http://doi.acm.org/10.1145/1141277.1141429
  17. 17.
    Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014). http://doi.acm.org/10.1145/2523813
  18. 18.
    Gama, J.: Knowledge Discovery from Data Streams, 1st edn. Chapman & Hall/CRC, Boca Raton (2010)Google Scholar
  19. 19.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). http://www.jstor.org/stable/2282952? MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 97–106. ACM, New York (2001). http://doi.acm.org/10.1145/502512.502529
  21. 21.
    Katakis, I., Tsoumakas, G., Vlahavas, I.: Dynamic feature space and incremental feature selection for the classification of textual data streams. In: ECML/PKDD-2006 International Workshop on Knowledge Discovery from Data Streams, vol. 4213, p. 107. Springer (2006)Google Scholar
  22. 22.
    Nguyen, H.-L., Woon, Y.-K., Ng, W.-K., Wan, L.: Heterogeneous ensemble for feature drifts in data streams. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012. LNCS (LNAI), vol. 7302, pp. 1–12. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-30220-6_1 CrossRefGoogle Scholar
  23. 23.
    Rodrigues, P., Gama, J., Pedroso, J.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008)CrossRefGoogle Scholar
  24. 24.
    Ross, G.J., Adams, N.M., Tasoulis, D.K., Hand, D.J.: Exponentially weighted moving average charts for detecting concept drift. ArXiv e-prints (2012)Google Scholar
  25. 25.
    Sovdat, B.: Updating formulas and algorithms for computing entropy and gini index on time-changing data streams. CoRR abs/1403.6348 (2014). http://arxiv.org/abs/1403.6348
  26. 26.
    Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, pp. 377–382, August 2001Google Scholar
  27. 27.
    Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1), 69–101 (1996)Google Scholar
  28. 28.
    Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 856–863. AAAI Press (2003)Google Scholar
  29. 29.
    Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., Liu, H.: Advancing feature selection research. ASU Feature Sel. Repository, 1–28 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Jean Paul Barddal
    • 1
  • Heitor Murilo Gomes
    • 1
  • Fabrício Enembreck
    • 1
  • Bernhard Pfahringer
    • 2
  • Albert Bifet
    • 3
  1. 1.Graduate Program in Informatics (PPGIa)Pontifícia Universidade Católica do ParanáCuritibaBrazil
  2. 2.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand
  3. 3.Computer Science and Networks Department (INFRES), Institut Mines-Télécom, Télécom ParisTechUniversité Paris-SaclayParisFrance

Personalised recommendations