Incremental Optimization Mechanism for Constructing a Balanced Very Fast Decision Tree for Big Data

  • Hang Yang
  • Simon Fong


Big data is a popular topic that highly attracts the attentions of researchers from all over the world. How to mine valuable information from such huge volumes of data remains an open problem. As the most widely used technology of decision tree, imperfect data stream leads to tree size explosion and detrimental accuracy problems. Over-fitting problem and the imbalanced class distribution reduce the performance of the original decision tree algorithm for stream mining. In this chapter, we propose an Optimized Very Fast Decision Tree (OVFDT) that possesses an optimized node-splitting control mechanism using Hoeffding bound. Accuracy, tree size, and learning time are the significant factors influencing the algorithm’s performance. Naturally, a bigger tree size takes longer computation time. OVFDT is a pioneer model equipped with an incremental optimization mechanism that seeks for a balance between accuracy and tree size for data stream mining. OVFDT operates incrementally by a test-then-train approach. Two new methods of functional tree leaves are proposed to improve the accuracy with which the tree model makes a prediction for a new data stream in the testing phase. The optimized node-splitting mechanism controls the tree model growth in the training phase. The experiment shows that OVFDT obtains an optimal tree structure in numeric and nominal datasets.



The authors are thankful for the financial support from the research grants “Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF)”, Grant no. MYRG2015-00128-FST offered by the University of Macau, FST, and RDAO, and “A scalable data stream mining methodology: stream-based holistic analytics and reasoning in parallel”, Grant no. FDCT-126/2014/A3, offered by FDCT Macau.


  1. Bifet, A., & Gavalda, R. (2007 ). Learning from Time-Changing Data with Adaptive Windowing. In Proceedings of SIAM International Conference on Data Mining (pp. 443–448).Google Scholar
  2. Bifet A., Geoff, H., Bernhard, P., Jesse, R., Philipp, K., Hardy, K., Timm, J., & Thomas, S. (2001). MOA: A Real-Time Analytics Open Source Framework. In Machine Learning and Knowledge Discovery in Databases (pp. 617–620). Lecture Notes in Computer Science, Volume 6913/2011.Google Scholar
  3. Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavalda, R. (2009). New Ensemble Methods for Evolving Data Streams. In Proceedings 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 139–147). New York: ACM.CrossRefGoogle Scholar
  4. Elomaa, T. (1999). The Biases of Decision Tree Pruning Strategies, Advances in Intelligent Data Analysis (pp. 63–74). Lecture Notes in Computer Science, Volume 1642/1999. Berlin/Heidelberg: Springer.Google Scholar
  5. Gama, J., & Kosina, P. (2011). Learning Decision Rules from Data Streams. In T. Walsh (Ed.), Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence – Volume Two (Vol. 2, pp. 1255–1260). Menlo Park: AAAI Press.Google Scholar
  6. Gama J, Rocha R., & Medas P. (2003). Accurate Decision Trees for Mining High-Speed Data Streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 523–528). ACM, New York.Google Scholar
  7. Geoffrey H., Richard K., & Bernhard P. (2005). Tie Breaking in Hoeffding Trees. In Proceedings Workshop W6: Second International Workshop on Knowledge Discovery in Data Streams (pp. 107–116).Google Scholar
  8. Hartline J. R. K. (2008). Incremental Optimization (PhD Thesis). Faculty of the Graduate School, Cornell University.Google Scholar
  9. Hashemi, S., & Yang, Y. (2009). Flexible Decision Tree for Data Stream Classification in the Presence of Concept Change, Noise and Missing Values. Data Mining and Knowledge Discovery, 19(1), 95–131.CrossRefGoogle Scholar
  10. Hulten G., & Domingos P. (2003). VFML – A Toolkit for Mining High-Speed Time-Changing Data Streams.
  11. Hulten G., Spencer L., & Domingos P. (2001). Mining Time-Changing Data Streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 97–106).Google Scholar
  12. Mladenic D., & Grobelnik M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes, In Proceeding ICML ‘99 Proceedings of the Sixteenth International Conference on Machine Learning (pp. 258–267). ISBN 1-55860-612-2, Morgan Kaufmann.Google Scholar
  13. Nitesh, C., Nathalie, J., & Alek, K. (2004). Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, 6(1), 1–6.CrossRefGoogle Scholar
  14. Oza N., & Russell S. (2001). Online Bagging and Boosting. In Artificial Intelligence and Statistics (pp. 105–112). San Mateo: Morgan Kaufmann.Google Scholar
  15. Pedro D., & Geoff H. (2000). Mining High-Speed Data Streams. In Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 71–80).Google Scholar
  16. Pfahringer B., Holmes G., & Kirkby R. (2007). New Options for Hoeffding Trees. In Proceedings in Australian Conference on Artificial Intelligence (pp. 90–99).Google Scholar
  17. Stefan H., Russel P., & Yun S. K. (2009). CBDT: A Concept Based Approach to Data Stream Mining (pp. 1006–1012). Lecture Notes in Computer Science, Volume 5476/2009.Google Scholar
  18. Yang H., & Fong S. (2011). Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning. In Proceedings of the 13th International Conference on Data Warehousing And Knowledge Discovery (pp. 471–483). Berlin/Heidelberg: Springer-Verlag.Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Hang Yang
    • 1
  • Simon Fong
    • 2
  1. 1.China Southern Power GridGuangzhouChina
  2. 2.Department of Computer and Information ScienceUniversity of Macau, Macau SARZhuhai ShiChina

Personalised recommendations