Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

  • Bakshi Rohit Prasad
  • Sonali Agarwal
Original Article


Streaming classification of big data is a method under stream data mining that learns from continuous, ordered sequences of data streams coming from diversified sources using limited computing and storage capabilities. SAMOA stands for scalable advanced massive online analysis, is a machine learning framework used to perform distributed data mining over streaming data. Vertical Hoeffding Tree (VHT) under SAMOA is a variant of very fast decision tree used for distributed classification of data streams. The performance of VHT depends on various critical parameters such as tie-threshold, grace value, confidence, split criterion, etc. Although, VHT is widely accepted as an efficient streaming classifier but one of the challenges in streaming classification is varying distribution of incoming data instances with respect to underlying classes in different datasets; therefore performance of VHT varies in different datasets. Therefore, achieving optimal performance from the stream classifier like VHT on different datasets is a challenging task and fixed set of values of critical parameters cannot be preconfigured for various types of datasets. This research work explores the capabilities of VHT streaming classifier of SAMOA in the light of various benchmarking performance statistics such as classification accuracy, kappa and kappa temporal. The work presented here, experimentally identifies suitable values of critical parameters of VHT that yield optimized performance on different datasets. Thus, this analytical study is extremely significant in developing streaming classifiers which achieve optimum performance via parameter tuning at run time.


Streaming data mining Streaming data classification Vertical Hoeffding Tree VHT Massive online analysis SAMOA VFDT 


  1. 1.
    Murdopo A, Severien A, Morales GDF, Bifet A (2013) SAMOA: developer’s guide. Yahoo Labs, BarcelonaGoogle Scholar
  2. 2.
    Storm. Accessed 10 Apr 2015
  3. 3.
    Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: IEEE International conference on data mining workshops (ICDMW). IEEE Press, pp 170–177Google Scholar
  4. 4.
    Apache Software Foundation. Samza. Accessed 11 Apr 2015
  5. 5.
    Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. Accessed 10 Mar 2015
  6. 6.
    Prasad BR, Agarwal S (2014) Handling big data stream analytics using SAMOA framework—a practical experience. Int J Database Theory Appl 7(4):197–208CrossRefGoogle Scholar
  7. 7.
    Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 71–80Google Scholar
  8. 8.
    Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–46Google Scholar
  9. 9.
    Yang H, Fong S (2011) Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 471–483Google Scholar
  10. 10.
    White T (2012) Hadoop: the definitive guide. O’Reilly Media Publishers, Yahoo PressGoogle Scholar
  11. 11.
    Apache Pig. Accessed 15 Apr 2015
  12. 12.
    Apache Mahout. Accessed 12 Mar 2015
  13. 13.
    Scott DM (2011) Real-time marketing and PR, revised: how to instantly engage your market, connect with customers, and create products that grow your business now. Wiley Desktop Editions Series. WileyGoogle Scholar
  14. 14.
    Taormina R et al (2015) ANN-based interval forecasting of stream flow discharges using the LUBE method and MOFIPS. Eng Appl Artif Intell 45:429–440CrossRefGoogle Scholar
  15. 15.
    Zhang J et al (2009) Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J Univ Comput Sci 15(4):840–858Google Scholar
  16. 16.
    Wang WC et al (2015) Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour Manage 29(8):2655–2675CrossRefGoogle Scholar
  17. 17.
    Zhang SW et al (2009) Dimension reduction using semi-supervised locally linear embedding for plant leaf classification. Lect Notes Comput Sci 5754:948–955CrossRefGoogle Scholar
  18. 18.
    Wu CL et al (2009) Methods to improve neural network performance in daily flows prediction. J Hydrol 372(1–4):80–93CrossRefGoogle Scholar
  19. 19.
    Chau KW et al (2010) A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J Hydroinform 12(4):458–473CrossRefGoogle Scholar
  20. 20.
    Amatriain X (2012) Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor Newsl 14:37–48CrossRefGoogle Scholar
  21. 21.
    Facebook Scribe. Accessed 13 Mar 2015
  22. 22.
    Bifet A et al (2010) MOA: massive online analysis. J Mach Learn. 11:1601–1604MathSciNetGoogle Scholar
  23. 23.
    VowpalWabbit (Fast Learning). Accessed 15 Mar 2015
  24. 24.
    Marz N, Warren J (2013) Big data: principles and best practices of scalable realtime data systems. Manning Publications, O’Reilly MediaGoogle Scholar
  25. 25.
    Alberg D, Last M, Kandel A (2012) knowledge discovery in data streams with regression tree methods. Wiley Interdiscip Rev Data Min Knowl Discov 2:69–78CrossRefGoogle Scholar
  26. 26.
    Gehrke J, Ramakrishnan R, Ganti V (1998) Rainforest—a framework for fast decision tree construction of large datasets. In: 24th international conference on very large data bases. VLDB, pp 416–427Google Scholar
  27. 27.
    Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 135–150Google Scholar
  28. 28.
    Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21:1–14MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Gomes JB, Ruiz EM, Sousa PAC (2011) Learning recurring concepts from data streams with a context-aware ensemble. In: ACM symposium on applied computing, pp 994–999Google Scholar
  30. 30.
    Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13:215–223zbMATHGoogle Scholar
  31. 31.
    Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: 15th ACMSIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148Google Scholar
  32. 32.
    Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23:128–168MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Kadlec P, Grbic R, Gabrys B (2011) Review of adaptation mechanisms for data-driven soft sensors. Comput Chem Eng 35:1–24CrossRefGoogle Scholar
  34. 34.
    Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45:521–530CrossRefGoogle Scholar
  35. 35.
    Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: 7th Brazilian symposium on artificial intelligence, pp 286–295Google Scholar
  36. 36.
    Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790zbMATHGoogle Scholar
  37. 37.
    Ross G, Adams N, Tasoulis D, Hand D (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33:191–198CrossRefGoogle Scholar
  38. 38.
    Gama J, Sebastiao R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90:317–346MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77CrossRefGoogle Scholar
  40. 40.
    Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: International conference on data mining, pp 592–602Google Scholar
  41. 41.
    Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 3–11Google Scholar
  42. 42.
    Mitsa T (2010) Importance of temporal data mining today. In: Temporal data mining. Chapman and Hall/CRC, Taylor and Francis Group, CRC Press, pp 1–17Google Scholar
  43. 43.
    Bifet A et al (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 465–479Google Scholar
  44. 44.
    Wikipedia. Accessed 18 Mar 2015

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Indian Institute of Information Technology AllahabadAllahabadIndia

Personalised recommendations