Skip to main content
Log in

Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Peer-to-Peer (P2P) applications generate streaming data in large volumes, where new communities of peers regularly attend and existing communities of peers regularly leave, requiring the classification techniques to consider concept drift, and update the model incrementally. Concept-adapting Very Fast Decision Tree (CVFDT) is one of the well-known streaming data mining techniques that can be applied to P2P traffic. However, we observe that P2P traffic data is class imbalanced, namely, only about 30 % of examples can be labeled as “P2P”, biasing the trained models (e.g. decision tree) towards the majority class (i.e. “NonP2P”). In this paper, we propose a new technique, the imbalanced CVFDT (iCVFDT), by integrating the CVFDT with an efficient resampling technique to address the issue of the class imbalanced data. The iCVFDT classification technique enjoys the advantages of CVFDT (such as stability), and at the same time, is not sensitive to imbalanced data. We captured the Internet traffic at a main gateway and prepared a real data stream with 3.5 million examples to which the iCVFDT classification technique was applied. The experimental results demonstrate a significant improvement in the performance of the iCVFDT compared to that of the CVFDT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Recently, streaming video grows at a fast rate and accounts for more traffic on the Internet than P2P traffic. However, it is not that P2P traffic is declining. It just grows at a slower rate compared to other types of video traffic.

  2. http://www.iana.org/

  3. “length” denotes the length field after “>” in the Table 1.

References

  1. Azzouna NB, Guillemin F (2004) Impact of peer-to-peer applications on wide area network traffic: An experimental approach. IEEE Glob Telecommun Conf 3:1544–1548

    Google Scholar 

  2. Kamei S, Kimura T (2003) Practicable network design for handling growth in the volume of peer-to-peer traffic. IEEE Pacific Rim Conf Commun, Comput Signal Process 2:597–600

    Google Scholar 

  3. Zander S, Nguyen T, Armitage G (2005) Self-learning IP traffic classification based on statistical flow characteristics. Springer-Verlag Lecture Notes in Computer Science, Vol. 3431, Springer, Berlin, pp 325–328

  4. Zuev D, Moore AW (2005) Traffic classification using a statistical approach. Springer-Verlag Lecture Notes in Computer Science, Vol. 3431, Springer Berlin, pp 321–324

  5. Raahemi B, Hayajneh A, Rabinovitch P (2007) Classification of peer-to-peer traffic using neural networks. Proceedings of Artificial Intelligence and Pattern Recognition, Orlando, USA, July 2007, pp 411–417

  6. Raahemi B, Hayajneh A, Rabinovitch P (2007) Peer-to-peer IP traffic classification using decision tree and IP layer attributes. Int J Business Data Commun Netw 3(4):60–74

    Article  Google Scholar 

  7. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp 97–106

  8. Schlimmer JC, Granger RH (1986) Beyond incremental processing: tracking concept drift. Proc. of the AAAI National Conference on Artificial Intelligence, pp 502–507

  9. Raahemi B, Zhong W, Liu J (2008) Peer-to-peer traffic identification by mining IP layer data streams using concept-adapting very fast decision tree. In: 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, USA, Nov, pp 525–532

  10. Raahemi B, Zhong W, Liu J (2009) Exploiting unlabeled data to improve Peer-to-Peer traffic classification using incremental Tri-Training method. Peer-to-Peer Network Appl 2(2):87–97

    Article  Google Scholar 

  11. Kamei S, Kimura T (2006) Cisco IOS netflow overview. Whitepaper, available at www.Cisco.com, Cisco Systems Inc

  12. Crovella M, Krishnamurthy B (2006) Internet measurement: infrastructure, traffic and applications. Wiley, West Sussex

    Google Scholar 

  13. Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of P2P traffic using application signatures. Proc. of the 13th International World Wide Web Conference, NY, USA, pp 512–521

  14. Karagiannis T, Broido A, Faloutsos M, Klaffy K (2004) Transport layer identification of P2P traffic. In Proc. of the 4th ACM SIGCOMM Conference on Internet Measurement, Italy, pp 121–134

  15. Moore W, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques. In Proc. ACM Sigmetrics, Alberta, Canada, June 2005, pp 50–59

  16. Auld T, Moore W, Gull F (2007) Bayesian neural network for Internet traffic classification. IEEE Trans Neural Network 18(1):223–239

    Article  Google Scholar 

  17. Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp 71–80

  18. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30

    Article  MathSciNet  MATH  Google Scholar 

  19. Maron O, Moore A (1994) Hoeffding races: accelerating model selection search for classification and function approximation. In: Advances in Neural Information Processing Systems 6, Morgan Kaufmann, San Mateo, CA, pp 59–66

  20. Wang Y, Martonosi M, Peh LS (2006) Predicting link quality using supervised learning in wireless sensor networks. Mobile Comput Commun Rev 11(3):71–83

    Article  Google Scholar 

  21. Li X, Barajas JM, Ding Y (2007) Collaborative filtering on streaming data with interest-drifting. Int J Intell Data Anal 11(1):75–87

    Google Scholar 

  22. Liang C, Zhang Y, Song Q (2010) Decision tree for dynamic and uncertain data streams. J Mach Learn Res—Proceed Track 13:209–224

    Google Scholar 

  23. Hashemi S, Yang Y, Mirzamomen Z, Kangavari M (2009) Adapted one-vs-all decision trees for data stream classification. IEEE Trans Knowl Data Eng 21(5):624–637

    Article  Google Scholar 

  24. Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1):20–29

    Article  Google Scholar 

  25. Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  26. Kotsiantis S, Pintelas P (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math, Comput TeleInform 1(1):46–55

    Google Scholar 

  27. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training set: one sided selection. Proc. of the 14th International Conference on Machine Learning, Tennessee, USA, July, pp 179–186

  28. Japkowicz N (2000) The class imbalance problem: Significance and strategies. Proceedings of the 2000 International Conference on Artificial Intelligence: Special Track on Inductive Learning, Nevada, USA, pp 111–117

  29. Japkowicz N, Stephen S (2002) The class imbalance problem: A systematic study. Intell Data Anal J 6(5):429–450

    MATH  Google Scholar 

  30. Garcia V, Sanchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. Tamida, Saragossa, pp 283–291

    Google Scholar 

  31. Liu H, Hussain F, Tan C, Dash M (2002) Discretization: An enable technique. Data Min Knowl Discov 6(4):393–423

    Article  MathSciNet  Google Scholar 

  32. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  Google Scholar 

  33. Witten IH, Frank E (2005) Data mining. Practical Machine Learning Tool and Techniques. Elsevier Printing

Download references

Acknowledgments

This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), and Fundamental Research Funds for the Central Universities and the National Natural Science Foundation of China under Grants 61103119 and 60970067.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bijan Raahemi.

Appendix

Appendix

A.1The CVFDTGrow procedure

figure f

A.2 The ForgetExample procedure

figure g

A.3 The CheckSplitValidity procedure

figure h

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, W., Raahemi, B. & Liu, J. Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream. Peer-to-Peer Netw. Appl. 6, 233–246 (2013). https://doi.org/10.1007/s12083-012-0147-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-012-0147-5

Keywords

Navigation