Abstract
Peer-to-Peer (P2P) applications generate streaming data in large volumes, where new communities of peers regularly attend and existing communities of peers regularly leave, requiring the classification techniques to consider concept drift, and update the model incrementally. Concept-adapting Very Fast Decision Tree (CVFDT) is one of the well-known streaming data mining techniques that can be applied to P2P traffic. However, we observe that P2P traffic data is class imbalanced, namely, only about 30 % of examples can be labeled as “P2P”, biasing the trained models (e.g. decision tree) towards the majority class (i.e. “NonP2P”). In this paper, we propose a new technique, the imbalanced CVFDT (iCVFDT), by integrating the CVFDT with an efficient resampling technique to address the issue of the class imbalanced data. The iCVFDT classification technique enjoys the advantages of CVFDT (such as stability), and at the same time, is not sensitive to imbalanced data. We captured the Internet traffic at a main gateway and prepared a real data stream with 3.5 million examples to which the iCVFDT classification technique was applied. The experimental results demonstrate a significant improvement in the performance of the iCVFDT compared to that of the CVFDT.
Similar content being viewed by others
Notes
Recently, streaming video grows at a fast rate and accounts for more traffic on the Internet than P2P traffic. However, it is not that P2P traffic is declining. It just grows at a slower rate compared to other types of video traffic.
“length” denotes the length field after “>” in the Table 1.
References
Azzouna NB, Guillemin F (2004) Impact of peer-to-peer applications on wide area network traffic: An experimental approach. IEEE Glob Telecommun Conf 3:1544–1548
Kamei S, Kimura T (2003) Practicable network design for handling growth in the volume of peer-to-peer traffic. IEEE Pacific Rim Conf Commun, Comput Signal Process 2:597–600
Zander S, Nguyen T, Armitage G (2005) Self-learning IP traffic classification based on statistical flow characteristics. Springer-Verlag Lecture Notes in Computer Science, Vol. 3431, Springer, Berlin, pp 325–328
Zuev D, Moore AW (2005) Traffic classification using a statistical approach. Springer-Verlag Lecture Notes in Computer Science, Vol. 3431, Springer Berlin, pp 321–324
Raahemi B, Hayajneh A, Rabinovitch P (2007) Classification of peer-to-peer traffic using neural networks. Proceedings of Artificial Intelligence and Pattern Recognition, Orlando, USA, July 2007, pp 411–417
Raahemi B, Hayajneh A, Rabinovitch P (2007) Peer-to-peer IP traffic classification using decision tree and IP layer attributes. Int J Business Data Commun Netw 3(4):60–74
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp 97–106
Schlimmer JC, Granger RH (1986) Beyond incremental processing: tracking concept drift. Proc. of the AAAI National Conference on Artificial Intelligence, pp 502–507
Raahemi B, Zhong W, Liu J (2008) Peer-to-peer traffic identification by mining IP layer data streams using concept-adapting very fast decision tree. In: 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, USA, Nov, pp 525–532
Raahemi B, Zhong W, Liu J (2009) Exploiting unlabeled data to improve Peer-to-Peer traffic classification using incremental Tri-Training method. Peer-to-Peer Network Appl 2(2):87–97
Kamei S, Kimura T (2006) Cisco IOS netflow overview. Whitepaper, available at www.Cisco.com, Cisco Systems Inc
Crovella M, Krishnamurthy B (2006) Internet measurement: infrastructure, traffic and applications. Wiley, West Sussex
Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of P2P traffic using application signatures. Proc. of the 13th International World Wide Web Conference, NY, USA, pp 512–521
Karagiannis T, Broido A, Faloutsos M, Klaffy K (2004) Transport layer identification of P2P traffic. In Proc. of the 4th ACM SIGCOMM Conference on Internet Measurement, Italy, pp 121–134
Moore W, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques. In Proc. ACM Sigmetrics, Alberta, Canada, June 2005, pp 50–59
Auld T, Moore W, Gull F (2007) Bayesian neural network for Internet traffic classification. IEEE Trans Neural Network 18(1):223–239
Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp 71–80
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30
Maron O, Moore A (1994) Hoeffding races: accelerating model selection search for classification and function approximation. In: Advances in Neural Information Processing Systems 6, Morgan Kaufmann, San Mateo, CA, pp 59–66
Wang Y, Martonosi M, Peh LS (2006) Predicting link quality using supervised learning in wireless sensor networks. Mobile Comput Commun Rev 11(3):71–83
Li X, Barajas JM, Ding Y (2007) Collaborative filtering on streaming data with interest-drifting. Int J Intell Data Anal 11(1):75–87
Liang C, Zhang Y, Song Q (2010) Decision tree for dynamic and uncertain data streams. J Mach Learn Res—Proceed Track 13:209–224
Hashemi S, Yang Y, Mirzamomen Z, Kangavari M (2009) Adapted one-vs-all decision trees for data stream classification. IEEE Trans Knowl Data Eng 21(5):624–637
Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6(1):20–29
Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Kotsiantis S, Pintelas P (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math, Comput TeleInform 1(1):46–55
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training set: one sided selection. Proc. of the 14th International Conference on Machine Learning, Tennessee, USA, July, pp 179–186
Japkowicz N (2000) The class imbalance problem: Significance and strategies. Proceedings of the 2000 International Conference on Artificial Intelligence: Special Track on Inductive Learning, Nevada, USA, pp 111–117
Japkowicz N, Stephen S (2002) The class imbalance problem: A systematic study. Intell Data Anal J 6(5):429–450
Garcia V, Sanchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. Tamida, Saragossa, pp 283–291
Liu H, Hussain F, Tan C, Dash M (2002) Discretization: An enable technique. Data Min Knowl Discov 6(4):393–423
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Witten IH, Frank E (2005) Data mining. Practical Machine Learning Tool and Techniques. Elsevier Printing
Acknowledgments
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), and Fundamental Research Funds for the Central Universities and the National Natural Science Foundation of China under Grants 61103119 and 60970067.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
A.1The CVFDTGrow procedure
A.2 The ForgetExample procedure
A.3 The CheckSplitValidity procedure
Rights and permissions
About this article
Cite this article
Zhong, W., Raahemi, B. & Liu, J. Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream. Peer-to-Peer Netw. Appl. 6, 233–246 (2013). https://doi.org/10.1007/s12083-012-0147-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-012-0147-5