Abstract
The popularity of peer-to-peer (P2P) network can be attributed to their inherent advantages such as resource utilization, scalability and better response. At the same time modern networks have become highly complex and need better approaches for management and monitoring of traffic. The use of machine learning (ML) techniques is inevitable due to their inherent advantages. The ML-based model needs a reliable dataset for training and testing of the developed approaches. This paper addresses the unavailability of a comprehensive labelled dataset to enable the researcher to evaluate their machine learning based solutions. The proposed SAMPARK dataset is constructed by capturing the traces by running various P2P and Non-P2P applications in real time. The generated dataset consists of the normal traffic pattern and 24 attributes that comprise the basic, flow, and packet-based general features. The major contribution in the work presented lies in building of an exclusive dataset to address important issues in P2P network such as selfish peer, flash crowd, as no dataset is being constructed explicitly to address these important problems in P2P network. The validity of the constructed SAMPARK dataset is carried out by using statistical analysis of probability distribution and feature correlation.The statistical evaluation of SAMPARK dataset shows non-linearity and non-normality characteristics. The correlation rate among features without labelling and with labels are determined using Pearson’s Correlation Coefficient (PCC) and Gain Ratio (GR) and the acceptable rates are 84% and 68% respectively. The effectiveness of the dataset is demonstrated by applying machine learning method. The labelling of dataset is done using port-based technique and performance is determining by calculated Accuracy and False Alarm Rate (FAR) for various proposed ML-model developed to identify P2P traffic and selfish peers. The comparative analysis is also done with UNIBS dataset. The highest accuracy achieved for RF technique on SAMPARK dataset is 99.13% which is better compare to UNIBS dataset. The experimental results also exhibit the usefulness and efficacy of the proposed SAMPARK dataset for various analysis of P2P networks.
Similar content being viewed by others
References
Abhishek V, Ranga V (2019) Evaluation of network intrusion detection systems for RPL based 6LoWPAN networks in IoT. Wireless Personal Commun 108(3):1571–1594
Adar E, Huberman BA (2000) Free riding on Gnutella
Alam AS, Kunwar P, Govil MC, Ahmed M, Chawla T, Choudhary A (2021) Score-based incentive mechanism (SIM) for live multimedia streaming in peer-to-peer network. Multimed Tools Appl 80(13):19263–19290
Alok M, Williamson C (2006) A longitudinal study of P2P traffic classification. In: 14th IEEE international symposium on modeling, analysis, and simulation. IEEE, pp 179–188
Alpaydin E (2020) Introduction to machine learning MIT Press
Anil J, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recognit 38(12):2270–2285
Ankur G, Awasthi LK (2011) Peer-to-peer networks and computation: current trends and future perspectives. Comput Inform 30(3):559–594
Biaou BO, Simon AO, Oluwatope HO, Babalola OA, Ojo OE, Sossou EH (2020) Ayo game approach to mitigate free riding in peer-to-peer networks. J King Saud Univ-Comput Inf Sci
Biryukov A, Khovratovich D, Pustogarov I (2014) Deanonymisation of clients in Bitcoin P2P network. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp 15–29
Biswas NK, Banerjee S, Biswas U (2021) Design and development of an energy efficient multimedia cloud data center with minimal SLA violation. Int J Interactive Multimed Artif Intell, vol 6(7)
Bland JM, Altman DG (1995) Calculating correlation coefficients with repeated observations: part 2—correlation between subjects. Bmj 310(6980):633
Bo L, Yin H (2007) Peer-to-peer live video streaming on the internet: issues, existing approaches, and challenges [peer-to-peer multimedia streaming]. IEEE Commun Magazine 45(6):94–99
Cherkassky VS, Mulier F (2007) Learning from data: concepts, theory and methods
CISCO VNI Global 2020 Forecast. https://www.cisco.com/c/dam/m/en_us/solutions/service-provider/vni-forecast-highlights/pdf/Global_2020_Forecast_Highlights.pdf. Accessed 26 Jul 2020
Cohen B (2003) Incentives build robustness in BitTorrent. Workshop Eco Peer-to-Peer Syst 6:68–72
CSE-CIC-IDS2018 dataset. https://www.unb.ca/cic/datasets/ids-2018.html. Accessed 26 Jul 2020
Dalwinder S, Singh B (2020) Investigating the impact of data normalization on classification performance. Appl Soft Comput 97:105524
Elmasri R (2008) Fundamentals of database systems. Pearson Educ India
Francesco G, Salgarelli L, Dusi M, Cascarano N, Risso F, Claffy KC (2009) Gt: picking up the truth from the ground for internet traffic. ACM SIGCOMM Comput Commun Rev 39(5):12–18
Gill P, Arlitt M, Li Z, Mahanti A (2007) Youtube traffic characterization: a view from the edge. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, pp 15–28
Gomes JV, Inácio PRM, Pereira M, Freire MM, Monteiro PP (2013) Detection and classification of peer-to-peer traffic: a survey. ACM Comput Surveys (CSUR) 45(3):1–40
Haesun B, Lee M (2009) Hypo: a peer-to-peer based hybrid overlay structure. In: 2009 11th International conference on advanced communication technology. IEEE, vol 1, pp 840–844
Hall MA, Smith LA (1998) Practical feature subset selection for machine learning
Harumasa T, Murata M, Aida M (2021) Mitigation of flash crowd in Web services by providing feedback information to users. IEICE Trans Inf Syst 104(1):63–75
Heckmann O, Bock A (2002) The edonkey 2000 protocol vol 140 Technical Report KOM-TR-08-2002. Multimedia Communications Lab, Darmstadt University of Technology
Hongli Z, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471
Hui L, Feng W, Huang Y, Li X (2007) A peer-to-peer traffic identification method using machine learning. In: 2007 International conference on networking, architecture, and storage (NAS 2007). IEEE, pp 155–160
Internet Assigned Numbers Authority (IANA). https://www.iana.org/assignments/service-names-port-numbershttps://www.iana.org/assignments/service-names-port-numbers. Accessed 10 June 2020
Jagan Mohan R, Hota C (2015) Heuristic-based real-time p2p traffic identification. In: 2015 International conference on emerging information technology and engineering solutions. IEEE, pp 38–43
Justel A, Pena D, Zamar R (1997) A multivariate Kolmogorov-Smirnov test of goodness of fit. Stat Probab Lett 35(3):251–259
Karagiannis T, Broido A, Faloutsos M, Claffy KC (2004) Transport layer identification of P2P traffic. In: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pp 121–134
Kunwar P, Govil MC, Ahmed M (2018) Priority-based scheduling scheme for live video streaming in peer-to-peer network. Multimed Tools Appl 77 (18):24427–24457
Kunwar P, Govil MC, Ahmed M (2018) Slack time–based scheduling scheme for live video streaming in P2P network. Int J Commun Syst 31(2):e3440
Liao X, Jin H, Liu Y, Ni LM, Deng D (2006) anysee: peer-to-peer live streaming. In: Proceedings IEEE INFOCOM 2006. 25th IEEE International Conference on Computer Communications. IEEE, pp 1–10
Mahdi A, Fazel SV, Rafiee M (2020) MBitCuts: optimal bit-level cutting in geometric space packet classification. J Supercomput 76(4):3105–3128
Mahdi A, Najafi A, Rafiee M, Khosravi MR, Menon VG, Muhammad G (2020) Efficient flow processing in 5G-envisioned SDN-based Internet of Vehicles using GPUs. IEEE Trans Intell Transp Syst 22(8):5283–5292
Mahdi A, Rafiee M (2019) A calibrated asymptotic framework for analyzing packet classification algorithms on GPUs. J Supercomput 75(10):6574–6611
Mahdi A, Tahouri R, Rafiee M (2019) Enhancing the performance of the aggregated bit vector algorithm in network packet classification using GPU. PeerJ Comput Sci 5:e185
Manju N, Harish BS, Nagadarshan N (2020) Multilayer feedforward neural network for internet traffic classification. Int J Interact Multim Artif Intell 6(1):117–122
Marcell P, Dang TD, Gefferth A, Molnar S (2006) Identification and analysis of peer-to-peer traffic. J Commun 1(7):36–46
Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530
Massey J, Frank J (1951) The Kolmogorov-Smirnov test for goodness of fit. J Amer Stat Association 46(253):68–78
Maurizio D, Gringoli F, Salgarelli L (2011) Quantifying the accuracy of the ground truth associated with Internet traffic traces. Comput Netw 55 (5):1158–1167
Max B, Rai MK (2017) Identifying P2P traffic: a survey. Peer-to-Peer Netw Appl 10(5):1182–1203
Mehdi M, Raahemi B, Akbari A, Moeinzadeh H, Nasersharif B (2011) Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer applications in the internet traffic. Expert Syst Appl 38(6):6417–6423
Michelle C, Eggert L, Touch J, Westerlund M, Cheshire S (2011) Internet assigned numbers authority (IANA) procedures for the management of the service name and transport protocol port number registry. RFC 6335:1–33
Miguel C, Druschel P, Kermarrec A-M, Nandi A, Rowstron A, Singh A (2003) Splitstream: high-bandwidth multicast in cooperative environments. ACM SIGOPS Operating Syst Rev 37(5):298–313
Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning. MIT Press
Moore A, Zuev D, Crogan M (2013) Discriminators for use in flow-based classification
Nazanin M, Rejaie R (2009) Prime: peer-to-peer receiver-driven mesh-based streaming. IEEE/ACM Trans Netw 17(4):1052–1065
Nour M, Slay J (2015) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military communications and information systems conference (MilCIS). IEEE, pp 1–6
Nour M, Slay J (2016) The evaluation of network anomaly detection systems: statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set Inform. Secur J Global Perspective 25 (1–3):18–31
Ojo OE, Iyadi CO, Oluwatope AO, Akinwale AT (2020) AyoPeer: the adapted ayo-game for minimizing free riding in peer-assisted network. Peer-to-Peer Netw Appl 13(5):1672–1687
Olson DL, Delen D (2008) Advanced data mining techniques. Springer Science and Business Media
Onimisi YM (2015) Free riding in peer-to-peer networks: review and analysis. African J Comput ICTs 8:53–60
Pouwelse JA, Garbacki P, Wang J, Bakker A, Yang J, Iosup A, Epema DHJ, Reinders M, Van Steen MR, Sips HJ (2008) TRIBLER: a social-based peer-to-peer system. Concurrency Comput Practice Experience 20(2):127–138
Ramayya K, Smith MD, Tang Z, Telang R (2004) The impact of free-riding on peer-to-peer networks. In: 37th Annual Hawaii international conference on system sciences, 2004. Proceedings of the IEEE, p 10-pp
Ripeanu M (2001) Peer-to-peer architecture case study: Gnutella network. In: Proceedings first international conference on peer-to-peer computing. IEEE, pp 99–100
Rob S, Lee S, Bhattacharjee B (2006) Cooperative peer groups in NICE. Comput Netw 50(4):523–544
Salem M, Buehler U (2012) Mining techniques in network security to enhance intrusion detection systems. arXiv:1212.2414
Samira P, Tao Y, Tian H, Chen S-C, Shyu M-L (2019) Multimodal deep learning based on multiple correspondence analysis for disaster management. World Wide Web 22(5):1893–1911
Satoshi F (2019) Flash crowd absorber for P2P video streaming. IEICE Trans Inf Syst 102(2):261–268
Sebastian G, Grill M, Stiborek J, Zunino A (2014) An empirical comparison of botnet detection methods. Comput Secur 45:100–123
Stefan S, Gummadi KP, Dunn RJ, Gribble SD, Levy HM (2002) An analysis of internet content delivery systems. ACM SIGOPS Operating Systems Review 36(SI):315–327
Stefan S, Gummadi KP, Gribble SD (2003) Measuring and analyzing the characteristics of Napster and Gnutella hosts. Multimed Syst 9(2):170–184
Sunny K, Nadal S (2012) Ppcoin: peer-to-peer crypto-currency with proof-of-stake. self-published paper, 19 August (1)
Thampi SM (2013) A review on P2P video streaming. arXiv:1304.1235
Thuy TT N, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE commun Surveys Tutorials 10 (4):56–76
Tran DA, Hua KA, Do T (2003) Zigzag: an efficient peer-to-peer scheme for media streaming. In: IEEE INFOCOM 2003. Twenty-second annual joint conference of the IEEE computer and communications societies (IEEE cat. No. 03CH37428). IEEE, vol 2, pp 1283–1292. Accessed 26 July 2020
Valentín C, Bujlow T, Barlet-Ros P (2014) Is our ground-truth for traffic classification reliable?. In: International conference on passive and active network measurement. Springer, Cham, pp 98-108
Yishuai C, Zhang B, Chen C, Chiu DM (2013) Performance modeling and evaluation of peer-to-peer live streaming systems under flash crowds. IEEE/ACM Trans Netw 22(4):1106–1120
Yoram K, Bickson D (2005) The eMule protocol specification. eMule project, http://sourceforge net
(2017). The zettabyte era: trends and analysis. http://www.hit.bme.hu/jakab/edu/HTI18/Litr/Cisco_The_Zettabyte_Era_2017June_vni-hyperconnectivity-wp.pdf. Accessed 26 Jul 2020
(2020). VUZE. https://www.vuze.com/. Accessed 05 April 2020
(2020). PPTV. http://www.pplive.com. Accessed 05 April2020
(2020). iQIYI. https://www.iq.com/. Accessed 05 April 2020
(2020). Hotstar. https://www.hotstar.com/in. Accessed 05 April 2020
(2020). Funshion. http://www.fun.tv/. Accessed 05 April 2020
(2020). Youtube. https:www.youtube.com/. Accessed 03 April 2020
(2020). YuppTV. https://www.yupptv.com/. Accessed 05 April 2020
(2020). Jami. https://jami.net/. Accessed 05 April 2020
(1999). KDD99, KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Accessed 26 Jul 2020
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
- No funding was received for conducting this study.
- Md. Sarfaraj Alam Ansari, Kunwar Pal, Prajjval Govil, Mahesh Chandra Govil and Lalit Kumar Awasthi declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ansari, M., Pal, K., Govil, P. et al. A statistical analysis of SAMPARK dataset for peer-to-peer traffic and selfish-peer identification. Multimed Tools Appl 82, 8507–8535 (2023). https://doi.org/10.1007/s11042-022-13556-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13556-x