Skip to main content
Log in

A statistical analysis of SAMPARK dataset for peer-to-peer traffic and selfish-peer identification

  • 1207: Innovations in Multimedia Information Processing & Retrieval
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The popularity of peer-to-peer (P2P) network can be attributed to their inherent advantages such as resource utilization, scalability and better response. At the same time modern networks have become highly complex and need better approaches for management and monitoring of traffic. The use of machine learning (ML) techniques is inevitable due to their inherent advantages. The ML-based model needs a reliable dataset for training and testing of the developed approaches. This paper addresses the unavailability of a comprehensive labelled dataset to enable the researcher to evaluate their machine learning based solutions. The proposed SAMPARK dataset is constructed by capturing the traces by running various P2P and Non-P2P applications in real time. The generated dataset consists of the normal traffic pattern and 24 attributes that comprise the basic, flow, and packet-based general features. The major contribution in the work presented lies in building of an exclusive dataset to address important issues in P2P network such as selfish peer, flash crowd, as no dataset is being constructed explicitly to address these important problems in P2P network. The validity of the constructed SAMPARK dataset is carried out by using statistical analysis of probability distribution and feature correlation.The statistical evaluation of SAMPARK dataset shows non-linearity and non-normality characteristics. The correlation rate among features without labelling and with labels are determined using Pearson’s Correlation Coefficient (PCC) and Gain Ratio (GR) and the acceptable rates are 84% and 68% respectively. The effectiveness of the dataset is demonstrated by applying machine learning method. The labelling of dataset is done using port-based technique and performance is determining by calculated Accuracy and False Alarm Rate (FAR) for various proposed ML-model developed to identify P2P traffic and selfish peers. The comparative analysis is also done with UNIBS dataset. The highest accuracy achieved for RF technique on SAMPARK dataset is 99.13% which is better compare to UNIBS dataset. The experimental results also exhibit the usefulness and efficacy of the proposed SAMPARK dataset for various analysis of P2P networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abhishek V, Ranga V (2019) Evaluation of network intrusion detection systems for RPL based 6LoWPAN networks in IoT. Wireless Personal Commun 108(3):1571–1594

    Article  Google Scholar 

  2. Adar E, Huberman BA (2000) Free riding on Gnutella

  3. Alam AS, Kunwar P, Govil MC, Ahmed M, Chawla T, Choudhary A (2021) Score-based incentive mechanism (SIM) for live multimedia streaming in peer-to-peer network. Multimed Tools Appl 80(13):19263–19290

    Article  Google Scholar 

  4. Alok M, Williamson C (2006) A longitudinal study of P2P traffic classification. In: 14th IEEE international symposium on modeling, analysis, and simulation. IEEE, pp 179–188

  5. Alpaydin E (2020) Introduction to machine learning MIT Press

  6. Anil J, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recognit 38(12):2270–2285

    Article  Google Scholar 

  7. Ankur G, Awasthi LK (2011) Peer-to-peer networks and computation: current trends and future perspectives. Comput Inform 30(3):559–594

    Google Scholar 

  8. Biaou BO, Simon AO, Oluwatope HO, Babalola OA, Ojo OE, Sossou EH (2020) Ayo game approach to mitigate free riding in peer-to-peer networks. J King Saud Univ-Comput Inf Sci

  9. Biryukov A, Khovratovich D, Pustogarov I (2014) Deanonymisation of clients in Bitcoin P2P network. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp 15–29

  10. Biswas NK, Banerjee S, Biswas U (2021) Design and development of an energy efficient multimedia cloud data center with minimal SLA violation. Int J Interactive Multimed Artif Intell, vol 6(7)

  11. Bland JM, Altman DG (1995) Calculating correlation coefficients with repeated observations: part 2—correlation between subjects. Bmj 310(6980):633

    Article  Google Scholar 

  12. Bo L, Yin H (2007) Peer-to-peer live video streaming on the internet: issues, existing approaches, and challenges [peer-to-peer multimedia streaming]. IEEE Commun Magazine 45(6):94–99

    Article  Google Scholar 

  13. Cherkassky VS, Mulier F (2007) Learning from data: concepts, theory and methods

  14. CISCO VNI Global 2020 Forecast. https://www.cisco.com/c/dam/m/en_us/solutions/service-provider/vni-forecast-highlights/pdf/Global_2020_Forecast_Highlights.pdf. Accessed 26 Jul 2020

  15. Cohen B (2003) Incentives build robustness in BitTorrent. Workshop Eco Peer-to-Peer Syst 6:68–72

    Google Scholar 

  16. CSE-CIC-IDS2018 dataset. https://www.unb.ca/cic/datasets/ids-2018.html. Accessed 26 Jul 2020

  17. Dalwinder S, Singh B (2020) Investigating the impact of data normalization on classification performance. Appl Soft Comput 97:105524

    Article  Google Scholar 

  18. Elmasri R (2008) Fundamentals of database systems. Pearson Educ India

  19. Francesco G, Salgarelli L, Dusi M, Cascarano N, Risso F, Claffy KC (2009) Gt: picking up the truth from the ground for internet traffic. ACM SIGCOMM Comput Commun Rev 39(5):12–18

    Article  Google Scholar 

  20. Gill P, Arlitt M, Li Z, Mahanti A (2007) Youtube traffic characterization: a view from the edge. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, pp 15–28

  21. Gomes JV, Inácio PRM, Pereira M, Freire MM, Monteiro PP (2013) Detection and classification of peer-to-peer traffic: a survey. ACM Comput Surveys (CSUR) 45(3):1–40

    Article  MATH  Google Scholar 

  22. Haesun B, Lee M (2009) Hypo: a peer-to-peer based hybrid overlay structure. In: 2009 11th International conference on advanced communication technology. IEEE, vol 1, pp 840–844

  23. Hall MA, Smith LA (1998) Practical feature subset selection for machine learning

  24. Harumasa T, Murata M, Aida M (2021) Mitigation of flash crowd in Web services by providing feedback information to users. IEICE Trans Inf Syst 104(1):63–75

    Google Scholar 

  25. Heckmann O, Bock A (2002) The edonkey 2000 protocol vol 140 Technical Report KOM-TR-08-2002. Multimedia Communications Lab, Darmstadt University of Technology

    Google Scholar 

  26. Hongli Z, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471

    Article  Google Scholar 

  27. Hui L, Feng W, Huang Y, Li X (2007) A peer-to-peer traffic identification method using machine learning. In: 2007 International conference on networking, architecture, and storage (NAS 2007). IEEE, pp 155–160

  28. Internet Assigned Numbers Authority (IANA). https://www.iana.org/assignments/service-names-port-numbershttps://www.iana.org/assignments/service-names-port-numbers. Accessed 10 June 2020

  29. Jagan Mohan R, Hota C (2015) Heuristic-based real-time p2p traffic identification. In: 2015 International conference on emerging information technology and engineering solutions. IEEE, pp 38–43

  30. Justel A, Pena D, Zamar R (1997) A multivariate Kolmogorov-Smirnov test of goodness of fit. Stat Probab Lett 35(3):251–259

    Article  MathSciNet  MATH  Google Scholar 

  31. Karagiannis T, Broido A, Faloutsos M, Claffy KC (2004) Transport layer identification of P2P traffic. In: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pp 121–134

  32. Kunwar P, Govil MC, Ahmed M (2018) Priority-based scheduling scheme for live video streaming in peer-to-peer network. Multimed Tools Appl 77 (18):24427–24457

    Article  Google Scholar 

  33. Kunwar P, Govil MC, Ahmed M (2018) Slack time–based scheduling scheme for live video streaming in P2P network. Int J Commun Syst 31(2):e3440

    Article  Google Scholar 

  34. Liao X, Jin H, Liu Y, Ni LM, Deng D (2006) anysee: peer-to-peer live streaming. In: Proceedings IEEE INFOCOM 2006. 25th IEEE International Conference on Computer Communications. IEEE, pp 1–10

  35. Mahdi A, Fazel SV, Rafiee M (2020) MBitCuts: optimal bit-level cutting in geometric space packet classification. J Supercomput 76(4):3105–3128

    Article  Google Scholar 

  36. Mahdi A, Najafi A, Rafiee M, Khosravi MR, Menon VG, Muhammad G (2020) Efficient flow processing in 5G-envisioned SDN-based Internet of Vehicles using GPUs. IEEE Trans Intell Transp Syst 22(8):5283–5292

    Google Scholar 

  37. Mahdi A, Rafiee M (2019) A calibrated asymptotic framework for analyzing packet classification algorithms on GPUs. J Supercomput 75(10):6574–6611

    Article  Google Scholar 

  38. Mahdi A, Tahouri R, Rafiee M (2019) Enhancing the performance of the aggregated bit vector algorithm in network packet classification using GPU. PeerJ Comput Sci 5:e185

    Article  Google Scholar 

  39. Manju N, Harish BS, Nagadarshan N (2020) Multilayer feedforward neural network for internet traffic classification. Int J Interact Multim Artif Intell 6(1):117–122

    Google Scholar 

  40. Marcell P, Dang TD, Gefferth A, Molnar S (2006) Identification and analysis of peer-to-peer traffic. J Commun 1(7):36–46

    Google Scholar 

  41. Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519–530

    Article  MathSciNet  MATH  Google Scholar 

  42. Massey J, Frank J (1951) The Kolmogorov-Smirnov test for goodness of fit. J Amer Stat Association 46(253):68–78

    Article  MATH  Google Scholar 

  43. Maurizio D, Gringoli F, Salgarelli L (2011) Quantifying the accuracy of the ground truth associated with Internet traffic traces. Comput Netw 55 (5):1158–1167

    Article  Google Scholar 

  44. Max B, Rai MK (2017) Identifying P2P traffic: a survey. Peer-to-Peer Netw Appl 10(5):1182–1203

    Article  Google Scholar 

  45. Mehdi M, Raahemi B, Akbari A, Moeinzadeh H, Nasersharif B (2011) Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer applications in the internet traffic. Expert Syst Appl 38(6):6417–6423

    Article  Google Scholar 

  46. Michelle C, Eggert L, Touch J, Westerlund M, Cheshire S (2011) Internet assigned numbers authority (IANA) procedures for the management of the service name and transport protocol port number registry. RFC 6335:1–33

    Google Scholar 

  47. Miguel C, Druschel P, Kermarrec A-M, Nandi A, Rowstron A, Singh A (2003) Splitstream: high-bandwidth multicast in cooperative environments. ACM SIGOPS Operating Syst Rev 37(5):298–313

    Article  Google Scholar 

  48. Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning. MIT Press

  49. Moore A, Zuev D, Crogan M (2013) Discriminators for use in flow-based classification

  50. Nazanin M, Rejaie R (2009) Prime: peer-to-peer receiver-driven mesh-based streaming. IEEE/ACM Trans Netw 17(4):1052–1065

    Article  Google Scholar 

  51. Nour M, Slay J (2015) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military communications and information systems conference (MilCIS). IEEE, pp 1–6

  52. Nour M, Slay J (2016) The evaluation of network anomaly detection systems: statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set Inform. Secur J Global Perspective 25 (1–3):18–31

    Google Scholar 

  53. Ojo OE, Iyadi CO, Oluwatope AO, Akinwale AT (2020) AyoPeer: the adapted ayo-game for minimizing free riding in peer-assisted network. Peer-to-Peer Netw Appl 13(5):1672–1687

    Article  Google Scholar 

  54. Olson DL, Delen D (2008) Advanced data mining techniques. Springer Science and Business Media

  55. Onimisi YM (2015) Free riding in peer-to-peer networks: review and analysis. African J Comput ICTs 8:53–60

    Google Scholar 

  56. Pouwelse JA, Garbacki P, Wang J, Bakker A, Yang J, Iosup A, Epema DHJ, Reinders M, Van Steen MR, Sips HJ (2008) TRIBLER: a social-based peer-to-peer system. Concurrency Comput Practice Experience 20(2):127–138

    Article  Google Scholar 

  57. Ramayya K, Smith MD, Tang Z, Telang R (2004) The impact of free-riding on peer-to-peer networks. In: 37th Annual Hawaii international conference on system sciences, 2004. Proceedings of the IEEE, p 10-pp

  58. Ripeanu M (2001) Peer-to-peer architecture case study: Gnutella network. In: Proceedings first international conference on peer-to-peer computing. IEEE, pp 99–100

  59. Rob S, Lee S, Bhattacharjee B (2006) Cooperative peer groups in NICE. Comput Netw 50(4):523–544

    Article  MATH  Google Scholar 

  60. Salem M, Buehler U (2012) Mining techniques in network security to enhance intrusion detection systems. arXiv:1212.2414

  61. Samira P, Tao Y, Tian H, Chen S-C, Shyu M-L (2019) Multimodal deep learning based on multiple correspondence analysis for disaster management. World Wide Web 22(5):1893–1911

    Article  Google Scholar 

  62. Satoshi F (2019) Flash crowd absorber for P2P video streaming. IEICE Trans Inf Syst 102(2):261–268

    Google Scholar 

  63. Sebastian G, Grill M, Stiborek J, Zunino A (2014) An empirical comparison of botnet detection methods. Comput Secur 45:100–123

    Article  Google Scholar 

  64. Stefan S, Gummadi KP, Dunn RJ, Gribble SD, Levy HM (2002) An analysis of internet content delivery systems. ACM SIGOPS Operating Systems Review 36(SI):315–327

    Article  Google Scholar 

  65. Stefan S, Gummadi KP, Gribble SD (2003) Measuring and analyzing the characteristics of Napster and Gnutella hosts. Multimed Syst 9(2):170–184

    Article  Google Scholar 

  66. Sunny K, Nadal S (2012) Ppcoin: peer-to-peer crypto-currency with proof-of-stake. self-published paper, 19 August (1)

  67. Thampi SM (2013) A review on P2P video streaming. arXiv:1304.1235

  68. Thuy TT N, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE commun Surveys Tutorials 10 (4):56–76

    Article  Google Scholar 

  69. Tran DA, Hua KA, Do T (2003) Zigzag: an efficient peer-to-peer scheme for media streaming. In: IEEE INFOCOM 2003. Twenty-second annual joint conference of the IEEE computer and communications societies (IEEE cat. No. 03CH37428). IEEE, vol 2, pp 1283–1292. Accessed 26 July 2020

  70. Valentín C, Bujlow T, Barlet-Ros P (2014) Is our ground-truth for traffic classification reliable?. In: International conference on passive and active network measurement. Springer, Cham, pp 98-108

  71. Yishuai C, Zhang B, Chen C, Chiu DM (2013) Performance modeling and evaluation of peer-to-peer live streaming systems under flash crowds. IEEE/ACM Trans Netw 22(4):1106–1120

    Google Scholar 

  72. Yoram K, Bickson D (2005) The eMule protocol specification. eMule project, http://sourceforge net

  73. (2017). The zettabyte era: trends and analysis. http://www.hit.bme.hu/jakab/edu/HTI18/Litr/Cisco_The_Zettabyte_Era_2017June_vni-hyperconnectivity-wp.pdf. Accessed 26 Jul 2020

  74. (2020). VUZE. https://www.vuze.com/. Accessed 05 April 2020

  75. (2020). PPTV. http://www.pplive.com. Accessed 05 April2020

  76. (2020). iQIYI. https://www.iq.com/. Accessed 05 April 2020

  77. (2020). Hotstar. https://www.hotstar.com/in. Accessed 05 April 2020

  78. (2020). Funshion. http://www.fun.tv/. Accessed 05 April 2020

  79. (2020). Youtube. https:www.youtube.com/. Accessed 03 April 2020

  80. (2020). YuppTV. https://www.yupptv.com/. Accessed 05 April 2020

  81. (2020). Jami. https://jami.net/. Accessed 05 April 2020

  82. (1999). KDD99, KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Accessed 26 Jul 2020

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Sarfaraj Alam Ansari.

Ethics declarations

- No funding was received for conducting this study.

- Md. Sarfaraj Alam Ansari, Kunwar Pal, Prajjval Govil, Mahesh Chandra Govil and Lalit Kumar Awasthi declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ansari, M., Pal, K., Govil, P. et al. A statistical analysis of SAMPARK dataset for peer-to-peer traffic and selfish-peer identification. Multimed Tools Appl 82, 8507–8535 (2023). https://doi.org/10.1007/s11042-022-13556-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13556-x

Keywords

Navigation