Detection of Web site visitors based on fuzzy rough sets

Abstract

Despite emerging of Web 2.0 applications and increasing requirements to well-behaved Web robots, malicious ones can reveal irreparable risks for Web sites. Regardless of behavior of Web robots, they may occupy bandwidth and reduce performance of Web servers. In spite of many prestigious researches trying to characterize Web visitors and classify them, there is a lack of concentration on feature selection to dynamically choose attributes used to describe Web sessions. On the other hand, depending on an accurate clustering technique, which can deal with huge number of samples in a reasonable amount of time, is practically important. Therefore, in this paper, a new algorithm, fuzzy rough set–Web robot detection (FRS-WRD), is proposed based on fuzzy rough set theory to better characterize and cluster Web visitors of three real Web sites. External evaluations show that in contrast to state-of-the-art algorithms, FRS-WRD achieves better results in terms of G-mean 95%, Jaccard 88%, entropy 0.36, and finally, purity 96%. Moreover, according to confusion matrixes, it can better detect malicious Web visitors.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Amigó E, Gonzalo J, Verdejo F (2013) A general evaluation measure for document organization tasks. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 643–652

  2. Ansari ZA, Sattar SA, Babu AV (2015) A fuzzy neural network based framework to discover user access patterns from web log data. Adv Data Anal Classif. doi:10.1007/s11634-015-0228-4

  3. Antoine V, Quost B, Masson M-H, Denoeux T (2014) CEVCLUS: evidential clustering with instance-level constraints for relational data. Soft Comput 18(7):1321–1335

    Article  Google Scholar 

  4. Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection-preprocessing web logfiles for robot detection. In: Bock HH et al (eds) New developments in classification and data analysis. Springer, Berlin, pp 113–124

  5. Chen D, Yang W, Li F (2008) Measures of general fuzzy rough sets on a probabilistic space. Inf Sci 178(16):3177–3187

    MathSciNet  Article  MATH  Google Scholar 

  6. Dubois D, Prade H (1990) Rough fuzzy sets and fuzzy rough sets*. Int J Gen Syst 17(2–3):191–209

    Article  MATH  Google Scholar 

  7. Gržinić T, Mršić L, Šaban J (2015) Lino-an intelligent system for detecting malicious web-robots. In: Asian Conference on Intelligent Information and Database Systems, Springer International Publishing, pp 559–568

  8. Hamidzadeh J (2015) IRDDS: instance reduction based on distance-based decision surface. J AI Data Min 3(2):121–130

    Google Scholar 

  9. Hamidzadeh J, Monsefi R, Yazdi HS (2014) LMIRA: large margin instance reduction algorithm. Neurocomputing 145:477–487

    Article  Google Scholar 

  10. Hamidzadeh J, Monsefi R, Yazdi HS (2015) IRAHC: instance reduction algorithm using hyperrectangle clustering. Pattern Recogn 48(5):1878–1889

    Article  MATH  Google Scholar 

  11. Inuiguchi M, Wu W-Z, Cornelis C, Verbiest N (2015) Fuzzy-rough hybridization. Springer Handbook of Computational Intelligence. Springer, Berlin

    Google Scholar 

  12. Kohonen T (2013) Essentials of the self-organizing map. Neural Netw 37:52–65

    Article  Google Scholar 

  13. Kwon S, Oh M, Kim D, Lee J, Kim Y-G, Cha S (2012) Web robot detection based on monotonous behavior. In: Proceedings of the Information Science and Industrial Applications, vol 4. Springer-Verlag, pp 43–48

  14. Lee J, Cha S, Lee D, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28(8):795–802

    Article  Google Scholar 

  15. List of User-Agents (Spiders, Robots, Browser) (2015) Retrieved from http://www.user-agents.org/

  16. Liu Z, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recogn 52:85–95

    Article  Google Scholar 

  17. Liu Z, Pan Q, Dezert J, Mercier G (2015) Credal c-means clustering method based on belief functions. Knowl Based Syst 74:119–132

    Article  Google Scholar 

  18. Lourenço AG, Belo OO (2006) Catching web crawlers in the act. In: Proceedings of the 6th international Conference on Web Engineering, vol 263, ACM, pp 265–272

  19. Lu W-Z, Yu S (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems

  20. Moghaddam VH, Hamidzadeh J (2016) New Hermite orthogonal polynomial kernel and combined kernels in support vector machine classifier. Pattern Recogn 60:921–935

    Article  Google Scholar 

  21. Nowicki RK, Nowak BA, Woźniak M (2016) Application of rough sets in k nearest neighbours algorithm for classification of incomplete samples. In: Knowledge, Information and Creativity Support Systems. Springer International Publishing, pp 243–257

  22. Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356

    Article  MATH  Google Scholar 

  23. Qian Y, Wang Q, Cheng H, Liang J, Dang C (2015) Fuzzy-rough feature selection accelerator. Fuzzy Sets Syst 258:61–78

    MathSciNet  Article  MATH  Google Scholar 

  24. Radzikowska AM, Kerre EE (2002) A comparative study of fuzzy rough sets. Fuzzy Sets Syst 126(2):137–155

    MathSciNet  Article  MATH  Google Scholar 

  25. Sadeghi R, Hamidzadeh J (2016) Automatic support vector data description. Soft Comput. doi:10.1007/s00500-016-2317-5

  26. Shafer G (1976) A mathematical theory of evidence, vol 1. Princeton University Press, Princeton

  27. Sisodia DS, Verma S, Vyas OP (2015) Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors. J Data Anal Inform Process 3(2):1–10

  28. Staeding A (2015) Bots versus browsers—public bots and user agents database and commentary. Retrieved from http://www.botsvsbrowsers.com/

  29. Stassopoulou A, Dikaiakos MD (2009) Web robot detection: a probabilistic reasoning approach. Comput Netw 53(3):265–278

    Article  MATH  Google Scholar 

  30. Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717

    Article  Google Scholar 

  31. Stevanovic D, Vlajic N, An A (2013) Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl Soft Comput 13(1):698–708

    Article  Google Scholar 

  32. Suchacka G, Sobkow M (2015) Detection of internet robots using a Bayesian approach. In: Cybernetics (CYBCONF), 2015 IEEE 2nd International Conference on, IEEE, pp 365–370

  33. Tan P-N, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Disc 6(1):9–35

    MathSciNet  Article  Google Scholar 

  34. Verbiest N, Cornelis C, Herrera F (2013a) FRPS: a fuzzy rough prototype selection method. Pattern Recogn 46(10):2770–2782

    Article  MATH  Google Scholar 

  35. Verbiest N, Cornelis C, Herrera F (2013b) OWA-FRPS: a prototype selection method based on ordered weighted average fuzzy rough set theory. In: International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, vol 8170. Springer, Berlin, pp 180–190

  36. Vlajic N, Card HC (2001) Vector quantization of images using modified adaptive resonance algorithm for hierarchical clustering. IEEE Trans Neural Netw 12(5):1147–1162

    Article  Google Scholar 

  37. Wang Xi-Zhao, Zhai Jun-Hai, Shu-Xia Lu (2008) Induction of multiple fuzzy decision trees based on rough set technique. Inf Sci 178(16):3188–3202

    MathSciNet  Article  MATH  Google Scholar 

  38. Wu W-Z, Leung Y, Zhang W-X (2002) Connections between rough set theory and Dempster–Shafer theory of evidence. Int J Gen Syst 31(4):405–430

    MathSciNet  Article  MATH  Google Scholar 

  39. Yao YY, Lingras PJ (1998) Interpretations of belief functions in the theory of rough sets. Inf Sci 104(1):81–106

    MathSciNet  Article  MATH  Google Scholar 

  40. Zabihi M, Jahan MV, Hamidzadeh J (2014a) A density based clustering approach for web robot detection. In: Computer and Knowledge Engineering (ICCKE), 2014 4th International eConference on, IEEE, pp 23–28

  41. Zabihi M, Jahan MV, Hamidzadeh J (2014b) A density based clustering approach to distinguish between web robot and human requests to a web server. ISC Int J Inf Secur 6(1):77–89

    Google Scholar 

  42. Zadeh LA (1974) The concept of a linguistic variable and its application to approximate reasoning. Springer, Berlin

    Book  Google Scholar 

  43. Zhai J (2011) Fuzzy decision tree based on fuzzy-rough technique. Soft Comput 15(6):1087–1096

    Article  Google Scholar 

  44. Zhao D, Traore I, Sayed B, Lu W, Saad S, Ghorbani A, Garant D (2013) Botnet detection based on traffic behavior analysis and flow intervals. Comput Secur 39:2–16

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Javad Hamidzadeh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Appendix

Appendix

In this section, a summary of all primary features used in this paper is presented. These attributes have been proposed in other related works and indicated to be helpful in separating humans from Web robots. The index column of Table 5 demonstrates whether the related attribute has higher value for Web robots (R) or human users (H).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hamidzadeh, J., Zabihimayvan, M. & Sadeghi, R. Detection of Web site visitors based on fuzzy rough sets. Soft Comput 22, 2175–2188 (2018). https://doi.org/10.1007/s00500-016-2476-4

Download citation

Keywords

  • Web robot detection
  • Fuzzy rough set
  • Clustering