Advertisement

Performance Evaluation of Large Data Clustering Techniques on Web Robot Session Data

  • Dilip Singh Sisodia
  • Rahul Borkar
  • Hari Shrawgi
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 748)

Abstract

Web robots are scripts that automatically surf the Web’s server structure to locate and index information. These robots are sometimes used maliciously to create a myriad of problems in the functioning of servers. Such automated programs are difficult to trace and triangulate as they mask their identities. A weblog file which comprises of server requests can be used for identifying these robots by using clustering techniques. These log files contain a massive amount of data, and large data clustering algorithms are used to partition the requests into robotic sessions or human sessions. In this paper, a study is conducted, comparing the primary large clustering techniques. For clustering of the HTTP requests, we implemented BIRCH—Balanced Iterative Reducing and Clustering using Hierarchy (Hierarchical clustering technique), DBSCAN—Density-Based Spatial Clustering of Applications with Noise (Density-based clustering technique) and CLIQUE—Clustering in Quest (Grid-based method) using open-source ELKI & JBIRCH java packages. The performances of the three algorithms are compared using internal validating measures -Dunn’s Index, DB Index, and Average Silhouette Index. As a result of the study, we found the optimal number of clusters to be four that produces the best validation measures.

Keywords

Clustering BIRCH DBSCAN CLIQUE Clustering feature Web robots Web server logs Web sessions 

References

  1. 1.
    Sun, Y., Zhuang, Z., Giles, C.L.: A large-scale study of robots.txt. In: Proceedings of the 16th International Conference on World Wide Web, pp. 1123–1124. ACM (2007)Google Scholar
  2. 2.
    Sisodia, D.S., Verma, S., Vyas, O.: A comparative analysis of browsing behavior of human visitors and automatic software agents. Am. J. Syst. Software. 3, 31–35 (2015)Google Scholar
  3. 3.
    Sisodia, D.S., Verma, S., Vyas, O.: Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors. J. Data Anal. Inf. Process. 3, 1–10 (2015)Google Scholar
  4. 4.
    Hu, M.K.: visual pattern recognition by moment invariant. IRE Trans. Inf. Theory 8, 179–187 (1962)zbMATHGoogle Scholar
  5. 5.
    Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics (Oxford, England) 22, 2405–2412 (2006)Google Scholar
  6. 6.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering databases method for very large. In: ACM SIGMOD International Conference on Management of Data, pp. 103–114 (1996)Google Scholar
  7. 7.
    Zabihi, M.: Vafaei Jahan, M., Hamidzadeh, J.: A density based clustering approach to distinguish between web robot and human requests to a web server. The ISC Int. J. Inf. Secur. 6, 77–89 (2014)Google Scholar
  8. 8.
    Berkhin, P.: A Survey of Clustering Data Mining Techniques. Springer (2006)Google Scholar
  9. 9.
    Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666 (2010)CrossRefGoogle Scholar
  10. 10.
    Park, H.-S., Jun, C.-H.: A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009)CrossRefGoogle Scholar
  11. 11.
    El-Hamdouchi, A., Willett, P.: Comparison of Hierarchical Agglomerative Clustering Methods for Document Retrieval (1989)Google Scholar
  12. 12.
    Agrawal, R.: Automatic subspace clustering of high dimensional data for data mining applications. US Patent No 6,003,029 (1999)Google Scholar
  13. 13.
    Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns (2002)Google Scholar
  14. 14.
    Dikaiakos, M.D., Stassopoulou, A., Papageorgiou, L.: An investigation of web crawler behavior: characterization and metrics. Comput. Commun. 28, 880–897 (2005)CrossRefGoogle Scholar
  15. 15.
    Stassopoulou, A., Dikaiakos, M.: Web robot detection: a probabilistic reasoning approach. Comput. Netw. (2009)Google Scholar
  16. 16.
    Doran, D., Gokhale, S.S.S.: Web robot detection techniques: overview and limitations. Data Min. Knowl. Discov. 22, 183–210 (2011)CrossRefGoogle Scholar
  17. 17.
    Stevanovic, D., An, A., Vlajic, N.: Feature evaluation for web crawler detection with data mining techniques. Expert Syst. Appl. 39, 8707–8717 (2012)CrossRefGoogle Scholar
  18. 18.
    Stevanovic, D., Vlajic, N., An, A.: Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl. Soft Comput. J. 13, 698–708 (2013)CrossRefGoogle Scholar
  19. 19.
    Ferrari, D.G., De Castro, L.N.: Clustering algorithm selection by meta-learning systems: a new distance-based problem characterization and ranking combination methods. Inf. Sci. 301, 181–194 (2015)CrossRefGoogle Scholar
  20. 20.
    Kouser, K., Sunita, A.: A comparative study of K Means algorithm by different distance measures. Int. J. Innov. Res. Comput. Commun. Eng. 1 (2013)Google Scholar
  21. 21.
    Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 645–678 (2005)CrossRefGoogle Scholar
  22. 22.
    Äyrämö, S., Kärkkäinen, T.: Introduction to partitioning-based clustering methods with a robust example. Reports of the Department of Mathematical Information Technology Series C, Software Engineering and Computational Intelligence 1/2006 (2006)Google Scholar
  23. 23.
    Daszykowski, M., Walczak, B.: Density-based clustering methods. Compr. Chemom. 2, 635–654 (2010)Google Scholar
  24. 24.
    Stonebraker, M., Frew, J., Gardels, K., Meredith, J.: The Sequoia 2000 storage benchmark. In: ACM SIGMOD Record, pp. 2–11 (1993)Google Scholar
  25. 25.
    Zait, M., Messatfa, H.: A comparative study of clustering methods. Future Gener. Comput. Syst. 13, 149–159 (1997)CrossRefGoogle Scholar
  26. 26.
    Tan, P.-N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. In: Intelligent Technologies for Information Analysis, pp. 193–222. Springer, Berlin, Heidelberg (2004)Google Scholar
  27. 27.
    Sisodia, D.S., Verma, S., Vyas, O.: Augmented intuitive dissimilarity metric clustering of web user sessions. J. Inf. Sci. 43, 480–491 (2016)CrossRefGoogle Scholar
  28. 28.
    Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons (1990)Google Scholar
  29. 29.
    Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 224–227 (1979)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Dilip Singh Sisodia
    • 1
  • Rahul Borkar
    • 1
  • Hari Shrawgi
    • 1
  1. 1.National Institute of Technology RaipurRaipurIndia

Personalised recommendations