Performance Evaluation of Computer and Communication Systems. Milestones and Future Challenges pp 227-239 | Cite as
Analysis of Web Logs: Challenges and Findings
Abstract
Web logs are an important source of information to describe and understand the traffic of the servers and its characteristics. The analysis of these logs is rather challenging because of the large volume of data and the complex relationships hidden in these data. Our investigation focuses on the analysis of the logs of two Web servers and identifies the main characteristics of their workload and the navigation profiles of crawlers and human users visiting the sites. The classification of these visitors has shown some interesting similarities and differences in term of traffic intensity and its temporal distribution. In general, crawlers tend to re-visit the sites rather often, even though they seldom send bursts of requests to reduce their impact on the servers resources. The other clients are also characterized by periodic patterns that can be effectively represented by few principal components.
Keywords
Spec Server Hierarchical Cluster Technique Malicious Purpose Academic Server Navigational PatternReferences
- 1.Almeida, V., Menascé, D., Riedi, R., Peligrinelli, F., Fonseca, R., Meira Jr., W.: Analyzing Web robots and their impact on caching. In: Proc. of the Sixth Web Caching and Content Delivery Workshop (2001)Google Scholar
- 2.Arlitt, M.F., Williamson, C.L.: Web server workload characterization: the search for invariants. In: Proc. of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 126–137 (1996)Google Scholar
- 3.Crovella, M., Bestavros, A.: Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Trans. on Networking 5(6), 835–846 (1997)CrossRefGoogle Scholar
- 4.Dikaiakos, M.D., Stassopoulou, A., Papageorgiou, L.: An investigation of web crawler behavior: characterization and metrics. Computer Communications 28(8), 880–897 (2005)CrossRefGoogle Scholar
- 5.Doran, D., Gokhale, S.: Discovering new trends in web robot traffic through functional classification. In: Proc. of the International Symposium on Network Computing and Applications, pp. 275–278. IEEE Computer Society (2008)Google Scholar
- 6.Duskin, O., Feitelson, D.G.: Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In: Proc. of the Workshop on Web Search Click Data, pp. 15–19. ACM (2009)Google Scholar
- 7.Hallam-Baker, P.M., Behlendorf, B.: Extended Log File Format. W3C Working Draft WD-logfile-960323 (1996)Google Scholar
- 8.Iyengar, A.K., Squillante, M.S., Zhang, L.: Analysis and characterization of large-scale Web server access patterns and performance. World Wide Web 2(1-2), 85–100 (1999)CrossRefGoogle Scholar
- 9.Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Data Analysis, 6th edn. Pearson Prentice Hall (2007)Google Scholar
- 10.Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, Heidelberg (2002)MATHGoogle Scholar
- 11.Koster, M.: A method for Web Robots control. Network Working Group - Internet Draft (1996)Google Scholar
- 12.Lê, S., Josse, J., Husson, F.: FactoMineR: An R Package for Multivariate Analysis.. Journal of Statistical Software 25(1), 1–18 (2008)Google Scholar
- 13.Lee, J., Cha, S., Lee, D., Lee, H.: Classification of web robots: An empirical study based on over one billion requests. Computers & Security 28(8), 795–802 (2009)CrossRefGoogle Scholar
- 14.Mahanti, A., Williamson, C., Wu, L.: Workload characterization of a large systems conference Web server. In: Proc. of the Seventh Annual Communication Networks and Services Research Conference, pp. 55–64. IEEE Computer Society (2009)Google Scholar
- 15.Menascé, D.A., Almeida, V.A.F., Riedi, R., Ribeiro, F., Fonseca, R., Meira Jr., W.: A hierarchical and multiscale approach to analyze E-business workloads. Performance Evaluation 54(1), 33–57 (2003)CrossRefGoogle Scholar
- 16.Menascé, D.A., Almeida, V.: Capacity Planning for Web Services: metrics, models, and methods. Prentice Hall (2001)Google Scholar
- 17.Olston, C., Najork, M.: Web Crawling. Journal of Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)CrossRefMATHGoogle Scholar
- 18.Park, K., Pai, V.S., Lee, K.-W., Calo, S.: Securing web service by automatic robot detection. In: Proc. of USENIX 2006, pp. 23–23. USENIX Association (2006)Google Scholar
- 19.Performance Evaluation Group Web site – University of Pavia: http://peg.unipv.it
- 20.Pitkow, J.E.: Summary of WWW characterizations. World Wide Web 2(1-2), 3–13 (1999)CrossRefGoogle Scholar
- 21.SPEC Web site – European mirror: http://spec.unipv.it
- 22.Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: A probabilistic reasoning approach. Computer Networks 53(3), 265–278 (2009)CrossRefMATHGoogle Scholar
- 23.Tan, P.N., Kumar, V.: Discovery of Web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery 6(1), 9–35 (2002)CrossRefMathSciNetGoogle Scholar
- 24.Thelwall, M., Stuart, D.: Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology 57(13), 1771–1779 (2006)CrossRefGoogle Scholar
- 25.Williams, A., Arlitt, M., Williamson, C., Barker, K.: Web workload characterization: Ten years later. In: Tang, X., Xu, J., Chanson, S.T. (eds.) Web Content Delivery. Web Information Systems Engineering and Internet Technologies, vol. 2, pp. 3–21. Springer, US (2005)CrossRefGoogle Scholar