Analysis of Web Logs: Challenges and Findings

  • Maria Carla Calzarossa
  • Luisa Massari
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6821)

Abstract

Web logs are an important source of information to describe and understand the traffic of the servers and its characteristics. The analysis of these logs is rather challenging because of the large volume of data and the complex relationships hidden in these data. Our investigation focuses on the analysis of the logs of two Web servers and identifies the main characteristics of their workload and the navigation profiles of crawlers and human users visiting the sites. The classification of these visitors has shown some interesting similarities and differences in term of traffic intensity and its temporal distribution. In general, crawlers tend to re-visit the sites rather often, even though they seldom send bursts of requests to reduce their impact on the servers resources. The other clients are also characterized by periodic patterns that can be effectively represented by few principal components.

Keywords

Spec Server Hierarchical Cluster Technique Malicious Purpose Academic Server Navigational Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Almeida, V., Menascé, D., Riedi, R., Peligrinelli, F., Fonseca, R., Meira Jr., W.: Analyzing Web robots and their impact on caching. In: Proc. of the Sixth Web Caching and Content Delivery Workshop (2001)Google Scholar
  2. 2.
    Arlitt, M.F., Williamson, C.L.: Web server workload characterization: the search for invariants. In: Proc. of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 126–137 (1996)Google Scholar
  3. 3.
    Crovella, M., Bestavros, A.: Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Trans. on Networking 5(6), 835–846 (1997)CrossRefGoogle Scholar
  4. 4.
    Dikaiakos, M.D., Stassopoulou, A., Papageorgiou, L.: An investigation of web crawler behavior: characterization and metrics. Computer Communications 28(8), 880–897 (2005)CrossRefGoogle Scholar
  5. 5.
    Doran, D., Gokhale, S.: Discovering new trends in web robot traffic through functional classification. In: Proc. of the International Symposium on Network Computing and Applications, pp. 275–278. IEEE Computer Society (2008)Google Scholar
  6. 6.
    Duskin, O., Feitelson, D.G.: Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In: Proc. of the Workshop on Web Search Click Data, pp. 15–19. ACM (2009)Google Scholar
  7. 7.
    Hallam-Baker, P.M., Behlendorf, B.: Extended Log File Format. W3C Working Draft WD-logfile-960323 (1996)Google Scholar
  8. 8.
    Iyengar, A.K., Squillante, M.S., Zhang, L.: Analysis and characterization of large-scale Web server access patterns and performance. World Wide Web 2(1-2), 85–100 (1999)CrossRefGoogle Scholar
  9. 9.
    Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Data Analysis, 6th edn. Pearson Prentice Hall (2007)Google Scholar
  10. 10.
    Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, Heidelberg (2002)MATHGoogle Scholar
  11. 11.
    Koster, M.: A method for Web Robots control. Network Working Group - Internet Draft (1996)Google Scholar
  12. 12.
    Lê, S., Josse, J., Husson, F.: FactoMineR: An R Package for Multivariate Analysis.. Journal of Statistical Software 25(1), 1–18 (2008)Google Scholar
  13. 13.
    Lee, J., Cha, S., Lee, D., Lee, H.: Classification of web robots: An empirical study based on over one billion requests. Computers & Security 28(8), 795–802 (2009)CrossRefGoogle Scholar
  14. 14.
    Mahanti, A., Williamson, C., Wu, L.: Workload characterization of a large systems conference Web server. In: Proc. of the Seventh Annual Communication Networks and Services Research Conference, pp. 55–64. IEEE Computer Society (2009)Google Scholar
  15. 15.
    Menascé, D.A., Almeida, V.A.F., Riedi, R., Ribeiro, F., Fonseca, R., Meira Jr., W.: A hierarchical and multiscale approach to analyze E-business workloads. Performance Evaluation 54(1), 33–57 (2003)CrossRefGoogle Scholar
  16. 16.
    Menascé, D.A., Almeida, V.: Capacity Planning for Web Services: metrics, models, and methods. Prentice Hall (2001)Google Scholar
  17. 17.
    Olston, C., Najork, M.: Web Crawling. Journal of Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)CrossRefMATHGoogle Scholar
  18. 18.
    Park, K., Pai, V.S., Lee, K.-W., Calo, S.: Securing web service by automatic robot detection. In: Proc. of USENIX 2006, pp. 23–23. USENIX Association (2006)Google Scholar
  19. 19.
    Performance Evaluation Group Web site – University of Pavia: http://peg.unipv.it
  20. 20.
    Pitkow, J.E.: Summary of WWW characterizations. World Wide Web 2(1-2), 3–13 (1999)CrossRefGoogle Scholar
  21. 21.
    SPEC Web site – European mirror: http://spec.unipv.it
  22. 22.
    Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: A probabilistic reasoning approach. Computer Networks 53(3), 265–278 (2009)CrossRefMATHGoogle Scholar
  23. 23.
    Tan, P.N., Kumar, V.: Discovery of Web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery 6(1), 9–35 (2002)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Thelwall, M., Stuart, D.: Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology 57(13), 1771–1779 (2006)CrossRefGoogle Scholar
  25. 25.
    Williams, A., Arlitt, M., Williamson, C., Barker, K.: Web workload characterization: Ten years later. In: Tang, X., Xu, J., Chanson, S.T. (eds.) Web Content Delivery. Web Information Systems Engineering and Internet Technologies, vol. 2, pp. 3–21. Springer, US (2005)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2011

Authors and Affiliations

  • Maria Carla Calzarossa
    • 1
  • Luisa Massari
    • 1
  1. 1.Dipartimento di Informatica e SistemisticaUniversità di PaviaPaviaItaly

Personalised recommendations