Advertisement

A Probabilistic Reasoning Approach for Discovering Web Crawler Sessions

  • Athena Stassopoulou
  • Marios D. Dikaiakos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4505)

Abstract

In this paper we introduce a probabilistic-reasoning approach to detect Web robots (crawlers) from human visitors of Web sites. Our approach employs a Naive Bayes network to classify the HTTP sessions of a Web-server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic. The parameters of the Bayesian network are determined with machine learning techniques, and the resulting classification is based on the maximum posterior probability of all classes, given the available evidence. Our method is applied on real Web logs and provides a classification accuracy of 95%. The high accuracy with which our system detects crawler sessions, proves the robustness and effectiveness of the proposed methodology.

Keywords

Bayesian Network Class Imbalance Conditional Probability Table Maximum Posterior Probability Random Undersampling 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dikaiakos, M.D., Stassopoulou, A., Papageorgiou, L.: An Investigation of WWW Crawler behavior: Characterization and Metrics. Computer Communications 28(8), 880–897 (2005)CrossRefGoogle Scholar
  2. 2.
    Noetic Systems Incorporated, http://www.noeticsystems.com/ergo/index.shtml
  3. 3.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  4. 4.
    Pearl, J.: Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Francisco (1988)Google Scholar
  5. 5.
    Provost, F.J., Fawcett, T.: Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 43–48 (1997)Google Scholar
  6. 6.
    Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. SIGKDD Explorations 1(2), 12–23 (2000)CrossRefGoogle Scholar
  7. 7.
    Tan, P.-N., Kumar, V.: Discovery of Web Robot Sessions Based on their Navigational Patterns. Data Mining and Knowledge Discovery 6(1), 9–35 (2002)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Athena Stassopoulou
    • 1
  • Marios D. Dikaiakos
    • 2
  1. 1.Department of Computer Science, IntercollegeCyprus
  2. 2.Department of Computer Science, University of CyprusCyprus

Personalised recommendations