Evaluation of a Distribution-Based Web Page Classification

Part of the Media Business and Innovation book series (MEDIA)


Since the invention of the World Wide Web several approaches have been proposed that attempt to automatically classify Web pages. Often, these classifications are performed by relying on the textual content of a Web page, thus implementing various methods of text analysis. These can range from bag of words representations based on word frequencies to complex algorithms such as Support Vector Machines. In most cases, the structural information contained in the hypertext markup of Web pages is used as an additional input for the classification processes.


  1. Amitay, E., Carmel, D., Darlow, A., Lempel, R., & Soffer, A. (2003). The connectivity sonar: Detecting site functionality by structural patterns. Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pp. 38–47.Google Scholar
  2. Asirvatham, A. P., & Ravi, K. K. (2001). Web page categorization based on document structure. Centre for Visual Information Technology.Google Scholar
  3. Attardi, G., Gulli, A., & Sebastiani, F. (1999). Automatic web page categorization by link and context analysis. Proceedings of First European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119.Google Scholar
  4. Baykan, E., Henzinger, M., Marian, L., & Weber, I. (2009). Purely URL-based topic classification. Proceedings of the 18th International Conference on World Wide Web (WWW’09), pp. 1109–1110.Google Scholar
  5. Calado, P., Christo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., & Goncalves, M. A. (2003). Combining link-based and content-based methods for web document classification. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM’03), pp. 394–401.Google Scholar
  6. Fürnkranz, J. (1999). Exploiting structural information for text classification on the WWW. Proceedings of the Third Symposium on Intelligent Data Analysis (IDA-99), pp. 487–497.Google Scholar
  7. Ghani, R., Slattery, S., & Yang, Y. (2001). Hypertext categorization using hyperlink patterns and meta data. Proceedings of ICML-01, 18th International Conference on Machine Learning, pp. 178–185.Google Scholar
  8. Golub, K., & Ardö, A. (2005). Importance of HTML structural elements and metadata in automated subject classification. Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’05), pp. 368–378.Google Scholar
  9. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning (ECML’98), pp. 137–142.Google Scholar
  10. Kan, M.-Y., & Thi, H. O. N. (2005). Fast webpage classification using URL features. Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05), pp. 325–326.Google Scholar
  11. Kwon, O.-W., & Lee, J.-H. (2003). Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management, 39(1), 25–44.CrossRefGoogle Scholar
  12. Pierre, J. M. (2001). On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science, 6.Google Scholar
  13. Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys, 41(2), Article 12.Google Scholar
  14. Rajalakshmi, R., & Aravindan, C. (2011). Naive Bayes approach for website classification. In AIM 2011. CCIS, 147, 323–326.Google Scholar
  15. Riboni, D. (2002). Feature selection for web page classification. Proceedings of the ACM Workshop (EURASIA-ICT).Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Stuttgart Media UniversityStuttgartGermany

Personalised recommendations