An Experiment to Test URL Features for Web Page Classification

  • Inma HernándezEmail author
  • Carlos R. Rivero
  • David Ruiz
  • José Luis Arjona
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 157)


Web page classification has been extensively researched, using different types of features that are extracted either from the page content, the page structure or from other pages that link to that page. Using features from the page itself implies having to download it before its classification. We present an experiment to proof that URL tokens contain information enough to extract features to classify web pages. A classifier based on these features is able to classify a web page without having to download it previously, avoiding unnecessary downloads.


Anchor Text Tree Edit Distance Academical Site Page Structure Maximum Entropy Classi 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)Google Scholar
  2. 2.
    Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: WWW, pp. 580–591 (2002)Google Scholar
  3. 3.
    Baykan, E., Henzinger, M.R., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW, pp. 1109–1110 (2009)Google Scholar
  4. 4.
    Baykan, E., Henzinger, M.R., Weber, I.: Web page language identification based on URLs. PVLDB 1(1), 176–187 (2008)Google Scholar
  5. 5.
    Blanco, L., Crescenzi, V., Merialdo, P.: Structure and semantics of Data-IntensiveWeb pages: An experimental study on their relationships. J. UCS 14(11), 1877–1892 (2008)Google Scholar
  6. 6.
    Blanco, L., Dalvi, N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. In: WWW, pp. 437–446. ACM, New York (2011)Google Scholar
  7. 7.
    Cohen, W.W.: Improving a page classifier with anchor extraction and link analysis. In: NIPS, pp. 1481–1488 (2002)Google Scholar
  8. 8.
    Fürnkranz, J.: Hyperlink ensembles: a case study in hypertext classification. Information Fusion 3(4), 299–312 (2002)CrossRefGoogle Scholar
  9. 9.
    Hernández, I., Sleiman, H.A., Ruiz, D., Corchuelo, R.: A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 282–291. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: A Tool for Link-Based Web Page Classification. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds.) CAEPIA 2011. LNCS, vol. 7023, pp. 443–452. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. KI 16(4), 48–54 (2002)Google Scholar
  12. 12.
    Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: CIKM, pp. 325–326 (2005)Google Scholar
  13. 13.
    Pierre, J.M.: On the automated classification of web sites. CoRR, cs.IR/0102002 (2001)Google Scholar
  14. 14.
    Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158, 69–88 (2004)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the hidden web. J. UCS 14(11), 1857–1876 (2008)Google Scholar
  16. 16.
    Zhu, M., Hu, W., Wu, O., Li, X., Zhang, X.: User oriented link function classification. In: WWW, pp. 1191–1192 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Inma Hernández
    • 1
    Email author
  • Carlos R. Rivero
    • 1
  • David Ruiz
    • 1
  • José Luis Arjona
    • 2
  1. 1.University of SevilleSevilleSpain
  2. 2.University of HuelvaHuelvaSpain

Personalised recommendations