An Experiment to Test URL Features for Web Page Classification

Hernández, Inma; Rivero, Carlos R.; Ruiz, David; Arjona, José Luis

doi:10.1007/978-3-642-28795-4_13

Inma Hernández⁶,
Carlos R. Rivero⁶,
David Ruiz⁶ &
…
José Luis Arjona⁷

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 157))

606 Accesses
3 Citations

Abstract

Web page classification has been extensively researched, using different types of features that are extracted either from the page content, the page structure or from other pages that link to that page. Using features from the page itself implies having to download it before its classification. We present an experiment to proof that URL tokens contain information enough to extract features to classify web pages. A classifier based on these features is able to classify a web page without having to download it previously, avoiding unnecessary downloads.

Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008- 04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, and TIN2010-09988-E).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)
Google Scholar
Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: WWW, pp. 580–591 (2002)
Google Scholar
Baykan, E., Henzinger, M.R., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW, pp. 1109–1110 (2009)
Google Scholar
Baykan, E., Henzinger, M.R., Weber, I.: Web page language identification based on URLs. PVLDB 1(1), 176–187 (2008)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P.: Structure and semantics of Data-IntensiveWeb pages: An experimental study on their relationships. J. UCS 14(11), 1877–1892 (2008)
Google Scholar
Blanco, L., Dalvi, N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. In: WWW, pp. 437–446. ACM, New York (2011)
Google Scholar
Cohen, W.W.: Improving a page classifier with anchor extraction and link analysis. In: NIPS, pp. 1481–1488 (2002)
Google Scholar
Fürnkranz, J.: Hyperlink ensembles: a case study in hypertext classification. Information Fusion 3(4), 299–312 (2002)
Article Google Scholar
Hernández, I., Sleiman, H.A., Ruiz, D., Corchuelo, R.: A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 282–291. Springer, Heidelberg (2011)
Chapter Google Scholar
Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: A Tool for Link-Based Web Page Classification. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds.) CAEPIA 2011. LNCS, vol. 7023, pp. 443–452. Springer, Heidelberg (2011)
Chapter Google Scholar
Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. KI 16(4), 48–54 (2002)
Google Scholar
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: CIKM, pp. 325–326 (2005)
Google Scholar
Pierre, J.M.: On the automated classification of web sites. CoRR, cs.IR/0102002 (2001)
Google Scholar
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158, 69–88 (2004)
Article MathSciNet Google Scholar
Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the hidden web. J. UCS 14(11), 1857–1876 (2008)
Google Scholar
Zhu, M., Hu, W., Wu, O., Li, X., Zhang, X.: User oriented link function classification. In: WWW, pp. 1191–1192 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Seville, Seville, Spain
Inma Hernández, Carlos R. Rivero & David Ruiz
University of Huelva, Huelva, Spain
José Luis Arjona

Authors

Inma Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Carlos R. Rivero
View author publications
You can also search for this author in PubMed Google Scholar
David Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
José Luis Arjona
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Inma Hernández .

Editor information

Editors and Affiliations

Faculty of Science, Department of Computing Science, University of Salamanca, Plaza de la Merced S/N, Salamanca, 37008, Spain
Juan M. Corchado Rodríguez
Escuela Universitaria de Informática, Universidad Pontificia de Salamanca, Compañía 5, Salamanca, 37002, Spain
Javier Bajo Pérez
Poznan University of Technology, Strzelecka 11, Poznan, 60-965, Poland
Paulina Golinska
Faculté des Sciences, Département de mathématiques, Université de Sherbrooke, 2500 boul. Université, Sherbrooke, J1K 2R1, Canada
Sylvain Giroux
ETSI Informática, Avda. Reina Mercedes, s/n, Sevilla, 41012, Spain
Rafael Corchuelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hernández, I., Rivero, C.R., Ruiz, D., Arjona, J.L. (2012). An Experiment to Test URL Features for Web Page Classification. In: Rodríguez, J., Pérez, J., Golinska, P., Giroux, S., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28795-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-28795-4_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28794-7
Online ISBN: 978-3-642-28795-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics