A Tool for Link-Based Web Page Classification

Hernández, Inma; Rivero, Carlos R.; Ruiz, David; Corchuelo, Rafael

doi:10.1007/978-3-642-25274-7_45

Inma Hernández²²,
Carlos R. Rivero²²,
David Ruiz²² &
…
Rafael Corchuelo²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7023))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

1338 Accesses
2 Citations

Abstract

Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.

Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC- 4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, and TIN2010-09988-E).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: On the design of a learning crawler for topical resource discovery. ACM Trans. Inf. Syst. 19(3), 286–309 (2001)
Article Google Scholar
Anupam, V., Freire, J., Kumar, B., Lieuwen, D.F.: Automating web navigation with the webvcr. Comp. Netw. 33(1-6), 503–517 (2000)
Article Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)
Google Scholar
Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: WWW, pp. 580–591 (2002)
Google Scholar
Bertoli, C., Crescenzi, V., Merialdo, P.: Crawling programs for wrapper-based applications. In: IRI, pp. 160–165 (2008)
Google Scholar
Blanco, L., Crescenzi, V., Merialdo, P.: Structure and semantics of Data-IntensiveWeb pages: An experimental study on their relationships. J. UCS 14(11), 1877–1892 (2008)
Google Scholar
Blanco, L., Dalvi, N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. In: WWW 2011, pp. 437–446. ACM (2011)
Google Scholar
Blythe, J., Kapoor, D., Knoblock, C.A., Lerman, K., Minton, S.: Information integration for the masses. J. UCS 14(11), 1811–1837 (2008)
Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed web crawler. Softw., Pract. Exper. 34(8), 711–726 (2004)
Article Google Scholar
Caverlee, J., Liu, L.: Qa-pagelet: Data preparation techniques for large-scale data analysis of the deep web. IEEE Trans. Knowl. Data Eng. 17(9), 1247–1262 (2005)
Article Google Scholar
Chakrabarti, S.: Focused web crawling. In: Encyclopedia of Database Systems, pp. 1147–1155 (2009)
Google Scholar
Cohen, W.W.: Improving a page classifier with anchor extraction and link analysis. In: NIPS, pp. 1481–1488 (2002)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)
Google Scholar
de Assis, G.T., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S.: Exploiting Genre in Focused Crawling. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 62–73. Springer, Heidelberg (2007)
Chapter Google Scholar
Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW, pp. 106–113 (2001)
Google Scholar
Fürnkranz, J.: Hyperlink ensembles: a case study in hypertext classification. Inf. Fusion 3(4), 299–312 (2002)
Article Google Scholar
Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. In: KI, vol. 16(4), pp. 48–54 (2002)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet MATH Google Scholar
Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004)
Article Google Scholar
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting Data Behind Web Forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)
Chapter Google Scholar
Markov, A., Last, M., Kandel, A.: The hybrid representation model for web document classification. Int. J. Intell. Syst. 23(6), 654–679 (2008)
Article MATH Google Scholar
Mukherjea, S.: Discovering and analyzing world wide web collections. Knowl. Inf. Syst. 6(2), 230–241 (2004)
Article MathSciNet Google Scholar
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, Á.: Semi-automatic wrapper generation for commercial web sources. In: EISIC, pp. 265–283 (2002)
Google Scholar
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006)
Article Google Scholar
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158, 69–88 (2004)
Article MathSciNet Google Scholar
Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: ICDE, pp. 357–368 (2002)
Google Scholar
Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the hidden web. J. UCS 14(11), 1857–1876 (2008)
Google Scholar
Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: CIKM, pp. 258–267 (2006)
Google Scholar
Wang, Y., Hornung, T.: Deep web navigation by example. In: BIS (Workshops), pp. 131–140 (2008)
Google Scholar
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: KDD, pp. 296–305 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Seville, Seville, Spain
Inma Hernández, Carlos R. Rivero, David Ruiz & Rafael Corchuelo

Authors

Inma Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Carlos R. Rivero
View author publications
You can also search for this author in PubMed Google Scholar
David Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Corchuelo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science School, University of the Basque Country, PÂº Manuel de Lardizabal 1, 20018, Donostia-San Sebastian, Spain
Jose A. Lozano
Computing Systems Department, University of Castilla-La Mancha, Campus Universitario s/n, 02071, Albacete, Spain
José A. Gámez
Dep. Statistics, O.R. and Computation, University of La Laguna, 38271, La Laguna, S.C. Tenerife, Spain
José A. Moreno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R. (2011). A Tool for Link-Based Web Page Classification. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds) Advances in Artificial Intelligence. CAEPIA 2011. Lecture Notes in Computer Science(), vol 7023. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25274-7_45

Download citation

DOI: https://doi.org/10.1007/978-3-642-25274-7_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25273-0
Online ISBN: 978-3-642-25274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics