A Tool for Link-Based Web Page Classification

  • Inma Hernández
  • Carlos R. Rivero
  • David Ruiz
  • Rafael Corchuelo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7023)


Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.


Crawling Web Page Classification Virtual Integration 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: On the design of a learning crawler for topical resource discovery. ACM Trans. Inf. Syst. 19(3), 286–309 (2001)CrossRefGoogle Scholar
  2. 2.
    Anupam, V., Freire, J., Kumar, B., Lieuwen, D.F.: Automating web navigation with the webvcr. Comp. Netw. 33(1-6), 503–517 (2000)CrossRefGoogle Scholar
  3. 3.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD, pp. 337–348 (2003)Google Scholar
  4. 4.
    Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: WWW, pp. 580–591 (2002)Google Scholar
  5. 5.
    Bertoli, C., Crescenzi, V., Merialdo, P.: Crawling programs for wrapper-based applications. In: IRI, pp. 160–165 (2008)Google Scholar
  6. 6.
    Blanco, L., Crescenzi, V., Merialdo, P.: Structure and semantics of Data-IntensiveWeb pages: An experimental study on their relationships. J. UCS 14(11), 1877–1892 (2008)Google Scholar
  7. 7.
    Blanco, L., Dalvi, N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. In: WWW 2011, pp. 437–446. ACM (2011)Google Scholar
  8. 8.
    Blythe, J., Kapoor, D., Knoblock, C.A., Lerman, K., Minton, S.: Information integration for the masses. J. UCS 14(11), 1811–1837 (2008)Google Scholar
  9. 9.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed web crawler. Softw., Pract. Exper. 34(8), 711–726 (2004)CrossRefGoogle Scholar
  10. 10.
    Caverlee, J., Liu, L.: Qa-pagelet: Data preparation techniques for large-scale data analysis of the deep web. IEEE Trans. Knowl. Data Eng. 17(9), 1247–1262 (2005)CrossRefGoogle Scholar
  11. 11.
    Chakrabarti, S.: Focused web crawling. In: Encyclopedia of Database Systems, pp. 1147–1155 (2009)Google Scholar
  12. 12.
    Cohen, W.W.: Improving a page classifier with anchor extraction and link analysis. In: NIPS, pp. 1481–1488 (2002)Google Scholar
  13. 13.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)Google Scholar
  14. 14.
    de Assis, G.T., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S.: Exploiting Genre in Focused Crawling. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 62–73. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW, pp. 106–113 (2001)Google Scholar
  16. 16.
    Fürnkranz, J.: Hyperlink ensembles: a case study in hypertext classification. Inf. Fusion 3(4), 299–312 (2002)CrossRefGoogle Scholar
  17. 17.
    Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering. In: KI, vol. 16(4), pp. 48–54 (2002)Google Scholar
  18. 18.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004)CrossRefGoogle Scholar
  20. 20.
    Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting Data Behind Web Forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  21. 21.
    Markov, A., Last, M., Kandel, A.: The hybrid representation model for web document classification. Int. J. Intell. Syst. 23(6), 654–679 (2008)CrossRefzbMATHGoogle Scholar
  22. 22.
    Mukherjea, S.: Discovering and analyzing world wide web collections. Knowl. Inf. Syst. 6(2), 230–241 (2004)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, Á.: Semi-automatic wrapper generation for commercial web sources. In: EISIC, pp. 265–283 (2002)Google Scholar
  24. 24.
    Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006)CrossRefGoogle Scholar
  25. 25.
    Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Inf. Sci. 158, 69–88 (2004)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: ICDE, pp. 357–368 (2002)Google Scholar
  27. 27.
    Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the hidden web. J. UCS 14(11), 1857–1876 (2008)Google Scholar
  28. 28.
    Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: CIKM, pp. 258–267 (2006)Google Scholar
  29. 29.
    Wang, Y., Hornung, T.: Deep web navigation by example. In: BIS (Workshops), pp. 131–140 (2008)Google Scholar
  30. 30.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: KDD, pp. 296–305 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Inma Hernández
    • 1
  • Carlos R. Rivero
    • 1
  • David Ruiz
    • 1
  • Rafael Corchuelo
    • 1
  1. 1.University of SevilleSevilleSpain

Personalised recommendations