Web Page Classification Exploiting Contents of Surrounding Pages for Building a High-Quality Homepage Collection

  • Yuxin Wang
  • Keizo Oyama
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4312)

Abstract

We propose a web page classification method for creating a high quality collection of researchers’ homepages. A method to reduce manual assessment required for assuring given precision/recall using a recall-assured and a precision-assured classifier is presented. Each classifier is built with SVM using textual features obtained from each page and its surrounding pages and tuning parameters. These pages are grouped based on connection types and relative URL hierarchy levels, and independent features are extracted from each group. Experiment results show the proposed features evidently improve classification performance and the manual assessment is significantly reduced.

Keywords

Web page classification SVM Quality assurance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Wang, Y., Oyama, K.: Combining Page Group Structure and Content for Roughly Filtering Researchers’ Homepages with High Recall. IPSJ Digital Courier 2, 369–381 (2006)CrossRefGoogle Scholar
  2. 2.
    Sun, A., Lim, E.-P., Ng, W.-K.: Web Classification using Support Vector Machine. In: Proc. 4th International Workshop on Web Information and Data Management, McLean, Virginia, pp. 96–99 (2002)Google Scholar
  3. 3.
    Sun, A., Lim, E.-P.: Web Unit Mining: Finding and Classifying Subgraphs of Web Pages. In: Proc. International Conference on Information and Knowledge Management (CIKM 2003), New Orleans, Louisiana, pp. 108–115 (2003)Google Scholar
  4. 4.
    Glover, E.J., et al.: Using Web Structure for Classifying and Describing Web Pages. In: Proc. 11th International World Wide Web Conference, Honolulu, Hawaii, pp. 562–569 (2002)Google Scholar
  5. 5.
    Kan, M.-Y., Thi, H.O.N.: Fast Webpage Classification using URL Features. In: Proc. CIKM 2005, Bremen, Germany, pp. 325–326 (2005)Google Scholar
  6. 6.
    NTCIR Project: NTCIR-4 WEB (Web Retrieval Test Collection) (2006), http://research.nii.ac.jp/ntcir/permission/ntcir-4/perm-en-WEB.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yuxin Wang
    • 1
  • Keizo Oyama
    • 1
    • 2
  1. 1.School of Multidisciplinary SciencesThe Graduate University for Advanced Studies 
  2. 2.Research Organization of Information and SystemsNational Institute of InformaticsTokyoJapan

Personalised recommendations