Chapter

Digital Libraries: Achievements, Challenges and Opportunities

Volume 4312 of the series Lecture Notes in Computer Science pp 515-518

Web Page Classification Exploiting Contents of Surrounding Pages for Building a High-Quality Homepage Collection

  • Yuxin WangAffiliated withSchool of Multidisciplinary Sciences, The Graduate University for Advanced Studies
  • , Keizo OyamaAffiliated withSchool of Multidisciplinary Sciences, The Graduate University for Advanced StudiesResearch Organization of Information and Systems, National Institute of Informatics

* Final gross prices may vary according to local VAT.

Get Access

Abstract

We propose a web page classification method for creating a high quality collection of researchers’ homepages. A method to reduce manual assessment required for assuring given precision/recall using a recall-assured and a precision-assured classifier is presented. Each classifier is built with SVM using textual features obtained from each page and its surrounding pages and tuning parameters. These pages are grouped based on connection types and relative URL hierarchy levels, and independent features are extracted from each group. Experiment results show the proposed features evidently improve classification performance and the manual assessment is significantly reduced.

Keywords

Web page classification SVM Quality assurance