A Heuristic Approach for Topical Information Extraction from News Pages

  • Yan Liu
  • Qiang Wang
  • QingXian Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4255)


Topical information extraction from news pages could facilitate news searching and retrieval etc. A web page could be partitioned into multiple blocks. The importance of different blocks varies from each other. The estimation of the block importance could be defined as a classification problem. First, an adaptive vision-based page segmentation algorithm is used to partition a web page into semantic blocks. Then spatial features and content features are used to represent each block. Shannon’s information entropy is adopted to represent each feature’s ability for differentiating. A weighted Naïve Bayes classifier is used to estimate whether the block is important or not. Finally, a variation of TF-IDF is utilized to represent weight of each keyword. As a result, the similar blocks are united as topical region. The approach is tested with several important English and Chinese news sites. Both recall and precision rates are greater than 96%.


Information Retrieval Entropy Naive Bayes classifier 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD 2002) (2002)Google Scholar
  2. 2.
    Debnath, S., Mitra, P., Giles, C.L.: Automatic Extraction of Informative Blocks from Webpages. In: SAC 2005, Santa Fe, New Mexico, USA (March 13-17, 2005)Google Scholar
  3. 3.
    Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM based Content Extraction of HTML Documents. In: Proceedings of the 12th World Wide Web conference (WWW 2003) (May 2003)Google Scholar
  4. 4.
    Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning Block Importance Models for Web Pages. In: WWW 2004, New York, USA, May 17-22 (2004)Google Scholar
  5. 5.
    Zhigang, Z., Jing, C., Xiaoming, L.: An Approach to Reduce Noise in HTML Pages. Journal Of The China Society For Scientific And Technical Information (April 23, 2004)Google Scholar
  6. 6.
    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report. MSR-TR-2003-79 (2003)Google Scholar
  7. 7.
    Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 398–403 (1948)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yan Liu
    • 1
  • Qiang Wang
    • 1
  • QingXian Wang
    • 1
  1. 1.Information Engineering InstituteInformation Engineering UniversityZhengzhouP.R. China

Personalised recommendations