A Heuristic Approach for Topical Information Extraction from News Pages
Topical information extraction from news pages could facilitate news searching and retrieval etc. A web page could be partitioned into multiple blocks. The importance of different blocks varies from each other. The estimation of the block importance could be defined as a classification problem. First, an adaptive vision-based page segmentation algorithm is used to partition a web page into semantic blocks. Then spatial features and content features are used to represent each block. Shannon’s information entropy is adopted to represent each feature’s ability for differentiating. A weighted Naïve Bayes classifier is used to estimate whether the block is important or not. Finally, a variation of TF-IDF is utilized to represent weight of each keyword. As a result, the similar blocks are united as topical region. The approach is tested with several important English and Chinese news sites. Both recall and precision rates are greater than 96%.
KeywordsInformation Retrieval Entropy Naive Bayes classifier
Unable to display preview. Download preview PDF.
- 1.Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD 2002) (2002)Google Scholar
- 2.Debnath, S., Mitra, P., Giles, C.L.: Automatic Extraction of Informative Blocks from Webpages. In: SAC 2005, Santa Fe, New Mexico, USA (March 13-17, 2005)Google Scholar
- 3.Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM based Content Extraction of HTML Documents. In: Proceedings of the 12th World Wide Web conference (WWW 2003) (May 2003)Google Scholar
- 4.Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning Block Importance Models for Web Pages. In: WWW 2004, New York, USA, May 17-22 (2004)Google Scholar
- 5.Zhigang, Z., Jing, C., Xiaoming, L.: An Approach to Reduce Noise in HTML Pages. Journal Of The China Society For Scientific And Technical Information (April 23, 2004)Google Scholar
- 6.Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report. MSR-TR-2003-79 (2003)Google Scholar