Skip to main content
Log in

Hierarchical subtopic segmentation of web document

  • Web Information Mining and Retrieval
  • Published:
Wuhan University Journal of Natural Sciences

Abstract

The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics and identify the boundary of each subtopic. Based on the term frequency matrix, the method measures the similarity between adjacent blocks, such as paragraphs, passages. In the real-world sample experiment, the macro-averaged precision and recall reach 73.4% and 82.5%, and the micro-averaged precision and recal reach 72.9% and 83.1%. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Goldstein J, Kantrowitz M, Mittal V,et al. Summarizing Text Documents: Sentence Selection and Evaluation Metrics.Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, August 1999. 121–128.

  2. Salton G, Allan J, Buckley C. Approaches to Passage Retrieval in Full Text Information Systems.Proceedings of the 16th Annual International ACM SIGIR Conference. New York, June 1993. 49–58.

  3. Hearst M A. Text Tiling: Segmenting Text Into Multi-Paragraph Subtopic Passages.Computational Linguistics, 1997,23(1): 33–64.

    Google Scholar 

  4. Lin Hong-fei, Zhang Xue-gang, Yao Tian-shun. Text Structure Analysis Based on Concept.Journal of Computer Research and Development, 2000,37(3): 325–328 (Ch).

    Google Scholar 

  5. Brants T, Chen F, Tsochantaridis I. Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis.Proceedings of the 11th International Conference on Information and Knowledge Management. New York, November 2002. 211–218.

  6. Miller George A. WordNet: A Lexical Database for English George A. Miller.Communication of the ACM, 1995,38 (11): 39–41.

    Article  Google Scholar 

  7. Han Jia-wei, Kamber Micheline. Translated by Fan Ming and Meng Xiao-feng.Data Mining Concepts and Techniques. Beijing: China Machine Press, 2001: 287 (Ch).

    Google Scholar 

  8. Jobbins A C, Evett L J. Text Segmentation Using Reiteration and Collocation.Proceedings of the 17th International Conference on Computational Linguistics-Volume 1, Morristown, August 1998, 614–68.

Download references

Author information

Authors and Affiliations

Authors

Additional information

Foundation item: Supported by the National High Technology Research and Development Program of China (2002AA119050)

Biography: ZHANG Yun-tao(1971-), male, Lecture, Ph. D. candidate, research direetion: text information processing and data mining.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yun-tao, Z., Ling, G. & Yong-cheng, W. Hierarchical subtopic segmentation of web document. Wuhan Univ. J. Nat. Sci. 11, 47–50 (2006). https://doi.org/10.1007/BF02831702

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02831702

Key words

CLC number

Navigation