Skip to main content
Log in

An Evaluation of Passage-Based Text Categorization

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order to reflect the subtopic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several passages, assigns categories to each passage, and merges the passage categories to the document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. Using four subsets of the Reuters text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to the main topic(s), depending on their location in the test document.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Apte, C., Damerau, F., and Weiss, F. (1994). Towards Language Independent Automated Learning of Text Categorization Models. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 23-30).

  • Baker, L.D. and McCallum, A.K. (1998). Distributional Clustering ofWords for Text Classification. In Proceedings of the 21th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 96-103).

  • Callan, J.P. (1994). Passage Retrieval Evidence in Document Retrieval. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 302-310).

  • Denoyer, L. and Gallinari, P. (2003). A Belief Networks-Based Generative Model for Structured Documents. An Application to the XML Categorization. In MLDM 2003-IAPR International Conference on Machine Learning and Data Mining, Leipzig, Germany

  • Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-length Document Access. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 59-68).

  • Hearst, M.A. (1994). Multi-Paragraph Segmentation of Expository Texts. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (pp. 9-16).

  • Kaszkiel, M., Zobel, J., and Sacks-Davis, R. (1999). Efficient Passage Ranking for Document Databases. ACM Transactions on Information Systems, 17(4), 406-439.

    Google Scholar 

  • Kaszkiel, M. and Zobel, J. (2001). Effective Ranking with Arbitary Passages. The Journal of American Society for Information Science and Technology, 52(4), 344-364.

    Google Scholar 

  • Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (pp. 289-297).

  • Moffat, A., Sacks-Davis, R.,Wilkinson, R., and Zobel, J. (1994). Retrieval of Partial Documents. In NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC 2) (pp. 181-190).

  • Salton, G., Allan, J., and Buckley, C. (1993). Approaches to Passage Retrieval in Full Text Information Systems. In Proceedings of the 16th Annual International Conference on Research and Development in Information Retrieval (pp. 49-58).

  • Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACMComputing Surveys, 34(1), 1-47.

    Google Scholar 

  • van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London.

    Google Scholar 

  • Witten, I.H., Moffat, A., and Bell, T.C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco: Morgan Kaufmann Publishing.

    Google Scholar 

  • Yang, Y. (1994). Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 13-22).

  • Yang, Y. and Pedersen, J.O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning(ICML'97) (pp. 412-420).

  • Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1), 67-88.

    Google Scholar 

  • Yang, Y., Slattery, S., and Ghani, R. (2002). A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 17(2), 219-241.

    Google Scholar 

  • Zobel. J., Moffat, A. Wilkinson, R., and Sacks-Davis, R. (1995). Efficient Retrieval of Partial Documents. Information Processing and Management, 31(3), 361-377.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, J., Kim, M.H. An Evaluation of Passage-Based Text Categorization. Journal of Intelligent Information Systems 23, 47–65 (2004). https://doi.org/10.1023/B:JIIS.0000029670.53363.d0

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:JIIS.0000029670.53363.d0

Navigation