Skip to main content
Log in

Classifying web documents in a hierarchy of categories: a comprehensive study

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. Information Systems, 12(3), 233–251.

    Google Scholar 

  • Bennett, P. (2000). Assessing the calibration of Naive Bayes’ posterior estimates. CMU-CS-00-155. Technical report, School of Computer Science, Carnegie-Mellon University.

  • Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In S. Džeroski, L. de Raedt & S. Wrobel (Eds.), Multi-Relational Data Mining 2002 (pp. 21–35). Edmonton, Canada: University of Alberta.

    Google Scholar 

  • Ceci, M., & Malerba, D. (2003). Hierarchical classification of HTML documents with WebClassII. In F. Sebastian (Ed.), Proceedings of ECIR-03, 25th European Conference on Information Retrieval (pp. 57–72). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Chuang, W. T., Tiyyagura, A., Yang, J., & Giuffrida, G. (2000). A fast algorithm for hierarchical text classification. In DaWaK 2000: Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery (pp. 409–418). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (2000). Learning to construct knowledge bases from the world wide web. Artificial Intelligence, 118(1–2), 69–113.

    Article  MATH  Google Scholar 

  • D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbau, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proc. of the 6th Int. Conf. on “Recherche d’Information Assistée par Ordinateur” (RIAO), (pp. 302–313). Paris, France.

  • Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing (pp. 784–788). New York: ACM.

    Chapter  Google Scholar 

  • Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), 103–130.

    Article  MATH  Google Scholar 

  • Dumais, S., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 256–263). New York: ACM.

    Chapter  Google Scholar 

  • Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98 (pp. 148–155). New York: ACM.

    Google Scholar 

  • Esposito, F.,Malerba, D., Tamma, V., & Bock, H.-H. (2000). Classical resemblance measures. In H.-H. Bock & E. Diday (Eds.), Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data, volume 15 of Studies in Classification, Data Analysis, and Knowledge Organization (Chapter 8.1, pp. 139–152). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Eyheramendy, S., Lewis, D., & Madigan, D. (2003). The Naive Bayes model for text categorization. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Jan 3–6, Key West, Florida.

  • Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL ’00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (pp. 59–68). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Han, E.-H. & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 424–431). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. In NIPS ’97: Proceedings of the 1997 conference on Advances in neural information processing systems 10 (pp. 507–513). Cambridge, Massachussetts: MIT.

    Google Scholar 

  • Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 143–151). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Joachims, T. (1998a). SVM light, an implementation of Support Vector Machines (SVMs) in C http://ais.gmd.de/thorsten/svmlight/.

  • Joachims, T. (1998b). Text categorization with support vector machines: Learning with many relevant features. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 137–142). Berlin Heidelberg New York: Springer.

    Chapter  Google Scholar 

  • Kim, S., Rim, H., Yook, D., & Lim, H. (2002). Effective methods for improving Naive Bayes text classifier. In 7th International Conference on Artificial Intelligence, volume 2417 of LNAI (pp. 95–106). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 170–178). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Science—Informatics and Computer Science, 158(1), 89–115.

    Google Scholar 

  • Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 4–15). Berlin Heidelberg New York: Springer.

    Chapter  Google Scholar 

  • Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.

    Google Scholar 

  • Malerba, D., Esposito, F., & Ceci, M. (2002). Mining HTML pages to support document sharing in a cooperative system. In EDBT ’02: Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers (pp. 420–434). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization (pp. 41–48). Menlo Park California: AAAI.

    Google Scholar 

  • McCallum, A., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the Fifteenth International Conference on Machine Learning (pp. 359–367). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Miller, G. (1990). Five papers on Wordnet. International Journal of Lexicology, 3(4), 278–301.

    Google Scholar 

  • Mitchell, T. (1997). Machine Learning. New York: McGraw-Hill.

    MATH  Google Scholar 

  • Mitchell, T. (1998). Conditions for the equivalence of hierarchical and flat Bayesian classifiers. Technical report, Center for Automated Learning and Discovery, Carnegie-Mellon University.

  • Mladenić, D. (1998a). Feature subset selection in text-learning. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 95–100). Berlin Heidelberg New York: Springer-Verlag.

    Chapter  Google Scholar 

  • Mladenić, D. (1998b). Machine learning on non-homogeneus, distributed text data. PhD thesis, University of Ljubjana, Slovenia.

  • Mladenić, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and Naive Bayes. In ICML ’99: Proceedings of the Sixteenth International Conference on Machine Learning (pp. 258–267). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Decision Support Systems, 35(1), 45–87.

    Article  Google Scholar 

  • Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perception learning, and a usability case study for text categorization. SIGIR Forum, 31(SI), 67–73.

    Article  Google Scholar 

  • Platt, J. (1998). Fast training of Support Vector Machines using sequential minimal optimization. In B. Scholkopf, C. Burges & A. Smola (Eds.), Advances in Kernel methods – support vector learning. MIT Press.

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Google Scholar 

  • Rocchio, J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing (pp. 313–323). Englewood Cliffs: Prentice Hall.

    Google Scholar 

  • Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1), 87–118.

    Article  MATH  Google Scholar 

  • Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Second International Conference on Knowledge Discovery in Databases (pp. 334–338). Menlo Park, California: AAAI.

    Google Scholar 

  • Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.

    Article  Google Scholar 

  • Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2–3), 135–168.

    Article  MATH  Google Scholar 

  • Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 215–223). New York: ACM.

    Chapter  Google Scholar 

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  • Shen, Y., & Jiang, J. (2003). Improving the performance of Naive Bayes for text classification, CS224N spring. Technical report, Stanford University.

  • Sona, D., Veeramachanemi, S., Avesani, P., & Polettini, N. (2004). Clustering with propagation for hierarchical document classification. In M. Gori, M. Ceci & M. Nanni (Eds.), Proceedings of the ECML/PKDD’04 Workshop on Statistical Approaches for Web Mining (pp. 50–61). Pisa, Italy.

  • Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 521–528). Los Alamitos, California: IEEE Computer Society.

    Google Scholar 

  • Theeramunkong, T., & Lertnattee, V. (2002). Multi-dimensional text classification. In Proc. of 19th International Conference on Computational Linguistics (COLING 2002) (pp. 1–7). Morristown, New Jersey: Association for Computational Linguistics.

    Google Scholar 

  • Tikk, D., & Biro, G. (2003). Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection. In ISUMA ’03: Proceedings of the 4th International Symposium on Uncertainty Modelling and Analysis (p. 104). Los Alamitos, Calfornia: IEEE Computer Society.

    Chapter  Google Scholar 

  • Vapnik, V. (1995). The Nature of Statistical Learning Theory. Berlin Heidelberg New York: Springer.

    MATH  Google Scholar 

  • Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information System, 18(2-3), 153–172.

    Article  Google Scholar 

  • Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193–216.

    Article  Google Scholar 

  • Yang, Y. (1996). An evaluation of statistical approaches to MEDLINE indexing. In Proceedings of the AMIA (pp. 358–362). Philadelphia, Pennsylvania: Hanley and Belfus.

    Google Scholar 

  • Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2), 69–90.

    Article  Google Scholar 

  • Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42–49). New York: ACM.

  • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412–420). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. G. (2003). Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. In Proceedings of the 20th International Conference on Machine Learning (pp. 888–895). Menlo Park, AAAI Press.

    Google Scholar 

  • Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. SIGKDD Explorations Newsletter, 6(1), 80–89.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michelangelo Ceci.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ceci, M., Malerba, D. Classifying web documents in a hierarchy of categories: a comprehensive study. J Intell Inf Syst 28, 37–78 (2007). https://doi.org/10.1007/s10844-006-0003-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-006-0003-2

Keywords

Navigation