Classifying web documents in a hierarchy of categories: a comprehensive study

Ceci, Michelangelo; Malerba, Donato

doi:10.1007/s10844-006-0003-2

Classifying web documents in a hierarchy of categories: a comprehensive study

Published: 19 January 2007

Volume 28, pages 37–78, (2007)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Michelangelo Ceci¹ &
Donato Malerba¹

449 Accesses
73 Citations
3 Altmetric
Explore all metrics

Abstract

Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. Information Systems, 12(3), 233–251.
Google Scholar
Bennett, P. (2000). Assessing the calibration of Naive Bayes’ posterior estimates. CMU-CS-00-155. Technical report, School of Computer Science, Carnegie-Mellon University.
Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In S. Džeroski, L. de Raedt & S. Wrobel (Eds.), Multi-Relational Data Mining 2002 (pp. 21–35). Edmonton, Canada: University of Alberta.
Google Scholar
Ceci, M., & Malerba, D. (2003). Hierarchical classification of HTML documents with WebClassII. In F. Sebastian (Ed.), Proceedings of ECIR-03, 25th European Conference on Information Retrieval (pp. 57–72). Berlin Heidelberg New York: Springer.
Google Scholar
Chuang, W. T., Tiyyagura, A., Yang, J., & Giuffrida, G. (2000). A fast algorithm for hierarchical text classification. In DaWaK 2000: Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery (pp. 409–418). Berlin Heidelberg New York: Springer.
Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (2000). Learning to construct knowledge bases from the world wide web. Artificial Intelligence, 118(1–2), 69–113.
Article MATH Google Scholar
D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbau, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proc. of the 6th Int. Conf. on “Recherche d’Information Assistée par Ordinateur” (RIAO), (pp. 302–313). Paris, France.
Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing (pp. 784–788). New York: ACM.
Chapter Google Scholar
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), 103–130.
Article MATH Google Scholar
Dumais, S., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 256–263). New York: ACM.
Chapter Google Scholar
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98 (pp. 148–155). New York: ACM.
Google Scholar
Esposito, F.,Malerba, D., Tamma, V., & Bock, H.-H. (2000). Classical resemblance measures. In H.-H. Bock & E. Diday (Eds.), Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data, volume 15 of Studies in Classification, Data Analysis, and Knowledge Organization (Chapter 8.1, pp. 139–152). Berlin Heidelberg New York: Springer.
Google Scholar
Eyheramendy, S., Lewis, D., & Madigan, D. (2003). The Naive Bayes model for text categorization. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Jan 3–6, Key West, Florida.
Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL ’00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (pp. 59–68). Berlin Heidelberg New York: Springer.
Google Scholar
Han, E.-H. & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 424–431). Berlin Heidelberg New York: Springer.
Google Scholar
Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. In NIPS ’97: Proceedings of the 1997 conference on Advances in neural information processing systems 10 (pp. 507–513). Cambridge, Massachussetts: MIT.
Google Scholar
Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 143–151). San Mateo, California: Morgan Kaufmann.
Google Scholar
Joachims, T. (1998a). SVM ^light, an implementation of Support Vector Machines (SVMs) in C http://ais.gmd.de/thorsten/svmlight/.
Joachims, T. (1998b). Text categorization with support vector machines: Learning with many relevant features. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 137–142). Berlin Heidelberg New York: Springer.
Chapter Google Scholar
Kim, S., Rim, H., Yook, D., & Lim, H. (2002). Effective methods for improving Naive Bayes text classifier. In 7th International Conference on Artificial Intelligence, volume 2417 of LNAI (pp. 95–106). Berlin Heidelberg New York: Springer.
Google Scholar
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 170–178). San Mateo, California: Morgan Kaufmann.
Google Scholar
Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Science—Informatics and Computer Science, 158(1), 89–115.
Google Scholar
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 4–15). Berlin Heidelberg New York: Springer.
Chapter Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Google Scholar
Malerba, D., Esposito, F., & Ceci, M. (2002). Mining HTML pages to support document sharing in a cooperative system. In EDBT ’02: Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers (pp. 420–434). Berlin Heidelberg New York: Springer.
Google Scholar
McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization (pp. 41–48). Menlo Park California: AAAI.
Google Scholar
McCallum, A., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the Fifteenth International Conference on Machine Learning (pp. 359–367). San Mateo, California: Morgan Kaufmann.
Google Scholar
Miller, G. (1990). Five papers on Wordnet. International Journal of Lexicology, 3(4), 278–301.
Google Scholar
Mitchell, T. (1997). Machine Learning. New York: McGraw-Hill.
MATH Google Scholar
Mitchell, T. (1998). Conditions for the equivalence of hierarchical and flat Bayesian classifiers. Technical report, Center for Automated Learning and Discovery, Carnegie-Mellon University.
Mladenić, D. (1998a). Feature subset selection in text-learning. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 95–100). Berlin Heidelberg New York: Springer-Verlag.
Chapter Google Scholar
Mladenić, D. (1998b). Machine learning on non-homogeneus, distributed text data. PhD thesis, University of Ljubjana, Slovenia.
Mladenić, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and Naive Bayes. In ICML ’99: Proceedings of the Sixteenth International Conference on Machine Learning (pp. 258–267). San Mateo, California: Morgan Kaufmann.
Google Scholar
Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Decision Support Systems, 35(1), 45–87.
Article Google Scholar
Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perception learning, and a usability case study for text categorization. SIGIR Forum, 31(SI), 67–73.
Article Google Scholar
Platt, J. (1998). Fast training of Support Vector Machines using sequential minimal optimization. In B. Scholkopf, C. Burges & A. Smola (Eds.), Advances in Kernel methods – support vector learning. MIT Press.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Google Scholar
Rocchio, J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing (pp. 313–323). Englewood Cliffs: Prentice Hall.
Google Scholar
Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1), 87–118.
Article MATH Google Scholar
Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Second International Conference on Knowledge Discovery in Databases (pp. 334–338). Menlo Park, California: AAAI.
Google Scholar
Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.
Article Google Scholar
Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2–3), 135–168.
Article MATH Google Scholar
Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 215–223). New York: ACM.
Chapter Google Scholar
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Shen, Y., & Jiang, J. (2003). Improving the performance of Naive Bayes for text classification, CS224N spring. Technical report, Stanford University.
Sona, D., Veeramachanemi, S., Avesani, P., & Polettini, N. (2004). Clustering with propagation for hierarchical document classification. In M. Gori, M. Ceci & M. Nanni (Eds.), Proceedings of the ECML/PKDD’04 Workshop on Statistical Approaches for Web Mining (pp. 50–61). Pisa, Italy.
Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 521–528). Los Alamitos, California: IEEE Computer Society.
Google Scholar
Theeramunkong, T., & Lertnattee, V. (2002). Multi-dimensional text classification. In Proc. of 19th International Conference on Computational Linguistics (COLING 2002) (pp. 1–7). Morristown, New Jersey: Association for Computational Linguistics.
Google Scholar
Tikk, D., & Biro, G. (2003). Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection. In ISUMA ’03: Proceedings of the 4th International Symposium on Uncertainty Modelling and Analysis (p. 104). Los Alamitos, Calfornia: IEEE Computer Society.
Chapter Google Scholar
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Berlin Heidelberg New York: Springer.
MATH Google Scholar
Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information System, 18(2-3), 153–172.
Article Google Scholar
Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193–216.
Article Google Scholar
Yang, Y. (1996). An evaluation of statistical approaches to MEDLINE indexing. In Proceedings of the AMIA (pp. 358–362). Philadelphia, Pennsylvania: Hanley and Belfus.
Google Scholar
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2), 69–90.
Article Google Scholar
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42–49). New York: ACM.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412–420). San Mateo, California: Morgan Kaufmann.
Google Scholar
Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. G. (2003). Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. In Proceedings of the 20th International Conference on Machine Learning (pp. 888–895). Menlo Park, AAAI Press.
Google Scholar
Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. SIGKDD Explorations Newsletter, 6(1), 80–89.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Universita degli Studi di Bari, 70126, Bari, Italy
Michelangelo Ceci & Donato Malerba

Authors

Michelangelo Ceci
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michelangelo Ceci.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ceci, M., Malerba, D. Classifying web documents in a hierarchy of categories: a comprehensive study. J Intell Inf Syst 28, 37–78 (2007). https://doi.org/10.1007/s10844-006-0003-2

Download citation

Received: 11 July 2005
Revised: 07 December 2005
Accepted: 03 April 2006
Published: 19 January 2007
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10844-006-0003-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classifying web documents in a hierarchy of categories: a comprehensive study

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classifying web documents in a hierarchy of categories: a comprehensive study

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation