Advertisement

Journal of Intelligent Information Systems

, Volume 28, Issue 1, pp 37–78 | Cite as

Classifying web documents in a hierarchy of categories: a comprehensive study

  • Michelangelo Ceci
  • Donato Malerba
Article

Abstract

Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.

Keywords

Text categorization Hierarchical models Supervised learning Feature selection Performance evaluation Web content mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. Information Systems, 12(3), 233–251.Google Scholar
  2. Bennett, P. (2000). Assessing the calibration of Naive Bayes’ posterior estimates. CMU-CS-00-155. Technical report, School of Computer Science, Carnegie-Mellon University.Google Scholar
  3. Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In S. Džeroski, L. de Raedt & S. Wrobel (Eds.), Multi-Relational Data Mining 2002 (pp. 21–35). Edmonton, Canada: University of Alberta.Google Scholar
  4. Ceci, M., & Malerba, D. (2003). Hierarchical classification of HTML documents with WebClassII. In F. Sebastian (Ed.), Proceedings of ECIR-03, 25th European Conference on Information Retrieval (pp. 57–72). Berlin Heidelberg New York: Springer.Google Scholar
  5. Chuang, W. T., Tiyyagura, A., Yang, J., & Giuffrida, G. (2000). A fast algorithm for hierarchical text classification. In DaWaK 2000: Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery (pp. 409–418). Berlin Heidelberg New York: Springer.Google Scholar
  6. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (2000). Learning to construct knowledge bases from the world wide web. Artificial Intelligence, 118(1–2), 69–113.zbMATHCrossRefGoogle Scholar
  7. D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbau, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proc. of the 6th Int. Conf. on “Recherche d’Information Assistée par Ordinateur” (RIAO), (pp. 302–313). Paris, France.Google Scholar
  8. Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing (pp. 784–788). New York: ACM.CrossRefGoogle Scholar
  9. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3), 103–130.zbMATHCrossRefGoogle Scholar
  10. Dumais, S., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 256–263). New York: ACM.CrossRefGoogle Scholar
  11. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98 (pp. 148–155). New York: ACM.Google Scholar
  12. Esposito, F.,Malerba, D., Tamma, V., & Bock, H.-H. (2000). Classical resemblance measures. In H.-H. Bock & E. Diday (Eds.), Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data, volume 15 of Studies in Classification, Data Analysis, and Knowledge Organization (Chapter 8.1, pp. 139–152). Berlin Heidelberg New York: Springer.Google Scholar
  13. Eyheramendy, S., Lewis, D., & Madigan, D. (2003). The Naive Bayes model for text categorization. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Jan 3–6, Key West, Florida.Google Scholar
  14. Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL ’00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (pp. 59–68). Berlin Heidelberg New York: Springer.Google Scholar
  15. Han, E.-H. & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 424–431). Berlin Heidelberg New York: Springer.Google Scholar
  16. Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. In NIPS ’97: Proceedings of the 1997 conference on Advances in neural information processing systems 10 (pp. 507–513). Cambridge, Massachussetts: MIT.Google Scholar
  17. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 143–151). San Mateo, California: Morgan Kaufmann.Google Scholar
  18. Joachims, T. (1998a). SVM light, an implementation of Support Vector Machines (SVMs) in C http://ais.gmd.de/thorsten/svmlight/.
  19. Joachims, T. (1998b). Text categorization with support vector machines: Learning with many relevant features. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 137–142). Berlin Heidelberg New York: Springer.CrossRefGoogle Scholar
  20. Kim, S., Rim, H., Yook, D., & Lim, H. (2002). Effective methods for improving Naive Bayes text classifier. In 7th International Conference on Artificial Intelligence, volume 2417 of LNAI (pp. 95–106). Berlin Heidelberg New York: Springer.Google Scholar
  21. Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 170–178). San Mateo, California: Morgan Kaufmann.Google Scholar
  22. Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Science—Informatics and Computer Science, 158(1), 89–115.Google Scholar
  23. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 4–15). Berlin Heidelberg New York: Springer.CrossRefGoogle Scholar
  24. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.Google Scholar
  25. Malerba, D., Esposito, F., & Ceci, M. (2002). Mining HTML pages to support document sharing in a cooperative system. In EDBT ’02: Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers (pp. 420–434). Berlin Heidelberg New York: Springer.Google Scholar
  26. McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization (pp. 41–48). Menlo Park California: AAAI.Google Scholar
  27. McCallum, A., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the Fifteenth International Conference on Machine Learning (pp. 359–367). San Mateo, California: Morgan Kaufmann.Google Scholar
  28. Miller, G. (1990). Five papers on Wordnet. International Journal of Lexicology, 3(4), 278–301.Google Scholar
  29. Mitchell, T. (1997). Machine Learning. New York: McGraw-Hill.zbMATHGoogle Scholar
  30. Mitchell, T. (1998). Conditions for the equivalence of hierarchical and flat Bayesian classifiers. Technical report, Center for Automated Learning and Discovery, Carnegie-Mellon University.Google Scholar
  31. Mladenić, D. (1998a). Feature subset selection in text-learning. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning (pp. 95–100). Berlin Heidelberg New York: Springer-Verlag.CrossRefGoogle Scholar
  32. Mladenić, D. (1998b). Machine learning on non-homogeneus, distributed text data. PhD thesis, University of Ljubjana, Slovenia.Google Scholar
  33. Mladenić, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and Naive Bayes. In ICML ’99: Proceedings of the Sixteenth International Conference on Machine Learning (pp. 258–267). San Mateo, California: Morgan Kaufmann.Google Scholar
  34. Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Decision Support Systems, 35(1), 45–87.CrossRefGoogle Scholar
  35. Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perception learning, and a usability case study for text categorization. SIGIR Forum, 31(SI), 67–73.CrossRefGoogle Scholar
  36. Platt, J. (1998). Fast training of Support Vector Machines using sequential minimal optimization. In B. Scholkopf, C. Burges & A. Smola (Eds.), Advances in Kernel methods – support vector learning. MIT Press.Google Scholar
  37. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.Google Scholar
  38. Rocchio, J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing (pp. 313–323). Englewood Cliffs: Prentice Hall.Google Scholar
  39. Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1), 87–118.zbMATHCrossRefGoogle Scholar
  40. Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Second International Conference on Knowledge Discovery in Databases (pp. 334–338). Menlo Park, California: AAAI.Google Scholar
  41. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRefGoogle Scholar
  42. Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2–3), 135–168.zbMATHCrossRefGoogle Scholar
  43. Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 215–223). New York: ACM.CrossRefGoogle Scholar
  44. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRefGoogle Scholar
  45. Shen, Y., & Jiang, J. (2003). Improving the performance of Naive Bayes for text classification, CS224N spring. Technical report, Stanford University.Google Scholar
  46. Sona, D., Veeramachanemi, S., Avesani, P., & Polettini, N. (2004). Clustering with propagation for hierarchical document classification. In M. Gori, M. Ceci & M. Nanni (Eds.), Proceedings of the ECML/PKDD’04 Workshop on Statistical Approaches for Web Mining (pp. 50–61). Pisa, Italy.Google Scholar
  47. Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 521–528). Los Alamitos, California: IEEE Computer Society.Google Scholar
  48. Theeramunkong, T., & Lertnattee, V. (2002). Multi-dimensional text classification. In Proc. of 19th International Conference on Computational Linguistics (COLING 2002) (pp. 1–7). Morristown, New Jersey: Association for Computational Linguistics.Google Scholar
  49. Tikk, D., & Biro, G. (2003). Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection. In ISUMA ’03: Proceedings of the 4th International Symposium on Uncertainty Modelling and Analysis (p. 104). Los Alamitos, Calfornia: IEEE Computer Society.CrossRefGoogle Scholar
  50. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Berlin Heidelberg New York: Springer.zbMATHGoogle Scholar
  51. Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information System, 18(2-3), 153–172.CrossRefGoogle Scholar
  52. Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193–216.CrossRefGoogle Scholar
  53. Yang, Y. (1996). An evaluation of statistical approaches to MEDLINE indexing. In Proceedings of the AMIA (pp. 358–362). Philadelphia, Pennsylvania: Hanley and Belfus.Google Scholar
  54. Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2), 69–90.CrossRefGoogle Scholar
  55. Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42–49). New York: ACM.Google Scholar
  56. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412–420). San Mateo, California: Morgan Kaufmann.Google Scholar
  57. Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. G. (2003). Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization. In Proceedings of the 20th International Conference on Machine Learning (pp. 888–895). Menlo Park, AAAI Press.Google Scholar
  58. Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. SIGKDD Explorations Newsletter, 6(1), 80–89.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Dipartimento di InformaticaUniversita degli Studi di BariBariItaly

Personalised recommendations