International Conference on Discovery Science

Discovery Science pp 236-250 | Cite as

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

  • Francesco Serafino
  • Gianvito Pio
  • Michelangelo Ceci
  • Donato Malerba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9356)

Abstract

Most of works on text categorization have focused on classifying documents into a set of categories with no relationships among them (flat classification). However, due to the intrinsic structure that can be found in many domains, recent works are focusing on more complex tasks, such as multi-label classification, hierarchical classification and multidimensional classification. In this paper, we propose the hierarchical multidimensional classification task, where documents can be classified according to different dimensions/viewpoints (e.g., topic, geographic area, time period, etc.), where in each dimension categories can be organized hierarchically. In particular, we propose the system MultiWebClass, a multidimensional variant of the system WebClassIII, which discovers correlations among categories belonging to different dimensions and exploits them, according to two different strategies, to refine the set of features used during the learning process. Experimental evaluation performed on both synthetic and real datasets confirms that the exploitation of correlations among categories can lead to better results in terms of classification accuracy, possibly reducing specialization error or generalization error, depending on the strategy adopted for the refinement of the feature sets.

Keywords

Structured output prediction Text categorization Hierarchical classification Multidimensional classification 

References

  1. 1.
    Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. (TOIS) 12(3), 233–251 (1994)CrossRefGoogle Scholar
  2. 2.
    Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with bayesian networks. Int. J. Approximate Reasoning 52(6), 705–727 (2011)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Ceci, M.: Hierarchical text categorization in a transductive setting. In: Workshops Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 15–19 December, 2008, pp. 184–191. IEEE Computer Society (2008)Google Scholar
  4. 4.
    Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  5. 5.
    Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)CrossRefGoogle Scholar
  6. 6.
    Han, E.-H.S., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000) CrossRefGoogle Scholar
  7. 7.
    Hernández, J., Sucar, L.E., Morales, E.F.: Multidimensional hierarchical classification. Expert Syst. Appl. 41(17), 7671–7677 (2014)CrossRefGoogle Scholar
  8. 8.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  9. 9.
    Manber, U., Wu, S., et al.: Glimpse: A tool to search through entire file systems. In: Usenix Winter, pp. 23–32 (1994)Google Scholar
  10. 10.
    Mitchell, T.M.: Machine Learning. McGraw Hill series in computer science. McGraw-Hill, Tom Mitchell (1997)MATHGoogle Scholar
  11. 11.
    Platt, J., et al.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods Support Vector Learning, 3 (1999)Google Scholar
  12. 12.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  13. 13.
    Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)CrossRefMATHGoogle Scholar
  14. 14.
    Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J.: Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24(18), 2086–2093 (2008)CrossRefGoogle Scholar
  15. 15.
    Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 32–41. ACM (2002)Google Scholar
  16. 16.
    Theeramunkong, T., Lertnattee, V.: Multi-dimensional text classification. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)Google Scholar
  17. 17.
    Wilson, E.B.: Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22(158), 209–212 (1927)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Francesco Serafino
    • 1
  • Gianvito Pio
    • 1
  • Michelangelo Ceci
    • 1
  • Donato Malerba
    • 1
  1. 1.Department of Computer ScienceUniversity of Bari Aldo MoroBariItaly

Personalised recommendations