Abstract
Most of works on text categorization have focused on classifying documents into a set of categories with no relationships among them (flat classification). However, due to the intrinsic structure that can be found in many domains, recent works are focusing on more complex tasks, such as multi-label classification, hierarchical classification and multidimensional classification. In this paper, we propose the hierarchical multidimensional classification task, where documents can be classified according to different dimensions/viewpoints (e.g., topic, geographic area, time period, etc.), where in each dimension categories can be organized hierarchically. In particular, we propose the system MultiWebClass, a multidimensional variant of the system WebClassIII, which discovers correlations among categories belonging to different dimensions and exploits them, according to two different strategies, to refine the set of features used during the learning process. Experimental evaluation performed on both synthetic and real datasets confirms that the exploitation of correlations among categories can lead to better results in terms of classification accuracy, possibly reducing specialization error or generalization error, depending on the strategy adopted for the refinement of the feature sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
In the case of synthetic datasets, results do not depend on the specific dimension.
References
Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. (TOIS) 12(3), 233–251 (1994)
Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with bayesian networks. Int. J. Approximate Reasoning 52(6), 705–727 (2011)
Ceci, M.: Hierarchical text categorization in a transductive setting. In: Workshops Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 15–19 December, 2008, pp. 184–191. IEEE Computer Society (2008)
Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)
Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)
Han, E.-H.S., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Hernández, J., Sucar, L.E., Morales, E.F.: Multidimensional hierarchical classification. Expert Syst. Appl. 41(17), 7671–7677 (2014)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Manber, U., Wu, S., et al.: Glimpse: A tool to search through entire file systems. In: Usenix Winter, pp. 23–32 (1994)
Mitchell, T.M.: Machine Learning. McGraw Hill series in computer science. McGraw-Hill, Tom Mitchell (1997)
Platt, J., et al.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods Support Vector Learning, 3 (1999)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)
Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J.: Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24(18), 2086–2093 (2008)
Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 32–41. ACM (2002)
Theeramunkong, T., Lertnattee, V.: Multi-dimensional text classification. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)
Wilson, E.B.: Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22(158), 209–212 (1927)
Acknowledgements
We would like to acknowledge the support of the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Serafino, F., Pio, G., Ceci, M., Malerba, D. (2015). Hierarchical Multidimensional Classification of Web Documents with MultiWebClass. In: Japkowicz, N., Matwin, S. (eds) Discovery Science. DS 2015. Lecture Notes in Computer Science(), vol 9356. Springer, Cham. https://doi.org/10.1007/978-3-319-24282-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-24282-8_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24281-1
Online ISBN: 978-3-319-24282-8
eBook Packages: Computer ScienceComputer Science (R0)