Skip to main content

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

  • Conference paper
  • First Online:
Discovery Science (DS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9356))

Included in the following conference series:

Abstract

Most of works on text categorization have focused on classifying documents into a set of categories with no relationships among them (flat classification). However, due to the intrinsic structure that can be found in many domains, recent works are focusing on more complex tasks, such as multi-label classification, hierarchical classification and multidimensional classification. In this paper, we propose the hierarchical multidimensional classification task, where documents can be classified according to different dimensions/viewpoints (e.g., topic, geographic area, time period, etc.), where in each dimension categories can be organized hierarchically. In particular, we propose the system MultiWebClass, a multidimensional variant of the system WebClassIII, which discovers correlations among categories belonging to different dimensions and exploits them, according to two different strategies, to refine the set of features used during the learning process. Experimental evaluation performed on both synthetic and real datasets confirms that the exploitation of correlations among categories can lead to better results in terms of classification accuracy, possibly reducing specialization error or generalization error, depending on the strategy adopted for the refinement of the feature sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html.

  2. 2.

    In the case of synthetic datasets, results do not depend on the specific dimension.

References

  1. Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. (TOIS) 12(3), 233–251 (1994)

    Article  Google Scholar 

  2. Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with bayesian networks. Int. J. Approximate Reasoning 52(6), 705–727 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  3. Ceci, M.: Hierarchical text categorization in a transductive setting. In: Workshops Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 15–19 December, 2008, pp. 184–191. IEEE Computer Society (2008)

    Google Scholar 

  4. Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)

    Article  Google Scholar 

  6. Han, E.-H.S., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  7. Hernández, J., Sucar, L.E., Morales, E.F.: Multidimensional hierarchical classification. Expert Syst. Appl. 41(17), 7671–7677 (2014)

    Article  Google Scholar 

  8. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  9. Manber, U., Wu, S., et al.: Glimpse: A tool to search through entire file systems. In: Usenix Winter, pp. 23–32 (1994)

    Google Scholar 

  10. Mitchell, T.M.: Machine Learning. McGraw Hill series in computer science. McGraw-Hill, Tom Mitchell (1997)

    MATH  Google Scholar 

  11. Platt, J., et al.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods Support Vector Learning, 3 (1999)

    Google Scholar 

  12. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  13. Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)

    Article  MATH  Google Scholar 

  14. Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J.: Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24(18), 2086–2093 (2008)

    Article  Google Scholar 

  15. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 32–41. ACM (2002)

    Google Scholar 

  16. Theeramunkong, T., Lertnattee, V.: Multi-dimensional text classification. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)

    Google Scholar 

  17. Wilson, E.B.: Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22(158), 209–212 (1927)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the support of the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Serafino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Serafino, F., Pio, G., Ceci, M., Malerba, D. (2015). Hierarchical Multidimensional Classification of Web Documents with MultiWebClass. In: Japkowicz, N., Matwin, S. (eds) Discovery Science. DS 2015. Lecture Notes in Computer Science(), vol 9356. Springer, Cham. https://doi.org/10.1007/978-3-319-24282-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24282-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24281-1

  • Online ISBN: 978-3-319-24282-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics