Skip to main content

Hierarchical Classification of Web Documents by Stratified Discriminant Analysis

  • Conference paper
Multidisciplinary Information Retrieval (IRFC 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7356))

Included in the following conference series:

Abstract

In this work we present and evaluate a methodology to classify web documents into a predefined hierarchy using the textual content of the documents. The general problem of hierarchical classification using taxonomies with thousands of categories is a hard task due to the problem of scarcity of training data. Hierarchical classification is one of the rare situations where, despite the large amount of available data, as more documents become available, more classes are also added to the hierarchy. This leads to a lack of training data for most of the categories, which produces poor individual classification models and tends to bias the classification to dense categories. Here we propose a novel feature extraction technique called Stratified Discriminant Analysis (sDA) that reduces the dimensions of the text-content features of the web documents along the different levels of the hierarchy. The sDA model is intended to reduce the effects of scarcity of data by better grouping and identify the categories with few training examples leading to more robust classification models for those categories. The results of classifying web pages from the Kids&Teens branch of the DMOZ directory show that our model extracts features that are well suited for category grouping of web pages and representation of categories with few training examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bennett, P.N., Nguyen, N.: Refined experts: improving classification in large taxonomies. In: Proc. 32nd ACM SIGIR, pp. 11–18. ACM Press (2009)

    Google Scholar 

  2. Cai, D., He, X., Han, J.: Srda: An efficient algorithm for large-scale discriminant analysis. IEEE Transactions on Knowledge and Data Engineering 20(1), 1–12 (2008)

    Article  Google Scholar 

  3. Cai, L., Hofmann, T.: Hierarchical document categorization with support vector machines. In: Proc. 13th ACM CKIM, pp. 78–87. ACM Press (2004)

    Google Scholar 

  4. Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems 28(1), 37–78 (2007)

    Article  Google Scholar 

  5. Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical classification. Journal Machine Learning Reasearch 7, 31–54 (2006)

    MathSciNet  MATH  Google Scholar 

  6. Chakrabarti, S., Dom, B.E., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Journal of Very Large Data Bases 7(3), 163–178 (1998)

    Article  Google Scholar 

  7. Chen, H., Dumais, S.: Bringing order to the web: automatically categorizing search results. In: Proc. SIGCHI Conference, pp. 145–152. ACM Press (2000)

    Google Scholar 

  8. Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proc. 23rd ACM SIGIR, pp. 256–263. ACM Press (2000)

    Google Scholar 

  9. Dumais, S., Cutrell, E., Chen, H.: Optimizing search by showing results in context. In: Proc. SIGCHI Conference, pp. 277–284. ACM Press (2001)

    Google Scholar 

  10. Fagni, T., Sebastiani, F.: Selecting negative examples for hierarchical text classification: an experimental comparison. Journal of the American Society for Information Science 61(11), 2256–2265 (2010)

    Article  Google Scholar 

  11. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)

    MATH  Google Scholar 

  12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)

    Article  Google Scholar 

  13. Howland, P., Park, H.: Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(8), 995–1006 (2004)

    Article  Google Scholar 

  14. Kim, H., Howland, P., Park, H.: Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research 6, 37–53 (2005)

    MathSciNet  MATH  Google Scholar 

  15. Kosmopoulos, A., Gaussier, E., Paliouras, G., Aseervatham, S.: The ECIR 2010 Large Scale Hierarchical Classification Workshop (2010)

    Google Scholar 

  16. Lan, M., Tan, C.L., Low, H.-B., Yuan, S.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Proc. 14th WWW, pp. 1032–1033 (2005)

    Google Scholar 

  17. Li, W., McCallum, A.: Pachinko allocation: Dag-structured mixture models of topic correlations. In: Proc. 23rd ICML, pp. 577–584. ACM Press (2006)

    Google Scholar 

  18. Li, T., Zhu, S., Ogihara, M.: Text categorization via generalized discriminant analysis. Information Processing and Management 44(5), 1684–1697 (2008)

    Article  Google Scholar 

  19. Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., Ma, W.-Y.: Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explorations Newsletter 7(1), 36–43 (2005)

    Article  Google Scholar 

  20. Malik, H.: Improving hierarchical SVMS by hierarchy flattening and lazy classification. In: Proc. Large-Scale Hierarchical Classification Workshop of ECIR (2010)

    Google Scholar 

  21. McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Proc. 15th ICML, pp. 359–367. Morgan Kaufmann Publishers Inc. (1998)

    Google Scholar 

  22. Mimno, D., Li, W., McCallum, A.: Mixtures of hierarchical topics with Pachinko allocation. In: Proc. 24th ICML, pp. 633–640. ACM Press (2007)

    Google Scholar 

  23. Paliouras, G., Gaussier, E., Kosmopoulos, A., Androutsopoulos, I., Artieres, T., Gallinari, P.: Joint ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification (2011)

    Google Scholar 

  24. Park, C.H., Lee, M.: On applying linear discriminant analysis for multi-labeled problems. Pattern Recognition Letters 29(7), 878–887 (2008)

    Article  Google Scholar 

  25. Qi, X., Davidson, B.D.: Web page classification: features and algorithms. ACM Computing Surveys 41(2), 1–31 (2009)

    Article  Google Scholar 

  26. Silla, C., Freitas, A.: A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22(1), 31–72 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  27. Torkkola, K.: Linear discriminant analysis in document classification. In: Proc. IEEE ICDM Workshop on Text Mining. IEEE (2001)

    Google Scholar 

  28. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowledge Discovery Handbook, ch. 34, pp. 667–685 (2010)

    Google Scholar 

  29. Xue, G.R., Xing, D., Yang, Q., Yu, Y.: Deep classification in large-scale text hierarchies. In: Proc. 31st ACM SIGIR, pp. 619–626. ACM Press (2008)

    Google Scholar 

  30. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. 14th ICML, pp. 412–420. Morgan Kaufmann Publishers Inc. (1997)

    Google Scholar 

  31. Yen, J., Wang, T.: Regularized discriminant analysis for high dimensional, low sample size data. In: Proc. 12th ACM SIGKDD, pp. 454–463. ACM Press (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gomez, J.C., Moens, MF. (2012). Hierarchical Classification of Web Documents by Stratified Discriminant Analysis. In: Salampasis, M., Larsen, B. (eds) Multidisciplinary Information Retrieval. IRFC 2012. Lecture Notes in Computer Science, vol 7356. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31274-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31274-8_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31273-1

  • Online ISBN: 978-3-642-31274-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics