Regularization for Unsupervised Classification on Taxonomies

  • Diego Sona
  • Sriharsha Veeramachaneni
  • Nicola Polettini
  • Paolo Avesani
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4203)


We study unsupervised classification of text documents into a taxonomy of concepts annotated by only a few keywords. Our central claim is that the structure of the taxonomy encapsulates background knowledge that can be exploited to improve classification accuracy. Under our hierarchical Dirichlet generative model for the document corpus, we show that the unsupervised classification algorithm provides robust estimates of the classification parameters by performing regularization, and that our algorithm can be interpreted as a regularized EM algorithm. We also propose a technique for the automatic choice of the regularization parameter. In addition we propose a regularization scheme for K-means for hierarchies. We experimentally demonstrate that both our regularized clustering algorithms achieve a higher classification accuracy over simple models like minimum distance, Naïve Bayes, EM and K-means.


Parameter Vector Regularization Parameter Regularization Scheme Reference Vector Unsupervised Classification 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Avesani, P., Girardi, C., Polettini, N., Sona, D.: TaxE: a testbed for hierarchical document classifiers. Technical Report T04-04-02, ITC-IRST (2004),
  2. 2.
    Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Hofmann, T., Cai, L., Ciaramita, M.: Learning with taxonomies: Classifying documents and words. In: NIPS Workshop on Syntax, Semantics, and Statistics (2003)Google Scholar
  4. 4.
    Kohonen, T.: Self-Organizing Maps. Series in Information Sciences, vol. 30. Springer, Berlin (2001)MATHGoogle Scholar
  5. 5.
    McCallum, A., Nigam, K.: Text classification by bootstrapping with keywords, EM and shrinkage. In: ACL 1999 - Workshop for Unsupervised Learning in Natural Language Processing (1999)Google Scholar
  6. 6.
    Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)MATHCrossRefGoogle Scholar
  7. 7.
    Veeramachaneni, S., Sona, D., Avesani, P.: Hierarchical Dirichlet model for document classification. In: ICML 2005 - Proc. of Int. Conf. on Machine Learning (2005)Google Scholar
  8. 8.
    Wang, K., Zhou, S., Liew, S.C.: Building hierarchical classifiers using class proximity. In: Proc. of the 25th VLDB Conference (1999)Google Scholar
  9. 9.
    Zhang, D., Lee, W.S.: Web taxonomy integration using support vector machines. In: WWW 2004: Proc. of Int. Conf. on World Wide Web, pp. 472–481. ACM Press, New York (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Diego Sona
    • 1
  • Sriharsha Veeramachaneni
    • 1
  • Nicola Polettini
    • 1
  • Paolo Avesani
    • 1
  1. 1.ITC-IRSTPovo – TrentoItaly

Personalised recommendations