Regularization for Unsupervised Classification on Taxonomies
We study unsupervised classification of text documents into a taxonomy of concepts annotated by only a few keywords. Our central claim is that the structure of the taxonomy encapsulates background knowledge that can be exploited to improve classification accuracy. Under our hierarchical Dirichlet generative model for the document corpus, we show that the unsupervised classification algorithm provides robust estimates of the classification parameters by performing regularization, and that our algorithm can be interpreted as a regularized EM algorithm. We also propose a technique for the automatic choice of the regularization parameter. In addition we propose a regularization scheme for K-means for hierarchies. We experimentally demonstrate that both our regularized clustering algorithms achieve a higher classification accuracy over simple models like minimum distance, Naïve Bayes, EM and K-means.
KeywordsParameter Vector Regularization Parameter Regularization Scheme Reference Vector Unsupervised Classification
Unable to display preview. Download preview PDF.
- 1.Avesani, P., Girardi, C., Polettini, N., Sona, D.: TaxE: a testbed for hierarchical document classifiers. Technical Report T04-04-02, ITC-IRST (2004), http://sra.itc.it
- 3.Hofmann, T., Cai, L., Ciaramita, M.: Learning with taxonomies: Classifying documents and words. In: NIPS Workshop on Syntax, Semantics, and Statistics (2003)Google Scholar
- 5.McCallum, A., Nigam, K.: Text classification by bootstrapping with keywords, EM and shrinkage. In: ACL 1999 - Workshop for Unsupervised Learning in Natural Language Processing (1999)Google Scholar
- 7.Veeramachaneni, S., Sona, D., Avesani, P.: Hierarchical Dirichlet model for document classification. In: ICML 2005 - Proc. of Int. Conf. on Machine Learning (2005)Google Scholar
- 8.Wang, K., Zhou, S., Liew, S.C.: Building hierarchical classifiers using class proximity. In: Proc. of the 25th VLDB Conference (1999)Google Scholar