Hierarchical classification with a topic taxonomy via LDA
- 407 Downloads
- 1 Citations
Abstract
Large scale hierarchical classification problem researches how to classify documents into a predefined taxonomy with thousands of categories. As the skewed category distribution over documents, that is, most categories have very few labeled documents, the data sparseness problem in the rare categories lead to a low classification performance. In this paper, we study the problem of web-page classification over the topic taxonomy of the DMOZ directory. For this hard task, we proposed a hierarchical classification model based on Latent Dirichlet allocation (LDA). We use LDA model as the feature extraction technique to extract latent topics to reduce the effects of data sparseness, and construct topic feature vectors associated with the corpus for training more robust classification models for rare categories. Experiments were conducted on the dataset of web pages from the Chinese Simplified branch of the DMOZ directory. The results show that our method achieves a performance improvement for rare categories over the hierarchical classification methods based on full-term and feature-word, and further improves the performance over the whole topic taxonomy.
Keywords
Text categorization Hierarchical classification Topic taxonomy Latent dirichlet allocation (LDA) Rare categoryReferences
- 1.Blei DM, McAuliffe JD (2010) Supervised topic models. arXiv:1003.0783Google Scholar
- 2.Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
- 3.Chen WJ, Shao Y-H, Hong N (2013) Laplacian smooth twin support vector machine for semi-supervised classification. Intern J Mach Learn Cyber. doi: 10.1007/s13042-013-0183-3
- 4.Fagni T, Sebastiani F (2007) On the selection of negative examples for hierarchical text categorization. In: Proceedings of the 3rd Language and Technology Conference (LTC07) pp 24–28Google Scholar
- 5.Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874MATHGoogle Scholar
- 6.Gomez JC, Moens M-F (2012) Hierarchical classification of web documents by stratified discriminant analysis. In: Multidisciplinary Information Retrieval, Springer, pp 94–108Google Scholar
- 7.Gopal S, Yang Y, Bai B, Niculescu-Mizil A (2012) Bayesian models for large-scale hierarchical classification. In: Advances in Neural Information Processing Systems 25: 2420–2428Google Scholar
- 8.Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101(Suppl 1): 5228–5235CrossRefGoogle Scholar
- 9.He L, Jia Y, Han W, Tan S, Chen Z (2012) Research and development of large scale hierarchical classification problem. In: Chinese Journal of Computers pp 2101–2115Google Scholar
- 10.He Q, Wu C (2011) Separating theorem of samples in banach space for support vector machine learning. Intern J Mach Learn Cybernet 2(1): 49–54CrossRefGoogle Scholar
- 11.Liu T, Yang Y, Wan H, Zeng H, Chen Z, Ma W (2005) Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explor Newslett 7(1):36–43CrossRefGoogle Scholar
- 12.Liu Z, Wu Q, Zhang Y, Chen CP (2011) Adaptive least squares support vector machines filter for hand tremor canceling in microsurgery. Intern J Mach Learn Cybernet 2(1):37–47CrossRefMathSciNetGoogle Scholar
- 13.Madani O, Huang J (2010) Large-scale many-class prediction via flat techniques. In: Large-Scale Hierarchical Classification Workshop of ECIRGoogle Scholar
- 14.Marath S (2010) Large-scale web page classification. Ph.D. thesisGoogle Scholar
- 15.Oh H, Choi Y, Myaeng S (2010) Combining global and local information for enhanced deep classification. In: Proceedings of the 2010 ACM Symposium on Applied Computing, ACM, pp 1760–1767Google Scholar
- 16.Wang X, Lu SX, Zhai JH (2008) Fast fuzzy multi-category svm based on support vector domain description. Intern J Patt Recogn Artif Intell 22(1):109–120CrossRefGoogle Scholar
- 17.Xue G, Xing D, Yang Q, Yu Y (2008) Deep classification in large-scale text hierarchies. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 619–626Google Scholar