An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses
In order to be able to use the advantage of public corpuses such as Wikipedia to address problems of classification by hierarchically structured topics with a large amount of classes, we propose an improvement of Naive Bayes based text classification algorithms which we call Semantically-Aware Hierarchical Balancing. SAHB addresses two issues in that specific use case with real-world applications, namely the large amount of topic labels to classify against, and the lack of balance in the hierarchy of the corpuses. This meta-algorithm performs with better accuracy and log-time complexity than straightforward naive bayes text classification methods or specific document weighing techniques, whilst taking equivalent time to train, which makes it more efficient, and also scalable to process and classify big data.
KeywordsClass Imbalance Cluster Node Class Imbalance Problem Topic Label Topic Node
Unable to display preview. Download preview PDF.
- 2.Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence 14, 771–780 (1999)Google Scholar
- 4.Rennie, J.D.M., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 616–623 (2003)Google Scholar
- 5.Swezey, R., Shiramatsu, S., Ozono, T., Shintani, T.: Intelligent Page Recommender Agents: Real-Time Content Delivery for Articles and Pages Related to Similar Topics. In: Proceedings of the Twenty Fourth International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA-AIE (2011)Google Scholar
- 6.Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001), pp. 105–113. ACM, New York (2001)Google Scholar