An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses

  • Robin M. E. Swezey
  • Shun Shiramatsu
  • Tadachika Ozono
  • Toramatsu Shintani
Conference paper
Part of the Studies in Computational Intelligence book series (SCI, volume 431)

Abstract

In order to be able to use the advantage of public corpuses such as Wikipedia to address problems of classification by hierarchically structured topics with a large amount of classes, we propose an improvement of Naive Bayes based text classification algorithms which we call Semantically-Aware Hierarchical Balancing. SAHB addresses two issues in that specific use case with real-world applications, namely the large amount of topic labels to classify against, and the lack of balance in the hierarchy of the corpuses. This meta-algorithm performs with better accuracy and log-time complexity than straightforward naive bayes text classification methods or specific document weighing techniques, whilst taking equivalent time to train, which makes it more efficient, and also scalable to process and classify big data.

Keywords

Class Imbalance Cluster Node Class Imbalance Problem Topic Label Topic Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 256–263. ACM, New York (2000)CrossRefGoogle Scholar
  2. 2.
    Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence 14, 771–780 (1999)Google Scholar
  3. 3.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429–449 (2002)MATHGoogle Scholar
  4. 4.
    Rennie, J.D.M., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 616–623 (2003)Google Scholar
  5. 5.
    Swezey, R., Shiramatsu, S., Ozono, T., Shintani, T.: Intelligent Page Recommender Agents: Real-Time Content Delivery for Articles and Pages Related to Similar Topics. In: Proceedings of the Twenty Fourth International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA-AIE (2011)Google Scholar
  6. 6.
    Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001), pp. 105–113. ACM, New York (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Robin M. E. Swezey
    • 1
  • Shun Shiramatsu
    • 1
  • Tadachika Ozono
    • 1
  • Toramatsu Shintani
    • 1
  1. 1.Graduate School of EngineeringNagoya Institute of TechnologyNagoyaJapan

Personalised recommendations