A Sentence Vector Based Over-Sampling Method for Imbalanced Emotion Classification

  • Tao Chen
  • Ruifeng Xu
  • Qin Lu
  • Bin Liu
  • Jun Xu
  • Lin Yao
  • Zhenyu He
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8404)


Imbalanced training data poses a serious problem for supervised learning based text classification. Such a problem becomes more serious in emotion classification task with multiple emotion categories as the training data can be quite skewed. This paper presents a novel over-sampling method to form additional sum sentence vectors for minority classes in order to improve emotion classification for imbalanced data. Firstly, a large corpus is used to train a continuous skip-gram model to form each word vector using word/POS pair as the unit of word vector. The sentence vectors of the training data are then constructed as the sum vector of their word/POS vectors. The new minority class training samples are then generated by randomly add two sentence vectors in the corresponding class until the training samples for each class are the same so that the classifiers can be trained on fully balanced training dataset. Evaluations on NLP&CC2013 Chinese micro blog emotion classification dataset shows that the obtained classifier achieves 48.4% average precision, an 11.9 percent improvement over the state-of-art performance on this dataset (at 36.5%). This result shows that the proposed over-sampling method can effectively address the problem of data imbalance and thus achieve much improved performance for emotion classification.


Emotion classification Imbalanced training data Over-sampling Sentence vector 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Turney, P.-D.: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In: Proceedings of ACL 2002, pp. 417–424 (2002)Google Scholar
  2. 2.
    Kamps, J., Marx, M., Mokken, R.-J., de Rijke, M.: Using WordNet to Measure Semantic Orientation of Adjectives. In: Proceedings of LREC 2004, pp. 1115–1118 (2004)Google Scholar
  3. 3.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Proceedings of EMNLP 2002, pp. 79–86 (2002)Google Scholar
  4. 4.
    Gu, X.-J., Wang, Z.-L., Liu, J.-W., Liu, S.: Research on Modeling Artificial Psychology Based on HMM. Application Research of Computers 12, 30–32 (2006)Google Scholar
  5. 5.
    Quan, C., Ren, F.: Construction of a Blog Emotion Corpus for Chinese Emotional Expression Analysis. In: Proceedings of EMNLP 2009, pp. 1446–1454 (2009)Google Scholar
  6. 6.
    Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations 6(1), 1–6 (2004)CrossRefGoogle Scholar
  7. 7.
    Zhou, Z.-H., Liu, X.-Y.: Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. Knowledge and Data Engineering 18(1), 63–77 (2006)CrossRefGoogle Scholar
  8. 8.
    Ertekin, S., Huang, J., Bottou, L., Giles, C.-L.: Learning on the Border: Active Learning in Imbalanced Data Classification. In: Proceedings of CIKM 2007 (2007)Google Scholar
  9. 9.
    Chen, T., Xu, R., Wu, M., Liu, B.: A Sentiment Classification Approach based on Sentiment Sentence Framework. Journal of Chinese Information Processing 27(5), 67–74 (2013)Google Scholar
  10. 10.
    Ren, J.-W., Yang, Y., Wang, H., Lin, H.: Construction of the Binary Affective Commonsense Knowledgebase and its Application in Text Affective Analysis. China Science Paper Online (2013),
  11. 11.
    Longadge, R., Dongre, S.-S., Malik, L.: Class Imbalance Problem in Data Mining Review. International Journal of Computer Science and Network 2(1), 1305–1707 (2013)Google Scholar
  12. 12.
    Wang, Z.-Q., Li, S.-S., Zhu, Q.-M., Li, P.-F., Zhou, G.-D.: Chinese Sentiment Classification on Imbalanced Data Distribution. Journal of Chinese Information Processing 26(3), 33–37 (2012)Google Scholar
  13. 13.
    Deerwester, S., Dumais, S.-T., Furnas, G.-W., Landauer, T.-K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  14. 14.
    Bellegarda, J.-R.: A Latent Semantic Analysis Framework for Large–span Language Modeling. In: Proceedings of Eurospeech 1997, pp. 1451–1454 (1997)Google Scholar
  15. 15.
    Blei, D.-M., Ng, A.-Y., Jordan, M.-I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  16. 16.
    Riis, S., Krogh, A.: Improving Protein Secondary Structure Prediction using Structured Neural Networks and Multiple Sequence Profiles. Journal of Computational Biology, 163–183 (1996)Google Scholar
  17. 17.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. In: Proceedings of ICLR Workshop (2013)Google Scholar
  18. 18.
    Han, J., Kamber, M.: Data mining: Concepts and Technique. Morgan Kaufman, San Francisco (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Tao Chen
    • 1
  • Ruifeng Xu
    • 1
  • Qin Lu
    • 2
  • Bin Liu
    • 1
  • Jun Xu
    • 1
  • Lin Yao
    • 3
  • Zhenyu He
    • 1
  1. 1.Key Laboratory of Network Oriented Intelligent Computation, Shenzhen Graduate SchoolHarbin Institute of TechnologyShenzhenChina
  2. 2.Department of ComputingThe Hong Kong Polytechnic UniversityHong KongChina
  3. 3.Peking University Shenzhen Graduate SchoolShenzhenChina

Personalised recommendations