Skip to main content

Imbalanced Chinese Text Classification Based on Weighted Sampling

  • Conference paper
  • First Online:
Trustworthy Computing and Services (ISCTCS 2013)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 426))

Included in the following conference series:

Abstract

Traditional text classification methods assume that dataset is balanced. But, in the real world, there are plenty of imbalanced data on which traditional classification methods could not get satisfactory results. Based on comprehensive analysis of existing researches on imbalanced data classification, we propose a data rebalance method based on weighted sampling. The method assigns weights to each class by calculating the ratio between different categories. Then, each class is sampled with different ratios using weighted sampling methods. Experimental results on real Chinese text data set show that the proposed method can effectively improve the classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The corpus is available at http://www.nlpir.org/download.

  2. 2.

    http://www.ictclas.org

  3. 3.

    http://weka.wikispaces.com

References

  1. Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)

    Article  Google Scholar 

  2. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)

    MATH  Google Scholar 

  3. Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv. Intell. Comput. Int. Conf. Intell. Comput. 3644, 878–887 (2005)

    Google Scholar 

  4. Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. (IEEE TKDE) 25(1), 206–219 (2013)

    Article  Google Scholar 

  5. Gustavo, E.A., Batista, P.A., Ronaldo, C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)

    Google Scholar 

  6. Ashish, A., Ganesan, P., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39(5), 1385–1391 (2010)

    Article  Google Scholar 

  7. Chumphol, B., Krung, S., Chidchanok, L.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3), 664–684 (2012)

    Article  Google Scholar 

  8. Atlántida, S., Eduardo, M., Jesus, A.G.: Synthetic oversampling of instances using clustering. Int. J. Artif. Intell. Tools 22(2) (2013)

    Google Scholar 

  9. Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 22(2), 1–21 (2012)

    Google Scholar 

  10. Luengo, J., Fernandez, A., Garcia, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)

    Article  Google Scholar 

  11. Cardie, C., Howe, N.: Improving minority class predication using case-specific feature weights. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp. 57–65 (1997)

    Google Scholar 

  12. Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)

    Article  Google Scholar 

  13. Furnkranz, J.: Round Robin classification. J. Mach. Learn. Res. 2, 721–747 (2002)

    MathSciNet  Google Scholar 

  14. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors’ work was sponsored by National Program on Key Basic Research Project (973 Program) of China (2013CB329601, 2013CB329602), National High Technology Research and Development Program (863 Program) of China (2011AA010702, 2012AA01A401 and 2012AA01A402), the Nature Science Foundation of China (60933005, 91124002), Support Science and Technology Project of China (2012BAH38B04, 2012BAH38B06), China Postdoctoral Science Foundation Program (2012M520114).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hu Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, H., Zou, P., Han, W., Xia, R. (2014). Imbalanced Chinese Text Classification Based on Weighted Sampling. In: Yuan, Y., Wu, X., Lu, Y. (eds) Trustworthy Computing and Services. ISCTCS 2013. Communications in Computer and Information Science, vol 426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43908-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-43908-1_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-43907-4

  • Online ISBN: 978-3-662-43908-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics