Imbalanced Chinese Text Classification Based on Weighted Sampling

Li, Hu; Zou, Peng; Han, WeiHong; Xia, Rongze

doi:10.1007/978-3-662-43908-1_5

Hu Li⁴,
Peng Zou⁴,
WeiHong Han⁴ &
…
Rongze Xia⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 426))

Included in the following conference series:

International Conference on Trustworthy Computing and Services

1214 Accesses
1 Citations

Abstract

Traditional text classification methods assume that dataset is balanced. But, in the real world, there are plenty of imbalanced data on which traditional classification methods could not get satisfactory results. Based on comprehensive analysis of existing researches on imbalanced data classification, we propose a data rebalance method based on weighted sampling. The method assigns weights to each class by calculating the ratio between different categories. Then, each class is sampled with different ratios using weighted sampling methods. Experimental results on real Chinese text data set show that the proposed method can effectively improve the classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The corpus is available at http://www.nlpir.org/download.
2.
http://www.ictclas.org
3.
http://weka.wikispaces.com

References

Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
MATH Google Scholar
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv. Intell. Comput. Int. Conf. Intell. Comput. 3644, 878–887 (2005)
Google Scholar
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. (IEEE TKDE) 25(1), 206–219 (2013)
Article Google Scholar
Gustavo, E.A., Batista, P.A., Ronaldo, C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Google Scholar
Ashish, A., Ganesan, P., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39(5), 1385–1391 (2010)
Article Google Scholar
Chumphol, B., Krung, S., Chidchanok, L.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3), 664–684 (2012)
Article Google Scholar
Atlántida, S., Eduardo, M., Jesus, A.G.: Synthetic oversampling of instances using clustering. Int. J. Artif. Intell. Tools 22(2) (2013)
Google Scholar
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 22(2), 1–21 (2012)
Google Scholar
Luengo, J., Fernandez, A., Garcia, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Article Google Scholar
Cardie, C., Howe, N.: Improving minority class predication using case-specific feature weights. In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp. 57–65 (1997)
Google Scholar
Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)
Article Google Scholar
Furnkranz, J.: Round Robin classification. J. Mach. Learn. Res. 2, 721–747 (2002)
MathSciNet Google Scholar
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgements

The authors’ work was sponsored by National Program on Key Basic Research Project (973 Program) of China (2013CB329601, 2013CB329602), National High Technology Research and Development Program (863 Program) of China (2011AA010702, 2012AA01A401 and 2012AA01A402), the Nature Science Foundation of China (60933005, 91124002), Support Science and Technology Project of China (2012BAH38B04, 2012BAH38B06), China Postdoctoral Science Foundation Program (2012M520114).

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, Hunan Province, People’s Republic of China
Hu Li, Peng Zou, WeiHong Han & Rongze Xia

Authors

Hu Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Zou
View author publications
You can also search for this author in PubMed Google Scholar
WeiHong Han
View author publications
You can also search for this author in PubMed Google Scholar
Rongze Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hu Li .

Editor information

Editors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Yuyu Yuan
Beijing University of Posts and Telecommunications, Beijing, China
Xu Wu
Beijing University of Posts and Telecommunications, Beijing, China
Yueming Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, H., Zou, P., Han, W., Xia, R. (2014). Imbalanced Chinese Text Classification Based on Weighted Sampling. In: Yuan, Y., Wu, X., Lu, Y. (eds) Trustworthy Computing and Services. ISCTCS 2013. Communications in Computer and Information Science, vol 426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43908-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-43908-1_5
Published: 27 June 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43907-4
Online ISBN: 978-3-662-43908-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics