Transfer synthetic over-sampling for class-imbalance learning with limited minority class data
- 7 Downloads
The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.
Keywordsmachine learning data mining class imbalance over sampling boosting transfer learning
Unable to display preview. Download preview PDF.
The authors wish to thank the associate editor and anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Key R&D Program of China (2017YFB1002801), the National Natural Science Foundation of China (Grant Nos. 61473087, 61573104), the Natural Science Foundation of Jiangsu Province (BK20141340), and partially supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization.
- 6.Yan Y, Chen M, Shyu M L, Chen S C. Deep learning for imbalanced multimedia data classification. In: Proceedings of the 2015 IEEE International Symposium on Multimedia. 2015, 483–488Google Scholar
- 8.Fawcett T, Provost F J. Combining data mining and machine learning for effective user profiling. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 8–13Google Scholar
- 10.Lewis D D, Ringuette M. A comparison of two learning algorithms for text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. 1994, 81–93Google Scholar
- 16.Khoshgoftaar T M, Seiffert C, Hulse J V, Napolitano A, Folleco A. Learning with limited minority class data. In: Proceedings of the 6th International Conference on Machine Learning and Applications. 2007, 348–353Google Scholar
- 18.Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of the International Conference on Intelligent Computing. 2005, 878–887Google Scholar
- 21.He H, Bai Y, Garcia E A, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. 2008, 1322–1328Google Scholar
- 22.Das B, Krishnan N C, Cook D J. wRACOG: a gibbs sampling-based oversampling technique. In: Proceedings of the 13th IEEE International Conference on Data Mining. 2013, 111–120Google Scholar
- 29.Li S, Wang Z, Zhou G, Lee S Y M. Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2011, 1826–1832Google Scholar
- 30.Zhang M L, Li Y K, Liu X Y. Towards class-imbalance aware multilabel learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 4041–4047Google Scholar
- 31.Hoens T R, Chawla N V. Learning in non-stationary environments with class imbalance. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 168–176Google Scholar
- 35.Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119Google Scholar
- 36.Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331Google Scholar
- 40.Raina R, Battle A, Lee H, Packer B, Ng A Y. Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 759–766Google Scholar
- 41.Wei Y, Zhu Y, Leung C W, Song Y, Yang Q. Instilling social to physical: co-regularized heterogeneous transfer learning. In: Proceedings of the 13rd AAAI Conference on Artificial Intelligence. 2016, 1338–1344Google Scholar
- 45.Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200Google Scholar
- 46.Blake C, Keogh E, Merz C J. UCI repository of machine learning databases. University of California, Irvine, CA, 1996Google Scholar
- 48.Schapire R E. A brief introduction to Boosting. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999, 1401–1406Google Scholar