Improving Generalization by Data Categorization
In most of the learning algorithms, examples in the training set are treated equally. Some examples, however, carry more reliable or critical information about the target than the others, and some may carry wrong information. According to their intrinsic margin, examples can be grouped into three categories: typical, critical, and noisy. We propose three methods, namely the selection cost, SVM confidence margin, and AdaBoost data weight, to automatically group training examples into these three categories. Experimental results on artificial datasets show that, although the three methods have quite different nature, they give similar and reasonable categorization. Results with real-world datasets further demonstrate that treating the three data categories differently in learning can improve generalization.
KeywordsSupport Vector Machine Learning Algorithm Target Function Selection Cost Intrinsic Function
- 3.Guyon, I., Matić, N., Vapnik, V.: Discovering informative patterns and data cleaning. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 181–203. AAAI Press / MIT Press, Cambridge (1996)Google Scholar
- 4.Nicholson, A.: Generalization Error Estimates and Training Data Valuation. PhD thesis, California Institute of Technology (2002)Google Scholar
- 6.Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, National Taiwan University (2003)Google Scholar
- 7.Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)Google Scholar
- 8.Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), Downloadable at http://www.ics.uci.edu/~mlearn/MLRepository.html