Full Border Identification for Reduction of Training Sets
Border identification (BI) was previously proposed to help learning systems focus on the most relevant portion of the training set so as to improve learning accuracy. This paper argues that the traditional BI implementation suffers from a serious limitation: it is only able to identify partial borders. This paper proposes a new BI method called Progressive Border Sampling (PBS), which addresses this limitation by borrowing ideas from recent research on Progressive Sampling. PBS progressively learns optimal borders from the entire training sets by, first, identifying a full border, thus, avoiding the limitation of the traditional BI method, and, second, by incrementing the size of that border until it converges to an optimal sample, which is smaller than the original training set. Since PBS identifies the full border, it is expected to discover more optimal samples than traditional BI. Our experimental results on the selected 30 benchmark datasets from the UCI repository show that, indeed, in the context of classification, PBS is more successful than traditional BI at reducing the size of the training sets and optimizing the accuracy results.
KeywordsBorder Identification Progressive Sampling Convergence Detection Learning Curve
Unable to display preview. Download preview PDF.
- 1.Bay, S.D.: The UCI KDD archive (1999), http://kdd.ics.-uci.edu
- 4.Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley Intersience, Chichester (2000)Google Scholar
- 5.Foody, G.M.: Issues in Training Set Selection and Refinement for Classification by a Feedforward Neural Network. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium. IGARSS 1998, Seattle, WA, USA, vol. 1, pp. 409–411 (1998)Google Scholar
- 6.John, G., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 367–370. AAAI Press, Menlo Park (1996)Google Scholar
- 7.Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proc. 14th International Conference on Machine Learning (1997)Google Scholar
- 9.Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: KDD 1999 (1999)Google Scholar
- 10.Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in kernel methods - support vector learning, MIT Press, Cambridge (1998)Google Scholar
- 11.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
- 12.Press, W.H., Farrar, G.R.: Recursive Stratified Sampling for Multidimensional Monte Carlo Integration. Computers in Physics 4, 190–195 (1990)Google Scholar
- 13.Strehl, A., Ghosh, J.: Value-based customer grouping from large retail data-sets. In: Proc. SPIE Conference on Data Mining and Knowledge Discovery, Orlando, April 2000, vol. 4057, pp. 33–42 (2000)Google Scholar
- 16.WEKA Software, v3.5.2. University of Waikato, http://www.cs.waikato.ac.nz/ml/-weka/index/datasets.html