Full Border Identification for Reduction of Training Sets

  • Guichong Li
  • Nathalie Japkowicz
  • Trevor J. Stocki
  • R. Kurt Ungar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5032)


Border identification (BI) was previously proposed to help learning systems focus on the most relevant portion of the training set so as to improve learning accuracy. This paper argues that the traditional BI implementation suffers from a serious limitation: it is only able to identify partial borders. This paper proposes a new BI method called Progressive Border Sampling (PBS), which addresses this limitation by borrowing ideas from recent research on Progressive Sampling. PBS progressively learns optimal borders from the entire training sets by, first, identifying a full border, thus, avoiding the limitation of the traditional BI method, and, second, by incrementing the size of that border until it converges to an optimal sample, which is smaller than the original training set. Since PBS identifies the full border, it is expected to discover more optimal samples than traditional BI. Our experimental results on the selected 30 benchmark datasets from the UCI repository show that, indeed, in the context of classification, PBS is more successful than traditional BI at reducing the size of the training sets and optimizing the accuracy results.


Border Identification Progressive Sampling Convergence Detection Learning Curve 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bay, S.D.: The UCI KDD archive (1999),
  2. 2.
    Cohn, D., Ghahramani, Z., Jordan, M.: Active learning with statistical models. Journal of Artificial Intelligence Research 4, 129–145 (1996)zbMATHGoogle Scholar
  3. 3.
    Duch, W.: Similarity based methods: a general framework for classification, approximation and association. Control and Cybernetics 29(4), 937–968 (2000)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley Intersience, Chichester (2000)Google Scholar
  5. 5.
    Foody, G.M.: Issues in Training Set Selection and Refinement for Classification by a Feedforward Neural Network. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium. IGARSS 1998, Seattle, WA, USA, vol. 1, pp. 409–411 (1998)Google Scholar
  6. 6.
    John, G., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 367–370. AAAI Press, Menlo Park (1996)Google Scholar
  7. 7.
    Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proc. 14th International Conference on Machine Learning (1997)Google Scholar
  8. 8.
    Mitchell, T.: Machine Learning. McGraw-Hill Companies, Inc., New York (1997)zbMATHGoogle Scholar
  9. 9.
    Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: KDD 1999 (1999)Google Scholar
  10. 10.
    Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in kernel methods - support vector learning, MIT Press, Cambridge (1998)Google Scholar
  11. 11.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  12. 12.
    Press, W.H., Farrar, G.R.: Recursive Stratified Sampling for Multidimensional Monte Carlo Integration. Computers in Physics 4, 190–195 (1990)Google Scholar
  13. 13.
    Strehl, A., Ghosh, J.: Value-based customer grouping from large retail data-sets. In: Proc. SPIE Conference on Data Mining and Knowledge Discovery, Orlando, April 2000, vol. 4057, pp. 33–42 (2000)Google Scholar
  14. 14.
    Sulzmann, J., Fürnkranz, J., Hüllermeier, E.: On Pairwise Naive Bayes Classifiers. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 658–665. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)zbMATHGoogle Scholar
  16. 16.
    WEKA Software, v3.5.2. University of Waikato,

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Guichong Li
    • 1
  • Nathalie Japkowicz
    • 1
  • Trevor J. Stocki
    • 2
  • R. Kurt Ungar
    • 2
  1. 1.Computer Science of University of Ottawa 
  2. 2.Radiation Protection BureauHealth CanadaOttawaCanada

Personalised recommendations