Instance Selection by Border Sampling in Multi-class Domains

  • Guichong Li
  • Nathalie Japkowicz
  • Trevor J. Stocki
  • R. Kurt Ungar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5678)


Instance selection is a pre-processing technique for machine learning and data mining. The main problem is that previous approaches still suffer from the difficulty to produce effective samples for training classifiers. In recent research, a new sampling technique, called Progressive Border Sampling (PBS), has been proposed to produce a small sample from the original labelled training set by identifying and augmenting border points. However, border sampling on multi-class domains is not a trivial issue. Training sets contain much redundancy and noise in practical applications. In this work, we discuss several issues related to PBS and show that PBS can be used to produce effective samples by removing redundancies and noise from training sets for training classifiers. We compare this new technique with previous instance selection techniques for learning classifiers, especially, for learning Naïve Bayes-like classifiers, on multi-class domains except for one binary case which was for a practical application.


Instance Selection Border Sampling Multi-class Domains Class Binarization method 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bay, S.D.: The UCI KDD archive, 1999 (1999),
  2. 2.
    Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining Knowledge Discovery 6(2), 153–172 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Debnath, R., Takahide, N., Takahashi, H.: A decision based one-against-one method for multi-class support vector machine. Pattern Anal. Applic. 7, 164–175 (2004)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Domingos, P., Pazzani, M.: Beyond independence: Conditions for the optima-lity of the sample Bayesian classifier. Machine Learning 29, 103–130 (1997)CrossRefzbMATHGoogle Scholar
  5. 5.
    Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. A Wiley Interscience Publication, Chichester (1973)zbMATHGoogle Scholar
  6. 6.
    Fawcett, T.: ROC graphs: Notes and practical considerations for researchers (2003),
  7. 7.
    Jiang, L., Zhang, H.: Weightily Averaged One-Dependence Estimators. In: Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS, vol. 4099, pp. 970–974. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    John, G., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 367–370. AAAI Press, Menlo Park (1996)Google Scholar
  9. 9.
    Li, G., Japkowicz, N., Stocki, T.J., Ungar, R.K.: Full Border Identification for reduction of training sets. In: Proceedings of the 21st Canadian Artificial Intelligence, Winsor, Canada, pp. 203–215 (2008)Google Scholar
  10. 10.
    Li, G., Japkowicz, N., Stocki, T.J., Ungar, R.K.: Border sampling through Markov chain Monte Carlo. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, pp. 393–402 (2008)Google Scholar
  11. 11.
    Liu, H., Motoda, H.: On issues of instance selection. Data Mining and Knowledge Discovery 6, 115–130 (2002)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Mitchell, T.: Machine Learning. McGraw-Hill Companies, Inc., New York (1997)zbMATHGoogle Scholar
  13. 13.
    Provost, F., Jensen, D., Oates, T.: Efficient Progressive Sampling. In: Proc. of the fifth ACM SIGKDD, San Diego, California, US, pp. 23–32 (1999)Google Scholar
  14. 14.
    Stocki, T.J., Blanchard, X., D’Amours, R., Ungar, R.K., Fontaine, J.P., Sohier, M., Bean, M., Taffary, T., Racine, J., Tracy, B.L., Brachet, G., Jean, M., Meyerhof, D.: Automated radioxenon monitoring for the comprehensive nuclear-test-ban treaty in two distinctive locations: Ottawa and Tahiti. J. Environ. Radioactivity 80, 305–326 (2005)CrossRefGoogle Scholar
  15. 15.
    Sullivan, J.D.: The comprehensive test ban treaty. Physics Today 151, 23 (1998)Google Scholar
  16. 16.
    Sulzmann, J., Fürnkranz, J., Hüllermeier, E.: On Pairwise Naive Bayes Classifiers. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS, vol. 4701, pp. 658–665. Springer, Heidelberg (2007)Google Scholar
  17. 17.
    Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics 6(6), 769–772 (1976)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Tsujinishi, D., Koshiba, Y., Abe, S.: Why pairwise is better than one-against-all or all-at-once. In: Proceedings of IEEE International Conference on Neural Networks, vol. 1, pp. 693–698. IEEE Press, Los Alamitos (2004)Google Scholar
  19. 19.
    Webb, G.I., Boughton, J., Wang, Z.: Not So Naive Bayes: Aggregating One-Dependence Estimators. Machine Learning 58(1), 5–24 (2005)CrossRefzbMATHGoogle Scholar
  20. 20.
    Wilson, D.R., Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 38, 257–286 (2000)CrossRefzbMATHGoogle Scholar
  21. 21.
    Zhang, H., Jiang, L., Su, J.: Hidden Naive Bayes. In: Twentieth National Conference on Artificial Intelligence, pp. 919–924 (2005)Google Scholar
  22. 22.
    Zheng, F., Webb, G.I.: Efficient lazy elimination for averaged-one dependence estimators. In: Proc. 23th International Conference on Machine Learning (ICML 2006), pp. 1113–1120 (2006)Google Scholar
  23. 23.
    WEKA Software, v3.5.2. University of Waikato, Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Guichong Li
    • 1
  • Nathalie Japkowicz
    • 1
  • Trevor J. Stocki
    • 2
  • R. Kurt Ungar
    • 2
  1. 1.School of Information Technology and EngineeringUniversity of OttawaOttawaCanada
  2. 2.Radiation Protection BureauHealth CanadaOttawaCanada

Personalised recommendations