Combine Vector Quantization and Support Vector Machine for Imbalanced Datasets

  • Ting Yu
  • John Debenham
  • Tony Jan
  • Simeon Simoff
Part of the IFIP International Federation for Information Processing book series (IFIPAICT, volume 217)


In cases of extremely imbalanced dataset with high dimensions, standard machine learning techniques tend to be overwhelmed by the large classes. This paper rebalances skewed datasets by compressing the majority class. This approach combines Vector Quantization and Support Vector Machine and constructs a new approach, VQ-SVM, to rebalance datasets without significant information loss. Some issues, e.g. distortion and support vectors, have been discussed to address the trade-off between the information loss and undersampling. Experiments compare VQ-SVM and standard SVM on some imbalanced datasets with varied imbalance ratios, and results show that the performance of VQ-SVM is superior to SVM, especially in case of extremely imbalanced large datasets.


Vector Quantization Minority Class Imbalanced Data Imbalanced Dataset Imbalance Ratio 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Chawla, N.V., N. Japkowics, and A. Kolcz, Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations, 2004. 6(1).Google Scholar
  2. 2.
    Wang, J., X. Wu, and C. Zhang, Support vector machines based on K-means clustering for real-time business intelligence systems. International Journal of Business Intelligence and Data Mining, 2005. 1(1).Google Scholar
  3. 3.
    Scholkopf, B. and A.J. Smola, Learning with Kernels, Support Vector Machines, Regularization, Optimization, and Beyond. 2002: MIT Press.Google Scholar
  4. 4.
    Akbani, R., S. Kwek, and N. Japkowicz. Applying Support Vector Machines to Imbalanced Datasets. in Proceedings of the 15th European Conference on Machine Learning (ECML). 2004.Google Scholar
  5. 5.
    Veropoulos, K., C. Campbell, and N. Cristianini. Controlling the Sensitivity of Support Vector Machines. in the International Joint Conference on Artificial Intelligence (IJCA199), Workshop ML3. 1999. Stockholm, Sweden.Google Scholar
  6. 6.
    Karakoulas, G. and J. Shawe-Taylor. Optimizing classifiers for imbalanced training sets. in Advances in neural information processing systems. 1998: MIT Press, Cambridge, MA, USA.Google Scholar
  7. 7.
    Lin, Y., Y. Lee, and G. Wahba, Support Vector Machines for Classification in Nonstandard Situations. Machine Learning, 2002. 46(1–3): p. 191–202.zbMATHGoogle Scholar
  8. 8.
    Wu, G. and E.Y. Chang, KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 2005. 17(6): p. 786–795.CrossRefGoogle Scholar
  9. 9.
    Vapnik, V., The Nature of Statistical Learning Theory. 1995, New York: Springer-Verlag.zbMATHGoogle Scholar
  10. 10.
    Linde, Y., A. Buzo, and R.M. Gray, An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications, pp., 1980: p. 702–710.Google Scholar
  11. 11.
    Gersho, A. and R.M. Gray, Vector Quantization And Signal Compression. 1992: Kluwer Academic Publishers.Google Scholar
  12. 12.
    Yu, T., T. Jan, J. Debenham, and S. Simoff. Incorporating Prior Domain Knowledge in Machine Learning: A Review. in AISTA 2004: International Conference on Advances in Intelligence Systems — Theory and Applications in cooperation with IEEE Computer Society. 2004. Luxembourg.Google Scholar
  13. 13.
    Yu, T., T. Jan, J. Debenham, and S. Simoff. Incorporate Domain Knowledge into Support Vector Machine to Classify Price Impacts of Unexpected News. in The 4th Australasian Conference on Data Mining. 2005. Sydney, Australia.Google Scholar
  14. 14.
    Kubat, M. and S. Matwin. Addressing the Curse of Imbalanced Data Sets: One-Sided Sampling. in Proceedings of the Fourteenth International Conference on Machine Learning. 1997.Google Scholar
  15. 15.
    Chang, C.-C. and C.-J. Lin, LIBSVM: a Library for Support Vecter Machine. 2004, Department of Computer Sicence and Information Engineering, National Taiwan University.Google Scholar
  16. 16.
    Weston, J., A. Elisseeff, G. Baklr, and F. Sinz, SPIDER: object-orientated machine learning library. 2005.Google Scholar
  17. 17.
    Jang, J.-S.R., DCPR MATLAB Toolbox. 2005.Google Scholar

Copyright information

© International Federation for Information Processing 2006

Authors and Affiliations

  • Ting Yu
    • 1
    • 2
  • John Debenham
    • 1
    • 2
  • Tony Jan
    • 1
    • 2
  • Simeon Simoff
    • 1
    • 2
  1. 1.Institute for Information and Communication Technologies Faculty of Information TechnologyUniversity of TechnologySydneyAustralia
  2. 2.Capital Markets Cooperative Research CentreAustralia

Personalised recommendations