FISA: Feature-Based Instance Selection for Imbalanced Text Classification

  • Aixin Sun
  • Ee-Peng Lim
  • Boualem Benatallah
  • Mahbub Hassan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)


Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.


Support Vector Machine Training Time Instance Selection Feature Selection Technique Positive Training 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Training text classifiers with SVM on very few positive examples. Technical Report MSR-TR-2003-34, Microsoft Research (April 2003)Google Scholar
  2. 2.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. J. of Artificial Intelligence Research 16, 321–357 (2002)MATHGoogle Scholar
  3. 3.
    Chen, C.-M., Lee, H.-M., Kao, M.-T.: Multi-class SVM with negative data selection for web page classification. In: Proc. of IEEE Joint Conf. on Neural Networks, Budapest, Hungary, pp. 2047–2052 (2004)Google Scholar
  4. 4.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. of Machine Learning Research 3, 1289–1305 (2003)MATHGoogle Scholar
  5. 5.
    Fragoudis, D., Meretakis, D., Likothanassis, S.: Integrating feature and instance selection for text classification. In: Proc. of ACM SIGKDD 2002, Canada, pp. 501–506 (2002)Google Scholar
  6. 6.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: Onesided selection. In: Proc. of ICML 1997, pp. 179–186 (1997)Google Scholar
  7. 7.
    Liu, H., Motoda, H.: On issues of instance selection. Data Mining and Knowledge Discovery 6, 115–130 (2002)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Manevitz, L.M., Yousef, M.: One-class SVMS for document classification. J. of Machine Learning Research 2, 139–154 (2002)MATHGoogle Scholar
  9. 9.
    Montgomery, D.C.: Introduction to Statistical Quality Control, 4th edn. Wiley, Chichester (2000)MATHGoogle Scholar
  10. 10.
    Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMS: a case study. SIGKDD Explorations Newsletter 6(1), 60–69 (2004)CrossRefGoogle Scholar
  11. 11.
    Wu, G., Chang, E.Y.: Kba: Kernel boundary alignment considering imbalanced data distribution. IEEE TKDE 17(6), 786–795 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Aixin Sun
    • 1
  • Ee-Peng Lim
    • 1
  • Boualem Benatallah
    • 2
  • Mahbub Hassan
    • 2
  1. 1.School of Computer EngineeringNanyang Technological UniversitySingapore
  2. 2.School of Computer Science and EngineeringUniversity of New South WalesAustralia

Personalised recommendations