An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset

  • Marcilio C. P. de Souto
  • Valnaide G. Bittencourt
  • Jose A. F. Costa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4234)


There have been a great deal of research on learning from imbalanced datasets. Among the widely used methods proposed to solve such a problem, the most common are based either on under or over sampling of the original dataset. In this work, we evaluate several methods of under-sampling, such as Tomek Links, with the goal of improving the performance of the classifiers generated by different ML algorithms (decision trees, support vector machines, among others) applied to problem of determining the structural similarity of proteins.


Support Vector Machine Majority Class Minority Class Neighbor Rule Imbalanced Dataset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies. In: Proc. of the AAAI Worrkshop on Learning from Imbalanced Data Sets, pp. 10–15 (2000)Google Scholar
  2. 2.
    Fawcett, T., Provost, J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)CrossRefGoogle Scholar
  3. 3.
    Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  4. 4.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent data Analyis 6, 429–449 (2002)MATHGoogle Scholar
  5. 5.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 20–29 (2004)CrossRefGoogle Scholar
  6. 6.
    Baldi, P., Brunak, S.: Bioinformatics: the Machine Learning approach, 2nd edn. MIT Press, Cambridge (2001)MATHGoogle Scholar
  7. 7.
    Ding, C., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)CrossRefGoogle Scholar
  8. 8.
    Tan, A., Gilbert, D., Deville, Y.: Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14, 206–217 (2003)Google Scholar
  9. 9.
    Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proc. of ISBM, pp. 98–106 (1995)Google Scholar
  10. 10.
    Lo Conte, L., Ailey, B., Hubbard, T., Brenner, S., Murzin, A., Chotia, C.: SCOP: a structural classification of proteins database. Nucleic Acids Research 28, 257–259 (2000)CrossRefGoogle Scholar
  11. 11.
    Chinnasamy, A., Sung, W., Mittal, A.: Protein structure and fold prediction using tree-augmented nave bayesian classifier. In: Proc. of the Pacific Symposium on Biocomputing, vol. 9, pp. 387–398 (2004)Google Scholar
  12. 12.
    Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Communications 6, 769–772 (1976)CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Hart, P.: The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14, 515–516 (1968)CrossRefGoogle Scholar
  14. 14.
    Batista, G.E.A.P.A., Carvalho, A.C.P.L.F., Monard, M.C.: Applying one-sided selection to unbalanced datasets. In: Proc. of the Mexican International Conference on Artificial Intelligence, pp. 315–325 (2000)Google Scholar
  15. 15.
    Laurikkala, J.: Improving identification of dificult small classes by balancing class distribution. A-2001-2, University of Tampere (2001)Google Scholar
  16. 16.
    Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Communications 2, 408–421 (1972)CrossRefMATHGoogle Scholar
  17. 17.
    Hobohm, U., Sander, C.: Enlarged representative set of proteins. Protein Science 3, 522–524 (1994)CrossRefGoogle Scholar
  18. 18.
    Dubchak, I., Muchnik, I., Kim, S.: Protein folding class predictor for SCOP: Approach based on global descriptors. In: Proc. of the Intelligent Systems for Molecular Biology, pp. 104–107 (1997)Google Scholar
  19. 19.
    Chandonia, J.M., Walker, N., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.: Astral compendium enhancements. Nucleic Acids Research 30, 260–263 (2002)CrossRefGoogle Scholar
  20. 20.
    Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)MATHGoogle Scholar
  21. 21.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)Google Scholar
  22. 22.
    Dietterich, T.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10, 1895–1923 (1998)CrossRefGoogle Scholar
  23. 23.
    Batista, G.E.A.P.A.: Pre-processamento de dados em Aprendizado de Mquina Supervisionado. PhD thesis, Universidade de So Paulo, Instituto de Cincias Matemticas e de Computao (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Marcilio C. P. de Souto
    • 1
  • Valnaide G. Bittencourt
    • 2
  • Jose A. F. Costa
    • 3
  1. 1.Department of Informatics and Applied MathematicsFederal University of Rio Grande do NorteNatal-RNBrazil
  2. 2.Department of Computing and AutomationFederal University of Rio Grande do NorteNatal-RNBrazil
  3. 3.Department of Electric EngineeringFederal University of Rio Grande do NorteNatal-RNBrazil

Personalised recommendations