The Journal of Supercomputing

, Volume 72, Issue 10, pp 3708–3728 | Cite as

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

  • Jinyan Li
  • Simon FongEmail author
  • Sabah Mohammed
  • Jinan Fiaidhi


Classification which is a popular supervised machine learning method has many applications in computational biology, where data samples are automatically categorized into predefined labels with the aid of data mining. Often the training samples contain very few instances of interest (e.g., medical anomalies, rare disease in a population, and unusual syndromes, etc.), but many normal instances. Such imbalanced ratio of data distributions among the target labels hampers the efficacy of classification algorithms, because the induced model has not been trained with sufficient amount of instances of the interesting label(s), but overwhelmed with ordinary training records. Traditional remedies attempt to rebalance the data distributions of the target classes, by inflating the interesting instances artificially, reducing the majority of the common instances or a combination of both. Though the fundamental concept is effective, there is no clear guideline on how to strike a balance between fabricating the rare samples and reducing the norms, with the purpose of maximizing the classification accuracy. In this paper, an optimization model using different swarm strategies (Bat-inspired algorithm and PSO) is proposed for adaptively balancing the increase/decrease of the class distribution, depending on the properties of the biological datasets. The optimization is extended for achieving the highest possible accuracy and Kappa statistics at the same time as well. The optimization model is tested on five imbalanced medical datasets, which are sourced from lung surgery logs and virtual screening of bioassay data. Computer simulation results show that the proposed optimization model outperforms other class balancing methods in medical data classification.


Imbalanced biological data Medical classification  Swarm algorithm Parameter optimization 



The authors are thankful for the financial support from the research grant “Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF)”, Grant No. MYRG2015-00128-FST, offered by the University of Macau, FST, and RDAO.


  1. 1.
    Mehta M, Agrawal R, Rissanen J (1996) SLIQ: a fast scalable classifier for data mining. In: Advances in database technology—EDBT’96. Springer, Berlin, Heidelberg, pp 18–32Google Scholar
  2. 2.
    Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques: concepts and techniques. Elsevier, AmsterdamzbMATHGoogle Scholar
  3. 3.
    Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRefGoogle Scholar
  4. 4.
    Fan W et al (1999) AdaCost: misclassification cost-sensitive boosting. In: ICMLGoogle Scholar
  5. 5.
    Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Third IEEE international conference on data mining, 2003. ICDM 2003. IEEEGoogle Scholar
  6. 6.
    Wu G, Chang EY (2005) KBA: Kernel boundary alignment considering imbalanced data distribution. Knowl Data Eng IEEE Trans 17(6):786–795CrossRefGoogle Scholar
  7. 7.
    Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceedings IEEE international conference on data mining, 2001. ICDM 2001. IEEEGoogle Scholar
  8. 8.
    Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinform 1(1):46–55Google Scholar
  9. 9.
    Chawla NV et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Knowledge discovery in databases: PKDD 2003. Springer, Berlin, Heidelberg, pp 107–119Google Scholar
  10. 10.
    Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  11. 11.
    Kennedy J (2010) Particle swarm optimization. Encyclopedia of machine learning. Springer, New YorkGoogle Scholar
  12. 12.
    Xin-She Y (2010) A new metaheuristic bat-inspired algorithm. In: Nature inspired cooperative strategies for optimization (NICSO, 2010). Springer, Berlin, Heidelberg, pp 65–74Google Scholar
  13. 13.
    Ichikawa T et al (2007) High-b value diffusion-weighted MRI for detecting pancreatic adenocarcinoma: preliminary results. Am J Roentgenol 188(2):409–414CrossRefGoogle Scholar
  14. 14.
    Lichman M (2013) UCI Machine learning repository. University of California, School of Information and Computer Science, Irvine. Accessed 11 Nov 2015
  15. 15.
    Maciej Z, Tomczak JM, Lubicz M, Witek J (2014) Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. In: Applied soft computing, vol 14, Elsevier, pp 99–108Google Scholar
  16. 16.
    Schierz AC (2009) Virtual screening of bioassay data. J Cheminform 1:1–21CrossRefGoogle Scholar
  17. 17.
    Chen X, Wang M, Zhang H (2011) The use of classification trees for bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):55–63CrossRefGoogle Scholar
  18. 18.
    Ma XH, Yap CW (2010) Consensus model for identification of novel PI3K inhibitors in large chemical library. J Comput-Aided Mol Des 24(2):131–141CrossRefGoogle Scholar
  19. 19.
    Tong DL, Mintram R (2010) Genetic algorithm-neural network (GANN): a study of neural network activation functions and depth of genetic algorithm search applied to feature selection. Int J Mach Learn Cybern 1(1–4):75–87CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Jinyan Li
    • 1
  • Simon Fong
    • 1
    Email author
  • Sabah Mohammed
    • 2
  • Jinan Fiaidhi
    • 2
  1. 1.Department of Computer and Information ScienceUniversity of MacauTaipaMacau SAR
  2. 2.Department of Computer ScienceLakehead UniversityTaipaMacau SAR

Personalised recommendations