Skip to main content

Advertisement

Log in

Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation

  • Original Article
  • Published:
Journal of Medical and Biological Engineering Aims and scope Submit manuscript

Abstract

Research into cancer prediction has applied various machine learning algorithms, such as neural networks, genetic algorithms, and particle swarm optimization, to find the key to classifying illness or cancer properties or to adapt traditional statistical prediction models to effectively differentiate between different types of cancers, and thus build prediction models that can allow for early detection and treatment. Training data from existing patients is used to establish models to predict the classification accuracy of new patient samples. This issue has attracted considerable attention in the field of data mining, and scholars have proposed various methods (e.g., random sampling and feature selection) to address category imbalances and achieve a re-balanced class distribution, thus improving the effectiveness of classifiers with limited data. Although resampling methods can quickly deal with the problem of unbalanced samples, they give more importance to the data in the majority class, and neglect potentially important data in the minority class, thus limiting the effectiveness of classification. Based on patterns discovered in imbalanced medical data sets, this research uses the synthetic minority oversampling technique to improve imbalanced data set issues. In addition, this research also compares the resampling performance of various methods based on machine learning, soft-computing, and bio-inspired computing, using three UCI medical data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    MATH  Google Scholar 

  2. Wang, D., Quek, C., & Ng, G. S. (2014). Ovarian cancer diagnosis using a hybrid intelligent system with simple yet convincing rules. Applied Soft Computing, 20, 25–39.

    Article  Google Scholar 

  3. Liu, Y., An, A., Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles (Vol. 3918, pp. 107–118). Lecture Notes in Computer Science.

  4. Liu, Y., Yu, X., Huang, J. X., & An, A. (2011). Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Information Processing and Management, 47, 617–631.

    Article  Google Scholar 

  5. García, V., Sánchez, J. S., & Mollineda, R. A. (2011). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25, 13–21.

    Article  Google Scholar 

  6. Yang, J., Liu, Y., Zhu, X., Liu, Z., & Zhang, X. (2012). A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management, 48, 741–754.

    Article  Google Scholar 

  7. Hao, M., Wang, Y., & Bryant, S. H. (2014). An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Analytica Chimica Acta, 806, 117–127.

    Article  Google Scholar 

  8. Dumais, S., Platt, J., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management (pp. 148–155).

  9. Wei, W., Li, J., Cao, L., Ou, Y., & Chen, J. (2013). Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16, 449–475.

    Article  Google Scholar 

  10. Mazurowski, M., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., & Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks, 21, 427–436.

    Article  Google Scholar 

  11. Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the 14th international conference information Machine Learning (pp. 179–186).

  12. Wang, B. X., & Japkowicz, N. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25, 1–20.

    Article  Google Scholar 

  13. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284.

    Article  Google Scholar 

  14. Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40, 3358–3378.

    Article  MATH  Google Scholar 

  15. Zhou, Z. H., & Liu, X. Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18, 63–77.

    Article  Google Scholar 

  16. Ertekin, S., Huang, J., Bottou, L., Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. In Proceedings of the 16th ACM conference on information and knowledge management (pp. 127–136).

  17. Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Letters, 34, 1339–1347.

    Article  Google Scholar 

  18. Mani, I. & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In International conference on machine learning, workshop on learning from imbalanced datasets (pp. 42–48).

  19. Lewis, D. D. & Catlett, J. (1994). Heterogenous uncertainty sampling for supervised learning. In Proceedings of the 11th international conference on machine learning (pp. 148–156).

  20. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133.

    Article  MathSciNet  MATH  Google Scholar 

  21. Hebb, D. (2002). The organization of behavior. New York: Wiely.

    Google Scholar 

  22. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408.

    Article  MathSciNet  Google Scholar 

  23. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558.

    Article  MathSciNet  Google Scholar 

  24. Yuan, Y., Giger, M. L., Li, H., Bhooshan, N., & Sennett, C. A. (2012). Correlative analysis of FFDM and DCE-MRI for improved breast CADx. Journal of Medical and Biological Engineering, 32, 42–50.

    Article  Google Scholar 

  25. Amato, F., López, A., Peña-Méndez, E. M., Vaňhara, P., Hampl, A., & Havel, J. (2013). Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 11, 47–58.

    Article  Google Scholar 

  26. Atoufi, B., Kamavuako, E. N., Hudgins, B., & Englehart, K. (2014). Toward proportional control of myoelectric prostheses with muscle synergies. Journal of Medical and Biological Engineering, 34, 475–481.

    Article  Google Scholar 

  27. Motalleb, G. (2014). Artificial neural network analysis in preclinical breast cancer. Cell Journal, 15, 324–331.

    Google Scholar 

  28. Du, Q., Nie, K., & Wang, Z. (2014). Application of entropy-based attribute reduction and an artificial neural network in medicine: A case study of estimating medical care costs associated with myocardial infarction. Entropy, 16, 4788–4800.

    Article  Google Scholar 

  29. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.

    MATH  Google Scholar 

  30. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.

    Article  MATH  Google Scholar 

  31. Giannakeas, N., Karvelis, P. S., Exarchos, T. P., Kalatzis, F. G., & Fotiadis, D. I. (2013). Segmentation of microarray images using pixel classification: Comparison with clustering-based methods. Computers in Biology and Medicine, 43, 705–716.

    Article  Google Scholar 

  32. Roayaei, J. A., Varma, S., Reinhold, W., & Weinstein, J. N. (2013). A microarray analysis for differential gene expression using Bayesian clustering algorithm, support vector machines (SVMs) to investigate prostate cancer genes. Journal of Computational Biology, 5, 15–22.

    Google Scholar 

  33. Sun, T., Wang, J., Li, X., Lv, P., Liu, F., Luo, Y., et al. (2013). Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set. Computer Methods and Programs in Biomedicine, 111, 519–524.

    Article  Google Scholar 

  34. Chiu, C. C., Yeh, S. J., Hu, Y. H., & Liao, K. Y. K. (2014). SVM Classification for diabetics with various degrees of autonomic neuropathy based on cross-correlation features. Journal of Medical and Biological Engineering, 34, 495–500.

    Article  Google Scholar 

  35. Lee, Y. H., Chen, C. J., Shiah, Y. J., Wang, S. F., Young, M. S., Hsu, C. Y., et al. (2014). Support-vector-machine-based meditation experience evaluation using electroencephalography signals. Journal of Medical and Biological Engineering, 34, 589–597.

    Article  Google Scholar 

  36. Zheng, B., Yoon, S. W., & Lam, S. S. (2014). Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Systems with Applications, 41, 1476–1482.

    Article  Google Scholar 

  37. Salzberg, S. L. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1, 317–327.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shyr-Shen Yu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsai, MF., Yu, SS. Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation. J. Med. Biol. Eng. 35, 775–782 (2015). https://doi.org/10.1007/s40846-015-0094-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40846-015-0094-8

Keywords

Navigation