Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation

Tsai, Meng-Fong; Yu, Shyr-Shen

doi:10.1007/s40846-015-0094-8

Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation

Original Article
Published: 19 November 2015

Volume 35, pages 775–782, (2015)
Cite this article

Journal of Medical and Biological Engineering Aims and scope Submit manuscript

Meng-Fong Tsai¹ &
Shyr-Shen Yu¹

319 Accesses
6 Citations
Explore all metrics

Abstract

Research into cancer prediction has applied various machine learning algorithms, such as neural networks, genetic algorithms, and particle swarm optimization, to find the key to classifying illness or cancer properties or to adapt traditional statistical prediction models to effectively differentiate between different types of cancers, and thus build prediction models that can allow for early detection and treatment. Training data from existing patients is used to establish models to predict the classification accuracy of new patient samples. This issue has attracted considerable attention in the field of data mining, and scholars have proposed various methods (e.g., random sampling and feature selection) to address category imbalances and achieve a re-balanced class distribution, thus improving the effectiveness of classifiers with limited data. Although resampling methods can quickly deal with the problem of unbalanced samples, they give more importance to the data in the majority class, and neglect potentially important data in the minority class, thus limiting the effectiveness of classification. Based on patterns discovered in imbalanced medical data sets, this research uses the synthetic minority oversampling technique to improve imbalanced data set issues. In addition, this research also compares the resampling performance of various methods based on machine learning, soft-computing, and bio-inspired computing, using three UCI medical data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Empirical Analysis of the Effect of Resampling on Supervised Learning Algorithms in Predicting the Types of Lung Cancer on Multiclass Imbalanced Microarray Gene Expression Data

Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

Article 30 March 2022

Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification

Article Open access 01 December 2016

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
MATH Google Scholar
Wang, D., Quek, C., & Ng, G. S. (2014). Ovarian cancer diagnosis using a hybrid intelligent system with simple yet convincing rules. Applied Soft Computing, 20, 25–39.
Article Google Scholar
Liu, Y., An, A., Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles (Vol. 3918, pp. 107–118). Lecture Notes in Computer Science.
Liu, Y., Yu, X., Huang, J. X., & An, A. (2011). Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Information Processing and Management, 47, 617–631.
Article Google Scholar
García, V., Sánchez, J. S., & Mollineda, R. A. (2011). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25, 13–21.
Article Google Scholar
Yang, J., Liu, Y., Zhu, X., Liu, Z., & Zhang, X. (2012). A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management, 48, 741–754.
Article Google Scholar
Hao, M., Wang, Y., & Bryant, S. H. (2014). An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Analytica Chimica Acta, 806, 117–127.
Article Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management (pp. 148–155).
Wei, W., Li, J., Cao, L., Ou, Y., & Chen, J. (2013). Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16, 449–475.
Article Google Scholar
Mazurowski, M., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., & Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks, 21, 427–436.
Article Google Scholar
Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the 14th international conference information Machine Learning (pp. 179–186).
Wang, B. X., & Japkowicz, N. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25, 1–20.
Article Google Scholar
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284.
Article Google Scholar
Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40, 3358–3378.
Article MATH Google Scholar
Zhou, Z. H., & Liu, X. Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18, 63–77.
Article Google Scholar
Ertekin, S., Huang, J., Bottou, L., Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. In Proceedings of the 16th ACM conference on information and knowledge management (pp. 127–136).
Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Letters, 34, 1339–1347.
Article Google Scholar
Mani, I. & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In International conference on machine learning, workshop on learning from imbalanced datasets (pp. 42–48).
Lewis, D. D. & Catlett, J. (1994). Heterogenous uncertainty sampling for supervised learning. In Proceedings of the 11th international conference on machine learning (pp. 148–156).
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133.
Article MathSciNet MATH Google Scholar
Hebb, D. (2002). The organization of behavior. New York: Wiely.
Google Scholar
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408.
Article MathSciNet Google Scholar
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558.
Article MathSciNet Google Scholar
Yuan, Y., Giger, M. L., Li, H., Bhooshan, N., & Sennett, C. A. (2012). Correlative analysis of FFDM and DCE-MRI for improved breast CADx. Journal of Medical and Biological Engineering, 32, 42–50.
Article Google Scholar
Amato, F., López, A., Peña-Méndez, E. M., Vaňhara, P., Hampl, A., & Havel, J. (2013). Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 11, 47–58.
Article Google Scholar
Atoufi, B., Kamavuako, E. N., Hudgins, B., & Englehart, K. (2014). Toward proportional control of myoelectric prostheses with muscle synergies. Journal of Medical and Biological Engineering, 34, 475–481.
Article Google Scholar
Motalleb, G. (2014). Artificial neural network analysis in preclinical breast cancer. Cell Journal, 15, 324–331.
Google Scholar
Du, Q., Nie, K., & Wang, Z. (2014). Application of entropy-based attribute reduction and an artificial neural network in medicine: A case study of estimating medical care costs associated with myocardial infarction. Entropy, 16, 4788–4800.
Article Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
Article MATH Google Scholar
Giannakeas, N., Karvelis, P. S., Exarchos, T. P., Kalatzis, F. G., & Fotiadis, D. I. (2013). Segmentation of microarray images using pixel classification: Comparison with clustering-based methods. Computers in Biology and Medicine, 43, 705–716.
Article Google Scholar
Roayaei, J. A., Varma, S., Reinhold, W., & Weinstein, J. N. (2013). A microarray analysis for differential gene expression using Bayesian clustering algorithm, support vector machines (SVMs) to investigate prostate cancer genes. Journal of Computational Biology, 5, 15–22.
Google Scholar
Sun, T., Wang, J., Li, X., Lv, P., Liu, F., Luo, Y., et al. (2013). Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set. Computer Methods and Programs in Biomedicine, 111, 519–524.
Article Google Scholar
Chiu, C. C., Yeh, S. J., Hu, Y. H., & Liao, K. Y. K. (2014). SVM Classification for diabetics with various degrees of autonomic neuropathy based on cross-correlation features. Journal of Medical and Biological Engineering, 34, 495–500.
Article Google Scholar
Lee, Y. H., Chen, C. J., Shiah, Y. J., Wang, S. F., Young, M. S., Hsu, C. Y., et al. (2014). Support-vector-machine-based meditation experience evaluation using electroencephalography signals. Journal of Medical and Biological Engineering, 34, 589–597.
Article Google Scholar
Zheng, B., Yoon, S. W., & Lam, S. S. (2014). Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Systems with Applications, 41, 1476–1482.
Article Google Scholar
Salzberg, S. L. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1, 317–327.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Chung Hsing University, Taichung, 402, Taiwan
Meng-Fong Tsai & Shyr-Shen Yu

Authors

Meng-Fong Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Shyr-Shen Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shyr-Shen Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsai, MF., Yu, SS. Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation. J. Med. Biol. Eng. 35, 775–782 (2015). https://doi.org/10.1007/s40846-015-0094-8

Download citation

Received: 10 April 2015
Accepted: 03 June 2015
Published: 19 November 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s40846-015-0094-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation

Abstract

Access this article

Similar content being viewed by others

Empirical Analysis of the Effect of Resampling on Supervised Learning Algorithms in Predicting the Types of Lung Cancer on Multiclass Imbalanced Microarray Gene Expression Data

Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation

Abstract

Access this article

Similar content being viewed by others

Empirical Analysis of the Effect of Resampling on Supervised Learning Algorithms in Predicting the Types of Lung Cancer on Multiclass Imbalanced Microarray Gene Expression Data

Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation