Improving Breast Cancer Diagnosis Accuracy by Particle Swarm Optimization Feature Selection

Kazerani, Reihane

doi:10.1007/s44196-024-00428-5

Improving Breast Cancer Diagnosis Accuracy by Particle Swarm Optimization Feature Selection

Research Article
Open access
Published: 13 March 2024

Volume 17, article number 44, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Improving Breast Cancer Diagnosis Accuracy by Particle Swarm Optimization Feature Selection

Download PDF

Reihane Kazerani ORCID: orcid.org/0000-0002-1489-1516¹

861 Accesses
Explore all metrics

Abstract

Breast cancer has been one of the leading causes of death among women in the world. Early detection of this disease can save patient’s lives and reduce mortality. Due to the large number of features involved in the diagnosis of this disease, the breast cancer diagnosis process can be time consuming. To reduce cost and time and improving accuracy of breast cancer diagnosis, this paper propose a feature selection algorithm based on particle swarm optimization (PSO) combined with machine learning methods for selection the most effective features for breast cancer diagnosis among all features. In order to evaluate the efficiency of the proposed feature selection method, it was tested on three most common breast cancer datasets available in the University of California, Irvine (UCI) repository named: Coimbra dataset (CD), Wisconsin Diagnostic Breast Cancer dataset (WDBC) and Wisconsin Prognostic Breast Cancer dataset (WPBC). In the Coimbra dataset with all its 9 features and without PSO feature selection algorithm the highest obtained accuracy was 87% by Support Vector Machine method, while with PSO feature selection algorithm the accuracy reached to 91% and the number of features was reduced from 9 to 4. In the WDBC dataset with all its 30 features and without PSO feature selection algorithm the highest obtained accuracy was 99% by Random Forest method, while with PSO feature selection algorithm the accuracy reached to 100% and the number of features was reduced from 30 to 19. In the WPBC dataset with all its 33 features and without PSO feature selection algorithm the highest obtained accuracy was 94% by Support Vector Machine method, while with PSO feature selection algorithm the accuracy reached to 96% and the number of features was reduced from 33 to 17. The results of this paper indicated that the proposed feature selection algorithm based on PSO algorithm can improve the accuracy of breast cancer diagnosis. While it has selected fewer and more effective features than the total number of features in the original datasets.

A review on extreme learning machine

Article Open access 22 May 2021

Breast Cancer Prediction: A Comparative Study Using Machine Learning Techniques

Article 01 September 2020

New cardiovascular disease prediction approach using support vector machine and quantum-behaved particle swarm optimization

Article Open access 05 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Today, the use of machine learning methods to diagnose diseases have become widespread [1]. Breast cancer disease is one of the most common types of malignant cancers among worldwide women and accounts for 25.1% of all cancers [2]. Breast cancer spreads to other organs over time. Research also showed that breast cancer is more common in women whose average age is 47 years than in women whose average age is 63 years [3, 4].

Cancerous tumors are divided into malignant and benign. Benign tumors are non-intense. But malignant tumors are intensive and can spread to other parts of the body. Therefore, the correct diagnosis of the tumor for treatment must be considered. Recurrence of breast cancer can occur 1–20 years after treatment for primary cancer. Cancer patients often face treatment complications. The recurrence of breast cancer can be predicted by examining various factors such as the size of the primary tumor, the number of damaged lymph nodes, the area of the tumor, and similar factors. In recent years Machine learning models have been used in medicine to diagnose cancer and accurately classify benign and malignant tumors in a reasonable time [5, 6].

With the advancement of technology, different types of data are produced with high dimensions. The data produced in the field of medicine or cancer have wide dimensions and variables. When the dimension of the data is high, the classification results may have more error and make data analysis difficult. Also, high-dimensional data has challenges such as search space, time, and computational costs [7]. Nowadays, Machine learning in diagnosing diseases such as COVID-19 [1], cardiovascular disease, diabetes mellitus and analyzing their data was successful. Using the dimension reduction and feature selection (FS) methods makes the disease diagnosis faster, easier and less expensive. It should be easier to store and classify [8] because feature selection can produce fewer features and reduce computational costs [9]. In feature selection, the goal is to reduce the number of features in the dataset and select the most effective features so that the highest possible classification accuracy can be achieved with the least possible number of features. Machine learning methods are widely used in medical studies and automatic diagnosis of cancers such as breast cancer. Many successful detections and prediction methods have been performed, especially in studies using Coimbra, Wisconsin Diagnostic Breast Cancer (WDBC), and Wisconsin Prognostic Breast Cancer dataset (WPBC) datasets. These predictions are made using the dimensions and other features of tumors [10].

The innovation of this research is the combination of ten different machine learning algorithms including ensemble learning methods with the PSO feature selection algorithm to select the most effective features in the diagnosis of breast cancer which is implemented on three famous datasets in the field of breast cancer named Coimbra, WPBC and WDBC datasets. Also, use the PSO algorithm as a method to select more effective features in disease diagnosis and reduce the size of the dataset and by applying it to the Coimbra, WDBC and WPBC datasets to diagnose breast cancer using the most common machine learning methods and performance analysis of each algorithm. These machine learning methods include AdaBoost (ADB), Decision Tree (DT), k Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Logistic Regression (LGR), Linear Regression (LR), Naïve Bayes (NB), Artificial Neural Networks (ANN), Random Forest (RF) and Support Vector Machine (SVM). Also, a comparative analysis between the performance evaluation criteria of machine learning methods on the original dataset and the dataset consist of the selected features by PSO feature selection algorithm was performed.

The following sections of this article are organized as follows. The second part of the article examines related works. This research includes studies on breast cancer, the use of machine learning algorithms and different techniques to diagnose breast cancer and compare their accuracy. In the third part of the article, the information related to the Coimbra, WDBC and WPBC datasets are explained, which are used to evaluate the proposed method. Also, in this section, machine learning methods such as classification methods and PSO algorithm are explained. The fourth part describes the theory and calculation of the proposed method. In the fifth section, the results obtained in this research are stated. It belongs to Discussion in the sixth part. The seventh section provides conclusions and suggestions for future research.

2 Related Works

Meta-heuristics algorithms are widely used in feature selection because they are highly efficient, easy to implement, and can manage large-scale data. Swarm Intelligence (SI) algorithm is a branch of meta-heuristics algorithms. These algorithms imitate from social behavior of animal group life, for example, Insects (instance ant, bee, etc.) birds, and fishes [11, 12]. On the other hand, feature selection is an important and challenging work in machine learning and one of its goals is to maximize the accuracy of classification [12]. For instance, in [13], the authors used the SVM methods to diagnose breast cancer and the kNN, NB, and DT algorithms to detect the type of cancer cells. In [14] a hybrid model based on concepts of neural networks and fuzzy systems presented. This model could manipulate data collected in medical examinations and detect patterns in healthy individuals and individuals with breast cancer with an acceptable level of accuracy. These intelligent techniques have made it possible to create expert systems based on logical rules of the IF/THEN type. According to its results, the hybrid model has a good capacity to predict breast cancer and analyze the characteristics of this cancer. In [10] the DT method was used to diagnose breast cancer. In this study, parameters related to blood analysis have been used. In this methods, the level of importance of the properties is determined by the Gini coefficient. Accuracy in this study is 90.5%. In [15] an analytical evaluation was performed on machine learning methods and breast cancer datasets. Some of the initial processing was performed using WEKA software on the input datasets and its overall effect on the prediction accuracy was also determined. In this research, the filter feature selection method has been used. The results show that correct feature selection can be used to select the best features and the prediction speed and accuracy can be increased. RF had the best accuracy of 69% before using the filter method and 98% after using it. Similarly, LR came in second with 96% accuracy after using the filter and 68% unfiltered, followed by NB with 91% after using the filter method and 71% unfiltered. The authors in [16] diagnosed breast cancer using four factors of resistance, glucose, age and BMI. They using three machine learning methods including SVM, RF, and LGR. They used the Monte Carlo cross-validation method to evaluate the results and the 95% confidence level. Their results indicate the superiority of the SVM method over the other two methods. In [17], four methods including Extreme Learning Machine (ELM), SVM, kNN and ANN were used to diagnose breast cancer using the Wisconsin dataset which includes blood analysis data of patients and healthy individuals. The accuracy obtained by ELM is 80%, 79.4% by the ANN method, 77.5% by the kNN method and 73.5% by the SVM method. In [18] the authors presented a SVM-based ensemble learning method that was applied on two breast cancer datasets including the Wisconsin dataset and one breast cancer dataset registered in the United States. The results of this method showed 33.34% increase in accuracy of diagnosis compared to the best individual SVM method.

3 Materials and Methods

This section examines datasets, machine learning methods, and the PSO algorithm used to increase the accuracy of breast cancer diagnosis. In this research, the data sets available in the database of the University of California Irvine (UCI), USA has been used. The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine [19].

3.1 Description of Datasets

The datasets used in this study were Coimbra, WDBC, and WPBC. The number of samples and features of which are also shown in Table 1.

Table 1 List of used datasets

Improving Breast Cancer Diagnosis Accuracy by Particle Swarm Optimization Feature Selection

Abstract

Similar content being viewed by others

A review on extreme learning machine

Breast Cancer Prediction: A Comparative Study Using Machine Learning Techniques

New cardiovascular disease prediction approach using support vector machine and quantum-behaved particle swarm optimization

1 Introduction

2 Related Works

3 Materials and Methods

3.1 Description of Datasets

3.1.1 Coimbra Dataset

3.1.2 WDBC Dataset

3.1.3 WPBC Dataset

3.2 Classification Methods

3.2.1 AdaBoost

3.2.2 Decision Tree

3.2.3 K nearest Neighbor

3.2.4 Linear Discriminant Analysis

3.2.5 Logistic Regression

3.2.6 Linear Regression

3.2.7 Naïve Bayes

3.2.8 Artificial Neural Network

3.2.9 Random Forest

3.2.10 Support Vector Machine

3.3 Particle Swarm Optimization Algorithm

3.3.1 Characteristics of Particle i at Iteration t

3.3.2 Parameters of the Algorithm

3.3.3 Update of the Speed and the Positions of the Particles

3.3.4 Stopping Rule

3.3.5 Advantages of PSO Algorithm

4 Proposed PSO Feature Selection Algorithm

4.1 Data Pre-processing

4.2 PSO Feature Selection Algorithm

4.3 Performance Evaluation Parameters

4.4 Sensitivity Analysis of Parameters of PSO Algorithm and Machine Learning Methods

5 Results

5.1 Classification Evaluation for Coimbra Dataset

5.2 Classification Evaluation for WDBC Dataset

5.3 Classification Evaluation for WPBC Dataset

6 Discussion

6.1 Comparing the Results of Classifiers in Each Dataset

6.2 Comparison the Results with the Related Works

7 Conclusion and Future Work

Availability of Data and Material

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation