Social Group Optimization–Assisted Kapur’s Entropy and Morphological Segmentation for Automated Detection of COVID-19 Infection from Computed Tomography Images

The coronavirus disease (COVID-19) caused by a novel coronavirus, SARS-CoV-2, has been declared a global pandemic. Due to its infection rate and severity, it has emerged as one of the major global threats of the current generation. To support the current combat against the disease, this research aims to propose a machine learning–based pipeline to detect COVID-19 infection using lung computed tomography scan images (CTI). This implemented pipeline consists of a number of sub-procedures ranging from segmenting the COVID-19 infection to classifying the segmented regions. The initial part of the pipeline implements the segmentation of the COVID-19–affected CTI using social group optimization–based Kapur’s entropy thresholding, followed by k-means clustering and morphology-based segmentation. The next part of the pipeline implements feature extraction, selection, and fusion to classify the infection. Principle component analysis–based serial fusion technique is used in fusing the features and the fused feature vector is then employed to train, test, and validate four different classifiers namely Random Forest, K-Nearest Neighbors (KNN), Support Vector Machine with Radial Basis Function, and Decision Tree. Experimental results using benchmark datasets show a high accuracy (> 91%) for the morphology-based segmentation task; for the classification task, the KNN offers the highest accuracy among the compared classifiers (> 87%). However, this should be noted that this method still awaits clinical validation, and therefore should not be used to clinically diagnose ongoing COVID-19 infection.


Introduction
Lung infection caused by coronavirus disease  has emerged as one of the major diseases and has affected over 8.2 million of the population globally 1 , irrespective of their race, gender, and age. The infection and the morbidity rates caused by this novel coronavirus are increasing rapidly [1,2]. Due to its severity and progression rate, the recent report of the World Health Organization (WHO) declared it as pandemic [3]. Even though an extensive number of precautionary schemes have been implemented, the occurrence rate of COVID-19 infection is rising rapidly due to various circumstances.
The origin of COVID-19 is due to a virus called severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2) and this syndrome initially started in Wuhan, China, in December 2019 [4]. The outbreak of COVID-19 has appeared as a worldwide problem and a considerable amount of research works are already in progress to determine solutions to manage the disease infection rate and spread. Furthermore, the recently proposed research works on (i) COVID-19 infection detection [5][6][7][8], (ii) handling of the infection [9,10], and (iii) COVID-19 progression and prediction [11][12][13] have helped get more information regarding the disease.
The former research and the medical findings discovered that COVID-19 initiates disease in the human respiratory tract and builds severe acute pneumonia. The existing research also confirmed that the premature indications of COVID-19 are subclinical and it necessitates a committed medical practice to notice and authenticate the illness. The frequent medical-grade analysis engages in a collection of samples from infected persons and sample supported examination and confirmation of COVID-19 using reverse transcription-polymerase chain reaction (RT-PCR) test and image-guided assessment employing lung computed tomography scan images (CTI), and the chest X-ray [14][15][16][17]. When the patient is admitted with COVID-19 infection, the doctor will initiate the treatment process to cure the patient using the prearranged treatment practice which will decrease the impact of pneumonia.
Usually, experts recommend a chain of investigative tests to identify the cause, position, and harshness of pneumonia. The preliminary examinations, such as blood tests and pleural-fluid assessment, are performed clinically to detect the severity of the infection [18][19][20]. The image-assisted methods are also frequently implemented to sketch the disease in the lung, which can be additionally examined by an expert physician or a computerized arrangement to recognize the severity of the pneumonia. Compared with chest X-ray, CTI is frequently considered due to its advantage and the 3-D view. The research work published on COVID-19 also confirmed the benefit of CT in detecting the disease in the respiratory tract and pneumonia [21][22][23].
Recently, more COVID-19 detection methods have been proposed for the progression stage identification of COVID-19 using the RT-PCR and imaging methods. Most of these existing works combined RT-PCR with the imaging procedure to confirm and treat the disease. The recent work of Rajinikanth et al. [8] developed a computer-supported method to assess the COVID-19 lesion using lung CTI. This work implemented few operator-assisted steps to achieve superior outcomes during the COVID-19 evaluation.
The presented work aims to: -Propose a ML-driven pipeline to extract and detect the COVID-19 infection from lung CTI with an improved accuracy. -Develop a procedural sequence for an automated extraction of the COVID-19 infection from a benchmark lung CTI dataset. -Put forward an appropriate sequence of techniques, tri-level thresholding using social group optimization (SGO)-based Kapur's entropy (KE) or SGO-KE, K-Means Clustering (KMC)-based separation, morphology-based segmentation to accurately extract COVID-19 infection from lung CTI.
A comparison of the extracted COVID-19 infection information from the CTI using the proposed pipeline with the ground truth (GT) images confirms the segmentation accuracy of the proposed method. The proposed pipeline achieves mean segmentation and classification accuracy of more than 91% and 87% respectively using 78 images from a benchmark dataset.
This research is arranged as follows; Section "Motivation" presents the motivation, Section "Methodology" represents the methodological details of the proposed scheme. Section "Results and Discussion" outlines the attained results and discussions. Section "Conclusion" depicts the conclusion of the present research work.

Motivation
The proposed research work is motivated by the former image examination works existing in literature [35][36][37][38]. During the mass disease screening operation, the existing medical data amount will gradually increase and reduce the data burden; it is essential to employ an image segregation system to categorize the existing medical data into two or multi-class, and to assign the priority during the treatment implementation. The recent works in the literature confirm that the feature-fusion-based methods will improve the classification accuracy without employing the complex methodologies [39][40][41]. Classification task implemented using the features of the original image and the regionof-interest (ROI) offered superior result on some image classification problems and this procedure is recommended when the similarity between the normal and the disease class images is more [24,26,31,42,43]. Hence, for the identical images, it is necessary to employ a segmentation technique to extract the ROI from the disease class image with better accuracy [26]. Finally, the fused features of the actual image and the ROI are fused to attain enhanced classification accuracy.

Methodology
This section of the work presents the methodological details of the proposed scheme. Like the former approaches, this work also implemented two different phases to improve the detection accuracy.

Proposed Pipeline
This work consists of the following two stages as depicted in Fig. 1. These are: -Implementation of an image segmentation method to extract the COVID-19 infection, -Execution of a ML scheme to classify the considered lung CTI database into normal/COVID-19 class.
The details of these two stages are given below: Stage 1: Figure 2 depicts the image processing system proposed to extract the pneumonia infection in the lung due to COVID-19. Initially, the required 2D slices of the lung CTI are collected from an open-source database [44]. All the collected images are resized into 256 × 256 × 1 pixels and the normalized images are then considered for evaluation. In this work, SGO-KE-based tri-level threshold is initially applied to enhance the lung section (see "Social Group Optimization and Kapur's Function" for details). Then, KMC is employed to segregate the thresholded image into background, artifact, and the lung segment. The unwanted lung sections are then removed using a morphological segmentation procedure and the extracted binary image of the lung is then compared with its related GT provided in the database. Finally, the essential performance measures are computed and based on which the performance of the proposed COVID-19 system is validated. Figure 3 presents the proposed ML scheme to separate the considered lung CTI into normal/COVID-19 class. This system is constructed using two different images, such as (i) the original test image (normal/COVID-19 class) and (ii) the binary form of the COVID-19 section. The

Segmentation of COVID-19 Infection
This procedure is implemented only for the CTI associated with the COVID-19 pneumonia infection. The complete details on various stages involved in this process are depicted in Fig. 1. The series of procedures implemented in this figure are used to extract the COVID-19 infection from the chosen test image with better accuracy. The pseudo-code of the implemented procedure is depicted in Algorithm 1.

Image Thresholding
Initially, the enhancement of the infected pneumonia section is achieved by implementing a tri-level threshold based on SGO and the KE. In this operation, the role of the SGO is to randomly adjust the threshold value of the chosen image until KE is maximized. The threshold which offered the maximized KE is considered as the finest threshold. The related information on the SGO-KE implemented in this work can be found in [45]. The SGO parameters discussed in Dey et al. [46] are  [47] by mimicking the knowledge sharing concepts in humans. This algorithm employs two phases, such as (i) enhancing phase to coordinate the arrangement of people (agents) in a group, and the (ii) knowledge gaining phase: which allows the agents to notice the finest solution based on the task. In this paper, an agent is considered a social population who is generated based on the features/parameters. The mathematical description of the SGO is defined as: let X I denote the original knowledge of agents of a group with dimension I = 1, 2, ..., N. If the number of variables to be optimized is represented as D, then the initial knowledge can be expressed as X I = (x I 1 , x I 2 ,... x I D ). For a chosen problem, the objective function can be defined as F J , with J = 1, 2, ..., N.

Social Group Optimization and Kapur's Function SGO is a heuristic technique proposed by Satapathy and Naik
The updated function in SGO is; where X new i,j is the original knowledge, X old i,j is the updated knowledge, ζ denotes self-introspection parameter (assigned as 0.2), R is the random number [0,1], and g best j is the global best knowledge. In this work, the SGO is employed to find the optimal threshold by maximizing the KE value and this operation is defined below: Entropy in an image is the measure of its irregularity and for a considered image, Kapur's thresholding can be used to identify the optimal threshold by maximizing its entropy value.
Let T h = [t 1 , t 2 , ..., t n−1 ] denote the threshold vector of the chosen image of a fixed dimension and assume this image has L gray levels (0 to L − 1) with a total pixel value of Z. Iff () represents the frequency of j -th intensity level, then the pixel distribution of the image will be: If the probability of j -th intensity level is given by: Then, during the threshold selection, the pixels of image are separated into T h + 1 groups according to the assigned threshold value. After disconnection of the images as per the selected threshold, the entropy of each cluster is separately computed and combined to get the final entropy as follows: The KE to be maximized is given by Eq. 14: For a tri-level thresholding problem, the expression will be given by Eq. 5: where G i is the entropy given by: where, P C j is the probability distribution for intensity, C is the image class (C = 1 for the grayscale image), and w C i−1 is the probability occurrence.
During the tri-level thresholding, a chosen approach is employed to find the F KE (T h) by randomly varying the thresholds (T h = {t 1 , t 2 , t 3 } ). In this research, the SGO is employed to adjust the thresholds to find the F KE (T h).

Segmentation Based on KMC and Morphological Process
The COVID-19 infection from the enhanced CTI is then separated using the KMC technique and this approach helps segregate the image into various regions [48]. In this work, the enhanced image is separated into three sections, such as the background, normal image section, and the COVIDinfection. The essential information on KMC and the morphology-based segmentation can be found in [49]. The extracted COVID-19 is associated with the artifacts; hence, morphological enhancement and segmentation discussed in [49,50] are implemented to extract the pneumonia infection, with better accuracy. KMC helps split u-observations into K-groups. For a given set of observations with dimension "d," KMC will try to split them into K-groups; Q(Q 1 , Q 2 , ..., Q K ) for (K ≤ u) to shrink the within-cluster sum of squares as depicted by Eq. 9: where O is the number of observations, Q is the number of splits, and μ j is the mean of points in Q i .

Performance Computation
The outcome of the morphological segmentation is in the form of binary and this binary image is then compared against the binary form of the GT and then the essential performance measures, such as accuracy, precision, sensitivity, specificity, and F1-score, are computed. A similar procedure is implemented on all the 78 images existing in the benchmark COVID-19 database and the mean values of these measures are then considered to confirm the segmentation accuracy of the proposed technique. The essential information on these measures is clearly presented in [51,52].

Implementation of Machine Learning Scheme
The ML procedure implemented in this research is briefed in this section. This scheme implements a series of procedures on the original CTI (normal/COVID-19 class) and the segmented binary form of the COVID-19 infection as depicted in Fig. 2. The main objective of this ML scheme is to segregate the considered CTI database into normal/COVID-19 class images. The process is shown in algorithm 2.
Initial Processing This initial processing of the considered image dataset is individually executed for the test image and the segmented COVID-19 infection. The initial processing involves extracting the image features using a chosen methodology and formation of a one-dimensional FV using the chosen dominant features.

Feature Vector 1 (FV1):
The accuracy of disease detection using the ML technique depends mainly on the considered image information. In the literature, a number of image feature extraction procedures are discussed to examine a class of medical images [35][36][37][39][40][41][42]. In this work, the well-known image feature extraction methods, such as Complex-Wavelet-Transform (CWT) and Discrete-Wavelet-Transform (DWT) as well as Empirical-Wavelet-Transform (EWT) are considered in 2-D domain to extract the features of the normal/COVID-19 class grayscale images. The information on the CWT, DWT, and EWT are clearly discussed in the earlier works [52]. After extracting the essential features using these methods, a statistical evaluation and Student's t test-based validation is implemented to select the dominant features to create the essential FVs, such as F V CW T (34 features), F V DW T (32 features), and F V EW T (3 features) which are considered to get the principle FV1 set (FV1=69 features) by sorting and arranging these features based on its p value and t value. The feature selection process and FV1 creation are implemented as discussed in [52].
-CWT: This function was derived from the Fourier transform and is represented using complex-valued scaling function and complex-valued wavelet as defined below; where ψ C (t), ψ R (t), and ψ I (t) represent the complex, real, and image parts respectively. -DWT: This approach evaluates the non-stationary information. When a wavelet has the function ψ(t) ∈ W 2 (r), then its DWT (denoted by DW T (a, b)) can be written as: (11) where ψ(t) is the principle wavelet, the symbol * denotes the complex conjugate, a and b (a, b ∈ R) are scaling parameters of dilation and transition respectively. -EWT: The Fourier spectrum of EWT of range 0 to π is segmented into M regions. Each limit is denoted as ω m (where m = 1, 2, ..., M) in which the starting limit is ω 0 = 0 and final limit is ω M = π . The translation phase T m centered around ω m has a width of 2 m where m = λω m for 0 < λ < 1. Other information on EWT can be found in [53].

Feature Vector 2 (FV2):
The essential information from the binary form of COVID-19 infection image is extracted using the feature extraction procedure discussed in Bhandary et al. [35] and this work helped get the essential binary features using the Haralick and Hu technique. This method helps get 27 numbers of features (F H aralick = 18 features and F H u = 9 features) and the combination of these features helped get the 1D FV2 (FV2 = 27 features).
-Haralick features: Haralick features are computed using a Gray Level Co-occurrence Matrix (GLCM). GLCM is a matrix, in which the total rows and columns depend on the gray levels (G) of the image. In this, the matrix component P (i, j| x, y) is the virtual frequency alienated by a pixel space ( x, y). If μ x and μ y represent the mean and σ x and σ y represent the standard deviation of P x and P y , then: where P x (i) and P y (j ) matrix components during the i-th and j -th entries, respectively.
These parameters can be used to extract the essential texture and shape features from the considered grayscale image. -Hu moments: For a two-dimensional (2D) image, the 2D (i + j)-th order moments can be defined as; x i y j f (x, y)dxdy (13) for i, j = 0, 1, 2,... If the image function f (x, y) is a piecewise continuous value, then the moments of all order exist and the moment sequence M ij is uniquely determined. Other information on Hu moments can be found in [35].

Fused Feature Vector (FFV:)
In this work, the original test image helped get the FV1 and the binary form of the COVID-19 helps get the FV2. To implement a classifier, it is essential to have a single feature vector with a pre-defined dimension.
In this work, the FFV based on the principle component analysis (PCA) is implemented to attain a 1D FFV (69 + 27 = 96 features) by combining the FV1 and FV2, and this feature set is then considered to train, test, and validate the classifier system implemented in this study. The complete information on the feature fusion based on the serial fusion can be found in [35,54].
Classification Classification is one of the essential parts in a verity of ML and deep learning (DL) techniques implemented to examine a class of medical datasets. The role of the classifier is to segregate the considered medical database into two-class and multi-class information using the chosen classifier system. In the proposed work, the classifiers, such as Random-Forest (RF), Support Vector Machine-Radial Basis Function (SVM-RBF), K-Nearest Neighbors (KNN), and Decision Tree (DT), are considered. The essential information on the implemented classifier units can be found in [35,36,45,52]. A fivefold crossvalidation is implemented and the best result among the trial is chosen as the final classification result.
Validation From the literature, it can be noted that the performance of the ML and DL-based data analysis is normally confirmed by computing the essential performance measures [35,36]. In this work, the common performance measures, such as accuracy (4), precision (15), sensitivity (16), specificity (17), F1-score (18), and negative predictive value (NPV) (19) computed.
The mathematical expression for these values is as follows: Precision = T P (T P + F P ) where T P = true positive, T N = true negative, F P = false positive, and F N =false negative.

COVID-19 Dataset
The clinical-level diagnosis of the COVID-19 pneumonia infection is normally assessed using the imaging procedure. In this research, the lung CTI are considered for the examination and these images are resized into 256×256×1 pixels to reduce the computation complexity. This work considered 400 grayscale lung CTI (200 normal and 200 COVID-19 class images) for the assessment. This research initially considered the benchmark COVID-19 database of [44] for the assessment. This dataset consists of 100 2D lung CTI along with its GT; and in this research, only 78 images are considered for the assessment and the remaining 22 images are discarded due to its poor resolution and the associated artifacts. The remaining COVID-19 CTI (122 images) are collected from the Radiopaedia database [55] from cases 3 [56], 8 [57], 23 [58], 10 [59], 27 [60] 52 [61], 55 [62], and 56 [63].
The normal class images of the 2D lung CTI have been collected from The Lung Image Database Consortium-Image Database Resource Initiative (LIDC-IDRI) [64][65][66] and The Reference Image Database to Evaluate therapy Response-The Cancer Imaging Archive (RIDER-TCIA) [66,67] database and the sample images of the collected dataset are depicted in Figs. 4 and 5. Figure 4 presents the test image and the related GT of the benchmark CTI. Figure 5 depicts the images of the COVID-19 [55] and normal lung [64,67] CTI considered for the assessment.

Results and Discussion
The experimental results obtained in the proposed work are presented and discussed in this section. This developed system is executed using a workstation with the configuration: Intel i5 2.GHz processor with 8GB RAM and 2GB VRAM equipped with the MATLAB (www.mathworks. com). Experimental results of this study confirm that this scheme requires a mean time of 173 ± 11 s to process the considered CTI dataset and the processing time can be improved by using a workstation with higher computational capability. The advantage of this scheme is it is a fully automated practice and will not require the operator assistance during the execution. The proposed research initially executes the COVID-19 infection segmentation task using the benchmark dataset of [44]. The results attained using a chosen trial image are depicted in Fig. 6. Figure 6a depicts the sample image of dimension 256 × 256 × 1 and Fig. 6b and c depict the actual and the binary forms of the GT image. The result attained with the SGO-KE-based tri-level threshold is depicted in Fig. 6d. Later, the KMC is employed to segregate Fig. 6d into three different sections and the separated images are shown in Fig. 6e-  A similar procedure is implemented for other images of this dataset and means performance measure attained for the whole benchmark database (78 images) is depicted in Fig. 7. From this figure, it is evident that the segmentation accuracy attained for this dataset is higher than 91%, and in the future  To improve the detection accuracy, the feature vector size is increased by considering the FFV (1 × 96 features) and a similar procedure is repeated. The obtained results (as in Table 1, bottom three rows) with the FFV confirm that the increment of features improves the detection accuracy considerably and the KNN classifier offers an improved accuracy (higher than 87%) compared with the RF, SVM-RBF, and DT. The precision and the F1-score offered by the RF are superior compared with the alternatives. The experimental results attained with the proposed ML scheme revealed that this methodology helps achieve better classification accuracy on the considered lung CTI dataset. The accuracy attained with the chosen classifiers for FV1 and FFV is depicted in Fig. 8. The future scope of the proposed method includes (i) implementing the proposed ML scheme to test the clinically obtained CTI of COVID-19 patients; (ii) enhancing the performance of implemented ML technique by considering the other feature extraction and classification procedures existing in the literature; and (iii) implementing and validating the performance of the proposed ML with other ML techniques existing in the literature; and (iv) implementing an appropriate DL architecture to attain better detection accuracy on the benchmark as well as the clinical grade COVID-19 infected lung CTI.

Conclusion
The aim of this work has been to develop an automated detection pipeline to recognize the COVID-19 infection from lung CTI. This work proposes an ML-based system to achieve this task. The proposed system executed a sequence of procedures ranging from image pre-processing to the classification to develop a better COVID-19 detection tool. The initial part of the work implements an image segmentation procedure with SGO-KE thresholding, KMCbased separation, morphology-based COVID-19 infection extraction, and a relative study between the extracted COVID-19 sections with the GT. The segmentation assisted to achieve an overall accuracy higher than 91% on a benchmark CTI dataset. Later, an ML scheme with essential procedures such as feature extraction, feature selection, feature fusion, and classification is implemented on the considered data, and the proposed scheme with the KNN classifier achieved an accuracy higher than 87%.
Acknowledgments The authors of this paper would like to thank Medicalsegmentation.com and Radiopaedia.org for sharing the clinicalgrade COVID-19 images.

Author Contributions
This work was carried out in close collaboration between all co-authors. ND, VR, MSK, and MM first defined the research theme and contributed an early design of the system. ND and VR further implemented and refined the system development. ND, VR, SJF, MSK, and MM wrote the paper. All authors have contributed to, seen, and approved the final manuscript.

Compliance with Ethical Standards
Conflict of Interest All authors declare that they have no conflict of interest.
Ethical Approval All procedures reported in this study were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed Consent
This study used secondary data; therefore, the informed consent does not apply.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommonshorg/licenses/by/4.0/.