1 Introduction

Multiple Sclerosis is an inflammatory disease in which myelin sheaths of the nerve cells in the brain and spinal cord are damaged [1,2,3,4,5,6]. This damage can disrupt the ability of parts of the nervous system which are responsible for communication and causes many signs and symptoms such as physical problems [7]. Regional estimates suggest that this disease has a moderate prevalence in countries and is in prevalence range of the European and Far East countries [8]. Symptoms of MS disease appear in several forms and its new symptoms occur either as step recurrence -multiple disease reversal or alternately over time [9,10,11]. Initially, recovery from attacks is almost complete, but slowly, neuropsychiatric disabilities with different degrees will remain from each attack [12]. MRI is the most practical method for detecting the masses left in the brain, which can greatly help the specialist. Sometimes there are different interpretations of MR brain images and so it is necessary to have an accurate statistical analysis of alternating signals. Sometimes due to the inherent nature of MRI, it is not possible to distinguish lesions caused by diseases such as Alzheimer’s, MS, and other common diseases in the brain. If it is possible to detect lesions from this disease at early stages, then it will be possible to treat the disease.

The discovery of lesions by the neurologist requires expedition, time and high accuracy and typically, with increasing number of MR images of the brain, the efficiency required for the diagnosis will be significantly reduced. The inadequacies of contrast and image clarity, as well as the similarity of lesions caused by the disease to other brain tissues, lead to different interpretations of brain MRI images; therefore, a precise analysis of the resulting images is required. Thus, employing a robust automated approach can provide satisfactory outcomes, whereas additionally dominant the time needed for diagnosis. The major contributions of MR processing work lie in applying the feature extraction and classification to detect the lesions and non-lesions as shown in Fig. 1.

Fig. 1
figure 1

Processing steps utilized in the lesions and non-lesions categorization

Several strategies have been proposed to date in order to automate the intelligent brain disease diagnosis, especially MS. Among the studies in the field of diagnosis, separation and classification of lesions caused by MS disease, Ballin et al. [13] mapped a three-dimensional lesion of the MS onto the MR image in addition to conduct classification by combination methods. Zhang et al. [14] also used tissue analysis in order to diagnose MS by extracting the features and then selecting the best ones in brain MR images. The best feature analysis had been used by linear methods and statistical analysis and the diagnostic accuracy of this disease in MR images can reach up to 88% by choosing the intelligent tissue analysis method.

Roy et al. [15] provided an automated method using tissue features and support vector machines, in which the brain lesion separation process was performed and then the classification occurred. The strong point of their work was to use the cumulative distribution function and also normalization of the area of interest to improve the image quality. The texture features, local brightness, and initial spatial information were used as the main features and in fact, the classification process occurred for image pixels, not for the image itself.

Elliott et al. [16] proposed the consistent probabilistic detection for new MS lesions in MR images. They used a series of sequential MR scans. Their work was a two-step classification; first, they used a classifier to find out the possibility of a voxel for being lesion. Cabezas et al. [17] suggested conducting the classification mechanism through the boosting classification technique in order to classify the classification for a set of features. They prepared MR images from 45 subjects during the three recording periods. This set of techniques has led to a typical, precise separation and as a result, parts of the MR image with the potential of disease are revealed.

Sweeney et al. [18] compared different methods of machine learning to build the best feature vector and they did this using a multi-quality structure in MR images. This resulted in a fairly precise separation of the images and the separation of lesions caused by MS disease. Ardakani et al. [19] investigated MS diagnosis through tissue analysis. They extracted features from the images for 50 patients through the analysis of MR images, initially by analyzing the main components. Then, they perform classification by relying on the linear separation method and obtained the sensitivity of 100% and the surface below the ROC curve had the value of 1 in their work. Separating a favorite area and classifying the three classes was one of the innovations of their work.

Liu et al. [20] diagnosed MS by removing confounding factors that did not include information on the lesion area and they did this using limited clustering. Weygandt et al. [21] relied on the pursuit of biomarkers based on segmentations such as thresholding and combining atlas images and brain MR to find the location of brain lesions based on which, the early stages of the formation of these lesions were possible from the images. Karimaghaloo et al. [22] proposed a system based on randomized conditional stratification and classification to separate the areas suspected to MS-related lesions, which distinguished small pieces which were considered as lesion.

Brosch [23] used three-dimensional encoded deep convolutional neural networks to detect lesions caused by MS disease. Zhang et al. [24] relied on methods such as static wavelet and entropy, and statistical characteristics. They suggested using the decision tree, k-NN classifier, and supporting vector machines for the classification process. They then achieved the accuracy of 97.9% for their data sets, and sensitivity and specificity values were higher than 95%. The use of two-dimensional separation and 3D mapping and final detection has been proposed through deep convolutional neural networks [25].

Recently, Gheshlaghi et al. [26] proposed a super-pixel segmentation based technique for multiple sclerosis lesion detection. Their study includes SVM as an effective classifier with polynomial kernels that is applied to have better performance in distinguishing specified decision classes. Also, they use discrete wavelet decomposition (DWT) which extracts local features from analyzing MR images.

Improving automated multiple sclerosis lesion segmentation with a cascaded 3D convolutional neural network approach have been conducted by Valverde et al. [27]. They propose an automated White Matter (WM) lesion segmentation method for MS patient images and their approach relies on a cascade of two 7-layer convolutional neural networks. The comparison of the previous approaches over the last few years has been shown in Table 1.

Table 1 The comparison of the previous approaches

In this study, different lesions of the MR images are segmented by using the clustering method, and then, features of segmented areas will be extracted. There are some objectives in the research, including the following: (1) Detection of MS lesions from magnetic resonance images in two classes of no disease and disease incidence; (2) Promoting the feature vector structure by aggregating results from descriptors such as Fractal and PZM. Additionally, we focus to demonstrate the following hypotheses: (1) Compatible extracted features based on efficient descriptors decrease the time and space complexity for building the prediction model. Also, feature subset selection by an evolutionary-differential algorithm improves the prediction accuracy of the ELM classifier and reduces the false prediction ratio. (2) Appropriate ELM configuration parameters, such as the number of neurons in hidden layer and the training error optimized by the memetic meta-heuristic SFLA algorithm. The performance of proposed methods is analyzed with the focus on dimensionality reduction of extracted features and improvement of classification accuracy using MR images.

The rest of this paper is organized as follows: Sect. 2 describes the proposed method with the different sections of the algorithm. Section 3, considers the implementation of the algorithm and the experimental results. Finally, the conclusion of the paper will be presented in Sect. 4.

2 Materials and methods

Separation in image processing is an approach which is conceptually used in the image segmentation process. We first described the MRI data used for the evaluation of the proposed segmentation and classification approach. Then, the different steps of the proposed classification process were detailed. The trend of the proposed stages could be summarized according to the following Fig. 2.

Fig. 2
figure 2

Proposed model steps for brain lesion and non-lesion classification

2.1 Pre-processing

In the preprocessing step, two steps are implemented on the basis of slices received from a person, which are Histogram equalization and segmentation by csFCM.

2.1.1 Histogram equalization

Retinex multi-scale transformation method was used for histogram equalization, which is one of the optimization methods, which also has acceptable results in the preprocessing. In the desired transform, first, the image brightness pattern is calculated based on the weighted neighboring mean in the specified radius for each pixel. The Retinex multi-scale transformation is expressed according to Eq. (1) [37]:

$$L_{{(i_{cent} ,j_{cent} )}} = \frac{{\sum {i,j_{cneighboring\,of\,cent\,} I(i,j).W(i,j)} }}{{\sum {i,j_{cneighboring\,of\,cent\,} W(i,j)} }}$$
(1)

where I is the input image received from MRI devices and W is the weight of the matrix. Neighborhood weights (NW) are calculated for each pixel (icent, jcent) based on the neighborhood of pixels and in accordance with Eq. (2) [37]:

$$\begin{aligned} NW_{i,j} & = e^{{ - ({{Dist_{i,j} } \mathord{\left/ {\vphantom {{Dist_{i,j} } {Radius}}} \right. \kern-0pt} {Radius}})^{2} }} \\ st.\quad Dist_{i,j} & = \left[ {(i_{cent} - i)^{2} + (j_{cent} - j)^{2} } \right]^{0.5} \\ \end{aligned}$$
(2)

The cent refers to the pixel coordinates where we want to calculate their brightness and the Radius term is the default radius for the neighborhood, where different values can be adopted. Finally, the final image is calculated by difference of the brightness pattern from the original image in the logarithmic range. In other words, R and L are respectively the components of the reflection and the component of the image brightness:

$$Log(R) = Log(I) - Log(L)$$
(3)

2.1.2 csFCM

The conditional spatial fuzzy C-means (csFCM) method is derived from the clustering technique followed by the segmentation of magnetic resonance images [38] which can be used for noise images and provides good results. The process of the csFCM algorithm has been displayed in the Region of Interest (ROI) segmentation in Algorithm 1.

In the set of equations of Algorithm 1, xk is the equivalent of the pixel k of the magnetic resonance image. The conditional spatial FCM algorithm with parameter p and q is denoted as csFCMp,q. Again, it may be noted that csFCM1,0 is identical to the conventional FCM algorithm.

2.2 Feature extraction and selection

In the process of extracting and selecting features from the pre-processing images, the feature vector is constructed from the combination of features extracted by the Pseudo-Zernike moments and fractal algorithms and, then the modified DE algorithm reduces the extracted vector dimension.

2.2.1 Pseudo-Zernike moments (PZm)

Pseudo-Zernike moments (PZMs) are a set of complex polynomials which form an orthogonal set within a unit circle (x2 + y2 = 1) which is represented by Vnm(x, y) and its structure is defined as PZMs Eq. (4) [39]:

$$V_{nm} (x,y) = V(\rho ,\theta ) = R_{nm} (\rho )e^{im\theta }$$
(4)
figure a

Where j = (− 1)0.5, θ = tan−1(yx−1), |ρ|  1, n  0, m  n, n − |m| = even, and R is a radial polynomial and it should also be noted that ρ is the source vector length to the point (x, y) and θ is the angle between the vector ρ and the axis x in the direction of counterclockwise. In the above relation n is a non-negative integer that shows the polynomial order. M is a negative or positive integer that shows the order of the horizontal arc repetition and its absolute value is always smaller or equal to n, and the difference m of n is always an even value. In the above relation, Rnm are radial polynomials calculated in accordance with relationships (5) and (6) [39]:

$$R_{nm} (x,y) = \left[ {\sum\limits_{s = 0}^{0.5 \times (n - \left| m \right|)} {S_{n,\left| m \right|,s} (x^{2} + y^{2} )} } \right]^{{^{n - 2s} }}$$
(5)
$$S_{n,\left| m \right|,s} = ( - 1)^{s} \frac{{(n - s)\text{!}}}{{s\text{!}\left( {\frac{n + \left| m \right|}{2} - s} \right)\text{!}\left( {\frac{n - \left| m \right|}{2} - s} \right)\text{!}}}$$
(6)

Zernike moments (ZMs) are a mapping of the image into a set of Zernike complex polynomials. One of the important features of Zernike moments is their orthogonally; for this reason, image features can be displayed without any redundancy in information or overlap between moments. If the rotation is made in the images, it will not have any effect on its Zernike moments. The complex Zernike moments of the order n and with the repetition of m are calculated using Eq. (7) [39]:

$$ZM_{nm} = \frac{n + 1}{\pi }\sum\limits_{x}^{{}} {\sum\limits_{y}^{{}} {f(x,y)V_{nm}^{ * } } } (x,y)$$
(7)

where, f (x, y) is the brightness intensity function of a digital image at the location of x and y and the * sign also denotes the complex conjugate.

2.2.2 Fractal

A fractal descriptor, known as one of the tissue descriptors of the image, is implemented in two stages based on segmentation [40,41,42]; first, the gray level image is divided into a group of images with binary values, and this process is performed by the bi-threshold binary decomposition algorithm. The bi-threshold binary decomposition algorithm considers the image I(x, y) as an input and retrieves binary images. By bi-threshold binary decomposition, a pair of thresholds is selected from T as a result; the segmentation is carried out in accordance Eq. (8) [40]:

$$I_{b} (x,y) = \left\{ {\begin{array}{*{20}l} 1 & {if\,t_{l} < I(x,y) \le t_{u} } \\ o & {Otherwise} \\ \end{array} } \right.$$
(8)

where tl and tu respectively represent the minimum and maximum threshold values. By applying Eq. (8), the binary images are decomposed and then proceeded from T U {nl}. On the one hand, all threshold pairs of {t, nl}, t ∈ T can be achieved in which, nl represent the maximum possible gray level in I (x, y). So the number of binary images is 2nt and the value of nt is set to 8 value. After applying the bi-threshold binary decomposition, the fractal tissue analysis feature vector is constructed according to the segmentation based on the size of the binary images, the mean gray level and the fractal dimensions of the boundaries. The boundaries of the image regions Ib (x, y) are selected as marginal segments, which are represented by Δ(x, y) and calculated as Eq. (9) [40]:

$$\Delta (x,y) = \left\{ {\begin{array}{*{20}l} 1 & \begin{aligned} if\,\exists \,(x,y) \in N_{B} [(x,y)] \hfill \\ I_{b} (x,y) = 0\,\,\varLambda \, \hfill \\ I_{b} (x,y) = 0 \hfill \\ \end{aligned} \\ o & {Otherwise} \\ \end{array} } \right.$$
(9)

where NB[(x, y))] represents a group of pixels with 8 connections to (x, y). Also, Δ(x, y) will have a value of 1 if the pixel is at the position (x, y), which refers to the binary image Ib and, if there is at least one neighborhood with a pixel with zero has the sum of 1. Figure 3 indicates the implementation stages of fractal algorithm. In this figure, area (A1, A2,…, An) and mean gray level (V1, V2,…, Vn) features are computed directly from the binary image. Fractal dimension (D1, D2,…, Dn) features is computed from the border or margin image. In these dimension, we select n equal 8 because this would result in a lower error rate.

Fig. 3
figure 3

The implementation stages of an algorithm for extracting features from MRI slices

2.2.3 Differential Evolution (DE) algorithm

We conduct the feature selection with a modified Differential Evolution (DE) algorithm. In the feature selection problem, the most effective subset features is chosen which has a better solution. The general evolutionary-differential algorithm has been shown in Fig. 4.

Fig. 4
figure 4

Steps of the DE algorithm

In order to increase the speed and at the same time maintain the heuristic nature of the algorithm and prevent early convergence, in both Cross Over and Mutation vectors, we have established a balance between local and global search, where random and uniform numerical definitions in the range of zero to one can be used. Additionally, the use of additional parameters in the control of the algorithm is eliminated and therefore, adjusting the DE algorithm in the feature selection is greatly optimized and finally we do not eliminate the automatic and random mode of the algorithm. In the proposed DE algorithm, the mutation vector controls the mutation effect in adaptive form using the feedback and thus, a kind of self-adjustment will increase the accuracy and speed of the algorithm. In other words, when the average fitness approaches the best fitness, it indicates that the algorithm reaches the best final value. At this time, the effect of the mutation should be reduced so that the algorithm does not go away from the final solution. Conversely, when the difference between the average fitness and the best fitness is high, it is necessary to map the problem space algorithm to increase the effect of the mutation.

An ELM based on wavelet kernel is used to calculate the feature selection error and in the fitness function, the initial train data is divided into new train data and validation data with the K-fold cross validation method, with K equal to 5 and, the optimal subset is selected after a limited number of repetitions. Therefore, the error resulting from the classification of the validation data along with the minimum number of features will be the output of the cost function.

2.3 Evolutionary extreme learning machine

The Extreme Learning Machine (ELM) is a configuration with a hidden layer where the weights between the first layer and the random hidden layer are updated and, the weights of its second layer are weighted normally similar to the performance of the nervous system. ELM outputs with m neurons and the activation function f can be displayed as follows:

$$o_{j} = \sum\limits_{i = 1}^{m} {\beta_{i} f(l_{i} x_{r} + b_{i} )}$$
(10)

The algorithm is fast but may offer a better overall performance [43,44,45]. Learning error and output weights should be minimized simultaneously to reduce learning errors in the ELM algorithm. Therefore, the overall performance of the neural networks will increase:

$$\begin{aligned} &Min\,\,\,\left\| {AS - C} \right\|,\, \hfill \\ &\left\| S \right\| \hfill \\ \end{aligned}$$
(11)

where, it can be written that:

$$S = A^{T} (E^{ - 1} + AA^{T} )^{ - 1} C$$
(12)

where, E is the adjustment coefficient, A is the output matrix of the hidden layer, and C is the expected output matrix of the samples. Therefore, the output function of the ELM algorithm can be written as Eq. (13):

$$u(r) = v(r)A^{T} \left( {\frac{1}{E} + AA^{T} } \right)^{ - 1} C$$
(13)

If the feature vector v(r) is unspecified, the ELM kernel matrix can be rewritten based on the Mercer conditions:

$$D = AA^{T} :k_{jz} = v(r_{j} )v(r_{z} ) = x(r_{j} ,r_{z} )$$
(14)

where, u(r) is the output function of the wavelet kernel for ELM, which can be represented as Eq. (15):

$$u_{r} = [b(r,r_{1} ), \ldots ,b(r,r_{M} )]\left( {\frac{1}{E} + D} \right)^{ - 1} C$$
(15)

where, D = AAT and b(r, g) is the kernel function ELM. In this case, there are some kernel functions including linear kernel, Polynomial kernel, Gaussian kernel, and Exponential kernel. But the Wavelet kernel function is beneficial for simulations and performance [14]:

$$b(r,g) = \cos \,(w \times \left\| {r - g} \right\| \times x^{ - 1} )\,\,\exp ( - \left\| {r - g} \right\|^{2} \times y^{ - 1} )$$
(16)

Determining these parameters will greatly affect the desired accuracy. To find the best structure of ELM, we suggest that the Shuffled Frog-Leaping Algorithm (SFLA) algorithm perform the optimization. The steps of SFLA to find best parameters of wavelet kernel in ELM classifier are as follows:

  1. A.

    Global search

In SFLA, the combination increases the quality of the memetics affected by different subgroups. Global and Local searches continue to meet the convergence condition. The balance between global transfer and local search allows the algorithm to leap easily from the local minimum and extend until reaching optimization. One of the features of the SFLA algorithm is fast convergence. Global search to find best parameters is conducted based on Algorithm 2.

Algorithm 2. Global search by SFLA

Step 1)

Initialization: select m and n, m is the number of memeplexes, and n is the number of frogs per memeplex, thus the total population in the pool is obtained through the relation F = m.n.

Step 2)

Virtual population generation: F virtual frogs (U(1)), U(2),…, U(F)) are sampled from the possible space. The value of fitness f(i) of each frog U(i) is calculated for each U(i)=(U1i, U2i,…, Udi) in which d is the number of decision variables.

Step 3)

Frog Ranking: Frogs are sorted according to their fitness in descending order and stored in the array X = {U (i), f (i), i = 1,…, F}. The best frog position of PX is selected in the entire population (Px = U(1))

Step 4)

Partitioning frogs into mempelexes: The array X is divided into m memeplex (Y1, Y2,…, Ym), each containing n frogs.

Step 5)

Memetic evolution in each memeplex: each memeplex (Yk, k = 1,…, m) evolves by local search.

Step 6)

Shuffling memeplexes: After a certain number of memetic evolutions are completed in each memeplex, memeplexes (Y1,…, Ym) are placed in X so that the relation X = (Yk, k = 1,…, m) is established. The best frog position in the PX population is updated. If convergence conditions are met, it stops. Otherwise it goes to the fourth stage of global search.

  1. B.

    Local search

In the fifth step of global search, the evolution of each memeplex is performed for n times independently. After the memeplex evolved, the algorithm returns to the global search for the shuffling. The details of local search are described in each memeplex. Weights are assigned with triangular probability distribution according to Eq. (17).

$$p_{j} = \frac{{2\left( {n + 1 - j} \right)}}{{n\left( {n + 1} \right)}}\quad j \, = \, 1, \ldots , \, n$$
(17)

where, j is the index of j-th member and n is the number of elements. To construct submemeplex array, q frog is randomly selected from each n frogs in each memeplex. The frogs in the submemeplex are arranged in descending order in terms of their fitness. The positions of the best frog and the worst frog in the submemeplex are identified by PB and Pw, respectively. The new position of the worst frog in the submemeplex that is the frog with the worst performance and then is calculated through Eq. (18). Also, S is the step size or mutation rate of the frog and is calculated as Eq. (19):

$$U\left( q \right) = P_{w} + S$$
(18)
$$S \, = \, max\,[round\left( {rand\left( {P_{B} - P_{w} } \right)} \right), \, Smax]\,$$
(19)

Also, we can conduct the local search based on Algorithm 3.

Algorithm 3. Local search by SFLA

Step 1)

Initialization: Parameters im and in get the initial value of zero. In this concept, im counts the number of memeplexs and in counts the number of evolution stages.

Step 2)

Generating submemeplexes: The goal of the frogs is to move optimal positions by improving their memeplexes. The submemeplex selection method is to allocate higher weights to frogs with higher performance and lower weights to the frogs with lower performance. Weights are assigned with triangular probability distribution according to Eq. (17).

Step 3)

Correcting the position of the worst frog based on Eq (18). If the new position is better than the previous position, then replace the new U(q) with the previous U(q) and goes to step 6 of the local search. Otherwise, go to step 4 of the local search.

Step 4)

S or step size of the frog and is calculated by Eq. (19). If U(q) is within the possible space, the new performance value f(q) is calculated. If the new f(q) is better than the previous one, then U(q) replaces the previous U(q) and goes to step 6 of the local search. Otherwise, it goes to step 5 of the local search.

Step 5)

Calculating step size by Px: If better result is not obtained in the step 3, then after calculating the step size of the frog, the new position U(q) is calculated by (19).

Step 6)

Censorship: If the new position is not in the possible area or not better than the previous position, a new frog (r) will randomly be generated in a possible position and replaces by a frog whose new position is not suitable for progress. f(r) is calculated and U(q) is set to r and f(q) equal to f(r)).

Step 7)

Upgrading memeplex: After changing the memetic of the worst frog in the submemeplex, the frogs in Z are placed in their original position on YIm. YIm is sorted by descending order in terms of performance. If In < n, it goes to step 3 of the local search. If Im < m, go to step 1 of the local search. Otherwise, it will return to global search for shuffling the memeplexes.

3 Experimental results

3.1 Data

The main data, including slices selected by a specialist − 10 to 15 important slices have been separated from each person from a variety of people with different levels of MS disease which was collected during the course of a year and a half from the Vasei Hospital in Sabzevar, Iran from 2016 to 2018. The sample of MRI selected from MS patients has been shown in Fig. 5. Healthy individuals are also included in this research to be considered as control or study variables. Patients are divided into two groups with and without the history of pain inflammation in the head region and various magnetic resonance imaging centers provided reports on imaging. People with MS disease with different stages of the disease have long been monitored by a doctor and the definitive symptoms of the disease have been listed. Sampling was done randomly according to the opinion of the specialist. The personal characteristics of the patients after imaging were recorded in the questionnaires or along with two DICOMDIR and Radiant setup software and complete neurological and visual examination has been performed on patients. Data gathering research was a two-group cross-sectional analytical study.

Fig. 5
figure 5

The two physicians have definitely commented on the MS lesion and even estimated its possible location in our MR images

The sample of the study including 125 subjects (at least 10 slices according to physician diagnosis) in patient subjects, definitive diagnosis of disease and age below 50 years old referring to the MS clinic, neurology ward and, patients referred to the magnetic resonance imaging center of Vasei Hospital in Sabzevar, in two groups 64 people and 61 people were examined. 64 patients were diagnosed with MS and 61 subjects were healthy without any complications. The two experts have definitely commented on the disease and even estimated its possible location. FLAIR images and often T1 and T2 modes have been studied.

3.1.1 Evaluations

Three factors of accuracy, sensitivity and specificity, which have been introduced to measure the diagnosis accuracy of the proposed system, are calculated in accordance with Eqs. (21) to (23) and used to evaluate the system:

$$Accuracy = \left( {\frac{{N_{TP} + N_{TN} }}{{N_{TP} + N_{FN} + N_{TN} + N_{FP} }}} \right)$$
(20)
$$Sensitivity = \left( {\frac{{N_{TP} }}{{N_{TP} + N_{FN} }}} \right)$$
(21)
$$Specificity = \left( {\frac{{N_{TN} }}{{N_{TN} + N_{FP} }}} \right)$$
(22)

where NTP is the number of brain MRI images containing MS-induced lesion, and the proposed algorithm has correctly diagnosed the disease. In addition, NTN indicates the number of brain MRI images not induced by MS lesions and the proposed algorithm has correctly detected the absence of disease. NFP is the number of brain MRI images that did not contain any MS-induced lesions, but the proposed algorithm had erroneously diagnosed the disease. Finally, NFN indicates the number of brain MRI images that contained MS-caused lesion, but the proposed algorithm had erroneously ruled out the disease.

3.2 Model initializing

Because the number of areas suspected to MS lesion may vary in the preprocessing stage that is, cases that resemble lesions resulting from Alzheimer’s disease, tumor or pathologic damage, or masses other than the target class, the experiment was repeated for 10 times. Thus, all images with an initial dimension after the pre-processing are imported to the optimized decision maker network as input.

The parameters of the DE algorithm in feature selection are based on the cost function due to the classification using the ELM (wavelet kernel) neural network. Table 2 represents the parameters of the DE algorithm in order to select the best features.

Table 2 Initialization for an adaptive DE algorithm in feature selection

3.3 Results

The proposed algorithm was implemented in Matlab and the required system for data processing included Due CPU 2.6 GHz with a RAM of 4 GB and performance of the algorithm was between 10 and 20 s in short term. The change in accuracy is due to the resizing of the image that, in the current research it has been tried to consider three different models of image resizing. As mentioned, the first-stage benchmarks have been represented by the K-fold validation method in Tables 3, 4 and 5 for dimensional change. Each table consists of train and test stage, and accuracy varies between 90% and 98% due to the changes in dimensions. Also Figs. 6, 7, 8 and 9 indicate an error reduction and convergence due to the feature selection and kernel parameters regularization from all features as one-third of the total features of the model.

Table 3 The assessment of benchmarks of MS recognition model without feature selection and kernel parameters optimization for different image dimensions. The bold values are best accuracies
Table 4 The assessment of benchmarks of MS recognition model with feature selection for different image dimensions. The bold values are best accuracies
Table 5 The assessment of benchmarks of MS recognition model with feature selection and kernel parameters optimization for different image dimensions. The bold values are best accuracies
Fig. 6
figure 6

The error rate and convergence of the proposed model after 500 iterations for images with 32 × 32 dimension

Fig. 7
figure 7

The error rate and convergence of the proposed model after 500 iterations for images with 64 × 64 dimension

Fig. 8
figure 8

The error rate and convergence of the proposed model after 500 iterations for images with 128 × 128 dimension

Fig. 9
figure 9

The error rate and convergence of the proposed model after 500 iterations for images with 256 × 256 dimension

4 Discussion

Experiments were carried out in the feature selection phase, in which the appropriate subset selection was made based on which selection of subsets including 10%, 15%, 20%, 30%, 40%, 50%, 60%, 65%, 70%, 80%, 90%, and 100% of the total features, and finally all the features are included in the calculation and in Table 6, the accuracy level changes in 10 repetitions (the 5-fold final output) are observed. In the feature selection step, two 5-fold cross validation stages were used in the cost function for generating three train, validation, and test data.

Table 6 The effect of dimension reduction in the calculated accuracy for the train and test steps. The bold values are best accuracies

Figure 10 shows a comparison between features selection approaches that output based on the lowest error rate with 5-fold cross validation and it performed with other methods such as Genetic algorithm (GA), Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) and the most optimal features have been computed in comparison with the DE method of this study in four repetitions. DE performance is expressed based on Kappa coefficient by Eq. (23) in which was more appropriate than other solutions in feature selection level.

Fig. 10
figure 10

The performance and comparison of the DE algorithm versus GA, PSO, and ACO algorithms in the feature selection step for four random sections of test data

$$Kappa = \frac{{N\sum\nolimits_{i = 1}^{r} {x_{ii} } - \sum\nolimits_{i = 1}^{r} {(x_{i + } \times x_{ + i} )} }}{{N^{2} - \sum\nolimits_{i = 1}^{r} {(x_{i + } \times x_{ + i} )} }}$$
(23)

where, N is the total number of samples, r is the number of classes, xii is the main diagonal of adjacency matrix of error, xi+ is the marginal sum of the rows and x+i is the marginal sum of the columns.

Using other evolutionary algorithms such as PSO, ACO, GA, instead of the SFLA algorithm in optimizing the wavelet kernel of the ELM and its training did not result in less errors. The main reason for this is the ability of the SFLA to find the global and local optimum and persistence to achieve the optimal solution. Additionally, the processing time of the SFLA algorithm is more appropriate than other solutions and the advantage of not having much parameters and simplicity in implementation can be considered as the main reason for choosing an algorithm to improve ELM. Hence, in Fig. 10, SFLA optimizer performance has also been compared to improve ELM performance with other evolutionary optimizers such as PSO, ACO and GA and, the error in any type of data due to lower optimality has been reported.

The overall performance of the ELM algorithm has been shown after the improvement by the SFLA in the set of Fig. 11, in which the error converges to the minimum values. Figure 12 also compared ELM classifier with linear, RBF, polynomial, and wavelet kernels as decision makers for each class of sample slices, with and without MS and, the final result indicates the proper performance of this wavelet kernel in distinguishing healthy individuals and MS patients. The ELM classifier with different kernels is optimized by SFLA method.

Fig. 11
figure 11

The parameters regularization of wavelet kernel by SFLA versus GA, PSO, and ACO algorithms for four random sections of test data

Fig. 12
figure 12

The performance of ELM classifier with linear, RBF, polynomial and wavelet kernels for four random sections of test data

The use of fractal and PZMs descriptors to extract feature yielded optimal responses. The primary reason for employing the proposed methods in feature extraction is its high accuracy in the analysis of MR images, in addition to its novelty and limited application in MR image analysis. Second, the set of images was analyzed by identical methods including GLCM, LBP, and HOG, which were compared with the proposed method in terms of feature extraction. Figure 13 shows that the calculation of performance accuracy at different test data by fractal and PZMs aggregation can produce responses that are more desirable than that of conventional methods such as GLCM, LBP and HOG. The third reason is that due to the nature of these descriptors, as explained in [39, 46,47,48], they are capable of extracting features even in small-scaled or rotated image or where the mass has been altered [47]. The fractal descriptor yielded high accuracy in Lahmiri’s research for the detection of Alzheimer’s lesions. It has been shown that fractal is effectively correlated with human perception of surface roughness.

Fig. 13
figure 13

Comparison between conventional feature extraction methods in MR images to detect the MS lesions for four random sections of test data

In addition to the comparison of descriptors, the repeatability of the algorithm was tested in four random datasets by calculating Accuracy, Sensitivity, and Specificity factors. Figure 14 shows the algorithm’s repeatability with the minimum dispersion in the outcomes.

Fig. 14
figure 14

The accuracy, sensitivity, and specificity factors of proposed model for four random sections of test data

The algorithm performance was evaluated by calculating the evaluation factors and calculating their mean values, in such a way that:

  1. (A)

    Processing model: Information processing was performed on the basis of the comparison between the solutions of the magnetic resonance image analysis with the proper accuracy.

  2. (B)

    Repetition: The variance in responses is low and un-modeled uncertainty is solvable.

  3. (C)

    Processing time: The time reduction rate was lower during the design phase and evaluation. However, the data processing time is half the overall time, which can be important in saving time.

Consisting of descriptions of features based on fractal and PZM methods, feature selection with DE algorithm and wavelet kernel optimization by SFLA, the ELM performance is better than other similar classifiers. In overall, we achieve accuracy higher than 97% at train and test stage via dividing the data by K-Fold cross-validation.

5 Conclusion

Considering the nature of magnetic resonance image of brain, cancerous masses, brain lesions and their areas, in this paper, a method has been proposed on the basis of MS diagnosis based on the use of automatic processing technique. MS-related lesions have often abnormal position in MRI and because of compression in their slice, they exhibit a different brightness. Using csFCM, pixels containing probable MS lesions were separated from others. Feature extraction from the same separated pixels, the type and shape of the lesion by the two descriptors, makes it possible to create the proper feature vector of slices. The DE algorithm removed some of the data from the feature vector that, the DE algorithm function cost function was achieved by data segmentation through 5-fold cross validation on the basis of the calculation of ELM class error. The optimal ELM classifier by the SFLA has better capability for classification and the existing claims were proven. By repeating the test, there was a slight difference in the responses obtained which indicated un-modeled uncertainty. Undoubtedly, more improvements can be made in terms of learning and processing time and the error can be minimized. The suggestion is that a larger volume of slices of each subject will be introduced into the classifier by modeling methods such as deep learning.