1 Introduction

Although cancer in some cases includes benign tumors, there is also the possibility of malignant tumors and hence great increase in the rate of mortality [1, 2]. One of the cancers in women which causes high rate of mortality as a result of the malignant masses is breast cancer [3]. In some European, Africa and Asia countries, the rate of mortality caused by this disease is increasing [4] and according to 2011 statistics, 110 women die every day from breast cancer globally [5]. Studies have shown that the prevention of this disease as a result of unknown factors seems very complicated, but in the early stages of formation, diagnostic process can be applied [6]. So, early detection and diagnosis is one of the most important factors in the treatment of this disease. Breast cancer is the leading cause of mortality among women population and is responsible for one-fifth of all deaths [7]. For instance, the number of patients with breast cancer has been on the increase in Asian countries and the age of disease onset is 10 years less than Western countries [8]. What is obvious is that this cancer, especially among women, is very common and dangerous and urged researchers and physicians to look for ways to identify and harness this wave of cancer. On the other hand, for an early diagnosis of this disease, the existence of an intelligent system with high accuracy for detection of cancerous masses is highly important. This cancer is usually diagnosed via surgical biopsy which has the higher accuracy among the existing method, but the difference is that this method is an invasive, time-consuming and expensive procedure [9]. Mammography is currently the most common and popular method for early diagnosis of this disease which has decreased the mortality rate to 25% due to early detection [10]; nevertheless, the interpretation of images resulting from mammography is very difficult and according to official figures of the National Cancer Institute in the US, 10–30% of glands in patient’s breast in mammography images are indistinguishable by radiologists [11]. Also in the mammography method, 30% of breast cancers due to the lack of precise detection of mass locations are not recognized properly [12, 13]. Therefore, employing computer-aided diagnosis (CAD), in the field of mammography can be useful for more accurate interpretations by specialist. CAD can be specifically helpful in intelligent detection of diseases from medical images. As seen in previously proposed systems and other studies, identification employing powerful extraction CADs and selection of features as well as classification in diagnosis of cancer masses included better results [14,15,16]. Using methods based on image processing can greatly increase the chance of detection in mammography [17]. Overall, the utilization of the proposed CAD system will lead to 80–90% detection accuracy [18].

The problem appears to be that most previous methods in this field only identify the presence or absence of tumor and as a result, only few researches has been carried out on an automated approach for recognizing the masses in mammography images [19]. The segmentation procedure in mammography images is one of important problem that is vital step for gathering information from the masses [20, 21].

Some researchers have also worked exclusively on two types of benign and malignant or micro-calcification masses. But in most cases, these methods have taken the advantages of intelligent systems model in inference and recognition of appropriate patterns with the possibility of learning [22]. A substantial number of former methods have benefited from tissue analysis or shape-based attributes [23,24,25,26,27,28,29,30,31,32,33].

Efficient tools that have been employed in various fields of image analysis have also extracted various features in the field of mammographic image processing including local binary pattern [34], gabor features [35, 36], histogram [37], principle component analysis [38] and geometric and statistical characteristics [31]. In a number of methods, Zernike moment method has been employed for the description and extraction of features [22, 39,40,41].

Using more features extracted from other tools like texture and shape characteristics can greatly increase the accurate identification of tumor type. Various studies [23,24,25, 28] have utilized texture features alongside non-texture features. Kabbadj et al. [31] employed geometric and statistical features, while Beura et al. [42] employed the wavelet transform along with gray-level co-occurrence matrix (GLCM) in the diagnosis of masses.

A number of other studies also tried to optimize sample classification like Singh et al. [22] that made employed adaptive differential evolution wavelet neural network (Ada-DEWNN) model, which was an optimized model. Also, Dheeba et al. [5, 35] and Raghavendra et al. [43] applied neural network optimized model and chose Gabor filters as the features extraction tool for mammography image. Recently, the conventional neural network is used as the core of an integrated belief concept for dealing with the assortment problem or feature extraction of the breast lesion detection and classification [44,45,46,47]. Also, Xie et al. [48] used extreme learning machine (ELM) as powerful classifier for breast mass classification in digital mammography. When the comparison is based on the use of classification type, k-NN [26, 27, 49], support vector machine [25, 29,30,31, 34, 36, 38, 40], artificial neural network [22, 32, 35, 39, 41,42,43], fuzzy inference system [23] and other efficient classifiers like ANFIS [50] are among those that have been frequently used. Despite the desired accuracy in their study and the small dimensions of extracted texture features, the selection method, the number of masses and their type in recognition is ambiguous. In addition to the data obtained by researchers, there are some databases such as MIAS, DDSM, DBT and IRMA in this field which are used by researchers for data analysis. The number of disease classes in images has been mentioned in some of these databases.

Other researchers have exclusively explored other masses like micro-calcification [29, 31, 51, 52] and some other researchers alongside images without symptoms; have been trying to recognize benign and malignant tumors [42]. The number of classes in a mammography image may exceed 10 types of masses; for instance, in the DDSM database, there are about 12 different classes of benign and malignant and similar masses, while in the Mini-MIAS database, the maximum number of masses does not exceed 7 [42]. In addition, there are some tools for assessment of the research such as the calculation of accuracy, sensitivity and specificity that can be considered appropriate benchmarks.

In this paper, we address a framework to detect the multi-mass breast cancer based on hybrid descriptors and memetic meta-heuristic learning. The novelty of our study is the analysis of mammography images using hybrid descriptors such as Pseudo–Zernike moment and wavelet transform. Furthermore, we optimize ANFIS classifier based on Memtic shuffled frog-leaping algorithm (SFLA). The remaining part of this paper is organized as follows. The framework of proposed algorithm will be presented in Sect. 2. In Sect. 4, the experimental results of the simulation will be presented and in the same section, the results are compared with other similar methods. Finally, overall conclusion of the system performance will be presented in Sect. 5.

2 Overview of the proposed system

Implementation steps include applying some basic steps in pre-processing of mammography input image, extraction and selection of the best features from the set of aggregated features and finally, the classification based on the ANFIS model. The suggested steps are shown in Fig. 1.

Fig. 1
figure 1

Block diagram of the proposed scheme for multi-classification of mammograms

2.1 Pre-processing step

Pre-processing steps comprise three basic steps as follows: (a) removing redundant information from mammography image, (b) deletion of pectoral from breast, and (c) separation of masses using K-means clustering.

2.1.1 Pectoral muscle

In the pre-processing section, simulation and mapping masks in pervious methods [53,54,55] were utilized for its higher accuracy in the separation of images especially from MIAS database at the beginning, probability density function used for the allocation of any part belonging to the image and area, were divided into three sections of background, breast and pectoral muscle.

$$A_{R} (x) = \frac{n(x \in R)}{N}$$
(1)

where A represents the probability density function for each pixel position x and degree of belonging to the area of R. On the other hand, n (x  R) is the number of x positions in the area of R. Furthermore, N is assumed as the total number of analyzed images. In order to define the probability density function, light intensity information is used to create processed masks [55].

$$I_{R} (i) = \frac{{H_{R} (i)}}{{\sum\nolimits_{j = 1}^{3} {H_{j} } (i)}}$$
(2)

where IR refers to the probability of any light intensity in the area of R and H is the intensity histogram. Finally, the label is assigned to each pixel by correspondence between LBP codes and histogram and also computed as the texture descriptor of that region. Therefore, it is possible to assign probabilities to each LBP code associated with the three regions of the tissue [55].

$$T_{R} (t) = \frac{{LBR_{R} (i)}}{{\sum\nolimits_{j = 1}^{3} {LBP_{j} } (i)}}$$
(3)

T refers to tissue information and here, LBP histogram is related to the code listed and thus the final data in order to build a probabilistic model for separation is mentioned [55]:

$$P_{R} (p(x,i,t)) = A_{R} (x)I_{R} (i)T_{R} (t)$$
(4)

Eventually, pectoral segmentation was realized based on the logical operator method with initial masks that had been defined manually by radiologists. According to logical operators of image processing, the AND operator is applied with the corresponding original mammogram and the pectoral area becomes segmented.

2.1.2 Region clustering

After eliminating redundant information, the obtained image is clustered using K-means method. By selecting the appropriate cluster or clusters, the masses in mammography image can be separated. If input patterns include a set of N vectors { } and the Euclidian distance is used as a measure of similarity, then we can formulate K-means clustering as that of finding K cluster centers, that minimize the total square-error E [56]:

$$E(\vec{c}_{1} ,\vec{c}_{2} , \ldots ,\vec{c}_{K} ) = \sum\limits_{k = 1}^{K} {\{ (1/N} )\sum\limits_{i = 1}^{N} {m_{ki} \left\| {\vec{x}_{i} - \vec{c}_{k} } \right\|^{2} } \}$$
(5)

where mki = 1 if \(\overrightarrow {{x_{i} }}\) belongs to cluster k, and mki = 0 otherwise. The notation ∥∥ denotes norm of term. When the training patterns are generated from probability density \(p\left( {\overrightarrow {{x_{i} }} } \right)\) defined on an input space S, the cost function of the K-means algorithm is transformed into:

$$E(\vec{c}_{1} ,\vec{c}_{2} , \ldots ,\vec{c}_{K} ) = \sum\limits_{k = 1}^{K} {\int_{S} {m(\vec{x})\left\| {\vec{x}_{i} - \vec{c}_{k} } \right\|^{2} } } p(\vec{x})d\vec{x}$$
(6)

where m(\(\vec{x}\)) = 1 if \(\overrightarrow {{x_{i} }}\) belongs to cluster k, and m(\(\vec{x}\)) = 0 otherwise. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. Based on this clustering, pixels can be divided into a maximum of 255 clusters. Here, based on the results achieved from Salvador, the number of proposed clusters has been 4–7 [57].

2.2 Feature aggregation

The features are created from the aggregation of extracted features by several tissues and statistical descriptors that have a desired effect on the accuracy of diagnosis. These features are composed of three parts.

2.2.1 Texture features

The GLCM is a square matrix whose elements correspond to the relative frequency of occurrence of a pair of gray values at a certain distance and a determined direction. The elements of a co-occurrence matrix with dimensions of G × G and distance vector d (dx = dy) are defined as (7):

$$P_{d} (i,j) = \{ ((r,s),(t,v)):I(r,s) = i,I(t,v) = j\}$$
(7)

where I (…) represents image with dimensions N × N and the gray level G. GLCM is in fact the description of Pij frequencies that have two neighboring pixel with distance d, one with the gray intensity i and the other with gray intensity j, that occur within a given neighborhood in the Image. Therefore, GLCM will be formed by a square matrix whose size depends on the maximum intensity of the gray pixels in the image. Each Pij element represents the number of events of the above structure: pixel with size i in a determined distance d from the pixel j. If d = 1, four possible orientations are the possible angles between two pixels can be defined by 0, 45, 90 and 135°, according to Fig. 2.

Fig. 2
figure 2

Feature extraction by using different angles in GLCM

2.2.2 Pseudo–Zernike moments (PZMs)

PZMs are employed to extract features that do not change with dataflow, that are non-repetitive and are resistant to the noise and visual form of investigated image. However, the most striking feature is the multistage display ability of this technique [58]. The Zernike polynomials are a set of orthogonal polynomials that arise within a unit circle (x2 + y2 = 1) and is displayed with Vnm (x, y) and its structure is defined in (8):

$$V_{nm} (x,y) = \, V(\rho ,\theta ) = R_{nm} (\rho )e^{jm\theta }$$
(8)

In this equation, j = \(\sqrt { - 1}\), \(\theta = tan^{ - 1} \left( {\frac{\text{y}}{\rm{x}}} \right)\), |ρ| ≤ 1, n ≥ 0, m ≤ n and n-|m| = even. It is also worthy of note that ρ is assumed to be the length of the vector origin to point (x, y) while θ is the angle between vector ρ, and the x-axis in an anticlockwise direction. As previously mentioned in the above relation, n is the non-negative integer that shows the order of polynomial. The order of horizontal arc and its absolute value is less than or equal to n (≤ n), and the difference of m from n is always even. On the other hand, Rnm is a radial polynomial that is calculated according to (9) and (10) [40, 41]:

$$R_{nm} (x,y) = \left[ {\sum\limits_{s = 0}^{n - |m|} {S_{n,\left| m \right|,s} (x^{2} + y^{2} )} } \right]^{{\frac{n - s}{2}}}$$
(9)
$$S_{n,\left| m \right|,s} = ( - 1)^{s} \frac{{(2n + 1 - s)\text{!}}}{{s\text{!}(n + \left| m \right| - s)\text{!(}n - \left| m \right| - s + 1)\text{!}}}$$
(10)

Zernike moments (ZMs) are images mapped into a set of Zernike mixed polynomials. One of the important features of Zernike moment is their orthogonal property, therefore image features without any redundancy of information or overlap between the moments can be presented. Mixed PseudoZernike moments with order n and repetition m are calculated using (11):

$$PZM_{nm} = \frac{n + 1}{\pi }\sum\limits_{x}^{{}} {\sum\limits_{y}^{{}} {f(x,y)V_{nm}^{ * } } } (x,y)$$
(11)

where f (x, y) represents the brightness intensity function of the digital mammography image at x and y locations and symbols * also refer to the complex conjugate. Furthermore, it should be noted that the pixels of any image that fall outside the unit circle after mapping will not be utilized in calculating Zernike. The Pseudo Zernike polynomials in a unit circle are shown in Fig. 3.

Fig. 3
figure 3

The first 21 Zernike polynomials, ordered vertically by radial degree and horizontally by azimuthal degree

Furthermore, the PseudoZernike moments for a digital image with dimension N × N when 0 ≥ ρπ ≥ 1 is displayed according to (12):

$$\begin{aligned} PZM_{nm} & = \frac{n + 1}{\pi } \, \sum\limits_{i = 0}^{N - 1} {\sum\limits_{j = 0}^{N - 1} { \, f({\text{r}},{\text{c}}) \, V_{n,m}^{ * } } } ({\text{r}},{\text{c}}) \\ & = \frac{n + 1}{\pi } \, \sum\limits_{i = 0}^{N - 1} {\sum\limits_{j = 0}^{N - 1} { \, f({\text{r}},{\text{c}})\,{\text{R}}_{n,m} } } (\rho_{rc} ){\text{e}}^{{ - jm\theta_{rc} }} \\ \end{aligned}$$
(12)

2.2.3 Wavelet transform

The process of decomposing multiple signals (x[n]) after mapping is carried out with two filters. Each step of the process includes two digital and sampling filters by a factor of 2. In the first filter, g [.] is the discrete wavelet and inherently high-pass while h [.] is the mirror versions of the wavelet which are inherently low-pass. The first time sampled output signal for high-pass and low pass filters includes partial coefficients D1 factors and approximation coefficients A1, respectively. A1 is the first approximation coefficients that decompose more than any other factors. All wavelet transforms can be determined in the form of a low-pass filter in the (13):

$$H(z)H(z^{ - 1} ) + H( - z)H( - z^{ - 1} ) = 1$$
(13)

where H (z) is the function z of filter h, and complementary high-pass filter could then be stated on the (15):

$$G(z) = zH( - z^{ - 1} )$$
(14)

A series of filters with increasing length (with index i) can be obtained according to (16):

$$\begin{array}{*{20}l} {H_{i + 1} (z) = H(z^{{2^{i} }} )H_{i} (z)} \hfill \\ {G_{i + 1} (z) = G(z^{{2^{i} }} )H_{i} (z)} \hfill \\ \end{array} \quad i = 0, \ldots ,I - 1$$
(15)

where H0(z) = 1 is assumed to be the original condition, and two-scale relationship in the time domain can be expressed on the basis of relations (17):

$$\begin{aligned} h_{i + 1} (k) & = [h]_{{ \uparrow 2^{i} }} * h_{i} (k) \\ g_{i + 1} (k) & = [g]_{{ \uparrow 2^{i} }} * h_{i} (k) \\ \end{aligned}$$
(16)

where \(\left[ . \right]_{{ \uparrow 2^{i} }}\) represents upward sampling with a factor m and k is assumed as the discrete sampling time. Basic functions and normalized wavelet i, ψi,l(k) and φi,l(k) can be defined in the following form:

$$\begin{aligned} \varphi_{i,l} (k) & = 2^{{{i \mathord{\left/ {\vphantom {i 2}} \right. \kern-0pt} 2}}} h_{i} (k - 2^{i} l) \\ \psi_{i,l} (k) & = 2^{{{i \mathord{\left/ {\vphantom {i 2}} \right. \kern-0pt} 2}}} g_{i} (k - 2^{i} l) \\ \end{aligned}$$
(17)

where 2i/2 results from normalized inner product; i and l are parameters of scale and translation, respectively. The decomposition of discrete wavelet transform is expressed in (19):

$$\begin{aligned} a_{(i)} (l) & = x(k) * \varphi_{i,l} (k) \\ d_{(i)} (l) & = x(k) * \psi_{i,l} (k) \\ \end{aligned}$$
(18)

where a(i)(l) and d(i)(l) represent the approximate coefficients and partial coefficients in attention i [59]. Due to this calculation, we are able to practically decompose and subsequently reconstruct the signal. After applying the conversion on the audio signal input from heart, statistical characteristics will be available for the distribution of time–frequency domain.

2.3 Feature subset selection

The basic premise of using feature subset selection algorithms is that the set of extracted data contains both redundant information and irrelevant features and thus, this process is implemented without incurring much loss of information. Heuristic algorithms belong to the set of powerful techniques to both redundant information elimination and irrelevant features that could be used in optimized feature subset selection in accordance with the resulted error of applying cost function based on unsupervised classifiers. Simulated annealing (SA) is one of large space searching algorithm that is defined as a probabilistic technique for approximating the global optimum of a given function.

In the process of refrigeration, the metals are heated to a high temperature and thereafter, a gradual cooling and reducing of temperature is carried out on them. In this process, an increase in temperature of the metal leads to an increase in speed in the movement of atoms and then a gradual decrease in temperature caused the formation of certain patterns in the position of the atoms. We applied this property to find optimum solutions or the best aggregated features. The best features are found based on the cost in cost function and finding minimum error of neural network classification. Generally, the process will be as follows:

  1. 1.

    Choosing a random feature subset for search and fitting by neural network

  2. 2.

    Setting the temperature to start

  3. 3.

    3 Producing a new point to achieve efficient feature subset

  4. 4.

    Evaluating the new Point to accept or reject it as an optimal feature

  5. 5.

    If produced feature subset was better than the first feature subset, they are accepted; otherwise they are accepted with a probability that depends on the temperature and energy in two modes.

  6. 6.

    Temperature drops and steps 3–6 continued to reach the minimum temperature.

The steps for choosing the best properties among the aggregated characters in the algorithm are shown in Algorithm 1.

figure a

2.4 Classification step

One of the efficient tools in identifying the association between variables is the ANFIS approach that has a similar structure to neural networks and fuzzy systems and it is similar to the neural networks in terms of structure and configuration. ANFIS training is carried out using two algorithms of back-propagation algorithm or combinatorial algorithm including two least squares estimation of the error and back-propagation error which estimate fuzzy membership function parameters. Assuming that the fuzzy system has two inputs x and y and output is z, then the rules are written as shown in (19):

$$\begin{aligned} & \underbrace {{If\;x\;is\;A_{1} \;and\;y\;is\;B_{1} }}_{Rule\,1}\mathop{\longrightarrow}\limits{Then}\;f_{1} = P_{1} x + Q_{1} y \\ & \underbrace {{If\;x\;is\;A_{2} \;and\;y\;is\;B_{2} }}_{Rule\,2}\mathop{\longrightarrow}\limits{Then}\;f_{2} = P_{2} x + Q_{2} y \\ \end{aligned}$$
(19)

And if the mean center of defuzzification is to be used for defuzzification, then the output is as follows:

$$\begin{aligned} f & = \frac{{w_{1} f_{1} + w_{2} f_{2} }}{{w_{1} + w_{2} }} = \bar{w}_{1} f_{1} + \bar{w}_{2} f_{2} \\ \bar{w}_{1} & = \frac{{w_{1} }}{{w_{1} + w_{2} }},\quad \bar{w}_{2} = \frac{{w_{2} }}{{w_{1} + w_{2} }} \\ \end{aligned}$$
(20)

2.5 Shuffled frog-leaping algorithm (SFLA)

In the SFLA optimization algorithm, rousing the idea of the frog movement, a strategy is proposed to scan for the parameters improvement, whose adequacy in finding a solution is considerable, compared with different responses. In fact, using this optimization procedure, an ANFIS structure is found to have the least amount of mean square error (MSE) in finding the network output compounds. In other hand, a configuration with neural network weights can be found that could lead to a best classification with a negligible error.

Deciding these parameters will incredibly influence the specified exactness. To discover the leading structure of ANFIS, we propose that the SFLA algorithm perform the optimization. The steps of SFLA to discover best parameters of ANFIS classifier are as follows [60]:

  • Step 1: Initialization H frogs are randomly generated to construct the initial population. The position of the hth frog is encoded as Xh = [xh1, xh2, , xhd, , xhD], h = 1, , H, whose, D is the dimension of the optimization space. Each Xh shows a possible response. And each possible response corresponds to a function f(Xh) related to the optimization cost function.

  • Step 2: Ranking and grouping H frogs are arranged in descending rank based on performance of cost function. Position Px = [Px1, Px2, , P − xd, , PxD], of the best frog based on cost function output in the population is separated. The population is divided into α memeplexes, and there are c frogs in each memeplex is defined as:

    $$\begin{aligned} & M_{{o_{1} }} = \left\{ {X_{{o_{1} + \alpha (o_{2} - 1)}} \in Papulation|1 \le o_{2} \le c} \right\} \\ & (1 \le o_{1} \le \alpha ) \\ \end{aligned}$$
    (21)
  • Step 3: Local search Inside each memeplex, the nearby optimization handle is repeated for the desired number of iterations.

    • Step 3-1 Positions of the frogs in the memeplex model, the best and the worst, are specified as Pb = [Pb1, Pb2, , Pbd, , PbD] and Pw = [PW1, PW2, , PWd, , PWD], respectively. In this definition, Pw is updated based on:

      $$\begin{aligned} D_{{s_{d} }} & = \left\{ {\begin{array}{*{20}c} {\hbox{min} [INT(r \times (P_{bd} - P_{wd} )),D_{d}^{\hbox{max} } ]} & {P_{bd} - P_{wd} \ge 0} \\ {\hbox{min} [INT(r \times (P_{bd} - P_{wd} )), - D_{d}^{\hbox{max} } ]} & {P_{bd} - P_{wd} < 0} \\ \end{array} } \right. \\ d & = 1,2, \ldots ,D \\ \end{aligned}$$
      (22)
      $$P^{\prime}_{wd} = P_{wd} + D_{{s_{d} }}$$
      (23)
      $$P^{\prime}_{wd} = \left\{ {\begin{array}{*{20}l} {Z_{d}^{\hbox{max} } } \hfill & {P^{\prime}_{wd} > Z_{d}^{\hbox{max} } } \hfill \\ {P^{\prime}_{wd} } \hfill & {Z_{d}^{\hbox{min} } \le P^{\prime}_{wd} \le Z_{d}^{\hbox{max} } } \hfill \\ {Z_{d}^{\hbox{min} } } \hfill & {P^{\prime}_{wd} < Z_{d}^{\hbox{min} } } \hfill \\ \end{array} } \right.$$
      (24)

      where r is the random value in [0,1] interval, Dsd is the neighbor of the dth decision variable, and D maxd is the maximum neighbor of the dth decision variable. Also, \(P^{\prime}_{wd}\) is the updated position of the dth decision variable.

    • Step 3-2 If the performance value of \(P^{\prime}_{wd} = [P^{\prime}_{w1} , \ldots ,P^{\prime}_{wd} , \ldots ,P^{\prime}_{wD} ]\) is better than Pw, then \(P_{w} = P^{\prime}_{w}\); otherwise, Pb is defined as Eq. (22) and is replaced with Px, and the position updating is performed repeatedly.

    • Step 3-3 If the cost function value of Pw is still better than \(P^{\prime}_{wd}\), then Pw is substituted by a random frog position.

  • Step 4: Shuffling and Global Search After a local search step, all memeplexes values are mixed to form an updated population. Frogs are arranged and the optimal frog Px is specified. After this level, the next grouping and local search results are performed until the determined number of global iterations is completed.

3 Experimental results

Using mammography images from the Mini-MIAS database [61], evaluation criteria were analyzed. The images downloaded from this database were scanned with LJPEG format in form of three-channel image with a size of 50 microns. The image resolution is 200 µm. Also, the downloaded images from Mini-MIAS database have a depth of 8 bits, and are in 1024 × 1024 dimensions. The Mini-MIAS mammograms have three channels. Therefore, due to the nature of the mammography imaging device, the images have been recorded in the gray-level form. Because the images are high-dimensional, we have resized them into 256 × 256 dimensions to reduce the computational complexity. The first column of data shows the reference for each image while the second column shows the background texture of the image. In the third column of the data, there are seven different classes of classified data as follows:

  1. 1.

    CALC Calcification

  2. 2.

    CIRC Well-defined/circumscribed masses

  3. 3.

    SPIC Spiculated masses

  4. 4.

    MISC Other, ill-defined masses

  5. 5.

    ARCH Architectural distortion

  6. 6.

    ASYM Asymmetry

  7. 7.

    NORM Normal.

The other column of data includes the severity of abnormal mass that comprised the letters B and M which are the abbreviation of benign and malignant, respectively. To evaluate the masses, ANFIS classifier output classes is employed based on 7 listed classes. By combining and integrating the solutions presented in Matlab programming environment, the proposed algorithm is constructed in three experimental steps.

3.1 Setting

In the features extraction step, 59 features were extracted from the image containing the location and condition of the mass. In Harlic matrix (Cm×n), the most important features included contrast, energy, entropy, variance or the sum of squares, correlation, etc. In Table 1, some of these features along with their describing relationships are shown.

Table 1 Some features extracted by the GLCM algorithm

In the first step of describing features for each scale, the combination matrix of that scale was formed by placing all sub-bands together. Thereafter, the Co-occurrence matrix of that scale was constructed with the parameters d = 1 (pixel resolution distance) and angles 0, 45, 90 and 135°.

In the second step, the simulation feature extraction of mammographic image was carried out using PseudoZernike moments and the blocks were divided so that:

  1. 1.

    Block feature sets are inscribed on the picture

  2. 2.

    Block feature sets are inscribed on one fourth of the picture (image is divided into four equal portions)

  3. 3.

    3 Block feature sets are inscribed blocks on one third transverse image (the image is divided horizontally into three equal parts)

  4. 4.

    Block feature sets are inscribed on one third longitudinal image (the image in vertical direction is equally divided into three parts).

Similarly in Table 2, the level 8 PseudoZernike moments can be observed. In discrete wavelet transform, each mammogram can be scaled up to 3, 4 or 5 levels. The number of sub-bands in each of the levels is different. For level three, the number of sub-bands is 18 i.e. 1 + 16 + 1 and for level 4; the number of sub-bands is 50 i.e. 1 + 16 + 32 + 1, which are related to levels1, 2, 3 and 4, respectively. The coefficients produce by wavelet transform to each 180 degrees are repetitive. As a result, half-sufficient sub-band was assumed for levels 2 and 3. Therefore at this stage of the four-level simulation, 26 sub-bands (i.e. 1 + 8+16 + 1) of the wavelet coefficients are generated and each sub-band is a set of coefficients. For feature extraction from 26 available sub-bands, the average information within each sub-band and standard deviation was measured. Each of the measured parameters produces a small amount.

Table 2 Level 8 PseudoZernike moments

Data in relation to Hold-out methods for fitness function of SA methods were selected. Back Propagation Neural networks with a number of 6 neurons in the input layer, 8 neurons in the hidden layer and 4 neurons in output layer were tried to select efficient feature subset. The best number of selected members of aggregated features was from 18 to 25 which revealed the lowest classification error. Hence the numbers of features were saved in 6 groups of 4 and selected features became the input configurations of ANFIS. In order to simulate ANFIS, the configuration of network shown in Fig. 4. Process of the improved ANFIS by Shuffled Frog Leaping Algorithm has been shown in Fig. 5 schematic.

Fig. 4
figure 4

The proposed structure of ANFIS classifier

Fig. 5
figure 5

Process of the improved ANFIS classifier by SFLA

3.2 Assessments

The results of the implementation of preprocessing section for four samples of mammography images are shown in Fig. 6. In a series of sample images, neighborhood radius was assumed to be eight and redundant parts were eliminated from image-sets. In the first row of the images, examples of removing unwanted elements can be seen. Selection of the eight-neighborhood occurred due to the entire image analysis and applies to all images. According to masks obtained from the research by Oliver et al. [55] and benchmark assessments presented in (25) and (26), the pixels belonging to three sections: Background, Breast and Pectoral can be mutually compared. Since the goal is the separation of breast area from the rest regions, therefore we have:

$$OL_{Br - Pe} = \frac{{2\left| {Br \cap Pe} \right|}}{{\left| {Br} \right| + \left| {Pe} \right|}}$$
(25)
$$OL_{Br - BG} = \frac{{2\left| {Br \cap BG} \right|}}{{\left| {Br} \right| + \left| {BG} \right|}}$$
(26)
Fig. 6
figure 6

This figure shows that the algorithm was implemented on 4 images. The first row: a section of an original mammography image. The second row: first step of the preprocessing of images by removing unwanted objects. The third row: shows the result of pectoral segmentation while the fourth row: shows the results of region clustering by K-means algorithm

Both equations represent the percentage of pixels overlapping for breast to pectoral area (|Br ∩ Pe|) and breast to background (|Br ∩ BG|). Table 3 shows the statistical results of segmentation in the pre-processing step for 322 image samples. In next level, with a choice of 4–6 spikes for mammography, mass location and its appearance can be segmented. Skilled radiologists were asked to identify the location of the mass in the mammograms with different shapes and precisely map the locations of possible masses. They were blind to the database information, and even predicted the type of mass in images. There was a significant relationship between the stated prediction and the results of clustering at the end of the pre-processing step.

Table 3 Shows the statistical results of segmentation in the pre-processing step

The data in the evaluation stage and proportional to K-fold method were divided into training and test data, and K-fold validation with K = 5 were used. The train and test results are shown in Tables 4 and 5. Also, the training and test results in each table were provided and the output was presented. In these tables, the accuracy has been computed based on confusion matrix for 7 classes of breast cancer. Also in Fig. 7, Receiver Operating Characteristic (ROC) curves were shown and calculation of the AUC shown in the images suggests the optimal performance of the system in recognition of the different masses in mammography images.

Table 4 The accuracy results of applying the proposed algorithm by K-fold evaluation in train steps
Table 5 The accuracy results of applying the proposed algorithm by K-fold evaluation in test steps
Fig. 7
figure 7

ROC curves in the first plot showed one class versus other masses; while in this plots, random six ROC curves displayed one-against-all for various masses. The first column: ROC curves for random train groups. The second column: ROC curves for random test groups

For accurate comparison in the first curve, we use normal and abnormal images. We randomly split the data set into two parts (50% and 50%), with the 50% used to train the proposed algorithm and the 50% used as Hold-out cross validation to display ROC curve. As in several multi-class problems, the idea is to generally carry out pairwise comparison such as one class versus all other classes, and one class versus another class. On the other hand, we compared and plotted ROC curve for class 2 against classes 3, 4, etc. Thus in next step, we will compare and plot class 3 against classes 2, 4, etc.

4 Discussion

We have tested numerous clusters and calculated the results in the experiments separately. When the number of clusters selected using the K-means was between 3 and 5, better outputs were obtained. This is shown in Fig. 8 by changing the number of clusters from 2–6 and calculating the final classification accuracy and AUC to analyze the desired number of clusters.

Fig. 8
figure 8

Shows that the best value for the parameter K of K-means clustering

By aggregating the features obtained from different describers a comparison has been made among their performance. All feature extraction procedures are shown in an in Fig. 9. Although the GLCM, PZMs, and Wavelet descriptors have allocated more suitable features, among the tissue features, to themselves, the feature aggregation has led to better results.

Fig. 9
figure 9

Comparison between feature extraction schemes for mammograms images across multiple experiments

Due to using all data of the error matrix, the Kappa factor is used as the classification accuracy and fitness function assessments. This factor is defined as (26):

$$Kappa = \frac{{N\sum\nolimits_{i = 1}^{r} {x_{ii} } - \sum\nolimits_{i = 1}^{r} {(x_{i + } \times x_{ + i} )} }}{{N^{2} - \sum\nolimits_{i = 1}^{r} {(x_{i + } \times x_{ + i} )} }}$$
(27)

where N is the number of all data, r is the number of classes, xii denotes the elements on the main diagonal of the error matrix, xi+ is the marginal sum of rows, and x+i shows the marginal sum of columns. Compared to different classification models based on Fig. 10, to optimize the ANFIS classifier with other methods such as GA, PSO, and ACO, the Kappa factor obtained from the SFLA is more acceptable. The performance of this algorithm in finding global optimum is satisfactory and it can be used as an ANFIS optimizer algorithm in the classification of various masses of breast cancer.

Fig. 10
figure 10

Comparison between classifier optimization approaches for mammograms images across multiple experiments

The database had the specified label, but the images in the data were labeled as healthy or unhealthy and two radiologists were also asked to review the masses. Compared with the previous methods, total precision is at an appropriate level (Table 6). By calculating the AUC and sensitivity, the numbers equal to 94.14% for total masses and 96.89% for benign and malignant states were obtained, respectively. Because when the model is not optimized, the accuracy of applying the training data is higher than that of the test data. The main reason for this event is the over-fitting problem and to prevent this challenge the model is tuned based on the SFLA algorithm, which helped optimize the accuracy of the test step. Thus, the results of the train and test steps for Mini-MIAS data are represented separately to show that the over-fitting challenge for a large number of classes has been considered.

Table 6 Comparison among proposed method and other breast cancer classification techniques

It can be seen that the performance of the algorithm for identifying benign and malignant masses as well as their separation compared to methods such as [22, 30,31,32,33,34, 36, 39,40,41,42] is effective. Although the total precision compared to methods such as [30, 31, 50] is less, it should be noted that if the algorithm to be implemented in two stages on two categories of data is in line with the mentioned procedures, then the binary classification accuracy and AUC (i.e. healthy or unhealthy), will be higher than 98.6%. Thus, by diagnosing healthy individuals from patients, this algorithm offers a better performance from priory algorithms [30, 31, 33, 39,40,41,42, 50]. By distinguishing patients from healthy people, the statistical population is limited to 121 images; among them, circumscribed masses comprised 25 images, Spiculated masses with 19 images, ill-defined masses with 15 images, Architectural distortion with 19 images, Asymmetry with 15 images and Calcification with 28 images.

We applied the algorithm again to the class of diseases and good precision (above 92%) was obtained for the six different classes. Although such methods [22, 32, 33, 36, 39,40,41,42, 50] have a favorable level of accuracy and sensitivity, the lack of integrity in discrimination and segmentation of all masses was criticized. Therefore, the first difference and a key advantage of the presented solution are in recognition of the different kind of masses.

In identifying micro-calcification, methods [29, 31, 51], respectively have functions equal to 87, 90 and 99.60%, but in their study, the F1 score does not result from sensitivity and specificity and the time taken for the procedure [31] is unclear, because this method made use of two categories of features. Unlike the former methods [22, 32, 33, 36, 39,40,41,42, 50], which did not assess data conclusiveness and the correlation of algorithm performance by the radiologist opinion, in this study, the p value in the proposed algorithm showed a significant correlation between the output of the proposed algorithm and the Radiologist opinion (p < 0.05). The comparison of the results with radiologists and other similar methods is a proper reason to reject the Null hypothesis (H0). Since the test result was not placed in the acceptable area H0, H0 is not accepted (α = 0.05 and thus − Zα−1 = − 1.65). This means that Confidence Interval is more than 95%, and despite the large number of mammography images and the evaluation of K-fold, outputs are closer to reality.

In addition, the strengths of the algorithm can be seen as a recognition tool. In some studies, tumor location based segmentation is in line with the recognition, while with the correct segmentation techniques; isolation of the breast and pectoral as well as ROI in the pre-processing step was performed. In few researches, heuristic algorithm is used to reduce the dimensions and choose the best subset of features. Also, the accuracy of the algorithm showed its ability to histological features and statistical analysis of mammography images. In some other studies, the aggregate descriptors were used for the recognition of the best features; for instance, GLCM and wavelet feature was used by Beura et al. [42], while the number of statistical descriptors and texture in the current context is three extraction features. Furthermore, ANFIS classifier, adaptive model of fuzzy inference system and neural network are special abilities for the classification of multiple classes [39, 48, 62, 63].

5 Conclusion

The recognition and early detection of breast cancer in women can be a strategy for fighting this disease. Thus, the need for an efficient system with the ability to automatically separate different classes of mass labels is necessary. In this paper, a combination method including feature extraction of mammography image based on aggregating various characteristics of three texture and statistical descriptors and also selection of efficient features by the algorithm SA was offered in the first place. By separating the target region in mammography image, the memetic meta-heuristic adaptive neuro-based fuzzy inference system (MM-ANFIS) classifier for classification was employed to verify the attained classification of above 90%. Although the different algorithms for estimating the presence or absence of this disease were proposed, the form and shape of the suspected masses can be effective in combating and preventing this disease in early stage. In future, the authors plan to optimize the feature extraction pattern, feature selection and modification of classifiers adaptive features to reduced feature dimensions and at the same time, reducing data processing time increases the precision.