1 Introduction

“A picture is worth a thousand words” is a famous idiom which signifies that processing an image may relieve more information than processing the textual data. In computer vision, image segmentation is the prime research area which corresponds to partitioning of an image into its constituent objects or region of interests (ROI). Generally, it assembles the image pixels into similar regions. It is a pre-processing phase of many image-based applications like biometric identification, medical imaging, object detection and classification, and pattern recognition [91]. Some of the prominent applications are as follows.

  • Content-based image retrieval: It corresponds to the searching of query-relevant digital images from large databases. The retrieval results are obtained as per the contents of the query image. To extract the contents from an image, image segmentation is performed.

  • Machine vision: It is the image-based technology for robotic inspection and analysis, especially at the industrial level. Here, segmentation extracts the information from the captured image related to a machine or processed material.

  • Medical imaging: Today, image segmentation helps medical science in a number of ways from medical diagnosis to medical procedures. Some of the examples include segmentation of tumors for locating them, segmenting tissue to measure the corresponding volumes, and segmentation of cells for performing various digital pathological tasks like cell count, nuclei classification and many others.

  • Object recognition and detection: Object recognition and detection is an important application of computer vision. Here, an object may be referred to as a pedestrian or a face or some aerial objects like roads, forests, crops, etc. This application is indispensable to image segmentation as the extraction of the indented object from the image is priorly required.

  • Video surveillance: In this, the video camera captures the movements of the region of interests and analysis them to perform an indented task such as identification of the action being performed in the captured video or controlling the traffic movement, counting the number of objects and many more. To perform the analysis, segmentation of the region of interest is foremost required.

Though segmenting an image into the constituent ROI may end up as a trivial task for humans, it is relatively complex from the perspective of the computer vision. There are number of challenges which may affects the performance of an image segmentation method. Figure 1 depicts three major challenges of image segmentation which are discussed below.

  • Illumination variation: It is a fundamental problem in image segmentation and has severe effects on pixels. This variation occurs due to the different lighting conditions during the image capturing. Figure 1a shows an image that is captured in different illumination conditions. It can be observed that the corresponding pixels in each image contain varying intensity values which pose difficulties in image segmentation.

  • Intra-class variation: One of the major issues in this field is the existence of the region of interest in a number of different forms or appearances. Figure 1b depicts an example of chairs that are shown in different shapes, each having a different appearance. Such intra-class variation often makes the segmentation procedure difficult. Thus, a segmentation method should be invariant to such kind of variations.

  • Background complexity: Image with a complex background is a major challenge. Segmenting an image as the region of interests may mingle with the complex environment and constraints. Figure 1c illustrates an example of such image which corresponds to H&E stained breast cancer histology image. The dark blue color regions in the image represent the nuclei region which is generally defined as the region of interests in histopathological applications like nuclei count or cancer detection. It can be observed that the background is too complex due to which the nuclei regions do not have clearly defined boundaries. Therefore, such background complexities degrade the performance of segmentation methods.

Fig. 1
figure 1

Challenges in image segmentation a Illumination variation [2] b Intra-class variation [2] c Background complexity [3]

Further, the essence of an image segmentation is to represent an image with a few significant segments instead of thousands of pixels. Moreover, image segmentation may be viewed as a clustering approach in which the pixels, that are satisfying a criterion, are grouped into a cluster while dissatisfying pixels are placed in different groups. To exemplify this, consider the images in Fig. 2. The first image consists of some animals on the field. To extract the animals from the background, the ideal result would be to group all the pixels belonging to the animals into the same cluster while background pixels into another cluster as presented in Fig. 2b. However, the pixel labelling is unknown. Moreover, it can be observed that first image presents a complex scenario in terms of different color intensities and shapes of each animal which correspond to illumination and intra-class variations respectively. To mitigate these challenges and learning from such unlabelled data, the common approach is to group the data based on certain similarity or dissimilarity measures followed by labelling of each group. This approach of grouping the data is generally termed as clustering.

Fig. 2
figure 2

Example of clustering based image segmentation a Original image b Segmented image

Generally, clustering has been used in different areas of real-world applications like market analysis, social network analysis, online query search, recommendation system, and image segmentation [66]. The main objective of a clustering method is to classify the unlabelled pixels into homogeneous groups that have maximum resemblance, i.e. to achieve maximum similarity within the clusters and minimum dissimilarity among the clusters. Mathematically, the clustering procedure on an image (X) of size (m × n), defined over d-dimensions, generates K clusters {C1,C2,⋯ ,CK} subject to the following conditions:

  • Ci, for i = 1,2,⋯ ,K

  • CiCj = , for i and j = 1,2,⋯ ,K and ij

  • \(\cup _{i=1}^{K} C_{i} = X\)

The first condition ensures that there will be at least one pixel in every formed cluster. The next condition implies that all the formed clusters will be mutually exclusive, i.e. a pixel will not be assigned to two clusters. The last condition states that the data values assigned to all the clusters will represent the complete image.

In literature, there are number of clustering algorithms for image segmentation. However, there is no precise definition of a cluster and different clustering methods have defined their own techniques to group the data. Therefore, based on the cluster formation, the clustering methods may broadly be classified in two main categories, namely hierarchical and partitional [85]. A taxonomy of the clustering methods is presented in Fig. 3. The following section presents each category of clustering with respect to image segmentation.

Fig. 3
figure 3

Classification of clustering based image segmentation methods [85]

2 Hierarchical clustering

In hierarchical clustering, the grouping of data is performed at different similarity levels which is schematized by a tree-like structure termed as a dendrogram. Generally, the hierarchical splitting follows two approaches, namely divisive and agglomerative [66]. Figure 4 depicts an example of the dendrogram along with the hierarchical approaches.

Fig. 4
figure 4

Types of hierarchical clustering [66]

Table 1 summaries the methods belonging to these two sub-categories. In divisive clustering, recursive hierarchical data splitting is performed in a top-down fashion to generate the clusters. All the data items belong to a single cluster initially. This single cluster is further split into smaller clusters until a termination criteria is satisfied or until each data item forms its own cluster. Divisive clustering (DIVCLUS-T) [17] and divisive hierarchical clustering with diameter criterion (DHCDC) [31] are popular methods of this category. On the other side, the agglomerative clustering is performed in the bottom-up fashion where data points are merged hierarchically to produce clusters. Initially, each data item defines itself as a cluster which is further merged into bigger clusters, until a termination criteria is meet or until a single cluster is formed consisting of all the data items. Some popular agglomerative methods are balanced iterative reducing and clustering using hierarchies (BIRCH) [94], clustering using representatives (CURE) [32], and chameleon [41].

Table 1 Hierarchical clustering methods for image segmentation

In general, divisive clustering is more complex than the agglomerative approach, as it partitions the data until each cluster contains a single data item. The divisive clustering will be computationally efficient if the each cluster is not partitioned to individual data leaves. The time complexity of a naive agglomerative clustering is O(n3) which can be reduced to O(n2) using optimization algorithms. On the contrary, the time complexity of divisive clustering is Ω(n2) [66]. Moreover, divisive clustering is also more accurate since it considers the global distribution of data while partitioning data in top-level partitioning. The pseudocode of the divisive and agglomerative clustering approaches are presented in Algorithms 1 and 2 respectively.

figure a
figure b

In divisive hierarchical clustering algorithms, DIVCLUS-T clustering follows the monothetic bipartitional approach which allows the constructed dendrogram to be considered as a decision tree. This method can handle both categorical and numerical data. In DHCDC, the largest cluster is considered in terms of diameter which is partitioned into the two largest diameter clusters. BIRCH clustering is generally used for very large datasets due to its ability to discover a good clustering result in a single scan of datasets. BIRCH clustering algorithm constructs clustering feature tree to discover clusters. However, BIRCH clustering algorithm can only produce spherical and convex shape clusters. Therefore, as an alternative, CURE clustering algorithm is used. CURE clustering employs random sampling approach to cluster all the data items which are further combined to create the final result. The improved version of CURE is ROCK which is suitable for the data of enumeration type. Chameleon clustering first divides the original data into a smaller sized graph and then merge these small-sized graphs to create the final clusters.

Generally, hierarchical methods employ a greedy approach and do not reconsider a data item again after it has been assigned to a cluster. This results in lacking the capability of correcting the misclassified data item. Therefore, these methods lack robustness, especially in the case of noise and outliers. They do not intend to optimize an objective function while forming the clusters. Moreover, they perform poorly when clusters are overlapping. While generating clusters for a particular problem, knowledge about the number of clusters is required. Moreover, the formation of spherical clusters and the reversal of the hierarchical structure are distorted. Additionally, time complexity is one of the major issues while clustering data through hierarchical approach, especially on high-dimensional dataset like image. The time complexity of hierarchical methods is computationally expensive which has been found to be at least O(n2) for most of the hierarchical methods [66], where n corresponds to the number of data points. Therefore, to overcome this limitation, partitional clustering methods are preferred which are discussed in the following section.

3 Partitional clustering

The partitional clustering is relatively popular and preferred over the hierarchical clustering, especially for a large dataset, due to its computational efficiency [13]. In this clustering approach, the notion of similarity is used as the measurement parameter. Generally, partitional clustering groups the data items into clusters according to some objective function such that data items in a cluster are more similar than the data items in the other clusters. To achieve this, the similarity of each data item is measured with every cluster. Moreover, in partitional clustering, the general notion of the objective function is the minimization of the within-cluster similarity criteria which is usually computed by using Euclidean distance. The objective function expresses the goodness of each formed cluster and returns the best representation from the generated clusters. Like hierarchical clustering, the number of clusters to be formed needs to be defined priorly. Moreover, methods based on partitional clustering assure that each data item is assigned to a cluster even if it is quite far from the respective cluster centroid. This, sometimes, results in distortion of the cluster shapes or false results, especially in the case of noise or outlier.

Table 2 depicts partitional methods which have been used in various areas, such as image segmentation, robotics, wireless sensor network, web mining, business management, and medical sciences. Each application domain has different data distribution and complexity. Thus, a single method of partitional clustering might not fit for all problems. Therefore, based on the problem and dataset, the suitable method is selected. As shown in Table 2, the partitional clustering methods are categorized in two classes, namely soft and hard clustering methods, which are discussed in the following subsections.

Table 2 Partitional clustering methods for image segmentation

3.1 Soft clustering methods

Soft clustering methods assign each data to either two or more clusters with a degree of belongingness (or membership) iteratively. The degree of belongingness illustrates the level of association among data more reasonably. The belongingness of a data item with a cluster is a continuous value in the interval [0, 1] and depends upon the objective function. Usually, the minimization of the sum of squared Euclidean distance among the data and formed clusters is considered. Some popular methods are fuzzy c-means (FCM) [10], fuzzy c-shells (FCS) [27], and mountain method [88]. In FCS, each cluster is considered as a multidimensional hypersphere and defines the distance function accordingly. Mountain method uses the Mountain function to find the cluster centres. Particularly, FCM is the most widely used and popular method of this approach [53]. It returns a set of K fuzzy clusters by minimizing the objective function defined in (1).

$$ \sum\limits_{i=1}^{N} \sum\limits_{k=1}^{K} \mu_{ik}^{m}\parallel x_{i} - v_{k}\parallel^{2}, m \geq 1. $$
(1)

where, μik ∈ [0,1] and corresponds to membership degree for ith pixel with kth cluster. Equation (1) is optimized iteratively by updating μik and vk according to (2) and (3) respectively.

$$ \mu_{ik}=\frac{1}{{\sum}_{j=1}^{K}\big(\frac{\parallel x_{i} - v_{k}\parallel}{\parallel x_{i} - v_{j}\parallel}\big)^{\frac{2}{m-1}}} $$
(2)
$$ v_{k} = \frac{{\sum}_{i=1}^{N}\mu_{ik}^{m} x_{i}}{{\sum}_{i=1}^{N}\mu_{ik}^{m}} $$
(3)

Normally, the exponent of the fuzzy partition matrix (m) is kept as m > 1. This regulates the number of pixels that can have membership with more than one cluster. The pseudo-code of FCM is presented in Algorithm 3.

figure c

The inclusion of degree of belongingness benefits soft clustering methods in a number of ways like, relatively high clustering accuracy, faster in generating approximate solutions, and efficient in handling incomplete or heterogeneous data. This has resulted in wide applicability of these approaches ranging from discovering association rules to image retrieval. However, there are certain limitations. It requires prior knowledge about the number of clusters to be formed, may return local optimal solutions, low scalability, and sensitive towards initial parameter settings, noise, outliers, and number of formed clusters.

3.2 Hard clustering methods

Hard clustering methods iteratively partition the data into disjoint clusters according to the objective function. Generally, the objective function is the sum of squared Euclidean distance between data and associated centroid which is to be minimized. Usually, the centre of the clustered data is considered as the centroid of the clusters in these methods. Moreover, in contrast to soft clustering, hard clustering assigns data to a single cluster only i.e., each data will have the degree of belongingness as either 0 or 1. The hard clustering approach is relatively simple and scalable with high computation efficiency. Moreover, it is competent for datasets which have spherical shape and are well-separated. However, it suffers from number of demerits like, formed cluster centroids are relatively poor cluster descriptors, sensitive towards initial parameter settings, and needs the pre-knowledge about the number of clusters to be formed. The various hard-clustering methods may be grouped into three categories, namely Kmeans-based methods, histogram-based thresholding methods, and meta-heuristics based methods as depicted in Fig. 5.

Fig. 5
figure 5

Classification of hard clustering methods

3.2.1 Kmeans-based methods

In Kmeans-based methods, the cluster centroid is updated by taking the mean of all the data items assigned to the corresponding cluster. This is iteratively continued until some defined convergence criterion is met. Although methods of this category have merits like relatively low time complexity, simple in nature, and guaranteed convergence, there are number of limitations which need to be handled. These limitations include the number of clusters to be formed needs to be known priorly, solution quality depends on the initial clusters and number of formed clusters, not appropriate on data having non-convex distribution, follow hill-climbing strategy and hence usually traps into local optima, and relatively sensitive to outliers, noise, and initialization phase.

Generally, this category includes all the methods which are inspired from the K-means method which is the simplest partitional clustering method. It is a golden standard for clustering in literature [66]. An overview of the K-means method is detailed below.

  1. K-means:

    K-means [48] partitions a set of data points, X = {x1,⋯ ,xn}, into a number of clusters (k). It performs partition based on the similarity criteria which is usually the sum of squared error defined in (4).

$$ J = \sum\limits_{i=1}^{k} \ sum_{x_{j} \in X} {|| x_{j} - m_{i} ||}^{2} $$
(4)

where, mi is the centroid of cluster i which is collectively represented as M = {m1,⋯ ,mk} for corresponding clusters, C = {c1,⋯ ,ck}. This method iteratively minimizes the criterion function J. Further, the formed clusters C and corresponding centroids M are updated as given by (5) and (6) respectively.

$$ x_{i} \in c_{l} , \ if \ l = arg min_{l=1}^{k} {|| x_{j} - m_{i} ||}^{2} $$
(5)
$$ m_{i} = \frac{{\sum}_{x_{i} \in c_{l}} x_{i}}{| c_{l} |} $$
(6)

for 1 ≤ iN and 1 ≤ lk.

K-means method has the time complexity of O(nkt) where, n, k, and t correspond to the number of data items, number of clusters to be formed, and maximum iterations respectively. However, this method is biased towards initial cluster centroids and usually traps into local minima. Moreover, solutions vary with the number of clusters. The pseudo-code of the K-means method is presented in Algorithm 4.

figure d

Some other popular methods of this category are bisecting K-means [72], sort-means [62], K-harmonic means [92], K-modes [16], K-medoids [61], partition around medoids (PAM) [40], and clustering large applications based upon randomized search (CLARANS) [55]. The K-medoids method corresponds to the variant of K-means which defines the cluster centroid as the data point, nearest to the centre of the cluster. This method is befitted for discrete data. PAM is reminiscent of the medoid-based method in which there are two phases, namely build and swap. It is a greedy search method where the aim is the minimization of the dissimilarity among the objects with the closest medoid. In the first phase, a set of data points are considered as medoids which define the selected objects set. The remaining data points are kept in another set, termed as unselected objects. This corresponds to the steps of the build phase. The next phase tries to improve the cluster quality by swapping the data in the selected objects set with the data in the unselected objects set. Further, CLARANS is an efficient clustering method, especially for the large datasets. It uses the concept of the graph to find the medoids for the given data. It applies PAM on complete data and considers only a subset of the possible swaps between the selected objects and unselected objects.

3.2.2 Histogram-based methods

This category includes methods which perform segmentation by constructing histogram according to the frequency of the intensity values in an image. These methods identify a set of optimal threshold values which partitions the histogram. This partitioning results in grouping of intensity values in different clustering. Therefore, histogram-based segmentation methods can be considered as clustering approach. Generally, a single feature of an image is considered, usually the intensity, to define a 1-dimensional (1D) histogram. For segmentation, histogram information, such as peaks, valleys, and curvatures, are analyzed. Figure 6 illustrates a 1D histogram of an image taken from Berkeley segmentation dataset and benchmark (BSDS) 300 [49] where the x-axis represents the grey levels while the y-axis depicts the frequency. It can be noticed from the histogram that grey levels vary from 20 to 220. Additionally, higher peaks can be observed for grey levels between 50 to 100 and the frequently occurring intensity value is 80. This signifies that many regions are dark which can also be observed in the considered image.

Fig. 6
figure 6

1D view of grey level histogram of an image taken from BSDS300 [49] a Image b 1D histogram

To understand the mathematical formulation of 1D histogram-based segmentation method, consider an RGB image of N number of pixels each having intensity from {0,1,2,⋯ ,L − 1} in each plane. Suppose, ni corresponds to the number of pixels for ith intensity level. Therefore, the probability distribution (pi) of ith intensity level in the image is defined as the probability of occurrence of the image pixels with ith intensity level which is formulated as (7).

$$ p_{i}=\frac{n_{i}}{N}, \ 0\leq i \leq L-1 $$
(7)

The mean intensity of an image plane is determined by (8).

$$ \mu= \sum\limits_{i=1}^{L}{ip_{i}} $$
(8)

To group the image pixels into n clusters {C1,C2,⋯ ,Cn}, n − 1 thresholds, (t1,t2,⋯ , tn− 1), are required which are represented as (9).

$$ v(x,y)=\begin{cases} 0, & v(x,y)\leq t_{1}\\ \frac{t_{1}+t_{2}}{2}, & t_{1} < v(x,y) \leq t_{2}\\ &.\\ &.\\ &.\\ \frac{t_{n-2}+t_{n-1}}{2}, & t_{n-2} < v(x,y) \leq t_{n-1}\\ L-1, & v(x,y)> t_{n-1} \end{cases} $$
(9)

where v(x,y) corresponds to pixel intensity at (x,y) location of a M × N image. The partition Cj,1 ≤ jn consists of pixels with intensity value more than tj− 1 and less than equals to tj. The frequency of cluster Cj for each plane is computed as (10).

$$ w_{j}= \sum\limits_{i=t_{j-1}+1}^{t_{j}}{p_{i}} $$
(10)

The mean for the cluster Cj can be calculated by (11) and the inter-class variance is formulated in (12).

$$ \mu_{j}= \sum\limits_{i=t_{j-1}+1}^{t_{j}}{ip_{i}/w_{j}} $$
(11)
$$ \sigma^{2}= \sum\limits_{j=1}^{n}{w_{j}{(\mu_{j}-\mu)}^{2}} $$
(12)

To group the pixels into clusters based on intensity, the maximization of inter-class variance [(12)] is considered. Therefore, the objective function tries to maximize the fitness function, defined in (13):

$$ \phi={ max}_{1< t_{1}< {\cdots} <t_{n-1}<L}\{\sigma^{2}(t)\} $$
(13)

The performance of 1D histogram-based methods is generally unsatisfactory as they consider only single feature like intensity level information of an image and do not deal with spatial correlation among the pixels which is an important parameter for segmentation. To achieve the same, a new histogram has been introduced for image segmentation, termed as 2D-histogram, which has been proven to be better. The concept of 2D histogram-based thresholding was introduced by Abutaleb [2] where the original grey-level information is integrated with local averaging information of pixels to form the grey-local 2D histogram. The experimental results of this method are promising in comparison to 1D histogram-based methods.

A 2D-histogram considers two features of an image at a time. In literature, these features are selected from a set of three image features, namely pixel intensity, average pixel intensity, and pixel gradient. Based on the considered features, two forms of 2D-histograms based image segmentation methods exist, namely grey-local 2D histogram [65] and grey-gradient 2D histogram [87].

Figure 7 presents a grey-local 2D histogram for the sample image taken from the BSDS300 dataset [49]. In the figure, the 2D-histogram is constructed by considering the intensity values of a grey-scale image on the x-axis while the y-axis consists of the average intensity values generated by using the local mean filter on the same image. This representation corresponds to the spatial correlation among the pixels.

Fig. 7
figure 7

Grey-local 2D histogram of the image, depicted in Fig. 6a a Three dimensional view b Two dimensional view

The grey-local 2D histogram takes the mean value of a group of pixels surrounding a target pixel to smooth the image which results in loss of information as the fine details such as points, lines, and edges get blurred. To minimize this loss of information, Mittal et al. [51] employed the non-local means filtering instead of local filtering and proposed non-local means 2D histogram. This advantaged in better post-filtering clarity. Moreover, Guang et al. [87] proposed a grey-gradient 2D histogram to consider the edge information. In a grey-gradient 2D histogram, the x-axis represents the grey intensity values while the y-axis defines the gradient values of the same image. However, this approach often generates inferior results than grey-local 2D histogram. Figure 8 represents the grey-gradient 2D histogram for the sample image considered in Fig. 6a.

Fig. 8
figure 8

Grey-gradient 2D histogram of an image, depicted in Fig. 6a a Three dimensional view b Two dimensional view

Generally, the segmentation methods based on the histogram are efficient as they require only a single pass through the pixels to construct the histogram. However, these methods have flaws like prior knowledge about the number of clusters is required, highly sensitive towards the noise, and in the overlapping regions of the histogram, it is quite difficult to identify significant peaks and valleys. Moreover, certain limitations which prevail in 2D histograms are that the diagonal regions are not smooth and the off-diagonal information is not adding improvement.

3.2.3 Metaheuristic-based methods

This category involves the use of metaheuristic-based approaches to obtain optimal clusters by updating random solutions according to mathematical formulation and optimality criteria (or objective function). It is a special branch of soft computing that have been leveraged to solve complex real-world problems in reasonable time as compared to classical methods. Broadly, algorithms of this category belong to the class of optimization algorithms which can solve computationally hard problems, like NP-complete problems [8]. However, no single meta-heuristic algorithm is efficient to solve every problem as stated by No Free Lunch Theorem [84]. Generally, a meta-heuristic method solves an optimization problem whose objective function is to perform either maximization or minimization of a cost function (f(x)) with a given set of constraints. The mathematically formulation of a maximization optimization problem is presented in (14).

$$ Maximize_{\{x \in \mathbb{R}^{d}\}} f(x) $$
(14)
$$ such \ that: \ a_{i} (x)\geq0, $$
(15)
$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ b_{j}(x)=0, $$
(16)
$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ c_{k} (x)\leq0 $$
(17)

where x = (x1,x2,x3,⋯ ,xd)T is the set of decision variables which is defined over d − dimensions and \( \mathbb {R}^{d}\) is the search space of the problem. ai(x), bj(x), and ck(x) correspond to the different constraints applicable to an optimization problem. Actual constraints depend on the considered optimization problem.

Over the last three decades, more than sixty meta-heuristic algorithms have been proposed based on the inspiration from nature to provide the optimal solution. Each meta-heuristic algorithm mimics a particular natural phenomena which may belong to evolutionary, physical, or biological. In literature, two common aspects that are often found in these algorithms are exploration and exploitation. Exploration represents the diversification in the search space wherein the existing solutions are updated with the intention of exploring the search space. This helps in exploring the new solutions, prevents the stagnation problem, and responsible for achieving the global solution. The exploitation, which corresponds to the intensification of the current solution, performs the local search around the currently generated solutions. In this, the goal is to exploit the search space and responsible for convergence to the optimal solution. Generally, meta-heuristic algorithms may broadly be classified into two categories, namely evolutionary and swarm algorithms as depicted in Fig. 5. Evolutionary-based algorithms are based on evolution theories such as Darwin’s evolutionary theory. The evolutionary algorithms work on the principle of generating better individuals with the course of generation by combining the best individuals of the current generation. Genetic algorithm (GA) [59], evolutionary strategy (ES), [6], differential evolution (DE) [64], biogeography-based optimization (BBO) [58], and probability-based incremental learning (PBIL) [25] are some examples of evolutionary algorithms. On the other side, swarm-based algorithms behave like the swarm of agents, such as fishes or birds, to achieve optimal results. Some algorithms of this category are particle swarm optimization (PSO) [42], ant colony optimization (ACO) [29], gravitational search algorithm (GSA) [54], spider monkey optimization (SMO) [7], grey-wolf optimizer (GWO) [75], cuckoo search (CS) [60], and military dog based optimizer (MDO) [76].

The meta-heuristic algorithms are able to provide promising results for unfolding the image clustering problem. Generally, these clustering-based methods are better than other clustering methods in terms of independence from the initial parameter settings and return global optimal solution [52]. As these methods ensure better solution, they have been widely used in clustering [38]. The basic approach of using meta-heuristics algorithms as clustering algorithm was introduced by Selim and Alsultan [69], using simulated annealing. Thereafter, Bezdek et al. [11] presented genetic algorithm based data clustering method that was basically the first evolutionary based method of data clustering. The first, swarm-based clustering algorithm was introduced by Lumer et al. [44] using ant colony optimization. A literature review of the existing metaheuristic-based clustering for image segmentation is discussed below. Moreover, the pseudo-code for a metaheuristic-based clustering method is presented in Algorithm 5.

figure e

Evolutionary-based clustering

Genetic algorithm with K-means was explored by Krishna et al. [43] in which the crossover operation of the genetic algorithm was performed by K-means. Subsequently, Maulik et al. [50] introduced the evolutionary-based clustering method using the genetic algorithm. Moreover, Schwefel [9] presented the recent works done in the area of clustering using evolutionary strategy. It has been perceived that the hybrid evolutionary strategy based algorithms perform better as compared to standard evolutionary strategy on the clustering problem for the majority of the UCI benchmark datasets. In continuation, Sheng et al. [70] introduced K-medoids and genetic-based hybrid algorithm for the clustering of the large-scale dataset. However, it has been observed that hybrid evolutionary based algorithms, which are formed by merging the good features of two or more algorithms, had outperformed the parent algorithms. Lue et al. [46] proposed gene transposition based algorithm for the efficient clustering in which immune cells (population) have been initialized with a vector of K cluster centroids. Likewise, Ye et al. [73] presented GA and PSO based application for image segmentation. Jiao et al. [37] applied memetic based clustering algorithm for the image segmentation of remote sensing images. Further, Lu et al. [47] combined multiple clustering using various measures of the partitions and proposed a fast simulated annealing algorithm for efficient clustering. Agusti et al. [4] explored an algorithm named grouping genetic algorithm to solve clustering problems. The summation of intra-cluster distance was used as the fitness function for performing the partition based clustering, while FCM was the fitness function for the fuzzy-based clustering. Further, Pal et al. [58] used bio-geography based optimizer (BBO) for clustering.

Swarm-based clustering

Merve et al. [81] proposed the swarm-based algorithm for partitional clustering using PSO. Chuang et al. [19] introduced a chaotic PSO clustering algorithm in which conventional parameters of the PSO were replaced with chaotic operators. On the same footprints, Tsai and Cao [77] presented a PSO based clustering algorithm with selective particle regeneration. Further, the shuffled frog leaping based clustering algorithm was successfully utilized for image segmentation [12]. Moreover, hybrid variants of PSO with K-means [81], K-harmonic means [90] and rough set theory [34] have also been introduced. Zhang et al. [93] introduced possibilistic c-means and PSO based variant for image segmentation. Furthermore, some researchers proposed swarm and evolutionary based hybrid algorithms for effective data clustering. Xu et al. [86] combined DE with PSO, and presented efficient results. Similarly, PSO was integrated with GA and simulated annealing for the improvements in the clustering as compared to the conventional PSO. Zou et al. [95] introduced cooperative ABC based clustering. Moreover, hybrid algorithms for clustering have also been proposed by a number of researchers and utilized for unfolding image segmentation. Moreover, a clustering algorithm based on bacterial foraging optimization was introduced by Lie et al. [83]. Senthilnath et al. [89] developed the firefly based algorithm for the efficient analysis of the clusters and tested on the UCI datasets. Besides this, the invasive weed optimization based clustering algorithm was presented by Chowdhury et al. [18]. Subsequently, Liu et al. [45] proposed multi-objective invasive weed optimization based algorithm for efficient clustering. Hatamlou et al. [33] introduced GSA based clustering algorithm where the cluster centroids are initialized with K-means.

In general, meta-heuristic methods have a number of merits like, applicable to a wide set of real-world optimization problems, searching behaviour follows a random approach, and able to find the approximate solutions to the NP-complete problems. However, these algorithms suffer from a number of demerits like convergence to the global optima is probabilistic, trap into local optima sometimes, and computational complexity is comparatively high.

4 Performance evaluation parameters

The performance evaluation of a method is necessary to assure the validity of a method. This section lists out various performance measures that researchers have found useful for the quantitive evaluation of an image segmentation method [51]. Table 3 tabulates the formulation of various performance parameters.

  1. 1.

    Confusion matrix (CM): It is a widely used representation for the assessing the efficacy of a classification method [66]. However, it can be used to analyze the results of a clustering method too. The confusion matrix (CM) of size NxN represents that there are N classes (or clusters). In context of clustering, the number of correctly clustered patterns (true positive (TP), true negative (TN)) and wrongly clustered patterns (false positive (FP), false negative (FN)) can be easily identified from the confusion matrix. Table 4 illustrates an example of a typical confusion matrix for 2 clusters, i.e. positive and negative. In the table, TP and TN depict the number of data items which are correctly predicted in positive and negative clusters respectively. FP and FN correspond to the number of data items which are wrongly predicted as positive and negative clusters respectively. Based on Table 4, precision, recall and accuracy can also be computed using (18) – (20).

    $$ Precision= \frac{TP}{TP+FP} $$
    (18)
    $$ Recall= \frac{TP}{TP+FN} $$
    (19)
    $$ Accuracy= \frac{TP+TN}{TP+TN+FP+FN} $$
    (20)
  2. 2.

    Intersection of Union (IoU): Intersection of Union is the ratio of the number of common pixels between X and Y to the total number of pixels in X and Y. Here, X and Y correspond to the segmented image and ground truth respectively. The formulation of IoU is depicted in (21).

    $$ IoU=\frac{|X \cap Y|}{|X| + |Y|} $$
    (21)
  3. 3.

    Dice-Coefficient (DC): Dice-Coefficient is defined as twice the number of common pixels divided by total number of pixels in X and Y, where X corresponds to the segmented image and Y is the ground truth. DC is mathematically defined as (22).

    $$ DC=\frac{2 |X \cap Y|}{|X| + |Y|} $$
    (22)
  4. 4.

    Boundary Displacement Error (BDE): This parameter computes the average boundary pixels displacement error between two segmented images as depicted in (23). The error of one boundary pixel is defined as it’s distance from the closest pixel in the other boundary image.

    $$ \mu_{LA}(u,v)=\begin{cases}\frac{u-v}{L-1} & 0 < u-v \\0 & u-v < 0\end{cases} $$
    (23)
  5. 5.

    Probability Rand Index (PRI): Probability Rand Index finds labelling consistency between the segmented image and its ground truth. It counts such fraction of pairs of pixels and average the result across all ground truths of a given image as shown in (24).

    $$ R= \frac{a+b}{a+b+c+d} = \frac{a+b}{n} $$
    (24)
  6. 6.

    Variation of Information (VOI): Variation of Information as shown in (25) computes the randomness in one segmentation in terms of the distance from given segmentation.

    $$ VOI(X;Y) = H(X) + H(Y) - 2I(X,Y) $$
    (25)
  7. 7.

    Global Consistency Error (GCE): A refinement in one segmentation over the other is depicted by the value of GCE as given in (26). If two segmentations are related this way then they are considered as consistent, i.e. both can represent the same natural image segmentation at different scales.

    $$ GCE = \frac{1}{n} \left\{ \sum\limits_{i}E(s_{1},s_{2},p_{i}), \sum\limits_{i}E(s_{2},s_{1},p_{i}) \right\} $$
    (26)
  8. 8.

    Structural Similarity Index (SSIM): Structural Similarity Index measures the similarity between two images by taking initial uncompressed or distortion-free image as the reference and computed as (27). It incorporates important perceptual phenomena such as luminance masking and contrast masking.

    $$ SSIM= \frac{(2\times\bar{x}\times \bar{y}+c_{1})(2\times\sigma_{xy}+c_{2})}{(\sigma_{x}^{2}+\sigma_{y)}^{2})\times((\overline{x})^{2}+(\overline{y})^{2}+c_{1})} $$
    (27)
  9. 9.

    Feature Similarity Index (FSIM): Feature Similarity Index is a quality score that uses the phase congruency (PC), which is a dimensionless measure and shows the significance of a local structure. It is calculated by (28).

    $$ FSIM = \frac{{\sum}_{x\in{\varOmega}}S_{L}(x).PC_{m}(x)}{{\sum}_{x\in{\varOmega}}PC_{m}(x)} $$
    (28)
  10. 10.

    Root Mean squared error (RMSE): The root-mean-squared error computes the difference between sample value predicted by a model or an estimator and actual value. The formulation is shown in (29).

    $$ \operatorname{RMSE}(\hat{\theta}) = \sqrt{\operatorname{MSE}(\hat{\theta})} = \sqrt{\operatorname{E}((\hat{\theta}-\theta)^{2})} $$
    (29)
  11. 11.

    Peak Signal to Noise Ratio (PSNR in dB): Peak signal-to-noise ratio is defined as the ratio between the maximum possible power of a signal and the power of corrupting noise and is calculated using (30). In general, a higher value of PSNR represents high quality reconstruction and is defined via MSE.

    $$ PSNR = 10log_{10}\frac{(2^{n}-1)^{2}}{\sqrt{MSE}} $$
    (30)
  12. 12.

    Normalized Cross-Correlation (NCC): Normalized cross-correlation is used for template matching where images are first normalized due to lighting and exposure conditions. It is calculated by subtracting the mean of original and segmented images from the corresponding images, and divided by their standard deviations. Let t(x,y) is the segmented image of f(x,y), then NCC is calculated by (31).

    $$ {\displaystyle {\frac {1}{n}} \sum\limits_{x,y}{\frac {1}{\sigma_{f}\sigma_{t}}}\left( f(x,y)-\overline{f}\right)\left( t(x,y)-\overline{t}\right)} $$
    (31)
  13. 13.

    Average Difference (AD): This parameter represents the average difference between the pixel values and is computed by (32).

    $$ AD=\frac{1}{MN} \sum\limits_{i=1}^{M} \sum\limits_{j=1}^{N}(x(i,j)-y(i,j)) $$
    (32)
  14. 14.

    Maximum Difference (MD): This parameter finds the maximum of error signal by taking the difference between original image and segmented image and defined in (33).

    $$ MD = max\mid x(i,j) - y(i,j)\mid $$
    (33)
  15. 15.

    Normalized Absolute Error (NAE): The normalized absolute difference between the original and corresponding segmented image gives NAE and is calculated by (34).

    $$ NAE =\frac{{\sum}_{i=1}^{M}{\sum}_{j=1}^{N}\mid x(i,j)-y(i,j)\mid}{{\sum}_{i=1}^{M}{\sum}_{j=1}^{N}\mid x(i,j)\mid} $$
    (34)
Table 3 Various performance parameters for the quantitative evaluation of an image segmentation method [51]
Table 4 Confusion matrix

In the above mentioned parameters, IoU, DC, SSIM, FSIM, PRI, PSNR, and NCC show better segmentation on high values. A high value indicates that the segmented image is more close to the ground truth. To measure the same, these parameters compute values by considering the region of intersection between the segmented image and the ground truth which corresponds to matching the number of similar pixels. On the contrary, other indices prefer lower values for better segmentation as these measures compute error between the segmented image and the ground truth and aim at reducing the difference between them.

5 Image segmentation benchmark datasets

The considered dataset is the key for fair analysis of an image segmentation method. The validation of segmentation methods against benchmark datasets tests its performance against the challenges posed by the considered dataset. Generally, a benchmark dataset consists of a variety of images that varies in number of different aspects. Some of the common challenges which segmentation methods need to handle are illumination variation, intra-class variation, and background complexity. This paper lists some of the popular benchmark datasets in Table 5 which are publicly available for image segmentation task and are briefed below.

  • Aberystwyth Leaf Evaluation Dataset: It is dataset of timelapse images of Arabidopsis thaliana (Arabidopsis) plant to perform leaf-level segmentation. It consists of original arabidopsis plant images with 56 annotated ground truth images [1].

  • ADE20K: It is a scene parsing segmentation dataset with around 22K hierarchically segmented images. For each image, there is a mask to segment objects and different parts of the objects [3].

  • Berkeley Segmentation Dataset and Benchmark (BSDS): This is a benchmark dataset for evaluation of image segmentation method. It is a collection of 12K manually-labelled segmentations performed on 1K images from the Corel dataset by 30 human subjects. There are two versions of this dataset. BSDS300 [74] and BSDS500 [78] consist of 300 and 500 number of images respectively.

  • Brain MRI dataset: It is segmentation dataset of MRI images along with manual fluid-attenuated inversion recovery (FLAIR) abnormality segmentation masks [14].

  • CAD120 Affordance Dataset: This segmentation dataset provides binary masks for each the affordance in the RGB image captured from the Cornell Activity Dataset (CAD) 120. The affordance annotation is performed in the context of human [15].

  • CoVID-19 CT-images Segmentation Dataset: This dataset consists of 100 CT-scan images of more than 40 CoVID-19 patients along with segmented ground truths. The segmented images were annotated by an expert radiologist [21].

  • Crack Detection Dataset: It consists of five datasets where each dataset consists of pavement images and associated segmented images for crack detection [5].

  • Daimler Pedestrian Segmentation Benchmark: This dataset contains pedestrian images along with corresponding segmented ground truth to perform pedestrian segmentation [24].

  • Epithelium Segmentation dataset: There are 42 ER+ breast cancer images scanned at the magnification of 20x. The ground truth contains manually annotated epithelium region by an expert pathologist [80].

  • EVIMO Dataset: This is a motion segmentation dataset along with egomotion estimation and tracking captured through event camera. It provides pixel-level masks for the motion segmentation [30].

  • Liver Tumor Segmentation (LITS): This is a public available benchmark dataset provided by CodeLab for segmentation of liver tumor. It contains 3D-CT scans of liver along with segmented liver and liver tumor [39].

  • Materials in Context (MINC): This dataset is intended for the segmentation and recognition of materials from the image. It consists of point annotations for 23 different types of materials [57].

  • Nuclei Segmentation dataset: This dataset contains 143 ER+ breast cancer images scanned at 40x. In total, there are around 12,000 nuclei segmented manually. [79]

  • Objects with Thin and Elongated Parts: It is a database of three datasets where each image consists of objects having thin and elongated parts. In total, there are 280 images of birds and insects with corresponding ground truth images [20].

  • OpenSurfaces: This dataset includes tens of thousands of segmented surfaces. The dataset is constructed from interior photographs and annotated with information like material type, texture, and contextual [56].

  • Oxford-IIIT Pet Dataset: This dataset consists of pet images with their pixel-level trimap segmentation. There are 37 categories of pets and each category contains 200 images [82].

  • PetroSurf3D Dataset: This is a dataset which contributes in the area of rock art. There are total 26 high-resolution 3D-scans images with pixel-wise labelling for the segmentation of petroglyphs [35].

  • Segmentation Evaluation Database: This dataset contains 200 gray-level images along with manually generated ground-truth images by three human subjects. Each image consists of either one or two objects in the foreground and the ground truth images are annotated into two or three classes [67].

  • Sky Dataset: This dataset consists of images taken from the Caltech Airplanes Side dataset. The images, containing sky regions, were selected and their ground truths were prepared. In total, there are 60 images along with ground truth to perform the segmentation of sky [71].

  • TB-roses-v1 Dataset: This is a small dataset consisting of 35 images of rose bushes and corresponding manually segmented rose stems [26].

  • Tsinghua Road Markings (TRoM): It is a dataset containing road images along with ground truths for the purpose of segmentation of road markings. There are 19 categories of road markings [63].

Table 5 Benchmark datasets for image segmentation [23]

6 Findings and conclusion

This paper reviews different clustering-based methods in the field of image segmentation. The clustering methods may be categorized in two broad classes, namely hierarchical and partitional based clustering. Hierarchical clustering methods perform data grouping at different similarity levels. They are suitable for both convex and arbitrary data since these algorithms construct a hierarchy of data elements. However, these methods fail when clusters are overlapping and are quite computationally expensive, especially on high-dimensional dataset like images as these methods scan the N x N matrix for the lowest distance in each of N-1 iterations. On such dataset, partitional clustering methods have performed better and partitions the data into required number of clusters based on a criteria and preferable for convex shape data. Therefore, further study includes an extensive survey in the perspective of partitional based clustering methods. In literature, there exists a number of partitional-based clustering methods, which belong to either soft or hard clustering approaches. Further, the hard partitonal clustering methods are categorized into three broad classes, namely Kmeans-based methods, histogram-based methods, and metaheuristic-based methods. A concise overview of the different clustering-based image segmentation methods is presented in Table 6. From the table, following guidelines are extracted.

  • The selection of an appropriate clustering method generally depends on certain parameters such as type of data, sensitivity towards noise, scalability, and number of clusters.

  • Hierarchical clustering algorithms are suitable for the data set with arbitrary shape and attribute of arbitrary type. There are a number of applications in which hierarchical clustering algorithms are used. For example, all files and folders on the hard disk are organized in a hierarchy and it can be easily managed with the help of hierarchical clustering. In hierarchical clustering, storage and time requirements grow faster than linear rate, Therefore, these methods cannot be directly applied to large datasets like image, micro-arrays, etc. The BIRCH clustering method is computationally efficient hierarchical clustering method; however, it generates low-quality clusters when applied on large datasets. Among all the hierarchical methods, the CURE clustering method is preferred for clustering high dimensional datasets. The computational complexity of the CURE method is O(n2logn), where n corresponds to the number of data points.

  • Partitional clustering methods generate disjoint clusters according to some predefined objective functions. The partitional clustering methods are preferred due to their high computing efficiency and low time complexity [66]. For the large datasets, partitional clustering methods are most preferred than other clustering methods. The partitional clustering algorithms are used when number of clusters to be created are known in advance. For example, partitional clustering is used to group the visitors to a website using their age. Though partitional clustering is computationally efficient, the number of clusters need to be priorly defined. Moreover, partitional clustering methods are sensitive to the outliers and sometimes also trap to local optima. Particularly, FCM and K-means are preferable to perform clustering on spherical-shaped data while histogram-based methods may be advantageous when clustering is to be performed irrespective of the data distribution.

  • Metaheuristic-based methods are scalable for clustering on large data, especially when the data distribution is complex. The meta-heuristic clustering mitigate the issue of trapping to local optima by using the exploration and exploitation property. However, existing meta-heuristic methods fail to maintain an efficient balance between the exploitation and exploration process. This will greatly affect the performance and the convergence behaviour of the meta-heuristic method.

Table 6 An overview of the different clustering-based image segmentation methods [85]

Apart from the complete statistics of clustering-based image segmentation methods as tabulated in Table 6, the different performance parameters required for the quantitive evaluation of a segmentation method are discussed. Various performance parameters are briefed along with their formulation. In addition, the publicly available image-segmentation datasets along with active links (till date) are listed out to facilitate the research in the domain of image-segmentation. A dataset consists of a variety of images that varies in number of different aspects like different region-of-interest, image size, applicability of image, and many more. This results in a number of challenges for the segmentation method such as illumination variation, intra-class variation, and background complexity.

In future, the clustering algorithms can be compared on different factors such as stability, accuracy, normalization, and volume of the dataset. The analysis on these factors may reveal quality and performance aspects of the clustering algorithms. Secondly, the applicability of the compared clustering methods on real-time applications needs to be analysed. Thirdly, different cluster validity indices [53] can be considered for the comparison. An efficient variant of existing meta-heuristic methods may improve the performance. Further, based on the findings concluded in this paper, similar work on remaining issues can be considered in continuation and can be presented in future.