Computing

, Volume 96, Issue 5, pp 381–402 | Cite as

Automatic image annotation approach based on optimization of classes scores

  • Nashwa El-Bendary
  • Tai-hoon Kim
  • Aboul Ella Hassanien
  • Mohamed Sami
Article

Abstract

This article presents an automatic image level annotation approach that takes advantage of both context and semantics presented in segmented images. The proposed approach is based on the optimization of classes’ scores using particle swarm optimization. In addition, random forest classifier and normalized cuts algorithm have been applied for automatic image classification, annotation, and clustering. For the proposed approach, each input image is segmented using the normalized cuts segmentation algorithm in order to create a descriptor for each segment. Two parameter selection models have been selected for particle swarm optimization algorithm and many voting techniques have been implemented to find the most suitable set of annotation words per image. Experimental results, using Corel5k benchmark annotated images dataset, demonstrate that applying optimization algorithms along with random forest classifier achieved noticeable increase in image annotation performance measures compared to related researches on the same dataset.

Keywords

Image annotation Segmentation Particle swarm optimization Random forest classifier 

Mathematics Subject Classification

68Uxx 

1 Introduction

Recently, rapid advances have taken place in the technology industry. That development in technology generated new problems and challenges that need to be solved in order to make our life easier and can provide new channels of retrieving information and exchanging knowledge in a fast and easy way. Nowadays, there are hundreds of images on our personal computers in addition to thousands gigabytes of images digital libraries on the Internet. Manual annotation of images is a very time consuming and impractical process. So, there are two solutions for handling that large amount of images. The first one is via applying content based image retrieval (CBIR) models. In CBIR, the main idea is converting the two dimensional (2D) images into vectors of numbers (visual features vectors), then choosing the visual features that preserve the content of the images and that are not sensitive to noise, scaling, or translations. There are three main types of visual features; namely color, texture, and shape features. However, it is impractical and almost impossible to retrieve images by providing keywords if there are no keywords or metadata associated to those images. Thus, CBIR techniques present a number of limitations such as not being suitable for non-professional users who prefer to find images using keywords instead of using image specifications or segments as queries. Also, another limitation is employing visual features in queries, which leads to what is called “semantic gap” resulted from using one type of visual features to differentiate between different objects regardless considering semantics presented in images.So, these limitations lead researchers to develop another solution for images retrieval through using automatic image annotation (AIA) models that have become a hot research topic in the recent years [1, 2, 3, 4, 5, 6, 7, 8].

In this article we propose an automatic image annotation approach for image level annotation using particle swarm optimization algorithm and random forest classifier. A number of voting techniques and integrations are used and experimented on the benchmark dataset Corel5k [9]. The rest of this article is organized as follows. Related research work is surveyed in Sect. 2. Section 3 introduces brief description for techniques used in the proposed approach. Section 4 describes, in details, the proposed approach and the algorithms presented in it. In Sect. 5, experimental results, for using the proposed AIA model on Corel5k benchmark, are introduced and discussed. Finally, Sect. 6 addresses conclusions and discusses future work.

2 Related work

Many researches have been done in the field of AIA. Hironobu et al. in [10] have used co-occurrence model to establish association between words and images. Hironobu’s model is applied for region labeling, however it needs large training data to calculate correct and accurate probabilities. Duygulu et al. in [11] proposed machine translation image annotation model. They treated the AIA problem as learning lexicon, where images are firstly segmented using normalized cuts (N-cuts) segmentation algorithm then features vectors are extracted for each resulted region. Expectation maximization (EM) algorithm is used for building probability lexicon table. The main disadvantage of the proposed approach in [11] is that the applied EM algorithm requires high computational time, specially for large datasets.

Another research work proposed by Jeon et al. in [12] using AIA model based on probabilistic model named cross-media relevance model (CMRM). In their experiments, they used the same dataset as in [11] and their results showed that the performance of their model is almost six times as good as co-occurrence word-blob model and twice as good as the machine translation, proposed in [11], in terms of mean precision. Moreover, Lavrenko et al. in [13] have built on the work by Jeon et al. in [12], however via applying a continuous relevance model (CRM) instead of the discrete CMRM. Considering performance results, it has been observed that CRM model has achieved better results compared to using the CMRM model. Also, Feng et al. in [14] used probabilistic model but via applying multi-bernoulli distribution instead of the multi-normal distribution used in the CRM and the CMRM. The proposed model in [14] is named multiple-bernoulli relevance (MBRM) model and it achieved better results compared to the results provided by CRM and CMRM. Moreover, Claudio et al. in [15] proposed an automatic image annotation approach using support vector machine (SVM) through applying multi-class voting techniques to classify regions of seven classes. The best accuracy using that model has been achieved when using non-linear multi-class SVM along with the HSV histogram features. Furthermore, Zhu et al. in [16] proposed a model based on mutli-instance learning algorithm and Gaussian mixture model. Experiments done on Corel5k benchmark achieved better results in only one performance measure comparing to related researches on the same dataset. Also, Lei Wang and Latifur Khan in [17] used k-means clustering and histogram weighting for features vectors. In [18], Lu et al. used genetic algorithms (GA) for MPEG-7 images features selection coupled with k-nearest neighbor (K-NN) classifier for image annotation.

Generally, according to the previously presented image annotation research works, various contributions focused on weighting features for AIA approaches have been presented, however via only considering image level with no segmentation or partitioning for the input images in order to apply annotation for the regions level. In this article, an automatic image level annotation approach has been presented and the impact of applying optimization algorithms along with random forest classifier have been experimented.

3 Preliminaries

Due to space limitations, we can provide only a brief explanation of basic technologies used in this article including N-cuts segmentation algorithm, particle swarm optimization (PSO) algorithm, and random forest classifier. A more comprehensive review can be found in sources such as [20, 21, 22, 23, 24, 25, 26, 27].

3.1 Normalized cuts (N-cuts) segmentation algorithm

In this article, we aim to automatically annotate images based on their regions visual features. Therefore, the need for an image segmentation phase is evident for the presented approach. The N-cuts segmentation algorithm [20] is based on graph theory and grouping the graph nodes using similarities criteria. This algorithm formulates image segmentation as a graph partitioning problem and uses the normalized cuts value between different graph groups. In the employed graph theoretic formulation of grouping, a set of points in an arbitrary feature space is represented as a weighted undirected graph \(G(V, E)\), where the nodes of the graph are the points in the feature space, and an edge is formed between every pair of nodes. The weight on each edge \(w(i,j)\) is a function of the similarity between nodes \(i\) and \(j\). A Graph \(G(V, E)\) is partitioned into two disjoint sets \(A\) and \(B\), where \(A \bigcup B=V\) and \(A \bigcap B=\phi \), by removing edges and hence performing a cut. The segment cut between \(A\) and \(B\) is defined as shown in Eq. (1).
$$\begin{aligned} cut(A,B)=\sum _{u \in A, v \in B} w(u,v) \end{aligned}$$
(1)
The optimal partitioning is the one that minimizes the cut value. There are many studies for finding the minimum cut value. Some techniques based on clustering graph into k-sub graphs as the cut value across the partitioned groups is minimized. The contribution in this algorithm is to avoid unnatural bias partitioning of small sets of points, instead of looking at total edge weights between the two parts. A fraction of cut cost for total edge connections to all the nodes in graph is calculated, which is called N-cuts and is given by Eq. (2).
$$\begin{aligned} Ncut(A,B)= \frac{Cut(A,B)}{assoc(A,V)} + \frac{Cut(A,B)}{assoc(B,V)} \end{aligned}$$
(2)
where \(assoc(A,V)= \sum _u2A, t2V w(u,t)\) is the total connections from nodes in group \(A\) to all nodes in the graph, and \(assoc(B,V)\) is similarly defined.
With this definition of the disassociation between the groups, as N-cuts will not be a small value because that the cut value will mostly a large percentage of the all connection from that small set to all other nodes, the problem unnatural bias for partitioning out small sets of points will no longer exists. Following the same technique, the total normalization within groups in the same partition can be calculated according to Eq. (3).
$$\begin{aligned} Nassoc(A,B)= \frac{assoc(A,A)}{assoc(A,V)} + \frac{assoc(B,B)}{assoc(B,V)} \end{aligned}$$
(3)
where \(assoc(A,A)\) and \(assoc(B,B)\) are total weights of edges connecting nodes within \(A\) and \(B\). The N-cuts segmentation steps are illustrated in Algorithm 1. Figure 1 shows some examples for normalized cuts segmentation results.
Fig. 1

Normalized cuts segmentation results for fruit images: an illustrative example

In Algorithm 1, for step (7), there are several ways to choose a splitting point, such as: Take \(0\), \(median\) and search a splitting point which results in that \(Ncut(A, B)\) is minimized. The splitting point that minimizes \(Ncut\) value also minimizes the formula represented by Eq. (7):
$$\begin{aligned} \frac{y^T (D-W)y}{y^T Dy} \end{aligned}$$
(7)
where \(y=(1+x)-b(1-x),\,b=k/(1-k),\,k=\frac{\sum _{x_i>0}d_i}{\sum _i d_i},\,x\) is an \(N\) dimensional indicator vector, where \(x_i= 1\) if node \(i\) is in \(A\) and \(x_i= -1\) otherwise.

3.2 Particle swarm optimization (PSO) algorithm

The PSO algorithm is inspired from birds swarm, but not uniquely based on it. The goal of this algorithm is to find an optimal solution in a determined search space. However, there is no guarantee to find this intended optimal solution. The concept of particle swarms, although initially introduced for simulating human social behaviors, has become very popular these days as an efficient search and optimization technique. Particle swarm optimization [21, 22, 23] does not require any gradient information of the function to be optimized. It instead uses primitive mathematical operators and is conceptually very simple. PSO has attracted the attention of a lot of researchers, the case that resulted in a large number of variants of the basic algorithm as well as many parameter automation strategies.

The canonical PSO model consists of a swarm of particles, which is initialized with a population of random candidate solutions. They move iteratively through the \(d\)-dimension problem space to search the new solutions, where the fitness, \(f\), can be calculated as a certain quality measure. Each particle has a position represented by a position-vector \({\varvec{x}_{i}}\) and a velocity represented by a velocity-vector \({\varvec{v}_{i}}\), where \(i\) is the index of that particle. Each particle remembers its own best position so far in a vector \(\varvec{x}_{i}^{\#}\), and its \(j\)th dimensional value is \(x_{ij}^{\#}\). The best position-vector among the swarm so far is then stored in a vector \(\varvec{x}^{*}\), and its \(j\)-th dimensional value is \(x_{j}^{*}\). During the iteration time \(t\), the velocity update from the previous velocity to the new velocity is determined by Eq. (8). The new position is then determined by the sum of the previous position and the new velocity, as shown in Eq. (9).
$$\begin{aligned} {v}_{ij}(t+1)&= w{v}_{ij}(t)+c_{1}r_{1}(x_{ij}^{\#} (t)-{x}_{ij}(t))+c_{2}r_{2}(x_{j}^{*}(t)-{x}_{ij}(t))\end{aligned}$$
(8)
$$\begin{aligned} {x}_{ij}(t+1)&= {x}_{ij}(t)+{v}_{ij}(t+1) \end{aligned}$$
(9)
where \(w\) is the inertia factor that governs how much the previous velocity should be retained from the previous time step, \(r_{1}\) and \(r_{2}\) are the random numbers, which are used to maintain the diversity of the population. \(r_{1}\) and \(r_{2}\) are uniformly distributed in the interval [0,1] for the \(j\)th dimension of the \(i\)-th particle. \(c_{1}\) is a positive constant representing the coefficient of the self-recognition component and \(c_{2}\) is a positive constant called coefficient of the social component. From Eq. (9), each particle decides where to move next, considering its own experience, which is the memory of its best past position, and the experience of being the most successful particle in the swarm. In the particle swarm model, each particle searches the solutions in the problem space with a range \([-s, s]\) (if the range is not symmetrical, it can be translated to the corresponding symmetrical range). In order to guide the particles effectively in the search space, the maximum moving distance during one iteration must be clamped in between the maximum velocity \([-v_{max}, v_{max}]\) given in Eq. (10).
$$\begin{aligned} {v}_{ij} =sign({v}_{ij})min(\left| v_{ij} \right| \!,v_{max}) \end{aligned}$$
(10)
The value of \(v_{max}\) is \(p \times s\), with \(0.1\le p\le 1.0\) and is usually chosen to be \(s\), i.e. \(p=1\). The stopping criterion is usually one of the following: (1) maximum number of iterations, (2) number of iterations without improvement, and (3) minimum objective function error. Many parameter selection researches have been done to PSO algorithm to achieve a better performance. In this article, we used parameters selection models namely; common PSO, Trelea-1, Trelea-2, and Clerc models [21, 22, 23]. In Trelea [22], graphical parameter selection of \(a\) and \(b\) was used, then ended with two parameter sets. For parameter set one, \(a\) equal 0.6 and \(b\) equal to 1.7. For the parameter set two, \(a\) equal 0.729 and \(b\) equal 1.494. These parameters take place in calculating the velocity of the particles in the search domain as in Eq. (11).
$$\begin{aligned} velocity = A1 + A2 \end{aligned}$$
(11)
where
$$\begin{aligned} A1= a*vel_{prev} + b*randnum1 * (pbest - pos) \end{aligned}$$
and
$$\begin{aligned} A2=b*randnum2*((gbest,ps,1)-pos) \end{aligned}$$
where \(vel_{prev}\) is the previous velocity, \(gBest\) is the best position for all the iterations, and \(pBest\) is the best position for the current positions. Both \(randnum1\) and \(randnum2\) are random numbers bounded by the dimension of the PSO problem. \(pos\) parameter is the current position of a specific particle. For Clerc [21] Type1 PSO, it developed a generalized model of the algorithm adding set of coefficients to control the system being used as shown in Eq. (12).
$$\begin{aligned} velocity = B1+B2 \end{aligned}$$
(12)
where
$$\begin{aligned} B1= Chi*vel_{prev} + ac1 * randnum1*(pbest-pos) \end{aligned}$$
and
$$\begin{aligned} B2=ac2*randnum2*((gbest,ps,1)-pos) \end{aligned}$$
where the parameter \(Chi\) is a Clerc’s constriction coefficient and the \(ac1\) and \(ac2\) are the acceleration coefficients. A flowchart illustrating steps of the PSO algorithm is shown in Fig. 2.
Fig. 2

Particle swarm optimization algorithm

3.3 Random forest classifier

Random forest classifier [24, 25, 26, 27] is an ensemble classifier that consists of several decision trees. The output of this classifier is the class number that most frequently occurs individually in the output of decision trees classifiers. The main idea of decision trees is to predict the value of a target variable based on a group of input data. Decision trees also named classification trees, where the tree leaves represent the class labels and the branches represent the conjunction of feature vectors that lead to class labels. As depicted in Fig. 3, each interior node represents an input feature and each node has children of another input feature. The training of decision tree is based on a process called recursive partitioning, which is a recursive process where the input dataset is split into subsets. Recursion stopping condition is when all the tree nodes have the same output targets. For the approach proposed in this article, the targets are the annotation words.
Fig. 3

Decision tree to differentiate between melon and orange classes: an illustrative example

Generally, there are two types of decision trees:
  1. 1.

    Regression tree: where the proposed targets are real numbers (for example: price of a car).

     
  2. 2.

    Classification tree: where the proposed targets are specific classes (for example: is a female, is a male).

     
Classification decision tree is the type used for the proposed approach in this article. Also, there is an important concept for decision tree learning, which is called “decision tree pruning”. Tree pruning is a process that aims to reduce the size of decision tree by removing parts of the tree that give small voting for classifications. This technique has advantage of reducing the size and the complexity of the produced tree in addition to reducting the over-fitting in some cases. However, random forest algorithm doesn’t use this technique. Instead, random forest classifier takes \(n_{tree}\) as a parameter that corresponds to the number of decision trees that will be created in the ensemble bagged forest classifier. Algorithm 2 shows random forest training for each decision tree.
The random forest error rate depends on two factors:
  1. 1.

    Correlation: represents correlation between any two trees in the forest. Error rate increases as the correlation increases.

     
  2. 2.

    Strength: represents the strength of each tree in the forest. The strength is measured by the error rate; a tree with low error rate is a strong tree. The forest error rate decreases as the decision tree’s strength increases.

     
One of the advantages of random forest classifier is that it is one of the highly accurate classifiers. On the other hand, it has been observed to over-fit for some datasets with noisy classification tasks.

4 The proposed automatic image annotation (AIA) approach

For the proposed approach, each image is firstly segmented using N-cuts segmentation algorithm into a number of segments, then a group of visual features are extracted from each region. In this research we used Corel5k benchmark annotated images dataset [9], which has 33 features vectors and each region includes region dominant color in RGB form, region size, region x–y location, standard deviation, first moment, and 12 mean oriented energy filters that are window filters rotated each time by 30\(^{\circ }\) . At the end of the feature extraction phase, a 33 features vector is generated for each image region. Then, for training phase, a classifier was trained in order to differentiate between all the classes and at the same time an optimization algorithm was integrated for weighting the classes’ scores. For the testing phase, the output resulted from the training phase was used, which is a weight for each class in one vector and a trained classifier or a group of trained classifiers. Then, for any non-annotated image, segmentation and feature extraction processes are applied, then a random forest classifier has been used as shown in Fig. 4 introduces a general AIA model. In the proposed model we used random forest as classifier and PSO as optimization algorithm. Figure 5 shows the proposed random forest classifier based AIA approach taking into account the training and testing phases.
Fig. 4

General automated image annotation model

Fig. 5

Random forest classifier based AIA model

Firstly, in thetraining phase, image regions are clustered into K clusters and for each cluster a random forest classifier has been trained. Thus, K random forest classifiers will be resulted at the end of training phase. Each forest classifies the words (classes) existed per cluster. In the proposed model, we have taken the advantage of the clustering well known good accuracy. To find the words existed in any cluster, the regions in this cluster and accordingly the related images were found using Euclidean distance. Any label for an image means that all the image regions are contributing to this label word (class). In other words, we aimed to find the images that exist in each cluster and their related words. Consequently, each word has been trained from its related images in that cluster. All the forest per cluster have an equal number of decision trees.

Secondly, in the testing phase, for each unlabeled image, every image region feature vector is clustered using Euclidian distance to the nearest cluster, then the region feature vector is classified using cluster’s forest classifier. The output of the region forest classification contains scores per word (class). The sorted words enter a voting process for image labeling. The voting process takes into consideration different parameters including regions sizes, words frequents, and words correlations. The region classification classes sored in descending order according to the forest scores per class using the parameter \(WindowSize\). The words frequencies are calculated using Eq. (13).
$$\begin{aligned} FreqTrainWords(n)= \frac{TotolTrainingAnnotation}{WordRepeats(n)*C} \end{aligned}$$
(13)
where \(WordRepeats\) is an array of the number of annotations per class in the training dataset, \(TotolTrainingAnnotation\) is the total number of annotations in the training dataset, and \(C\) is a constant. The main propose of this constant is to decrease the affect of \(FreqTrainWords\) in voting process as the \(TotolTrainingAnnotation\) is a large number. Subsequently, the \(FreqTrainWords\) is used in Algorithm 3, which shows the details of the annotation steps for unlabeled images. Table 1 presents description for the parameters used in Algorithm 3.
Table 1

Parameters used in Algorithm 3

Parameter

Description

T

Total number of test images

N

Total number of regions per image

W

Total number of test data

ClstrNumber (\(CN\))

The region vector features cluster number

ClssScores (\(CS\))

The output scores for a region vectors

CorrletionLabels (\(CL\))

The word to word repeats

WrdCorr (\(WC\))

The correlation words for the class word with highest score

FreqTrainWords (\(FTW\))

The frequencies per word for each training dataset

WindowSize (\(WZ\))

The window size controls the size of the part taken from the sorted labels per region

ComVotes (\(CV\))

A variable contains cumulated votes for each class

NewLabels (\(NL\))

The output labels for unknown image

In Algorithm 3, for each image, the closest cluster index is found for each of its regions using the Euclidean distance and stored at \(CN\) variable. The number in \(CN\) is used in order to find the index of cluster classifier \(RndmTree\). Then, the selected classifier is used to get the classes’ scores of the current features vector. After that, descending sort using associated random forest scores is applied. For \(WC\) variable, the word correlations for the highest scored word (class) were stored in order to be subsequently used in voting calculations. As well, \(CV\) variable represents the sum of all the votes for current image and it will be reset to zero for each new image. Finally, based on the \(CV\) content, any new image will be annotated with \(NumOfLabels\) words (classes) that have the highest votes. Figure 6 shows a visualization for image level annotation algorithm using only classes’ scores.
Fig. 6

Visualization for random forest voting per image

PSO algorithm has been applied to random forest classifiers in order to weight the classes’ scores. The length of each particle swarm is calculated according to Eq. (14).
$$\begin{aligned} Length= Number of classes \end{aligned}$$
(14)
Algorithm 4 shows the updated part of Algorithm 3, which has been used for calculating the PSO fitness function, where \(Scores\) variable contains all the random forest’s scores for the classes those exist in a specific cluster. \(ClassWeight\) variable includes the generated PSO weights. The average precision measure have been used as the fitness value and the classes’ scores has been used for voting for simplicity reasons. PSO classes’ weights are multiplied to each regions’ features vectors classifications output scores. Figure 7 shows a visualization for random forest classes’ scores weighting using PSO output for each image region.
Fig. 7

Weighting the random forest scores using PSO’s weights output

5 Experimental results and analysis

In this section, obtained experimental results of the proposed automatic image annotation approach are presented.

In our experiments, we used Corel5k [9] dataset. The Corel5k dataset consists of 5,000 images from 50 Corel Stock Photo CDs. Each cd includes 100 images on the same topic and each image is also associated with 1–5 keywords. This dataset is divided into 4,500 images for training and 500 images for testing. In the training dataset there are 371 words. We consider each word as a class, as previously explained in Sect. 4. Each Image is segmented using normalized cuts segmentation algorithm, then the region with size larger than a certain threshold is selected. Each image has a number of regions between 5 and 10. There are 42,379 regions in the training dataset. For each region, a 33 features vector is extracted and the regions are clustered into 500 clusters. These features include segment size, location, convexity, first moment, region color, and region average orientation energy. The dimension of each feature vector is 36. The size of testing data is 500 images and includes only 263 words.

In order to evaluate our experiments, we used the same measures applied in previous works on Corel5k benchmark dataset. These measures are well known in the field of automatic image annotation. the first measure is the precision, which refers to the ratio of the counter of correct annotation in relation to all the times of annotation. The second measure is the recall, which refers to the ratio of the times of correct annotation in relation to all the positive annotated samples. Equations (15) and (16) show the calculations of precision and recall measures, respectively.
$$\begin{aligned} Precision= \frac{B}{A}\end{aligned}$$
(15)
$$\begin{aligned} Recall= \frac{B}{C} \end{aligned}$$
(16)
where \(A\) is the number of images annotated by some keyword, \(B\) is the number of images annotated correctly, and \(C\) is the number of images annotated by some keyword in the whole dataset. Another measure is \(NumWords\), which is statistics of the number of correctly annotated keywords that are used to correctly annotate at least one image. This statistical measure reflects the coverage of keywords in different proposed methods.
For all the clusters, \(n_{tree}\) equals to 50 has been used. As the number of decision trees per forest increases, classification accuracy increases accordingly. The \(n_{tree}\) has not been increased to be more than 50 due to memory limitation. As a results of experiments conducted on Corel5k benchmark, Fig. 8a–c show some statistics for the random forest AIA based model. These figures show the relationship between the number of labels which is given to an image versus accuracy measures.
Fig. 8

Avg-precision, Avg-recall and NumWords results for random forest AIA with different voting windowSize

In Fig. 8a, the average precision and average recall against the number of labels at random forest approach are depicted for using voting window size equals to 9. The best average precision value which is achieved in this case is \(0.1862\) when using four labels. The recall curve increase as the number of labels increases and it is intersected with the precision curve when the number of labels equals to 5. At number of labels equals to 5, the average precision equals to \(0.1795\), the average recall equals to 0.1577, and the “NumWords” equals to 92. These resulted were obtained considering that the test image is labeled with 1–4 words, when the voting window size equals to \(5\) the results differs a little as shown in Fig. 8b. In this case the max average precision equals to \(0.1731\) at number of labels equals to \(3\) and the maximum average recall equals to \(0.2263\) at number of labels with value \(10\). At number of labels equals to \(5\) the average precision equals to \(0.1459\), the average recall equals to \(0.1644\) and the “NumWords” equals to \(90\) words. From this case we can notice that decreasing the voting window size from \(9\) to \(5\) has a negative effect on all the measures used in our experiments. Similar to the two previous figures, Fig. 8a and b, Fig. 8c shows the results when using voting window size equals to \(15\). In this case when each image is labeled with \(5\) words the average precision value equals to \(0.1599\), average recall value equals to \(0.1436\) and NumWords equals to \(88\) words.

Figure 9 illustrates the change in the average precision and recall values when applying changes in the voting technique per image. In Fig. 9, we compared six different cases. (1) “No Correlation” case, where the relation of words is removed from the voting part in forest AIA algorithm. (2) “Local Correlations” case, which means using the words correlations within each cluster instead of using the correlation in the overall training images. (3) “Region only” case, which means to use the region size only in voting technique. (4) “Region + Correlations” case, which means to remove the part of words frequents in the voting algorithm. (5) “Divide by 4,000” case, which means to divide all the frequent words array with \(C=4{,}000\), taking into consideration that in all the previous figures and tests the value of \(C\) was \(10,000\). Finally, (6) “Vote + 1”, which means adding 1 for each occurrence of a class without using region values or words frequent values. Applying normalization to the frequent words array achieved lower accuracy compared to using constant division.
Fig. 9

Different voting techniques for random forest with clustering (RFC)

From Fig. 9, one can notice that the best results obtained via applying case (5), “Divide by 4,000”, where the average precision is equal to 0.18, average recall is equal to 0.18, and \(NumWords\) is equal to 104. The best results have been obtained for the case where \(windowSize\) equals to 9. The reason for this is that it is the average of words correlations within clusters.

Table 2 compares the forest based approach proposed in this article and previous traditional annotation models such as COM [10], TM [11], CMRM [12], CRM [13], MBRM [14], and MIL [16]. The proposed model is marked as \(RFC\), which stands for random forest with clustering.
Table 2

Performance of various annotation models on Corel5k vs RFC

Model

Average precision

Average recall

NumWords

COM [10]

0.03

0.02

19

TM [11]

0.06

0.04

49

CMRM [12]

0.10

0.09

66

CRM [13]

0.16

0.19

107

MBRM [14]

0.24

0.25

122

MIL [16]

0.20

0.22

124

RFC

0.18

0.18

104

Table 2 shows that the proposed random forest model is one of the highest accuracy AIA models, however not the best one on Corel5k. For achieving the goal of enhancing the accuracy of RFC, PSO algorithm has been applied to the random forest approach, where the classes’ scores are weighted and the average precision acted as the fitness value in the first RFC–PSO experiment and the sum of average precision and average recall in the second RFC–PSO experiment. The \(windowSize\) in these two experiments equals to \(9\). Table 3 presents the results before and after using PSO weights. For PSO configuration, number of iterations equals to \(400\), number of particles equals to \(200\), and velocity step equals to 2, have been used. RFC with PSO(Trela1) in case of \(windowSize\) equals to \(9\) and \(15\) achieved the best results. In Table 3, for RFC, random forest scores have been used without any weighting, while in case \(RFC+PSO-1\) PSO weights have been applied with average precision as fitness value. In \(RFC+PSO-2\) case, we used the sum of both the average precision and average recall as fitness value.
Table 3

Accuracy results for optimized forest AIA approach

Models

Average precision

Average recall

NumWords

RFC (w\(=\)5)

0.1386

0.1407

72

RFC (w\(=\)9)

0.1494

0.1335

70

RFC (w\(=\)15)

0.1482

0.1356

69

RFC\(+\)PSO\(-\)1 (w\(=\)15)

0.2207

0.1437

86

RFC\(+\)PSO\(-\)2 (w\(=\)15)

0.2068

0.2052

100

RFC\(+\)PSO (Trela1) (w\(=\)9)

0.2510

0.2170

108

RFC\(+\)PSO (Trela1) (w\(=\)15)

0.2571

0.2182

109

RFC\(+\)PSO (Clerc) (w\(=\)15)

0.2643

0.2160

108

It is clear that merging the PSO algorithm with RFC approach resulted in a major impact on the overall performance of RFC approach. Table 4 compares the proposed RFC with PSO to the related research models. Based on experimental results shown in Table 4, our proposed approach achieved a competitive accuracy. Table 5 shows a three test images samples with the annotation results for the RFC and RFC with PSO(Trela1). Figure 10 shows the generated PSO weights for the random forest with Terla PSO optimization algorithm with \(windowSize\) equals to 15.
Table 4

Performance of various annotation models on Corel vs RFC with PSO (Terla1)

Model

Average precision

Average recall

NumWords

COM [10]

0.03

0.02

19

TM [11]

0.06

0.04

49

CMRM [12]

0.10

0.09

66

CRM [13]

0.16

0.19

107

MBRM [14]

0.24

0.25

122

MIL [16]

0.20

0.22

124

RFC–PSO

0.26

0.22

109

Table 5

Annotation example on samples from Corel5k benchmark

Fig. 10

PSO(Trela)-random forest AIA model weights with windowSize = 15

6 Conclusions and future work

In this article, an automatic image annotation approach, based on random forest classifier and particle swarm optimization algorithm, has been proposed and tested. The proposed approach showed that applying PSO algorithm with random forest increased the average precision from \(0.1482\) to \(0.2207\) when using the average precision as fitness value for PSO. The accuracy achieved, precision \(= 0.26\) and recall \(= 0.22\), using PSO (Trela1) with random forest classifier, where the sum of the average precision and average recall is the fitness value for PSO. Changing the window size in voting technique has a noticeable impact on the overall performance. The results curves show clearly the inverse relationship between the average precision and average recall in relation to the increasing the number annotation labels. For the proposed random forest model, any error happened in the clustering stage affects the classification output as a cumulative error. In addition, there are no direct correspondence between the images regions and the classes in Corel5k dataset, that is an image used for the class ‘sky’ may also used for the class ‘tree’. There are few classes represented in some cases with one image in the training Corel5k set, the case that leads to hard classifications for these classes. Creating one tree to classify all the classes was infeasible due to memory limitations. For future work, testing different numbers of clusters may have a noticeable impact on the overall performance. Also, applying features selections and weighting techniques or using different features other than the ones generated in the Corel5k is another point of research. ImageCLEF benchmark dataset is one of the famous datasets for applying the proposed approach on. Moreover, changing the number of decision trees used in the random forest classifier may lead to variations in annotation accuracy.

References

  1. 1.
    Chen Z, Hou J, Zhang D, Qin X (2012) An annotation rule extraction algorithm for image retrieval. Pattern Recognit Lett 33(10):1257–1268CrossRefGoogle Scholar
  2. 2.
    Zhang D, Islam M, Lu G (2012) A review on automatic image annotation techniques. Pattern Recognit 45(1):346–362CrossRefGoogle Scholar
  3. 3.
    Wang Y, Mei T, Gong S, Hua X (2009) Combining global, regional and contextual features for automatic image annotation. Pattern Recognit 42(2):259–266CrossRefMATHGoogle Scholar
  4. 4.
    Yao J, Zhang Z, Antani S, Long R, Thoma G (2008) Automatic medical image annotation and retrieval. Neurocomputing 71(10):2012–2022CrossRefGoogle Scholar
  5. 5.
    Yu N, Hua K, Cheng H (2012) A multi-directional search technique for image annotation propagation. J Vis Commun Image Represent 23(1):237–244CrossRefGoogle Scholar
  6. 6.
    Gao Y, YIN Y, Uozumi T (2012) A hierarchical image annotation method based on SVM and semi-supervised EM. Acta Automatica Sinica 36(7):960–967CrossRefGoogle Scholar
  7. 7.
    Qi X, Han Y (2007) Incorporating multiple SVM for automatic image annotation. Pattern Recognit 40(2):728–741CrossRefMATHGoogle Scholar
  8. 8.
    Li R, Lu J, Zhang Y, Zhao T (2010) Dynamic adaboost learning with feature selection based on parallel genetic algorithm for image annotation. Knowl Based Syst 23(3):195–201CrossRefGoogle Scholar
  9. 9.
    Corel5k dataset website (online) (2012) http://kobus.ca/research/data/eccv_2002/. Accessed July 2012
  10. 10.
    Hironobu YM, Takahashi H, Oka R (1999) Image-to-word transformation based on dividing and vector quantizing images with words. Boltzmann machines. Neural Netw 405–409Google Scholar
  11. 11.
    Duygulu P, Barnard K, Freitas J, Forsyth D (2002) Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: 7th European conference on computer vision (ECCV 2002). Springer, LNCS, Copenhagen, 28–31 May 2002, pp 97–112Google Scholar
  12. 12.
    Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, Toronto, 28 July–1 August 2003, pp 119–126Google Scholar
  13. 13.
    Lavrenko V, Manmatha R, Jeon J (2003) A model for learning the semantics of pictures. In: 16th conference on advances in neural information processing systems (NIPS 16), Vancouver. MIT Press, Canada, 8–13 December 2003Google Scholar
  14. 14.
    Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: IEEE computer society conference on computer vision and pattern recognition (CVPR ’04). IEEE, Washington, 27 June–2 July 2004, pp 1002–1009Google Scholar
  15. 15.
    Cusano C, Ciocca G, Schettini R (2004) Image annotation using SVM. Internet Imaging IV 5304(1):330–338Google Scholar
  16. 16.
    Zhu S, Tan X (2011) A novel automatic image annotation method based on multi-instance learning. Procedia Eng 15:3439–3444CrossRefGoogle Scholar
  17. 17.
    Wang L, Khan L (2006) Automatic image annotation and retrieval using weighted feature selection. Multimed Tools Appl 29(1):55–71CrossRefGoogle Scholar
  18. 18.
    Lu J, Zhao T, Zhang Y (2008) Feature selection based-on genetic algorithm for image annotation. Knowl Based Syst J 21(8):887–891CrossRefGoogle Scholar
  19. 19.
    Sun F, He JP (2009) A normalized cuts based image segmentation method. In: 2nd International conference on information and computing science. IEEE, Manchester, 21–22 May 2009, pp 333–336Google Scholar
  20. 20.
    Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905CrossRefGoogle Scholar
  21. 21.
    Clerc M, Kennedy J (2003) The particle swarm—explosion, stability, and convergence in a multidimensional complex space. IEEE Trans Evol Comput 6(1):58–73CrossRefGoogle Scholar
  22. 22.
    Trelea IC (2003) The particle swarm optimization algorithm: convergence analysis and parameter selection. Inf Process Lett 85(6):317–325CrossRefMATHMathSciNetGoogle Scholar
  23. 23.
    Kennedy J, Eberhart R (1995) Particle swarm optimization. In: IEEE International conference on neural networks, vol. 4. IEEE, Perth, 27 November–1 December 1995, pp 1942–1948Google Scholar
  24. 24.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATHGoogle Scholar
  25. 25.
    Ho TK (1995) Random decision forests. In: 3rd International conference on document analysis and recognition (ICDAR 1995), vol. 1. IEEE Computer Society, Montreal, 14–15 August 1995, pp 278–282Google Scholar
  26. 26.
    Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844CrossRefGoogle Scholar
  27. 27.
    Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New YorkMATHGoogle Scholar

Copyright information

© Springer-Verlag Wien 2013

Authors and Affiliations

  • Nashwa El-Bendary
    • 1
    • 2
  • Tai-hoon Kim
    • 3
  • Aboul Ella Hassanien
    • 4
    • 2
  • Mohamed Sami
    • 2
    • 4
  1. 1.Arab Academy for Science, Technology, and Maritime TransportCairoEgypt
  2. 2.Scientific Research Group in Egypt (SRGE)CairoEgypt
  3. 3.Hannam UniversityDaejeonKorea
  4. 4.Faculty of Computers and InformationCairo UniversityCairoEgypt

Personalised recommendations