Keywords

1 Introduction

Conventional image classification techniques were limited in their ability to process natural image data in their raw form. It required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (the pixel values of an image) into a suitable internal representation or feature vector from which the classifier could classify patterns in the input [6]. On the contrary, deep-learning methods like ConvNets can learn representation directly from the data.

ConvNets are now the most commonly used large-scale image classification models. As early as 1990, ConvNets were trained for the task of classifying low-resolution images of handwritten digits [1]. However, with limited computation and training data, ConvNets were not popular until recent years. Krizhevsky et al. [4] trained a large, deep ConvNet (named AlexNet) to win over other contestants in the ILSVRC-2012 competition which was the first time a model performed so well on ImageNet dataset. Zeiler and Fergus [17] explained a lot of the intuition behind ConvNets and showed how to use a multi-layered deconvnet [18] to visualize the filters and weights correctly. Simonyan & Zisserman [12] created a simple and deep (19 layers) VGG Net, which reinforced the notion that ConvNets have to have a deep network of layers. Szegedy et al. [13] introduced the Inception module and built GoogLeNet. Coming up with the Inception module, a creative structuring of layers can lead to improved performance and computationally efficiency. He et al. [3] presented a residual learning framework to ease the training of deep networks and built a 152 layer ResNet, which won ILSVRC 2015 with an incredible error rate of 3.6% (lower than the error rate of humans, around 5–10%). Larsson et al. [5] advanced ResNet and built FractalNet which shows that explicit residual learning is not a requirement for building ultra-deep neural networks. All in all, the development of ConvNets is amazing.

In this paper, we use ConvNets for mass commodity image classification. However, a single ConvNet is limited in its ability of treating all the classes fairly. Combining the predictions of many different models is a very successful way to reduce test errors. Therefore, we propose a Class Grouping algorithm based on feedback to learn the similarity between classes and train several deep ConvNets become experts of different types of commodity images.

Schmidhuber et al. [10] declared that only winner neurons are trained. They trained several deep neural columns become experts on inputs preprocessed in different ways and average their predictions. Simonyan and Zisserman [11] proposed a two-stream ConvNet architecture which incorporates spatial and temporal networks. Each stream was implemented using a deep ConvNet, softmax scores of which were combined by late fusion.

Inspired by works above, we build a cascade ConvNets model to implement large-scale image classification. However, more innovative than those works, our model introduces the theory of three-way decisions to simulate human decision process. We build a 3WD-based cascade model including 3 layers, the first layer is a base ConvNet for all images and the third layer consists of several expert ConvNets trained for similar classes. The most important second layer is a three-way decision layer, which controls whether the data flows from the first layer to the third layer.

The notion of three-way decisions was originally introduced by the needs to explain the three regions of probabilistic rough sets. A theory of three-way decisions is constructed based on the notions of acceptance, rejection and noncommitment [15], whenever it is impossible to make an acceptance or a rejection decision, the third noncommitment decision is made [14]. Three-way decisions play a key role in everyday decision-making and have been widely used in many fields and disciplines. Three-way spam filtering systems [19], for example, add a suspected folder to allow users make further examinations of suspicious emails, thereby reducing the chances of misclassification. Three-way decisions are also commonly used in medical decision making [8, 9]. In the threshold approach to clinical decision making proposed in [9], by comparing the probability of disease with a pair of a “testing” threshold and a “test-treatment” threshold, doctors make one of three decisions: (a) no treatment no further testing; (b) no treatment but further testing; (c) treatment without further testing.

This paper extends the application of three-way decisions to the image classification. When there is doubt about the classification result(the first layer), our model will make a noncommitment decision(the second layer) and learn more information from expert classifiers(the third layer) to make the final prediction.

The structure of this paper is as follows. In Sect. 2 we describe our model. Next, in Sect. 3 we show the experiments and analyze the results. Last, we summarize our work in Sect. 4.

2 Model

We use GoogLeNet model throughout the paper. GoogLeNet, as defined in [13], uses 9 Inception modules in the whole architecture, which is a network in the network structure [7]. By using the Inception models, the network is deeper and wider, and is better than previous models.

In Sect. 2.1, we define some symbols that will be referred later. Next, in Sect. 2.2, we introduce the Class Grouping algorithm that based on feedback. Later, in Sect. 2.3 we introduce the three-way decision cascade model. Last, Sect. 2.4 shows the ultimate form of our model, CRL-supervised 3WD cascade model.

2.1 Symbols Definition

Before going further into our model, we define some symbols.

Img, the input image.

CAT \(=\left\{ c_1, c_2, \cdots , c_i, \cdots , c_C \right\} \), the class set, including C classes.

\(P=\left\{ p_1, p_2, \cdots , p_i, \cdots , p_C \right\} \), the classification result of a ConvNet model, where \(p_i\) is the probability that Img is classified as class \(c_i\).

Conf \(=(n_{ij})_{C \times C}\), the confusion matrix of ConvNet test result, where \(n_{ij}\) is the number of images of class \(c_{i}\) being classified as class \(c_{j}\). The bigger the \(n_{ij}\), the easier that images of class \(c_{i}\) are classified as class \(c_{j}\).

Threshold of possible classes (Th-pos). Obviously, we do not need to consider the situation that Img belonging to class \(c_i\) if \(p_i\) is very small. Therefore, we need a threshold to determine the possible classes of Img. If \(p_i\) is no less than Th-pos, we think that Img may belong to class \(c_i\). We stipulate that Th-pos is no less than \(\frac{1}{C}\).

Top-1 class (referred to as \(c_{top}\)), the class considered the most probable by the model, the probability is \(p_{top}\).

2.2 Class Grouping Algorithm Based on Feedback

For commodity images of web-based platforms, many commodity classes are similar to each other, which is difficult for both humans and machines to distinguish them (see Fig. 1). Therefore, a classifier trained for all the classes is not enough for distinguishing those similar classes, we need some more specified classifiers trained for certain similar classes. In this paper, we propose a Class Grouping (CG) algorithm (see Algorithm 1) based on the feedback of the classification results.

Fig. 1.
figure 1

Experimental data, 4 samples each class. Classes \(c_{1}\), \(c_{5}\), \(c_{10}\) and \(c_{14}\) are of similar features, they are all kinds of sweaters. And classes \(c_{3}\), \(c_{18}\), \(c_{29}\), \(c_{30}\), \(c_{31}\) and \(c_{32}\) are kinds of trousers.

figure a

We define \(s_{ij}\) the similarity between class \(c_{i}\) and class \(c_{j}\)(see Formula (1)).

$$\begin{aligned} s_{ij}=\frac{n_{ij}}{\sum _{t\,=\,1}^Cn_{it}}\times \frac{n_{ji}}{\sum _{t\,=\,1}^Cn_{jt}} \end{aligned}$$
(1)

After running CG algorithm, we get K clusters: \(cat-1 \), \(cat-2 \), \(\cdots \), \(cat-K \). We train an expert ConvNet for each cluster(see Sect. 3.2). These ExpConvNets will be used to build the third layer of cascade model.

Based on class grouping experimental results, we introduce similar-classes. If class \(c_{i}\) and \(c_{j}\) are belong to the same subset cat-k, we call that \(c_{i}\) and \(c_{j}\) are similar-classes. Similar-classes are of high probability to be wrongly classified to each other and therefore require expert judgments.

2.3 Three-Way Decision Based Cascade Model

A theory of 3WD is constructed based on the notions of acceptance, rejection and noncommitment. Inspired by 3WD theory, we no longer directly accept the classification result of the base classifier. Instead, we make one of two decisions: (a) accept it if it is reliable; (b) opt for a noncommitment if it is not reliable. Since this is not a binary-decision problem with two options, but a multiclass classification problem, there is no “reject” option. We judge the classification result is not reliable if meeting two conditions: (i) existing class \(c_a\)(\(c_a\not =c_{top}\)) and \(p_a\ge Th-pos \); (ii) \(c_a\) and \(c_{top}\) belonging to the same subset cat-k. The condition (i) guarantees that Img may belong to class \(c_a\), condition (ii) guarantees that \(c_a\) and \(c_{top}\) are similar-classes. We define these conditions under the hypothesis that meeting these two conditions means that the classifier is confused. The classifier considers that Img can be predicted as \(c_a\) or \(c_{top}\). Thus, we need to put the image into the expert classifier ExpConvNet-k for further judgment. Formula (2) is the 3WD process.

$$\begin{aligned} 3WD= {\left\{ \begin{array}{ll} \text {delay,}&{} \text {satisfiying condition (i) and (ii) }\\ \text {accept,}&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

The ability of a single classifier is limited. Therefore, combining several different models is a way to reduce the error rate [4]. Cascade [2] is a special case of ensemble learning. The basic idea of cascade is the connection of multiple classifiers. The information is passed between layers and the output information of the upper classifier is used as the additional information of the next classifier.

Under the guidance of 3WD theory, we establish a self-adaptive cascade ConvNets model, including 3 layers (see Algorithm 2). The first layer is a base classifier (a base ConvNet), the second layer is a 3WD layer and the third layer is expert layer, including several expert classifiers (ExpConvNets). We put Img into the first layer (a base classifier) and send the classification result \(P^{1}\) into the 3WD layer. Next, the 3WD layer will make one of two decisions: (a) accept \(P^{1}\) as the final P; (b) opt for a noncommitment and put Img into ExpConvNet-k (the classification result is \(P^{2}\)). Finally, we calculate probability P based on \(P^{1}\) and \(P^{2}\), see Formula (3).

$$\begin{aligned} p_i= {\left\{ \begin{array}{ll} p_{c_i}^2\text {, }&{} \text {if }c_i\in cat-k \\ p_{c_i}^1\text {, }&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

where \(p_{c_i}^1\) represents the probability of the image being predicted as class \(c_i\) by the base classifier and \(p_{c_i}^2\) represents the probability of the image being predicted as class \(c_i\) by the expert classifier.

figure b

2.4 CRL-supervised 3WD Cascade Model

3WD decides which images may need expert judgments. But the experimental result (see Table 2) tells us that blindly following the decision of 3WD is not a good idea. Suppose that the base classifier considers Img as class \(c_i\) while 3WD layer delays this result and after expert judgment, the cascade model finally judge Img as class \(c_j\). Experimental experience tells us that there are two situations: (1) in most cases, \(c_i\) is the correct class while \(c_j\) is wrong; (2) on the contrary, in most cases, \(c_i\) is wrong while \(c_j\) is the correct class. Situation (1) tells us that the 3WD makes the classification result from right to wrong; while situation (2) tells us that 3WD makes the classification result from wrong to right. Obviously, we welcome the latter situation. Thus, we need to supervise the 3WD process.

We define Correcting Reliability Level (CRL) to supervise the 3WD process. A high CRL means that the 3WD has a high probability of making right decision. We define \(TF_i\) (Truth to False) to describe the situation (1) above, and \(FT_i\) (False to Truth) to describe the situation (2) above. CRL computing see Formula (4).

$$ \begin{aligned} CRL_i= {\left\{ \begin{array}{ll} \frac{N_{FT_i}-N_{TF_i}}{N_{FT_i}+N_{TF_i}}&{} \text {while } N_{FT_i}>N_{TF_i}\ \& \ N_{FT_i}>0 \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

We use a random function

$$\begin{aligned} R(p)=binomial(1,p) \end{aligned}$$
(5)

to move CRL value to a Boolean value, “True” means following the 3WD and putting Img into the expert classifier, while “False” means “accept” the result of base classifier. p is the probability of return “True”. \(R(CRL_i)\), for instance, has a probability of \(CRL_i\) to return “True”. Therefore, the larger the CRL value is, the more likely that Img will be passed into the expert classifier.

On the basis of 3WD-CM, we add a CRL table after the 3WD layer. CRL table is used to determine whether Img is worthy of expert judgment (see Algorithm 3).

figure c

3 Experiments

In this section, we show the experiment details. In Sect. 3.1, we introduce the experimental dataset. Next, in Sect. 3.2 we introduce the class grouping process and ExpConvNets training. Then, in Sect. 3.3 we show the results of 3WD-based Cascade Model. At last, in Sect. 3.4 we analyze the results of CRL-supervised 3WD Cascade Model.

3.1 JD Clothing Dataset

The experimental data of this paper is JD clothing dataset, examples see Fig. 1. JD is one of the most famous B2C shopping site in China and the first large-scale integrated business platform to be listed in the United States. JD has a strong market share; therefore, it has accumulated a large number of commodity image data, which provides researchers with a lot of resources. Our experimental dataset has about 400,000 clothing images, including 37 classes. The dataset is divided into training set, validation set and test set at a ratio of 8: 1: 1.

In this paper, we report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model [4].

3.2 Class Grouping and ExpConvNets Training

In this paper, we do experiments on a deep learning framework named Caffe. Yosinski et al. [16] pointed out that fine-tuning is better than randomly initialize parameters. Our experimental results also confirm this. We start from a pre-trained model (GoogLeNet on ImageNet LSVRC-2014) and fine-tune it. The top-1 error rate is 44.59% and the top-5 error rate is 9.40%, which are better than randomly initializing (the top-1 error rate is 57.11% and the top-5 error rate is 21.16% ). We call the fine-tuned model Base Model (BM) below and the latter experiments will fine-tune models on the basis of it.

Test results confirm that images of many classes are easily to be misclassified to each other, like class \(c_{1}\) and \(c_{10}\) (see Fig. 2). Thus, we group those similar classes into the same subset with CG algorithm introduced in Sect. 2.2.

Fig. 2.
figure 2

Stacked bar chart of test result of BM. Take class \(c_1\) as example, there is about 20% of test images misclassified as class \(c_{10}\). Similarly, there is about 40% of test images of class \(c_{10}\) misclassified as class \(c_{1}\).

We set \(K=5\) and divide CAT into 5 subsets. After class grouping, we train an expert ConvNet for each subset. We fine-tune the BM and adapt most of the architecture (only change the output number of the last layer), and resume training from the BM weights. Table 1 shows the class grouping result and the ExpConvNets error rates.

Table 1. Class grouping results and ExpConvNets error rates.

3.3 3WD-based Cascade Model

In order to get the CRL table, we first need to test images with 3WD-CM. We set Th-pos 0.1 in this experiment. The top-1 error rate of 3WD-CM is 44.401%, reducing by 0.189% compared with BM; top-5 error rate is 9.475%, increasing by 0.075% compared with BM. The results are bad. The top-1 error rate reduces a little bit, and the top-5 error rate does not drop but increase.

Table 2 shows the counting results of TF and FT. We can see that there are totally 1195 cases of situation (1) and 1,273 cases of situation (2). Therefore, in fact only 78 samples are modified correctly by 3WD-CM, the accuracy increases by only 0.189%. When \(c_1\) is considered as \(c_{top}\) by the base classifier, there are totally 613 images considered needing expert judgement by 3WD, wherein, 135 images are modified correctly and 478 images are modified incorrectly. Thus, there are 343 images being modified incorrectly in total. This shows that when \(c_1\) is considered as \(c_{top}\), we should better ignore the decision of 3WD that Img needing expert judgment. We should better accept \(c_1\) as the prediction result. On the contrary, if the base classifier considers \(c_{10}\) as \(c_{top}\), we would better follow the 3WD and do an expert judgment for Img. Because, for \(c_{10}\), the number of images which are modified correctly is greater than the number of images which are modified incorrectly. Therefore, we use CRL to supervise 3WD process.

Table 2. Classification results of 3WD-CM

3.4 CRL-supervised 3WD Cascade Model

Now, we calculate CRL of each class, see Table 3. We establish CRL-supervised 3WD cascade model (CRL-CM). Table 4 shows the classification performance of CRL-CM under different Th-pos values (0.1, 0.2, 0.3 and 0.4). We test 30 times for each Th-pos value and take the average error rate. Compared with BM, the top-1 error rate reduces by about 1.09% when \(Th-pos = 0.1\). Top-5 error rate does not reduce obviously. With the increasement of Th-pos, the error rate reduces less, because the greater the Th-pos is, the more harsh that 3WD determines a sample being “unsure”, thus, those misclassified samples lose the chance of being modified correctly. The experimental results show that the CRL-CM can effectively reduce the classification error rate compared with a single base ConvNet.

Table 3. CRL table.
Table 4. Average error rates of CRL-CM under different Th-pos.

4 Conclusion

In this paper we integrate several different ConvNets to build a CRL-supervised 3WD cascade model. Experimental results show that our model can effectively reduce the error rate compared with a single ConvNet. The contributions of this paper are: (i) Simulating the human decision process by using 3WD to construct a cascade model with several ExpConvNets which become experts on inputs preprocessed in different ways; (ii) introducing CRL to supervise 3WD process which reduces error rate effectively. In future work, we will do more experiments with public datasets like ImageNet to prove the validity of our model.