1 Introduction

Banana is one of the most consumed fruit globally. It contributes about 16% of the world’s fruit production. According to FAO (Food and Agriculture Organization of the United Nations), more than 114 million tons of bananas were produced worldwide in 2014. It was reported that China ranked second among top banana-producers during recent years. Banana is also the biggest tropical fruit in China, which covered an area of 392,000 ha, production of 11,791,900 tons in 2014. China produces more than 110 banana cultivars and the Musa AAA Cavendish cv. Brazil is the most important both for Chinese national and international markets [1].

The quality control and the acceptance of consumers need an optimum ripening stage for bananas [2, 3]. In the last decades, many types of research on the ripening assessment of banana were conducted. The presented methods could be classified into two categories: the biochemical or physiochemical property-based methods and the computer vision-based techniques. A plethora of methods were presented to understand the ripening process of bananas through various bioindicators including starch content, soluble solid content, sugar content, and firmness. The primary objective of these methods is to discover the relationship between the ripening process and the physical, biochemical, or nutritional transformations [47]. These methods can supply fine-grained classification results of banana during the ripening process. For example, in [4] a starch staining instrument is used to differentiate the banana maturity through the disappearance of the starch. With the accurate determination of starch content, the ripening process of banana can be divided into more than 50 stages. However, most of these methods are laborious and expensive due to the involvement of invasive or destructive techniques.

Many computer vision-based approaches were also proposed to classify the ripening stages of bananas based on the appearance of bananas and various computer based algorithms [8]. These computer vision techniques [911] can potentially provide an automated and non-destructive tool for the classification of ripening banana. However, none of them has been widely used due to several limitations. First, they have rarely paid attention to the fine-grained classification and are therefore incapable of differentiating subtle differences among subordinate classes of ripening bananas. Due to the limitation in using the skin color of banana and of the computer vision system itself, these methods are usually able to divide the ripening process into seven stages [12]. Second, these methods exploit hand-crafted features, resulting in limited performances due to the hardships in the manual design of features. Third, most of them deteriorate for lower-class bananas due to the difficulties in modeling bananas’ skin defects [10].

In the field of artificial intelligence, recent advances in deep learning [1315] have led to breakthroughs in long-standing tasks such as vision-related problems of feature extraction [16, 17], image segmentation [1820], and image classification [2126]. Among all these techniques, convolutional neural network (CNN) is one of the most successful methods [13] and has acquired a broad application in image classification. Recently, CNN-based models of fine-grained image classification have obtained tremendous advances in recognition of subtle difference among subordinate classes, including the traffic classification [27, 28], medical image classification [29], plant classification [30], and food classification [31].

In this paper, we present a deep indicator for fine-grained classification of the precise ripening stages of bananas based on images. It is accomplished through a novel CNN architecture designed specifically for the unique characteristics of banana appearance. The proposed CNN framework takes triple images as input, from which the triplet loss (similarity loss) is produced. Through the joint optimization of classification accuracy and similarity loss, our proposed technique can effectively learn the fine-grained feature representations from the ripening process of a banana.

The proposed technique bears at least three advantages. First, it leverages a training process to extract automatically multi-scale image features combining both the global and local features of the banana image. The mapping learned from these fine-grained features offers a function to automatically specify the ripening stages of banana given a new input image of a banana. Experimental results from a large number of 17,312 images of bananas show that the proposed CNN architecture produces an impressive classification accuracy of 94.4% at the laboratory level. Second, as the samples of bananas with imperfections are also included in the experiments, the experimental results also validate that our method can be applied to fine-grained classification of bananas at different ripening stages no matter the banana peel bears defects or not. Third, different from some existing methods [32], our indicator does not combine any information from the physiochemical or biological changes during the ripening process of banana. However, it is still able to achieve competitive classification accuracy as validated by our experiments.

Our work offers at least four significant contributions as below:

  1. (I)

    To the best of our knowledge, this paper is the first attempt to introduce CNN into banana ripening assessment, while previously proposed methods mainly are based on traditional machine vision-based algorithms.

  2. (II)

    We propose a novel CNN architecture adapted to fine-grained-feature based representation for a subtle classification of banana’s ripening stages. And the state-of-the-art methods mainly aimed at coarse classification of the ripeness of bananas.

  3. (III)

    The proposed approach is data-driven and employs an integrated system which combines a set of learning functions for both features and the feature-to-class mapping.

  4. (IV)

    Our approach performs with an impressive superiority to the state-of-the-art techniques.

The rest of this paper is organized as follows. In Section 2, we present the details of the materials we used and our approach. Section 3 contains our experimental results and discussions. In Section 4, we provide our conclusion and vision for the future.

2 Materials and methods

2.1 Fruit selection and sampling

Twenty batches of bananas (Musa AAA Cavendish) cv. Brazil were purchased from a local wholesale market. To avoid the sample variability of the samples, 197 bananas with similar size, color, and weight (200–245 g) were selected for the following analysis.

The bananas were sanitized with 1% NaOCl for 15 min, then were rinsed with distilled water and dried at ambient temperature. During the experiments, they were stored under conditions of darkness at 25 °C and 75% of relative humidity for 14 days in a stability chamber (Fig. 1) (Sailham 523000, China).

Fig. 1
figure 1

Image of the bananas at selected ripening stages (from left to right). Each banana represents a stage during the ripening process

2.2 Computer vision system

A computer vision system (as shown in Fig. 2) was built to capture the images of bananas during the ripening process. This system includes a lighting system with a ring fluorescent lamp (FGR series, Wordop, China) to avoid inhomogeneous lighting. The camera (Canon EOS 760D, Japan) settings is the manual mode, exposure level 0.0, without zoom, and flash. The white balance is set to the white fluorescent light. The samples (bananas) are placed on a piece of white paper placed on a steady table. The color space applied in subtle classification of ripening stages of banana is the sRGB (standard RGB). The size of the captured images is 3200∗2400 pixels (0.1 mm/pixel) and then stored in the server in PNG (Portable Network Graphic) format.

Fig. 2
figure 2

The proposed computer vision system. The system consists of a camera, a ring fluorescent lamp, the white paper under the banana, and a steady table

2.3 The proposed CNN architecture

CNN is typically a multilayer, hierarchical neural network but bears at least three principal factors different from a generic neural network: local receptive fields, weight sharing, and spatial pooling layers [33]. CNN employs a local receptive field rather than a global one, which is similar to the brain capturing local structure of image through constraining each neuron to depend only on a spatially local subset of the neurons in the proceeding layer. Moreover, weights are shared across different neurons in the same layer, which can be translated to evaluating the same filter over all local windows of the input image. Spatial pooling in CNN is to divide the image into an array of blocks and then evaluate a pooling function over the responses in each block. The goal of pooling is to reduce the dimensionality of the convolutional responses and enforce a translational invariance (in a small degree) into the model. In the case of max pooling, the response for each block is taken to be the maximum value over all response values within the block. A typical CNN consists of multiple layers, alternating between convolution and pooling. Compared with shallow CNN architectures, deep CNN has more hidden layers. Lower layers defined as the ones closers to the input construct the low-level convolutional filters which can be thought of as providing low-level encoding of the input image. In contrast, higher layers learn more complicated structures. In CNN, stride length is used to specify the number of pixels with which the local receptive field is moved to the right (or down).

To classify the bananas at subtle ripening stages, we propose a novel architecture of CNN to predict the probability of classification of the inputted image. It is composed of convolutional layer with rectified linear unit (ReLU), max pooling and fully connected layer with ReLU, as shown in Fig. 3. For each image, a positive image (at the same ripening stage as the original image) and a negative image (not at the same ripening stage) are chosen from the captured images of bananas. Then, three images and the label of the original image are jointly inputted into our proposed CNN structure. Three parameter-sharing CNNs are presented to handle the original, positive, and negative image, respectively. The structure of the CNN is shown in Fig. 3.

Fig. 3
figure 3

The proposed CNN architecture. The original, positive, negative images, and label of the original image are taken as input in the framework. The yellow circle, red circle, and blue circle represent the l2 normalized vector of original image, positive image, and negative image, respectively. The structured feature represents the extracted features from the CNN framework. Two types of losses are exploited to obtain the fine-grained classification results

  • Convolutional layer 1. 48 kernels of size 3×7×7 (3 represents the number of RGB channels, with a stride of 2) are applied to the input banana image in the first layer combined with the ReLU. A max pooling layer follows this convolutional layer.

  • Convolutional layer 2. 128 kernels of size 3×5×5 (with a stride of 2) are applied to the input banana image in the second layer combined with the ReLU. A max pooling layer follows this convolutional layer.

  • Convolutional layer 3. 128 kernels of size 3×3×3 (with a stride of 2) are applied to the input banana image in the third layer combined with the ReLU.

  • Fully connected layer. 512 neurons combined with ReLU, which is used to perform high-level reasoning like neural networks.

The structured feature in Fig. 3 displays the features that could be extracted from the original image, positive image, and negative image by the proposed CNN framework. Our architecture serves as a baseline to naturally embed label structures without sacrificing the classification accuracy.

One softmax loss operation locates at end of the CNN channel for original image. The corresponding softmax loss function is shown as

$$ L_{s}=-\sum_{i=1}^{N} {\text{log}}\left(P\left(\omega_{k}\right)|\left(L_{i}\right)\right), $$
(1)

where N denotes number of the input images, P(ω k |L i ) indicates the probability of the kth image to image label L i correctly classified as l k .

Besides the softmax loss function used that has been exploited in other CNNs, the triplet loss is also presented in our proposed CNN architecture.

$$ L_{t}=\frac{1}{2N}\sum_{i=1}^{N} {\text{max}}\left(0,D(o_{i},p_{i})-D(o_{i},n_{i})+m\right), $$
(2)

where D(.,.) is the squared Euclidean distance between two l2 normalized vectors; o i ,p i ,n i , respectively denote the l2 normalized vectors from original image, positive image, and the negative image, as shown in Fig. 3; m denotes the hyper-parameter to confine the value of Eq. (2) greater than zero as the difference between the original image and the positive image is expected to be greater than the difference between the original image and the negative image.

As shown in Eq. (3), the two losses are integrated to obtained the fine-grained indicator of classification.

$$ L=\lambda L_{s}+(1-\lambda)L_{t}, $$
(3)

where λ denotes the weight that is used to manipulate the trade-off between the softmaxloss (L s ) and the tripletloss (L t ).

3 Results and discussion

3.1 Dataset and preprocessing

We collected 17,312 images of bananas at different ripening stages (the whole process is 14 days long). There are about 30 images captured for every banana everyday. To overcome the potential over-fitting effect, we enlarge the original dataset with data augmentation methods including translations (varying from 10 to 100 pixels with a gap of 10 pixels), vertical, and horizontal reflections. After the data augmentation, the images are then resized into 256×256. For each image, a positive image and a negative image as mentioned in Section 2.3 are chosen from the captured images to form a triplet.

3.2 Training and evaluating

We manually labeled the samples into 7 categories and 14 categories (according to the date of image capturing), respectively. Fifty percent of the images are taken as training dataset, 30% are chosen into the evaluation dataset, and the other 20% are used in the testing process. In the training process, the proposed framework is refined with the back propagation mechanism which calculates the minimization of the squared difference between the classification ground truth and the corresponding output prediction, iteratively. The training is performed on GPU of high performance and implemented in Tensorflow [34]; it takes 105 iterations. For each iteration, it costs about 0.5 s.

3.3 Experimental results

After the training process, we conducted experiments to testify the performance of our proposed CNN architecture. We choose several state-of-the-art image classification methods [3538] to compare with our proposed method. Different features including single feature and combined features are exploited by these SVM-based methods, respectively.

As shown in Fig. 4, the accuracy of our method for 7 categories classification is 94.4% after 31 iterations. Meanwhile, the training loss of our method decreases to 0.15. Note that the training loss’s decreasement indicates the convergence of the CNN architecture.

Fig. 4
figure 4

The classification accuracy for seven ripening stages of banana. The orange line denotes the accuracy of classification during the training and validation process. The green line and blue line represent the loss of training and loss of validation, respectively. Epoch (the X axial) represents the iteration numbers of the classification in the evaluation. Loss and accuracy (the Y axial) represents the error and the precise rate during the evaluating process

Figure 5 shows the classification result of our approach for distinguishing the 14 ripening stages. The results demonstrate that the accuracy obtained by our method is 92.4% after 34 iterations, and the training loss of our method decreases to 0.22. We also find that the proposed method can perform well in the fine-grained classification of bananas during the ripening process.

Fig. 5
figure 5

The classification accuracy for 14 ripening stages of banana. The orange line denotes the accuracy of classification during the training and validation process. The green and blue lines represent the loss of training and loss of validation, respectively. Epoch (the X axial) represents the iteration numbers of the classification in the evaluation. Loss and accuracy (the Y axial) represents the error and the correctness during the evaluating process

Meanwhile, to further compare the performance of our method and state-of-the-art methods, the precision and recall were calculated with the following two equations, respectively.

$$ {\text{Precision}}=\frac{{\text{TP}}}{{\text{TP}}}+{\text{FP}}, $$
(4)
$$ {\text{Recall}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}, $$
(5)

where TP, FP, and FN denote true positive, false positive, and false negative, respectively.

As revealed by the results for 7- and 14-category classifications in Table 1, the proposed deep indicator outperforms state-of-the-art methods.

Table 1 Performances on classifying 7 and 14 ripening stages of banana

We show the feature maps extracted by our CNN architecture in Fig. 6. We noticed that the extracted features combine information from color, shape, and texture of the banana.

Fig. 6
figure 6

(left) The feature maps extracted from the convolution layer 1. (right) The feature maps obtained after the convolution layer 3. Both the convolution layers are shown in Fig. 3

Meanwhile, we chose the images with severe defects as shown in Fig. 7 to testify the performance of our proposed method in relatively extreme situations. The precision and recall values generated form these images are shown in Table 2. The receiver operating characteristics curves (ROCs) of our method and state-of-the-art methods are shown in Fig. 8 and the corresponding areas under curves (AUCs) are included in Table 3. And ROC and AUC are classical measurement used to assess the classification results [39]. We use Z test to compare the statistical difference of AUC between state-of-the-art methods and our approach, as shown in Table 3.

Fig. 7
figure 7

Two exampling images with severe defects

Fig. 8
figure 8

ROCs generated by the comparing methods for fine-grained classification of bananas with flaws

Table 2 Performances on images of bananas with severe defects
Table 3 The AUCs and AUC group testing performance of the comparing methods

There are mis-classifications due to the severe blemishes in these images. Most of the mis-classifications are caused by extreme viewing conditions, e.g., too many black spots appearing in the image. However, the experimental result shows that even under an extreme situation, our method can still obtain more satisfactory performance than the state-of-the-arts.

3.4 Discussion

From the results shown previously, we can see that the proposed CNN architecture presents a deep indicator (to indicate the ripeness of the banana with a deep CNN architecture) which performs accurately no matter for a rough classification or a fine-grained classification of the banana’s ripening stages. This indicator outperforms state-of-the-art techniques for bananas with or without severe defects and its improvements are significant (as shown in Table 3).

Similar to human being’s visual system, our proposed approach can extract the global and local features of the images of bananas. The global features combined with local features form a layout for each category of bananas. Figure 6 shows that in the first convolutional layer, the global features including shape of the banana are extracted. Then, in the two following convolutional layers, other features including color and texture are extracted hierarchically. Unlike the manually extracted features used in [3538], these features are automatically extracted by our proposed CNN architecture. As illustrated in Fig. 6, the global and local features of banana image are automatically extracted by our proposed CNN architecture. The map between the input banana image and the output classification result could be obtained through the extracted features.

Our proposed CNN framework can significantly enhance the performance of image classification by jointly optimizing the classification loss (between the label of the original image and the output classification result) and the similarity loss (between the original image, positive image, and negative image). The parameter λ in Eq. (3) that is used to balance the softmax loss and loss plays an essential role in the proposed CNN architecture. With the λ is set to 1 or 0, the performance of the structure would degenerate to softmax loss or triplet loss, respectively. According to the process of error and trial, it is reasonable to assign a value greater than 0.5 to λ, which shows that the softmax loss should be more important than triplet loss in our proposed CNN architecture. To note that the introduction of triplet loss contributes substantially to the image classification according to the complementary information from both the positive and negative images. Meanwhile, it could encourage the intra-class similarity and inter-class difference at the same time. The proposed CNN architecture is suitable for characteristics of banana. Through combining the softmax loss with the newly presented triplet loss, the subtle difference between intra-category and inter-categories of bananas during different ripening stages can be differentiated from each other.

4 Conclusions

In this paper, we present a deep indicator of banana’s ripening stages based on a novel CNN architecture, which offers a unique tool for achieving an automated and non-destructive fine-grained classification of banana maturity. The proposed deep indicator integrates the capabilities of accurate fine-grained classification and non-invasive examination. The former can be currently achieved by bioindicators but not the computer vision systems. The latter is an advantage of current computer visions systems but not the bioindicators. In the proposed CNN architecture, three parameter-sharing CNNs are exploited to handle each of the three input images: the original image, positive image, and negative image. At the end of the CNN framework, the structured feature of the triplet input could be obtained. Then, a softmax loss integrated with a triplet loss are presented to implement the fine-grained classification. To evaluate the performance of our method, we take advantage of a large image dataset which consists of 17,312 images from bananas with or without defects.

This paper offers several contributions. First of all, this is probably the first attempt to introduce deep learning strategy into the fine-grained classification of bananas during different ripening stages. Secondly, the triplet loss submitted by our method positively affects the performance of image classification. To the best of our knowledge, this is also an early application of the similarity between the original image, positive image, and negative image into the CNN classifier. Thirdly, similar to human being’s visual system our proposed CNN framework can extract the multi-scale features including both the global and local features of the images of bananas. Finally, our approach performs with an obvious superiority to the state-of-the-art image classification techniques.

In our future works, we will delve into the construction of different CNN architectures and explore the process of implicit feature extraction within CNN. Meanwhile, we will research on the applications of our proposed CNN-based image classification method in other tasks, e.g., the classification of fruits, medical image analysis [40], and industrial products. Furthermore, to leverage on the multiple modality images including the RGB and the infrared, we would continue to study the application of CNN in multiple modality image processing [41].