Introduction

Given the significance of foods in our ordinary lives, fruit grading becomes crucial, but is time-consuming. Grading automatically using computerised approaches is believed as the solution of this problem, which will save human labour. There is a shred of evidence which shows that when fruit deterioration occurs, fruit goes through a series of biochemical transformation that leads to changes in its physical conditions and chemical composition, e.g., changes in nutrition.

Fruit grading methods are grouped into two categories: non-visual and visual approaches. Non-visual grading approaches mainly concentrate on aroma, chemicals, and tactile impression. Fruit spoilage in nature is a biochemical process that natural pigments in various reactions are transformed into other chemicals that result in changes of colours. Identifying fruit spoilage is an innate ability of human perception system. It is regarded as the desirability and acceptance to the consumption of a portion. It assists identifying whether the given fruits are edible or not [1].

The research work unfolds that there exists a strong relationship between bacteria and fruit spoilage, which encompasses aerobic psychrotrophic Gram-negative bacteria with the secretion of extracellular hydrolytic enzymes that corrupt plant’s cell walls, heterofermentative lactobacilli, spore-forming bacteria, yeasts, and moulds. Fruit degeneration is a consequence of biochemical reactions, i.e., a structural acidic heteropolysaccharide grown in terrestrial plant’s cell walls, chiefly consisted of galacturonic acid. Starch/amylum and sugar (i.e., polymetric carbohydrates with the same purposes) are then metabolised with produced lactic (i.e., an acid that is a metabolic intermediate as the end product of glycolysis releasing energy anaerobically) and ethanol [2]. Colonising and induced lesions as a result of microbe dissemination are frequently observed, and infestation is a primary reason of spoilage for postharvest fruits [3].

Besides, the lack of nutrients results in the growth of dark spots, e.g., insufficient calcium leads to apple cork spots [4]. The exposure to oxygen is another determinant as an enzyme known as polyphenol oxidase (PPO) triggers a chain of biochemical reactions including proteins, pigments, fatty acids, and lipids, which lead to fading of the fruit colours as well as degrading to an undesirable taste and smell [5].

The established research evidence shows that if fruit deterioration occurs, fruit goes through a series of biochemical transformation that incurs to changes in its physical conditions, e.g., visual features including colour and shape, most of these features can be extracted. It is affirmed that a computer vision-based approach is the most economical solution.

Previously, scale-invariant feature transform (SIFT) along with colour and shape of fruits [21] has been offered to fruit recognition. K-nearest neighbourhood (KNN) and support vector machine (SVM) have been employed for the classification. Despite attaining high accuracy, this approach has input images with the size \(90\times 90,\) which is low, the information might be dropped. The low-resolution image has the implications that individual pixel may have a significant contribution to the final result, which is dependent on noises for prediction. It is well known that KNN and SVM are vulnerable to the curse of dimensionality where the growth of feature dimensions will have a massive impact on performance; meanwhile, high-resolution images are likely to have rich visual features.

Given the advancement of deep learning, fruit grading algorithms should produce satisfactory accuracies timely [6, 7]. The state-of-the-art technology in computer vision sees the categories in fruit/vegetable automatic grading [8]: detections of fruit/vegetable diseases and defects using foreign biological invasion [10], fruit/vegetable classification for assorted horticultural products [11], estimation of fruit/vegetable nitrogen [12], fruit/vegetable object real-time tracking [13], etc. Most scientific approaches for fruit grading using pattern classification are classified. Pertaining to fruit quality grading, the focus is on not only the freshness, but also the overall visual changes. Despite the recent rise of popularity of deep learning, more than half of the work [14] did not use deep learning methods.

Fruit recognition using computer vision and deep learning is an interesting research field [14]. The delicious golden apples have been graded using SVM + KNN [19]. However, the research project has just twofold: Healthy and defect; only one class of fruits was taken into consideration. Another project was developed for tomato grading [20] using texture, colour, and shape of digital images. A binary classification was proposed that fruits are recognised.

Deep learning was found useful in identifying conditions of citrus leaves [22], which is extremely powerful in image recognition for classification [23]. YOLO [24] was adopted for fruit and vegetable recognition. YOLO is a faster algorithm compared to other approaches, which achieved 40 fps (i.e., frames per second) in videos that are applicable for real-time applications. However, the fruits are constrained to the conditions when the fruits remain being connected to their biological hosts, e.g., hanging on branches. It does not take into account the scenarios where fruits are taken off from trees in the ongoing process of decaying. VGG was used for fruit recognition [25], and the result [25] manifests that convolutional neural network, when going deep, can achieve high accuracy. In contrast to the previous one, a shallow convolutional neural network [26] consisting of four convolution and pooling layers was suggested, followed by two fully connected layers. However, the source images in this experiment are simple, and all fruits are placed ideally at a static position in a pure white background.

An automatic grading system was developed for olive using discrete wavelet transform and statistical texture [27]. Another work [28] has addressed raspberry recognition using deep learning successfully, namely, a nine-layer neural network consisting of three convolutional and pooling layers, one input, dense, and output layer.

Mandarin decay process is impacted by a disease called penicillium digitatum; there is a contribution [29] dedicated to early detection of this disease by examining decay visual features. The visual elements are captured and processed by a combination of decision trees. However, these experiments only were conducted based on one class of fruits. Another problem is that the grading mechanism is a classification model which treats the fruit as being either fresh or rotten/defect. Still, we believe that the decay process occurs gradually; the final predictive layer should regress the output rather than perform a classification task.

Motivated by the related work, a novel fruit freshness grading method based on deep learning is proposed in this article. We create a dataset for fruit freshness grading. The dataset is comprised of selected frames from recorded videos for a dataset having six classes. From data collection, the images are resized and labelled (regions of interest, object classes, and freshness grades). Four typical augmentations are used, e.g., adjusted contrast, sharpness, rotation, and added noises. Our experiments embark on the statistics where visual reflections on the observed objects are discussed, followed by the implementation of a hierarchical deep learning model: YOLO + Regression-CNN for fruit freshness grading. This experiment takes into account of four base networks: VGG [15], AlexNet [16], ResNet [17], and GoogLeNet [18]. The main contributions of this paper are:

  • We propose a new approach to grade fruit freshness. The fruit freshness matters are generally tackled through pattern classification. In this article, the regression of CNN is applied to fruit freshness grading. We detect and classify a given fruit as a visual object for freshness grading.

  • We inject noises to fruit images, so that the developed model is capable of resisting noises introduced from real applications.

In this article, we narrated the work related to fruit freshness grading in Sect. Data Collection. Then, our dataset is described in Sect. Methods, our method is detailed in Sect. Experimental Results, and the experimental results are demonstrated in Sect. Conclusion. In Sect. 5, the conclusion of this paper is drawn.

Data Collection

Different from the existing work, in this article, we propose deep learning algorithms for fruit freshness grading. As we already know, deep learning is promising to the freshness grading for multiclass fruits that will significantly reduce our human labour. In this article, we provide a detailed description of how we have collected visual data and conducted data augmentation before fruit freshness classification. Given the novelty of this research project, the fruit data are not available at present, and thus, we have to collect the data by ourselves. We illustrate our process of how we have received the fruit data and provided empirical evidence on how the dataset accurately represents the fruit freshness.

Datasets

The collected dataset consists of six classes of fruits: apple, banana, dragon fruit, orange, pear, and Kiwi fruits, derived from a vast variety of locations in the images with various noises, irrelevant adjacent objects, and lighting conditions. We first analyse the relationship between fruit appearance and freshness. Fresh apple peel is low in chlorophyll and carotenoid concentrations [30], and the spoilage leads to a gradual degradation of the constituent pigments, that reflect different wavelengths in spectrophotometry. A ripe banana having bright yellow colour is likely a result of carotenoid accumulation [31]. The main compositions of orange peels and flesh are pectin, cellulose, and hemicellulose if excluding water that represents 60–90% of weights [32, 33], the pigments are mostly carotenoids and flavonoids that generate red appearance of oranges. The exotic, aesthetic, and exterior look of dragon fruit is comprised of red-violet betacyanins and yellow betaxanthins [34]. The green colour of Kiwi fruits is a visual manifestation of chlorophylls if the degrading gives rise to the formation of pheophytins and pyropheophytins that render olive-brown colour to the fruit [35]. The green/yellow peel of a pear is a result of congregated chlorophylls; once degradation occurs, chlorophylls degenerate blue-black pheophytins, pyropheophytins are produced [35].

In total, we have collected approximately 4,000 images with each class of fruits around 700. We split the dataset into training and validation sets at the ratio of \(1:9\) (90% for training and 10% for validating). The freshness is graded from 0 to 10.0, with 0 indicating totally rotted (i.e., fruit colour and smell are stable which will not be worsen anymore, e.g., the fruit is not eatable and should be thrown away) and 10.0 for complete freshness shown in Table 1. In this article, we define the particular moment when the fruit is harvested as an absolute freshness grade with the number 10.0. However, based on the definition of total corruption, there lacks a definitive degree on this matter. From the fruit decay experiments, we see that fruit freshness grade is not available, and decayed fruits may have fungus and produce toxin. We consider the fruits are edible as the primary condition of being recognised.

Table 1 The means and standard deviations for fruit freshness grading

In this project, we invited ten people to participate in the labelling work. We first sampled a few images (i.e., three images for each class of fruits at different decay stages) and required the participants to give their grades. We calculated the mean and standard deviation of the distribution of the proposed freshness grades. Regarding the fruit images with significant grade gaps, e.g., the standard deviation is higher or equals three, we invited them for a second round of grading and narrowed down the disagreement. We kept the labels unchanged if the grades proposed by the participants are close to what we initially have labelled. We justify the labels according to participants’ recommendations if the initially proposed freshness grade is far from the mean. It is assumed that for each image, there is a set of images in which the fruits have a similar freshness grade. We grouped the similar images, and if the sampled images are required to adjust the freshness levels, the associated images will be set accordingly. Table 1 shows 18 fruit images. Our dataset is available at: https://www.kaggle.com/datasets/dcsyanwq/fuit-freshness.

Image Quality Enhancement

Many of the source images have low quality, e.g., blurred or weak exposure to light. Thus, in this article, image enhancement approaches are taken into account to ensure the quality of the images.

Given a three colour-channel image \(I\left( {x,y,z} \right)\) with pixel \(v\left( {x,y,z} \right)\) , there exists a contrast factor \({f}_{\mathrm{contrast}}\) which renders a pixel as same as the average pixel intensity of the whole image when \({f}_{\mathrm{contrast}}=0\), and keeps the intensity unchanged if \({f}_{\mathrm{contrast}}=1\). The intensity variation increases, while \({f}_{\mathrm{contrast}}\) rises up. The relationship between \({f}_{\mathrm{contrast}}\) and input/output pixel is described as

$$v_{{x_{{{\text{new}}}} ,y_{{{\text{new}}}} ,z_{{{\text{new}}}} }} = f_{{{\text{contrast}}}} v_{x,y,z} .$$
(1)

We denote \(v_{\min i}\) as the minimum value and \(v_{\max i}\) as the maximum value in the input image, \(v_{\min o}\) and \(v_{\max o}\) are the minimum and maximum intensity in the output image, respectively, and thus, we have

$$v_{{x_{{{\text{new}}}} ,y_{{{\text{new}}}} ,z_{{{\text{new}}}} }} = \left( {v_{x,y,z} - v_{\min i} } \right) \times \left( {\frac{{v_{\max o} - v_{\min o} }}{{v_{\max i} - v_{\min i} }}} \right) + v_{\min o} .$$
(2)

Here, \({f}_{\mathrm{contrast}}=1.2\) is determined as the result of human perceptions to which degree the contrast-adjusted images are inclusive of necessary visual features while being enhanced enough to render granularities that may be easy for neural network training.

We have our subjective evaluations for the quality of the contrast-based images. In this experiment, we found a myriad of photos which are blurry. This is reduced through image sharpening. We see that granular details are more evident than the image before applying to sharpen. It is believed that sharpened images render better visual results. Interpolation and extrapolation are utilized in the sharpening [36]. We thus define a filter for image smoothing

$${\mathrm{kernel}}_{\mathrm{smooth}}=\frac{1}{13}\left( \begin{array}{ccc}1& 1& 1\\ 1& 5& 1\\ 1& 1& 1\end{array} \right).$$
(3)

Pertaining to any source image \({I}_{\mathrm{source}}\), the convolution result \({I}_{\mathrm{smooth}}\) is expressed as

$${I}_{\mathrm{smooth}}={I}_{\mathrm{source}}*{\mathrm{kernel}}_{\mathrm{smooth}}.$$
(4)

where, \(*\) denotes a convolution operator. Similar to that of the contrast process, we define a sharpness factor \({f}_{\mathrm{sharpen}}\), and the derived image \({I}_{\mathrm{blend}}\) is obtained [37] (Table 2).

Table 2 The data augmentation with contrast and sharpening
$${I}_{\mathrm{blend}}=\left(1.0-{f}_{\mathrm{sharpen}}\right){I}_{\mathrm{smooth}}+{f}_{\mathrm{sharpen}}{I}_{\mathrm{source}},$$
(5)

where \({f}_{\mathrm{sharpen}}\) controls the result \({I}_{\mathrm{smooth}}\) based on the source image \({I}_{\mathrm{source}}\). In other words, \({f}_{\mathrm{sharpen}}=0\) renders an image completely blurred under with \({\mathrm{kernel}}_{\mathrm{smooth}}\), while \({f}_{\mathrm{sharpen}}=1.0\) keeps the image unaltered. The interpolation with \({f}_{\mathrm{sharpen}}\in (0, 1)\) has the effects after partially blurring the image \({I}_{\mathrm{source}}\), the extrapolation with \({f}_{\mathrm{sharpen}}\in (1.0, +\infty )\) inverses smoothing to sharpening. Provided that decrement of \({f}_{\mathrm{sharpen}}\in (0, 1)\) renders increasingly blurry effects, as a result of linear extrapolation, \({f}_{\mathrm{sharpen}}\in (-\infty , 0)\) blurs multifolds of what single \({\mathrm{kernel}}_{\mathrm{smooth}}\) rendered.

Image Augmentation

Image augmentation is a methodology to transform source images into ones with additional information, including scaling, rotating, cropping, and adding random noises. We experimented a rich assortment of augmentation methods, as shown in Table 3. Based on our observations, we decided to consider the following augmentation approaches: rotating and adding random noises. All images are rotated with the angle 120°; we denote an image \(I\) as a 2D matrix with coordinates \((x, y)\) for pixel value v

$$I\left( {x,y} \right) = v_{x,y} .$$
(6)
Table 3 The examples of image augmentations

We denote a rotation matrix as \(R\), and thus, we have

$$R = \left[ {\begin{array}{*{20}c} {\cos \theta } & { - \sin \theta } \\ {\sin \theta } & {\cos \theta } \\ \end{array} } \right].$$
(7)

For any \(\theta\) degree rotation, we have

$$\left[ {x_{{{\text{new}}}} ,y_{{{\text{new}}}} } \right] = \left[ {x,y} \right]\left[ {\begin{array}{*{20}c} {\cos \theta } & { - \sin \theta } \\ {\sin \theta } & {\cos \theta } \\ \end{array} } \right],$$
(8)

And

$$v_{{x_{{{\text{new}}}} ,y_{{{\text{new}}}} }} = v_{x,y}$$
(9)

for each new location \(({x}_{\mathrm{new}},{y}_{\mathrm{new}})\) having the same pixel intensity. The new image \({I}_{new}\) is

$${I}_{\mathrm{new}}\left(x, y\right)=I\left[{x}_{\mathrm{new}},{y}_{\mathrm{new}}\right].$$
(10)

The source images are the matrices with three colour channels. For an RGB-encoded image \({I}_{rgb}\) with \(z=3\), the rotation matrix is applied to all three dimensions. All images are supplementary with random noises consisting of arbitrary changes of brightness, contrast, saturation, and erosion of ten image regions. The added random noises sequentially follow the order: Random brightness adjustment, random contrast adjustment, and random erosion filtering for the ten image regions. Given an image \(I(x,y,z)\) and each pixel \(v(x, y, z)\), with a brightness factor \({f}_{\mathrm{brightness}}({v}_{\mathrm{add}})\), where

$${f}_{\mathrm{brightness}}\left({v}_{\mathrm{add}}\right)=\frac{v\left(x,y,z\right)+{v}_{\mathrm{add}}}{v\left(x,y,z\right)},$$
(11)

the level of brightness adjustment is proportion to the pixel value, we have \(f\left({v}_{\mathrm{add}}\right)\in [0.9, 1.5]\). Thus

$${v}_{{x}_{\mathrm{new}},{y}_{\mathrm{new}},{z}_{\mathrm{new}}}={f}_{\mathrm{contrast}}{v}_{x,y,z},$$
(12)

where \({f}_{\mathrm{contrast}}\) denotes the level of contrast of an image. In this article, we set \({f}_{\mathrm{contrast}}\in [0.9, 1.5]\) randomly.

Randomly removing image regions [37] is an image augmentation that addresses generalisation issues. Assume an input image \(I\) with \({w}_{I}\) and \({h}_{I}\) for width and height, and we define two integers \({x}_{\mathrm{start}}\in [0, {w}_{I}]\) and \({y}_{\mathrm{start}}\in [0, {h}_{I}]\) as the starting point \(({x}_{\mathrm{start}},{y}_{\mathrm{start}})\). We define the width and height of a removal region in proportion \({r}_{b}\). In this article, we set \({r}_{b}=0.15\). The two points, namely, bottom left.

\(({x}_{\mathrm{start}}, {y}_{\mathrm{start}})\) and top right \(({x}_{\mathrm{end}}, {y}_{\mathrm{end}})\) of the removed region are defined as

$$\left({x}_{\mathrm{start}}, {y}_{\mathrm{start}}\right)=\left(\mathrm{random}\left(0, {w}_{I}\right), \mathrm{random}\left(0, {h}_{I}\right)\right)$$
(13)
$$\left({x}_{\mathrm{end}}, {y}_{\mathrm{end}}\right)=\left({x}_{\mathrm{start}}+{r}_{b} {w}_{I}, {y}_{\mathrm{start}}+{r}_{b} {h}_{I}\right).$$
(14)

The random selection process repeats ten times, and the results are shown in Table 3. In summary, the data preprocessing contains six classes of fruits with various decay grades. Data augmentation is extensively emphasised in this article. For each image, there are four variants: sharpened with contrast, rotated with random noises. There are two classes of labels for fruit objects: fruit freshness grade and location of a fruit in a given image of VOC [38].

Methods

In this section, a neural network YOLO for fruit classification as one hierarchical deep learning model is considered, whose results are fed into a regression CNN for fruit grading. In comparison to the deep learning method, we first treat a linear model based on texture and colour of the images; the relevant analysis paves the way for explicating the reason why we should implement a deep learning approach.

A Linear Proposal

Simple ambient noises refer to the image background with little distractions, usually plain black or white colour. In an environment, fruit localisation and freshness grading become easy, as simple pixel-based manipulation can render satisfactory results. The primary advantage of this project is a fast computation for fruit grading.

In this project, we proposed a simple solution to locate a fruit on a digital image, automatically grade its freshness based on the texture appearance of the fruit itself. Since most of fruits have distinct appearances when the background has a plain or pure colour, a simple threshold can be applied to segment a fruit object from an image. Image regions within the thresholds will be selected, while others are masked. The contour of the selected image regions will be depicted to determine the bounding boxes for object detection.

Denote an image \(I\) comprising of pixel \({v}_{x, y, z}\) where \(x\in [1, \mathrm{width}]\), \(y\in [1, \mathrm{height}]\), \(z\) is (r,g,b) colour channel, for example, an RGB image has \(z\in [1, 256]\). We have a binary mask

$${\text{mask}}\left( {x,y,z} \right) = \left\{ {\begin{array}{*{20}c} {1,} & {z < {\text{threshold}}} \\ {0,} & {{\text{otherwise}}} \\ \end{array} } \right.,$$
(15)

where the \(\mathrm{threshold}\) is the pixel intensity of a fruit image. Pertaining to apples, the most observed colours are beige and crimson with RGB colours \((166, 123, 91)\) and \(\left(220, 20, 60\right),\) respectively. Thus, the colour thresholds are defined as

$${\text{threshold}}_{r} = \left[ {166 \pm 20,220 \pm 20} \right]$$
(16)
$${\text{threshold}}_{g} = \left[ {20 \pm 20,123 \pm 20} \right]$$
(17)
$${\text{threshold}}_{b} = \left[ {60 \pm 20,91 \pm 20} \right].$$
(18)

For freshness grading, we treat the brightness and the pixel intensity within a bounding box as the two conditions. It is believed that generally for a rotten fruit, the number of brown/dark spots grow. This appearance change results in the increases of pixel intensity and the decreases of brightness. An entropy for a given image \(I\) with histograms \({h}_{i}\) is

$${\text{entropy}}\left( I \right) = - \sum_{i} \left( {h_{i} \times \log \left( {h_{i} } \right)} \right).$$
(19)

For a given image \(I\) with a pixel \({p}_{i}\), where \(i=\mathrm{0,1}, \dots , n\), \(n\) represents the number of pixels that comprise the image, and we have

$$\mathrm{brightness}\left(I\right)=\frac{1}{n}{\sum }_{i}\left({p}_{i}\right).$$
(20)

The freshness is calculated using Eq. (21)

$$\mathrm{freshness}={k}_{e} \mathrm{entropy}\left(I\right)+{k}_{b} \mathrm{brightness}\left(I\right)+b,$$
(21)

where \({k}_{e}\) and \({k}_{b}\) are weight adjustment parameters, and \(b\) is the bias. These parameters are determined via linear regression, assuming a regression output \({y}_{i}\) and a data sample \({x}_{i}\), where \({x}_{i}\) consists of \(n\) dimensions, and thus

$${\widehat{y}}_{i}={\beta }_{0}+{\beta }_{1}{ x}_{i,1}+{\beta }_{2}{ x}_{i,2}+\dots +{\beta }_{n}{ x}_{i,n}.$$
(22)

The loss function for linear regression is

$${\mathrm{Loss}=\sum_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}.$$
(23)

We minimise the loss

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {\hat{\beta }} = \arg_{{\hat{\beta }}} \min {\text{Loss}}\left( {X,\mathop{\beta }\limits^{\rightharpoonup} } \right)$$
(24)

Therefore, we have \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {\hat{\beta }} = \left\{ {b, - k_{e} ,k_{b} } \right\}\). We selected a few fruit images with various decay levels and calculated the entropy as well as the brightness of the detected bounding box; meanwhile, \({k}_{e}\) and \({k}_{b}\) are determined. We observed the entropy and intensity, adjusted \({k}_{e}\) and \({k}_{b}\) to make the sum of the entropy and brightness intensity close to the corresponding freshness level.

A Hierarchical Deep Learning Model

In this section, we propose a hierarchical deep learning model for fruit freshness classification, whose results are fed into a second one (regression CNN) for freshness grading. YOLO + Regression-CNN is a hierarchical neural network, whose predictive bounding boxes are fed to the regression CNN for freshness grading. Regression CNNs are trained for each class of fruits. In this article, we work for the classification of six classes of fruits; the six regression CNNs are trained. YOLO is used to classify the class of the object/fruits as well as estimate the bounding box, which locates the visual object on an image. The corresponding regression has been applied to this class of fruits for freshness grading. The framework is illustrated in Fig. 1, the regression model is shown in Fig. 2, the pipeline of this model is shown in Fig. 3.

Fig. 1
figure 1

YOLO + regression CNN model

Fig. 2
figure 2

The customized regression model (VGG)

Fig. 3
figure 3

The pipeline of the proposed hierarchical deep learning model

The source images are fed into YOLO for object recognition, where the central point, width, and height of the bounding box are determined. With YOLO prediction, the model maps the predicted class of the detected fruit onto the regression neural network. The detected object in the image is cropped out from its background as the input image to the regression CNN network.

The YOLO model in this article has the same structure as YOLOv3 [39]. We thus define a set of input data \(D\), in which we have

$$D=\left\{{I}_{1},{I}_{2}, \dots , {I}_{n}\right\},$$
(25)

where \(n\) is the total number of input images, and \({I}_{i}\) is the \(i\) th image, \(i=\mathrm{1,2},3\dots ,n\). Our input images are encoded using RGB channels. Thus, this defines each image \({I}_{i}\) as three-dimensional and has the same image size. The image \({I}_{i}\) is defined as a 2D matrix

$${I}_{i}=\left[\begin{array}{ccc}{p}_{\mathrm{1,1}},{p}_{\mathrm{1,2}} & \cdots & {p}_{1,w} \\ \vdots & \ddots & \vdots \\ {p}_{h,1},{p}_{h,2}& \cdots & {p}_{h,w} \end{array}\right].$$
(26)

To prevent overfitting, additional random flips are considered, after YOLOv3 takes the source data and starts the computation [39] at the time \(t\), we have a prediction \(\widehat{Y}=\{\widehat{{y}_{1}},\widehat{{y}_{2}}, \dots , \widehat{{y}_{n}}\}\)

$$\widehat{Y}\left(t\right)=P\left({\xi }_{YOLO}\left(t\right)|D\right).$$
(27)

For the YOLO model \({\xi }_{YOLO}\) we have a bounding box, the associated object class, and the prediction

$$\widehat{{y}_{i}}=\left\{\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}},\widehat{{c}_{i}}\right\}.$$
(28)

According to the predicted class \(\widehat{{c}_{i}}\), the anchored box is denoted as (\(\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}}\)), the source image \({I}_{i}\) is cropped. The new image is

$$\mathrm{new}{I}_{i}=\mathrm{crop}\left({I}_{i},\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}}\right),$$
(29)

where \(\widehat{{x}_{i}} and \widehat{{y}_{i}}\) are the central point of the predicted bounding box, the \({crop(I}_{i},\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}})\) for the \(i\) th image \({I}_{i}\) is expressed as

$${\mathrm{crop}(I}_{i},\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}})= \left[\begin{array}{ccc}{p}_{\widehat{{x}_{i}}-\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}-\frac{\widehat{{h}_{i}}}{2}} & \cdots & {p}_{\widehat{{x}_{i}}+\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}-\frac{\widehat{{h}_{i}}}{2}} \\ \vdots & \ddots & \vdots \\ {p}_{\widehat{{x}_{i}}-\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}+\frac{\widehat{{h}_{i}}}{2}}& \cdots & {p}_{\widehat{{x}_{i}}+\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}+\frac{\widehat{{h}_{i}}}{2}}\end{array}\right].$$
(30)

The cropped image \(\mathrm{new}{I}_{i}\) is fed into a regression neural network. We thus define the regression neural network \({\xi }_{rege}\left(t\right)\) at the training epoch \(t\), a cropped image dataset is \(newD=\{\) \(\mathrm{new}{I}_{1}, \mathrm{new}{I}_{2},\dots ,\mathrm{new}{I}_{m}\}\)

$$\widehat{R}=P\left({\xi }_{regr},\widehat{{c}_{i}}|\mathrm{new}D\right),$$
(31)

where \(\widehat{R}\) is the result of fruit freshness regression, and we have

$$\widehat{R}=\left\{{\mathrm{r}}_{1}, {\mathrm{r}}_{2}, \dots , {\mathrm{r}}_{\mathrm{n}}\right\}.$$
(32)

Hence, the hierarchical model is expressed as

$$\widehat{Y}\left(t\right)=P\left({\xi }_{YOLO}\left(t\right),{\xi }_{regr}\left(t\right)|D\right).$$
(33)

For each prediction \(\widehat{{y}_{i}}\), we have

$$\widehat{{y}_{i}}=\left\{\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}},\widehat{{c}_{i}},\widehat{{r}_{i}}\right\}.$$
(34)

In this article, we experimented on four base networks, i.e., AlexNet, VGG, ResNet, and GoogleNet, for regression based on the six classes of fruits. Each class of fruits likely has unique feature distinct from others that should be processed by using regression convolutional neural network.

We offer the base networks AlexNet, VGG, ResNet, and GoogLeNet. In the fully connected layers, we modified the number of neurons to fit for our fruit freshness regression. We designed additional four layers for the fully connected neural network.

Experimental Results

In this article, our model is constructed hierarchically, consisting of classification and location-oriented model (YOLO), a set of regression CNNs targeting each fruit type. Besides, we make a comparison to the linear model.

The Proposal

We calculate the average brightness and entropy for a video frame. Associated the images with complicated background noises, the location is very hard to be found, the brightness/entropy approach does not converge as expected. The defined freshness function is shown in Eq. (35)

$${\widehat{y}=k}_{1}J+{k}_{2}B+b,$$
(35)

where it has not a linear relationship with entropy and brightness of the image. The configurations of a linear regressor are shown as

  • \({k}_{1}\): − 2.7701

  • \({k}_{2}\): 0.00367

  • \(b\): 9.0004.

However, this approach is subject to background noises, even if a minor change of background might result in significant errors. During this experiment, we set up different physical backgrounds while acquiring an image of the fruits, including lighting conditions and placing adjacent foreign objects.

For fresh fruits such as apples and banana, we do observe that there exist correlations between entropy/brightness levels and decay stages if the background is set as static. However, for other fruits such as kiwi fruits and oranges, this assumption is hardly correct.

This preliminary approach through entropy/brightness computations reveals the complexity of fruit freshness grading. Fruits have their own processes of decaying; for each decay characteristic, there is no apparent relationship between static visual features (i.e., a set of defined rules of pixel statistics) and freshness levels. Based on these discoveries, we decide to treat each class of fruit individually rather than a comprehensive approach.

YOLO + GoogLeNet

In the GoogLeNet, the fruits show multiple levels of regression on grading fruit freshness. Banana is the most accurately predicted class of fruits, while Kiwi fruits are the most difficult one. Apple freshness grading appears the most unstable one in the validation set. For example, they appeared with fungus and brown/dark spots, in comparison to other fruits, e.g., dragon fruits usually are covered by yellowish dark spots (Table 4).

Table 4 The metrics for evaluating the performance of GoogLeNet

YOLO + AlexNet

It is observable that the performance of AlexNet on the six classes of fruits is similar to other base network regarding on which class of fruits the regression is prone to deviating from the ground truth. Apples, Kiwi fruits, and pears are the three most challenging classes to be regressed, while bananas are the most accurate one. Fruits with relatively large errors tend to be less stable in standard deviation during regression. This is evident in both training and validation sets of all classes of fruits. The average MSE for the six classes of fruits is \(3.500\) for the training dataset and \(4.099\) for the validation dataset. In terms of regressions, this model generates \(1.480\) for the training set and \(1.248\) for the validation set (Tables 5).

Table 5 The metrics for evaluating the performance of AlexNet

YOLO + ResNet

ResNet-152 is the top net among the ResNet family as well as the deepest one among the ResNets. Again, ResNet fails to deliver reliable results for three particular classes of fruits: apples, Kiwi fruits, and pears. The regression error is largely based on the Kiwi fruit dataset, both on the training and validation sets. For pears, there exists a possibility of overfitting by using the validation set shows \(6.057\), while it reports \(3.984\) using the training set. Banana freshness grading is the most accurate one. In terms of regression stability, pears are the least stable, while oranges are generally the highest one, judged using training and validation sets. On average, the MSE values of training and validation sets for ResNet-152 are \(3.582\) and \(4.058,\) respectively. For stability measurement, the standard deviation is \(1.329\) for the training set and \(1.842\) for the validation set (Table 6).

Table 6 The metrics for evaluating the performance of ResNet

YOLO + VGG

We tested the VGG-11 model. Again, grading bananas is the most accurate one in freshness grading regression, while classifying apples, kiwis fruits, and pears are the most difficult ones. However, VGG-11 tends to suffer less from overfitting as indicated in the metrics where the result gaps between the training and validation sets are small against what we have observed in other base networks. VGG-11 displays high stability in regression, where for apples, Kiwi fruits, and pears, both training and validation sets show robust regression output (hinted in standard deviation). The average training and validation MSE values are close to the other three base networks, \(3.665\) and \(3.934\), respectively. The standard deviations are \(1.361\) and \(1.266\), respectively (Table 7).

Table 7 The metrics for measuring the performance of VGG-11

Comparisons

The four deep learning models have similar performance. Using the training set, AlexNet shows the best of MSE, while VGG eyes the lowest error with the validation set. Table 8 is a summary of the overall proposed model regression performance, measured in MSE.

Table 8 A summary of performance of the proposed schemes in MSE

Conclusion

In this paper, we constructed a linear regression model to detect and measure the fruit freshness by judging the darkness of the fruit skin and variations of colour transitions. Accordingly, we affirm that fruit spoilage occurs with biochemical reactions that result in visual fading. Hence, we propounded a deep learning solution.

Deep learning has been used for fruit freshness grading, with the considerations of multiclass of fruits (i.e., apple, banana, dragon fruit, Kiwi fruit, orange, and pear). We have developed a hierarchical approach, in which a slew of fruits are detected and classified with real-time object detection, the regions of interest are cropped from the source images and fed into CNN models for regression, and thus, the freshness level is finally graded. We independently trained the convolutional neural network for four renown models, i.e., GoogLeNet, ResNet, AlexNet, and VGG-11. Our experimental results have shown an excellent performance of deep learning algorithms towards to resolving this problem [9, 40,41].