In this section, a neural network YOLO for fruit classification as one hierarchical deep learning model is considered, whose results are fed into a regression CNN for fruit grading. In comparison to the deep learning method, we first treat a linear model based on texture and colour of the images; the relevant analysis paves the way for explicating the reason why we should implement a deep learning approach.
A Linear Proposal
Simple ambient noises refer to the image background with little distractions, usually plain black or white colour. In an environment, fruit localisation and freshness grading become easy, as simple pixel-based manipulation can render satisfactory results. The primary advantage of this project is a fast computation for fruit grading.
In this project, we proposed a simple solution to locate a fruit on a digital image, automatically grade its freshness based on the texture appearance of the fruit itself. Since most of fruits have distinct appearances when the background has a plain or pure colour, a simple threshold can be applied to segment a fruit object from an image. Image regions within the thresholds will be selected, while others are masked. The contour of the selected image regions will be depicted to determine the bounding boxes for object detection.
Denote an image \(I\) comprising of pixel \({v}_{x, y, z}\) where \(x\in [1, \mathrm{width}]\), \(y\in [1, \mathrm{height}]\), \(z\) is (r,g,b) colour channel, for example, an RGB image has \(z\in [1, 256]\). We have a binary mask
$${\text{mask}}\left( {x,y,z} \right) = \left\{ {\begin{array}{*{20}c} {1,} & {z < {\text{threshold}}} \\ {0,} & {{\text{otherwise}}} \\ \end{array} } \right.,$$
(15)
where the \(\mathrm{threshold}\) is the pixel intensity of a fruit image. Pertaining to apples, the most observed colours are beige and crimson with RGB colours \((166, 123, 91)\) and \(\left(220, 20, 60\right),\) respectively. Thus, the colour thresholds are defined as
$${\text{threshold}}_{r} = \left[ {166 \pm 20,220 \pm 20} \right]$$
(16)
$${\text{threshold}}_{g} = \left[ {20 \pm 20,123 \pm 20} \right]$$
(17)
$${\text{threshold}}_{b} = \left[ {60 \pm 20,91 \pm 20} \right].$$
(18)
For freshness grading, we treat the brightness and the pixel intensity within a bounding box as the two conditions. It is believed that generally for a rotten fruit, the number of brown/dark spots grow. This appearance change results in the increases of pixel intensity and the decreases of brightness. An entropy for a given image \(I\) with histograms \({h}_{i}\) is
$${\text{entropy}}\left( I \right) = - \sum_{i} \left( {h_{i} \times \log \left( {h_{i} } \right)} \right).$$
(19)
For a given image \(I\) with a pixel \({p}_{i}\), where \(i=\mathrm{0,1}, \dots , n\), \(n\) represents the number of pixels that comprise the image, and we have
$$\mathrm{brightness}\left(I\right)=\frac{1}{n}{\sum }_{i}\left({p}_{i}\right).$$
(20)
The freshness is calculated using Eq. (21)
$$\mathrm{freshness}={k}_{e} \mathrm{entropy}\left(I\right)+{k}_{b} \mathrm{brightness}\left(I\right)+b,$$
(21)
where \({k}_{e}\) and \({k}_{b}\) are weight adjustment parameters, and \(b\) is the bias. These parameters are determined via linear regression, assuming a regression output \({y}_{i}\) and a data sample \({x}_{i}\), where \({x}_{i}\) consists of \(n\) dimensions, and thus
$${\widehat{y}}_{i}={\beta }_{0}+{\beta }_{1}{ x}_{i,1}+{\beta }_{2}{ x}_{i,2}+\dots +{\beta }_{n}{ x}_{i,n}.$$
(22)
The loss function for linear regression is
$${\mathrm{Loss}=\sum_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}.$$
(23)
We minimise the loss
$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {\hat{\beta }} = \arg_{{\hat{\beta }}} \min {\text{Loss}}\left( {X,\mathop{\beta }\limits^{\rightharpoonup} } \right)$$
(24)
Therefore, we have \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {\hat{\beta }} = \left\{ {b, - k_{e} ,k_{b} } \right\}\). We selected a few fruit images with various decay levels and calculated the entropy as well as the brightness of the detected bounding box; meanwhile, \({k}_{e}\) and \({k}_{b}\) are determined. We observed the entropy and intensity, adjusted \({k}_{e}\) and \({k}_{b}\) to make the sum of the entropy and brightness intensity close to the corresponding freshness level.
A Hierarchical Deep Learning Model
In this section, we propose a hierarchical deep learning model for fruit freshness classification, whose results are fed into a second one (regression CNN) for freshness grading. YOLO + Regression-CNN is a hierarchical neural network, whose predictive bounding boxes are fed to the regression CNN for freshness grading. Regression CNNs are trained for each class of fruits. In this article, we work for the classification of six classes of fruits; the six regression CNNs are trained. YOLO is used to classify the class of the object/fruits as well as estimate the bounding box, which locates the visual object on an image. The corresponding regression has been applied to this class of fruits for freshness grading. The framework is illustrated in Fig. 1, the regression model is shown in Fig. 2, the pipeline of this model is shown in Fig. 3.
The source images are fed into YOLO for object recognition, where the central point, width, and height of the bounding box are determined. With YOLO prediction, the model maps the predicted class of the detected fruit onto the regression neural network. The detected object in the image is cropped out from its background as the input image to the regression CNN network.
The YOLO model in this article has the same structure as YOLOv3 [39]. We thus define a set of input data \(D\), in which we have
$$D=\left\{{I}_{1},{I}_{2}, \dots , {I}_{n}\right\},$$
(25)
where \(n\) is the total number of input images, and \({I}_{i}\) is the \(i\) th image, \(i=\mathrm{1,2},3\dots ,n\). Our input images are encoded using RGB channels. Thus, this defines each image \({I}_{i}\) as three-dimensional and has the same image size. The image \({I}_{i}\) is defined as a 2D matrix
$${I}_{i}=\left[\begin{array}{ccc}{p}_{\mathrm{1,1}},{p}_{\mathrm{1,2}} & \cdots & {p}_{1,w} \\ \vdots & \ddots & \vdots \\ {p}_{h,1},{p}_{h,2}& \cdots & {p}_{h,w} \end{array}\right].$$
(26)
To prevent overfitting, additional random flips are considered, after YOLOv3 takes the source data and starts the computation [39] at the time \(t\), we have a prediction \(\widehat{Y}=\{\widehat{{y}_{1}},\widehat{{y}_{2}}, \dots , \widehat{{y}_{n}}\}\)
$$\widehat{Y}\left(t\right)=P\left({\xi }_{YOLO}\left(t\right)|D\right).$$
(27)
For the YOLO model \({\xi }_{YOLO}\) we have a bounding box, the associated object class, and the prediction
$$\widehat{{y}_{i}}=\left\{\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}},\widehat{{c}_{i}}\right\}.$$
(28)
According to the predicted class \(\widehat{{c}_{i}}\), the anchored box is denoted as (\(\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}}\)), the source image \({I}_{i}\) is cropped. The new image is
$$\mathrm{new}{I}_{i}=\mathrm{crop}\left({I}_{i},\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}}\right),$$
(29)
where \(\widehat{{x}_{i}} and \widehat{{y}_{i}}\) are the central point of the predicted bounding box, the \({crop(I}_{i},\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}})\) for the \(i\) th image \({I}_{i}\) is expressed as
$${\mathrm{crop}(I}_{i},\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}})= \left[\begin{array}{ccc}{p}_{\widehat{{x}_{i}}-\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}-\frac{\widehat{{h}_{i}}}{2}} & \cdots & {p}_{\widehat{{x}_{i}}+\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}-\frac{\widehat{{h}_{i}}}{2}} \\ \vdots & \ddots & \vdots \\ {p}_{\widehat{{x}_{i}}-\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}+\frac{\widehat{{h}_{i}}}{2}}& \cdots & {p}_{\widehat{{x}_{i}}+\frac{\widehat{{w}_{i}}}{2},\widehat{{y}_{i}}+\frac{\widehat{{h}_{i}}}{2}}\end{array}\right].$$
(30)
The cropped image \(\mathrm{new}{I}_{i}\) is fed into a regression neural network. We thus define the regression neural network \({\xi }_{rege}\left(t\right)\) at the training epoch \(t\), a cropped image dataset is \(newD=\{\) \(\mathrm{new}{I}_{1}, \mathrm{new}{I}_{2},\dots ,\mathrm{new}{I}_{m}\}\)
$$\widehat{R}=P\left({\xi }_{regr},\widehat{{c}_{i}}|\mathrm{new}D\right),$$
(31)
where \(\widehat{R}\) is the result of fruit freshness regression, and we have
$$\widehat{R}=\left\{{\mathrm{r}}_{1}, {\mathrm{r}}_{2}, \dots , {\mathrm{r}}_{\mathrm{n}}\right\}.$$
(32)
Hence, the hierarchical model is expressed as
$$\widehat{Y}\left(t\right)=P\left({\xi }_{YOLO}\left(t\right),{\xi }_{regr}\left(t\right)|D\right).$$
(33)
For each prediction \(\widehat{{y}_{i}}\), we have
$$\widehat{{y}_{i}}=\left\{\widehat{{x}_{i}},\widehat{{y}_{i}},\widehat{{w}_{i}},\widehat{{h}_{i}},\widehat{{c}_{i}},\widehat{{r}_{i}}\right\}.$$
(34)
In this article, we experimented on four base networks, i.e., AlexNet, VGG, ResNet, and GoogleNet, for regression based on the six classes of fruits. Each class of fruits likely has unique feature distinct from others that should be processed by using regression convolutional neural network.
We offer the base networks AlexNet, VGG, ResNet, and GoogLeNet. In the fully connected layers, we modified the number of neurons to fit for our fruit freshness regression. We designed additional four layers for the fully connected neural network.