Keywords

1 Introduction

The quality of crayfish is mainly determined by the three links of breeding, processing and storage, all of which are capable of significantly affecting its quality score [1]. Therefore, the use of traditional methods such as sanitary inspection, sensory evaluation, and physical and chemical analysis. They not only require professional testing, but also have the disadvantages of being too subjective and having long operation cycle [2].

The NIR spectroscopy is a green non-destructive detection with the advantages of low cost, high analytical efficiency, high speed and good reproducibility [3], and has been widely used in various fields such as food, pharmaceutical and clinical medicine [4], biochemical [5], textile [6], and environmental science. Modern NIR spectroscopy must rely on chemometric methods to complete spectral pre-processing and model building. Spectral pre-processing methods include smoothing algorithms, multivariate scattering correction, wavelet transform, etc.; Commonly used multivariate correction methods include linear correction methods such as principal component regression and partial least square, and nonlinear correction methods such as artificial neural networks and support vector machines [7].

In this study, we experimentally analyze four algorithms in crayfish quality detection and compare their prediction rates. Although PCL PLS and BP neural network have achieved better results in experiments, there is still room for improvement compared to support vector machines. Support vector machine has high generalization ability and can better handle practical problems such as small samples, nonlinearity, ambiguity, and high dimensionality [8]. The crayfish classification model with high stability and high accuracy in near-infrared spectroscopy using support vector machines, aiming to provide reference for subsequent research.

The second part of this paper gives a brief introduction of each model as well as its derivation; the third part selects the optimal model from the above machine algorithms through experimental analysis and comparison; the fourth part is an analysis of the advantages and disadvantages of the algorithms and a summary.

2 Theoretical Approach to Modeling

In this paper, four different machine classification algorithms will be used to predict crayfish quality, namely: SVM, PLS-SVM, PCA-SVM, and BP neural network. Firstly, we process the original data and divide the training set and test set according to a certain proportion. The training set is used as the input for training, and the classification model is obtained by adjusting the optimization parameters of each algorithm. Then the test set is used as the input. Finally, compare the accuracy of the four classifiers and find the appropriate optimal model.

2.1 Support Vector Machines

The basic idea of SVM is to find the support vector which constructs the optimal classification hyperplane in the training sample set which means that samples of different categories are correctly classified and the hyperplane interval is maximized. The mathematical form of the problem is:

$$y_{i} [(w^{T} x_{i} + b)] \ge 1, i = 1,2,3,...,N$$
(1)

For linear indivisibility, there is a certain classification error that does not satisfy Eq. (1). Therefore, a slack variable is introduced in the optimization objective function. At this time, The problem of finding the optimal classification hyperplane will be converted into a convex optimization problem with constraints for solving:

$$\left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\min } & {\frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{N} {\zeta_{i} } } \\ {s.t.} & {y_{i} (w^{T} x_{i} + b) \ge 1 - \zeta_{i} } \\ \end{array} } & {i = 1,2,3,...,N} \\ \end{array} } \right.$$
(2)

In the Eq. (2): C is called the penalty parameter. If the value of C is larger, the penalty for misclassification is larger. And the smaller C is, the smaller the penalty for misclassification is [9].

The classifier discriminant model function in n-dimensional space . At this time, the problem of the linear indivisible support vector machine becomes a convex quadratic programming problem. And we can use the Lagrangian function to solve it.

When the sample is non-linear, we can choose the kernel function to solve. In this paper, we mainly use RBF for SVM. The corresponding classification decision function is:

$$f(x) = {\text{sgn}} (\sum\limits_{i = 1}^{N} {\lambda_{i} y_{i} K(x,x_{i} ) + b} )$$
(3)

2.2 Partial Least Square

Partial least square is a dimensionality reduction technique that maximizes the covariance between the prediction matrix composed of each element in the space and the predicted matrix [10]. It concentrates the features of principal component analysis, typical correlation analysis and linear regression analysis in the modeling process. Therefore, it can provide richer and deeper systematic information [11]. The partial least square model is developed as follows:

Pre-process the prediction matrix and the predicted matrix to make them mean and centered, and then decompose them:

$$\left\{ {\begin{array}{*{20}c} {X = AP^{T} + B} \\ {Y = TQ^{T} + E} \\ \end{array} } \right.$$
(4)

where \(Y \in R^{n * m}\) and \(X \in R^{n * m}\) are the predicted matrix, \(A \in R^{n * a}\) and \(T \in R^{n * a}\) are the score matrix, \(P \in R^{m * a}\) and \(Q \in R^{m * a}\) are the load matrix, \(B \in R^{m * n}\) and \(E \in R^{n * m}\) are the residual matrix.

The matrix product \(AP^{T}\) can be expressed as the sum of the products of the score vector \(t\) and the load vector \(P_{j}\), then we have:

$$X = \sum\limits_{j = 1}^{a} {t_{j} p_{j}^{T} + B}$$
(5)

The matrix product \(TQ^{T}\) can also be expressed as the sum of the products of the score vector \(u_{j}\) and the load vector \(q_{j}\), so it can be expressed as:

$$Y = \sum\limits_{h = 1}^{a} {u_{j} q_{{_{j} }}^{T} } + E$$
(6)

Let \(u_{j} = b_{j} t_{j}\), where \(b_{j}\) is the regression coefficient, then \(U = AH\), \(H \in R^{aa}\) is the regression matrix:

$$Y = AHQ^{T} + E$$
(7)

2.3 Principal Component Analysis Method

Principal component analysis is a mathematical transformation method in multivariate statistics that uses the idea of dimensionality reduction to transform the original multiple variables into a few integrated variables with most important information [12]. These integrated variables reduce the complexity of data processing, and reflect the maximization of the content contained in the original variable, reduce the interference of error factors, and reflect The relationship between the variables within the matter.

For the raw data, we can extract the intrinsic features among the data by some transformations, and one of the methods is to go through a linear transformation to achieve [13]. This process can be expressed as follows:

$$Y = wX$$
(8)

Here \(w\) is a transformation value, which can be used as a basic transformation matrix to extract the features of the original data by this transformation. Let \(x\) denote the \(m\) dimensional random vector. Assume that the mean value is zero, that is:

$$E[X] = 0$$
(9)

Let \(w\) be denoted as an m dimensional unit vector \(x\) and make it project on \(x\). This projection is defined as the inner product of the vectors \(x\) and \(x\), it is denoted as:

$$Y = \sum\limits_{k = 1}^{n} {w_{k} x_{k} = w^{T} x}$$
(10)

In the above equation, the following constraints are to be satisfied:

$$\left\| w \right\| = (w^{T} w)^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}}} = 1$$
(11)

The principal component analysis method is to find a vector of weights \(E[y^{2} ]\), which enables the expression to take the maximum value [14].

2.4 BP Neural Network

BP neural networks simulate the human brain by simulating the structure and function of neurons. And it has the ability to solve complex problems quickly, accurately and in parallel. When the training samples are large enough, the BP neural network makes the error very small and makes the prediction result accurate enough. Compared to other neural network algorithms, BP neural networks are able to propagate the error backwards from the output to the input layer by using hidden layers. And modify the weights and threshold values during the back propagation process using the fastest descent method to make the error function converge quickly, which has fast training speed [15].

3 Experimental Results and Analysis

3.1 Support Vector Machine Classification Model

In supervised learning theory, two data sets are included. One is used to build the model, called the training sample set; the other is used to test the quality of the built model, called the test sample set. After preprocessing the data, we select half of the experimental data as the training set randomly, and use them to build the model. Finally, the remaining half of the experimental data are used as a test set and input them to the established model for classification and identification of crayfish.

LIBSVM is chosen as the training and testing tool for this model, and Gaussian kernel is chosen as the kernel function. We can search for parameters (c, g) by 10-fold cross-validation, and calculate the optimal value of 10-fold cross-validation accuracy. The set of (c, g) with the highest cross-validation accuracy is taken as the best parameter, obtaining c = 0.1, g = 4, as shown in Fig. 1.

As shown in Fig. 2, according to the comparison between the model and the actual sitution, where all samples are correctly classified with an accuracy rate of 100%, and it shows that the model has an extremely strong generalization ability and has a very high accuracy in high dimensionality.

Fig. 1.
figure 1

Optimization parameters by grid searching technique.

Fig. 2.
figure 2

The Sample error in the SVM Model.

3.2 Principal Component Analysis for Clustering Crayfish

In order to remove the overlapping information in the NIR spectra and the information lacking correlation with the sample properties as much as possible, we reduced the original data matrix from 800 × 215 to 800 × 3 (3 principal components) by PCA. Since the principal component score plots of the samples can reflect the internal characteristics and clustering information of the samples, we obtained the contribution rate plots of the first three principal components as shown in Fig. 3 and the three-dimensional score distribution plots of the first three principal components as shown in Fig. 4.

Fig. 3.
figure 3

Contribution of the top three principal components.

Fig. 4.
figure 4

3D score distribution of the top three principal components.

Figure 6 is a plot of the scores of principal component 1, 2, 3 for 800 crayfish, where the x y z axis represent the first principal component score, the second principal component score and the third principal component score respectively. From the figure, we can see that crayfish are clearly classified into 8 categories, indicating that components 1, 2, and 3 have a significant impact on crayfish with a better clustering effect. To describe the classification results quantitatively, we build a classification model for principal components using SVM.

We randomly select one-half of the standardized sample data as the training set to train the model, and the remaining one-half as the test set. The first 5 principal component score data are taken as the data features for identification. As shown in Table 1.

Table 1. Reliabilities of principal compoents.

After that, we obtain a classification accuracy equal to 98.75% for this experiment by SVM.

3.3 Partial Least Squares Regression Analysis

It is especially important to determine the number of principal components in the PLS model. As the number of principal components increases, the degree of importance gradually decreases and represents less and less effective information. If too few principal components are selected, the characteristics of the sample are not fully reflected thus reducing the accuracy of the model prediction, this situation called under-fitting; if too many principal components are selected, some noisy information will be used as the characteristics of the sample, making the prediction ability of the model lower, this situation called overfitting [16]. Therefore, in order to reasonably determine the principal component score of the model, we derived a principal component score of 3 by taking the sum of squared prediction residuals [17] as the evaluation criterion.

Fig. 5.
figure 5

Contribution of the top three Comparison of predicted values and actual values.

Fig. 6.
figure 6

Error analysis of S content in PLS.

The SVM model is built by the LIBSVM toolbox, and the comparison chart between predicted and reference values is shown in Fig. 5, and the error analysis is shown in Fig. 6. We came up with an accuracy rate of 99.5%.

3.4 BP Neural Network Model

The crawfish classification BP network model uses a three-layer network structure, namely input layer, implicit layer, and output layer, and the layers are interconnected. Among them, the number of neurons in the input layer is 215 features of the samples. the number of labels of the samples in the output layer is 1 layer, and the number of implicit neurons is 20 layers. The weights of the BP neural network model are set to default, the learning step is set to 0.01, the maximum number of training sessions is 1000, and the expected error is 0.001. We normalize the 800-group sample as the input term,after several training sessions, if the error meets our expectation, then the neural network model is valid and can be applied.

Figure 7 shows the performance curve of the training, indicating its variance variation. After four cycles, the network achieves convergence with a mean squared error of 0.00089, which is less than the set expectation error target of 0.001. The whole curve decreases faster, indicating the appropriate size of the learning rate.

Fig. 7.
figure 7

Performance curve of BP neural neural network.

Fig. 8.
figure 8

Sample error plot in the BP network training.

Figure 8 shows the regression plot corresponding to the BP regression function, from which the fit of the training data, validation data, test data and the whole data,we know the correlation coefficient \(R\) are 0.99604, 0.99415, 0.99342 and 0.9955 respectively with high correlation, indicating the model fit. Through the above analysis, the BP neural network model has a good prediction effect with strong generalization ability in this study. Finally, we measured the accuracy rate of 97%.

4 Conclusion

Crayfish quality is affected by several factors, and it is necessary to ensure a reasonable classification of crawfish quality for objective evaluation of all aspects. This paper introduces SVM, PCA, PLS and BP neural networks in crayfish quality detection, leading to the following conclusions:

  1. (1)

    To ensure the comparability of the model, 800 learning samples were selected with 215 feature vectors as input and classification label level as output. The results of the validation data show that the classification accuracy is all greater than 95%, which meets the accuracy requirement of mine environment evaluation.

  2. (2)

    Compared with the BP neural network algorithm, the SVM algorithm shows more obvious superiority: the SVM model introduces the cross-validation method to program the automatic optimal selection of parameters, which overcomes the disadvantage that the neurons in the hidden layer of the BP neural network are not easily determined, and thus has a higher accuracy rate.

  3. (3)

    Compared with PCA, the PLS algorithm can not only solve the problem of variable multicollinearity, but also solve the regression problem of multiple dependent variables with independent variables, reducing the influence of overlapping information.

In summary, the support vector machine model is chosen to be more suitable for the classification of crayfish quality, which has high accuracy and low time at high dimensionality and fuzziness.