Introduction

Wheat is the second major food crop in the world and is a main source of energy, protein and trace minerals [1]. Its breeding, transportation, warehouse storage and quality rating are important aspects for grain quality assessments [2]. The commonly used method to measure wheat protein is the N method; however, conventional methods involving chemical analyses are time-consuming with a number of shortcomings [3]. Alternatively, there is a need to develop rapid quality detection techniques to meet the requirements of modern storage coupled with efficient and rapid grading [4, 5].

Hyperspectral technology could provide such information from the target sample accurately and in timely manner without the need of destructive methods. This technology has been successfully used for in crop growth monitoring; canopy nutrition diagnosis and estimation of grain starch and protein contents [6,7,8]. Although the spectral estimation of crop quality achieved by analyzing grain powder provide accurate estimation of grain quality [9], while the grain milling process is not resource efficient which consumes a lot of energy and time, which limits its wide application in production practice [10]. Second, hyperspectral estimation of milled grain powder is not ideal for repeated measurements for periodic quality assessments. By contrast, direct quantitative estimation of grain protein by using the hyperspectral technology might be valuable for providing real-time rapid and repeated assessments to ensure proper storage and maintenance of quality standards.

In addition, due to the influence of environmental factors and grain status during the spectrum acquisition process, the preprocessing method can overcome the influence of external factors, thereby improving the signal-to-noise ratio of spectral information and improving the modeling accuracy [11,12,13]. However, the principles of different modeling methods are quite different, which will also lead to different performances of the optimal model of wheat quality and content [14,15,16]. When the preprocessing method and the modeling method are different, there is no unified conclusion on the influence of the characteristic band on the modeling accuracy. Researcher depicted that the model based on characteristic bands can greatly simplify the complexity of model construction and further improved its prediction efficiency [17]. Therefore, it is necessary to study the comprehensive effects of pretreatment methods combined with modeling methods, as well as full spectrum, characteristic wavelength, etc. on model accuracy to obtain the optimal combination method.

During this study, we analyzed winter wheat samples and collected grain and powder hyperspectral data to analyze grain protein contents. In addition, different preprocessing methods and multivariate modeling methods were used to elucidate response properties of winter wheat protein and hyperspectral data sets. Based on the information, we constructed hyperspectral quantitative monitoring model for the precise and real-time assessment of winter wheat grain protein.

Materials and methods

Sample collection

Samples were collected in Linfen City, Shanxi Province (36.0882° N, 111.5196° E), and Yuncheng City, Shanxi Province (35.0263° N, 111.0070° E) presented in Fig. 1.

Fig. 1
figure 1

Sampling sites and details of sampling point distribution across both the sites

Both areas are located in the middle reaches of the Yellow River Basin, which belongs to the typical landform area of the Loess Plateau, with an altitude of 500–1000 m, a temperate continental monsoon semi-arid climate, and an annual precipitation of about 500 mm, respectively [18]. Farmer fields across southern Shaanxi province were selected for experimentation and wheat grain samples harvested at maturity were analyzed for grain protein content.

Spectral data acquisition

Wheat grain samples were evenly placed in a black plastic box (diameter 90 mm, height 15 mm), and the surface of the sample was kept flat. The ASD FieldSpec 3 portable spectrometer was then used to obtain the spectral data. The probe of the instrument was equipped with an analog solar light source, and spectral data was recorded in the range of 350–2500 nm. During the spectrum acquisition process, the influence of external light on the spectral information, the spectrometer probe was placed on the sample to reduce the effect of light. Each sample was then sampled using five-point method and a total of 9 spectral curves were obtained at each point corresponding to 9 spectral curves obtained for each sample. Finally, 45 spectral curves were averaged as the spectral curves of each sample.

Determination of protein contents

Determination of nitrogen content from grains was performed using Kjeldahl Method [19] and grain protein content (P) was calculated as follows:

$$P = \frac{5.7 \times C \times 100\% }{{2M}}$$
(1)

where M is the weight of the sample after crushing (g) and C is the nitrogen content in the digestate (mg/L).

Spectral data processing

Removal of outliers from the collected spectral data using View Spec Pro Version software, and average processing, using the averaged spectral data as the final data was performed. Primarily, the original spectrum was preprocessed and signal-to-noise ratio was improved using Savitzky–Golay Smoothing (SG), Derivative [First Derivative (FD), Second Derivative (SD)], Standard Normal Variate (SNV), and Multiplicative Scattering Correction (MSC); and Continuous projection algorithm (SPA).

Savitzky–Golay smoothing (SG)

Savitzky–Golay is one of the most basic and commonly used spectral preprocessing methods. The principle of Savitzky–Golay is to divide spectral data into window distributions, set a window width, and average the data in the window by moving the window to eliminate noise. Smoothing can reduce the burr in the spectral data image and make the image smoother, and this method can keep the shape and width of spectral data unchanged [20].

Derivative

The derivative transformation method can enhance the spectral differences between samples and compensate for the baseline shift caused by light scattering, thereby improving the recognition of different sample spectra [21, 22]. Among them, the first derivative mainly reduces some linear or near-linear noise in the target spectrum, while the second derivative can eliminate baseline drift and interference. Both methods can improve the model monitoring accuracy of the sample [23, 24].

Standard normal variate (SNV)

When measuring the spectrum of a sample, if there are particles on the surface of the sample or the particles are unevenly distributed, scattering will occur. Standard Normal Variate (SNV) is often used to eliminate the scattering error and the interference of particle size [20].

Multiplicative scattering correction (MSC)

The basic principle of multiplicative scattering correction (MSC) is to obtain an ideal spectrum with high correlation by replacing the average spectral curve of the sample modeling set, so as to realize the scattering correction of spectral data [25]. This method can separate chemical light absorption from physical light scattering, and can eliminate the scattering influence caused by uneven particle distribution and different particle sizes [26].

Continuous projection algorithm (SPA)

Continuous projection algorithm (SPA) is a forward variable selection method to minimize the collinearity of vector space. It uses vector projection analysis to eliminate redundant information in the spectrum to the greatest extent, and selects some characteristic bands from the whole spectral band, which can not only reduce the number of spectral bands involved in modeling, but also ensure the minimum collinearity between characteristic bands, thus improving the modeling efficiency. The calibration set and validation set used for extracting feature bands in SPA in this article are consistent with the test set and validation set used for building the model [11, 27].

Model method and evaluation index

We used back propagation neural network (BPNN), partial least squares (PLS), random forest (RF), and support vector machine (SVM) to construct multivariate statistical models [28], whereas R2, RMSE, and RPD were used to evaluate the effectiveness of each model using these formulas:

$$R^{2} = \frac{{\mathop \sum \nolimits_{i} (\widehat{{y_{i} }} - \overline{y})^{2} }}{{\mathop \sum \nolimits_{i} \left( {y_{i} - \overline{y}} \right)^{2} }}$$
(2)
$${\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \widehat{{y_{i} }}} \right)^{2} }$$
(3)
$${\text{RPD}} = \frac{{{\text{SD}}}}{{{\text{RMSE}}}}$$
(4)

where n is the number of samples, \(\widehat{{y}_{i}}\) is the predicted value of the sample, \({y}_{i}\) is the measured value of the sample, and \(\overline{y }\) is the average measured value of the sample, and SD is the standard deviation of the sample measured value.

Data processing software

In this study, the Unscrambler 9.7 and Excel 2019 were used to preprocess the spectral data of the collected samples, Matlab R2010a was used to extract the characteristic bands and build the model, and Oringin 2021 was used to draw the map.

Results and analysis

Descriptive statistical analysis of winter wheat protein content

As per concentration gradient method, the acquired 185 sample data were divided into a calibration set (139) and verification sets (46) with ratio of 3:1. The descriptive statistical analysis of winter wheat grain protein content in each data set is presented in Fig. 2. The minimum and maximum values of winter wheat protein content ranged 37 to 284 mg/g averaging 127 mg/g. The range difference between the correction set and verification set was 247 and 208 mg/g, respectively. In addition, the division of sample data was also reasonable and skewness values of each sample set were less than 1, indicating that the winter wheat grain protein content data in each data set exhibited characteristics of normal distribution.

Fig. 2
figure 2

Descriptive statistics of protein content in winter wheat

Hyperspectral response of protein content in winter wheat

Average the four spectral data to obtain the spectral reflectance of grains and powders at four protein content levels is presented in Fig. 3. The overall change trends of the spectral curves of the two types of samples were similar with obvious peaks and troughs. However, the spectral reflectance of winter wheat powder was significantly higher than that of winter wheat grain. In addition, with the increase in grain protein content, the spectral reflectance of grain and powder samples was increased thereby indicating a positive correlation between protein content and spectral reflectance. Furthermore, changes in spectral reflectance with different protein levels were obvious and grain spectral reflectance was mainly in the ranges of 750–1200 nm and 1450–2400 nm, while the spectral reflectance of powder was mainly in the range of 500–750 nm, 1450–1800 nm, and 1900–2400 nm respectively.

Fig. 3
figure 3

Spectral reflectance of grain and powder under different protein content. The total sample was divided into four parts according to the quartile method, and each part of the data was 0–25%, 25–50%, 50–75%, and 75–100% of the total sample data, respectively

Correlation between grain and powder spectrum of winter wheat

The correlation between the pretreatment spectra of grain and powder and the protein content of winter wheat is presented in Fig. 4. Among them, the correlation between R and protein in the grain state increased with the increase of wavelength and finally became stable. The correlation between R and the protein content of winter wheat grains in the powder state initially declined followed by an increase and became stable with the increase of wavelength. Overall, the trend of the correlation between the spectrum and the protein content of the two types of samples under SG and treatment were the same as that of the original spectrum and the protein content. In addition, the spectra of the two types of samples under SG pretreatment were positively correlated with the grain protein contents. The correlations at 1750 nm and 2000–2500 nm after SG pretreatment in the powder state were the highest. The fluctuation trend of the correlation between sample spectrum and protein content under FD pretreatment was like SD. The correlation between spectral data and protein fluctuated violently between positive and negative values after FD and SD transformations. Moreover, correlation of FD of grain samples increased gradually after 1400 nm, while that of powder samples decreased gradually after 1500 nm. Both spectra and protein contents of the two types of samples under SNV pretreatment were consistent with that of MSC pretreatment. In addition, powder spectra under MSC and SNV pretreatment exhibited negative correlations with protein content at 700–1200 nm and 1800–2500 nm, and positive correlations at 1300–1800 nm.

Fig. 4
figure 4

Correlation analysis between the preprocessed spectral reflectance of grain, powder and protein content

Feature band extracted of winter wheat grain proteins

We filtered characteristic bands and the best spectral bands using SPA (Table 1). Distributions of the spectral bands, characteristic bands and those obtained by these six preprocessing methods were plotted (Fig. 5). Spectral bands containing information on the protein content of the seed samples were concentrated in the ranges of 350–450 nm, 900–1160 nm, 1300–1500 nm, and 1901–2100 nm in the spectra of the seed samples, whereas spectral bands containing information on the protein content of the winter wheat seed samples were concentrated in the ranges of 330–430 nm, 550–600 nm, 1300–1400 nm and 1990–2050 nm for the spectra of powder samples. The analysis of the characteristic band distribution of the spectra of the seed samples and the spectra of the powder samples exhibited band range after 500 nm to a lesser extent than the spectra of the seed samples. Overall, there was an overlap between the two types of samples, and these overlapping bands were likely to be the common characteristic band distribution of the seed and powder samples.

Table 1 Extraction of important bands of winter wheat kernel protein-based SPA method
Fig. 5
figure 5

Extraction and distributions for the important spectral wavelengths of grain protein based on SPA method. A is grain, B is powder. The shaded part represents the main distribution areas of characteristic bands of grain protein obtained by SPA under different soil spectral pretreatments. The solid line represents the spectral reflectance of all grain and powder samples

Hyperspectral estimation model for protein content of winter wheat seeds

The optimal model performance of different modeling methods in the quantitative estimation models of proteins in the full spectrum and characteristic bands is presented in Table 2. By comparing the values of R2 and RMSE and RPD for the validation set of each model in the seed stage, where the optimal model in the full band was SG-RF (Rv2 = 0.779, RMSE = 0.026, RPD = 2.125) and the optimal model in the characteristic band were SG-SVM (Rv2 = 0.789, RMSE = 0.026, RPD = 2.177). whereas the optimal model in the powder state in the full band was RF-FD and the optimal model in the characteristic band was SG-BPNN, comparing the values of R2 and RMSE and RPD for the validation sets of SG-RF, SG-SVM, FD-RF, SG-BPNN. The values of R2 and RMSE and RPD of the sets, where SG-BPNN (Rv2 = 0.814, RMSE = 0.024, RPD = 2.318) exhibited highest model accuracy. To facilitate a more visual understanding, a 1:1 fit of the measured and predicted values of winter wheat seed protein content for the four models, seed full band SG-RF, seed characteristic band SG-SVM, powder full band FD-RF, and powder characteristic band SG-BPNN has been presented (Fig. 6).

Table 2 Estimated model performance of winter wheat grain protein under different spectral pretreatment
Fig. 6
figure 6

The measured and predicted values of the optimal estimation model for grain protein content in winter wheat were fitted 1:1

Discussions

In this study, the protein content of winter wheat grain had obvious absorption peaks at 851, 1443, 1458, 1476, and 2246 nm. The original spectrum of grain was positively correlated with protein content, and the correlation was the strongest at 851–1476 nm, while the original spectrum of powder was negatively correlated with protein content at 714–1154 nm and positively correlated with protein content at 1407–2500 nm. The characteristic spectra of wheat grain protein extracted by SPA were mainly distributed in 350–450 nm, 900–1160 nm, 1300–1500 nm, and 1901–2100 nm. The characteristic spectra of wheat grain powder protein contents were mainly distributed in 330–430 nm, 550–600 nm, 1300–1400 nm, and 1990–2050 nm. Considering the fact that the basic unit of protein is an amino acid, which is mainly composed of C, H, O, and N; information provide by hyperspectral reflectance mainly comes from the frequency doubling absorption of C–H, O–H, and N–H groups, of which about 800 nm was related to the third harmonic generation of C–H and N–H [29]. In addition, values between 1200 and 1500 nm may be related to C–H triple frequency and O–H stretching vibration; around 2000 nm was combined with N–H stretching vibration frequency absorption [30, 31]. In summary, the spectral regions of 350–430 nm, 851–1154 nm, 1300–1476 nm, and 1990–2050 nm were closely related to winter wheat protein.

We found that the hyperspectral reflectance curves of the powder samples were built with higher model accuracy compared to the seed samples, and the correlation between the powder spectral data and protein content was higher. The reason for this was that the seed and powder samples have different particle sizes, and thus the spectral reflectance is very different, and the protein content is measured by the powder [10, 32]; therefore, the correlation and model accuracy was higher than that of the seeds. However, powder samples can damage seeds and seed coats when the model for powder samples was of similar accuracy to that for seed samples, it was more practical to choose seeds. In this study, the difference in R2 between the model validation set of SG-BPNN in the powder state and SG-SVM in the seed state was 0.025 and considering the practical value, it was also feasible to choose the model in the seed state for monitoring the protein content of wheat. Therefore, the potential mechanism between the accuracy of the spectral prediction of the sample treatment on quality and the explanation of such subtle differences remains to be further investigated.

In addition, when performing the correlation analysis between pretreatment methods and protein content, it was found that the highest correlation was FD [33]. It is reported that FD can remove linear and near-linear components in the original spectrum, highlighting the increase and deceleration rate of spectral reflectance. It can also capture the inflection point and extreme point of the original spectral curve, and accurately locate the peak valley characteristics of protein absorption in the spectral curve [34]. In addition, FD can also separate the absorption characteristics and change trends of the protein spectrum in the infrared region, achieving better prediction results than the original spectrum [35, 36]. However, the optimal model in this paper is SG-BPNN based on characteristic bands in the powder state, and the preprocessing method used is SG instead of FD. In addition, the influence of seed coat on spectral information is greatly reduced in the powder state [37,38,39]. In addition, BPNN has a strong nonlinear fitting ability, which can effectively analyze and use rich data sets to simulate the complex relationship of the internal mechanism of variables, greatly improving the accuracy of the model [40, 41]. The featured band training model will not train invalid information, which improves the recognition accuracy [42,43,44,45]. Therefore, the combination of sample state, band selection, and model algorithm may have a great impact on the accuracy of the model.

Conclusion

We established a quantitative estimation model for the protein content of winter wheat grains based on spectral preprocessing methods such as SG, FD, SD, SNV, and MSC, combined with BPNN, PLS, RF, and SVM. The hyperspectral and protein content of both types of samples of winter wheat seeds and seed powder were positively correlated, in which the spectral regions 350–430 nm, 851–1154 nm, 1300–1476 nm, and 1990–2050 nm bands were closely related to winter wheat protein. Among the various hyperspectral estimation models constructed for the protein content of winter wheat, SG-BPNN based on the spectral feature bands of powders was the best (Rc2 = 0.851, RMSEc = 0.021, RPDc = 2.594; Rv2 = 0.814, RMSEv = 0.024, RPDv = 2.318). However, the SG-SVM in the seed feature band was of more practical value (Rc2 = 0.073, RMSEc = 0.025, RPDc = 2.097; Rv2 = 0.789, RMSEv = 0.026, RPDv = 2.177). We found that the protein content of winter wheat seeds could be effectively monitored using hyperspectral techniques in combination with different spectral preprocessing methods affected the accuracy of the quantitative estimation models. In short, this study provides a reference for future use of hyperspectral techniques for accurate and rapid estimation of wheat grain protein and provides technical support for future studies.