SUnSeT: spectral unmixing of hyperspectral images for phenotyping soybean seed traits

Jeong, Seok Won; Lyu, Jae Il; Jeong, HwangWeon; Baek, Jeongho; Moon, Jung-Kyung; Lee, Chaewon; Choi, Myoung-Goo; Kim, Kyoung-Hwan; Park, Youn-Il

doi:10.1007/s00299-024-03249-0

SUnSeT: spectral unmixing of hyperspectral images for phenotyping soybean seed traits

Original Article
Open access
Published: 09 June 2024

Volume 43, article number 164, (2024)
Cite this article

Download PDF

You have full access to this open access article

Plant Cell Reports Aims and scope Submit manuscript

SUnSeT: spectral unmixing of hyperspectral images for phenotyping soybean seed traits

Download PDF

Seok Won Jeong¹^na1,
Jae Il Lyu²^na1,
HwangWeon Jeong²,
Jeongho Baek²,
Jung-Kyung Moon³,
Chaewon Lee⁴,
Myoung-Goo Choi⁵,
Kyoung-Hwan Kim² &
…
Youn-Il Park ORCID: orcid.org/0000-0003-2912-2821¹

722 Accesses
1 Altmetric
Explore all metrics

Abstract

Key message

Hyperspectral features enable accurate classification of soybean seeds using linear discriminant analysis and GWAS for novel seed trait genes.

Abstract

Evaluating crop seed traits such as size, shape, and color is crucial for assessing seed quality and improving agricultural productivity. The introduction of the SUnSet toolbox, which employs hyperspectral sensor-derived image analysis, addresses this necessity. In a validation test involving 420 seed accessions from the Korean Soybean Core Collections, the pixel purity index algorithm identified seed- specific hyperspectral endmembers to facilitate segmentation. Various metrics extracted from ventral and lateral side images facilitated the categorization of seeds into three size groups and four shape groups. Additionally, quantitative RGB triplets representing seven seed coat colors, averaged reflectance spectra, and pigment indices were acquired. Machine learning models, trained on a dataset comprising 420 accession seeds and 199 predictors encompassing seed size, shape, and reflectance spectra, achieved accuracy rates of 95.8% for linear discriminant analysis model. Furthermore, a genome-wide association study utilizing hyperspectral features uncovered associations between seed traits and genes governing seed pigmentation and shapes. This comprehensive approach underscores the effectiveness of SUnSet in advancing precision agriculture through meticulous seed trait analysis.

Hyperspectral Image-Based Variety Discrimination of Maize Seeds by Using a Multi-Model Strategy Coupled with Unsupervised Joint Skewness-Based Wavelength Selection Algorithm

Article 11 July 2016

Experimental data manipulations to assess performance of hyperspectral classification models of crop seeds and other objects

Article Open access 03 June 2022

Hyperspectral imaging for seed quality and safety inspection: a review

Article Open access 08 August 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Seeds are an important source of food for human beings and confer genetic backgrounds for crop yield (TeKrony and Egli 1991). Various phenotyping tools have been used to evaluate seed quality in a quantitative manner. Among these tools, manual inspection has been de facto standard. However, this is notorious for being error-prone, laborious, and time-consuming, especially for massive number of seeds. To amend this, several high-throughput phenotyping methods have been recently developed to generate more precise and reproducible quantitative measurements of seeds.

One prominent phenotyping method is using image sensors and computer-based analysis tools, which has largely increased accuracy particularly for plants and specifically for seeds. Utilizing 2D and 3D images primarily captured by red-green-blue (RGB) cameras, various computer vision algorithms have been employed to extract diverse agricultural traits from seeds. Noteworthy software applications such as SmartGrain (Tanabata et al. 2012) and SeedExtractor (Zhu et al. 2021) have been developed for seed phenotyping, focusing on aspects such as seed shape and color. To complement RGB-based images, which capture only surface properties, X-ray-based imaging methods (Gagliardi and Marcos-Filho 2011; Gomes-Junior et al. 2012) have been introduced to examine seed structure and germination capacity. This addresses the limitation of RGB cameras, which provide low spectral resolution by acquiring only three wavelengths (R, G, B), neglecting other wavelengths rich in chemical composition (Yang et al. 2015). Despite the benefits, X-ray imaging is often time-consuming, rendering it unsuitable for high-throughput phenotyping (Dumont et al. 2015).

Hyperspectral imaging sensors provide both spatial and spectral information and enough throughput for real-world scenarios. With a hyperspectral sensor, reflectance spectra across a broad range of wavelengths, including ultraviolet (UV), visible (VIS), and near-infrared (NIR) spectra, can be obtained using various platforms (Bock et al. 2010). Due to its ease of operation and scalability, hyperspectral imaging has become a widely accepted tool, supplanting RGB cameras. Some examples are assessing the viability of wheat seeds (Zhang et al. 2018), classifying rice seed cultivars (Qiu et al. 2018), and determining the starch content of corn seeds (Liu et al. 2020). A recent development in this domain is the HyperSeed platform, designed to provide hyperspectral information for seeds (Gao et al. 2021). This software offers a comprehensive solution for high-throughput seed phenotyping with HSI techniques. It includes a lab-based imaging system coupled with an application featuring a graphical user interface (GUI) implemented in MATLAB, an open-source platform available to researchers with an institutional license.

The data representing phenotypic characteristics, extracted from hyperspectral images (HSI), has been integrated with genotype data typically comprising genome-wide single nucleotide polymorphisms (SNPs) identified through various methods such as resequencing, genotyping-by-sequencing, or array-based genotyping. This integration facilitates several genetic analyses including genomic prediction (Krause et al. 2019; Sandhu et al. 2021), as hyperspectral patterns and spectra indices are genetically linked with target phenotypes. Moreover, it enables genetic inference studies such as genome-wide association (GWA), heritability, and genetic correlation analyses (Feng et al. 2017; Sun et al. 2019; Barnaby et al. 2020; Wu et al. 2021; Massahiro Yassue et al. 2023) to explore allelic variation.

Among the wide variety of crop seeds, soybean (Glycine max (L.) Merr.) stands as a crucial food and feed crop, being the world’s most cultivated annual herbaceous legume. Recognizing its significance, efforts have been made to construct soybean core collections (Oliveira et al. 2010; Jeong et al. 2019; Kim et al. 2022; Jo et al. 2023; Nair et al. 2023). Core collection is a subset that represents the genetic and phenotypic diversities of the entire collection (Frankel and Frankel 1984). Utilizing a core collection is advantageous for conserving and harnessing the genetic variability of soybeans in research and breeding programs. Among several soybean core collections, the Korean soybean core collection (KSCC) was meticulously assembled, leveraging genotypic and phenotypic data, as well as population structure analysis. The KSCC comprises 430 accessions carefully selected from a pool of 2,872 collections, based on Affymetrix Axiom®180k SoyaSNP array data (Jeong et al. 2019).

Despite the paramount importance of seed traits, the initial focus was primarily on investigating morphological characteristics, seed weights (Baek et al. 2020; Nair et al. 2023), shape and colors (Baek et al. 2020; Kim et al. 2022; Nair et al. 2023). Unfortunately, in the previous reports, seed shape and colors were not characterized in a whole core-collection wide, which hampers prediction or classification of core collection using machine learning algorithms. In previous approaches to soybean seed phenotyping, seed morphological traits were extracted from 2D RGB images using the software ImageJ. However, this method proved to be laborious, time-consuming, and error-prone, resulting in lower reproducibility. Additionally, seed coat colors for all KSCC accessions were not classified, even through manual inspection. To overcome these challenges, there is a need for a precise and reproducible phenotyping method based on a command line interface (CLI) for classification.

Machine learning, a subdomain of Artificial Intelligence (AI), encompasses algorithms capable of deriving valuable insights from data and utilizing this information in self-learning processes for effective classification or prediction. Advancements in hardware and software components within machine vision systems have propelled the popularity of machine learning algorithms with fair reliability and accuracy. These algorithms process data swiftly and deliver reliable decisions in minimal time. Various techniques like Artificial Neural Networks (ANN), Fuzzy logic, decision trees, Naïve Bayes, k-means clustering, support vector machines (SVM), random forest (RF), k-Nearest Neighbor (k-NN), among others, have found extensive use in agricultural-related fields (Rehman et al. 2019; Saha and Manickavasagan 2021).

Linear Discriminant Analysis (LDA) is a linear transformation method that reduces the dimensions within a dataset to identify a feature subspace optimizing class separability. While LDA assumes the normal distribution of data and statistical independence of features, it can be applied to datasets even when these assumptions are not entirely met (Raschka and Mirjalili 2017). Primarily used for feature extraction, LDA enhances computational efficiency and reduces overfitting. Its robustness has led to widespread applications in agricultural product classification based on hyperspectral data (Mahesh et al. 2008; Liu et al. 2010; Delwiche et al. 2011; Qin et al. 2020). Most studies employing LDA for hyperspectral imaging-based classification report an average accuracy exceeding 90%, affirming the classifier’s robustness.

According to UPOV guidelines, soybean seed classification includes three main categories: seed weight, seed shape, and seed testa colors, with 3, 4, and 7 different classes, respectively (UPOV, https://www.upov.int/edocs/mdocs/upov/en/twa_46/tg_80_7_proj_3.pdf). This theoretically yields 84 input features. However, in an initial investigation, machine learning of KSCC seeds classification using these features led to underfitting with a lower accuracy of 25%. Therefore, additional features or predictors are required for machine learning.

In this study, we aimed to address this challenge by conducting seed classification based on features extracted from hyperspectral analysis using 420 KSCC accessions as the test case. We overcame the limitations of manual and GUI-dependent methods in quantifying seeds with morphological and biochemical changes by extracting four endmembers using the pixel purity index (PPI) algorithm. These endmembers helped segment seeds from backgrounds and distinguish the seed testa region from the hilum. RGB triplet codes and grayscale values for seed accessions were then obtained from the segmented seed testa. A total of 199 features were extracted from both morphological and hyperspectral traits for classification. We applied LDA to classify the 420 KSCC accessions, achieving high accuracy compared to other machine learning classifiers. Averaged seed coat reflectance in the range of 450 to 900 nm may be useful for characterizing soybean seed accessions. We have developed a toolbox called SUnSet (Spectral Unmixing of hyperspectral images for phenotyping soybean Seed Traits). This toolbox encompasses spectral unmixing, seed segmentation, feature extraction, model production using LDA classifier, and screening genes that determine these seed traits. SUnSet serves as a potential tool for comprehending genetic variation and expediting agricultural trait enhancement of soybean seeds by identifying genes associated with controlling seed traits.

Materials and methods

Workflow

The workflow of the current work is illustrated in Fig. 1. Each KSCC accession with 15 to 50 seeds is placed under the illumination of two halogen lamps (750W, ARRILITE Plus 750), and HSI are generated in the form of a hypercube. Seed-specific endmembers were extracted and then used to segment seeds from the background. Seed-based morphological and hyperspectral features were extracted for further analysis. Seed weight and morphological features such as length, area, and shape of seeds, extracted from reconstituted RGB images from hypercubes using ImageJ software (ImageJ 1.54d, http://imagej.org), were used for validation. Machine learning algorithms were utilized to classify KSCC seeds into their respective accessions. The entire process, from data analysis to machine learning, was conducted using the SUnSet toolbox, which is implemented in MATLAB (The MathWorks, Inc., Natick, USA). Additionally, Genome-Wide Association Study (GWAS) was employed to screen candidate genes controlling seed traits.

Sample preparation and HSI acquisition

Seeds were harvested from 420 KSCC accessions grown at National Institute Agricultural Science (NIAS, Jeonju, Korea) during the 2022 harvest season. For each seed accession, HSI were acquired for the ventral or lateral sides of the seeds. Fifteen or fifty seeds per accession were used to extract morphological and hyperspectral features, respectively. The hyperspectral camera and light source were mounted on a camera tripod stand (Manfrotto 155), with a lamp case attached to the stand (ARRILITE 750 Plus, München, Germany). A high-performance line-scan image spectrograph, SPECIM IQ (Specim Ltd, Oulu, Finland), was fixed on top of the stand, covering the spectral range from 400 to 1,000 nm with a 2.9 nm spectral resolution (Behmann et al. 2018). The fore optic of the camera provided a 31$^\circ$ x 31$^\circ$ field of view, and the minimum working distance of the camera lens was 150 mm. Two lighting units with a 750 W tungsten-halogen bulb (Ushio HPL X Plus Long Life Lamp, 750W/230V, color temperature 3,050K) were fixed on top of the frame to illuminate the seed samples, and the light intensity on the top of the seeds was 1,700 $\mu$mol m^-2s^-1.

Image acquisitions were performed in a dark room at room temperature. The distance and exposure time were set to 16 cm and 1 ms, respectively, resulting in a spatial resolution of 0.17 mm and 0.20 mm on the targets for ventral and lateral view images, respectively. On average, it took 30 s to capture one image in the form of three-dimensional hypercubes of 512 x 512 pixels x 204 wavelengths. For the measurement of the white reference, a 99% barium sulfate white panel was recorded simultaneously with the seeds.

Spectral unmixing for identifying KSCC seed specific endmembers

HSI, compatible with the ENVI format and consisting of pairs of raw images (*.raw) and header (*.hdr) files, were radiometrically corrected using the corresponding software Specim IQ Studio (Behmann et al. 2018). After data preprocessing, the remaining procedures were executed using Matlab R2023b, along with the Computer Vision Toolbox, Imaging Process Toolbox, and Parallel Computing Toolbox. First, hypercubes were generated, visualized in RGB images, and then regions of interest (ROIs) containing seeds were cropped out, excluding the whiteboard panel. The number of endmembers present in a hyperspectral data cube was determined by using the noise-whitened Harsanyi-Farrand-Chang (NWHFC) method with a probability false alarm value of $10^{-3}$ (Chang and Du 2004).

Pixel Purity Index (PPI) algorithm that is used to identify the most spectrally pure (extreme) pixels in remote-sensed images was applied to the Minimum Noise Fraction (MNF) Rotation transform output image to select spectral endmembers (Boardman et al. 1995). Spectral variability of endmembers caused by variable illumination and environmental, atmospheric, and temporal conditions (Somers et al. 2011) was computed from the mean-squared error (MSE) values obtained for each endmember spectrum with respect to all the endmembers from initially extracted endmember dataset.

$$\begin{aligned} \text {MSE} = \frac{1}{n} \sum _{i=1}^{n} (m_i - \hat{m}_i)^2 \end{aligned}$$

(1)

where $m_i$ and $\hat{m}_i$ are reference and test endmember i, and n is the number of wavebands. Endmembers with spectrally similar characteristics were chosen based on the MSE values within the rage 0 < MSE < 5 x 10$^\text {-5}$. The resulting averaged spectra was then regarded as the set of endmembers for HSI analysis. Subsequently, the proportion of each endmember present in the spectra of each pixel was estimated as abundance maps. For a hyperspectral data cube of spatial dimensions M-by-N containing P endmembers, there exist P abundance maps, each of size M-by-N.

To quantify the accuracy of the reconstruction over the original image, the relative root mean square error (rRMSE) value was determined using the formula:

$$\begin{aligned} \text {rRMSE(X)} = \frac{\sqrt{\frac{1}{n} \sum _{i=1}^{n}(x_i - x_i^{\text {ref}})^2}}{\text {max}(\text {X}^{\text {ref}}) - \text {min}(\text {X}^{\text {ref}})} x 100\% \end{aligned}$$

(2)

where X is a reconstructed image, X$^{\text {ref}}$ is an original reference image, n is the number of pixels, max and min is the maximum and minimum value, respectively (Chen et al. 2012).

To assess similarity among endmembers, the spectral angle mapper (SAM) classification algorithm (Kruse et al. 1993), the spectral information divergence (SID) technique (Chang 2000), and the normalized spectral similarity score method (NS3) (Nidamanuri and Zbell 2011) were utilized. Given the test spectra t and a reference spectra r of length n, SAM, SID and NS3 scores that computes spectral similarity based on the Euclidean and SAM distances between two spectra are calculated as following equations

$$\begin{aligned} \text {SAM}= & {} \cos ^{-1} \left( \frac{\sum _{i=1}^{n} t_i r_i}{\sqrt{\sum _{i=1}^{n} t_i^2} \sqrt{\sum _{i=1}^{n} r_i^2}} \right) \end{aligned}$$

(3)

$$\begin{aligned} \text {SID}= & {} \sum _{i=1}^{n} t_i \log \left( \frac{t_i}{r_i}\right) + \sum _{i=1}^{n} r_i \log \left( \frac{r_i}{t_i}\right) \end{aligned}$$

(4)

$$\begin{aligned} \text {Euclidean}= & {} \sqrt{\frac{1}{n} \sum _{i=1}^{n} (t_i - r_i)^2} \end{aligned}$$

(5)

$$\begin{aligned} \text {NS3}= & {} \sqrt{\text {Euclidean}^2 + (1 - \cos (\text {SAM}))^2} \end{aligned}$$

(6)

where $t_{i}$ and $r_{i}$ are the $\textit{i}$-th elements of the vectors t and r, respectively.

Vegetation indices for vegetation (NDVI; Schnell 1974), chlorophyll (TCARI; Haboudane et al. 2004) and anthocyanin (ARI2; Gitelson et al. 2001) were obtained by formula as followings.

$$\begin{aligned} \text {NDVI}= & {} \frac{R_{800} - R_{680}}{R_{800} + R_{680}} \end{aligned}$$

(7)

$$\begin{aligned} \text {ARI2}= & {} R_{800} \left( \frac{1}{R_{550}} - \frac{1}{R_{700}} \right) \end{aligned}$$

(8)

$$\begin{aligned} \text {TCARI}= & {} 3\left[ (R_{700} - R_{670}) - 0.2(R_{700} - R_{550}) \left( \frac{R_{700}}{R_{670}}\right) \right] \end{aligned}$$

(9)

where the symbols $R_{800}$, $R_{680}$, $R_{550}$, $R_{700}$, $R_{670}$ represent reflectance at wavelengths 800nm, 680nm, 550nm, 700nm, and 670nm, respectively.

Seed segmentation

Intensity thresholding techniques, commonly employed in imaging analysis, have been extensively used for background removal in seed segmentation (Baek et al. 2020; Gao et al. 2021). In preliminary study, the hypercube image was initially reduced to a 3D RGB image, segmented using GUI-dependent color thresholders or CLI-dependent single or multiple thresholding values. However, applying these thresholding techniques for KSCC seed segmentation posed challenges due to difficulty in distinguishing background colors overlapping with seed testa or hilum colors, exacerbated by unavoidable shadowing caused during image capture. Additionally, the underutilization of spectral information in the NIR regions, despite using relatively expensive hyperspectral sensors, rendered these techniques less justifiable.

In this study, specific endmembers for seed foregrounds or backgrounds were visually inspected and grouped based on the abundance maps. Pixels associated with foreground endmembers underwent further processing. To isolate overlapped seeds into individual entities, RGB codes were converted into grayscale(GS) using a weighted average formula ($GS = 0.299R + 0.587G + 0.114B$). Subsequently, gradient magnitude in both horizontal and vertical directions was computed using a Sobel kernel (Kanopoulos et al. 1988), and Otsu’s method (Otsu 1979) was applied to convert the grayscale image into a binary image using a global threshold. Morphological operation using opening filter (van den Boomgaard and van Balen 1992) were then conducted to first shrink and then expand foreground objects to approximately their original size using a square structuring element.

Following this, MATLAB’s regionprops function was utilized to measure properties such as area, major axis, minor axis, and centroid for each 8-connected component in the binary image. After filtering components out based on their size compared to the median value, a series of masks representing individual seeds were obtained. This process achieved a ratio of 1.0 between the number of imaged seeds and the number of segmented seeds, indicating minimal misinterpretation of background pixels as seeds. Additionally, the potential issue of multiple seeds being treated as a single entity due to overlapping was successfully mitigated. Seed shape indices, including width/height and thickness/width ratios, were calculated from ventral side area, width, and height, along with lateral side thickness measurements.

RGB triplet and hyperspectral data extraction

Soybean seeds are categorized into eight different classes based on the color of the seed testa, specifically yellow, yellow-green, green, light brwon, intermediate brown, dark brown, and black seeds, following the UPOV guideline. An RGB triplet in the range [0, 1] is a three-element row vector specifying the intensities of the red, green, and blue components of the chosen color. To obtain the RGB triplet for each seed accession, an initial visual inspection of seed colors was conducted by ten individuals. In cases of disagreement on classification, the class with the majority of votes was assigned to the seed coat color.

For the segmentation of the seed coat from the hilum, pixels with the most abundant endmember were selected. These refined seed segmentation take the form of binary masks, consisting of zero and non-zero values. Subsequently, these masks were applied to the x-y plane of the hypercube. Pixels with zero values in the mask were set to zero, while all others remained unchanged in the output image. Additionally, due to the low sensitivity of camera sensors at the spectral extremes, outliers caused by random noise often appear. For improved accuracy, 5% of bands at the beginning and end were removed in this study, resulting in the selection of spectral bands with wavelengths in the range of 426–970 nm. Following the refinement of the hypercube, RGB triplets and averaged reflectance spectra of pixels within each segmented seed were extracted. This process ensures a more accurate representation of soybean seed characteristics, considering both color information and spectral reflectance.

Data transformation

Reflectance spectra in close-range imaging are often influenced by the illumination and background environment. This influence has been effectively mitigated by employing derivative spectra (Zhang et al. 2006; Prasad and Gnanappazham 2014). First-order derivative spectra (dR) reveal the peak characteristics of the spectrum, reflecting waveform changes caused by the absorption of light by chlorophyll and other substances in plants (Becker et al. 2005). Additionally, logarithmic transformation (logR) (can enhance spectral differences in the visible region and reduce the impact of multiplicative factors caused by changes in illumination conditions (Pu and Gong 2011). dR and logR values were obtained by the following formulas.

$$\begin{aligned} \text {dR} &= \left[ \frac{r_3 - r_1}{\Delta \lambda }, \frac{r_4 - r_2}{\Delta \lambda }, \ldots , \frac{r_n - r_{n-2}}{\Delta \lambda } \right] \end{aligned}$$

(10)

$$\begin{aligned} \text {logR}= & {} [\log (r_1), \log (r_2), \ldots , \log (r_n)] \end{aligned}$$

(11)

where ri denotes reflectance at the i-th wavelength, n denotes the number of wavebands, and $\Delta \lambda$ denotes the waveband intervals (nm).

Seed classification

The classification of KSCC accessions based on seed size, shape, and colors was conducted using machine learning models. Seed sizes were categorized into three classes–small, medium, and large–utilizing weight per seed. As weight-dependent seed size classification is not feasible using 2D images, the variables of area, width, height, and thickness were employed instead. Seed shapes were classified into four types–spherical, spherical flattened, elongated, and elongated flattened–based on the ratios of width/height and thickness/width. Seed coat colors for the core collection were manually categorized into seven different colors: yellow, yellow-green, green, light brown, medium brown, dark brown, and black, respectively. Quantitative RGB triplet codes for the KSCC accessions were obtained from seed testa segmentation based on testa-abundant endmembers. Additionally, averaged reflectance values at given wavelengths and spectral indices such as NDVI, ARI2, and TCARI of every seed were included to produce features for machine learning.

The final dataset comprised 420 response classes (KSCC seed accessions) and 199 predictors, including seed size, shape, color code, RGB triplet, averaged entire spectrum, and pigment chlorophyll and anthocyanin indices derived from reflectance spectra. This dataset was divided into two parts: 80% of seeds for training and 20% of seeds for testing, using random selection. To classify KSCC accession seeds, the entire set of methods in the classification learner toolbox, consisting of 31 machine learning algorithms available in MATLAB 2023b software, was utilized. The selected parameters are default parameters of the MATLAB 2023b classification learner tool; therefore, no classifiers were optimized to achieve higher classification accuracy.

Five-fold cross validation with default settings is used with these classifiers to obtain numerical results. Models obtained as such are evaluated using four performance metrics on the test samples: Accuracy, Precision, Recall, and F1-score. Usually higher values represent better performance and formula are given as Eqs. (9)–(12).

$$\begin{aligned} \text {Accuracy (A)}= & {} \frac{(\text {TP} + \text {FP)}}{(\text {TP + TN + FP + FN})} \end{aligned}$$

(12)

$$\begin{aligned} \text {Precision (P)}= & {} \frac{\text {TP}}{\text {(TP + FP)}} \end{aligned}$$

(13)

$$\begin{aligned} \text {Recall (R)}= & {} \frac{\text {TP}}{\text {(TP + FN)}} \end{aligned}$$

(14)

$$\begin{aligned} \text {F1-score}= & {} \frac{\text {2PR}}{\text {(P + R)}} \end{aligned}$$

(15)

where TP, FP, TN, and FN are true positives, false positives, true negatives and false negatives, respectively. Moreover, the model is also evaluated using seed group prediction accuracy for a fair comparison, which presents the percentage of correctly predicted seeds in the test set.

Feature importance analysis

Linear discriminant analysis (LDA) is a dimensionality reduction and classification technique widely used in supervised learning. In the context of classification, it seeks to discover a linear combination of features that maximizes class or category separation within a dataset. By leveraging dimensionality reduction to enhance classification, LDA identifies the most discriminative features in the dataset (Siqueira et al. 2017; Kaznowska et al. 2018).

Importance of the predictors for classification was obtained by running MATLAB’s built-in-function, fitcdiscr, that yields the DeltaPredictor property. Usually, the ’Delta’ property is set to provide a threshold for which predictors to use or not. At the same time, the most informative bands of the hyperspectral data cube were selected by using orthogonal space projection method (Du and Yang 2008).

$$\begin{aligned} \text{Y} = \text{X} - \text{P}_{B}{(X)} \end{aligned}$$

(16)

where $Y$ is the orthogonal space to subspace $B$, $X$ is the original data matrix, and $P_B(X)$ is the projection of $X$ onto subspace $B$. In MATLAB’s built-in-function selectBands, only 10% of the pixel values in the hyperspectral data cube is computed to reduce the computational complexity.

GWAS analysis

A Genome-Wide Association Study (GWAS) analysis was conducted to explore the relationship between genetic variations and phenotypic traits in KSCC accessions. A total of 420 KSCC accessions were analyzed, with phenotypic data representing 23 features, including seed size, shape, color, endmember abundance proportion, spectral indices, as well as chlorophyll and anthocyanin contents. Genotypic data, comprising of 169,029 Single Nucleotide Polymorphisms (SNPs), were obtained from the 180K Axiom SoySNP Array (Lee et al. 2015; Jeong et al. 2019).

The GWAS analysis employed the Mixed Linear Model (MLM) algorithm implemented in the GAPIT tool (https://zzlab.net/GAPIT/) (Lipka et al. 2012). SNPs with a False Discovery Rate (FDR) threshold of 0.05 or lower were considered significant, while those with a Minor Allele Frequency (MAF) below 5% were excluded from further analysis. Gene information associated with the identified SNPs was obtained from the genetic annotation database, Soybase DB (https://www.soybase.org/). Relevant genes were then subjected to enrichment analysis using the Gene Ontology (GO) analysis tool accessible in the web version of the Soybean Genome Database (SoybeanGDB, https://venyao.xyz/soybeangdb/) (Li et al. 2023).

Anthocyanin and chlorophyll content

To analyze anthocyanin content, five beans were individually placed in separate tubes and incubated overnight in 300 $\mu$L of methanol acidified with 1% HCl. Following the addition of 200 $\mu$L of distilled water, anthocyanins were separated from chlorophyll by the addition of 500 $\mu$L of chloroform. The concentration of total anthocyanins was determined by spectrophotometric measurements at absorbance values of 530 nm (A530) and 657 nm (A657) in the aqueous phase. The relative anthocyanin content per gram (g) was calculated by subtracting the A657 absorbance from the A530 absorbance (Neff and Chory 1998).

To analyze chlorophyll content, tubes containing a single bean from a set of five were filled with 1 mL of N,N’-dimethylformamide. The samples were incubated overnight to ensure complete chlorophyll extraction. Absorbance at 647 nm and 664 nm was measured using a spectrophotometer. Chlorophyll content was calculated according to the formula by Porra et al. (1989), and then normalized to the weight of the beans.

Results

Classification of morphological and color traits of seeds using manual inspection

In total, 420 soybean germplasms were classified based on seed weight, seed shape and coat colors using the criteria provided by UPOV. The single seed weight ranged from 0.0575 to 0.439 g, with a mean of 0.194 g (± 0.063, n = 420) (Fig. 2A). These values are quite similar to soybean germplasms reported earlier (Kim et al. 2022). The lowest and the highest seed weights were observed in accession no. cmj078 and cmj088, respectively. When accessions were classified based on UPOV guideline, the number of accessions belong to small (< 0.13 g), medium (0.13–0.24 g), and large (> 0.24 g) were 85, 262, and 69, respectively, showing respective average weights of 0.11 (±0.02), 0.19 (±0.03), and 0.29 (±0.05) g (Fig. 2B and C).

To determine seed shape, RGB images were generated from the HSI, and then ImageJ software was used to measure area, major- and minor-axis lengths for height, width, and thickness, respectively. Seed height ranged from 1.42 to 10.89 mm, with a mean of 4.08 mm (± 1.59) (Fig. 3A). Seed width ranged from 3.23 to 8.05 mm, with a mean of 5.85 mm (± 0.83) (Fig. 3A). Seed thickness ranged from 4.63 to 9.71 mm, with a mean of 7.08 mm (± 0.93) (Fig. 3A). For the classification of seed shape based on UPOV guideline (i.e., spherical (Width/Length > 0.90 and Thickness/Width > 0.85), spherical-flattened (Width/Length > 0.90 and Thickness/Width > 0.84), elongated (Width/Length < 0.89 and Thickness/Width > 0.85), and elongated-flattened (Width/Length < 0.89 and Thickness/Width < 0.84), 414 seeds belong to the category no. 3 (elongated), and the remaining accessions, cmj063 and cmj348, belong to the spherical and elongated-flattened, respectively (Fig. 3B and C). No flattened (Thickness/Width ratio $\le$ 0.84) germplasm was found.

The seed coat colors of soybeans are diverse including yellow, green, brown and black (Baek et al. 2020; Kim et al. 2022). In comparison, only two colors, yellow and green observed in cotyledons (Fig. 4A). Through visual inspection, the seed coat colors of 420 accessions were classified into seven different categories: (1) yellow, (2) yellow green, (3) green, (4) light brown, (5) medium brown, (6) dark brown, and (8) black, as shown in Fig. 4B. Purple-colored seed accessions were not found in the KSCC germplasm, as previously reported from the other source of soybean germplasm (Kim et al. 2022). Yellow (n = 244) was the most frequent classification, followed by black (n = 85), yellow-green (n=35), green(n = 22), light brown (n = 12), medium brown (n = 12), and dark brown (n = 6) (Fig. 4B and C). For bi- and multiple-color seed coats such as black with white stripes or dots and brown with white stripes or dots, the colors were classified as black or brown, respectively. However, the averaged RGB triplet and grey scale of medium- and dark-brown seed coats were [0.41 0.32 0.26 0.30] and [0.39 0.29 0.24 0.31], being statistically not different (p < 0.5, Fig. 4D). Therefore, we propose that KSCC accessions are now classified as 6 categories, i.e., yellow, yellow green, green, light brown, brown, and dark.

Seed endmembers extraction and seed segmentation

Seed segmentation using RGB triplet codes based on global or multiple thresholds in the current study was not successful, resulting in low accuracy (number of seeds segmented / number of seeds imaged x 100) for separating seed images from the background images as well as neighboring seeds (4.3%). This low accuracy was mainly caused by shadows that overlapped with neighboring seeds, making it challenging to separate individual seeds. Therefore, we aimed to filter out shadow pixels using spectral unmixing technology. Additionally, pixels are not pure; instead, they are subject to mixed pixel issues (Sanjeevi and Barnsley 2000), which involve the mixture of multiple pure endmembers. Thus, spectral unmixing was applied to extract endmembers and quantify their abundance proportion. In the present study, mixed pixels were unmixed using the linear mixing model, PPI algorithm. Out of 420 HSI of seed accessions, a total of 2,567 endmembers were initially isolated, with a mean of 6.17 (± 1.84) endmembers per accession. After filtering out non-redundant endmembers based on the mean-squared error (MSE) among endmembers using a cutoff value of 5 x $10^{-5}$, 8 endmembers were finally selected (Fig. 5A). Unmixing accuracy evaluated using rRMSE was 2.12% (± 0.37, n = 420). Furthermore, seed segmentation accuracy was 98.8%, significantly enhanced compared to that based on RGB images.

Based on visual inspection of abundance maps (Suppl. Fig. S1), these endmembers could be assigned to either the foreground target (named hereafter as EM#1, 2, 3, and 4) or background pixels (named hereafter as EM#5, 6, 7, and 8). The first group endmembers are refined to the seed coat and hilum, while the 2nd group mostly represents background pixels such as shadows and white cloth (Suppl. Fig. S1). Vegetation indices of these endmembers, such as ARI2 index for anthocyanin pigment (Gitelson et al. 2003) and TOCARI index for chlorophyll pigment (Haboudane et al. 2004) (Fig. 5B), were different among endmembers. Index values of foregrounds showed greater values than those of the background. Roughly speaking, endmembers #1, 2, and 4, and 3 are correlated to respective yellow, brown and green, and black seed colors.

Next, we carried out spectral similarity analysis between test spectra and the specified reference spectrum using SAM classification algorithm (Kruse et al. 1993), SID technique (Chang 2000), and NS3 (Nidamanuri and Zbell 2011), respectively. As shown in Fig. 5C and Suppl. Fig. S2A and B, spectral dissimilarity was found within foreground endmembers as well as between those of foregrounds and backgrounds. Figure 5D and E and Supplementary Fig. S2C and D show abundance proportions of endmembers for the ventral and lateral sides of seeds, respectively. In both side images, EM#4 (n = 292 for ventral side, and 275 for lateral side) was the most abundant endmember, followed by EM#2 (n = 88 and 89), EM#3 (n = 52 and 34), and EM#1 (n = 1 for both). Here, the endmember proportion in the seed image represents the abundances of a given endmember unmixed from the other target endmember classes. For instance, a proportion value of 0.80 from the abundance image indicates that 80% of the target consists of seed testa classified by a given endmember. For both ventral and lateral sides, EM#1 was the most abundant endmember, and hence likely contributed to distinguish seed coat from the other parts. For the case of EM#1 abundance value is > 0.80, the hilum’s spectral property seems very similar to that of seed coat.

Correlation between seed matrices acquired through SUnSet and those obtained manually

To evaluate the effectiveness of the algorithm on seed segmentation with different morphological traits, morphological statistics such as vertical area, seed height, width, and thickness were compared with those obtained from ImageJ. A total of 5,382 seeds from 420 accessions, with a mean of 12.93 (± 0.58, no.of seeds/accession), were measured and compared in terms of area, height, and width. The correlation coefficient values ($r^2$) of these traits between seeds segmented from the hyperspectral unmixing algorithm and ImageJ were 0.98, 0.94, and 0.93, respectively, indicating that the size and shape seeds are accurately segmented by SUnSet toolbox.

Spectral analysis

The averaged hyperspectral reflectance of seed accessions was obtained by averaging the reflectance of segmented seeds. As shown in Suppl. Fig. S3, each curve represents the averaged reflectance of approximately 50 seeds in the corresponding group. The average reflectance curves showed great diversity, including similar patterns of vegetation with the red-edge region, which is the position of the main inflection point of the red-NIR slope (670–780 nm) (Curran et al. 1995; Van Der Meer 2004). At the same time, stronger or weaker red-edge inflections were also observed. It is known that strong chlorophyll absorption in the red spectrum and scattering in the NIR cause the Red-Edge Position (REP) phenomenon (Dawson and Curran 1998). The differences in the spectral response in the visible light region are indicative of seed pigments. For instance, seeds with green colors would consequently have higher reflectance within the green reflectance spectral region. Consistently, the averaged hyperspectral reflectance of yellow, yellow-green, and green seed colors show greater inflection in the green light region, while this inflection is the lowest in the light brown seed coat group (Fig. 6A).

Logarithmic (logR; Fig. 6B) and first derivative (dR; Fig. 6C) transformations also clearly yielded spectral variations among different seed coat color groups. Furthermore, the seeds from the KSCC collection were subjected to additional categorization through hierarchical clustering. Spectral similarity among the averaged reflectance of KSCC accession seeds were assessed using SAM, SID, and NS3 algorithms. Subsequently, linkage and cluster functions were applied to group these spectra based on their similarity values obtained from pairwise comparisons. As a result, the KSCC accession seeds were grouped into 20, 25, and 27 subclusters (Suppl. Fig. S3), highlighting the potential of hyperspectral reflectance as a predictive factor for seed classification when employing machine learning classifiers.

Supervised learning based classification of KSCC seeds

To determine seed accessions using machine learning classifiers (Zhang et al. 2018; Zhu et al. 2021), we utilized 199 predictors, including morphological features such as seed weight, shape, and size (as shown in Figs. 2, 3, and 4), chlorophyll and anthocyanin pigment contetns, RGB triplets, and hyperspectra-derived features. Hyperspectral features comprised the averaged reflectance of each seed coat and hilum segmented by endmembers (Figs. 5 and 6), along with their log transformation (logR) and 1st derivatives (dR), as well as vegetation indices for anthocyanin and chlorophyll contents. For RGB and hyperspectral features, each band or wavelength was considered a feature. The response classes consisted of 420 KSCC accession seeds and the number of training samples matched the total number of seeds. A total of 19,933 or 19,934 seeds were randomly split into training and test samples–15,947 or 15,948 for training and 3,986 for testing–using four different random number generator algorithms, i.e., twister (mt19937ar)(Marsaglia and Tsang 2000), combRecursive (mrg32k3a) (L’Ecuyer 1999), multFibonacci (mlfg6331_64) (Mascagni and Srinivasan 2004) and philox (philox4x32_10) (Salmon et al. 2011).

The model’s accuracy in classifying seeds into their respective accession numbers on the training samples varied in the range of 9.7% to 95.8%, with the highest accuracies observed for LDA (95.8%) and its subspace LDA Ensemble (95.4%) (Fig. 7). Prediction accuracies using morphological and RGB features resulted in 24.7%, while hyperspectral features achieved 94.3%, suggesting that prediction accuracy is primarily related to hyperspectral features. Accuracies using logR and dR derivatives of averaged spectra had minimal impact on prediction matrices, obtaining 94.3% and 94.1% prediction accuracies, respectively. In the present study, principle component analysis (PCA)-dependent reduction of spectral bands did not significantly increase prediction accuracy, if at all. Metrics for LDA classification are shown in Table 1.

Table 1 Classification model performance assessment

Full size table

Major wavelengths determining seed classification

The extracted spectrum includes 183 wavelengths after filtering out both end bands. To identify the bands that are determining features for accurate prediction, bands were selected using Matlab’s built-in function selectBands, resulting in 1,664 bands from 420 seed accessions. Out of the 183 bands, the 46 bands exhibited a selection frequency of more than once, whereas the remaining 139 bands were not chosen at all (Fig. 8). This highlights that 25.1% of the wavelengths play a pivotal role in seed classification based on averaged reflectance spectra. Among the selected bands, three bands – no. 1 (corresponding to 427 nm), 100 (721 nm), and 181 (973 nm) – were found in every seed accession. Other bands were mostly located in three regions, centering around green (554 nm), red (672 nm), and NIR (823 nm).

Next, we used the reflectance values of the selected bands as predictors to test if the 46 features significantly contributed to seed classification. The LDA classifier, using the 46 selected features, resulted in an overall accuracy of 62.6% (± 13.08, n = 4), which is higher than those obtained by morphological and RGB features but lower than those of full spectral features.

GWAS analysis using hyperspectral features reveals genes regulating seed shape, size and pigmentation

We processed to assess whether features extracted from HSI analysis using SUnSet accurately represent KSCC accession seed traits. Therefore, GWAS analysis was conducted to extract and validate these endmember features associated with seed traits, followed by establishing their correlation with actual soybean traits (Fig. 9). In the present study, features obtained from the SUnSet toolbox include seed shape, represented by the ratio of thickness/width and length/width, seed coat and hilum colors grouped visually or by their RGB triplet codes and grayscale values obtained from their respective color groups, chlorophyll and anthocyanin pigment contents, endmembers abundance proportions, and spectral indices of vegetation (NDVI), chlorophyll (TCARI), and anthocyanin (ARI2) from both the ventral and lateral sides of the seed (Supplementary Table S1).

With an FDR threshold of < 0.05 and MAF 5%, GWAS analysis revealed significant SNP associations for 23 features used in the machine learning LDA analysis. Specifically, associations were identified for 191 SNPs with seed shape, 145 with anthocyanin contents, 5 with seed coat color category, 225 with seed coat RGB code and grayscale, 962 with seed coat and hilum abundant endmembers and their proportionality, and spectral indices, resulting in a total of 1,615 SNPs (Supplementary Table S2 and Table 2). Chromosome regions displaying significant associations with seed coat and hilum colors were identified on Chromosomes 1, 6, 8, 9, and 20 (Fig. 9, Supplementary Fig. S4 and Table 3), consistent with previous research findings (Sonah et al. 2015; Vuong et al. 2015; Zhou et al. 2015; Fang et al. 2017). Additionally, 13 additional regions related to seed color and 4 regions related to seed shape were newly identified on Chromosomes 4, 6, 7, 8, 14, 16, 17, 18, and 19, as well as Chromosomes 6, 7, 8, and 18, respectively. Notably, four regions simultaneously associated with both seed shape and color were discovered on Chromosomes 6 and 8.

Table 2 Summary of GWAS analysis results

Full size table

Consistently, among 229 genes associated with these SNPs (Supplementary Table S3), genes associated with the phenylpropanoid pathway are prevalent (Supplementary Table S4), suggesting a significant correlation between the SUnSet-derived input phenotype features and the identified genes. Furthermore, quantitative trait loci (QTLs) linked to seed coat and hilum traits are located on chromosome 8 in cultivated soybean recombinant inbred lines (RILs) and natural populations (Yang et al. 2023), emphasizing its relevance in this study.

Table 3 Chromosome regions displaying significant associations with seed color and shape traits either independently or collectively

Full size table

Discussion

The proposed SUnSet toolbox offers a comprehensive solution for hyperspectral imaging of seeds, covering the entire process from acquisition to analysis. Unlike previous approaches that predominantly rely on analyzing mixed pixels, this system is specifically based on a linear unmixing algorithm for hyperspectral imaging analysis. As a result, it has demonstrated high segmentation and classification accuracies for seeds across 420 different accessions. Furthermore, it has identified several candidate chromosome regions that are potentially associated with seed shape determination and the coloration of seed coat and hilum.

Seed size has mostly been classified based on the 100-seed weight (UPOV guideline), hinting that 2D RGB image analysis makes it hard to assess this 3D parameter. Contrary to this assumption, seed area, height and width on the vertical side quantified by SUnSet or ImageJ software resulted in high linear correlation to the seed size ($r^2$ = 0.92 ~ 0.98). However, $r^2$ value was lower when seed thickness were compared with seed weight ($r^2$ = 0.68). Thus, the ventral area, height and width of the seed seems highly likely to represent seed weight.

Seed coat or hilum color has been classified by trained visual inspection, which is quite qualitative and subject to variations as the apparent color is a result of the combination of genetic and environmental factors. So far, RGB codes have been assigned only to the representative soybean core collections (Baek et al. 2020; Kim et al. 2022), but those for every core collection accession have not been attempted on a whole-core collection scale. Averaged RGB triplets for seven seed coat colors obtained in the present study revealed medium and dark brown seed coats are statistically indistinguishable. However, averaged reflectance spectra of seeds, and their log and 1st derivative transformants clearly distinguish two colors due to differences in the green and red bands along with the NIR region (Fig. 6). Thus, RGB triplet codes obtained in the present study could be used as reference values for classifying RGB- or hyperspectral-images of soybean seeds specifically and other crop seeds with diverse genetic diversity in seed colors in general.

Spectral unmixing done in the present study yielded four target seed endmembers and their seed coat- and hilum-abundance maps (Fig. 5 and Suppl. Fig. 1), which is practically hard to obtain manually. By simply multiplying abundance maps and ventral areas, seed coat and hilum sizes are easily obtainable. Therefore, spectral unmixing could be a useful analysis tool to assess seed coat and hilum size. Of course, seeds with either seed coat and hilum are indistinguishable visually or striped- or dotted-coat seeds; this way of image segmentation is not applicable.

Morphological functions using grey or color thresholders are widely used in seed segmentation for both RGB and HSI, but seed segmentation accuracies in the present study are lower than 25%, hampering further image analysis. This lower accuracy was significantly enhanced up to 99.8% by the directional gradient of gray images that are targeted to the endmembers-selected foregrounds. Thus, spectral unmixing turns out to be a critical step for seed image segmentation and hence reliable extraction method of seed morphological, color, and spectral features. Indeed, morphological traits such as area, width, and height of seeds obtained based on the current study almost perfectly match those from ImageJ software.

The classification accuracy of KSCC seeds is best performed by LDA, a dimensionality reduction and classification technique commonly used in machine learning. LDA aims to reduce the dimensionality of the feature space while retaining as much class-separability information as possible. It has proven successful in discriminating between Fusarium damaged wheat kernels and undamaged ones (Delwiche et al. 2011). In the domain of seed classification using LDA, hyperspectral features have shown superior performance compared to morphological and RGB color features. This improved performance is likely attributed to the inclusion of features in the near-infrared (NIR) range, which are absent in RGB-based features. Indeed, NIR bands were among the 46 selected bands in addition to green and red bands (Fig. 8). Similarly, in the classification of red- and white-wheat kernels, reflectance primarily from O-H and C-H bands in the short-wave (SW) NIR range (700 - 1,100 nm) plays a crucial role in predicting the color class of grains (Archibald et al. 1999).

Spectral transformations and vegetation indices play a crucial role in mitigating the impact of brightness variations on ground-based spectral measurements. Previous studies have demonstrated the effectiveness of classification performance using first derivative and log-transformed datasets, as well as selected hyperspectral vegetation indices, in discriminating between seed and vegetation species (Polder et al. 2002; Gao et al. 2021). However, in the present study, despite utilizing log transformation and first derivative spectra for classification, the accuracy of LDA classification showed minimal improvement. Additionally, classification accuracies using features derived from chlorophyll and anthocyanin indices were even lower than those obtained using morphological features. This disparity could be attributed to the successful filtering out of brightness variations during the spectral unmixing process. Consequently, spectral unmixing of hyperspectral imaging proves to be effective for soybean core collection seed identification, with LDA serving as an efficient classification technique.

The integration of large-scale hyperspectral data into GWAS has emerged as a potent strategy for elucidating the genetic foundations of agricultural traits and expediting crop enhancement (Feng et al. 2017; Sun et al. 2019; Barnaby et al. 2020; Wu et al. 2021; Massahiro Yassue et al. 2023). In our present investigation, we reinforce this perspective by employing the SUnSet toolbox. As a result, the discerned broad-spectrum endmember features exhibit considerable efficacy in pinpointing valuable genetic loci in soybeans. This alignment with known anthocyanin Quantitative Trait Locus (QTL) regions on chromosome 8 underscores the significance of our findings and bolsters the effectiveness of the broad-spectrum endmember features identified in our study. Moreover, beyond the identification of genes related to phenylpropanoid metabolism, several candidate genes implicated in regulating seed size, shape, and colors including auxin signaling and cell expansion, carotenoid, and chlorophyll metabolisms are identified (Supplementary Tables S3 and S5).

Conclusion

We present an end-to-end software solution named SUnSet designed to process HSI of soybean seeds, with applicability to seeds from various plant species. The spectral unmixing step in the software effectively extracts hyperspectral reflectance data for the segmented seeds as a whole, as well as for seed parts such as testa and hilum. The software’s output includes morphological traits based on seeds, RGB color codes, and averaged reflectance data for each seed. The classification of seeds involves the utilization of extracted reflectance data from 420 KSCC accession seeds. Among various machine learning models compared, LDA demonstrated high accuracy in seed classification. Furthermore, the spectral curves of the seeds were meticulously analyzed, leading to the identification of wavelengths with significant importance and genes controlling complex seed trait.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Archibald D, Thai C, Dowell F (1999) Development of short-wavelength near-infrared spectral imaging for grain color classification, vol 3543
Baek J, Lee E, Kim N, Kim SL, Choi I, Ji H, Kim KH (2020) High throughput phenotyping for various traits on soybean seeds using image analysis. Sensors 20(1):248. https://doi.org/10.3390/s20010248
Article PubMed PubMed Central Google Scholar
Barnaby JY, Huggins TD, Lee H, McClung AM, Pinson SRM, Oh M, Edwards JD (2020) Vis/nir hyperspectral imaging distinguishes sub-population, production environment, and physicochemical grain properties in rice. Sci Rep 10:9284. https://doi.org/10.1038/s41598-020-65999-7
Article CAS PubMed PubMed Central Google Scholar
Becker BL, Lusch DP, Qi J (2005) Identifying optimal spectral bands from in situ measurements of great lakes coastal wetlands using second-derivative analysis. Remote Sens Environ 97(2):238–248. https://doi.org/10.1016/j.rse.2005.04.020
Article Google Scholar
Behmann J, Acebron K, Emin D, Bennertz S, Matsubara S, Thomas S, Rascher U (2018) Specim iq: evaluation of a new, miniaturized handheld hyperspectral camera and its application for plant phenotyping and disease detection. Sensors 18(2):441. https://doi.org/10.3390/s18020441
Article CAS PubMed PubMed Central Google Scholar
Boardman JW, Kruse FA, Green RO (1995) Mapping target signatures via partial unmixing of aviris data: in summaries
Bock C, Poole G, Parker P, Gottwald T (2010) Plant disease severity estimated visually, by digital photography and image analysis, and by hyperspectral imaging. Crit Rev Plant Sci 29(2):59–107. https://doi.org/10.1080/07352681003617285
Article Google Scholar
Chang CI (2000) An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis. IEEE Trans Inf Theory 46(5):1927–1932. https://doi.org/10.1109/18.857802
Article Google Scholar
Chang CI, Du Q (2004) Estimation of number of spectrally distinct signal sources in hyperspectral imagery. IEEE Trans Geosci Remote Sens 42(3):608–619. https://doi.org/10.1109/TGRS.2003.819189
Article Google Scholar
Chen GH, Theriault-Lauzier P, Tang J, Nett B, Leng S, Zambelli J, Rowley H (2012) Time-resolved interventional cardiac c-arm cone-beam ct: an application of the piccs algorithm. IEEE Trans Med Imaging 31(4):907–923. https://doi.org/10.1109/TMI.2011.2172951
Article PubMed Google Scholar
Curran PJ, Windham WR, Gholz HL (1995) Exploring the relationship between reflectance red edge and chlorophyll concentration in slash pine leaves. Tree Physiol 15(3):203–206
Article CAS PubMed Google Scholar
Dawson T, Curran P (1998) Technical note a new technique for interpolating the reflectance red edge position. Int J Remote Sens 19(11):2133–2139
Article Google Scholar
Delwiche SR, Kim MS, Dong Y (2011) Fusarium damage assessment in wheat kernels by vis/nir hyperspectral imaging. Sens Instrum Food Qual Saf 5:63–71. https://doi.org/10.1007/s11694-011-9112-x
Article Google Scholar
Du Q, Yang H (2008) Similarity-based unsupervised band selection for hyperspectral image analysis. IEEE Geosci Remote Sens Lett 5(4):564–568. https://doi.org/10.1109/LGRS.2008.2000619
Article Google Scholar
Dumont J, Hirvonen T, Heikkinen V, Mistretta M, Granlund L, Himanen K, Keinänen M (2015) Thermal and hyperspectral imaging for norway spruce (picea abies) seeds screening. Comput Electron Agric 116:118–124. https://doi.org/10.1016/j.compag.2015.06.010
Article Google Scholar
Fang C, Ma Y, Wu S, Liu Z, Wang Z, Yang R, Tian Z (2017) Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean. Genome Biol 18:161
Article PubMed PubMed Central Google Scholar
Feng H, Guo Z, Yang W, Huang C, Chen G, Fang W, Liu Q (2017) An integrated hyperspectral imaging and genome-wide association analysis platform provides spectral and genetic insights into the natural variation in rice. Scie Rep. https://doi.org/10.1038/s41598-017-04668-8
Article Google Scholar
Frankel O, Frankel A (1984) Plant genetic resources today: a critical appraisal. In: Crop genetic resources: conservation and evaluation, pp 249-257
Gagliardi B, Marcos-Filho J (2011) Relationship between germination and bell pepper seed structure assessed by the x-ray test. Sci Agric 68(4):411–416. https://doi.org/10.1590/S0103-90162011000400004
Article Google Scholar
Gao T, Chandran AKN, Paul P, Walia H, Yu H (2021) Hyperseed: an end-to-end method to process hyperspectral images of seeds. Sensors 21(24):8184. https://doi.org/10.3390/s21248184
Article PubMed PubMed Central Google Scholar
Gitelson A, Merzlyak M, Chivkunova O (2001) Optical properties and nondestructive estimation of anthocyanin content in plant leaves. Photochem Photobiol 71:38–45
Article Google Scholar
Gitelson AA, Gritz Y, Merzlyak MN (2003) Relationships between leaf chlorophyll content and spectral reflectance and algorithms for non-destructive chlorophyll assessment in higher plant leaves. J Plant Physiol 160(3):271–282
Article CAS PubMed Google Scholar
Gomes-Junior F, Yagushi J, Belini U, Cicero S, Filho M (2012) X-ray densitometry to assess internal seed morphology and quality. Seed Sci Technol 40:102–107. https://doi.org/10.15258/sst.2012.40.1.11
Article Google Scholar
Haboudane D, Miller JR, Pattey E, Zarco-Tejada PJ, Strachan IB (2004) Hyperspectral vegetation indices and novel algorithms for predicting green lai of crop canopies: modeling and validation in the context of precision agriculture. Remote Sens Environ 90(3):337–352. https://doi.org/10.1016/j.rse.2003.12.013
Article Google Scholar
Jeong N, Kim KS, Jeong S, Kim JY, Park SK, Lee JS, Choi MS (2019) Korean soybean core collection: genotypic and phenotypic diversity population structure and genome-wide association study. PLoS ONE 14(10):1–16. https://doi.org/10.1371/journal.pone.0224074
Article CAS Google Scholar
Jo H, Ha BK, Park SK, Jeong SC, Lee JD, Moon JK (2023) Genetic diversity of Korean wild soybean core collections and genome-wide association study for days to flowering. Plants 12(6):1305. https://doi.org/10.3390/plants12061305
Article CAS PubMed PubMed Central Google Scholar
Kanopoulos N, Vasanthavada N, Baker R (1988) Design of an image edge detection filter using the sobel operator. IEEE J Solid-State Circuits 23(2):358–367
Article Google Scholar
Kaznowska E, Depciuch J, Łach K, Kołodziej M, Koziorowska A, Vongsvivut J, Cebulski J (2018) The classification of lung cancers and their degree of malignancy by ftir, pca-lda analysis, and a physics-based computational model. Talanta 186:337–345. https://doi.org/10.1016/j.talanta.2018.04.083
Article CAS PubMed Google Scholar
Kim SH, Subramanian P, Hahn BS, Ha BK (2022) High-throughput phenotypic characterization and diversity analysis of soybean roots (glycine max L.). Plants. https://doi.org/10.3390/plants11152017
Article PubMed PubMed Central Google Scholar
Krause MR, González-Pérez L, Crossa J, Pérez-Rodríguez P, Montesinos-López O, Singh RP, Mondal S (2019) Hyperspectral reflectance-derived relationship matrices for genomic prediction of grain yield in wheat. G3 Genes Genomes Genet 9(4):1231–1247. https://doi.org/10.1534/g3.118.200856
Article Google Scholar
Kruse F, Lefkoff A, Boardman J, Heidebrecht K, Shapiro A, Barloon P, Goetz A (1993) The spectral image processing system (sips)-interactive visualization and analysis of imaging spectrometer data. Remote Sens Environ 44(2):145–163. https://doi.org/10.1016/0034-4257(93)90013-N
Article Google Scholar
Lee Y, Jeong N, Kim J, Lee K, Kim K, Pirani A, Jeong S (2015) Development, validation and genetic analysis of a large soybean snp genotyping array. Plant J 81:625–36. https://doi.org/10.1111/tpj.12755
Article CAS PubMed Google Scholar
Li H, Chen T, Jia L, Wang Z, Li J, Wang Y, Wang Y (2023) Soybeangdb: a comprehensive genomic and bioinformatic platform for soybean genetics and genomics. Comput Struct Biotechnol J 21:3327–3338. https://doi.org/10.1016/j.csbj.2023.06.012
Article CAS PubMed PubMed Central Google Scholar
Lipka A, Tian F, Wang Q, Peiffer J, Li M, Bradbury P, Zhang Z (2012) Gapit: genome association and prediction integrated tool. Bioinformatics 28:2397–2399. https://doi.org/10.1093/bioinformatics/bts444
Article CAS PubMed Google Scholar
Liu C, Huang W, Yang G, Wang Q, Li J, Chen L (2020) Determination of starch content in single kernel using near-infrared hyperspectral images from two sides of corn seeds. Infrared Phys Technol 110:103462. https://doi.org/10.1016/j.infrared.2020.103462
Article CAS Google Scholar
Liu L, Ngadi M, Prasher S, Gariépy C (2010) Categorization of pork quality using gabor filter-based hyperspectral imaging technology. J Food Eng 99(3):284–293
Article Google Scholar
L’Ecuyer P (1999) Good parameter sets for combined multiple recursive random number generators. Oper Res 47:159–164
Article Google Scholar
Mahesh S, Manickavasagan A, Jayas D, Paliwal J, White N (2008) Feasibility of near-infrared hyperspectral imaging to differentiate Canadian wheat classes. Biosyst Eng 101(1):50–57
Article Google Scholar
Marsaglia G, Tsang W (2000) The ziggurat method for generating random variables. J Stat Softw 5:1–7
Article Google Scholar
Mascagni M, Srinivasan A (2004) Parameterizing parallel multiplicative lagged-fibonacci generators. Parallel Comput 30:899–916
Article Google Scholar
Massahiro Yassue R, Galli G, James Chen CP, Fritsche-Neto R, Morota G (2023) Genome-wide association analysis of hyperspectral reflectance data to dissect the genetic architecture of growth-related traits in maize under plant growth-promoting bacteria inoculation. Plant Direct 7(4):e492. https://doi.org/10.1002/pld3.492
Article CAS PubMed PubMed Central Google Scholar
Nair R, Yan M, Vemula A, Rathore A, van Zonneveld M, Schafleitner R (2023) Development of core collections in soybean on the basis of seed size. Legume Sci 5(1):e158. https://doi.org/10.1002/leg3.158
Article CAS Google Scholar
Neff M, Chory J (1998) Genetic interactions between phytochrome a, phytochrome b, and cryptochrome 1 during arabidopsis development. Plant Physiol 118:27–35
Article CAS PubMed PubMed Central Google Scholar
Nidamanuri R, Zbell B (2011) Normalized spectral similarity score (ns3) as an efficient spectral library searching method for hyperspectral image classification. Select Top Appl Earth Observ Remote Sens IEEE J 4:226–240. https://doi.org/10.1109/JSTARS.2010.2086435
Article Google Scholar
Oliveira MF, Nelson RL, Geraldi IO, Cruz CD, de Toledo JFF (2010) Establishing a soybean germplasm core collection. Field Crops Res 119(2):277–289
Article Google Scholar
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9:62–66. https://doi.org/10.1109/TSMC.1979.4310076
Article Google Scholar
Polder G, van der Heijden G, Young I (2002) Spectral image analysis for measuring ripeness of tomatoes. Trans Am Soc Agric Eng. https://doi.org/10.13031/2013.9924
Porra R, Thompson W, Kriedmann P (1989) Determination of accurate extinction coefficients and simultaneous equations for assaying chlorophylls a and b extracted with four different solvents: verification of the concentration of chlorophyll standards by atomic absorption spectroscopy. Biochem Biophys Acta 975:384–394
CAS Google Scholar
Prasad A, Gnanappazham L (2014) Species discrimination of mangroves using derivative spectral analysis. ISPRS Ann Photogramm Remote Sens Spat Inf Sci II 8:45–52. https://doi.org/10.5194/isprsannals-II-8-45-2014
Article Google Scholar
Pu P, Gong R (2011) Hyperspectral remote sensing of vegetation bioparameters. Advances in environmental remote sensing: Sensors, algorithms, and applications. CRC Press, Boca Raton, pp 101–142. https://doi.org/10.1201/b10599
Chapter Google Scholar
Qin J, Vasefi F, Hellberg RS, Akhbardeh A, Isaacs RB, Yilmaz AG, Kim MS (2020) Detection of fish fillet substitution and mislabeling using multimode hyperspectral imaging techniques. Food Control 114:107234
Article CAS Google Scholar
Qiu Z, Chen J, Zhao Y, Zhu S, He Y, Zhang C (2018) Variety identification of single rice seed using hyperspectral imaging combined with convolutional neural network. Appl Sci 8(2):212. https://doi.org/10.3390/app8020212
Article Google Scholar
Raschka S, Mirjalili V (2017) Python machine learning (second). Birmingham b3 2pb, UK. Packt Publishing Ltd. Retrieved from https://www.packtpub.com
Rehman TU, Mahmud MS, Chang YK, Jin J, Shin J (2019) Current and future applications of statistical machine learning algorithms for agricultural machine vision systems. Comput Electron Agric 156:585–605. https://doi.org/10.1016/j.compag.2018.12.006
Article Google Scholar
Saha D, Manickavasagan A (2021) Machine learning techniques for analysis of hyperspectral images to determine quality of food products: a review. Curr Res Food Sci 4:28–44. https://doi.org/10.1016/j.crfs.2021.01.002
Article CAS PubMed PubMed Central Google Scholar
Salmon JK, Moraes MA, Dror RO, Shaw DE (2011) Parallel random numbers: as easy as 1, 2, 3
Sandhu K, Patil SS, Pumphrey M, Carter A (2021) Multitrait machine- and deep-learning models for genomic selection using spectral information in a wheat breeding program. Plant Genome 14(3):e20119. https://doi.org/10.1002/tpg2.20119
Article CAS PubMed Google Scholar
Sanjeevi S, Barnsley M (2000) Spectral unmixing of compact airborne spectrographic imager (casi) data for quantifying sub-pixel proportions of biophysical parameters in a coastal dune system. J Indian Soc Remote Sens 28:187–204. https://doi.org/10.1007/BF02989903
Article Google Scholar
Schnell J (1974) Monitoring the vernal advancement and retrogradation (greenwave effect) of natural vegetation. Nasa/gsfct Type Final Report
Siqueira LF, Araújo Júnior RF, de Araújo AA, Morais CL, Lima KM (2017) Lda vs. qda for ft-mir prostate cancer tissue classification. Chemometr Intell Lab Syst 162:123–129. https://doi.org/10.1016/j.chemolab.2017.01.021
Article CAS Google Scholar
Somers B, Asner GP, Tits L, Coppin P (2011) Endmember variability in spectral mixture analysis: a review. Remote Sens Environ 115(7):1603–1616. https://doi.org/10.1016/j.rse.2011.03.003
Article Google Scholar
Sonah H, O’Donoughue L, Cober E, Rajcan I, Belzile F (2015) Identification of loci governing eight agronomic traits using a gbs-gwas approach and validation by qtl mapping in soya bean. Plant Biotechnol J 13(2):211–221
Article CAS PubMed Google Scholar
Sun D, Cen H, Weng H, Wan L, Abdalla A, El-Manawy AI, He Y (2019) Using hyperspectral analysis as a potential high throughput phenotyping tool in gwas for protein content of rice quality. Plant Methods. https://doi.org/10.1186/s13007-019-0432-x
Article PubMed PubMed Central Google Scholar
Tanabata T, Shibaya T, Hori K, Ebana K, Yano M (2012) Smartgrain: high-throughput phenotyping software for measuring seed shape through image analysis. Plant Physiol 160(4):1871–1880. https://doi.org/10.1104/pp.112.205120
Article CAS PubMed PubMed Central Google Scholar
TeKrony D, Egli D (1991) Relationship of seed vigor to crop yield: a review. Crop Sci 31:816–822
Article Google Scholar
van den Boomgaard R, van Balen R (1992) Methods for fast morphological image transforms using bitmapped binary images. CVGIP Graphi Models Image Process 54(3):252–258. https://doi.org/10.1016/1049-9652(92)90055-3
Article Google Scholar
Van Der Meer F (2004) Analysis of spectral absorption features in hyperspectral imagery. Int J Appl Earth Obs Geoinf 5(1):55–68
Google Scholar
Vuong TD, Sonah H, Meinhardt CG, Deshmukh R, Kadam S, Nelson RL, Nguyen HT (2015) Genetic architecture of cyst nematode resistance revealed by genome-wide association study in soybean. BMC Genom 16:593
Article CAS Google Scholar
Wu X, Feng H, Wu D, Yan S, Zhang P, Wang W, Dai M (2021) Using high-throughput multiple optical phenotyping to decipher the genetic architecture of maize drought tolerance. Genome Biol 22:185. https://doi.org/10.1186/s13059-021-02377-0
Article CAS PubMed PubMed Central Google Scholar
Yang X, Hong H, You Z, Cheng F (2015) Spectral and image integrated analysis of hyperspectral data for waxy corn seed variety classification. Sensors 15(7):15578–15594. https://doi.org/10.3390/s150715578
Article PubMed PubMed Central Google Scholar
Yang Y, Zhao T, Wang F, Liu L, Liu B, Zhang K, Qiao Y (2023) Identification of candidate genes for soybean seed coat-related traits using qtl mapping and gwas. Front Plant Sci. https://doi.org/10.3389/fpls.2023.1190503
Article PubMed PubMed Central Google Scholar
Zhang J, Rivard B, Sánchez-Azofeifa A, Castro-Esau K (2006) Intra- and inter-class spectral variability of tropical tree species at la selva, costa rica: Implications for species identification using hydice imagery. Remote Sens Environ 105(2):129–141. https://doi.org/10.1016/j.rse.2006.06.010
Article Google Scholar
Zhang T, Wei W, Zhao B, Wang R, Mingliu L, Yang L, Sun Q (2018) A reliable methodology for determining seed viability by using hyperspectral data from two sides of wheat seeds. Sensors 18:813. https://doi.org/10.3390/s18030813
Article CAS PubMed PubMed Central Google Scholar
Zhou Z, Jiang Y, Wang Z, Gou Z, Lyu J, Li W, Tian Z (2015) Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat Biotechnol 33:408–414
Article CAS PubMed Google Scholar
Zhu F, Paul P, Hussain W, Wallman K, Dhatt BK, Sandhu J, Walia H (2021) Seedextractor: an open-source gui for seed image analysis. Front Plant Sci. https://doi.org/10.3389/fpls.2020.581546
Article PubMed PubMed Central Google Scholar

Download references

Funding

This research was supported by NIAS Research Program for Agricultural Science and Technology Development (RS-2023-00223126), RDA and in part by KRIBB Research Initiative Program (KGM1002311), Korea.

Author information

Seok Won Jeong and Jae Il Lyu have contributed equally to this work.

Authors and Affiliations

Biological Sciences, Chungnam National University, 99 Daehagro, Youseong, Daejon, 34134, Korea
Seok Won Jeong & Youn-Il Park
Gene Engineering Division, National Institute of Agricultural Sciences, 370 Nongsaengmyeongro, Jeonju, Jeollabuk-do, 54874, Korea
Jae Il Lyu, HwangWeon Jeong, Jeongho Baek & Kyoung-Hwan Kim
Crop Foundation Research Division, National Institute of Crop Sciences, 181 Hyeoksinro, Wanju, Jeollabuk-do, 55365, Korea
Jung-Kyung Moon
Crop Cultivation and Environment Research Division, National Institute of Crop Sciences, 54 Seohoro, Suwon, Kyounggi-do, 16613, Korea
Chaewon Lee
Wheat Research Team, National Institute of Crop Sciences, RDA, 181 Hyeoksinro, Wanju, Jeollabuk-do, 55365, Korea
Myoung-Goo Choi

Authors

Seok Won Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Jae Il Lyu
View author publications
You can also search for this author in PubMed Google Scholar
HwangWeon Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Jeongho Baek
View author publications
You can also search for this author in PubMed Google Scholar
Jung-Kyung Moon
View author publications
You can also search for this author in PubMed Google Scholar
Chaewon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Myoung-Goo Choi
View author publications
You can also search for this author in PubMed Google Scholar
Kyoung-Hwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Youn-Il Park
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YIP and KHK: designed the project; SWJ, JIL, and CL: performed the experiments; HWJ, JHB, JKM, and MGC: contributed to data analysis; YIP and JIL: wrote the manuscript. All authors reviewed and approved the manuscript. SWJ and JIL contributed equally to this work.

Corresponding author

Correspondence to Youn-Il Park.

Ethics declarations

Conflict of interest

The authors declare no competing financial interest.

Additional information

Communicated by Byeong-ha Lee.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 2181 KB)

Supplementary file 2 (xlsx 458 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jeong, S.W., Lyu, J.I., Jeong, H. et al. SUnSeT: spectral unmixing of hyperspectral images for phenotyping soybean seed traits. Plant Cell Rep 43, 164 (2024). https://doi.org/10.1007/s00299-024-03249-0

Download citation

Received: 03 April 2024
Accepted: 06 May 2024
Published: 09 June 2024
DOI: https://doi.org/10.1007/s00299-024-03249-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SUnSeT: spectral unmixing of hyperspectral images for phenotyping soybean seed traits

Abstract

Key message

Abstract

Similar content being viewed by others

Hyperspectral Image-Based Variety Discrimination of Maize Seeds by Using a Multi-Model Strategy Coupled with Unsupervised Joint Skewness-Based Wavelength Selection Algorithm

Experimental data manipulations to assess performance of hyperspectral classification models of crop seeds and other objects

Hyperspectral imaging for seed quality and safety inspection: a review

Introduction

Materials and methods

Workflow

Sample preparation and HSI acquisition

Spectral unmixing for identifying KSCC seed specific endmembers

Seed segmentation

RGB triplet and hyperspectral data extraction

Data transformation

Seed classification

Feature importance analysis

GWAS analysis

Anthocyanin and chlorophyll content

Results

Classification of morphological and color traits of seeds using manual inspection

Seed endmembers extraction and seed segmentation

Correlation between seed matrices acquired through SUnSet and those obtained manually

Spectral analysis

Supervised learning based classification of KSCC seeds

Major wavelengths determining seed classification

GWAS analysis using hyperspectral features reveals genes regulating seed shape, size and pigmentation

Discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Supplementary Information

Supplementary file 1 (pdf 2181 KB)

Supplementary file 2 (xlsx 458 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation