Introduction

Leek (Allium porrum L. or Allium ampeloprasum var. porrum) is a very important vegetable crop cultivated outdoors all over the world [1,2,3]. Leek is a biennial herb, related to onion and garlic. It is commonly cultivated as an annual crop [4]. The edible parts of leek are leaves and a bulb. The inflorescence shoot can reach a height of 200 cm. The bicolor of the stem (white shaft with a milder flavor and green shaft with a spicy taste) is related to the presence of different amounts of essential oils [5]. Leek is rich in methyl furan, pentanol, glucosinolates, polysaccharides [6], folic acid, nicotinic acid, lutein, zeaxanthin, vegetable protein, fats, fiber, sulfide oil, minerals, e.g., calcium, zinc, phosphorus, potassium, magnesium, iron, copper, manganese, sodium, and vitamins such as A, B, C, E, and K [5]. As Allium species, leek can be a rich source of secondary metabolites, e.g., polyphenolic compounds including flavonoids, phenolic acids, and flavonoid polymers with health benefits. The health-promoting properties of the Allium species are associated with organosulfur compounds responsible for the characteristic aroma, taste, and lachrymatory effects [3]. Leek can be characterized by anticancer, antifungal, and antibacterial effects [7]. The consumption of Allium vegetables can reduce the risk of colorectal cancer, prostate cancer, breast cancer, and or stomach cancer [8]. Due to the presence of various bioactive substances, these vegetables have antioxidant properties [9, 10]. Due to its nutrition and medicinal value, the leek is a culinary and medicinal vegetable [11]. Leek can impart the slight spiciness of a dish and improve its taste. It is a flavor enhancer used in meal preparations and ready-to-heat products. It can be used as a tissue-based system, such as cut leek, and a disrupted system, such as mixed puree-like systems or soups. Leek can be additive to bread, pasta, fermented sausages, and traditional Greek sausages. Dried leek can be added to salads, sauces, soups, meat dishes, and casseroles, and can be an alternative to fresh vegetables [12, 13].

Leek can be characterized by phenotypic and genetic diversity. Different cultivars vary in leaf type and color or stem thickness which can result in different plant morphology and productivity [14]. The growth, yield, and seed characteristics of the leek can be affected by self- and cross-pollination. The properties of leek seeds can also depend on the genotype [1]. Allium seeds are black, with rhomboidal or spheroidal shape [15]. Different seed cultivars can be characterized by different properties. Seed quality depends on cultivar purity and distinctness, as well as seed physiological characteristics. Batch purity of high-quality seed cultivars can be essential in marketed species. There are various seed quality control methods useful for cultivar discrimination. Popular sensitive tests, e.g., DNA-based genotyping can be destructive, labor-intensive, complex, and expensive. Therefore, a quality evaluation can be performed for only seeds randomly selected from a batch. Whereas spectroscopy is considered a non-destructive and high-throughput technique for seed evaluation [16], the combination of spectroscopy and chemometric methods may be a promising approach to seed cultivar classification [17]. Furthermore, the procedures combining spectroscopic data with machine learning methods were successfully used for seed quality classification [18].

Fluorescence spectroscopy in the food industry is widely used for quantitative analysis. It is sensitive and specific enough to detect even small concentrations of the compounds [19, 20]. Through it, for example, changes in the structures of proteins, carbohydrates, and lipids in oils can be detected. This is useful for verifying the authenticity of food products [21, 22]. Advances in fiber-optic technology offer outstanding opportunities for the development of a wide range of highly sensitive fiber-optic sensors in many new application areas. Fiber-optic components are successfully adapted to assemblies with micro-optic elements such as lenses, mirrors, prisms, gratings, and others [23, 24]. Fluorescence spectroscopy in agricultural sciences is applied to the analysis of tomatoes [25] and cereals [26]. Their characterization through this technique is performed by grouping objects with similar characteristics to establish methods related to their classification. Until now, the principles of modern optoelectronics have not been used to analyze leek planting material. In the last few years, the demand for high-quality varieties and hybrids of leeks has increased significantly. Therefore, it is important to use non-destructive methods for quality monitoring of leek planting material such as fluorescence spectroscopy [27].

The objective of this study was to distinguish leek seed cultivars using an innovative approach combining fluorescence spectroscopy and machine learning. The application of traditional machine learning algorithms from different groups for the development of models based on selected spectroscopic data to classify leek seed varieties and breeding lines was a great novelty of the present study.

The contributions and prominent features of this manuscript are indicated as follows:

  • The application of non-destructive fluorescence spectroscopy for distinguishing leek seed variety and breeding lines.

  • Using traditional machine learning algorithms for the classification.

  • The development of successful classification models for distinguishing leek seed samples.

  • Obtaining high classification accuracies.

Materials and methods

Materials

The samples that are the subject of the study are Starozagorski kamush, breeding line number 4 and breeding line number 39. Starozagorski kamush is a Bulgarian variety grown throughout the country. It is distinguished by its longer, thin and delicate cylindrical false stem reaching up to 70 cm in height. The leaves are narrow, long light green and upright. Breeding line 4 was created by the inbreeding method in a population of the variety Starozagorski 72. It is characterized by a longer false stem of 1.00 m. The leaves are narrow, long, light green and upright. Breeding line 39 was created by breeding offspring of the Starozagorski 72 variety with a longer false stem of 90.00 cm. The leaves are narrow, long, dark green and upright.

The seeds were produced at Maritsa Vegetable Crops Research Institute. After removing the leeks in the beginning of November, the cuttings are cleaned by variety, selecting plants typical of the variety. Then they are planted in the field in mid-November with the aim of good rooting. The cuttings are cut at a height of 25 cm and planted in furrows according to the scheme 70/15 for one plant, or 70/30 for three plants in a nest and completely covered with soil. During the growing season, the crop is fed, watered and the phytosanitary status is monitored. Plants develop a single flower stalk. They bloom in July and ripen in September. The seeds are threshed with machines, after which the seeds are cleaned, washed and dried. Drying is carried out in dryers at a temperature of 25–30°C. 20–30 kg/day of seeds are obtained per hectare.

Fluorescence spectroscopy

The study was performed with a fiber-optic spectrometer, which allows the generation of fluorescent emission signals from 200 to 1200 nm. The apparatus is used for performing fluorescence spectroscopy of solid samples at a photosensitive area of 1.9968 × 1.9968 mm. The experimental setup includes a laser diode (emission wavelength 285 nm, optical power 16 mW, DC), portable spectrometer model AvaSpec-ULS2048CL-EVO. The AvaSpec-ULS2048CL-EVO spectrometer is designed for field measurements in the field. The device allows the detection of fluorescent emission signals in any environment regardless of its illumination. By tuning the spectrometer from AvaSoft8, the light signals from the working environment are eliminated and only the useful emission signal of the investigated sample remains. Because of this advantage, the AvaSpec-ULS2048CL-EVO has no requirement for an environment in which to conduct fluorescence measurements with it, and no requirement for an illumination level. The sample is placed on a duralumin stand, which allows the reception of an emission signal from it below 180° by a U-shaped optical fiber. This reduces aberrations and allows the generation of a better quality emission fluorescent signal (Fig. 1). The resolution of the spectrometer can be in the range of 0.06–20 nm, and that of the setting used for our experiment is 0.06 nm. Since the fluorescence is often very weak and also in all directions, in order not to saturate the receiver, the useful fluorescence signal is measured in a direction that is below 180° to the excitation radiation. It is preferable to use a laser diode (LED) as a source in the circuit, as its spectral width is very small. The LED used in the experiment has a relatively wide spectral radiation width of about 30–40 nm and the angular distribution of its radiation is in a large angular range of ± 30°. The sensitivity of the spectrometer is in the range of 200 nm to 1200 nm. Its resolution is δλ = 5 nm. The spectral installation based on fluorescent signals will make it possible to record both the emission spectrum and the spectrum of the excitation source. The emission spectrum is the wavelength distribution of the emission measured for a constant excitation wavelength. The excitation spectrum is the dependence of the emission intensity measured for one wavelength on scanning on the excitation wavelength. This spectrum is represented as a dependence of the wavelength of light on the light intensity incident on the photodetector in the spectrometer.

Fig. 1
figure 1

General view of the experimental installation used by fluorescence spectroscopy

For the specific circuit, the photodetector is of the CMOS type model S9132. Its sensitivity is in the range of 200–1200 nm. Its resolution is δλ = 5 nm. S9132 was chosen because it can detect emission radiation from a sample of garlic with a very low loss of water content due to a false stem grown from a vegetative bud of very short length.

The laser radiation is removed from the source and falls on the sample. The samples represent 5 g of planting material from five different packets containing seeds of Starozagorski kamush, breeding line 4 and breeding line 39 located on an area with a radius of 1 cm, which are at a distance of 2 cm from the optical fiber. After the sample fluoresces, the emission signal falls on a U-shaped optical fiber with a core diameter of 200 µm with a step index of refractive index and a numerical aperture of 0.22. The same U-shaped fiber was used to detect all emission signals from the planting material samples and lead the signal to the detector. It takes it to the detector. In the spectrometer, the light signal is converted to electrical-digital via a USB 2.0 wire, downloaded to a computer with AvaSoft8 software and exported to Excel. This allows analysis, processing and visualization of the results of the study.

Five replicates of emission fluorescence signal detection were performed for each seed type: Starozagorski kamush, breeding line 4 and breeding line 39. There are five graphs each of Starozagorski kamush, breeding line 4 and breeding line 39 (Fig. 2). A difference in the emission fluorescence signal of Starozagorski kamush and breeding line 4 as well as breeding line 39 is clearly observed. The spectral wavelength shift and signal intensity level are due to a difference in the content of biologically active substances of a specific variety.

Fig. 2
figure 2

Difference in emission wavelength for Starozagorski kamush, breeding line 4 and breeding line 39

Leek seed classification

The leek seeds belonging to the variety Starozagorski kamush, breeding line 4 and breeding line 39 were classified using the WEKA application (Machine Learning Group, University of Waikato, Hamilton, New Zealand) [28,29,30] based on the fluorescence spectroscopic data. The classification models were built for all three classes and the comparison of pairs, such as Starozagorski kamush vs. breeding line 4, Starozagorski kamush vs. breeding line 39, and breeding line 4 vs. breeding line 39. The applied procedure is presented in Fig. 3.

Fig. 3
figure 3

The procedure applied to classify leek seed variety and breeding lines based on fluorescence spectroscopic data using machine learning algorithms

For each dataset, attribute selection using the best first with the CFS (correlation-based feature selection) was carried out to choose spectroscopic data, the most useful to distinguish leek seed samples. The models were built based on selected data using a tenfold cross-validation mode and PART (group of Rules), Logistic (group of Functions), Naive Bayes (group of Bayes), Random Forest (group of Trees), IBk (group of Lazy), and Filtered Classifier (group of Meta) traditional machine learning algorithms. The following main parameters of the algorithms were used:

PART-confidenceFactor: 0.25; doNotCheckCapabilities: False; debug: False; batchSize: 100; unpruned: False; minNumObj: 2; numFolds: 3; useMDLcorrection: True; seed: 1,

Logistic-debug: False; doNotCheckCapabilities: False; batchSize: 100; useConjugateGradientDescent: False; ridge: 1.0E-8,

Naive Bayes-debug: False; doNotCheckCapabilities: False; batchSize: 100; displayModelInOldFormat: False; useSupervisedDiscretization: False; useKernelEstimator: False,

Random Forest-doNotCheckCapabilities: False; batchSize: 100; breakTiesRandomly: False; debug: False; numIterations: 100; numExecutionSlots: 1; seed: 1,

IBk-KNN: 1; doNotCheckCapabilities: False; batchSize: 100; nearestNeighbourSearchAlgorithm: LinearNNSearch–distanceFunction: Euclidean Distance-R first-last; debug: False; meanSquared: False; windowSize: 0,

Filtered Classifier-batchSize: 100; classifier: J48-C 0.25-M 2; doNotCheckCapabilities: False; debug: False; filter: Discretize-R firs-last—precision 6; seed: 1.

The number of correctly and incorrectly classified cases, average accuracy, and the values of precision, F-measure, MCC (Matthews correlation coefficient), ROC (receiver operating characteristic) area, and Kappa statistic were determined [31,32,33] (Eqs. 18).

$$Accuracy= \frac{(TP+TN)}{TP+TN+FN+FP},$$
(1)
$$Precision= \frac{TP}{TP+FP},$$
(2)
$$Recall=TPR = \frac{TP}{TP+FN},$$
(3)
$$FPR= \frac{FP}{FP +TN},$$
(4)
$$F-measure= \frac{2*precsion*recall}{\left(precision+recall\right)},$$
(5)
$$MCC= \frac{\left(TP*TN-FP*FN\right)}{\sqrt{\left(\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)\right)}},$$
(6)
$$ROC area=area under TPR vs. FPR curve,$$
(7)
$$Kappa= \frac{\frac{\left(TP+FP\right)\left(TP+FN\right)}{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)} + \frac{\left(TN+FP\right)\left(TN+FN\right)}{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)} }{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)},$$
(8)

where TP: true positive; TN: true negative; FP: false positive; FN: false negative.

Results

This section discusses machine learning-based analysis of leek seed spectroscopic data. In this context, leek seeds of two breeding lines (4 and 36) and Starozagorski kamush variety were distinguished by using six different machine learning algorithms. To examine the performance of several approaches for sorting leek seeds, the tables below include the confusion matrix, precision, Kappa statistic, ROC area, F-measure, MCC, and average accuracy metrics.

For distinguishing all three samples (two breeding lines and one variety), the performances of PART, Logistic, Naive Bayes, Random Forest, IBk, and Filtered Classifier machine learning algorithms are shown in Table 1. The average accuracy reached 93.33% in the case of a classification model built using the IBk and Filtered Classifier algorithms. The Kappa statistic was equal to 0.90. 10% of cases from breeding line 4 were classified as breeding line 36, and 10% of cases belonging to breeding line 36 were classified as breeding line 4. The model developed using IBk correctly classified all cases of Starozagorski kamush. For the Filtered Classifier, all leek seeds belonging to Starozagorski kamush and breeding line 39 were correctly classified, whereas 20% of cases from the actual class of breeding line 4 were incorrectly included in the predicted class breeding line 36. The great mixing of cases between classes was observed for the PART algorithm, which achieved an average accuracy of 86.67% for distinguishing three different samples of leek seeds. When the confusion matrices for the classification conducted using PART were evaluated, it was noted that all of the Starozagorski kamush samples were accurately distinguished. However, the PART classifier incorrectly classifies 10% of leek seeds from breeding line 4 as Starozagorski kamush and 20% of breeding line 4 as breeding line 39. In addition, the PART classifier incorrectly classifies 10% of leek seeds from breeding line 39 as breeding line 4. For the other machine learning algorithms, the Starozagorski kamush seeds were correctly distinguished from other classes and the mixing of cases occurred between breeding lines. The values of precision, F-measure, MCC, and ROC area were the highest for Starozagorski kamush and reached 1.000 in the case of Logistic, Naive Bayes, Random Forest, IBk, and Filtered Classifier.

Table 1 The performance metrics of classification of leek seeds belonging to Starozagorski kamush, breeding line 4 and breeding line 39 based on fluorescence spectroscopic data

The results of distinguishing Starozagorski kamush and breeding line 4 are presented in Table 2. The completely correct classification with accuracies of 100%, Kappa statistic of 1.00 and precision, F-measure, MCC, and ROC area of 1.000 was obtained for models built using Logistic, Naive Bayes, Random Forest, and IBk. The average accuracy equal to 85% was the lowest for a model built using PART. 10% of cases belonging to Starozagorski kamush were incorrectly classified as breeding line 4 and 20% from breeding line 4 were incorrectly classified as Starozagorski kamush.

Table 2 The classification of leek seeds Starozagorski kamush and breeding line 4

Slightly higher accuracies were obtained for the classification of leek seeds Starozagorski kamush and breeding line 39 (Table 3). For the models developed using Logistic, Naive Bayes, Random Forest, and IBk, both classes were completely correctly distinguished in 100%. Whereas the lowest average accuracy of 90% was observed for a model built using Filtered Classifier. In the case of this model, both classes were correctly classified in 90%, and 10% belonging to each class were incorrectly classified as the second class. The Kappa statistic was equal to 0.80. The values of precision of 0.900, F-measure of 0.900, MCC of 0.800, and ROC area of 0.900 were determined for both classes.

Table 3 The distinguishing leek seeds Starozagorski kamush and breeding line 39

The least correct classification and the greatest mixing of cases were found for the distinguishing breeding line 4 and breeding line 39 (Table 4). An average accuracy reaching 80% and the Kappa statistic equal to 0.60 were found for models developed by Logistic, Naive Bayes, Random Forest, and Filtered Classifier. The highest accuracy for individual classes was equal to 90% for breeding line 4 in the case of Naive Bayes and Random Forest, whereas leek seeds belonging to breeding line 39 were correctly classified in 70% for these algorithms. The remaining cases were incorrectly included in other classes. The model built using IBk was characterized by the lowest average accuracy of 70%. Both breeding lines were correctly classified in 70%. The Kappa statistic was 0.40 and the precision of 0.700, F-measure of 0.700, MCC of 0.800, and ROC area of 0.700 for both classes were observed.

Table 4 The classification of leek seeds belonging to breeding line 4 and breeding line 39

The performed study revealed the usefulness of fluorescence spectroscopic data for distinguishing leek seed varieties and breeding lines using machine learning models. Spectroscopic techniques are effective in seed research and can be used in various aspects. For example, da Silva Medeiros et al. [34] applied near-infrared spectroscopy (NIRS) to discriminate Brassica seed species with correctness of 94.9%. The usefulness of near-infrared (NIR) spectroscopy for sexing papaya seeds with an F-score value equal to 0.81 was also confirmed [35]. Furthermore, NIR spectroscopy proved to be an effective technique for the assessment of insect infestation and protein content of maize seeds [36] and the quantification of phenolic content and total flavonoids in raw peanut seeds [37]. Terahertz spectroscopy was used for the recognition of transgenic cotton seeds [38] and the identification of the adulteration of rice seeds [39]. In view of the promising literature data, various spectroscopic techniques can be used in future studies in various directions of leek seed quality assessment.

Conclusions

The performed study involved a novel approach combining fluorescence spectroscopy and traditional machine learning algorithms to distinguish leek seed varieties and breeding lines. The applied procedure was innovative on the background of available literature for the assessment of leek seed quality. Machine learning models built based on spectroscopic data allowed for the classification of three seed types with an accuracy reaching 93.33%. The most effective algorithms were IBk and Filtered Classifier. Each model distinguished Starozagorski kamush variety with breeding lines 39 and 4 with the highest accuracy, whereas the greatest mixing of leek seeds was found between breeding lines. Future studies can involve also other spectroscopic techniques in leek seed studies or various directions of seed quality assessment including the discrimination of species, varieties, and breeding lines, as well as the estimation of the content of chemical compounds in seeds.