Discrimination of tomato seeds belonging to different cultivars using machine learning

This study was aimed at developing the discriminant models for distinguishing the tomato seeds based on texture parameters of the outer surface of seeds calculated from the images (scans) converted to individual color channels R, G, B, L, a, b, X, Y, Z. The seeds of tomatoes ‘Green Zebra’, ‘Ożarowski’, ‘Pineapple’, Sacher F1 and Sandoline F1 were discriminated in pairs. The highest results were observed for models built based on sets of textures selected individually from color channels R, L and X and sets of textures selected from all color channels. In all cases, the tomato seeds ‘Green Zebra’ and ‘Ożarowski’ were discriminated with the highest average accuracy equal to 97% for the Multilayer Perceptron classifier and 96.25% for Random Forest for color channel R, 95.25% (Multilayer Perceptron) and 95% (Random Forest) for color channel L, 93% (Multilayer Perceptron) and 95% (Random Forest) for color channel X, 99.75% (Multilayer Perceptron) and 99.5% (Random Forest) for a set of textures selected from all color channels (R, G, B, L, a, b, X, Y, X). The highest average accuracies for other pairs of cultivars reached 98.25% for ‘Ożarowski’ vs. Sacher F1, 95.75% for ‘Pineapple’ vs. Sandoline F1, 97.5% for ‘Green Zebra’ vs. Sandoline F1, 97.25% for Sacher F1 vs. Sandoline F1 for models built based on textures selected from all color channels. The obtained results may be used in practice for the identification of cultivar of tomato seeds. The developed models allow to distinguish the tomato seed cultivars in an objective and fast way using digital image processing. The results confirmed the usefulness of texture parameters of the outer surface of tomato seeds for classification purposes. The discriminative models allow to obtain a very high probability and may be applied to authenticate and detect seed adulteration.


Introduction
The origin of tomato (Solanum lycopersicum L.) is not confirmed by archaeological evidence, but on the basis of DNA sequence analyses of plants currently found in Latin America, Peru and Ecuador are most often indicated as the place of origin. It is estimated that about 7000 years ago the selection of wild plants led to the development of the domestic tomato. For centuries, the yield has been the main criterion for breeding the tomato, which has led to a significant enlargement of the fruit with a simultaneous reduction in sugar and aroma content [1]. In 2019, the global cultivated area was 6.1 million hectares and total production was around 243.6 million tons [2], making tomato one of the world's main food crops. The common tomato produces fruits in a large variety of shapes, colors, and sizes. Tomato quality factors for fresh consumption are overall appearance, firmness and taste, whereas the quality of tomatoes for processing is determined by total solids content, color, pH and firmness [3]. For the consumers, color is an indicator of maturity level and in many cases, this feature of the fruit has a decisive influence on the preference for selection. The color of tomatoes fruits depends mainly on lycopene content. The second-most important compound affecting the color of the fruit is β-Carotene [4]. The nutritional value of tomatoes is mainly due to their nutrient content (carotenoids, polyphenols, ascorbic acid), minerals (Ca, Mg, Cu, Zn, K, Fe) and fiber [5]. In comparison to other vegetables, tomato fruits have intermediate levels of carotenoids; however, high dietary intake makes it a very important source. It has recently been shown that the content of lycopene and β-carotene in the fresh weight of tomatoes fruits is in the range 0.02-422 mg/100 g and 0.01-4.44 mg/100 g, respectively [6]. A similar situation applies to ascorbic acid content. Comparing to other vegetables, its concentration remains at an average level, but the large quantities consumed make tomatoes great contributors of this nutrient in diet [3]. Large differences in ascorbic acid levels have been reported among tomato cultivars and growing conditions; however, concentration in tomatoes fruits was estimated between 1 and 64 mg/100 g in fresh weight [6] The level of phenolic compounds in tomato fruit is influenced by a huge number of factors (variety, cultivation method, weather conditions, degree of ripeness) and therefore average content may not be representative [6,7]. However, the content of polyphenolic compounds is not high, as it was reported, the fruits contain flavonoids in concentration ranges 2.57-4.37 mg/100 g [8] and phenolic acids (5-caffeoylquinic acid and caffeoylquinic acids derivatives) 10.5 mg/100 g fresh weight [5]. Epidemiological evidence indicates an association between the consumption of tomatoes and reduced cardiovascular risk. Lycopene administered at 200 mg/day has a significant effect on normalizing the blood lipid profile [9]. Tomato intake was found to have a reduction effect on LDL, total cholesterol, TG, and an increase in HDL levels [10].
Spectral and image analysis acquires using various methods and provides valuable information for classifying the physiological condition of seeds, their defects invisible to the eye, and for variety discrimination. Nowadays, non-destructive, rapid classification methods based on imaging, tomography and infrared spectroscopy (NIR) are under development for such use. For example, excellent results were obtained when the possibility of using near-infrared spectroscopy (reflection spectrum) to classify damaged and correct tomato seeds was investigated. The study showed that these discrimination models can be used to differentiate thermally damaged seeds. Total classification accuracy for the validation sample was 96.7% when five factors were selected for partial least squares discriminant analysis [11]. The potential of NIR spectroscopy for discrimination of tomato seed quality (viable and non-viable) using spectral analysis was evidenced. The ability to correctly identify the positive samples and to reject the negative samples of the model for prediction of viable and non-viable seeds were in both cases: 0.94 [12]. Rapid non-destructive grading of tomato seeds was also developed based on the hyperspectral technique. The area, circularity and average gray of seeds were analyzed to correlate with standard germination test performance. Image acquisition system equipped with line scanning spectrometer, gives a good result when 713 nm of wavelength was selected for prediction analysis. The accuracy of the calibration and validation data set was above 90.00% [13]. In another case, the physiological maturity of tomato seeds determined by X-ray image analysis proved to be an effective method for selecting high-quality seeds. The internal features of the seeds: embryo morphology and presence of free areas, (which represent the physiological potential of the seeds) were analyzed on the radiographic images [14]. Own research proposes the application of image textures for cultivar discrimination of tomato seeds. In the available literature, there is a lack of information on the presence of models based on textures extracted from the color channels R, G, B, L, a, b, X, Y, Z from digital color images acquired using a flatbed scanner for distinguishing of tomato seed cultivars. The performed experiments were intended to supplement this scarcity.
The objective of this study was to develop the discriminant models for distinguishing the tomato seeds based on texture parameters of the outer surface of seeds calculated from the images (scans) converted to individual color channels R, G, B, L, a, b, X, Y, Z.

Materials
The tomatoes belonging to cultivars 'Green Zebra', 'Ożarowski', 'Pineapple', Sacher F1 and Sandoline F1 were used in the experiments. The tomatoes were purchased from a local manufacturer. The seeds were manually prepared for the image acquisition. The individual tomato fruits were cut into quarters. Then, the seed chambers were emptied. The extracted seeds were covered with a protective tissue (mucilaginous gel) which was removed to obtain clean seeds. During the process of seed extraction, the seeds were rinsed in a sieve under tap water. In the next step, the mucilaginous gel was removed mechanically by sponge on absorption paper. Before scanning, the seeds were dried with paper towels.

Image analysis
The tomato seeds images were obtained with the use of a flatbed scanner. The outer surfaces of seeds belonging to tomatoes 'Green Zebra', 'Ożarowski', 'Pineapple', Sacher F1 and Sandoline F1 were scanned on the black background that facilitated the segmentation and ROI (region of interest) identification. For each cultivar, two scans were acquired. One scan included one-hundred seeds. Therefore, the images of two hundred seeds were obtained for each tomato cultivar. The images were characterized by the following parameters: 800 dpi resolution, TIFF format. After the image acquisition, the Mazda software (Łódź University of Technology, Institute of Electronics, Poland) [15] was applied for image processing. Before image analysis using the Mazda application, the images were converted to BMP format. Then, the conversion of tomato seed images to individual color channels R, G, B, L, a, b, X, Y, Z was carried out. The exemplary images of tomato seeds are presented in Fig. 1. The individual seeds were separated from the background and the region of interest (ROI) was overlaid on each seed. For one image from each color channel, almost 200 textures based on the run-length matrix, histogram, co-occurrence matrix, autoregressive model and gradient map [15] were computed for each ROI (one seed) and these features were used for the stage of attribute selection. Of the 200 features, the features with the highest discriminatory power were selected and used to build models to distinguish tomato seed cultivars. In this study, the texture parameters of tomato seeds were calculated from images based on spatial variation of pixel brightness intensities. Analysis of textures can provide numerical data about the structure of objects, which can determine the changes that are difficult to notice to the naked eye. The images of objects with the same color histograms and number of pixels can differ in textures if they have dissimilar color distributions [16][17][18]. The texture parameters were successfully applied to distinguish seed cultivars [19][20][21][22]. The proposed procedure of cultivar discrimination of tomato seeds is presented in Fig. 2.

Discriminant analysis
The discrimination of tomato seeds belonging to different cultivars was carried out with the use of the WEKA 3.9 application (Machine Learning Group, University of Waikato) [23]. The cultivars were discriminated in pairs: 'Green Zebra' vs. 'Ożarowski', 'Green Zebra' vs. 'Pineapple', 'Green Zebra' vs. Sacher F1, 'Green Zebra' vs. Sandoline F1, 'Ożarowski' vs. 'Pineapple', 'Ożarowski' vs. Sacher F1, 'Ożarowski' vs. Sandoline F1, 'Pineapple' vs. Sacher F1, 'Pineapple' vs. Sandoline F1, Sacher F1 vs. Sandoline F1. Additionally, the discrimination of all five tomato seed cultivars was performed and the discrimination for one cultivar versus other cultivars. The discriminative models were developed individually for the sets of selected textures. The textures were selected using the Best First with the CFS (Correlation-based Feature Selection) subset evaluator. In the case of pair comparison, 10 textures were selected for individual color channels and 30 for all color channels for each pair of tomato seed cultivars.  This was the optimal number of features that provided high correctness of discrimination and a short analysis time. For the classification of all five cultivars, there were more selected textures, about 15 for each color channel and 35 for model built for a set including textures selected from all color channels. It meant that there may be a need to use more features to distinguish more cultivars from each other. For example, in the case of color channel R, the following textures were selected: RHMean, RHVariance, RHPerc01, RHPerc50, RHPerc99, RSGSkewness, RS5SH-1DifVarnc, RS5SV1SumVarnc, RS4RHLngREmph, RS4RVGLevNonU, RS4RVLngREmph, RS4RZRL-NonUni, RAArea, RATeta2, RASigma. For color channel X, the selected textures were: XHMean, XHVariance, XHPerc01, XHPerc10, XHPerc50, XHPerc99, XSGArea, XS5SV1SumVarnc, XS5SN1DifEntrp, XS5SZ3AngSc-Mom, XS4RVGLevNonU, XS4RZRLNonUni, XAArea, XATeta2, XASigma. The discrimination was performed using different classifiers from the groups of Functions, Decision Trees, Lazy and Rules which were available using the Weka application. The tenfold cross-validation mode was applied for the discrimination [24]. In the case of each pair, the discriminant models were built separately for individual color channels R, G, B, L, a, b, X, Y, Z from color spaces RGB, Lab, XYZ, respectively, using different classifiers. The main criterion for the evaluation of the analysis performance and selection of classifiers was the highest average accuracy (%). The accuracies of classification (%) for individual tomato seed cultivars were also evaluated. The highest discrimination accuracies were determined in the case of the Multilayer Perceptron and Random Forest classifiers, as well as the color channels R, L and X when the models were built for sets of textures selected individually for each color channel. Therefore, the results for these discriminative classifiers and individual color channels are presented in this paper.
Random Forest is one of the classifier algorithms from a group of Decision Trees. The function of this classifier is to build random forests by bagging ensembles of randomized decision trees. The Multilayer Perceptron classifier is a neural network belonging to a group of Functions that uses backpropagation for training [24].
Additionally, the models for sets of textures selected from all color channels were developed. The presented results include the confusion matrices for the pairs of cultivars and all five cultivars of tomato seeds, the average accuracies for each pair of cultivars and all five cultivars, and the TP (True Positive) Rate, Precision, F-Measure, ROC (Receiver Operating Characteristic) Area and PRC (Precision-Recall) Area. The values of these metrics were computed using the Weka. However, these parameters may be calculated manually using the following equations [22]: where TP is the True Positive; FP is the False Positive; FN is the False Negative.
The interpretation of the results was based on the average accuracy (%) of classification of the tomato seeds belonging to different cultivars, the accuracies of classification (%) for individual cultivars and the values of other performance metrics, such as TP Rate, Precision, F-Measure, ROC Area and PRC Area. The higher the accuracies and the values of other metrics, the better the classification efficiency.

Results and discussion
The cultivar discrimination of tomato seeds was performed for pairs of cultivars. In the case of each pair, the discriminant models were built separately for color channels R, L, X based on selected textures, and two discriminative classifiers (Multilayer Perceptron and Random Forest) were applied for classification. In the case of the model built based on textures selected from color channel R of images of tomato seeds belonging to cultivars 'Green Zebra' and 'Ożarowski', very satisfactory discrimination accuracies (1) Precision = TP∕(TP + FP),  gave slightly better results with an obtained average accuracy of 79.5%.
In the next step of the analysis, the discriminant models were built for the sets of textures selected from color channel L (Table 2)   In the case of discrimination of the pairs of tomato cultivars based on seed textures from the images converted to color channel X, the average accuracies were very high and reached 95% for 'Green Zebra' vs. 'Ożarowski' for the Random Forest classifier ( Table 3). The seeds belonging to tomato 'Ożarowski' were correctly classified in 97% and for the seeds 'Green Zebra', the correctness was 93%. The TP Rate reached 0.970 ('Ożarowski'), Precision was up to 0.969 ('Green Zebra'), F-Measure-up to 0.951 ('Ożarowski'), ROC Area-up to 0.987 ('Ożarowski', 'Green Zebra'), PRC Area-up to 0.989 ('Green Zebra'). Also, in the case of the Multilayer Perceptron, the average accuracy of 93% (93% for 'Green Zebra' and 93% for 'Ożarowski') was very satisfactory. The average accuracy of 92.75% was obtained in the case of the discrimination of seeds 'Pineapple' and Sandoline F1, for both Multilayer Perceptron and Random Forest. Additionally, the seeds belonging to tomatoes 'Green Zebra' vs. Sandoline F1 were distinguished with very high accuracies equal to 92% (Multilayer Perceptron) and 91.25% (Random Forest). The seeds of tomatoes 'Green Zebra' vs. Sacher F1 were correctly discriminated in 90.25% (Multilayer Perceptron) to 91.5% (Random Forest). Also, a pretty high accuracy of 91.75% was observed in the case of seeds 'Ożarowski' vs. Sandoline F1 for the Random Forest. Whereas the tomato seeds 'Pineapple' vs. Sacher  (Table 3).
The increase in the correctness was obtained by combining the textures from all color channels R, G, B, L, a, b, X, Y, X in the discriminative models ( Table 4). The tomato  The average accuracies of discrimination of all five cultivars of tomato seeds were slightly lower than for pair comparisons. The tomato seeds 'Green Zebra', 'Ożarowski', 'Pineapple', Sacher F1 and Sandoline F1 were correctly discriminated with average accuracies reaching 83.6% (Random Forest) for model developed using a set of textures selected from all color channels R, G, B, L, a, b, X, Y, Z and 73.7% (Random Forest) for color channel R for analysis performed for individual color channels. In the case of individual cultivars, the tomato seeds 'Ożarowski' were discriminated with the highest accuracy of up to 93.5% (Random Forest classifier, textures selected from a set of all color channels R, G, B, L, a, b, X, Y, Z) ( Table 5). The other performance metrics for the discrimination of five cultivars of tomato seeds reached 0.935 for TP Rate, 0.912 for F-Measure, 0.990 for ROC Area, 0.960 for PRC Area for 'Ożarowski' and 0.894 for Precision for Sandoline F1 in the case of a set of textures selected from all color    'Ożarowski' ('Green Zebra', 'Pineapple', Sacher F1 and Sandoline F1) and was equal to 0.976 for both curves (Fig. 4a,  b). The values of PRC Area reached 0.927 for tomato seeds 'Ożarowski' (Fig. 4c) and 0.993 for tomato seeds other than 'Ożarowski' (Fig. 4d).
Computer vision systems can be of great practical importance for the cultivar classification. Correct cultivar identification is needed to authenticate and avoid adulteration and mixing cultivars with different properties and applications [25]. Computer vision systems can ensure objective, accurate and reproducible quality evaluation [26,27]. The application of image processing can provide distributors, producers and consumers important information about both cultivar and quality of seeds as well as identification of aberrant seeds [28]. The seed classification based on images can be important for crops, both fruit and vegetables, disease recognition, or for archaeobotanical reasons related to obtaining specific feature information [29]. Image analysis is non-destructive and easier than other techniques used for distinguishing tomato cultivars reported by the available literature, e.g., based on genetics [30]. Besides, image analysis of seeds may be more advantageous compared to manual analysis due to the speeding up of the process, automaticity of classification using image pixel values, reduction of distortions caused by natural light [31]. Image analysis and machine learning may replace labor-intensive and time-consuming human visual procedures and can be used by seed laboratories or in the nursery industry for inspections of tomato seeds and evaluation of their germination rate [32]. The non-destructive cultivar discrimination of tomato seeds can be also useful for registration programs, protection of plant cultivars and management of plant genetic resources [33]. Own research proved the usefulness of images obtained with the use of a flatbed scanner for cultivar discrimination of tomato seeds based on selected texture parameters extracted from color channels R, G, B, L, a, b, X, Y, Z. Selection of textures allowed to build innovative discriminative models that provided high correctness. The developed non-destructive, objective, fast and inexpensive procedure can be of great practical importance for distinguishing tomato seeds.

Conclusion
The obtained results indicated that the tomato seeds belonging to different cultivars can be discriminated with a very high probability using the selected features calculated from the images. Therefore, the usefulness of textures of the outer surface for seed discrimination was confirmed. The models built based on sets of combined textures selected from all color  channels proved to be more useful for tomato seeds discrimination than the models built separately for each color channel. The average accuracy reached 99.75% for distinguishing seeds 'Green Zebra' and 'Ożarowski' and in this case, the seeds 'Ożarowski' were correctly discriminated in 100% and the seeds 'Green Zebra'-in 99.5%. These results are very satisfactory. Due to this, the image analysis can be applied to confirm the authenticity of the seed cultivar and avoid the adulteration that may be useful in various industries, e.g., for tomato seed processing and tomato cultivation.