Introduction

Measurement of geometric features of different types of grain or other crops is of fundamental importance to the processing industry. Knowledge about the basic dimensions can be used in designing machines for sorting, washing, grinding or transporting devices of different kinds (e.g., conveyors). Such knowledge, combined with a knowledge of chemical composition, may help producers or processors to perform quick evaluation of the technological usability of a batch of grain [6]. Before the video systems were developed, shape and dimensions could only be measured with rulers or different kinds of sieves. Currently, owing to the use of video systems, 2D or 3D images can be fed into a computer and even the most complicated geometrical parameters can be determined automatically. The first studies of the subject were conducted as early as in the 1980s [18]. Paliwal et al. [8], Luo et al. [4] used a video system to identify grain damage caused by diseases. Luo et al. [4] determined 68 geometric attributes of shape, which were used to correctly identify diseases or damage with 90–100% accuracy. Shouche et al. [12] described the use of shape indexes and moments of inertia to characterize wheat varieties. Measurement of geometric attributes can be useful in automatic cultivar classification. Various grading systems using different morphological features for the classification of different cereal grains and varieties have been reported in literature [5]. The authors presented the application of image processing techniques in the identification of Australian wheat varieties. They determined 23 geometric features, 10 of which were used in cultivar classification. The classification accuracy ranged from 97 to 100%. Grain harvested in 1994 and only one humidity level were taken into account in the experiment. There is no information on how the model performed in subsequent years of harvest. Pablo et al. [7], Visen et al. [16] classified grain varieties using their color along with geometric features. They developed models with the use of neural networks, which resulted in a classification accuracy of 40–96%. Paliwal et al. [9] used a combination of geometric features, grain surface texture and its color to develop a statistical model that used selected variables to identify varieties. They selected 20 features in each group. Depending on the species or the number of variables in the model, the accuracy of classification ranged from 88 to 98%. The experiment was also conducted within 1 year, and there is no information whether it can be used in subsequent years of cultivation. Similar studies using geometric parameters, together with texture and color of grain or seeds of papilionaceous plants, were conducted by [1, 2, 10, 11, 14, 15].

However, the reports do not provide any information about the operation of the statistical model on the data obtained in successive years of cultivation. Therefore, the aim of this study is to develop a statistical model to classify grain of spring and winter wheat harvested in three consecutive years of cultivation, at three humidity levels of 12, 14 and 16%. The system is based on the use of images acquired with a flat scanner, and the method of arranging grains on the measurement scene made it possible to reduce the time of analysis to just a few minutes.

Materials and methods

Grain samples

The experimental material comprised cleaned grain of common spring* and winter wheat of four quality classes (elite wheat: Torka*, prime quality wheat: Nawra*, Koksa*, Zyta, Sukces, Tonacja, Fregata, bread wheat: Cytra*, Soraja, Nutka, forage wheat: Symfonia). The study covered three cultivation years (2005, 2006, 2007), and 11 varieties (seven winter and four spring varieties) were analyzed each year at three moisture content levels—12, 14 and 16%. Initial moisture content was determined in two replications using the drying method according to Polish standard PN-71A-75101. The samples were ground and placed in a laboratory dryer at a temperature of 100 °C for 4 h. Samples characterized by low initial moisture content values were hydrated. Water was added, grain was stirred for 24 h, and it was placed in tight plastic containers and stored for 48 h at room temperature to ensure equal moisture distribution through the sample. Moisture content was again determined after the applied hydration treatment.

Image analysis

The image acquisition and image analysis workstation consisted of an Epson Perfection 4490 Photo flat scanner connected with a graphic station based on an Intel Pentium D 830 processor. SilverFast Epson v 6.4.3 software was used. Before each series of images was acquired, the scanner was calibrated with an IT8.7/2 template, supplied with the scanner software. Grains were arranged on the measurement scene in 24 rows and 23 columns, so 552 grains could be scanned simultaneously. Grains were arranged with the use of a specially designed matrix, and it took about 5 min to arrange one scene. In total, over 6,500 grains were scanned and analyzed in each cultivar for each year and humidity level. Before a proper analysis of the image was performed, an algorithm of image segmentation was developed. It is an issue of special importance because an incorrectly established segmentation threshold can significantly affect the results. The segmentation algorithm has morphological and non-linear filters implemented in it. The analytical procedure involved a series of the following successive steps: scanner calibration, kernel arrangement in the matrix, matrix removal, scanning and image saving (2,673 × 4,031 resolution, 400 DPI, 24-bit color depth, TIFF format). The next step was image segmentation to generate a mask for the original image. At the final stage, a 1-bit mask of the original image was obtained, and the surface area occupied by pixels identified as belonging to a single kernel was subjected to a geometric analysis.

The methodology developed in this study allowed an unlimited number of images to be analyzed automatically. The computer-aided image analysis was performed by MaZda 4.3 software [13].

Each grain was described by 54 geometric variables that include linear measurements and shape indexes (Table 1).

Table 1 Listing of calculated linear dimensions and shape indexes

Statistical analysis of results

The analysis of results was performed at several stages. Initially, a histogram distribution for individual variables was checked. At the next stage, in order to reduce the dispersion of results with respect to the mean value, randomly taken values were averaged to give one value. At the next stage, variables were reduced to a set of the 20 best ones. Supervised and unsupervised selection was used. At the last stage, multidimensional analysis was performed in order to discriminate the varieties. To that end, the usability of several methods of discrimination was analyzed and 7 were chosen based on decision trees, Bayes classifiers and Lazy classifiers.

Variables reduction

As each case was described with over 50 geometric variables and the discriminant power of variables could cancel each other out, they were reduced to a set of the 20 best ones. 7 methods of selection were analyzed, based on genetic algorithms, methods based on Class Ranker and Class RankersSearch. In the first one, the selected attributes were evaluated by the InfoGainAttributeEvaluate method, which involves attributes by measuring their information gain with respect to the class. It discretizes numeric attributes first using the MDL-based discretization method (it can be set to binarize them instead). This method, along with the next three, can treat missing as a separate value or distribute the counts among other values in proportion to their frequency [17]. Another method was based on the ChiSquared. ChiSquaredAttributeEvaluate statistic evaluates attributes by computing the chi-squared statistic with respect to the class. GainRatioAttributeEval evaluates attributes by measuring their gain ratio with respect to the class. SymmetricalUncertAttributeEval evaluates an attribute A by measuring its symmetric uncertainty with respect to the class C [17]. In the Class RankerSearch method, the quality of attributes was evaluated by the CfsSubsetEvaluate and ConsistencySubsetEvaluate methods. Statistical analysis was performed with the use of WEKA v. 3.7 software [3].

Multidimensional analysis

Once the variables had been selected, the multidimensional analysis was started. Cultivar classification was performed with the use of 6 classification methods, i.e., Bayes, Lazy, Meta classifiers, Decision tree and Stepwise discriminant analysis. Discriminant stepwise progressive analysis was performed with the use of the Statistica v 9.0 (StatSoft. Inc) statistical package; the other analyses were performed with the use of WEKA v3.7. The strategy adopted in developing the statistical model involved division of a data set into two subsets: the test set accounted for 30% of the whole and the training set—for 70%. At that stage of the analysis, a method was being sought to ensure the minimal error in the classification of 11 varieties of wheat grain in successive years of cultivation and at specified humidity levels.

Results and discussion

Statistical characteristics of the study material

Selected results of measurements of the geometric features are shown in Table 2. The grain length (Geo_L) ranged from 6.30 to 7.58 mm. Grains of the spring varieties were shorter than the winter ones. No permanent tendency to change the grain length depending on the year of cultivation was observed. In the CYTRA cultivar, the length increased every year, whereas the reverse tendency was observed in the ZYTA cultivar. The grain widths (Geo_S) ranged from 3.15 to 3.77 mm. The average width of the spring varieties differed by 0.49 mm from the winter ones. It is noteworthy that the grain width of the spring varieties changed significantly in 2007. Their width was the same as those of winter varieties. The smallest width in the winter varieties was recorded in 2006. It was a year when the weather conditions were adverse and the grain was not as well developed as in 2005 or 2007. The tendency was also observable in the spring varieties. The projection area (Geo_F) of spring varieties grains differed by 3.88 mm2 from the winter ones. As in the case of grain of spring varieties, 2007 was significantly different from 2005 to 2006. The projection area in the winter varieties decreased in the successive years of cultivation. The shape index Geo_W6 describes the extent to which an object surface is folded; its value for a circle is 1. On the other hand, the value of the Geo_Rb is not sensitive to a change of the object scale and it describes the shape precisely. The perimeter of the projection area of the spring varieties grains was more folded than in the winter varieties; the variability of the index between the years was greater. The values of the indexes for the winter varieties were more stable, and the value of the Geo_W6 index showed that the grains of the winter varieties were more oblong and less folded. The CYTRA cultivar had the most irregular and the least stable shape.

Table 2 The average values of selected linear dimensions and shape indexes for grain with a humidity of 12%

In order to analyze the usability of the geometric dimensions in cultivar discrimination, the distribution of histograms for each variable was analyzed. Ideally, the intervals of dispersion of a variable for different varieties should not overlap. Figure 1 shows histograms for the year 2005, for selected varieties and variable geometries. The distribution of the histogram shape was normal, but, unfortunately, their intervals overlapped. This showed that there were grains in each cultivar whose linear dimensions of the shape indexes were not statistically different than those in other experimental groups. This had a negative impact on the discriminant power of individual variables. Table 3 shows the results for selected methods of discriminant analysis for a set of “raw” data. Discrimination of individual varieties was highly unsatisfactory, with a classification error ranging from 47 to 70%. For this reason, the number of cases was reduced by averaging 25 cases to 1. The operation resulted in reducing the set of data to 280 cases for each cultivar and mainly to reducing diversity within one cultivar. Figure 1 shows the histograms after the cases were averaged. This procedure resulted in more distinct “clusters” of cases for each cultivar, and the histograms did not overlap as for the original data.

Fig. 1
figure 1

Dispersion of histograms of selected variables before (left) and after (right) the cases was averaged

Table 3 The results of discriminant analysis for the raw data Training set 1840, test set 785, method of selection Ranker + ChiSquaredAttributEval

Variables reduction

The variables for discrimination were selected from the set of data for the years 2005–2007 and from all the three humidity values. At least 10 variables were selected in each experimental setting. This produced a group of 63 sets of variables from all the experimental groups (3 years, 3 humidity levels and 7 methods of selection). In the next step, the number of sets was reduced from 63 to 21, by combining sets of variables obtained at a specific humidity level and all the methods of selection. The sets were combined by selecting variables that were first on the list. Table 4 shows in a synthetic way the numbers of variables which were chosen for further discriminant analysis. The variables were shown according to the frequency of their occurrence (methods of reduction). The most frequently chosen included the following: Geo_Er2 (average square distance from gravity center), Geo_Fd2 (area of circumscribing circle), Geo_Fmax (maximal Feret’s diameter) and Geo_L (length). The majority of these are variables determined from linear dimensions. The most frequently chosen shape index was the Danielsson’s index (Geo_R D ) and Rc2. Of all the 54 analyzed geometric variables, 17 parameters were chosen more than 4 times, including most of those which are not shape indexes.

Table 4 The results of discriminant analysis for the raw data setting of selected variables

Subsequently, the usability of different sets for cultivar discrimination was tested on the set of data from the year 2005 and the humidity level of 12%. Table 5 shows the collective results of the multidimensional analysis conducted by 6 discriminant methods. The multidimensional analysis proper was conducted on the entire set with a set of variables obtained by the selection method: Ranker + ChiSquared AttributEval and Best First.

Table 5 The results of discriminant analysis for the raw data method of selection

Multidimensional analysis

Preliminary multidimensional analysis

Due to a complicated experimental setting (number of years, levels of humidity, varieties) and the methods of selection of variables and multidimensional analyses, a preliminary evaluation of the usability of the classification methods was carried out. The results are shown in Table 5. The analysis was performed on the set of data from the years 2005–2007 and the humidity level of 12%. The cumulative error of classification, depending on the method applied, ranged from 56 to 99%. The worst results were achieved for the Bayes Net and Naive Bayes methods, whereas the best results were achieved for the methods: Meta MultiClass Classifier and Discriminant analysis. For this reason, those two methods of discrimination were chosen for further analysis. Regardless of the applied method of selection of variables and of classification, the lowest error of varieties discrimination was achieved in 2005, followed by 2006 and the worst was in 2007.

Main multidimensional analysis

In the final stage of the analysis, the discrimination of 11 grain varieties from three successive years of cultivation and at three level of humidity was conducted. As described previously, the varieties discrimination was conducted by the Ranker + ChiSquaredAttributEval and Best First methods, whereas classification was conducted by the Meta MultiClass Classifier and Discriminant analysis methods.

Discrimination by the Meta MultiClass Classifier Ranker method

The results of the classification analysis are shown in Table 6, where the training set comprised 1,840 cases, whereas the test set comprised 785 cases. The cumulative classification ranged from 90 to 99%, depending on the year of cultivation. A decreasing tendency in classification quality depending on the year of cultivation was observed. The best results were achieved for the year 2005 and the worst for 2007, with the differences not being significant and not higher than 8%. The worst results with respect to varieties were achieved for Nutka and Tonacja. In the case of the latter, the maximum accuracy of classification achieved was 98%. For the majority of varieties, the accuracy of classification ranged from 96 to 100%. No negative effect of humidity on discrimination was observed in any of the varieties.

Table 6 The results of multidimensional analysis for 11 varieties of grain, the year of cultivation: 2005, 2006, 2007 and the grain humidity of 12, 14 and 16%
Discrimination by the stepwise progressive method

This method of discrimination employed stepwise progressive analysis, assuming that the analysis will be conducted until all the variables are introduced in the model or until the value of Wilks’ lambda statistics of min. 0.00001 is achieved.

Figure 2 shows diagrams of dispersion of canonical variables for the varieties under analysis. As in the method discussed earlier, the accurate classification index ranged from 92 to 97%. The worst results were again achieved for the Nutka and Tonacja varieties. The decreasing tendency in discrimination quality was observed depending on the cultivation year; 2005 was the best, and 2007 was the worst.

Fig. 2
figure 2

Diagrams of dispersion of canonical variables for grain humidity of 12, 14 and 16% (from left to right) and the years of cultivation 2005, 2006 and 2007. Stepwise progressive analysis

The discrimination analyses conducted made it possible to distinguish between spring and winter varieties. The Cytra, Torka, Koksa and Nawra varieties occupied a distinct area in the dispersion diagram (Fig. 2). A winter cultivar—Zyta—was also included in the same space, but only in 2005. In the other years, the winter varieties were separate from the spring ones.

Conclusions

The experiment and the proposed methodology has resulted in a statistical model that can perform classification of 11 wheat varieties with an accuracy of 90–100%, depending on the method applied, year of cultivation, humidity and cultivar. Cultivar discrimination was based on a model in which 20 geometry variables were implemented, most of which were calculated from linear dimensions, with shape indexes not being as important. The proposed model also distinguished winter varieties from spring varieties. No effect of grain humidity on the discrimination quality was observed. Of the varieties analyzed, Nutka and Tonacja lowered the quality of classification. After they were removed from the model, the cumulative accuracy of classification ranged from 99 to 100%. Further studies should result in developing such a universal statistical model for successive years of cultivation. Most publications dealing with the issue propose models verified on data from the year in which they were developed.