1 Introduction

One of the many purposes of X-ray scanning is to provide quality control and assurance in food production industry. The usage of X-rays provides non-destructive means of examining food items and the data can be used to verify that the content is free of anomalies or foreign objects. The usage of multispectral X-ray scanning has been used successfully in detecting explosives [12], and compares well to an X-ray dual-energy sandwich detector [8]. Foreign objects found in food items consist mostly of insects, wood chips, stone pebbles, sand/dust and plastic. These objects might be present in the raw materials, or accidentally introduced during the manufacturing process [6], where organic materials pose the main challenge for detection. Grating-based imaging techniques [9, 10], that measure the attenuation, scattering and refraction of X-ray beams, have shown great promise in detecting organic foreign objects [6]. Although grating based methods are promising, they still have not been scaled to be used in a production line. Multispectral X-ray scanners exist with a conveyor belt setup, where a single acquisition takes around 5 s on the setup we used. Certain foreign objects can be detected in multispectral X-ray images of food items using a sparse classifier, which gives the potential for storing fewer data and making classification and acquisition faster.

According to the Beer-Lambert Law (BLL), the intensity of an X-ray beam decreases as it passes through matter with exponential decay depending on the distance traveled and the medium.

$$\begin{aligned} I = I_0 e^{-\mu \rho d} \end{aligned}$$
(1)

\(I_0\) in Eq. 1 corresponds to the initial X-ray intensity, \(\mu \) and \(\rho \) together form the linear absorption coefficient, where \(\mu \) corresponds to mass absorption and \(\rho \) corresponds to density, and finally d corresponds to the distance traveled by the beam. For a given simple material, this equation allows us to do inference on either the thickness or type of material we have, in case either of them is known. In our case, we are interested in food items. Food items are particularly challenging since their shape and material composition can greatly vary. There is considerate specimen to specimen variation along with potentially inhomogeneous materials which makes inference about the material composition difficult.

Instead of trying to model the signal according to the BLL, we apply a data-driven approach, where we train a classifier to recognize whether foreign objects are present by training it on multispectral X-ray image samples of a given food product and foreign objects in the food. The dimensionality of the data is high and poses problems for data storage and processing speed. We thus seek a sparse classifier in order to examine whether decreasing the data dimensionality will still result in reasonable accuracy and a low false discovery rate. We choose sparse linear discriminant analysis (SDA) [4] for this task, since it perfectly fits the requirements. This classification method performs variable selection and dimensionality reduction in the optimization process, which allows us to identify which spectra in the images are relvant for the given classification task. This is achieved via an elastic net regularizer [14]. The elastic net regularizer also allows for the construction of a Tikhonov regularization matrix, that can be further tailored to the specific classification task. Knowing which spectra are relevant for the classification task, allows for compressing the data, to only store the relevant spectra, and it could also give some domain knowledge on which spectra are most different between certain materials. A similar method that could also be considered for this task is sparse partial least squares, [3].

To generate the data for the classifier, we first preprocess it. The main part is normalizing each pixel w.r.t. the maximum intensity, which gives better contrast between different materials. These steps are further explained in the next section. Note that maximum 6 scans (images) were used for generating labels for training, a process that takes around 10 s per image. So the process of generating data for training and training the classifier is fast. This process can further be automated for a given target application.

We will examine to what extent we can detect foreign objects in two types of food materials and report which objects we detect.

The paper’s outline consists of a description of the data and acquisition process, where we explain the scanner setup, the properties of the data we obtain and the preprocessing. Next, we describe how the data sets are prepared for the classifier and a description of the classifier. Finally, we present results to evaluate the performance and some discussion.

2 Data and Acquisition

The scanner used for data acquisition is a MULTIX multispectral X-ray scanner with three daisy chained detection modules, providing line scans of \(3\times 128 = 384\) pixels, where the pixel side length is 800 \(\upmu \)m. The energy of the photons is measured over 128 energy bins, (also referred to as channels), where the energy for our experiments is set to 90 kV. This spectrometric scanner is made from a combination of a semiconductor crystal and advanced electronics, capable of measuring the energy of every incident X-ray photon. The material signature is acquired in real-time and stored in raw format [2, 12]. This scanning technology has been compared to a dual energy sandwich detector, (where two detector modules are used with a single shot exposure), showing better detection for explosive materials and a lower false discovery rate (FDR) [8]. We aim to examine which part of the spectra is best suited for foreign object detection in food and to what extent can we detect foreign objects in organic food items using a sparse classifier.

The MULTIX scanner provides images in a binary format, where pixel intensities are encoded as 16-bit unsigned integers. To make sure that there are no scaling differences between samples, we scale the intensities by looking at an average measurement for a patch of air in the image and find the peak value over all the channels. We scale all values by the inverse of the value at the peak, such that the maximum attenuation corresponds to one, this ensures that no scaling differences are between different images. The different attenuation profiles in an image of spring rolls can be seen in Fig. 1. Foreign objects that are not “inside” a food item give a very different attenuation profile from the ones that are inside the food items. The attenuation profile of the food items naturally varies a lot by the thickness of the item, thus making it more difficult to work with products where the thickness varies much.

Fig. 1.
figure 1

Intensity profiles for different materials in an image of spring rolls with foreign objects. The green line above the blue one corresponds to foreign objects that are not superimposed on food items, while the bottom most purple line corresponds to foreign objects that are superimposed on top of food items. The data is scaled such that the peak for air/no item is at 1. The profiles are further normalized (not depicted here) such that the maximum value of every pixel is 1. This is a crude normalization for depth/thickness and gives us data that better represents differences between materials. (Color figure online)

After this initial scaling, we remove line artifacts. These artifacts appear as strides in the image where two different detector modules are attached (See Fig. 3). To achieve this we start by creating an average air profile. We use a patch of pixels from a corner of the image where there are no overlapping detector modules and we are certain that it contains no items. The mean over the samples gives us an average air profile, similar to the red one seen in Fig. 1, i.e. 128 values that should represent no items, we call that vector \(\varvec{\mu _{\text {Air}}}\in \mathbb {R}^{128\times 1}\). For each column in the scanning direction of the image, we now look at the mean of the first 50 pixels. This gives us another profile which is specific to that particular column, i.e. a local mean profile, which we call \(\varvec{\mu _{\text {local}}}\in \mathbb {R}^{128\times 1}\). Now we need to find the scaling difference between \(\varvec{\mu _{\text {Air}}}\) and \(\varvec{\mu _{\text {local}}}\), that is the vector \(\mathbf {s}\in \mathbb {R}^{128\times 1}\) which is the solution to the following equation, where on the left-hand side we have elementwise multiplication.

$$\begin{aligned} \mathbf {s} \varvec{\mu _{\text {Air}}} =\varvec{\mu _{\text {Air}}} \end{aligned}$$
(2)

The solution to Eq. 2 is simply found via elementwise division of the mean vectors. Now for each pixel in this particular column, we multiply the 128 values with the scaling vector \(\mathbf {s}\) elementwise. This process is depicted in Fig. 2, where the strides have been removed in the middle image.

Finally, we do a crude normalization for depth. For every pixel in the image, we find the maximum \(c_{\text {max}}\) of the 128 channels and multiply each of the 128 values by 1/\(c_{\text {max}}\), such that all values in each pixel lie between 0 and 1 and the maximum is 1. This should give us data that better represents differences between materials, rather than thickness since different materials have their maximum intensities at different channels, (see Fig. 1, where this scaling has not been done and the maximum intensity appears in different channels). All the preprocessing steps are depicted in Fig. 2.

We test our approach on two datasets, one with spring rolls and another with minced meat. The spring rolls dataset is challenging in a sense that the thickness varies considerably and the spring rolls can also overlap. The minced meat data varies much by thickness since it contains strings/filaments of meat that overlap.

The objects in Table 1 were used for the imaging, where they were superimposed on top of the food material.

Fig. 2.
figure 2

An illustration of the preprocessing steps for the images. Left most image shows channel 10 in the raw data (minced meat). The image in the middle shows channel 10 in image after the removal of the strides/line artifacts. The rightmost image shows channel 10 when each pixel has been scaled by the maximum value in each of its channels. The final normalization step gives more contrast between different materials although the contrast between meat and no item is less in this particular channel. The intensities are scaled linearly from black (lowest) to white (highest) in the shown images.

Table 1. Foreign objects used for the scans of minced meat. The items used for the spring rolls are the same excluding the last seven items. Set 3 consists of the same items as used in [6]. PTFE is an acronym for Polytetrafluoroethylene, more commonly referred to as Teflon.

2.1 Spring Rolls

The spring rolls data set consists of scans of 8 different bags of spring rolls. Each bag was scanned 20 times, then refrozen and scanned again a day later 20 times each. The foreign objects were superimposed on the bags. The foreign objects were also scanned individually 10 times and each bag was also scanned 2 times without any foreign objects. Figure 3 shows four of the image channels in a grayscale. Most of the contrast between the different materials seems to be present in the first channels, which can also be seen in Fig. 1. The different scans provide variation in position and rotation of the food items and the different bags provide shape differences for the dataset. The spring rolls were contained in a plastic bag.

Fig. 3.
figure 3

Raw grayscale images of different channels from a spring roll sample (top row) and minced meat (bottom row) generated with the MULTIX scanner. From left to right are channels 2, 20, 50 and 100. The contrast decreases the higher we go in the channels and the variation in the measurements increases. The foreign objects can be seen as small black dots in the images and are most visible in the second image, better visible in Fig. 4. Strides have been removed in the meat data to show the difference compared to the strides that are present in the spring rolls images shown here. The intensities are scaled linearly from black (lowest) to white (highest). In this image, the line artifacts have not been removed and the individual pixels have not been scaled by the maximum value.

2.2 Minced Meat

A single plastic box containing 1 kg of minced meat was used for all the scans. First, 5 scans were produced with no items, then the meat was scanned 5 times without any foreign objects. Finally, the meat was scanned with 3 sets of foreign objects, 10 times for each set. The types of foreign objects in each of the three sets is described in Table 1 and a sample image from the data can be seen in Fig. 3.

3 Method

For a given scanned food item, we would like to classify which parts of the image contain food and which contain foreign objects. To achieve this we first need to construct a dataset for training a classifier.

For the spring rolls dataset, we manually select four regions from five scans. In each scan, we select a region containing no items, one containing spring rolls and finally two regions with the most visually distinct foreign objects. This selection process is depicted in Fig. 4. To encode the neighborhood information, we treat a single observation of a given pixel as the \(5\times 128\) values from itself and the pixels directly above, below and on the left and right. So each observation contains \(128\times 5=640\) variables. This should give us more robustness for detection of different materials.

Fig. 4.
figure 4

Selection of data used for training the classifier. Three classes are selected, the enclosed region of the green box represents the no item or air class, the blue region represents the food item and the red boxes represent the foreign objects. For this illustration, the red boxes are a little bit larger than in practice. (Color figure online)

After selecting the regions from five scans we can generate a matrix where rows represent observations and the \(p=640\) variables are represented as columns. Each image yields around 30 pixels of foreign objects, the other classes (spring rolls and air) are randomly subsampled, such that we have equal number of observations in each class, so we end up with a matrix \(\mathbf {X}\) of dimension \(n\times p = n\times 640\), where the value of n is around 500 to 600 pixels. The labels are represented in an indicator matrix \(\mathbf {Y}\), which has an equal number of rows as \(\mathbf {X}\), but the number of columns is equal to the number of classes K, in this case, three. If observation i belongs to class j, then \(\mathbf {Y}_{ij}\) is 1 and the other values in the same row are zero. We employ a similar methodology to setup the minced meat dataset. After the data is set up we normalize it by subtracting the mean and scaling the variables such that they have unit variance. The same procedure is done for the minced meat data, so we have two different data sets.

The R programming language [11] was used for all the processing of the image data and classification. The package imager [1] was used for manually extracting regions from the images. The imager package is an R interface to the Cimg C++ library [13].

3.1 Sparse Linear Discriminant Analysis

We apply SDA, [4], to solve the present classification problem. SDA is a statistical learning method [7], which falls under the category of supervised classifiers and is a sparse version of the more basic method linear discriminant analysis (LDA). The method can handle many classes and it can also handle the case when we have more variables than observation, \(p\gg n\) problems, with regularization. The underlying problem can be formulated in different ways, but we approach it by sparse optimal scoring.

$$\begin{aligned} \begin{array}{rl} \displaystyle {(\varvec{\theta }_k, \varvec{\beta }_k) = \mathop {\text {argmin}}\limits _{\varvec{\theta } \in \mathbb {R}^K,\,\varvec{\beta } \in \mathbb {R}^p} \Vert \mathbf {Y} \varvec{\theta } -\mathbf {X} \varvec{\beta }\Vert ^2 + \lambda _1 \varvec{\beta }^T\varvec{\varOmega }\varvec{\beta } + \lambda _2\Vert \varvec{\beta }\Vert _1} \\ \text {s.t.} \frac{1}{n} \varvec{\theta }^T \mathbf {Y}^T\mathbf {Y} \varvec{\theta } = 1, \;\; \varvec{\theta }^T \mathbf {Y}^T \mathbf {Y} \varvec{\theta }_l = 0 \; \forall l < k, \end{array} \end{aligned}$$
(3)

In the sparse optimal scoring formulation, (Eq. 3), the \(\mathbf {X}\) data matrix and \(\mathbf {Y}\) indicator matrix are the same as the ones described in the last section. We seek the discriminant vectors \(\varvec{\beta _i},\, i \in \{1,2,...,K-1\}\), which we use to project the data into a lower dimensional space. Classification is performed in this lower dimensional space by classifying an observation as belonging to the class corresponding to the nearest centroid, where the labeled data is used to estimate the centroids. \(\varvec{\theta }\) serves the purpose of avoiding the masking problem, i.e. such that class centroids are not colinear in the lower dimensional representation, it is not needed for classification of new observations, only in training. The second and third terms in the minimization problem form an elastic net penalty [14], which serves as a regularizer and allows us to solve the problem in the case of more variables than observations. The scaling parameters \(\lambda _1\) and \(\lambda _2\) are selected via cross-validation. In our case, the \(\varvec{\varOmega }\) in the first part of the elastic net penalty is a diagonal \(p\times p\) matrix, which penalizes the magnitude of the coefficients in the \(\varvec{\beta }_i\) vectors.

The SDA method, (without the elastic net penalty), is a linear map to a lower dimensional representation like principal component analysis (PCA), but in SDA we project the data to a lower dimensional space such that we maximize the variance between classes and minimize the variation within classes for optimal linear separation. There are also sparse versions of PCA [5], which SDA is more similar to. The centroids in the lower dimensional space can be thought of as means of different multivariate normal distributions which all have the same covariance structure, therefore we get linear decision boundaries, like in classical LDA [7]. Other classifiers can also be used on the projected data, an example of such projected data is depicted in Fig. 5, it is the training data used for the minced meat dataset.

Fig. 5.
figure 5

Visualization of the training data used for the classification of the minced meat data after projection with the discriminant vectors. The classes are almost perfectly separated along the first discriminant direction, while the foreign objects and meat are rather close. The second discriminant vector further separates the meat and foreign objects.

4 Results

The classification results from the classifier trained on the spring rolls data are summarized in Table 2, where we show the number of detected pixels in images not contained in the training or validation set. The SDA method was trained only on 6 images from dataset 1 and the training data images are not included in the table. The final training set was balanced and consisted of 519 measurements with 640 variables. The training error is 0% with 48% sparsity, i.e. only 48% of the values in the discriminant vectors are non-zero. This means that almost half the variables are irrelevant for this particular classification task. Values corresponding to the same channels in the pixels were non-zero very consistently in the 640 variables corresponding to a pixel and its four neighbors. 10-fold cross-validation was used to tune the sparsity parameter and should be noted that sparsity of 5% only yielded 5% error on the training data, meaning that very few variables are more critical than others.

One thing to note about the results in Table 2 is that consistently fewer foreign object pixels were detected in the scans which only contained foreign objects. That is because the training set only consisted of foreign objects superimposed on the spring rolls. A very low number of false positives were detected in the data set which consisted of only spring rolls and no foreign objects (average 4.21 pixels). In most of the images, only 0, 1, 2 or 3 pixels were detected except for a single outlier were 42 pixels were detected, which inflates the standard deviation.

Table 2. Results for the spring rolls dataset where we present the number of pixels detected as foreign objects on average in 4 different datasets. Dataset 1 consists of 8 bags of spring rolls with foreign objects superimposed. Dataset 2 consists of the same 8 bags, where the spring rolls have been refrozen and scanned a day later. The other datasets are scans with only the spring rolls or only the foreign objects.

The classification results for the classifier which was trained on the minced meat dataset are summarized in Table 3, where we show the number of pixels detected in images which were not part of the training or validation set. The main difference from the spring rolls dataset is that no false positives were discovered in the scans that contained no foreign objects. Otherwise, there is consistent variation between datasets in the number of foreign object pixels discovered. Another difference is that in the cross-validation process the sparsity regularization parameter that was chosen yields 17% sparsity. This means that we could certainly get away with storing fewer data for this approach.

Table 3. Results for the minced meat dataset where we present the number of pixels detected as foreign objects on average in 5 different datasets. The first dataset is 5 scans of nothing, i.e. empty scans. Datasets 1, 2 and 3 consist of 10 scans each, two which were used for training. The difference of the foreign objects in each dataset can be found in Table 1. The last dataset is meat without any foreign objects.

Some example results are presented in Fig. 6. The foreign objects detected were mostly metals, and also some stone pebbles, quartz, and glass. The smallest objects detected are 2–3 mm in diameter.

Fig. 6.
figure 6

Example results from the classifier on both datasets. The white color indicates foreign objects. The top row shows the spring rolls and the bottom row shows from left to right an example from dataset 1,2 and 3. The foreign objects detected were stone pebbles, metals, quartz, and glass. These were the foreign objects used for training.

5 Discussion

We achieve good detection on the type of foreign objects we can detect. The items used for training are the ones that are represented in the detection, we do not manage to generalize to all the scanned foreign objects. This is mainly because of low signal to noise ratio, especially for the objects that have low absorption. The undetected items are also not represented in the training set, if they would be included we would potentially get more false positives, because they are not as distinct as the metals, stone pebbles, and quartz, thus moving the decision boundary closer to the food item. One approach to try to get a more general detector would be to train with as many different types of foreign objects as possible, that are detectable, and find the discriminant vectors from SDA or use a semi-supervised approach. Then the new data can be projected similar to Fig. 5 and the air and food item classes can be described there with normal distributions or other ways to encapsulate the two classes. Then everything outside those classes would be classified as foreign objects. Each type of foreign object could also be modeled on its own, then we could encapsulate what is known, and everything outside the known classes would belong to an unknown class.

One way to augment the current measurements would be to have some way to estimate the thickness of the materials that are being scanned. That would be a good additional variable for a data-driven approach, or it could be used for normalization. This is already being done in some commercial products using a laser to map the height of the food product.

We can say that for the data sets used we can get away with storing half of the data or less. But this is both application and material dependent. Different applications could yield different foreign objects and different materials have different intensity profiles, meaning that some variables/channels are redundant in some cases and useful in others. This would have to be dealt with for each specific application.

6 Conclusion

We have demonstrated that we can achieve robust detection of certain foreign objects in the data sets used in this work. This was done in a completely data-driven manner by applying a sparse classifier to the normalized data. There is great potential for using an approach similar to the one we present, which could help with storing fewer data and processing the results faster.