Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Baggage inspection using X-ray screening is a priority task that reduces the risk of crime, terrorist attacks and propagation of pests and diseases [1]. Security and safety screening with X-ray scanners has become an important process in public spaces and at border checkpoints [2]. However, inspection is a complex task because threat items are very difficult to detect when placed in closely packed bags, occluded by other objects, or rotated, thus presenting an unrecognizable view [3]. Manual detection of threat items by human inspectors is extremely demanding [4]. It is tedious because very few bags actually contain threat items, and it is stressful because the work of identifying a wide range of objects, shapes and substances (metals, organic and inorganic substances) takes a great deal of concentration. In addition, human inspectors receive only minimal technological support. Furthermore, during rush hours, they have only a few seconds to decide whether or not a bag contains a threat item [5]. Since each operator must screen many bags, the likelihood of human error becomes considerable over a long period of time even with intensive training. The literature suggests that detection performance is only about 80–90 % [6]. In baggage inspection, automated X-ray testing remains an open question due to: (i) loss of generality, which means that approaches developed for one task may not transfer well to another; (ii) deficient detection accuracy, which means that there is a fundamental tradeoff between false alarms and missed detections; (iii) limited robustness given that requirements for the use of a method are often met for simple structures only; and (iv) low adaptiveness in that it may be very difficult to accommodate an automated system to design modifications or different specimens.

There are some contributions in computer vision for X-ray testing such as applications on inspection of castings, welds, food, cargos and baggage screening [7]. For this research proposal, it is very interesting to review the advances in baggage screening that have taken place over the course of this decade. They can be summarized as follows: Some approaches attempt to recognize objects using a single view of mono-energy X-ray images (e.g., the adapted implicit shape model based on visual codebooks [8]) and dual-energy X-ray images (e.g., Gabor texture features [9], bag of words based on SURF features [10] and pseudo-color, texture, edge and shape features [11]). More complex approaches that deal with multiple X-ray images have been developed as well. In the case of mono-energy imaging, see for example the recognition of regular objects using data association in [12] and active vision [13] where a second-best view is estimated. In the case of dual-energy imaging, see the use of visual vocabularies and SVM classifiers in [14]. Progress also has been made in the area of computed tomography. For example, in order to improve the quality of CT images, metal artifact reduction and de-noising [15] techniques were suggested. Many methods based on 3D features for 3D object recognition have been developed (see, for example, RIFT and SIFT descriptors [16], 3D Visual Cortex Modeling 3D Zernike descriptors and histogram of shape index [17]). There are contributions using known recognition techniques (see, for example, bag of words [18] and random forest [19]). As we can see, the progress in automated baggage inspection is modest and still very limited compared to what is needed because X-ray screening systems are still being manipulated by human inspectors. Automated recognition in baggage inspection is far from being perfected given that the appearance of the object of interest can become extremely difficult due to problems of (self-)occlusion, noise, acquisition, clutter, etc.

We believe that algorithms based on sparse representations can be used for this general task because in many computer vision applications, under the assumption that natural images can be represented using sparse decomposition, state-of-the-art results have been significantly improved [20]. Thus, it is possible to cast the problem of recognition into a supervised recognition form with X-ray images images and class levels (e.g., objects to be recognized) using learned features in a unsupervised way. In the sparse representation approach, a dictionary is built from the training X-ray images, and matching is done by reconstructing the query image using a sparse linear combination of the dictionary. Usually, the query image is assigned to the class with the minimal reconstruction error.

Reflecting on the problems confronting recognition of objects, we believe that there are some key ideas that should be present in new proposed solutions. First, it is clear that certain parts of the objects are not providing any information about the class to be recognized (for example occluded parts). For this reason, such parts should be detected and should not be considered by the recognition algorithm. Second, in recognizing any class, there are parts of the object that are more relevant than other parts (for example the sharp parts when recognizing sharp objects like knives). For this reason, relevant parts should be class-dependent, and could be found using unsupervised learning. Third, in the real-world environment, and given that X-ray images are not perfectly aligned and the distance between detector and objects can vary from capture to capture, analysis of fixed parts can lead to misclassification. For this reason, feature extraction should not be in fixed positions, and can be in several random positions. Moreover, it would be possible to use a selection criterion that enables selection of the best regions. Fourth, an object that is present in a query image can be subdivided into ‘sub-objects’, for different parts (e.g., in case of a handgun there are trigger, muzzle, grip, etc.). For this reason, when searching for images of the same class it would be helpful to search for image parts in all images of the training images instead of similar training images.

Inspired by these key ideas, we propose a method for recognition of objects using X-ray imagesFootnote 1. Three main contributions of our approach are: (1) A new general algorithm that is able to recognize regular objects: it has been evaluated in the recognition of four different objects. (2) A new representation for the classes to be recognized using random patches: this is based on representative dictionaries learned for each class of the training images, which correspond to a rich collection of representations of selected relevant parts that are particular to a specific class. (3) A new representation for the query X-ray image: this is based on (i) a discriminative criterion that selects the ‘best’ test patches extracted randomly from the query image and (ii) and an ‘adaptive’ sparse representation of the selected patches computed from the ‘best’ representative dictionary of each class. Using these new representations, the proposed method (XASR+) can achieve high recognition performance under many complex conditions, as shown in our experiments.

2 Proposed Method

The proposed XASR+ method consists of two stages: learning and testing (see Fig. 1). In the learning stage, for each object of the training, several random patches are extracted and described from their images in order to built representative dictionaries. In the testing stage, random test patches of the query image are extracted and described, and for each test patch a dictionary is built concatenating the ‘best’ representative dictionary of each object. Using this adapted dictionary, each test patch is classified in accordance with the Sparse Representation Classification (SRC) methodology [22]. Afterwards, the patches are selected according to a discriminative criterion. Finally, the query image is classified by voting for the selected patches. Both stages will be explained in this section in further detail.

Fig. 1.
figure 1

Overview of the proposed method. The figure illustrates the recognition of three different objects. The shown classes are three: clips, razor blades and springs. There are two stages: Learning and Testing. The stop-list is used to filter out patches that are not discriminating for these classes. The stopped patches are not considered in the dictionaries of each class and in the testing stage.

2.1 Model Learning

In the training stage, a set of n object images of k objects is available, where \(\mathbf{I}_j^i\) denotes X-ray image j of object i (for \(i=1 \dots k\) and \(j=1 \dots n\)) as illustrated in Fig. 2. In each image \(\mathbf{I}_j^i\), m patches \({\mathcal P}_{jp}^i\) of size \(w \times w\) pixels (for \(p=1 \dots m\)) are randomly extracted. They are centered in \((x_{jp}^i,y_{jp}^i)\). In this work, a patch \({\mathcal P}\) is defined as vector:

$$\begin{aligned} \mathbf{p} = [ \ \mathbf{z} \ ; \ \alpha r ] \in \mathcal {R}^{d+1} \end{aligned}$$
(1)

where \(\mathbf{z} = g({\mathcal P}) \in \mathcal {R}^d\) is a descriptor of patch \({\mathcal P}\) (i.e.,a local descriptor of d elements extracted from the patch); r is the distance of the center of the patch \((x_{jp}^i,y_{jp}^i)\) to the center of the image; and \(\alpha \) is a weighting factor between descriptor and location. Description \(\mathbf{z}\) must be rotation invariant because the orientation of the object can be anyone. Patch \({\mathcal P}\) is described using a vector that has been normalized to unit length:

$$\begin{aligned} \mathbf{y} = f(\mathcal{P}) = \frac{\mathbf{p}}{|| \mathbf{p} ||} \in \mathcal{R}^{d+1} \end{aligned}$$
(2)
Fig. 2.
figure 2

Extraction and description of m patches of training image j of object i.

In order to eliminate non-discriminative patches, a stop-list is computed from a visual vocabulary. The visual vocabulary is built using all descriptors \(\mathbf{Z} = \{ \mathbf{z}_{jp}^i \} \in \mathcal {R}^{d \times knm}\), for \(i=1 \dots k\), for \(j=1 \dots n\) and for \(p=1 \dots m\). Array \(\mathbf{Z}\) is clustered using a k-means algorithm in \(N_v\) clusters. Thus, a visual vocabulary containing \(N_v\) visual words is obtained. In order to construct the stop-list, the term frequency\(t_f\)’ is computed: \(t_f(d,v)\) is defined as the number of occurrences of word v in document d, for \(d = 1 \dots K\), \(v=1 \dots N_v\). In our case, a document corresponds to an X-ray image, and \(K=kn\) is the number of classes in the training dataset. Afterwards, the document frequency\(d_f\)’ is computed: \(d_f(v) = \sum _d \{ t_f(d,v)>0 \}\), i.e., the number of images in the training dataset that contain a word v, for \(v=1 \dots N_v\). The stop-list is built using words with highest and smallest \(d_f\) values: On one hand, visual words with highest \(d_f\) values are not discriminative because they occur in almost all images. On the other hand, visual words with smallest \(d_f\) are so unusual that they correspond in most of the cases to noise. Usually, the top 5 % and bottom 10 % are stopped [23]. Those patches of \(\mathbf{Z}\) that belong to the stopped clusters are not considered in the following steps of our algorithm.

Using (2) all extracted patches are described as \(\mathbf{y}_{jp}^i = f({\mathcal P}_{jp}^i)\). Thus, for object i an array with the description of all patches is defined as \(\mathbf{Y}^i = \{ \mathbf{y}_{jp}^i \} \in \mathcal {R}^{(d+1) \times nm}\) (for \(j=1 \dots n\) and \(p=1 \dots m\)).

Fig. 3.
figure 3

Representative dictionaries of object i for \(Q=32\) (only for \(q=1 \dots 7\) is shown) and \(R=20\). Left column shows the centroids \(\mathbf{c}^i_q\) of parent clusters. Right columns (orange rectangle called \(\mathbf{D}^i\)) shows the centroids \(\mathbf{c}^i_{qr}\) of child clusters. \(\mathbf{\bar{A}}^i_q\) is row q of \(\mathbf{D}^i\), i.e., the centroids of child clusters of parent cluster q (Color figure online).

The description \(\mathbf{Y}^i\) of object i is clustered using k-means algorithm in Q clusters that will be referred to as parent clusters:

$$\begin{aligned} \mathbf{c}_q^i = \text {kmeans} (\mathbf{Y}^i, Q) \end{aligned}$$
(3)

for \(q = 1 \dots Q\), where \(\mathbf{c}_q^i \in \mathcal {R}^{(d+1)}\) is the centroid of parent cluster q of object i. We define \(\mathbf{Y}_q^i\) as the array with all samples \(\mathbf{y}_{jp}^i\) that belong to the parent cluster with centroid \(\mathbf{c}_q^i\). In order to select a reduced number of samples, each parent cluster is clustered again in R child clusters:

$$\begin{aligned} \mathbf{c}_{qr}^i = \text {kmeans} (\mathbf{Y}_q^i, R) \end{aligned}$$
(4)

for \(r = 1 \dots R\), where \(\mathbf{c}_{qr}^i \in \mathcal {R}^{(d+1)}\) is the centroid of child cluster r of parent cluster q of object i. All centroids of child clusters of object i are arranged in an array \(\mathbf{D}^i\), and specifically for parent cluster q are arranged in a matrix:

$$\begin{aligned} \mathbf{\bar{A}}_q^i = [ \mathbf{c}_{q1}^i \dots \ \mathbf{c}_{qr}^i \dots \ \mathbf{c}_{qR}^i ]^\mathsf{T} \in \mathcal {R}^{(d+1) \times R} \end{aligned}$$
(5)

Thus, this arrange contains R representative samples of parent cluster q of object i as illustrated in Fig. 3. The set of all centroids of child clusters of object i (\(\mathbf{D}^i\)), represents Q representative dictionaries with R descriptions \(\{ \mathbf{c}^i_{qr} \}\) for \(q=1 \dots Q, r = 1 \dots R\).

Fig. 4.
figure 4

Adaptive dictionary \(\mathbf{A}\) of patch \(\mathbf{y}\). In this example there are \(k=4\) objects in the training dataset. For this patch only \(k'=3\) objects are selected. Dictionary \(\mathbf{A}\) is built from those objects by selecting all child clusters (of a parent cluster -see blue rectangles-) which has a child cluster with the smallest distance to the patch (see green squares). In this example, object 2 does not have child clusters that are similar enough, i.e., \(h^2(\mathbf{y},{\hat{q}}^2)>\theta \) (Color figure online).

2.2 Testing

In the testing stage, the task is to determine the identity of the query image \(\mathbf{I}^t\) given the model learned in the previous section. From the test image, s selected test patches \({\mathcal P}_{p}^t\) of size \(w \times w\) pixels are extracted and described using (2) as \(\mathbf{y}_{p}^t = f({\mathcal P}_{p}^t)\) (for \(p=1 \dots s\)). The selection criterion of a test patch will be explained later in this section.

For each selected test patch with description \(\mathbf{y} = \mathbf{y}_{p}^t\), a distance to each parent cluster q of each object i of the training dataset is measured:

$$\begin{aligned} h^i(\mathbf{y},q) = \text {distance} (\mathbf{y} , \mathbf{\bar{A}}_q^i ). \end{aligned}$$
(6)

We tested with several distance metrics. The best performance, however, was obtained by:

$$\begin{aligned} h^i(\mathbf{y},q) = \text {min}_r ||\mathbf{y} - \mathbf{c}_{qr}^i|| \ \ \text {for } r=1 \dots R, \end{aligned}$$
(7)

which is the smallest Euclidean distance to centroids of child clusters of parent cluster q as illustrated in Fig. 4. For \(\mathbf{y}\) and \(\mathbf{c}_{qr}^i\) normalized to unit \(\ell _2\) norm, the following distance can be used based on (7):

$$\begin{aligned} h^i(\mathbf{y},q) = \text {min}_r ( 1 - < \mathbf{y} , \mathbf{c}_{qr}^i > ) \ \ \text {for } r=1 \dots R, \end{aligned}$$
(8)

where the term \(< \bullet >\) corresponds to the scalar product that provides a similarity (cosine of angle) between vectors \(\mathbf{y}\) and \(\mathbf{c}_{qr}^i\). The parent cluster that has the minimal distance is searched:

$$\begin{aligned} {\hat{q}}^i = \mathop {\text {argmin}}\limits _{q} \ h^i(\mathbf{y},q), \end{aligned}$$
(9)

which minimal distance is \(h^i(\mathbf{y},{{\hat{q}}^i})\).

For patch \(\mathbf{y}\), we select those training objects that have a minimal distance less than a threshold \(\theta \) in order to ensure a similarity between the test patch and representative object patches. If \(k'\) objects fulfill the condition \(h^i(\mathbf{y},{{\hat{q}}^i}) < \theta \) for \(i=1 \dots k\), with \(k' \le k\), we can build a new index \(v_{i'}\) that indicates the index of the \(i'\)-th selected object for \(i'=1 \dots k'\). For instance in a training dataset with \(k=4\) objects, if \(k'=3\) objects are selected (e.g., objects 1, 3 and 4), then the indices are \(v_1=1\), \(v_2=3\) and \(v_3=4\) as illustrated in Fig. 4. The selected object \(i'\) for patch \(\mathbf{y}\) has its dictionary \(\mathbf{D}^{v_{i'}}\), and the corresponding parent cluster is \(u_{i'} = {{\hat{q}}^{v_{i'}}}\), in which child clusters are stored in row \(u_{i'}\) of \(\mathbf{D}^{v_{i'}}\), i.e., in \(\mathbf{A}^{i'} := \mathbf{\bar{A}}_{u_{i'}}^{v_{i'}}\).

Therefore, a dictionary for patch \(\mathbf{y}\) is built using the best representative patches as follows (see Fig. 4):

$$\begin{aligned} \mathbf{A}(\mathbf{y}) = [ \ \mathbf{A}^1 \dots \mathbf{A}^{i'} \dots \mathbf{A}^{k'} \ ] \in \mathcal {R}^{(d+1) \times Rk'} \end{aligned}$$
(10)

With this adaptive dictionary \(\mathbf{A}\), built for patch \(\mathbf{y}\), we can use Sparse Representation Classification (SRC) methodology [22]. That is, we look for a sparse representation of \(\mathbf{y}\) using the \(\ell _1\)-minimization approach:

$$\begin{aligned} \mathbf{\hat{x}} = \text {argmin} ||\mathbf{x}||_1 \ \ \ \text {object to} \ \ \ \mathbf{A}{} \mathbf{x} = \mathbf{y} \end{aligned}$$
(11)

The residuals are calculated for the reconstruction for the selected objects \({i'}=1 \dots k'\):

$$\begin{aligned} r_{i'}(\mathbf{y}) = || \mathbf{y} - \mathbf{A} \delta _{i'} (\mathbf{\hat{x}}) || \end{aligned}$$
(12)

where \(\delta _{i'} (\mathbf{\hat{x}})\) is a vector of the same size of \(\mathbf{\hat{x}}\) whose only nonzero entries are the entries in \(\mathbf{\hat{x}}\) corresponding to class \(v(i')=v_{i'}\). Thus, the class of selected test patch \(\mathbf{y}\) will be the class that has the minimal residual, that is it will be

$$\begin{aligned} {\hat{i}} (\mathbf{y}) = v({\hat{i'}}) \end{aligned}$$
(13)

where \(\hat{i'} = \text {argmin}_{i'} r_{i'}(\mathbf{y})\).

Finally, the identity of the query object will be the majority vote of the classes assigned to the s selected test patches \(\mathbf{y}^t_p\), for \(p=1 \dots s\):

$$\begin{aligned} \text {identity} ( \mathbf{I}^t ) = \text {mode} ( \hat{i}(\mathbf{y}^t_1), \dots \hat{i}(\mathbf{y}^t_p), \dots \hat{i}(\mathbf{y}^t_s)) \end{aligned}$$
(14)

The selection of s patches of query image is as follows:

  • (i) From query image \(\mathbf{I}^t\), m patches are randomly extracted and described using (2): \(\mathbf{y}^t_j\), for \(j=1 \dots m\), with \(m \ge s\).

  • (ii) Each patch \(\mathbf{y}^t_j\) is represented by \(\mathbf{\hat{x}}^t_j\) using the mentioned adaptive sparse representation according to (11).

  • (iii) The sparsity concentration index (SCI) of each patch is computed in order to evaluate how spread are its sparse coefficients [22]. SCI is defined by

    $$\begin{aligned} S_j := \text {SCI}(\mathbf{y}^t_j) = \frac{k \ {\max (|| \delta _{i'} ({\hat{\mathbf{x}}^t_j}) ||_1)}/{|| {\hat{\mathbf{x}}^t_j} ||_1}-1}{k-1} \end{aligned}$$
    (15)

    If a patch is discriminative enough it is expected that its SCI is large. Note that we use k instead of \(k'\) because the concentration of the coefficients related to k classes must be measured.

  • (iv) Array \(\{ S \}_{j=1}^m\) is sorted in a descended way.

  • (v) The first s patches in this sorted list in which SCI values are greater than a \(\tau \) threshold are then selected. If only \(s'\) patches are selected, with \(s' < s\), then the majority vote decision in (14) will be taken with the first \(s'\) patches.

3 Experiments

Our method was tested in the recognition of five classes in baggage screening: handguns, shuriken (ninja stars), clips, razor blades and background (see some samples in Fig. 5). In our experiments, there are 100 X-ray images per class. All images were resized to 128 \(\times \) 128 pixels. We defined the following protocol: from each class, 50 images were randomly chosen for training and one for testing. In order to obtain a better confidence level in the estimation of recognition accuracyFootnote 2, the test was repeated 100 times by randomly selecting new 51 images per class each time (50 for training and 1 for testing). The reported accuracy in all of our experiments is the average calculated over the 100 testsFootnote 3.

The descriptor used by our method was LBP\(_{8,1}^{ri}\), i.e.,Local Binary Pattern rotation-invariant with 8 samples and radius 1 [25]. That yields a 36-bin descriptor (\(d=36\)). The size of the patch was 24 \(\times \) 24 pixels (\(w=24\)).

Table 1. Accuracy [%] of each experiment

In order to evaluate the robustness against occlusion, we corrupted the test images with a square of random gray value of size \(a \times a\) pixels located randomly, for \(a=15, 30, 50, 70\) (see example in Table 1). The obtained result is given in first row of Table 1 (see XASR+’s row). We observe that the accuracy was more than 95 % in each class when there is no occlusion, and more than 80 % if the object is occluded less than 30 %.

In order to evaluate the effectiveness of the stop-list, we repeated the same experiment without considering this step. The results are given in the second row of Table 1 (see XASR’s row). We observe that the use of a stop-list can increase the accuracy significantly.

In addition, we compared our method with four known methods that can be used in object recognition: (i) SIFT [26], (ii) sparse representation classification (SRC) [22] with SIFT descriptors, (iii) efficient visual search based on an information retrieval approach (Vgoogle) [23], and (iv) bag of words [27] using KNN (BoW-KNN) and random forest (BoW-RF) [28] with SIFT descriptors. We coded these methods according to the specifications given by the authors in their papers. The parameters were set so as to obtain the best performance. The results are summarized in the corresponding rows of Table 1. Results show that XASR+ deals well with unconstrained conditions in every experiment, achieving a high recognition performance in many conditions and obtaining similar or better performance in comparison with other representative methods in the literature.

The time computing depends on the size of the dictionary that is proportional to the number of classes to be detected. In our experiments with 5 classes the computational time is about 0.2 s per testing image (testing stage) on a Mac Mini Server OS X 10.10.1, processor 2.6 GHz Intel Core i7 with 4 cores and memory of 16GB RAM 1600 MHz DDR3.

Fig. 5.
figure 5

Images used in our experiments. The five classes are: handguns, shuriken, razor blades, clips and background.

4 Conclusions

In this paper, we have presented XASR+, an algorithm that is able to recognize objects automatically in cases with less constrained conditions including some contrast variability, pose, intra-class variability, size of the image and focal distance. We tested the effectiveness of our method for the detection of four different objects: razor blades, shuriken (ninja stars) handguns and clips. In our experiments, the recognition rate was more than 95 % in every class. The robustness of our algorithm is due to three reasons: (i) the dictionaries learned for each class in the learning stage corresponded to a rich collection of representations of relevant parts which were selected and clustered; (ii) the testing stage was based on adaptive sparse representations of several random patches using the dictionaries estimated in the previous stage which provided the best match with the patches, and (iii) a visual vocabulary and a stop-list used to reject non-discriminative patches in both learning and testing stage.