Keywords

1 Introduction

Identity fraud is a major issue in today’s societies with serious consequences. The threats vary from small frauds up to organized crimes and terrorist actions. The work presented in this paper is part of a research project IDFRAudFootnote 1 proposing a platform for identity documents verification. The first step classifies the query document according to its type and country of origin to prepare the verification of specific security checks, fake detection, document archiving etc. These later processes are out of scope of this paper.

For any supervised classification problem, the first task is to collect annotated data. In our application, these are specimens from various types of documents as well as emitting countries. Obtaining such data in large quantities is not always possible and we should therefore take into account this limitation. Some classes have more samples than others, giving unbalanced datasets. Moreover, query images vary from high quality scans to poor quality mobile phone photos with complex background, various orientations, occlusions, or flares, see Fig. 1.

There are two main approaches in the document classification literature. Methods based on the layout are mainly used when documents are composed of text blocks, figures, tables, etc. This is the case for journal articles, publications, books, or invoices. Documents are described by their spatial layout, i.e. the structure of text blocks, figures, and tables. Such descriptions are finally used to perform classification [2, 16] or to compute similarities [8, 24]. The second type of approaches is based on text. These methods build a description of the text content (extracted with an OCR in the case of scanned documents), such as bag of words or Word2Vec, which is given as input to classifiers [34]. More recently Recurrent Neural Networks (RNN) have been applied to classify documents [17].

Fig. 1.
figure 1

Sample images from the databases

Section 2 explains why we have discarded these two classical approaches to propose an alternative based on the visual content of the identity documents. Image recognition has a large spectrum of tasks with applications in search engines, interest object detection, or image categorization/classification, which has been extensively studied over the last decades. The availability of large and/or complex datasets as well as regular international challenges has spurred a large variety of image classification methods. We propose to apply these approaches to deal with identity document classification.

This choice is not obvious as there are few graphical elements in identity documents. Moreover, the portrait photo of the owner is uninformative for classification. However, the recent work on Convolutional Neural Networks (CNN) showed great generic visual descriptions, which are transferable to a large variety of image recognition tasks, such as fine-grain image classification. Thus, our paper studies a wide range of image classification method as well as the transfer capabilities of CNN to the specific task of identity documents classification.

2 Previous Works

The introduction presented document recognition through three main trends: layout-based, text-based, and visual-based methods. We now explain why we choose this last trend.

Identity documents contain textual and graphical information with a given layout. From such well structured documents, one could expect to base their classification on the layout. However, the layout is not always discriminant. Some classes share very similar structure: this is especially the case of different versions of passport or ID card emitted by the same country. Other methods are based on text transcription. Unfortunately, such methods are not adapted to our application due to the following difficulties: The document is not localized a priori in the query image and background information might disturb the OCR tasks, see Fig. 1. Indeed, text information is difficult to extract before knowing the type of the document and where it is localized in the image. Moreover, a large part of the text is specific to the owner of the document and not to the class. Therefore, we prefer to rely on the graphical content of the identity document and we turn towards image classification techniques in search for robustness and diversity.

Image classification has received a large attention from the scientific community, e.g. see the abundant literature related to the Pascal VOC [9] and ImageNet [7] challenges. A large part of the modern approaches follow the bag-of-word (BOW) approach [6], represented by a 3 step pipeline: (1) extraction of local image features, (2) encoding of local image descriptors and pooling of these encoded descriptors into a global image representation, (3) training and classification of global image descriptors for the purpose of object recognition. Local feature points, such as SIFT [21], are widely used as local features due to their description capabilities. Regarding the second step, image encoding, BOW were originally used to encode the feature point’s distribution in a global image representation [12, 16]. Fisher vectors and VLAD later showed improvement over the BOW [14, 23]. Pooling has also witnessed many improvement: for instance, spatial and feature space pooling techniques have been widely investigated [18, 32]. Finally, regarding the last step of the pipeline, discriminative classifiers such as linear Support Vector Machines (SVM) are widely accepted as the reference in terms of classification performance [4].

Recently, the deep CNN approaches have been successfully applied to large-scale image classification datasets, such as ImageNet [7, 15], obtaining state-of-the-art results significantly above Fisher vectors or bag-of-words schemes. These networks have a much deeper structure than standard representations, including several convolutional layers followed by fully connected layers, resulting in a very large number of parameters that have to be learned from training data. By learning these networks parameters on large image datasets, a structured representation can be extracted at an intermediate to a high-level [22, 35]. Furthermore, Deep CNN representation have been recently combined with VLAD [1, 11] or Fisher vectors [5, 19] encodings.

Fig. 2.
figure 2

Classification pipelines composed of (1) feature extraction on the first row, (2) feature encoding on the second row, and (3) classification on the final row.

It is worth mentioning that other approaches have been proposed in Computer Vision with the aim to build mid-level description [29] or to learn a set of discriminative parts to model classes [10, 25, 28]. They are highly effective in similar fine-grain classification scenarios but are extremely costly.

3 A Plurality of Methods

To perform image classification, we first follow the BOW-based pipeline. SIFT keypoints are extracted in either a dense fashion or by interest point detection. Dense extraction tends to offer better performance in classification, while interest points are rotation invariant [26]. Then, these features are encoded with BOW, VLAD or Fisher vectors and are used to classify images with SVM.

Secondly, we study CNN based features, where intermediate transferable representations are computed from pre-trained networks. Descriptors are computed using various networks, layers, orientations, and scales. Finally, a VLAD aggregation of activation maps across orientations and scales is proposed. These image descriptors are similarly given as input to SVM to perform classification, see Fig. 2.

3.1 Bag-of-Words

Assume that the local description output vectors in \(\mathbb {R}^d\). The Bag of visual Words aims at encoding local image descriptors based on a partition of the feature space \(\mathbb {R}^d\) into regions. This partition is usually achieved by using the k-means algorithm on a training set of feature points. It yields a set \(\mathcal {V}\), so called visual vocabulary, of k centroids \(\{\mathbf {v}_i\}_{i=1}^k\), named visual words. The regions are the Voronoi cells of the centroids. This process is achieved offline and once for all.

The local descriptors of an image \(\{{\mathbf {x}}_t\}_{t=1}^T\) are quantized onto the visual vocabulary \(\mathcal {V}\):

$$\begin{aligned} \mathsf {NN}({\mathbf {x}}_t) = \arg \min _{1\le i\le k}\Vert {\mathbf {x}}_t-\mathbf {v}_i\Vert . \end{aligned}$$
(1)

The histogram of frequencies of these mappings becomes the global image description whose size is k.

3.2 Fisher Vectors

Fisher vectors also start from a visual vocabulary \(\mathcal {V}\) but used as a Gaussian Mixture Model (GMM). The distribution of the local descriptors is assumed to be a mixture of k Gaussian \(\mathcal {N}(\mathbf {v}_i,\mathsf {diag}(\varvec{\sigma }_i^2))\) with weights \(\{\omega _i\}\). Covariance matrices are assumed to be diagonal, variances vectors \(\{\varvec{\sigma }_i^2\}\) and weights \(\{\omega _i\}\) are learned from the training set as well.

Fisher vectors considers the log-likelihood of the local descriptors of the image \(\{{\mathbf {x}}_t\}_{t=1}^T\) w.r.t. to this GMM. They are composed of two gradient calculations of this quantity per Gaussian distribution: The gradient \(G^X_\mu \) w.r.t. \(\mathbf {v}_i\) and the gradient \(G^X_\sigma \) w.r.t. to the variance vector \(\varvec{\sigma }_i^2\):

$$\begin{aligned} G_{\mu ,i}^X = \frac{1}{T \sqrt{\omega _i}} \sum _{t=1}^T \gamma _t(i) \mathsf {diag}(\varvec{\sigma }_i)^{-1}({\mathbf {x}}_t-\mathbf {v}_i), \end{aligned}$$
(2)
$$\begin{aligned} G_{\sigma ,i}^X = \frac{1}{T \sqrt{2\omega _i}} \sum _{t=1}^T \gamma _t(i) [\mathsf {diag}(\varvec{\sigma }_i^2)^{-1}({\mathbf {x}}_t-\mathbf {v}_i)^2 -\mathbf {1}_d ], \end{aligned}$$
(3)

where \(\gamma _t(i)\) represents the soft assignment term, i.e. the probability that descriptor \({\mathbf {x}}_t\) derives from the i-th Gaussian distribution [23], and \(\mathbf {a}^2\) denotes the vector whose components are the square of the components of \(\mathbf {a}\). The concatenation of these gradients results in a global descriptor of 2kd components.

3.3 VLAD

VLAD is similar to Fisher vectors [14] aggregating only the difference between the local descriptors and hard-assigned cluster from the visual vocabulary:

$$\begin{aligned} \mathbf {d}_i = \sum _{{\mathbf {x}}_t:\mathsf {NN}({\mathbf {x}}_t)=i} {\mathbf {x}}_t - \mathbf {v}_i. \end{aligned}$$
(4)

The global descriptor \((\mathbf {d}_1^{\top },\ldots ,\mathbf {d}_k^{\top })^\top \) has a size of dk. A power law, \(l_2\) normalization, and/or PCA reduction are usually performed on Fisher and VLAD [23].

3.4 Convolutional Neural Networks

Deep Convolutional Neural Network [15] are composed of convolutional layers followed by fully connected ones with normalization and/or pooling performed in between layers. There is a large variety of network architectures [30, 31], but a usual choice is 5 convolutional layers followed by 3 fully connected layers. The layers parameters are learned from training data.

The works [22, 33] showed that extracting intermediate layer produces mid-level generic representations, which can be used for various recognition tasks and a wide range of data [27]. In our case, we use a fast network and a very deep network, both trained on ImageNet ILSVRC data. The fast network from [3] is similar to [15], while the deep network stacks more convolutional layers (19 layers in total) with smaller convolutional filters [30].

Following previous works [22, 28, 33], image representations are computed by either taking the output of the fully connected intermediate layers or by performing pooling on the output of the last convolutional layer [33].

Unfortunately rotation invariance can not be obtained with such networks. Thus, we enrich our datasets using flipped and rotated versions of each image to artificially enforce such invariance.

Recent works showed that fully connected layers can be kernelized to obtain a fully convolutional networks [20]. Such transformation allows input of various size, which is shown to be beneficial in [13] classification.

After showing the benefits of using several scales and orientations, we propose to aggregate multi-scale information using VLAD across a fixed set of scales and orientations. Specifically, each activation of the feature map is considered as a local descriptors, which are aggregated with equal weights. Unlike the similar NetVLAD [1], our method allows the aggregation over several scales and orientations. To our knowledge, such use of VLAD aggregation over scales and orientation of the activations of various layers has not yet been proposed.

4 Experiments

4.1 Datasets

There is no publicly available dataset of identity documents as they hold sensitive and personal information. Three private datasets are provided by our industrial partner. Images are collected using a variety of sources (scan, mobile photos) and no constraint is imposed. Thus, the documents have any dimension, any orientation, and might be surrounded by complex backgrounds. Figure 1 shows examples of such images.

Preliminary experiments is held on a dataset of 9 classes of French documents (FRA), namely identity card (front), identity card (back), passport (old), passport (new), residence card (old front), residence card (old back), residence card (new front), residence card (new back), driving licence. A total of 527 samples are divided into train and test, ranging from 26 to 136 images per class. Then, a larger dataset (Extended-FRA or E-FRA) of the same types of documents with a total of 2399 images (86 to 586 per class) is used. The last dataset consists of 446 samples (8 to 110 per class) of 10 Belgian identity documents (BEL), namely identity card 1 (front), identity card 1 (back), identity card 2 (front), identity card 2 (back), residence card (old front), residence card (old back), residence card (new front), residence card (new back), passport (new), passport (old).

Table 1. Evaluation of BOW, VLAD, and Fisher in terms of mAP for detected and dense features, on the FRA dataset.
Table 2. Performance of several CNN-based features, on the FRA dataset.

4.2 Results

An extensive evaluation is carried out on the image datasets. Three measures are calculated: mean average precision (mAP), overall mean accuracy, and averaged accuracy per class.

First, SIFT-based methods are evaluated on the FRA dataset, see Table 1. This comprises BOW, VLAD, and Fisher Vector encodings with several visual vocabulary sizes, and from detected or dense SIFT local descriptors. We note that SIFT descriptors are square-rooted and PCA is applied to obtain 64-dimensional vectors. We observe that Fisher Vector performs better than VLAD, which performs better than BOW. This is expected: the more refined the encoding, the longer the global descriptor, and the better the performances. Even when comparing similar global descriptor dimension, Fisher Vector offers the best performance. Note that Fisher Vector does not improve over 64 Gaussians. Secondly, dense local description overall outperforms detected feature except for the case of BOW encoding. These results agrees with general observations made in computer vision for classification tasks [26].

Then, we evaluate CNN-based descriptors on the same FRA dataset, see Table 2. Two architecture are compared: the ‘fast’ network [3] and the deep ‘vd19’ network [30]. Descriptors are obtained by extracting the output of the two first fully connected layers (fc6 and fc7), as well as the last convolutional layer (c5). Average and max pooling of c5 are evaluated as well. Surprisingly, the fast network outperforms vd19. Average pooling is also shown to outperform max pooling for convolutional layer and is preferred in the following experiments. Overall c5 outperforms fc6, which outperforms fc7. In fact, lower layers (c5) encodes lower level and more generic information, which is less sensible to network training data.

Table 3. Orientation invariance of CNN features, on the FRA dataset.
Table 4. Performance using various combination of the FRA and E-FRA datasets with orientation invariance. Tr Te and E represents Training set of FRA, Testing set of FRA, and E-FRA.

Since the CNN feature do not have any rotation invariance mechanism, we propose to enrich the training data collection by adding rotated and flipped images (ending up in 8 distinct descriptors per image), see Table 3. Such process offers a constant improvement for every descriptors.

Further experiments are achieved on the larger E-FRA dataset, see Table 4. Unlike for FRA dataset alone, we observe that fc6 outperforms c5. Unsurprisingly, the more training data the better the performance reaching up to 99% mAP and more than 96% accuracy, when training on E-FRA. More experiment is performed on the BEL dataset, see Table 5. We divide the dataset into three folds, then learn on two third and test on the last one. Scores obtained on all permutations and finally averaged. As for E-FRA, the sixth fully connected layer offers the best performance. Also performances on the BEL dataset are much lower because some classes (residence card (old/front), residence card (old/back), residence card (new/back)) have very few (5 to 12) training samples.

Table 5. Results obtained on the BEL dataset using 3 folds.
Table 6. Varying scales CNN features with orientation invariance, on the FRA dataset.

The very recent work of [13] highlighted how input dimension of the image can have a large impact on performance. Therefore, we experiment various input sizes, (\(s1 = 224\times 224, s2 = 544\times 544, s3 = 864\times 864\)), see Table 6. Concerning convolutional layers, the feature maps are averaged pooled as earlier. However, since fully connected layers require a fixed input feature maps dimension, we kernelize the layers so they are applied at every location of the larger feature map output by the last convolutional layer and finally perform max pooling. We observe a stable gain for every layer, c5 and s3 offering the best performance on the FRA dataset. We further note that higher dimensionality (\(1184\times 1184\), \(1504\times 1504\)) offers worst results in our experiments.

Since larger scales and multiple orientations encapsulate more precise information, we decide to aggregate the activations of several scales (\(s1 = 224\times 224, s2 = 544\times 544, s3 = 864\times 864\), and \(s4 = 1184\times 1184\)) and 8 orientations together in a VLAD descriptor. Each activation is centered, PCA reduced to 128 dimensions, and l2-normalized. Once concatenated in the VLAD, the final vector is power normalized. Table 7 shows the final performance on the FRA dataset and we observe a stable improvement for all layers reaching a very high performance, around 99% mAP and 98% mean accuracy.

Our application requires fast processing of the scanned documents. We report the computation time of SIFT and CNN features extraction in Table 8. Execution times hold for a single threaded i7 core 2.6 GHz. Note that image dimensions remained unchanged for SIFT features, while images are resized to \(224 \times 224\) for CNN features using the fast network. CNN features are much faster than SIFT, and keypoints detection is quite slow especially for high-resolution images.

Table 7. VLAD aggregation over scales and orientations, on the FRA dataset.
Table 8. Computation time for detected SIFT, dense SIFT, and CNN features extracted from (\(224 \times 224\)) dimensional features on FRA train/test sets.

To conclude, CNN generate highly effective compact description, largely outperforming earlier SIFT-based encoding schemes from the classification performance and run-time point of view. Secondly, our evaluation provides insight regarding the amount and balance of data required to reach very high performance. Finally, the proposed VLAD aggregation across scales an orientations shows superior performance.

5 Conclusion

This paper addressed the problem of identification documents classification as an image classification task. Several image classification methods are evaluated. We show that CNN features extracted from pre-trained networks can be successfully transferred to produce image descriptors which are fast to compute, compact, and highly performing.