A Performance Evaluation of Classic Convolutional Neural Networks for 2D and 3D Palmprint and Palm Vein Recognition

Palmprint recognition and palm vein recognition are two emerging biometrics technologies. In the past two decades, many traditional methods have been proposed for palmprint recognition and palm vein recognition, and have achieved impressive results. However, the research on deep learning-based palmprint recognition and palm vein recognition is still very preliminary. In this paper, in order to investigate the problem of deep learning based 2D and 3D palmprint recognition and palm vein recognition in-depth, we conduct performance evaluation of seventeen representative and classic convolutional neural networks (CNNs) on one 3D palmprint database, five 2D palmprint databases and two palm vein databases. A lot of experiments have been carried out in the conditions of different network structures, different learning rates, and different numbers of network layers. We have also conducted experiments on both separate data mode and mixed data mode. Experimental results show that these classic CNNs can achieve promising recognition results, and the recognition performance of recently proposed CNNs is better. Particularly, among classic CNNs, one of the recently proposed classic CNNs, i.e., EfficientNet achieves the best recognition accuracy. However, the recognition performance of classic CNNs is still slightly worse than that of some traditional recognition methods.


Introduction
In the network and digital society, personal authentication is becoming a basic social service.It is well known that biometrics technology is one of the most effective solutions to personal authentication [1] .In recent years, two emerging biometrics technologies, palmprint recognition and palm vein recognition have attracted a wide range of attention [2−6] .Generally speaking, there are three subtypes of palmprint recognition technology, including 2D low resolution palmprint recognition, 3D palmprint recognition and high resolution palmprint recognition.High-resolution palmprint recognition is usually used for forensic applications.2D low-resolution palmprint recognition and 3D palmprint recognition are mainly used for civil applications.In this paper, we only focus on civil applications of biometrics, therefore, the problem of highresolution palmprint recognition will not be investigated.
Many effective methods have been proposed for 2D low-resolution palmprint recognition (2D low-resolution palmprint recognition will be called 2D palmprint recog-nition for short in the rest of this paper), 3D palmprint recognition and palm vein recognition, which can be divided into two groups, i.e., traditional methods and deep learning-based methods.
In the past decade, deep learning has become the most important technology in the field of artificial intelligence.It has brought a breakthrough in performance for many applications [7,8] , such as speech recognition, natural language processing, computer vision, image and video analysis, and multimedia.In the field of biometrics, especially in face recognition, deep learning has become the most mainstream technology [9] .However, the research on deep learning-based 2D and 3D palmprint recognition and palm vein recognition is still very preliminary [9,10] .
Convolution neural network (CNN) is one of the most important branches of deep learning technology, and has been widely used in various tasks of image processing and computer vision, such as target detection, semantic segmentation and pattern recognition.For image-based biometrics technologies, CNN is the most commonly used deep learning technique.Up to now, many classic CNNs have been proposed and impressive results have been achieved in many recognition tasks.However, the recognition performance of these classic CNNs for 2D and 3D palmprint recognition and palm vein recognition has not been systematically studied.For example, existing deep learning-based palmprint recognition and palm vein recognition work only used simple networks, and did not provide an in-depth analysis.In the future, with the rapid development of CNNs, the recognition accuracy of new CNNs will be continuously improved.It can be predicted that CNNs will become one of the most important techniques for 2D and 3D palmprint recognition and palm vein recognition.Therefore, it is very important to systematically investigate the recognition performance of classic CNNs for 2D and 3D palmprint recognition and palm vein recognition.To this end, this paper evaluates the performance of classic CNNs in 2D and 3D palmprint recognition and palm vein recognition.Particularly, seventeen representative and classic CNNs are exploited for performance evaluation.
It should be noted that the samples within the above databases are captured in two different sessions at certain time intervals.In traditional recognition methods, some samples captured in the first session are usually used as training sets, while all the samples captured in the second session are used as the test set.However, in existing deep learning-based palmprint recognition and palm vein recognition methods, the training set often contains samples from both sessions.Thus, it is easy to obtain a high recognition accuracy.If the training samples are only from the first session, and the test samples are from the second session, we call this experimental mode a separate data mode.If the training samples are from both sessions, we call this experimental mode a mixed data mode.We conduct experiments in both separate data mode and mixed data mode to observe the recognition performance of classic CNNs in these two different modes.
The main contributions of our work are as follows.
1) We briefly summarize the classic CNNs, which can help the readers to better understand the development history of CNNs for image classification tasks.
2) We evaluate the performance of the classic CNNs for 3D palmprint and palmprint recognition.To the best of our knowledge, it is the first time such an evaluation has been conducted.
3) We evaluated the performance of classic CNNs on Hefei University of Technology cross sensor palmprint database.To the best of our knowledge, it is the first time the problem of palmprint recognition across different devices using deep learning technology has been investigated.
4) We investigate the problem of the recognition performance of CNNs on both separate data mode and mixed data mode.
The rest of this paper is organized as follows.Section 2 presents the related work.Section 3 briefly introduces seventeen classic CNNs.Section 4 introduces the 2D and 3D palmprint and palm vein databases used for evaluation.Extensive experiments are conducted and reported in Section 5. Section 6 offers the concluding remarks.

Traditional 2D palmprint recognition methods
For 2D palmprint recognition, researchers have proposed many traditional methods, which can be divided into different sub-categories, such as palm line-based, texture-based, orientation coding-based, correlation filterbased, and subspace learning-based [3] .
Because palm line is the basic feature of palmprint, some methods exploiting palm line features for recognition have been proposed.Huang et al. [18] proposed the modified finite Radon transform (MFRAT) to extract principal lines, and designed a pixel-to-area algorithm to match the principal lines of two palmprints.Palma et al. [19] used a morphological top-hat filtering algorithm to extract principal lines, and proposed a dynamic matching algorithm involving a positive linear dynamical system.
The texture-based method is also very effective for pattern recognition.Some local texture descriptors were designed and used for palmprint recognition [20] .Replacing the gradient by the response of Gabor filters in the local descriptor of histogram of oriented gradients (HOG), Jia et al. [21] proposed the descriptor of histogram of oriented lines (HOL) for palmprint recognition.Later, Luo et al. [22] proposed the descriptor of local line directional pattern (LLDP) using the modulation of two orientations.Motivated by LLDP, Li and Kim [23] proposed the descriptor called the local micro-structure tetra pattern (LMTrP).To fully utilize different direction information of a pixel and explore the most discriminant direction representation, Fei et al. [24] proposed the methods of the local discriminant direction binary pattern (LDDBP), the discriminant direction binary palmprint descriptor (DDBPD) [25] , and the apparent and latent direction code (ALDC)-based descriptor [26] .Scale-invariant feature transform (SIFT) is a powerful descriptor and has been applied to palmprint recognition.Using SIFT, Wu and Zhao [27] tried to solve the problem of deformed palmprint matching.
Orientation is a robust feature of palmprint.A lot of orientation coding-based methods have been proposed.These methods have high accuracy and fast matching speed.Generally, orientation coding-based methods first detect the orientation of each pixel, and then encode the orientation number to a bit string, at last, exploit Hamming distance for matching.Jia et al. [13] summarized orientation coding-based methods.Typical orientation coding-based methods include competitive code [11] , ordinal code [28] , robust line orientation code (RLOC) [29] , binary orientation co-occurrence vector (BOCV) [30] , double-orientation code (DOC) [31] , etc.
Recently, correlation-based methods have been successfully used for biometrics.Jia et al. [13] proposed to use a band-limited phase-only correlation (BLPOC) filter for palmprint recognition.
Subspace learning has been one of the important techniques for pattern recognition.Some subspace learningbased methods have been used for palmprint recognition, including principal component analysis (PCA) [32] , linear discriminant analysis (LDA) [33] , kernel PCA (KPCA) [34] , etc.However, the recognition performance of subspace learning-based methods is sensitive to illumination changes and other image variations.

Traditional 3D palmprint recognition methods
For 3D palmprint recognition, researchers have proposed a lot of traditional methods [5,10,16,35] .Generally, 3D palmprint data preserves the depth information of a palm surface.At the same time, the original captured 3D palmprint data is a small positive or negative float, which is usually transformed into the grey-level value for practical feature extraction.In previous researches, the original 3D palmprint data is usually transformed into a curvature-based data.Two most important curvatures include the mean curvature (MC) and Gaussian curvature (GC).Their corresponding gray images are called as mean curvature image (MCI) and Gaussian curvature image (GCI) [35] .In the recognition process, researchers extracted features from MCI or GCI for 3D palmprint recognition.Besides GC and MC, researchers also tried to propose other 2D representations of 3D palmprints.Based on GC and MC, Yang et al. [36] proposed a new grey-level image representation, called surface index image (SI).Recently, Fei et al. [37] proposed a simple yet effective compact surface type (CST) to represent surface features of a 3D palmprint.Since the representations of MCI, GCI, SI and CST depict a 3D palmprint as a 2D grey-level palmprint image, those 2D palmprint recognition methods can be used for 3D palmprint recognition.Li et al. [16] extracted competitive code from MCI for 3D palmprint recognition, which is an important orientation coding method.Zhang et al. [5] proposed a blockwise statisticsbased ST vector for 3D palmprint feature representation, and used collaborative representation-based classification (CRC) as the classifier.Fei et al. [38] proposed a complete binary representation (CBR) for the 3-D palmprint recognition by combining descriptors extracted from both MCI and CST.Fei et al. [39] proposed the precision direction code (PDC) to depict the 2D texture-based features, and then combined CST to form the PDCST descriptor to represent the multiple level and multiple dimensional features of 3D palmprint images.

Traditional palm vein recognition methods
For palm vein recognition, traditional methods can also be divided into the following categories: vein linebased, texture-based, orientation coding-based, and subspace learning-based.
To extract palm vein lines, Zhang et al. [40] , Kang and Wu [6] proposed two typical methods.In Zhang′s method, the multiscale Gaussian matched filters were exploited to extract vein lines [40] .In Kang′s method, the normalized gradient-based maximal principal curvature (MPC) algorithm was exploited to extract vein lines [6] .
Kang and Wu [6] proposed a texture-based method, in which a mutual foreground-based linear binary pattern (LBP) was exploited for texture feature extraction.Mirmohamadsadeghi and Drygajlo [41] also proposed a texturebased method, in which two texture descriptors, LBP and local derivative patterns (LDP) were used for palm vein recognition.ManMohan et al. [42] proposed a palm vein recognition method using local tetra patterns (LTP).Kang et al. [43] investigated the SIFT-based method for palm vein recognition.
Zhou and Kumar [44] presented an orientation codingbased method for palm vein recognition, named neighborhood matching Radon transform (NMRT) which is similar to the RLOC method proposed for palmprint recognition.The experimental results showed that the recognition performance of NMRT is much better than other methods such as Hessian phase, ordinal code, competitive code, and SIFT.

The brief development history of classic CNNs
Fig. 1 shows the chronology of the events in the development history of classic CNNs for image classification tasks.In 1998, the first CNN, LeNet, was proposed by Lecun et al. [49] However, LeNet did not have a widespread impact due to various restrictions.In 2012, AlexNet was proposed by Hinton and his student Krizhevsky and won the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC 2012) [50] .AlexNet demonstrated the effectiveness of CNN in some complex tasks.As a result, excellent performance of AlexNet attracted the attention of researchers, and promoted the further development of CNNs.In 2013, ZFNet was proposed by Zeiler and Fergus [51] .Zeiler and Fergus [51] also explain the essence of each layer of the neural network through visualization technology.In 2013, network in network (NIN) was proposed, which has two important contributions including global average pooling and the use of 1×1 convolution layer [52] .In 2014, VGG was proposed by the Oxford Visual Geometry Group [53] , and was the 2nd runner-up in ILSVRC 2014.Compared with AlexNet, VGG has two important improvements.The first one is using a smaller kernel size.The second one is using a deeper network.In 2014, another most important CNN is GoogLeNet (Inception_v1) [54] , which was the champion of the ILSVRC 2014.Later, the subsequent versions of GoogLeNet, i.e., Inception_v2 [55] , Inception_v3 [56] and In-ception_v4 [57] , were successively proposed in 2016 and 2017.Inception_ResNet_v1 and Inception_ResNet_v2 were proposed in the same paper, which are improved versions of Inception_v4 [57] .In 2015, ResNet was proposed by He et al. [58] and won the ILSVRC 2015.The paper of ResNet obtained the best paper of CVPR 2016.It can be said that the emergence of ResNet is an important event in the history of deep learning, because ResNet made it possible to train hundreds of layers of neural networks, and ResNet greatly improved the performance of image classification and other computer vision tasks.In 2016, DenseNet was proposed by Huang et al. [59] As the best paper of CVPR 2017, DenseNet broke away from the stereotyped thinking of deepening network structure (ResNet) and broadening network structure (Inception) to improve network performance.Considered from the point of view of features, DenseNet not only greatly reduces the amount of network parameters, but also alleviates the gradient vanishing problem to a certain extent by feature reuse and bypass settings.In the same year, SqueezeNet, the first lightweight network, was proposed by Iandoula et al. [60] to compress the number of feature maps by using a 1×1 convolution core to accelerate the network.Since then, other important lightweight networks such as MobileNets [61−63] , ShuffleNets [64,65] , Xception [66] , SqueezeNeXt [67] were proposed in turn.In order to solve the problem of poor information circulation, MoblieNets uses the strategy of point-wise convolution, ShuffleNets uses the strategy of channel shuffle, Xception uses the strategy of modified depth-wise convolution, and SqueezeNeXt hoists the speed from the perspective of hardware.In 2017, Xie et al. [68] proposed ResNeXt combining ResNet and Inception, which does not need to design complex structural details manually.Particularly, ResNeXt uses the same topological structure for each branch, the essence of which is group convolution.In 2018, the winner of the last image classification mission, SENet was proposed by Hu et al. [69] SENet consists of squeeze and excitation, in which the former compresses the model, and the latter predicts the importance of each channel.In addition, SENet can be plugged into any network to improve the recognition performance.In 2019, EfficientNet was proposed by Google [70] , which relies on AutoML and compound scaling to achieve state-of-the-art accuracy without compromising resource efficiency.In 2020, the team of Huawei Noah′s Ark Lab proposed a lightweight network, i.e., GhostNet, which can achieve better recognition performance than MobileNet_v3 with similar computational cost [71] .Some members from Facebook AI Research (FAIR) developed RegNet that outperforms EfficientNet while being up to 5× faster on GPUs [72] .The work of RegNet presented a new network design paradigm, which combines the advantages of manual network design and neural network search (NAS).By stacking split-attention blocks, Zhang et al. [73] proposed a new ResNet variant, i.e., ResNeSt, which has better recognition performance than ResNet.
Table 3 summarizes the existing CNN-based palm vein recognition methods.Hassan and Abdulrazzaq [91] proposed to use CNN for palm vein recognition, in which they designed a simple CNN and used the strategy of data augmentation to obtain more training data.Zhang et al. [17] released a new touchless palm vein database, and used the method of PalmRCNN for palm vein recognition.Lefkovits et al. [92] applied four CNNs for palm vein identification including AlexNet, VGG-16, ResNet-50, and SqueezeNet.Thapar et al. [93] proposed the method of PVSNet.In PVSNet, using triplet loss, a Siamese network was trained.Chantaf et al. [94] applied Inception_v3 and SmallerVggNet for palm vein recognition.
1) AlexNet The network structure of AlexNet is shown in Fig. 2. AlexNet is based on LeNet and uses some new techniques such as rectified linear unit (ReLU), dropout and local response normalization (LRN) for the first time [50] .Due to the limitation of hardware capability, the training of AlexNet uses distributed computing technology to distribute network on two GPUs.Each GPU stored half of the parameters.The GPUs can communicate with each other and access memory.Therefore, AlexNet is divided into upper and lower parts, each part corresponding to a single GPU.In AlexNet, data enhancement technology is used, such as random cropping and horizontal flipping of raw data, to improve the generalization of the network while reducing over-fitting problem.
2) VGG VGG is a further improvement of AlexNet, which makes the network deeper [53] .The VGG′s structure is shown in Fig. 3.Because the size of all convolutional kernel is , the structure of VGG is neat and its topo-logy is simple.Small convolution kernel size also brings some benefits, such as increasing the number of layers.VGG expands the number of layers of CNN to more than 10, enhancing the expressive ability of the network and facilitating subsequent modification in network structure. 3) Based on Inception_v2, Inception_v3 further decomposed the convolution [56] .That is, any convolution can be replaced by a convolution followed by a convolution (see Fig. 4(a)), which can reduce a lot of parameters, avoid over-fitting problems, and strengthen the nonlinear expression ability.In addition, Szegedy et al. [56] have carefully designed three types of Inception module, as shown in Fig. 4(b).

4) ResNet
As the depth of the network continues to increase, the vanishing gradient and exploding gradient problems became more and more difficult to solve.In this situation, it is hard to train the deep network.But ResNet can overcome this difficulty [58] .ResNet relies on a shortcut connection structure called residual module.Multiple residual modules are sequentially stacked to form ResNet, as shown in Fig. 5(a).Actually, the shortcut connection performs identity mapping, and its outputs are added to the output of the following layer.This simple calculation does not increase the number of parameters and computational complexity, and can improve the performance and speed up the training.The residual module actually con-

5) Inception_v4
Inception_v4 is an improved version of Inception_v3 [57] .Compared with regular network structure such as VGG and ResNet, Inception_v4 is mainly composed of one input stem, three Inception modules and two reduction modules, each of which is designed separately.The overall structure of Inception_v4 and the structure of each module are shown in Fig. 6.

6) Inception_ResNet_v2
While designing Inception_v4, Szegedy et al. [57] introduced the residual modules into Inception_v3 and Incep-tion_v4, respectively, resulting in Inception_ResNet_v1 and Inception_ResNet_v2.The overall structure of Incep-tion_ResNet_v1 and Inception_ResNet_v2 is the same, and the difference is the modules in the network.Fig. 7 shows the Inception_ResNet_v2 overall structure and module structure.7) DenseNet It seems that DenseNet is an extreme version of Res-Net [59] .DenseNet introduces short connections from any layer to all the following layers.But, in fact, DenseNet combines features by concatenating them instead of summation before features are passed to a layer, which enables the network to make better use of features.As shown in Fig. 8(a), a five-layer dense block is illustrated, in which each layer of output is connected to each subsequent layer.The dense block is continuously stacked to form DenseNet.The structure of DenseNet is depicted in Fig. 8(b).
8) Xception Xception is another improved version of Inception_v3 [66] .It is based on the assumption that spatial convolution (convolution along the horizontal and vertical directions of the feature map) and channel convolution (convolution along the direction of the feature map channel) can be performed independently to separate convolution.As shown in Fig. 9, for the feature maps in the   previous layer, 1×1 convolutions are used to linearly combine the feature maps, and then use convolutions separately for each channel, where M is the channel of the feature maps, N is the number of convolutions (or the output channel), n is the size of convolution kernel.
In fact, the Inception module can be simplified as follows: all 1×1 convolutions in Inception can be reformulated as a large convolution, then utilize convolutions separately on every output channel, forming the extreme Inception, as shown in Fig. 10.Extreme Inception is consistent with the initial assumption and achieves the decoupling operation of convolution.This kind of extreme Inception is named Xception.

9) MobileNet_v2 & 10)MobileNet_v3
In order to meet the needs of embedded devices such as mobile phones, the research team of Google proposed a compact neural network named MobileNet in 2017.Mobi-leNet is based on depthwise separable convolution to reduce the number of parameters.Depthwise separable convolution splits the standard convolution into two steps: depthwise convolution, which applies convolution to each channel of the feature map separately and pointwise convolution, which uses 1×1 convolutions to combine the feature, as shown in Fig. 11.In Fig. 11, M is the number of input channels, D K is the convolution kernel size, N is the number of convolution kernels, and if the size of feature map is D F × D F , then for standard convolution, the computational cost is D F × D F × M × N × D K × D K and for depthwise separable convolution, the computational cost of the depthwise convolution is D F × D F × M × D K × D K , and the computational cost of the pointwise convolution is Therefore, we get a reduction in computation of: For example, if a 3×3 convolution is used, the computational cost can be reduced by about 8 or 9 times.In ad- In 2018, the research team of Google continued to improve MobileNet and designed MobileNet_v2 [62] .MobileN-et_v2 introduces the shortcut connection in ResNet and DesNet to the network.Since the output of the depthwise separable convolutions is limited by the number of input channels and the characteristics of the bottleneck residual module, if the residual is directly introduced into the network without modification, the initial feature compression will result in too few features available in the subsequent layers.Therefore, MobileNet_v2 proposes Inverted residuals -expanding the number of features first, then extracts the features using convolution, and finally compresses the features.In addition, MobileNet_v2 cancels the ReLU at the end of the inverted residual, because ReLU sets all non-positive inputs to zeros, and adding ReLU after feature compression loses the feature.The network structure of MobileNet_v2 is shown in Fig. 13.
.It enhances important features by modeling the correlation between feature channels to improve accuracy.The SENet block is a substructure that can be embedded in other classification or detection models.In the 2017 ILSVRC competition, the SENet block and ResNeXt are applied to reduce the top-5 error to 2.251% on the ImageNet dataset, which was the champion in the classification project.The network structure of the SENet block is shown in Fig. 15.
12) ResNeXt ResNeXt is the upgraded version of ResNet [68] .In order to improve the accuracy of the model, some networks deepen and widen the network structure, resulting in increasing the number of network hyperparameters as well as the difficulty and computational cost of network design.However, ResNeXt improves the accuracy without increasing the complexity of the parameters, even reducing the number of hyperparameters.ResNeXt has three equivalent network structures, as shown in Fig. 16.The original three-layer convolution block in RseNet is replaced by a block of parallel stacking topologies.The topologies are the same, but the hyperparameters are reduced, which facilitates model migration.

13) ShuffleNet_v2
In ResNeXt, the packet convolution strategy is applied as a compromise strategy, and the pointwise convolution of the entire feature map restricts the performance of ResNeXt.Thus, an efficient strategy is to perform pointwise convolution within a group, but it is not conducive to information exchange between channels.To solve this problem, ShuffleNet_v1 proposed a channel shuffle operation.The structure of ShuffleNet_v1 is depicted in Fig. 17, where Fig. 17  In ShuffleNet_v2 [65] , researchers found that it is unreasonable to only apply commonly-used FLOPs in the evaluation of model performance, because file IO, memory read, GPU execution efficiency also need to be considered.Taking memory consumption and GPU parallelism into account, researchers designed an efficient ShuffleNet_v2 model.This model is similar to DenseNet, but ShuffleNet_v2 has higher accuracy and faster speed.The network structure of ShuffleNet_v2 is shown in Fig. 18.

14) EfficientNet
EfficientNet was proposed in 2019 [70] , and is a more general idea for the optimization of current classification networks.Widening the network, deepening the network and increasing the resolution are three common network indicators, which are applied independently in most previous networks.Thus, the compound model scaling algorithm is proposed, which comprehensively optimizes the network width, network depth and resolution to improve the accuracy and the existing classification network, and the amount of model parameters and calculations are greatly reduced.EfficientNet uses the EfficientNet-b0 as the basic network to design eight network structures called b0−b7, and EfficientNet-b7 has the highest accuracy.The network structure of EfficientNet-b0 is shown in Fig. 19.
15) GhostNet In GhostNet, Han et al. [71] proposed a novel ghost module, which can generate more feature maps with fewer parameters.Specifically, the convolution layer in the depth neural network is divided into two parts.The first part involves the common convolution, but the number of them should be strictly controlled.Given the inherent characteristic graph of the first part, then a series of Bottleneck, s = 1, t = 2.
Fig.  simple linear operations are applied to generate more characteristic graphs.Compared with the conventional CNN, the total number of parameters and the computational complexity of the ghost module are the lowest without changing the size of the output characteristic map.Based on the ghost module, Han et al. [71] proposed GhostNet.Fig. 20 shows the ghost module.

16) RegNet
In RegNet, Radosavovic et al. [72] proposed a new network design paradigm, which aims to help improve the understanding of network design.Radosavovic et al. [72] focused on the design of network design space of parameterized networks.The whole process is similar to the classic manual network design, but it is promoted to the level of design space.Using this rule to search for a simple low dimensional network, i.e., RegNet.The core idea of RegNet parameterization is that the width and depth of a good network can be explained by a quantized linear function.Particularly, RegNet outperforms traditional available models and runs five times on GPUs.17) ResNeSt In ResNeSt, Zhang et al. [73] explored the simple architecture modification of ResNet, and incorporated featuremap split attention within the individual network blocks.More specifically, each of the blocks divides the featuremap into several groups (along the channel dimension) and finer-grained subgroups or splits, where the feature representation of each group is determined via a weighted combination of the representations of its splits (with weights chosen based on global contextual information).Zhang et al. [73] refer to the resulting unit as a split-attention block, which remains simple and modular.By stacking several split-attention blocks, a ResNet-like network is created called ResNeSt.The architecture of ResNeSt requires no more computation than existing ResNet-variants, and is easily adopted as a backbone for other vision tasks.The performance of ResNeSt is better than all existing ResNet variants, while the computational efficiency is the same, an even better speed and accuracy tradeoff is achieved than the most advanced CNN model generated by NAS.Fig. 21 shows the ResNeSt block module.

2D and 3D palmprint and palm vein databases used for evaluation
In this paper, five 2D palmprint image databases, one 3D palmprint database and two palm vein databases are used for performance evaluation, including PolyU II [11] , PolyU M_B [12] , HFUT [13] , HFUT CS [14] , TJU-P [15] , PolyU 3D [15] , PolyU M_B [12] and TJU-PV [17] .After preprocessing, the ROI sub-images were cropped.The ROI size of all databases is 128×128.The detailed descriptions of above databases are listed in Table 4. Figs.22−25 depict some ROI images of four 2D palmprint databases.In Figs.22−25, the three images depicted in the first row were captured in the first session.The three images depicted the second row were captured in the second session.
Fig. 26 shows three original palmprints of HFUT CS database and their corresponding ROI images.Fig. 27 shows three original 3D palmprint data of the PolyU 3D database.Fig. 28 shows four different 2D representations from one 3D including MCI, GCI, ST and CST.Figs. 29 and 30 depict some ROI images of two 2D palm vein databases.In Figs. 29 and 30, three images depicted in the first row were captured in the first session.Three images depicted the second row were captured in the second session.
PolyU II is a challenging palmprint database because the illuminations between the first session and the second session have an obvious change.HFUT CS is also a challenging palmprint database.From Fig. 25, it can be seen that there are some differences between the palmprints captured by different devices.

Experimental configuration
In this section, we introduce the default configuration    Since different networks need different input sizes, such as 227×227 in AlexNet, 299×299 in Inception_v3, and 224×224 in ResNet, the palmprint/palmvein ROI image needs to be upsampled to a suitable size before input into the network.In order to enhance the stability of the network, we also added a random flip operation (only during the training phase), i.e., for a training image, there is a certain probability that the image is flipped horizontally and then input into the network.We do not initialize the model parameters using the random parameter initialization method, but initialize it using the parameters of the pretrained model in the ImageNet.The palmprint/palmvein ROI image in the database is usually a grayscale image, that means the number of image channels is 1, and the input of the model is a RGB image, so the grayscale channel of the image is copied three times to form a RGB image.
The system configuration is as follows: Intel CPU i7 4.2GHz, NVIDIA GPU GTX 1080Ti (EfficientNet runs on two parallel GPUs GTX 1080Ti), 16Gb memory and Windows 10 operating system.All evaluation experiments are performed on Pytorch.The cross entropy loss function (CrossEntropyLoss in Pytorch), Adam optimizer is used by default and the batch size is 4.

Recognition performance on separate data mode
We first conduct evaluation experiments on a separate data mode, i.e., all samples captured in the first session are used for training, and all samples captured in the second session are used for test.

Recognition results of ResNet18 and EfficientNet under different learning rates
Learning rate is a very important hyperparameter in model training, which affects the convergence of the loss function.If the learning rate is too small, the decrease of  [50] 2012 VGG VGG Simonyan and Zisserman [53] 2015 Inception_v3 IV3 Szegedy et al. [56] 2016 ResNet Res He et al. [58] 2016 Inception_v4 IV4 Szegedy et al. [57] 2017 Inception_ResNet_v2 IResV2 Szegedy et al. [57] 2017 DenseNet Dense Huang et al. [59] 2017 Xception Xec Chollet [66] 2017 ResNeXt ResX Xie et al. [68] 2017 MobileNet_v2 MbV2 Howard et al. [62] 2018 ShuffleNet_v2 ShuffleV2 Ma et al. [65] 2018 SENet SE Hu et al. [69] 2018 MobileNet_v3 MbV3 Howard et al. [63] 2019 EfficientNet Efficient Tan et al. [70] 2019 GhostNet Ghost Han et al. [71] 2020 RegNet Reg Radosavovic et al. [72] 2020 ResNeSt ResS Zhang et al. [73] 2020 *Abb.means the abbreviated name loss along the gradient direction will be slow, and it will take time to reach the optimal solution.If the learning rate is too large, it may lead optimal solutions to be missed, and may cause severe turbulence and even vanishing gradient problems.Here, we are only looking for the initial learning rate, combined with the dynamic learning rate strategy in the actual experiment.Therefore, choosing a suitable learning rate is especially critical.In this sub-section, we select ResNet18 and EfficientNet for evaluation because ResNet18 has a high recognition rate in early networks and EfficientNet is one of representative networks proposed recently.The experimental results are listed in Tables 6 and 7. From Tables 6 and 7, it can be seen that when the learning rate is 5×10 −5 , Res-Net18 and EfficientNet achieve the best recognition rate.Thus, in the remaining experiments, we set the initializing learning rate to 5×10 −5 .It should be noted that all our experiments have an initial learning rate of 5×10 −5 , and 100 iterations (EfficientNet used 200 iterations since it has slow convergence) are the learning rate decline steps, where learning rate decay rate is 0.1.That is, the learning rate drops by ten times every 100 iterations, and the total number of iterations is 500.

Recognition results of ResNet and VGG with different numbers of layers
Some CNNs may have different versions with different numbers of layers.For example, ResNet has different versions with 18, 34 and 50 layers.Using more layers may get better recognition rates, but may have the problem of overfitting.Thus, the number of layers is also an important factor for recognition.In this sub-section, we evaluate VGG and ResNet with different numbers of layers.Since most databases have difficulty in training on VGG when the learning rate is 5×10 −5 , we set the learning rate of VGG to 10 −5 .The recognition rates of VGG and Res-Net under different numbers of layers are shown in Table 8.In this experiment, we verify the impact of network depth on the recognition performance.
The results in Table 8 indicate that: 1) For VGG, the recognition performance of VGG-16 is slightly better than that of VGG-19, and the recognition performances of VGG-16 and VGG-19 are close.
2) For ResNet, the recognition performance of Res-18 is better than those of Res-34 and Res-50.On those chal-  lenging databases such as PolyU II, HFUT, HFUT CS, TJU-P, and TJU-PV, the recognition performance of Res-18 is obviously better than that of Res-50.
3) In all databases, the recognition rate of ResNet-18 is obviously better than those of VGG-16 and VGG-19.
According to the results listed in Table 8, for VGG and ResNet, we only use VGG-16 and Res-18 for evaluation in the remaining experiments.
For different CNNs, the best number of layers to obtain the best recognition rate is determined by many factors, such as network structure, data size, data type, etc.Therefore, in practical applications, a lot of experiments need to be done to determine the optimal number of network layers for different CNNs.

Recognition results of EfficientNet from b0 to b7
EfficientNet gets the baseline network EfficientNet-b0 by grid search, and further optimizes different parameters to get EfficientNet-b1 to b7.The recognition results of EfficientNet from b0 to b7 are listed in Table 9.It can be seen that the recognition accuracy of EfficientNet is gradually increasing from b0 to b7, and EfficientNet-b7 achieves the best recognition accuracy.In the remaining experiments of this paper, for EfficientNet, we only use EfficientNet-b7 to conduct evaluation experiments.It should be noted that although EfficientNet-b7 performs well, it converges almost twice as slowly as other networks.In fact, EfficientNet-b6 is also slow, but the of speed EfficientNet-b0 to b5 is normal.

Recognition results of selected CNNs on all databases
In this sub-section, we conduct the experiments using all selected CNNs on all databases.The recognition results of selected CNNs on 2D palmprint and palm vein databases are listed in Table 10.The recognition results of selected CNNs on the 3D palmprint databases are listed in Table 11.Sometimes, when the learning rate is set to 5 × 10 −5 , AlexNet and VGG-16 are untrainable.In this time, we adjust the learning rate of AlexNet and VGG-16 to 10 −5 .In Table 10, AlexNet and VGG-16 have two recognition rates.The former is the result under the learning rate of 5 × 10 −5 , and the latter is the result under the learning rate of 10 −5 .If AlexNet and VGG-16 are untrainable, we mark the result as U.  From Tables 10 and 11, we have the following observations: 1) EfficientNet achieves the best recognition rate on most databases.The overall recognition result of ResNet is in the second place.
2) As a representative of lightweight networks, the overall recognition performance of MobileNet_v3 is worse than that of EfficientNet, close to ResNet, but better than other CNNs.This demonstrates that MobileNet_v3 is effective.
3) The recognition performance of recently proposed CNNs is obviously better than those of early CNNs.For example, the recognition rates of AlexNet and VGG are rather low.For those early CNNs such as AlexNet and VGG, their structures are relatively simple, and the number of layers is small.Thus, the recognition performance of them is not as good as those of the recently proposed CNNs such as EfficientNet.
4) HFUT CS is a very challenging database.The recognition performances of the most CNNs on HFUT CS database are unsatisfactory.In this database, ResNeSt (ResS-50) achieves the highest recognition rate, which is 99.15%.
5) Except ResNeSt has achieved good results on HFUT CS database, several recently proposed networks, including GhostNet, RegNet, ResNeSt, etc. have not achieved very good recognition results on various databases.Maybe the network structures of GhostNet, RegNet and ResNeSt are not very suitable for palmprint recognition and palmar vein recognition.
6) Among four 2D representations of the 3D palmprint, the recognition results obtained from MCI are the best.
7) For 3D palmprint recognition, based on MCI representation, EfficientNet achieved the recognition rate of 99.88%, which is a very promising result.

Recognition performance on mixed data mode
In the mixed mode, the first image captured in the second session is added to the training data.That is, the training set of each palm contains all images captured in the first session and the first image captured in the second session.Here, we use EfficientNet to conduct experiments.For each palm, the total number of training images are the number of images captured in the first session adding one (+1).This one means the first image captured in the second session.From Table 12, it can be seen that the recognition accuracy of EfficientNet gradually increases when the number of training samples increases.
We list the recognition rates of different CNNs on mixed data mode in Tables 13 and 14.It can be seen that the recognition accuracies of all CNNs increased significantly, particularly, for 2D palmprint recognition, Effi-cientNet achieves 100% recognition accuracies on PolyU II, PolyU M_B, HFUT, TJU-P, PolyU M_N and TJU-PV.For 3D palmprint recognition, Res-18 achieves the best recognition results, and all CNNs achieve the best recognition results from MCI representation among four 2D representations.
This experiment proves once again that the sufficiency of data is very important to improve the recognition accuracy of deep learning.In the future, with the wide application of palmprint recognition and palmar vein recognition, the data volume of palmprint and palmar vein will increase continuously.In this way, the recognition accuracy of palmprint recognition and palm vein recognition technology based on deep learning will reach a new level.

Performance comparison with other methods
For 2D palmprint and palm vein recognition, we com-pare the performance of CNNs and other methods including some traditional methods and one deep learning method PalmNet [83] .Four traditional palmprint recognition methods, including competition code, sequence number, RLOC and LLDP, are selected for comparison.For CNNs, we only list the results of MobileNet_v3 and Effi-cientNet which have excellent performance.The performance comparison is conducted on both separate data mode and mixed data mode.
On the separate data mode, for traditional methods, four images collected in the first session are used as the training data, and all images collected in the second session are exploited as the test data.For MobileNet_v3 and EfficientNet, all images collected in the first session are used as the training data and the second session images are used as the test data (In the HFUT CS database, all images captured by the camera are used as the training data).The comparison results on separate data mode are shown in Table 15.
From Table 15, it can be seen that, on separate data mode, the performances of the traditional methods are better than those of the CNNs.As we know, traditional methods use fewer training samples.Because the features of 2D and 3D palmprint and palm vein are relatively stable, thus, hand-crafted features can well represent the palmprint, resulting in a better recognition performance of traditional methods.In addition, the classic CNNs used in this paper are designed for general image classification tasks, and are not specially designed for 2D and 3D palmprint recognition and palm vein recognition, so the accuracies of them are not satisfactory.
On the mixed data mode, for traditional methods, four images collected in the first session are used as the training data, and we add the first image captured in the second session to the training set.The remaining images collected in the second session are exploited as the test data.For MobileNet_v3 and EfficientNet, all images collected in the first session are used as the training data,   16.
From Table 16, it can be seen that, on mixed data mode, the performances of CNNs are nearly equal to that of the traditional methods.The scale of the 2D and 3D palmprint and palm vein databases is small.But deep learning methods rely heavily on learning from large-scale database.If there are sufficient training samples, deep learning methods can achieve better performance.
For 3D palmprint recognition, we compare the performances between CNNs and other traditional methods on the separate data mode.Compared with ILSVRC, the scale of palmprint and palm vein databases is small, and the model with more layers may lead to the problem of over-fitting.4) For 3D palmprint recognition, deep learning-based methods obtained promising results.Among four 2D representations of 3D palmprints, MCI can help deep learning methods to achieve the best recognition results.5) In separate data mode, the recognition performance of classic CNNs is not satisfactory, and is worse than those of some traditional methods on those challenging databases.On mixed data mode, CNNs can achieve good recognition accuracy.For example, CNNs achieved 100% recognition accuracies on most databases.In this work, a lot of classic CNNS have been evaluated.However, these CNNs have been designed manually by human experts.In recent two years, NAS technology has attracted more and more attention.The core idea of NAS is to use search algorithms to find better neural network structure, so that can obtain better recognition performance.In our future work, we will try to exploit NAS technology for 2D and 3D palmprint and palm vein recognition.In our future work, we will also design special CNNs according to the characteristics of 2D and palmprint recognition and palm vein recognition.In this way, better recognition performance of deep learning for 2D and 3D palmprint recognition and palm vein recognition can be expected.

Fig. 1
Fig. 1 Chronology of classic CNNs chronology for classification tasks

Fig. 5
Fig. 3 Structure of VGG Filter concat Filter concat Filter concat
The internal modules of Mo-bileNet_v3 inherit MobileNet_v1, MobileNet_v2 and MnasNet, and networks are researched by platform-aware NAS and NetAdapt.The calculation in the final stage of the network is redesigned on MobileNet_v3 due to the extensive calculation in MobileNet_v2.In addition, a new activation function h-swish[x] is proposed to improve the accuracy of networks effectively.MobileNet_v3 includes two versions: MobileNet_v3-small and MobileNet_v3large.MobileNet_v3-small has faster speed and its accuracy is similar to MobileNet_v2.MobileNet_v3-large has higher accuracy.Finally, the results of image classification, target detection and semantic segmentation experiments show the advantage of MobileNet_v3.The network structure of MobileNet_v3 is shown in Fig. 14. 11) SENet Squeeze-and-Excitation (SENet) is a new image recognition structure, which was proposed by the autopilot company Momenta in 2017
(a) does not need downsampling, and Fig.17 (b) required downsampling operation.

Fig. 22 Fig. 23
Fig. 22 Six palmprint ROI images of PolyU II database.The three images of the first row were captured in the first session.The three images of the second row were captured in the second session.

Fig. 26
Fig. 26 Three original palmprint and ROI images of HFUT CS database.The three images of the first row are three original palmprint images.The three images of the second row are corresponding ROI images.

Fig. 25 Fig. 27 Fig. 28
Fig. 25 Six palmprint ROI images of TJU-P database.The three images of the row were captured in the first session.The three images of the second row were captured in the second session.

Fig. 29 Fig. 30
Fig. 29 Six palm vein ROI images of PolyU M_N database.Three images of the first row were captured in the first session.Three images of the second row were captured in the second session.
and we add the first image captured in the second session to the training set.The remaining images collected in second session are exploited as the test data.The comparison results on mixed data mode are shown in Table

6 Conclusions
This paper systematically investigated the recognition performance of classic CNNs for 2D and 3D palmprint recognition and palm vein recognition.Seventeen representative and classic CNNs were exploited for performance evaluation including AlexNet, VGG, Inception_v3, Incep-, MobileNet_v2, MobliNet_v3, Shun-ffleNet_v2, SENet, EfficientNet, GhostNet, RegNet and ResNeSt.Five 2D palmprint image databases, one 3D palmprint database and two palm vein databases were exploited for performance evaluation, including PolyU II, PolyU M_B, HFUT, HFUT CS, TJU-P, PolyU 3D, PolyU M_B and TJU-PV.These databases are very representative.For example, PolyU II, PolyU M_B, PolyU M_N and HFUT databases were collected by the contact manner; HFUT CS, TJU-P, and TJU-PV were captured by the contactless manner.All databases were collected in two different sessions.In particular, HFUT CS is a rather challenging database because it was collected in the conditions of two different sessions, contactless manner and crossing three different sensors.We conducted a lot of experiments on the above databases in the conditions of different network structures, different learning rates, different numbers of network layers.We conducted the experiments on both separate data mode and mixed data mode.And we also compared the recognition performances between the CNNs and traditional methods.According to the experimental results, we have the following observations.1) The performances of recently proposed CNNs such as EfficientNet and MobileNet_v3 are obviously better than those of other early CNNs.Particularly, EfficientNet achieves the best recognition accuracy.2) Learning rate is an important hyperparameter.It has an important influence on the recognition performance of CNNs.For palmprint and palm vein recognition, 5×10 −5 is an appropriate learning rate.3) Using more layers, VGG and ResNet did not get better recognition results.

Table 1
Summary of existing 2D palmprint recognition methods based on deep learning

Table 2
Summary of 3D palmprint recognition methods based on deep learning W. Jia et al. / A Performance Evaluation of Classic Convolutional Neural Networks for 2D and 3D Palmprint and ••• improved version of MobileNet_v1 and MobileNet_v2; ShuffleNet_v2 is a good compression network, and is a modified version of ShuffleNet_v1; SENet enhances important features to improve accuracy; EfficientNet, GhostNet, RegNet and ResNeSt are four representative CNNs proposed recently.

Table 3
Summary of palm vein recognitoin methods based on deep learning •••

Table 4
Details of 2D palmprint, 3D palmprint and palm vein databases Jia et al. / A Performance Evaluation of Classic Convolutional Neural Networks for 2D and 3D Palmprint and •••

Table 5
Full name and its abbreviated name of selected CNNs

Table 6
Recognition rates of ResNet18 under different learning rates

Table 7
Recognition rates of EfficientNet under different learning rates

Table 8
Recognition rates of VGG and ResNet under different numbers of layers

Table 9
Recognition rates of EfficientNet from b0 to b7 .Jia et al. / A Performance Evaluation of Classic Convolutional Neural Networks for 2D and 3D Palmprint and ••• W

Table 10
Recognition results of different CNNs on 2D palmprint and palm vein databases under the separate data mode

Table 12
Recognition rates on different mixed-training data amounts of EfficientNet Jia et al. / A Performance Evaluation of Classic Convolutional Neural Networks for 2D and 3D Palmprint and ••• Table 17 lists the comparison results.It can be seen that the recognition accuracy of CNN is slightly better than traditional methods.

Table 13
Recognition results of different CNNs on 2D palmprint and palm vein databases under the mixed data mode

Table 14
Recognition results of different CNNs on four 2D representations of 3D palmprint databases under the mixed data mode

Table 15
2D palmprint and palm vein recognition: Performance comparison between classic CNNs and other methods under the separate data mode Table 16 2D palmprint and palm vein recognition: Performance comparison between classic CNNs and other methods under the mixed data mode

Table 17
3D palmprint recognition: Performance comparison between classic CNNs and other methods under the separate data mode