Introduction

Person identification can use several human parts or traits and are classified as primary biometric and soft biometric traits [1]. Primary biometric traits are fingerprint [2], hand [3], body [4], gait [5], face [6], and voice [24]. Soft biometric traits such as androgenic/arm’s hair patterns, gender, age, weight, skin marks, height, and color (skin, hair, and eye) are used along with the primary biometric traits to obtain improved accuracy [1].

Often, the evidence collected is in the form of digital images and captured in uncontrolled situations [7]. As most perpetrators cover their faces, the only available information in these images will be their hands. Though hands are primary biometric traits, these have less variability when compared to faces. The facial features are generally more complex and visible, making it a more robust biometric trait for identification. With the advent of more sophisticated and advanced digital cameras and better resolution closed-circuit television (CCTV) cameras in public places, several security systems have used hand vein patterns and androgenic hair patterns for person identification [8].

There are several methods to recognize humans from primary and soft biometric traits. Afifi [9] used a two-stream convolutional neural network and support vector machine classifier for hand-based person identification. They considered the subjects’ both hands as the same, which is not usual and is less accurate. Baisa et al. [3] proposed global and part-aware deep feature representation learning for hand-based person identification. Similarly, several other deep learning architectures such as part-based convolutional baseline (PCB), multiple granularity network (MGN), pyramidal representations network (PyrNet), attentive but diverse network (ABD-Net), omni-scale network (OSNet), discriminative and generative learning network (DGNet), dual part-aligned representations network (P2Net), and interaction and aggregation network (IANet) are used for person identification from digital images, but all these networks need to train entirely when a set of data comes for person re-identification [6, 8]. In case of serious crimes, new criminals will get added over time, and for each new addition of criminals, training the entire database again is a very tedious task. There are very few works on person identification from arm’s or androgenic hair patterns [10,11,12]. The existing methods used grayscale, local binary patterns (LBP), and histogram of oriented gradients (HOG). These techniques used hand-crafted features for person identification. It is evident from the literature that the state-of-the-art deep learning techniques perform better than these machine learning techniques, which use hand-crafted features. The earlier methods related to Person re-ID (re-IDentification) extracted local descriptors, low-level features or high-level semantic attributes, and global representations through sophisticated but time-consuming hand-crafted features. In addition, the hand-crafted feature representation failed to perform better when image variants such as occlusion, background clutter, pose, illumination, cultural and regional background, intra-class variations, cropped images, multipoint view, and deformations were present in the data. However, deep neural networks were introduced in person re-ID in 2014, which completely changed the feature extraction methodology. Deep learned features perform better in end-to-end learning and are robust to the image variants. The improved feature representation in the deep learning architectures makes it more popular than machine learning methods for person re-ID [6, 8, 13, 14].

To address the following issues, we proposed and implemented a novel architecture based on Siamese networks to identify the person based on their arm’s hair patterns. Since there exists no standard database dedicated for arm’s hair pattern recognition, we created and analyzed arm’s hair pattern person identification with several state-of-the-art deep learning architectures.

The key contributions of this paper are as follows:

  • Person identification with a novel color threshold (CT)-twofold Siamese network architecture using arm’s androgenic hair patterns.

  • Created a database with images of person’s hand for person identification collected from Indian subjects.

Rest of the paper is organized as follows. The next section discusses the literature work on existing methods of person identification. The third section describes the proposed methodology. The fourth section discusses the experimental results, and finally, the last section concludes the paper with future directions.

Literature survey

Person re-identification from the arm’s androgenic hair comes under the closed world person re-id. Here, a single modality is used with bounding boxes, sufficient and correct annotated data exists, and the query exists in the gallery. The three standard components in a closed world re-id system are feature representation learning, deep metric learning, and ranking optimization [6].

Feature representation learning includes features such as global, local, auxiliary, and video (temporal) features. Global feature learning captures the fine-grained cues for each subject present in the image. Single and cross image representation frameworks are used, which were trained using triplet loss [15]. But these did not perform for multiclass classification problems. Id discriminative embeddings (IDE) were widely used to address multiclass classification problems, but this did not perform better to capture discriminative cues at different scales. Qian et al. [16] proposed multiscale deep representation learning models to address this issue. Then attention models were proposed to enhance the robustness against the misalignment and mine the feature relations across multiple images. Song et al. [17] proposed mask-guided contrastive person identification, attention models. Li et al. [18] and Wang et al. [19], respectively, proposed Harmonious and multitask attention models for person re-identification. But these architectures were not proposed for person identification from androgenic hair patterns.

Another way of making the model more robust against misalignment is using the local feature representation learning. Feature level fusion techniques are used, and some examples include the methods such as multi-channel part based CNN by Cheng et al. [20], deep context-aware features by Li et al. [21], feature decomposition, and fusion by Zhao et al. [22]. Still, these did not perform well for multiple part level classifiers and horizontally divided region features. The state of the art architectures which address these are Siamese long short-term memory (by Varior et al. [23]), second-order non-local attention networks, interaction and aggregation networks, and Aanet [6, 25,26,27,28,29]. But these are not explored in person re-identification using hair patterns.

Table 1 Related literature on person re-identification from various features

Several other attributes, such as semantic attributes [30], viewpoint information [31], domain information [32], and generative adversarial networks [33], are used as auxiliary features for person re-identification. Auxilliary feature representation learning includes the use of data augmentation for better performance [34]. Spatio-temporal dimension attention cues are popularly used in video feature representation learning [35]. Though GANs are very popular, they are more suited for Open-World Person Re-Identification and perform better when primary biometric parameters are used. Training GANs is another issue as it is time-consuming and often lacks diversity. The dependency on diversity often leads to limited improvement or sometimes performance degradation. The existing literature points out the weakness of GANs are due to pose variations and camera style adaptions, which hinders them from modeling many other important aspects, including viewpoint and background changes, thus making the generated samples scarce of diversity. To address this issue, self-supervised learning (SSL) is introduced, but SSL uses CNN-based basic architectures to train and test with huge amount of data, which makes these methods to learn more discriminative features and to generalize uses semi supervised methods [36].

Before the extensive use of deep learning, metric learning was popular [37]. Now the use of metric learning is replaced using the loss functions. Several loss functions are widely used, such as identity loss [15], verification loss [38], triplet loss [39], and online instance matching loss [40], depending on the data and the result. Along with the loss functions, several training strategies such as batch sampling and identity sampling are used to address imbalance issues of the data [41].

In the testing phase, the retrieval performance can be improved using ranking optimization [42]. It can be performed using human interaction, automatic gallery to gallery similarity mining, query adaptive or human interaction based re-ranking, and rank or metric fusion. The common and popular evaluation metrics used are cumulative matching characteristics (CMC) and mean average precision (mAP). Currently, mean inverse negative penalty (mINP) is also used for the smaller dataset, and it avoids the domination of easy matches mAP and CMC evaluations [6].

Most of the work is related to person re-identification from face or face and other attributes. But in our work (related to criminology), faces will be covered most of the time. Table 1 summarizes some of the works with non face data for the person re-identification problem. Table 1 shows various features used in the literature and the corresponding methodology, datasets, and evaluation metrics. It is observed from the literature that there exists no publicly available dataset on arm’s androgenic hair patterns. In addition, the androgenic hair pattern-based person re-identification problems did not use state-of-the-art deep learning architectures. This paper proposes a deep learning architecture based on Siamese networks and creates a database with persons’ arm image dataset with androgenic hair patterns.

Created dataset

Subjects, consent and image data

The hand images for the database are collected from a Nikon D5300 DSLR cameraFootnote 1 with a maximum resolution of 6000 \(\times \) 4000 pixels. Indians commonly have dense androgenic hairs on arms, and hence images are taken from at least three different angles to cover the entire hand. There were no strict guidelines for the subjects on posing, and they were allowed to pose for different viewpoints, illuminations, and poses. We made sure that the background is clear and can see the hand correctly. All other image variants were obtained using the data augmentation techniques.

The subjects were of different ages, sex, race, and culture. A total of 50 subjects were considered in this study. Consent is obtained from each subject that their data will be used for research purposes only. For every subject, we took images for both the left hand as well as the right hand. We took at least three different images for each hand such that collected images contain different image variants and cover the entire hand. The distance between the subject and the DSLR was approximately 1.2 m. Figures 1 and 2 show the raw images collected from the subject for right hand left hand, respectively.

The collected images also contain skin marks, scars, and other skin features. The camera has a 23.5 mm \(\times \) 15.6 mm RGB CMOS (red, green, blue complementary metal oxide semiconductor) sensor with a 1.5 \(\times \) FOV (field of view) crop. A focal length of 55 mm is used. We adjusted the focal length, sensor pixel size, and resolution such that focus in the collected images is more on hand hairs than on the skin marks, scars, or other skin features. A minimum of three images per hand were obtained for all 50 subjects, but in some cases, we took more than three images per hand to cover all the androgenic hair patterns. These additional images are taken in case of too dense or too few hair patterns and hand parts where tattoo and skin marks are present. Therefore, instead of 300 images ((subjects) 50 \(\times \) 3 (right hand) \(\times \) 3 (left hand)), we got 383 images at the end of this step. Though we adjusted the camera properties to avoid the skin marks and tattoos in a few cases, manually cropping is also performed on some images. A total of 383 images were collected from 50 subjects. In some cases, manually cropping is also performed so that only the hand is visible.

Fig. 1
figure 1

Sample right hand raw image of the created database

Fig. 2
figure 2

Sample left hand raw image of the created database

The collected high-resolution images are reduced to lower resolutions of around 244 \(\times \) 244 based on the preprocessing steps and the deep learning architecture used. Hence, this study’s range of image resolutions is between 40 and 12.5 dpi (dots per inch). After reducing the resolution of the images, we observed that some cropped image quality has drastically deteriorated. To address this issue, we divided the image into two or more parts (with the same person id), and hence the total number of images increased from 343 to 424. All the obtained 424 images do not contain any tattoos or external markings made on their hands.

The naming convention used for the collected images is shown in Fig. 3. The first three digits correspond to the subject, and hence this becomes the unique part of the image name w.r.t. each subject. The following two digits are either 00 or 11, representing right hand and left hand, respectively. The last two digits are the sequence numbers representing the sequence of images taken for each hand. The naming convention is used so that the training and validation process becomes smooth when we use functions like data generators while using deep learning architecture.

Fig. 3
figure 3

Naming convention used in the created database

Preprocessing

The database of criminals in forensic analysis is generally created in controlled environments. The created database contains images from the controlled environment as well, but the crime scene data comes from uncontrolled situations. Therefore, it has different angles, resolutions, illuminations, and so on. To make both the database and the deep learning architecture more robust, we have used data augmentation techniques and preprocessing.

Rotation range, height shift range, width shift range, zoom range, fill mode, horizontal flip, channel shift range, and zca whitening are the eight different data augmentations techniques used in this study. The corresponding augmentation values/parameters are 40, 0.2, 0.2, 0.2, nearest, true, 20 and true, respectively. The data augmentation techniques used are from the standard literature [13, 14, 36, 44]. We used all the data augmentation techniques given in the TensorFlow documentationFootnote 2 except the color space transformations. Color Space Transformations change the color of the hand, and it is not advisable to use for person re-ID as per the existing literature. Since it alters one of the unique features, the hand’s skin tone, it is not recommended for person re-ID identification. Regarding the values used in the data augmentation, we used the standard values from the literature and cross-verified them manually by the empirical study. The standard values perform the best even in person re-ID.

The proposed architecture uses both the color image as well as the thresholded image as input. The following steps were used to convert the color image to the thresholded image.

  • Step 1—GrayScaled Image:Footnote 3 The input color image is first converted to a grayscale image. The Sobel edge detector is used to smoothen the grayscale image.

  • Step 2—Black-hat transforms operation:Footnote 4 It is used in digital image processing and morphology to extract small elements and details from given images where all the objects which are white on a dark background are highlighted as shown in Fig. 4. The settings used in this study such as anchor, iterations, borderType, const and borderValue are set to Point\((-1,-1)\), 1, BORDER_CONSTANT, Scalar and morphologyDefaultBorderValue(), respectively.

  • Step 3—Binary thresholding:Footnote 5 It is used to get the thresholded image where the pixel value is set to 255 if it is greater than the threshold or considered zero.

Fig. 4
figure 4

Sample image snapshot for results of Preprocessing steps

Figure 4 shows a sample screenshot of all the preprocessing steps followed. The output images present in Fig. 4 is a sample output of a portion of a single input color image. After the preprocessing, all those images are stored with the same name as that of the input color image in a separate folder. Figure 5 shows the sample thresholded image of the subject 002.

Fig. 5
figure 5

Sample thresholded image {subject : 0020003} after the preprocessing

The input image given in preprocessing step is manually cropped for the hand part in the image. Though we did not send the complete picture for preprocessing, to understand the cropped and removed part for the image shown in Fig. 5, the uncropped color image of the same subject is given as an input for preprocessing, and the output is shown in Fig. 6. Here the parts which are not used for the computation are shown as cropped and unused. Only the part which contains the arm’s hair (the middle part in the image) is considered for the computation, which is also shown in Fig. 5.

Fig. 6
figure 6

Sample thresholded image {subject : 0020003} after the preprocessing for an uncropped color image as input

After performing preprocessing, we obtained another set of 424 thresholded images. We applied eight different data augmentation techniques on these 424 color or actual images and 424 thresholded images. A total number of 6784 ((424 (color) \(\times \) 8) + (424 (thresholded) \(\times \) 8) images were obtained from the color and thresholded images. The manual verification was performed using two different human observers to avoid bias (mostly in cropping and discarding unrelated areas, the similarity of two images after augmentation or thresholding, and discarding the distorted images after augmentation or after lowering image resolutions). And we calculated the inter-rater reliability using Cohen’s kappa, and we observed that both the human observers agree with a \(\kappa \) value of 0.96 (this \(\kappa \) value calculation includes all the steps whenever the human observers are used).

Table 2 Created database details

After data augmentation and thresholding, all the images were cross-verified manually again. The images that do not contain hair parts like after cropping in augmentation, some images contain the cropped part of the hand which is close to the wrist and did not contain much hair there, all such images were discarded. After discarding the images in this step, the total number of images obtained is 6500 (284 images are discarded in this step). The complete details of the created database are given in Table 2.

Fig. 7
figure 7

Complete flow of the proposed work

Proposed methodology

Person identification using visual features can be modeled as a similarity learning problem. Siamese architectures are extensively used in deep CNN models based for similarity learning, where it requires less parameters to be trained whenever a new entry comes to the database for person identification. From the literature, two types of input images performed better for person identification from arm’s hair [8]. They are the thresholded image and the color image, and both are used in our proposed architecture.

Figure 7 shows the complete methodology of the proposed work. The proposed color threshold (CT)-twofold Siamese network is composed of two different CNN-based networks. The notations used in Fig. 7 are \(c, c^t\), and X, where it represents the color image, thresholded image, and search region, respectively. The size of \(c^t\), and X is \(W_t \times H_t \times 3\) . X is the collection of image patches with the same dimension as c and hence the target size is \(W_r \times H_r \times 3\) where \(H_s<H_t\) and \(W_s<W_t\) and located at the centre of \(c^t\). The C-Net and T-Net are not combined until the testing time, which is similar to [46].

T-Net: The network which takes the thresholded images as input clones its architecture from the SiamFC network (Siamese FullyConvolutional) [47]. The convolutional network used here extracts the features from the thresholded image (denoted by \(f_{a}(.)\)) and is known as T-Net. The following equation shows the appearance branch response map where corr(.) is the correlation operation:

$$\begin{aligned} h_{a}(c,X)=\mathrm{corr}\left( f_{a}(c),f_{a}(X)\right) . \end{aligned}$$
(1)

All the parameters of the T-Net are trained from scratch for similarity learning. The following equation shows the logistic loss function, which was minimized to optimize the T-Net. Here, \(Y_i\), \(\theta _a\) and N are the search region, parameters of T-Net and number of training samples, respectively:

$$\begin{aligned} \mathrm{arg} \,\mathrm{min}_{\theta _{a}}\frac{1}{N}\sum _{i=1}^{N}{L\left( h_{a}(c_{i},X_{i},\theta _{a}),Y_{i}\right) } \end{aligned}$$
(2)

C-Net: The second network takes color images as its input (named as C-Net); here, inception v3 architecture is used in the pre-trained network, and then its parameters are updated in the last two convolutional layers (except last two layers, all others are freezed). The low-level features are not extracted from the pre-trained networks as they provide different levels of abstraction. Each convolutional layer’s features have a different spatial resolution, and it needs to be concatenated (represented by f(.)). After the feature extraction, \(1\times 1\) ConvNet is used as a fusion module to make these features suitable for correlation. This fusion is performed within the same layer features. \(g(f_s(X))\) gives the feature vector for the search region after the fusion.

In target processing, \(c^t\) is taken as a target input by C-Net. This target input contains the contextual features denoted by t. The features obtained from this module include high-level features and are robust for changes in the object; hence, they are more generalized and less discriminative. Channel attention modules were introduced to enhance the discriminative power of the architecture. The attention modules use \(c^t\) as a feature map instead of t to provide importance to the surrounding context along with the target. Channel wise operations are used in the attention module, and the attention process for \(i\mathrm{th}\) channel is shown in Fig. 8.

Several operations were performed; for example, a feature map of conv5 contains \(22 \times 22\) spatial dimension, and the feature maps are of \(3 \times 3\) grid with a central grid is for the target with the dimension \(6 \times 6\). Max pooling is performed within the grid, and then a coefficient is produced using a two-layer multi-layer perceptron. Here the perceptron uses the same convolutional layer to share the weights across the channels. The final output \(w_i\) is obtained using a sigmoid function with bias. A single crop operation is used on \(f_t(c^t)\) to obtain \(f_t(c)\). The output of the attention module is the channel weights \(w_i\), and the input is \(f_t(c^t)\). The following equation provides the response map where the dimension of w is same as \(f_t(c)\). The elementwise operation is represented by . (dot). Here, only the channel attention module and the fusion modules are trained:

$$\begin{aligned} h_{t}(c_{t},X)=\mathrm{corr}\left( g \left( \xi .f_{t}(c)\right) ,g\left( f_{t}(X)\right) \right) . \end{aligned}$$
(3)

The logistic loss function (Eq. 4) is minimized to optimize the response map. The training pairs are \(((c^t)_i, X_i)\), and the response map is \(y_i\):

$$\begin{aligned} \mathrm{arg} \, \mathrm{min}_{\theta _{t}}\frac{1}{N}\sum _{i=1}^{N}{L\left( h_{t}\left( c_{i}^{t},X_{i},\theta _{t}\right) ,Y_{i}\right) }, \end{aligned}$$
(4)
Fig. 8
figure 8

Attention process for \(i\mathrm{th}\) channel in channel wise operations

where N denotes the training samples, and \(\theta _s\) denotes the trainable parameters. A weighted average of heatmaps (Eq. 5) is used to get the overall heatmap of two branches during the test time. Here, \(\lambda \) is the weighting parameter. The validation set can be estimated using \(\lambda \). The most matched location in re-ID is given by \(h(c_{t},X)\) and have the largest value:

$$\begin{aligned} h\left( c_{t},X\right) =\lambda h_{a}(c,X) + (1- \lambda )h_{t}\left( c_{t},X\right) . \end{aligned}$$
(5)

VGGNet (visual geometry group network) like architecture is used as a base network for both the T-Net and C-Net. As mentioned earlier, T-Net is a replica of the SiamFC network. The C-Net is loaded from a pre-trained VGGNet on the ImageNet. C-net strides are adjusted so that the last layer of C-Net and T-Net have the same dimension. To avoid the channel getting suppressed to zero in the attention module, a nine-dimensional vector is used to get the pooled features of each layer. Therefore, the layers in MLP (multiLayer perceptron) had nine neurons, ReLU (rectified linear unit) non-linear function, and after the MLP sigmoid function is used with 0.5 as bias.

Results and analysis

The standard metrics used in person re-identification methods are cumulative matching characteristics (CMC) and mAP (mean-average precision). Generally, it is used in the biometric system, which operates in closed-set identification tasks. The test images (templates) are compared with the annotated images present in the database (biometric subject) and ranked based on similarity. Based on the match rate, the rank versus identification task is compared using the CMC. Suppose each test sample (single gallery shot) identity has only one instance, then for every query, the algorithm will rank the test samples using a step function, CMC top-k accuracy, and it is given in the following equation:

$$\begin{aligned} Acck = \left\{ \begin{array}{ll} 1 &{}\quad \mathrm{if} \, top-k \, \mathrm{ranked} \, \mathrm{gallery} \, \mathrm{samples} \\ &{}\quad \mathrm{contain} \,\mathrm{the}\, \mathrm{query} \,\mathrm{identity} \\ 0 &{}\quad \mathrm{otherwise} \end{array}\right. \end{aligned}$$
(6)

Due to data augmentation and the use of multiple images of both the right and left hand of the same person, we can input multiple instances of the same person in test samples and can be tested (multi-gallery-shot setting). To address this case using a better performance metric, we have also used another metric, mINP (mean inverse negative penalty), to check the performance of the model w.r.t. the created database. The hardest correct match’s penalty is measured using negative penalty and is shown in the following equation, where \(Q_j\) indicates the total number of correct matches for query j and \(H_i^\mathrm{hard}\) indicates the rank position of the hardest match:

$$\begin{aligned} \mathrm{NP}_{j} = \frac{H^{\mathrm{hard}}_{j} - |Q_{j}|}{H^{\mathrm{hard}}_{j}}. \end{aligned}$$
(7)

INP (inverse negative penalty) is the inverse of NP, and we have used mINP as shown in Eq. 8. CMC and mAP evaluations are dominated by the easy matches and it is avoided using the mINP. The limitation of mINP is for a larger dataset, and our dataset contains only 50 subjects, so it is used as one of the supplementary metrics for evaluation along with the widely used CMC and mAP metrics:

$$\begin{aligned} \mathrm{mINP} = \frac{1}{n} \sum _{j}^{}(1-\mathrm{NP}) = \frac{1}{n} \sum _{j}^{}\frac{|Q_{j}|}{H^\mathrm{hard}_{j}} \end{aligned}$$
(8)

Implementation details: A small weight \(\lambda \) is used to combine the branches, and it is updated by considering the validation set. It is observed that the hyperparameter \(\lambda \) performs its best when \(\lambda = 0.3\). The attention module has one hidden layer with a 9-dimensional vector with ReLu as a non-linear function of the hidden layer. From the empirical study, it is observed that the proposed model performed its best when the learning rate was 0.01. The average training speed of the CTTSN was 52 frames per second (fps). The grid search was performed from 0.1 to 0.9 with step 0.2. Three scales are searched in this study to handle scale variations for evaluation and testing. Weight decay and momentum were empirically set to 0.0005 and 0.9, respectively.

Table 3 Comparison of created dataset results

Table 3 shows the CMC, mAP, and mINP results comparison with proposed and other methods. The proposed CT-twofold Siamese network (CTTSN) is trained on Imagenet using the VGG network. We changed that to other popular networks such as Inception v4, ResNet, AlexNet and Xception (all four architectures are also trained on ImageNet). It is observed that the results are comparatively less as the ImageNet weights are not a significant contributor compared to other features used in the data. SiamFC [47] is the base architecture of the Siamese network, and DSiam (Dynamic Siamese Network) architecture for visual object tracking [48] is also tested and compared with other architectures.

Fig. 9
figure 9

Cumulative match curves comparison for the created database

Figure 9 shows the Rank-1 results of the proposed architecture. We compared it with the Siamese network and the modified versions of our network. CTTSN (color) contains both the C-Net architectures and CTTSN (threshold) has only T-Net architectures. It is observed from Fig. 9 that the proposed architecture with both C-Net and T-Net performs better (upto rank 30).This infers the importance of complementary features required to strengthen the proposed model’s performance.

We also used data augmentation and used those images during the training process. Figure 10 shows the comparison of results with and without data augmentation. It is evident from the figure that the data augmentation increases performance, and it is in line with the existing literature [44].

Fig. 10
figure 10

Cumulative match curves comparison for with and without data augmentation

The dataset contains thresholded images, data augmented images, and color images. In the testing phase, if we provide these images separately, the performance of the proposed model is shown in Figs. 11, 12, and 13 for color images, thresholded images, and augmented images, respectively.

Fig. 11
figure 11

Cumulative match curves comparison for color images as input test image

Figure 11 has more accuracy because there are several other features like hair color and so on that are also considered by the proposed architecture, and hence it performs better. The data augmented images contain different resolution images. It contains cropped images that may contain cropped part of the input image that has very little hair in that part. Hence, it failed to perform better when compared to color or thresholded images. We observed similar fall in person re-identification when we compared it with male vs. female; there was a dip approximately by 7% in mAP and mINP in females as the hairs were less in some parts of their hands.

Fig. 12
figure 12

Cumulative match curves comparison for thresholded images as input test image

Fig. 13
figure 13

Cumulative match curves comparison for data augmented images as input test image

The results are also compared with different image resolutions, and we observe that the performance of the proposed method decreases with a decrease in input image size as shown in Fig. 14. The different image sizes taken for the comparison are as per the standard sizes used in the literature [10] for the comparison in the criminology department. The graphs reflect the importance of clearly visible hair features as the performance decreases with a decrease in the image resolution.

Fig. 14
figure 14

Cumulative match curves comparison for different input image resolutions

Grad-CAM (gradient-weighted class activation mapping): We used the Grad-CAM class activation visualization (as per in the Keras documentationFootnote 6). Grad-Cam depicts the discriminative features that are responsible for person identification using heat maps. It uses the gradient information flowing into the last convolutional layer of the proposed architecture to understand each neuron for a decision of interest [49]. A sample image from the database {Subject: 0020003} is shown in Fig. 15 where the feature from the highlighted part is responsible for person re-identification {Values related to \(h(c_{t},X)\)}. In addition, Fig. 16 shows the heat map in the image where the part of the hand with the highest probability of person re-identification is shown in yellow. From Figs. 15 and 16, it is evident the proposed method effectively identifies the discriminative features to re-identify the person, and also the region of interest shows the generalizability in choosing the region of interest.

Fig. 15
figure 15

Sample image after applying Grad-CAM

Fig. 16
figure 16

Heatmap of the sample image after applying Grad-CAM

Apart from the detailed results shown in the graphs, we have also performed some ablation analysis on the proposed method with the created dataset.

  • Both T-Net and C-Net were trained with random initialization and obtained an mAP of 81.5, suggesting the requirement of an initialization function.

  • We took only the color images in both T-Net and C-Net and did not get better performance (mAP 79.11), and it is also similar for thresholded images (mAP 83.17). This infers the importance of complementary features required in the proposed method.

  • We removed the channel attention module in the C-Net and observed a drastic decrease in the performance (mAP 76.11). This shows the importance of balancing the intra-layer and inter-layer channels.

  • We separately trained both the architecture in the study. Here we jointly trained both the network branches and observed mAPs of 83.67 and 84.21, respectively, suggesting the importance of optimizing multilevel features independently.

  • We ran channel weights on two different subject hand images. We used a sigmoid function with a bias of 0.5, so the weight distribution ranges from 0.5 to 1.5. We observed a different weight distribution for conv4 and conv5 layers for each image, indicating the importance of channel weights in the proposed method.

From the results, it is observed that the created database with 50 subjects performs better when we use both the C-Net and T-Net trained with color, thresholded, and data augmented images. When the test input is color images, it performs better. Further, it performs better for male and high-resolution input images. Thresholded images contain only hair patterns. The Person re-identification from hair patterns comes under soft biometric and the results observed with only thresholded images (Fig. 12) suggests the huge potential of using arm’s hair patterns for person re-identification using digital images.

Limitations

The created database contains Asian subjects only. The performance of the proposed model is not sufficient when the input image is taken from less-resolution cameras (some CCTV cameras). When we provide color images during training, by default, the deep learning architecture uses all the features present in that image, and it is difficult to analyze the impact of only hair patterns in that.

Conclusion

In most cases, the data obtained for identifying a criminal is collected from uncontrolled situations in criminology. Generally, they wear a mask during the crime, and also, the other body parts may not be visible so clearly compared to hands. This paper discussed the created database with 6500 images from persons’ hand images with androgenic hair as a soft biometric parameter for the person re-identification problem. We proposed a CT-twofold Siamese network and analyzed its performance on the created database. The results show that there is a potential to recognize a person from arm’s androgenic hair patterns. We observed that the proposed model performs well for the created database with a Rank-1 cumulative match of 93.5. The proposed methodology is targeted to work primarily in forensic psychiatric hospitals as the target users are generally non-cooperative. This set of person re-identification problems comes under closed world re-ID and should work unobtrusively in real time with less training and testing time. The method should get trained fast and perform better whenever new data are obtained. Hence, in this work, the proposed CTTSN performs well for the closed world person re-identification problem and uses androgenic hair patterns (soft biometric) for the person re-ID. The trained data are robust due to data augmentation, and the methodology is both discriminative and generalized. The proposed method works in real time with 52 fps for the test images.

The future directions related to the methodology include the use of other intra-image modalities such as skin color, skin marks and veins, and hair patterns to identify the person. In addition, to collect the data and test the proposed model with diverse data and improve the robustness of the proposed method. The proposed architecture needs to be tested in various areas like forensic psychiatric hospitals with different CCTV locations and subjects to check its robustness and further improvements. Generally, hospitals have CCTV installed. Instead of using barcoding, radio frequency identification (RFID), and biometrics to track and identify patients during the rehabilitation process, the proposed technique can be a cost-effective and unobtrusive solution to the existing ones.