Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals

Salins, Rohan Don; Ashwin, T. S.; Prabhu, G. Ananth; Basthikodi, Mustafa; Mallikarjun, Chaitra K.

doi:10.1007/s40747-022-00771-0

Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals

Original Article
Open access
Published: 20 June 2022

Volume 8, pages 3185–3197, (2022)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals

Download PDF

Rohan Don Salins¹,
T. S. Ashwin ORCID: orcid.org/0000-0002-1690-1626²,
G. Ananth Prabhu¹,
Mustafa Basthikodi¹ &
…
Chaitra K. Mallikarjun³

1000 Accesses
1 Citation
Explore all metrics

A Correction to this article was published on 22 July 2022

This article has been updated

Abstract

Identifying criminals in serious crimes from digital images is a challenging forensic task as their faces will be covered in most cases. In addition, the only available information will be hand. A single robust technique to identify the criminals from arm’s hair patterns can be a potential cost-effective and unobtrusive solution in various other areas such as in criminal psychiatric hospitals during rehabilitation to identify and track patients instead of using barcoding, radio frequency identification (RFID), and biometrics. The existing state-of-the-art methods for person identification uses convolutional neural network (CNN) and long short-term memory (LSTM)-based architectures which require the entire data to be trained once again when new data comes. To address these issues, we proposed a novel Siamese network-based architecture which not only reduces this training paradigm but also performs better than several existing methods. Since there were no standard datasets for person identification from arm’s hair patterns, we created a database with several voluntary participants by collecting their hands’ images. Several data augmentation techniques are also used to make the database more robust. The experimental results show that the proposed architecture performs better for the created database with mAP, mINP, and R1 of 94.8, 90.0, and 93.5, respectively. The proposed CTTSN performs well for the closed world person re-identification problem using soft biometric features in real time (52 frames per second).

A Review of Methods Employed for Forensic Human Identification

Review for Person Recognition Using Siamese Neural Network

Deep CNN-Based Facial Recognition for a Person Identification System Using the Inception Model

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Person identification can use several human parts or traits and are classified as primary biometric and soft biometric traits [1]. Primary biometric traits are fingerprint [2], hand [3], body [4], gait [5], face [6], and voice [24]. Soft biometric traits such as androgenic/arm’s hair patterns, gender, age, weight, skin marks, height, and color (skin, hair, and eye) are used along with the primary biometric traits to obtain improved accuracy [1].

Often, the evidence collected is in the form of digital images and captured in uncontrolled situations [7]. As most perpetrators cover their faces, the only available information in these images will be their hands. Though hands are primary biometric traits, these have less variability when compared to faces. The facial features are generally more complex and visible, making it a more robust biometric trait for identification. With the advent of more sophisticated and advanced digital cameras and better resolution closed-circuit television (CCTV) cameras in public places, several security systems have used hand vein patterns and androgenic hair patterns for person identification [8].

There are several methods to recognize humans from primary and soft biometric traits. Afifi [9] used a two-stream convolutional neural network and support vector machine classifier for hand-based person identification. They considered the subjects’ both hands as the same, which is not usual and is less accurate. Baisa et al. [3] proposed global and part-aware deep feature representation learning for hand-based person identification. Similarly, several other deep learning architectures such as part-based convolutional baseline (PCB), multiple granularity network (MGN), pyramidal representations network (PyrNet), attentive but diverse network (ABD-Net), omni-scale network (OSNet), discriminative and generative learning network (DGNet), dual part-aligned representations network (P2Net), and interaction and aggregation network (IANet) are used for person identification from digital images, but all these networks need to train entirely when a set of data comes for person re-identification [6, 8]. In case of serious crimes, new criminals will get added over time, and for each new addition of criminals, training the entire database again is a very tedious task. There are very few works on person identification from arm’s or androgenic hair patterns [10,11,12]. The existing methods used grayscale, local binary patterns (LBP), and histogram of oriented gradients (HOG). These techniques used hand-crafted features for person identification. It is evident from the literature that the state-of-the-art deep learning techniques perform better than these machine learning techniques, which use hand-crafted features. The earlier methods related to Person re-ID (re-IDentification) extracted local descriptors, low-level features or high-level semantic attributes, and global representations through sophisticated but time-consuming hand-crafted features. In addition, the hand-crafted feature representation failed to perform better when image variants such as occlusion, background clutter, pose, illumination, cultural and regional background, intra-class variations, cropped images, multipoint view, and deformations were present in the data. However, deep neural networks were introduced in person re-ID in 2014, which completely changed the feature extraction methodology. Deep learned features perform better in end-to-end learning and are robust to the image variants. The improved feature representation in the deep learning architectures makes it more popular than machine learning methods for person re-ID [6, 8, 13, 14].

To address the following issues, we proposed and implemented a novel architecture based on Siamese networks to identify the person based on their arm’s hair patterns. Since there exists no standard database dedicated for arm’s hair pattern recognition, we created and analyzed arm’s hair pattern person identification with several state-of-the-art deep learning architectures.

The key contributions of this paper are as follows:

Person identification with a novel color threshold (CT)-twofold Siamese network architecture using arm’s androgenic hair patterns.
Created a database with images of person’s hand for person identification collected from Indian subjects.

Rest of the paper is organized as follows. The next section discusses the literature work on existing methods of person identification. The third section describes the proposed methodology. The fourth section discusses the experimental results, and finally, the last section concludes the paper with future directions.

Literature survey

Person re-identification from the arm’s androgenic hair comes under the closed world person re-id. Here, a single modality is used with bounding boxes, sufficient and correct annotated data exists, and the query exists in the gallery. The three standard components in a closed world re-id system are feature representation learning, deep metric learning, and ranking optimization [6].

Feature representation learning includes features such as global, local, auxiliary, and video (temporal) features. Global feature learning captures the fine-grained cues for each subject present in the image. Single and cross image representation frameworks are used, which were trained using triplet loss [15]. But these did not perform for multiclass classification problems. Id discriminative embeddings (IDE) were widely used to address multiclass classification problems, but this did not perform better to capture discriminative cues at different scales. Qian et al. [16] proposed multiscale deep representation learning models to address this issue. Then attention models were proposed to enhance the robustness against the misalignment and mine the feature relations across multiple images. Song et al. [17] proposed mask-guided contrastive person identification, attention models. Li et al. [18] and Wang et al. [19], respectively, proposed Harmonious and multitask attention models for person re-identification. But these architectures were not proposed for person identification from androgenic hair patterns.

Another way of making the model more robust against misalignment is using the local feature representation learning. Feature level fusion techniques are used, and some examples include the methods such as multi-channel part based CNN by Cheng et al. [20], deep context-aware features by Li et al. [21], feature decomposition, and fusion by Zhao et al. [22]. Still, these did not perform well for multiple part level classifiers and horizontally divided region features. The state of the art architectures which address these are Siamese long short-term memory (by Varior et al. [23]), second-order non-local attention networks, interaction and aggregation networks, and Aanet [6, 25,26,27,28,29]. But these are not explored in person re-identification using hair patterns.

Table 1 Related literature on person re-identification from various features

Full size table

Several other attributes, such as semantic attributes [30], viewpoint information [31], domain information [32], and generative adversarial networks [33], are used as auxiliary features for person re-identification. Auxilliary feature representation learning includes the use of data augmentation for better performance [34]. Spatio-temporal dimension attention cues are popularly used in video feature representation learning [35]. Though GANs are very popular, they are more suited for Open-World Person Re-Identification and perform better when primary biometric parameters are used. Training GANs is another issue as it is time-consuming and often lacks diversity. The dependency on diversity often leads to limited improvement or sometimes performance degradation. The existing literature points out the weakness of GANs are due to pose variations and camera style adaptions, which hinders them from modeling many other important aspects, including viewpoint and background changes, thus making the generated samples scarce of diversity. To address this issue, self-supervised learning (SSL) is introduced, but SSL uses CNN-based basic architectures to train and test with huge amount of data, which makes these methods to learn more discriminative features and to generalize uses semi supervised methods [36].

Before the extensive use of deep learning, metric learning was popular [37]. Now the use of metric learning is replaced using the loss functions. Several loss functions are widely used, such as identity loss [15], verification loss [38], triplet loss [39], and online instance matching loss [40], depending on the data and the result. Along with the loss functions, several training strategies such as batch sampling and identity sampling are used to address imbalance issues of the data [41].

In the testing phase, the retrieval performance can be improved using ranking optimization [42]. It can be performed using human interaction, automatic gallery to gallery similarity mining, query adaptive or human interaction based re-ranking, and rank or metric fusion. The common and popular evaluation metrics used are cumulative matching characteristics (CMC) and mean average precision (mAP). Currently, mean inverse negative penalty (mINP) is also used for the smaller dataset, and it avoids the domination of easy matches mAP and CMC evaluations [6].

Most of the work is related to person re-identification from face or face and other attributes. But in our work (related to criminology), faces will be covered most of the time. Table 1 summarizes some of the works with non face data for the person re-identification problem. Table 1 shows various features used in the literature and the corresponding methodology, datasets, and evaluation metrics. It is observed from the literature that there exists no publicly available dataset on arm’s androgenic hair patterns. In addition, the androgenic hair pattern-based person re-identification problems did not use state-of-the-art deep learning architectures. This paper proposes a deep learning architecture based on Siamese networks and creates a database with persons’ arm image dataset with androgenic hair patterns.

Created dataset

Subjects, consent and image data

The hand images for the database are collected from a Nikon D5300 DSLR camera^{Footnote 1} with a maximum resolution of 6000 $\times $ 4000 pixels. Indians commonly have dense androgenic hairs on arms, and hence images are taken from at least three different angles to cover the entire hand. There were no strict guidelines for the subjects on posing, and they were allowed to pose for different viewpoints, illuminations, and poses. We made sure that the background is clear and can see the hand correctly. All other image variants were obtained using the data augmentation techniques.

The subjects were of different ages, sex, race, and culture. A total of 50 subjects were considered in this study. Consent is obtained from each subject that their data will be used for research purposes only. For every subject, we took images for both the left hand as well as the right hand. We took at least three different images for each hand such that collected images contain different image variants and cover the entire hand. The distance between the subject and the DSLR was approximately 1.2 m. Figures 1 and 2 show the raw images collected from the subject for right hand left hand, respectively.

The collected images also contain skin marks, scars, and other skin features. The camera has a 23.5 mm $\times $ 15.6 mm RGB CMOS (red, green, blue complementary metal oxide semiconductor) sensor with a 1.5 $\times $ FOV (field of view) crop. A focal length of 55 mm is used. We adjusted the focal length, sensor pixel size, and resolution such that focus in the collected images is more on hand hairs than on the skin marks, scars, or other skin features. A minimum of three images per hand were obtained for all 50 subjects, but in some cases, we took more than three images per hand to cover all the androgenic hair patterns. These additional images are taken in case of too dense or too few hair patterns and hand parts where tattoo and skin marks are present. Therefore, instead of 300 images ((subjects) 50 $\times $ 3 (right hand) $\times $ 3 (left hand)), we got 383 images at the end of this step. Though we adjusted the camera properties to avoid the skin marks and tattoos in a few cases, manually cropping is also performed on some images. A total of 383 images were collected from 50 subjects. In some cases, manually cropping is also performed so that only the hand is visible.

The collected high-resolution images are reduced to lower resolutions of around 244 $\times $ 244 based on the preprocessing steps and the deep learning architecture used. Hence, this study’s range of image resolutions is between 40 and 12.5 dpi (dots per inch). After reducing the resolution of the images, we observed that some cropped image quality has drastically deteriorated. To address this issue, we divided the image into two or more parts (with the same person id), and hence the total number of images increased from 343 to 424. All the obtained 424 images do not contain any tattoos or external markings made on their hands.

The naming convention used for the collected images is shown in Fig. 3. The first three digits correspond to the subject, and hence this becomes the unique part of the image name w.r.t. each subject. The following two digits are either 00 or 11, representing right hand and left hand, respectively. The last two digits are the sequence numbers representing the sequence of images taken for each hand. The naming convention is used so that the training and validation process becomes smooth when we use functions like data generators while using deep learning architecture.

Preprocessing

The database of criminals in forensic analysis is generally created in controlled environments. The created database contains images from the controlled environment as well, but the crime scene data comes from uncontrolled situations. Therefore, it has different angles, resolutions, illuminations, and so on. To make both the database and the deep learning architecture more robust, we have used data augmentation techniques and preprocessing.

Rotation range, height shift range, width shift range, zoom range, fill mode, horizontal flip, channel shift range, and zca whitening are the eight different data augmentations techniques used in this study. The corresponding augmentation values/parameters are 40, 0.2, 0.2, 0.2, nearest, true, 20 and true, respectively. The data augmentation techniques used are from the standard literature [13, 14, 36, 44]. We used all the data augmentation techniques given in the TensorFlow documentation^{Footnote 2} except the color space transformations. Color Space Transformations change the color of the hand, and it is not advisable to use for person re-ID as per the existing literature. Since it alters one of the unique features, the hand’s skin tone, it is not recommended for person re-ID identification. Regarding the values used in the data augmentation, we used the standard values from the literature and cross-verified them manually by the empirical study. The standard values perform the best even in person re-ID.

The proposed architecture uses both the color image as well as the thresholded image as input. The following steps were used to convert the color image to the thresholded image.

Step 1—GrayScaled Image:^{Footnote 3} The input color image is first converted to a grayscale image. The Sobel edge detector is used to smoothen the grayscale image.
Step 2—Black-hat transforms operation:^{Footnote 4} It is used in digital image processing and morphology to extract small elements and details from given images where all the objects which are white on a dark background are highlighted as shown in Fig. 4. The settings used in this study such as anchor, iterations, borderType, const and borderValue are set to Point$(-1,-1)$, 1, BORDER_CONSTANT, Scalar and morphologyDefaultBorderValue(), respectively.
Step 3—Binary thresholding:^{Footnote 5} It is used to get the thresholded image where the pixel value is set to 255 if it is greater than the threshold or considered zero.

Figure 4 shows a sample screenshot of all the preprocessing steps followed. The output images present in Fig. 4 is a sample output of a portion of a single input color image. After the preprocessing, all those images are stored with the same name as that of the input color image in a separate folder. Figure 5 shows the sample thresholded image of the subject 002.

The input image given in preprocessing step is manually cropped for the hand part in the image. Though we did not send the complete picture for preprocessing, to understand the cropped and removed part for the image shown in Fig. 5, the uncropped color image of the same subject is given as an input for preprocessing, and the output is shown in Fig. 6. Here the parts which are not used for the computation are shown as cropped and unused. Only the part which contains the arm’s hair (the middle part in the image) is considered for the computation, which is also shown in Fig. 5.

After performing preprocessing, we obtained another set of 424 thresholded images. We applied eight different data augmentation techniques on these 424 color or actual images and 424 thresholded images. A total number of 6784 ((424 (color) $\times $ 8) + (424 (thresholded) $\times $ 8) images were obtained from the color and thresholded images. The manual verification was performed using two different human observers to avoid bias (mostly in cropping and discarding unrelated areas, the similarity of two images after augmentation or thresholding, and discarding the distorted images after augmentation or after lowering image resolutions). And we calculated the inter-rater reliability using Cohen’s kappa, and we observed that both the human observers agree with a $\kappa $ value of 0.96 (this $\kappa $ value calculation includes all the steps whenever the human observers are used).

Table 2 Created database details

Full size table

After data augmentation and thresholding, all the images were cross-verified manually again. The images that do not contain hair parts like after cropping in augmentation, some images contain the cropped part of the hand which is close to the wrist and did not contain much hair there, all such images were discarded. After discarding the images in this step, the total number of images obtained is 6500 (284 images are discarded in this step). The complete details of the created database are given in Table 2.

Proposed methodology

Person identification using visual features can be modeled as a similarity learning problem. Siamese architectures are extensively used in deep CNN models based for similarity learning, where it requires less parameters to be trained whenever a new entry comes to the database for person identification. From the literature, two types of input images performed better for person identification from arm’s hair [8]. They are the thresholded image and the color image, and both are used in our proposed architecture.

Figure 7 shows the complete methodology of the proposed work. The proposed color threshold (CT)-twofold Siamese network is composed of two different CNN-based networks. The notations used in Fig. 7 are $c, c^t$, and X, where it represents the color image, thresholded image, and search region, respectively. The size of $c^t$, and X is $W_t \times H_t \times 3$ . X is the collection of image patches with the same dimension as c and hence the target size is $W_r \times H_r \times 3$ where $H_s<H_t$ and $W_s<W_t$ and located at the centre of $c^t$. The C-Net and T-Net are not combined until the testing time, which is similar to [46].

T-Net: The network which takes the thresholded images as input clones its architecture from the SiamFC network (Siamese FullyConvolutional) [47]. The convolutional network used here extracts the features from the thresholded image (denoted by $f_{a}(.)$) and is known as T-Net. The following equation shows the appearance branch response map where corr(.) is the correlation operation:

$$\begin{aligned} h_{a}(c,X)=\mathrm{corr}\left( f_{a}(c),f_{a}(X)\right) . \end{aligned}$$

(1)

All the parameters of the T-Net are trained from scratch for similarity learning. The following equation shows the logistic loss function, which was minimized to optimize the T-Net. Here, $Y_i$, $\theta _a$ and N are the search region, parameters of T-Net and number of training samples, respectively:

$$\begin{aligned} \mathrm{arg} \,\mathrm{min}_{\theta _{a}}\frac{1}{N}\sum _{i=1}^{N}{L\left( h_{a}(c_{i},X_{i},\theta _{a}),Y_{i}\right) } \end{aligned}$$

(2)

C-Net: The second network takes color images as its input (named as C-Net); here, inception v3 architecture is used in the pre-trained network, and then its parameters are updated in the last two convolutional layers (except last two layers, all others are freezed). The low-level features are not extracted from the pre-trained networks as they provide different levels of abstraction. Each convolutional layer’s features have a different spatial resolution, and it needs to be concatenated (represented by f(.)). After the feature extraction, $1\times 1$ ConvNet is used as a fusion module to make these features suitable for correlation. This fusion is performed within the same layer features. $g(f_s(X))$ gives the feature vector for the search region after the fusion.

In target processing, $c^t$ is taken as a target input by C-Net. This target input contains the contextual features denoted by t. The features obtained from this module include high-level features and are robust for changes in the object; hence, they are more generalized and less discriminative. Channel attention modules were introduced to enhance the discriminative power of the architecture. The attention modules use $c^t$ as a feature map instead of t to provide importance to the surrounding context along with the target. Channel wise operations are used in the attention module, and the attention process for $i\mathrm{th}$ channel is shown in Fig. 8.

Several operations were performed; for example, a feature map of conv5 contains $22 \times 22$ spatial dimension, and the feature maps are of $3 \times 3$ grid with a central grid is for the target with the dimension $6 \times 6$. Max pooling is performed within the grid, and then a coefficient is produced using a two-layer multi-layer perceptron. Here the perceptron uses the same convolutional layer to share the weights across the channels. The final output $w_i$ is obtained using a sigmoid function with bias. A single crop operation is used on $f_t(c^t)$ to obtain $f_t(c)$. The output of the attention module is the channel weights $w_i$, and the input is $f_t(c^t)$. The following equation provides the response map where the dimension of w is same as $f_t(c)$. The elementwise operation is represented by . (dot). Here, only the channel attention module and the fusion modules are trained:

$$\begin{aligned} h_{t}(c_{t},X)=\mathrm{corr}\left( g \left( \xi .f_{t}(c)\right) ,g\left( f_{t}(X)\right) \right) . \end{aligned}$$

(3)

The logistic loss function (Eq. 4) is minimized to optimize the response map. The training pairs are $((c^t)_i, X_i)$, and the response map is $y_i$:

$$\begin{aligned} \mathrm{arg} \, \mathrm{min}_{\theta _{t}}\frac{1}{N}\sum _{i=1}^{N}{L\left( h_{t}\left( c_{i}^{t},X_{i},\theta _{t}\right) ,Y_{i}\right) }, \end{aligned}$$

(4)

where N denotes the training samples, and $\theta _s$ denotes the trainable parameters. A weighted average of heatmaps (Eq. 5) is used to get the overall heatmap of two branches during the test time. Here, $\lambda $ is the weighting parameter. The validation set can be estimated using $\lambda $. The most matched location in re-ID is given by $h(c_{t},X)$ and have the largest value:

$$\begin{aligned} h\left( c_{t},X\right) =\lambda h_{a}(c,X) + (1- \lambda )h_{t}\left( c_{t},X\right) . \end{aligned}$$

(5)

VGGNet (visual geometry group network) like architecture is used as a base network for both the T-Net and C-Net. As mentioned earlier, T-Net is a replica of the SiamFC network. The C-Net is loaded from a pre-trained VGGNet on the ImageNet. C-net strides are adjusted so that the last layer of C-Net and T-Net have the same dimension. To avoid the channel getting suppressed to zero in the attention module, a nine-dimensional vector is used to get the pooled features of each layer. Therefore, the layers in MLP (multiLayer perceptron) had nine neurons, ReLU (rectified linear unit) non-linear function, and after the MLP sigmoid function is used with 0.5 as bias.

Results and analysis

The standard metrics used in person re-identification methods are cumulative matching characteristics (CMC) and mAP (mean-average precision). Generally, it is used in the biometric system, which operates in closed-set identification tasks. The test images (templates) are compared with the annotated images present in the database (biometric subject) and ranked based on similarity. Based on the match rate, the rank versus identification task is compared using the CMC. Suppose each test sample (single gallery shot) identity has only one instance, then for every query, the algorithm will rank the test samples using a step function, CMC top-k accuracy, and it is given in the following equation:

$$\begin{aligned} Acck = \left\{ \begin{array}{ll} 1 &{}\quad \mathrm{if} \, top-k \, \mathrm{ranked} \, \mathrm{gallery} \, \mathrm{samples} \\ &{}\quad \mathrm{contain} \,\mathrm{the}\, \mathrm{query} \,\mathrm{identity} \\ 0 &{}\quad \mathrm{otherwise} \end{array}\right. \end{aligned}$$

(6)

Due to data augmentation and the use of multiple images of both the right and left hand of the same person, we can input multiple instances of the same person in test samples and can be tested (multi-gallery-shot setting). To address this case using a better performance metric, we have also used another metric, mINP (mean inverse negative penalty), to check the performance of the model w.r.t. the created database. The hardest correct match’s penalty is measured using negative penalty and is shown in the following equation, where $Q_j$ indicates the total number of correct matches for query j and $H_i^\mathrm{hard}$ indicates the rank position of the hardest match:

$$\begin{aligned} \mathrm{NP}_{j} = \frac{H^{\mathrm{hard}}_{j} - |Q_{j}|}{H^{\mathrm{hard}}_{j}}. \end{aligned}$$

(7)

INP (inverse negative penalty) is the inverse of NP, and we have used mINP as shown in Eq. 8. CMC and mAP evaluations are dominated by the easy matches and it is avoided using the mINP. The limitation of mINP is for a larger dataset, and our dataset contains only 50 subjects, so it is used as one of the supplementary metrics for evaluation along with the widely used CMC and mAP metrics:

$$\begin{aligned} \mathrm{mINP} = \frac{1}{n} \sum _{j}^{}(1-\mathrm{NP}) = \frac{1}{n} \sum _{j}^{}\frac{|Q_{j}|}{H^\mathrm{hard}_{j}} \end{aligned}$$

(8)

Implementation details: A small weight $\lambda $ is used to combine the branches, and it is updated by considering the validation set. It is observed that the hyperparameter $\lambda $ performs its best when $\lambda = 0.3$. The attention module has one hidden layer with a 9-dimensional vector with ReLu as a non-linear function of the hidden layer. From the empirical study, it is observed that the proposed model performed its best when the learning rate was 0.01. The average training speed of the CTTSN was 52 frames per second (fps). The grid search was performed from 0.1 to 0.9 with step 0.2. Three scales are searched in this study to handle scale variations for evaluation and testing. Weight decay and momentum were empirically set to 0.0005 and 0.9, respectively.

Table 3 Comparison of created dataset results

Full size table

Table 3 shows the CMC, mAP, and mINP results comparison with proposed and other methods. The proposed CT-twofold Siamese network (CTTSN) is trained on Imagenet using the VGG network. We changed that to other popular networks such as Inception v4, ResNet, AlexNet and Xception (all four architectures are also trained on ImageNet). It is observed that the results are comparatively less as the ImageNet weights are not a significant contributor compared to other features used in the data. SiamFC [47] is the base architecture of the Siamese network, and DSiam (Dynamic Siamese Network) architecture for visual object tracking [48] is also tested and compared with other architectures.

Figure 9 shows the Rank-1 results of the proposed architecture. We compared it with the Siamese network and the modified versions of our network. CTTSN (color) contains both the C-Net architectures and CTTSN (threshold) has only T-Net architectures. It is observed from Fig. 9 that the proposed architecture with both C-Net and T-Net performs better (upto rank 30).This infers the importance of complementary features required to strengthen the proposed model’s performance.

We also used data augmentation and used those images during the training process. Figure 10 shows the comparison of results with and without data augmentation. It is evident from the figure that the data augmentation increases performance, and it is in line with the existing literature [44].

The dataset contains thresholded images, data augmented images, and color images. In the testing phase, if we provide these images separately, the performance of the proposed model is shown in Figs. 11, 12, and 13 for color images, thresholded images, and augmented images, respectively.

Figure 11 has more accuracy because there are several other features like hair color and so on that are also considered by the proposed architecture, and hence it performs better. The data augmented images contain different resolution images. It contains cropped images that may contain cropped part of the input image that has very little hair in that part. Hence, it failed to perform better when compared to color or thresholded images. We observed similar fall in person re-identification when we compared it with male vs. female; there was a dip approximately by 7% in mAP and mINP in females as the hairs were less in some parts of their hands.

The results are also compared with different image resolutions, and we observe that the performance of the proposed method decreases with a decrease in input image size as shown in Fig. 14. The different image sizes taken for the comparison are as per the standard sizes used in the literature [10] for the comparison in the criminology department. The graphs reflect the importance of clearly visible hair features as the performance decreases with a decrease in the image resolution.

Grad-CAM (gradient-weighted class activation mapping): We used the Grad-CAM class activation visualization (as per in the Keras documentation^{Footnote 6}). Grad-Cam depicts the discriminative features that are responsible for person identification using heat maps. It uses the gradient information flowing into the last convolutional layer of the proposed architecture to understand each neuron for a decision of interest [49]. A sample image from the database {Subject: 0020003} is shown in Fig. 15 where the feature from the highlighted part is responsible for person re-identification {Values related to $h(c_{t},X)$}. In addition, Fig. 16 shows the heat map in the image where the part of the hand with the highest probability of person re-identification is shown in yellow. From Figs. 15 and 16, it is evident the proposed method effectively identifies the discriminative features to re-identify the person, and also the region of interest shows the generalizability in choosing the region of interest.

Apart from the detailed results shown in the graphs, we have also performed some ablation analysis on the proposed method with the created dataset.

Both T-Net and C-Net were trained with random initialization and obtained an mAP of 81.5, suggesting the requirement of an initialization function.
We took only the color images in both T-Net and C-Net and did not get better performance (mAP 79.11), and it is also similar for thresholded images (mAP 83.17). This infers the importance of complementary features required in the proposed method.
We removed the channel attention module in the C-Net and observed a drastic decrease in the performance (mAP 76.11). This shows the importance of balancing the intra-layer and inter-layer channels.
We separately trained both the architecture in the study. Here we jointly trained both the network branches and observed mAPs of 83.67 and 84.21, respectively, suggesting the importance of optimizing multilevel features independently.
We ran channel weights on two different subject hand images. We used a sigmoid function with a bias of 0.5, so the weight distribution ranges from 0.5 to 1.5. We observed a different weight distribution for conv4 and conv5 layers for each image, indicating the importance of channel weights in the proposed method.

From the results, it is observed that the created database with 50 subjects performs better when we use both the C-Net and T-Net trained with color, thresholded, and data augmented images. When the test input is color images, it performs better. Further, it performs better for male and high-resolution input images. Thresholded images contain only hair patterns. The Person re-identification from hair patterns comes under soft biometric and the results observed with only thresholded images (Fig. 12) suggests the huge potential of using arm’s hair patterns for person re-identification using digital images.

Limitations

The created database contains Asian subjects only. The performance of the proposed model is not sufficient when the input image is taken from less-resolution cameras (some CCTV cameras). When we provide color images during training, by default, the deep learning architecture uses all the features present in that image, and it is difficult to analyze the impact of only hair patterns in that.

Conclusion

In most cases, the data obtained for identifying a criminal is collected from uncontrolled situations in criminology. Generally, they wear a mask during the crime, and also, the other body parts may not be visible so clearly compared to hands. This paper discussed the created database with 6500 images from persons’ hand images with androgenic hair as a soft biometric parameter for the person re-identification problem. We proposed a CT-twofold Siamese network and analyzed its performance on the created database. The results show that there is a potential to recognize a person from arm’s androgenic hair patterns. We observed that the proposed model performs well for the created database with a Rank-1 cumulative match of 93.5. The proposed methodology is targeted to work primarily in forensic psychiatric hospitals as the target users are generally non-cooperative. This set of person re-identification problems comes under closed world re-ID and should work unobtrusively in real time with less training and testing time. The method should get trained fast and perform better whenever new data are obtained. Hence, in this work, the proposed CTTSN performs well for the closed world person re-identification problem and uses androgenic hair patterns (soft biometric) for the person re-ID. The trained data are robust due to data augmentation, and the methodology is both discriminative and generalized. The proposed method works in real time with 52 fps for the test images.

The future directions related to the methodology include the use of other intra-image modalities such as skin color, skin marks and veins, and hair patterns to identify the person. In addition, to collect the data and test the proposed model with diverse data and improve the robustness of the proposed method. The proposed architecture needs to be tested in various areas like forensic psychiatric hospitals with different CCTV locations and subjects to check its robustness and further improvements. Generally, hospitals have CCTV installed. Instead of using barcoding, radio frequency identification (RFID), and biometrics to track and identify patients during the rehabilitation process, the proposed technique can be a cost-effective and unobtrusive solution to the existing ones.

Change history

22 July 2022
A Correction to this paper has been published: https://doi.org/10.1007/s40747-022-00822-6

Notes

References

Dantcheva A, Elia P, Ross A (2016) What else does your biometric data reveal? A survey on soft biometrics. IEEE Trans Inf Forensics Secur 11(3):441–467
Article Google Scholar
Engelsma JJ, Cao K, Jain AK (2019) Learning a fixed-length fingerprint representation. IEEE Trans Pattern Anal Mach Intell 43:1–1
Google Scholar
Baisa NL, Jiang Z, Vyas R, Williams B, Rahmani H, Angelov P, Black S (2021) Hand-based person identification using global and part-aware deep feature representation learning. arXiv preprint arXiv:2101.05260
Baisa NL (2019) Occlusion-robust online multi-object visual tracking using a GM-PHD filter with a CNN-based re-identification. CoRR, vol. arXiv:1912.05949
Muramatsu D, Shiraishi A, Makihara Y, Uddin MZ, Yagi Y (2015) Gait-based person recognition using arbitrary view transformation model. IEEE Trans Image Process 24(1):140–154
Article MathSciNet Google Scholar
Ye M, Shen J, Lin G Xiang T, Shao L, Hoi SCH (2020) Deep learning for person re-identification: a survey and outlook. IEEE Trans Pattern Anal Mach Intell 44(6):2872–2893. https://doi.org/10.1109/TPAMI.2021.3054775
Hartung B, Rauschning D, Schwender H, Ritz-Timme S (2020) A simple approach to use hand vein patterns as a tool for identification. Forensic Sci Int 307:110115
Article Google Scholar
Leng Q, Ye M, Tian Q (2020) A survey of open-world person re-identification. IEEE Trans Circ Syst Video Technol 30(4):1092–1108. https://doi.org/10.1109/TCSVT.2019.2898940
Article Google Scholar
Afifi M (2019) 11k hands: gender recognition and biometric identification using a large dataset of hand images. Multimed Tools Appl 78:20835–20854
Article Google Scholar
Su H, Kong AWK (2014) A study on low resolution androgenic hair patterns for criminal and victim identification. IEEE Trans Inf Forensics Secur 9(4):666–680. https://doi.org/10.1109/TIFS.2014.2306591
Article MathSciNet Google Scholar
Chan FKS,Kong AWK (2015) Using hair follicles with leg geometry to align androgenic hair patterns. In: 2015 European Intelligence and Security Informatics Conference, pp 137–140. https://doi.org/10.1109/EISIC.2015.17
Mussabekova SA, Mkhitaryan XE (2021) Elemental composition of hair as a marker for forensic human identification. J Forensic Leg Med 81:102182
Article Google Scholar
Leng Q, Ye M, Tian Q (2020) A survey of open-world person re-identification. IEEE Trans Circ Syst Video Technol 30(4):1092–1108. https://doi.org/10.1109/TCSVT.2019.2898940
Article Google Scholar
Ashwin TS, Guddeti RMR (2019) Unobtrusive behavioral analysis of students in classroom environment using non-verbal cues. IEEE Access 7:150693–150709
Article Google Scholar
Zheng L, Zhang H, Sun S, Chandraker M, Yang Y, Tian Q (2017) Person re-identification in the wild. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1367–1376
Qian X, Fu Y, Jiang Y-G, Xiang T, Xue X (2017) Multi-scale deep learning architectures for person re-identification. In: Proceedings of the IEEE International conference on computer vision, pp 5399–5408
Song C, Huang Y, Ouyang W, Wang L (2018) Mask-guided contrastive attention model for person re-identification. In: CVPR, pp 1179–1188
Li W, Zhu X, Gong S (2018) Harmonious attention network for person re-identification. In: CVPR, pp 2285–2294
Wang C, Zhang Q, Huang C, Liu W, Wang X (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In: ECCV, pp 365–381
Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person reidentification by multi-channel parts-based cnn with improved triplet loss function. In: CVPR, pp 1335–1344
Li D, Chen X, Zhang Z, Huang K (2017) Learning deep context-aware features over body and latent parts for person reidentification. In: CVPR, pp 384–393
Zhao H, Tian M, Sun S (2017) Spindle net: Person reidentification with human body region guided feature decomposition and fusion. In: CVPR, pp 1077–1085
Varior RR, Shuai B, Lu J, Xu D, Wang G (2016) A Siamese long short-term memory architecture for human re-identification. In: ECCV, pp 135–153
Chung JS, Nagrani A, Zisserman A (2018) Voxceleb2: Deep speaker recognition. Proc. Interspeech 2018, pp 1086–1090. https://doi.org/10.21437/Interspeech.2018-1929
Yang F, Yan K, Lu S, Jia H, Xie X, Gao W (2019) Attention driven person re-identification. Pattern Recogn 86:143–155
Article Google Scholar
Tay C-P, Roy S, Yap K-H (2019) Aanet: attribute attention network for person re-identifications. In: CVPR, pp 7134–7143
Ashwin TS, Reddy GRM (2020) Automatic detection of students’ affective states in classroom environment using hybrid convolutional neural networks. Educ Inf Technol 25(2):1387–1415
Article Google Scholar
Gupta SK, Ashwin TS, Guddeti RMR (2019) Students’ affective content analysis in smart classroom environment using deep learning techniques. Multimed Tools Appl 78(18):25321–25348
Article Google Scholar
Gupta SK, Ashwin TS, Guddeti RMR (2018) CVUCAMS: computer vision based unobtrusive classroom attendance management system. In: 2018 IEEE 18th International Conference on advanced learning technologies (ICALT), pp 101–102.https://doi.org/10.1109/ICALT.2018.00131
Lin Y, Zheng L, Zheng, Wu Y, Yang Y (2017) Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220
Chang X, Hospedales TM, Xiang T (2018) Multi-level factorisation net for person re-identification. Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR)
Lin J, Ren L, Lu J, Feng J, Zhou J (2017) Consistent-aware deep learning for person re-identification in a camera network. In: CVPR, pp 5771–5780
Zheng Z, Zheng L, Yang Y (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: ICCV, pp 3754–3762
Dai Z, Chen M, Gu X, Zhu S, Tan P (2019) Batch dropblock network for person re-identification and beyond. In: ICCV, pp 3691–3701
McLaughlin N, Martinez del Rincon J, Miller P (2016) Recurrent convolutional network for video-based person re-identification. In: CVPR, pp 1325–1334
Chen F, Wang N, Tang J, Liang D, Feng H (2020) Self-supervised data augmentation for person re-identification. Neurocomputing 415:48–59. https://doi.org/10.1016/j.neucom.2020.07.087 (ISSN 0925-2312)
Article Google Scholar
Kostinger M, Hirzer M, Wohlhart P (2012) Large scale metric learning from equivalence constraints. In: CVPR, pp 2288–2295
Li W, Zhao R, Xiao T, Wang X (2014) Deepreid: deep filter pairing neural network for person re-identification. In: CVPR, pp 152–159
Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737
Xiao T, Li S, Wang B, Lin L, Wang X (2017) Joint detection and identification feature learning for person search. In: CVPR, pp 3415–3424
Liao S, Li SZ (2015) Efficient psd constrained asymmetric metric learning for person re-identification. In: ICCV, pp 3685–3693
Zhong Z, Zheng L, Cao D, Li S (2017) Re-ranking person reidentification with k-reciprocal encoding. In: CVPR, pp 1318–1327
Zhao Z, Kumar A (2017) Towards more accurate iris recognition using deeply learned spatially corresponding features. In: IEEE International Conference on Computer Vision (ICCV) 2017, pp 3829–3838
Lionnie R, Agustina E, Sediono W, Alaydrus M (2019) Biometric identification using augmented database. Telkomnika 17(1):103–109
Article Google Scholar
Attia A, Akhtar Z, Youssef C (2021) Feature-level fusion of major and minor dorsal finger knuckle patterns for person authentication. Signal Image Video Process 15:851–859
Article Google Scholar
He A, Luo C, Tian X, Zeng W (2018) A twofold Siamese network for real-time object tracking. In: Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR)
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional Siamese networks for object tracking. In: European Conference on Computer Vision Workshop, pp 850–865. Springer
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic Siamese network for visual object tracking. In: Proceedings of the IEEE International Conference on computer vision (ICCV), pp 1763–1771
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on computer vision (ICCV) 2017, pp 618–626. https://doi.org/10.1109/ICCV.2017.74

Download references

Author information

Authors and Affiliations

Sahyadri College of Engineering and Management, Adyar, Mangalore, India
Rohan Don Salins, G. Ananth Prabhu & Mustafa Basthikodi
IIT Bombay, Mumbai, India
T. S. Ashwin
JSS Academy of Technical Education, Noida, India
Chaitra K. Mallikarjun

Authors

Rohan Don Salins
View author publications
You can also search for this author in PubMed Google Scholar
T. S. Ashwin
View author publications
You can also search for this author in PubMed Google Scholar
G. Ananth Prabhu
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Basthikodi
View author publications
You can also search for this author in PubMed Google Scholar
Chaitra K. Mallikarjun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to T. S. Ashwin.

Ethics declarations

Conflict of interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

Ethics statement

The authors have obtained all the ethical approvals from the appropriate ethics committee and approval from the subjects involved in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rohan Don Salins and T. S. Ashwin contributed equally to this work.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Salins, R.D., Ashwin, T.S., Prabhu, G.A. et al. Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals. Complex Intell. Syst. 8, 3185–3197 (2022). https://doi.org/10.1007/s40747-022-00771-0

Download citation

Received: 29 July 2021
Accepted: 01 May 2022
Published: 20 June 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s40747-022-00771-0

Keywords

Arabic Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals

Abstract

Similar content being viewed by others

A Review of Methods Employed for Forensic Human Identification

Review for Person Recognition Using Siamese Neural Network

Deep CNN-Based Facial Recognition for a Person Identification System Using the Inception Model

Introduction

Literature survey