Deep learning-powered multimodal biometric authentication: integrating dynamic signatures and facial data for enhanced online security

Salturk, Serkan; Kahraman, Nihan

doi:10.1007/s00521-024-09690-2

Deep learning-powered multimodal biometric authentication: integrating dynamic signatures and facial data for enhanced online security

Original Article
Open access
Published: 15 April 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Deep learning-powered multimodal biometric authentication: integrating dynamic signatures and facial data for enhanced online security

Download PDF

598 Accesses
Explore all metrics

Abstract

The significant increase in online activities in the wake of recent global events has underlined the importance of biometric person authentication on digital platforms. Although many biometric devices may be used for precise biometric authentication, acquiring the necessary technology, such as 3D sensors or fingerprint scanners, can be prohibitively expensive and logistically challenging. Addressing the demands of online environments, where access to specialized hardware is limited, this paper introduces an innovative approach. In this work, by fusing static and dynamic signature data with facial data captured through regular computer cameras, a dataset of 1750 samples from 25 individuals is constructed. Deep learning models, including convolutional neural networks (CNN), long short-term memory (LSTM), gated recurrent unit (GRU), and temporal convolutional networks (TCN), are employed to craft a robust multi-classification model. This integration of various deep learning algorithms has demonstrated remarkable performance enhancements in biometric authentication. This research also underscores the potential of merging dynamic and static biometric features, derived from readily available sources, to yield a high-performance recognition framework. As online interactions continue to expand, the combination of various biometric modalities holds potential for enhancing the security and usability of virtual environments.

Deep learning based authentication schemes for smart devices in different modalities: progress, challenges, performance, datasets and future directions

Article 08 February 2024

An Overview of Deep Learning Techniques for Biometric Systems

State-of-the-Art Multi-trait Based Biometric Systems: Advantages and Drawbacks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Biometrics, a discipline dedicated to studying distinctive physical and behavioral traits that are challenging to imitate, encompasses a myriad of data types, including fingerprints, iris patterns, facial features, voice, gait, signature, EEG/ECG, palm, and more. These biometric data play a pivotal role in verification and person recognition across various facets of daily life and in academic research [1,2,3,4]. The global shift to remote activities during the pandemic prompted the evolution of diverse approaches for acquiring biometric data [5]. These approaches leverage hardware such as smart devices, 3D sensors, and stereo cameras, demonstrating the versatility of biometric applications in different domains. In recent studies, Mekruksavanich and Jitpattanakul [6] employed deep learning techniques involving the use of accelerometers and gyroscope sensors in smart devices for biometric identification. On the other hand, Lewis, Lie, and Xie [7] utilized the dynamic time warping (DTW) method. Andersson and Araujo [8] explored biometric identification using the Kinect sensor to capture body movements, while Chiu, Hsieh, et al. [9] tracked handwriting in the air using a 3D sensor. A wide range of applications for person identification and verification have been created utilizing these technologies, which employ diverse biometric characteristics. Despite the existence of such research, these technologies can be difficult to obtain and occasionally costly. One of our motivations in this study is to obtain biometric data with a camera, which is a common and easy-to-use hardware for online activities. Our aim is to develop a model that demonstrates the usability of these biometric data for verification and recognition purposes.

One of the most utilized traits for biometric authentication is signature biometrics. Signatures represent a subset of behavioral biometrics and play a crucial role in the majority of official records verification processes. Consequently, numerous studies in the literature have delved into the utilization of signature data for person verification and recognition. In their work, Khan et al. [10] built an architecture based on Gaussian gated recurrent unit (GGRU) tailored for handwritten signature biometrics. Focusing on potential vulnerabilities in signature biometrics, Gonzalez-Garcia et al. [11] investigated various attack scenarios. Jain et al. [12] took a different approach by integrating geometrical features and neural networks for signature verification. Utilizing signatures, there are also studies using machine learning methods [13,14,15]. Signature biometrics can be static or dynamic. Static signature biometrics involves the utilization of the signature’s final image or signal, whereas dynamic biometrics also incorporates the time variable during data reception. Depending on the time variable while signing, different data such as pen sounds, pressure, and angle have been applied in the literature for biometric authentication [16,17,18]. On the other hand, due to improvements in object tracking, the number of studies using air signatures has increased in recent years. Bailador et al. [19] collected in-air signatures with mobile phone sensors, and Guerra Segura et al. [20] utilized a 3D sensor to acquire data to apply traditional machine learning methods. Malik et al. [21] captured in-air signatures with a depth camera to create a deep learning model. The approach of acquiring in-air signatures offers a significant advantage when obtaining signatures remotely. Individuals can sign without physically being present at the location where their signature is needed. This circumstance has various advantages. In terms of cost savings, monitoring the in-air signature does not necessitate additional sensors, only cameras.

It is often effective to use a single biometric method; nevertheless, various studies have shown that combining multiple systems produces better results [14, 22,23,24]. Multiple biometrics can be defined as the use of sensors together (multi-sensor) [25], the use of different instances of the same biometric characteristic (multi-instance) [17], or the use of distinct biometric features together (multimodal) [14]. This paper proposes a multimodal solution for online security using facial and dynamic signatures biometric data. Facial biometrics is a type of biometric data that has been utilized for a long time. It exists in a variety of fields, including health, business, security, and entertainment. Because of its prevalence in everyday life, it has been employed in much related research [26,27,28]. Similarly, various studies utilizing signature biometrics have been carried out, and a significant number of related articles have been published in the literature, particularly in the previous ten years. There have been studies on this topic that use paper signatures [29, 30], air signatures [20, 21], and time-varying physical conditions when signing [16,17,18, 31]. Moreover, there have also been studies in which these biometric techniques have been utilized in tandem [22].

The aim of this study is to develop a robust deep learning model for online activities, especially focusing on the potential vulnerabilities associated with facial and in-air signature biometrics. This is especially significant given the potential for modern technology to deceive many biometric features. To achieve this aim, we employed a low-cost laptop camera to collect 70 instances of both facial and in-air signature data from each of the 25 participants. Concurrently, we recorded the coordinates and timestamp details of the signature data during acquisition. Following a meticulous preprocessing stage, a total of 1750 pairs of facial images, in-air signature images, and corresponding in-air signature signals were utilized for training across CNN, LSTM, GRU, and TCN networks, resulting in an impressive accuracy rate surpassing 98%.

2 Methods and materials

Deep learning models represent a cornerstone in contemporary machine learning applications, demonstrating their ability through remarkable effectiveness across various research domains. Notably, biometric data consolidation, categorization, and authentication have made significant strides leveraging deep learning methodologies. This study delves into the integration of diverse biometric features, encompassing facial and dynamic signature traits, unveiling the potential of their combined utilization to significantly enhance identification success rates. The state-of-the-art deep learning architectures, including CNN, LSTM, GRU, and TCN, were selected as strategic choices. Careful consideration was given to their established success in effectively handling the inherent complexity of patterns and sequences within biometric data. While CNNs excel in image-related tasks, LSTMs and GRUs prove adept at capturing temporal dependencies in sequential data. Additionally, TCNs offer an effective approach for modeling long-range dependencies in time series data. This diverse set of architectures was planned so that their individual strengths could be used to fully address the complex nature of biometric information. This led to a significant improvement in the overall performance of the system.

One of the most crucial points of this study is that practically every phone or computer has a basic camera that may be used to capture data. In this work, the data were acquired using a modest camera with two megapixels and a resolution of 1920 × 1080. The computational framework included an i7-11800H 2.30 GHz processor and 16 GB RAM.

2.1 Data collection and description

The perception of hand shape, movement, and hand joint points has become vital in achieving user experiences in various technical disciplines and platforms [1, 32,33,34,35]. For instance, augmented reality apps enable digital material, which has been increasingly popular in recent years, to be presented in conjunction with the actual environment [36, 37]. Aside from that, the perception of hand movements is used in areas such as sign language perception and digital content presentation [38,39,40].

In the expansive landscape of hand and finger detection, numerous libraries have been developed across diverse software languages and platforms. Noteworthy examples include OpenPose, HandPose in TensorFlow.js, and the robust leap motion hardware and software solution. However, in the context of this study, which focused on obtaining precise coordinates for hand joint points, a deliberate choice was made to harness the capabilities of the MediaPipe Hands (MPH) tool. This high-performance Python library stands out for its proficiency in hand and finger tracking, utilizing machine learning to extract 21 joints and endpoints with exceptional precision. Consequently, the MPH proves to be an exemplary choice for providing a comprehensive and accurate representation of one or more hands in three-dimensional space. [41,42,43]. This solution works on both desktop and mobile platforms. The location of the fingers and the writing action are utilized to store the image from the signature procedure in the proposed air signature application. A rectangular region is established for the signature, and an image is obtained inside this region based on finger motions and placements. The user is initially prompted to show her or his hand in front of the camera at the depth to be signed to prevent the depth from distorting the metric measurement. In addition to the image, the coordinates of the spots where the fingertip travels and the time information are stored in a txt file.

Similarly, to obtain facial data from the camera, the MediaPipe Face (MPF) library was employed. Facial landmarks and facial expressions in photographs and videos can be detected using the MPF. Machine learning (ML) models are used for this purpose, and they can process both a single image and an ongoing stream of images. This process generates 3D facial landmarks, blendshape scores to infer detailed facial surfaces in real time, and transformation matrices to carry out the changes necessary for effect rendering. Finding 468 main points on the face with the use of this library allows us to establish the basic shape of the face. The camera captures the face while simultaneously capturing the signature.

The dataset used in this study was collected at Yildiz Technical University’s (YTU) Cyber Security and Biometric Research Centre under the contract of General Directorate of Development Agencies with the ethical permission of YTU Ethical Committee. Volunteers are asked to sign in the air in front of a screen, and the biometric data is saved. As shown in Table 1, there were 25 volunteers, including 11 males and 14 females.

Table 1 Dataset characteristics and participant statistics

Full size table

Participants were asked to first show their hands at the location where they would sign in order to assess the depth in which they would sign. Here, the hand dimensions were determined, and a key on the keyboard was pressed to initiate the biometric data collection process. Afterward, the rectangular area where the signature would be placed was displayed on the screen, and the volunteers were asked to sign there. During signing, both important locations on the face were displayed, and the person’s real-time signature was drawn on the screen at the same time (Fig. 1). The participants were requested to write on the screen with the tip of the index finger while signing. However, a rule has been defined that the drawing process is halted if the distance between the tips of the index and middle fingers is below a threshold value, preventing the hand from beginning to draw when it enters the rectangular area to be signed. The threshold value here is approximately the distance between the index finger dip joint and the tip of the index finger. It is possible to add distinct signature sections such as points, lines, while signing in this manner. At the end of the procedure, a key determined from the keyboard was used to record the face, signature image, and signature coordinate-time data.

In this context, a total of 1750 data for each biometric trait (face image (.jpg), signature image (.jpg), and signature signal (.csv)) were obtained, including 70 signatures and facial images from each of the 25 people.

2.2 Preprocessing

Following the capture of the data, 1750 signatures and facial images from 25 people were prepared for processing. In addition to acquiring the signature visually, the coordinates and creation times of the points where the index fingertip was followed were transferred to a file with a csv extension.

On the 2D signature images, the unnecessary portions around the signatures were removed based on the size of the rectangle visible on the screen, and their size was fixed to 256 × 256. Furthermore, despite the fact that a rotation was performed to place the signatures on the horizontal plane, this preprocessing step was removed because it had no significant positive effect on the resulting performance. On the other hand, for each signature signal, to make the signature signal coordinates start from the origin, the minimum values in the x and y signals were subtracted from the x and y values of each point, and the first point’s time t0 was subtracted from all time signal points. Additionally, all signals are oversampled to ensure that their lengths are the same. To accomplish this, two times the maximum length of the longest signal was chosen, and all signals were extended to this length (Fig. 2). The signature, consisting of 30 points created in the air anywhere in the rectangular area on the screen in Fig. 2a, was shifted to the origin on the coordinate axis, and then oversampled to 200 points by linear interpolation method (Eq. 1), as seen in Fig. 2b. In this equation, for two points of the signal, if $y\left(n\right)$ and $y\left(n+1\right)$ are the starting and ending points, respectively, ${\hat{\text{Y}}}\left( {n + h} \right)$ represents the estimated value at a point which is at a distance of $h$ from point $n$. $h$ lies in the interval [0,1]. The Python SciPy library's “interp1d linear interpolation” function was used during this procedure. Since there is no noise in the signature signal, preprocessing steps such as noise reduction and skeleton removal were not applied.

$$ {\hat{\text{Y}}}\left( {n + h} \right) = \left( {1 - h} \right)y\left( n \right) + hy\left( {n + 1} \right) $$

(1)

While acquiring facial images, the FACE_OVAL feature of the MPF library crops only the facial portion of the images while masking the rest with a black background. Therefore, no further preprocessing was needed.

2.3 Applied deep learning methods

2.3.1 Convolutional neural network (CNN)

Convolutional neural networks (CNNs) have been a game-changing innovation in image processing and computer vision [44]. CNNs use layers of linked nodes to extract complex visual patterns, simulating the human visual system. Their design includes pooling layers to downsample feature maps, activation functions to induce nonlinearities, and convolutional layers to apply filters to input images. With the use of gradient descent optimization, CNNs are trained via backpropagation. CNNs do away with the necessity for human feature engineering since they have the potential to automatically learn optimum filters. They have been used in many different fields, including video analysis, object identification, and medical diagnostics. CNNs have transformed the study of visual data by providing formerly unobtainable degrees of accuracy. In this study, facial images and 2D signature images were included in the network model using the CNN algorithm because of the CNN architecture’s strength with images.

2.3.2 Long short-term memory (LSTM)

A specific kind of recurrent neural network (RNN) called long short-term memory (LSTM) has been developed to recognize long-term dependencies in sequential inputs [45]. Based on the relevance of the information, LSTMs selectively store, discard, and output it using memory cells, input gates, forget gates, and output gates. LSTMs may perform very well in applications such as speech recognition, natural language processing, and time series analysis due to their distinctive design. By modifying the weights they carry using gradient descent optimization, LSTMs are trained to learn from labeled sequences via backpropagation through time (BPTT). LSTM networks have greatly improved sequential modeling and have been applied in many different fields, boosting the capacity of neural networks to comprehend and model sequential inputs. Since it consistently produces positive outcomes in signal data, LSTM was one of the neural network techniques utilized to include signature signals in the model in this study.

2.3.3 Gated recurrent unit (GRU)

Due to its streamlined construction and performance, which rival LSTM, the gated recurrent unit (GRU) is a type of recurrent neural network (RNN) [46]. By including memory update and reset gates, GRU networks solve the vanishing gradient issue and enable improved information flow and quicker training. GRUs have a more streamlined architecture than does LSTM due to the merging of these gates, which lowers the amount of computation needed. Tasks, including machine translation, conversational systems, and anomaly detection, have been effectively completed using GRUs. GRUs are useful tools in a variety of applications requiring time series analysis and natural language processing due to their efficiency in handling sequential data, quicker training durations, and lower computing complexity. It is known that the GRU algorithm yields better results than does the LSTM algorithm in some studies. GRU algorithm was also used in this study for comparison with the LSTM results.

2.3.4 Temporal convolutional networks (TCN)

For challenges involving sequence modeling, temporal convolutional networks (TCNs) have become effective architectures [47]. Dilated convolutions are used by TCNs to effectively describe dependencies over a range of time steps. TCNs are excellent at processing lengthy sequences while lowering computing complexity due to their parallelizable and scalable nature. Applications including voice synthesis, music production, and video processing have had success with TCNs. TCNs perform transformed sequence modeling by capturing a broad variety of relationships with fewer parameters and computations than do conventional convolutional structures. They are essential tools in a variety of fields involving time series analysis and sequential data processing because of their ability to handle temporal sequences well. This algorithm, which yields effective results with signals, was also evaluated during the inclusion of signature signals in the study’s model.

2.4 Biometric authentication

Three different types of data were gathered during the data collection: facial images, signature images, and signature signal. The images consist of RGB intensities, while the signals comprise x, y coordinates, and timestamp data. After the preprocessing phase, the identification success of the dataset was evaluated separately for each trait. The CNN architecture, whose performance on images has been proven in the literature, was employed for the evaluation of the facial image and the signature image traits. To do this, all the images were initially resized to 256 × 256 × 3 dimensions. Before building the model, the images are partitioned into 80% training images and 20% test images. Using a model with a similar structure, the CNN model (Fig. 3) was constructed for both facial images and signature images separately. This structure employs 2D convolution layers with 32, 64, and 128 filters, followed by 2x2 max-pooling layers. The model ended with a flatten layer and two separate dense output layers. The first was constructed with a 512-unit and ReLu activation function, and the second was made with a 25-unit (representing the number of classes or number of participants) and SoftMax function. Outputs P1 to P25 in Fig. 3 represent classes or participants. Various layer structures were also employed in this phase, with this structure yielding the highest accuracy values. Training was initiated with 50 epochs while the model was being fitted, and an early stopping criterion that required the performance value to not rise by five times was implemented. Training was completed for facial images between 35 and 42 epochs and signature images between 36 and 45 epochs via tenfold cross-validation. While fitting the model, exploration of different batch sizes, specifically 16, 32, and 64, was undertaken. The model’s performance did not significantly alter at 16 batch size, but time performance did decline. Besides, when utilizing a batch size of 64, hardware-related issues were encountered, leading to disruptions in the training process. As a compromise, the batch size was ultimately set to 32 to ensure both satisfactory performance and efficient hardware utilization. In conjunction with batch size tuning, the Adam optimizer, a popular choice for deep learning tasks, was employed. The Adam optimizer adapts the learning rate during training, providing an adaptive and efficient optimization strategy. The initial learning rate was set to 0.001, and the optimizer’s default parameters were used.

LSTM, GRU, and TCN architectures were individually employed to analyze the signature signals. Prior to analysis, all signals, including their x, y, and time coordinates, were resized to a length of 200, which is twice the length of the longest signal. First, the preprocessed and equalized signals are split into 80% training samples and 20% test samples, as in the images. Two distinct LSTM layers were added to the LSTM architecture with linear activation, each having 128 and 32 units. The architecture was then completed with two distinct dense layers containing 512 units using the ReLu function and 25 units using the SoftMax function (Fig. 4). Similar to LSTM, it is implemented in the GRU architecture with GRU layers. For the TCN structure, two 1D convolution layers with 256 and 64 filters were utilized, followed by 1D maximum pooling and two dense layers.

2.4.1 Signature signal-signature image pair model

The construction of the initial pair model commenced by considering the signature image and its corresponding signature signal. Twenty percent of the coordinate-time signals and signature images handled as a result of the preprocessing are set aside for testing. The training procedure was applied to the remaining 80% of the data. First, a CNN+LSTM model was constructed for processing signature images and signals. For the CNN model applied to signature images, convolution and max-pooling operations were executed in three separate hidden layer blocks. Each convolution operation, utilizing 32, 64, and 128 filters, was followed by 2x2 pooling layers. The resulting feature maps were then flattened using a flatten layer. On the other hand, the signature signals underwent a 128-unit LSTM layer and a flatten layer. Subsequently, the blocks composed of the CNN and LSTM layers were merged. This merging process was accomplished using the concatenate function in the Python Keras library. This library constructs a robust architecture that merges tensors of the same length from CNN and LSTM models, encapsulating the distinctive features of both tensors. The result is a unified representation that captures the combined information extracted from the concatenated tensors, enhancing the overall effectiveness of the model. The concatenate function combines the outputs of the CNN and LSTM models, allowing for the fusion of data extracted by both architectures. To complete the model, the merged blocks were connected to two independent dense layers. The first dense layer consisted of 512 units with a ReLu activation function, while the second dense layer contained 25 units with a SoftMax activation function (Fig. 5). Fifty epochs were first thought to be sufficient to begin the model’s training. The model was developed using tenfold cross-validation, with the epoch number stopped with an early stopping between 34 and 48 to prevent overfitting.

When the GRU algorithm is used instead of the LSTM algorithm, a 128-unit GRU layer is added to the LSTM layer. When the TCN algorithm is used, the same parameters are used to add a 128-filter 1D convolution layer.

2.4.2 Signature image-face image pair model

The CNN algorithm was employed to generate a model with a combination of facial images and signal images that took into account both attributes. The same design is used to produce a CNN structure that includes max-pooling layers, three independent convolution layers with 32, 64, and 128 filters, and a flatten layer. By adding two dense layers, each with 512 and 25 units, to the end of the model, the model was trained in this manner. Training was started with 100 epochs in this structure, which uses two CNN architectures with tenfold cross-validation, and it was determined that the performance would not increase by 10 times as an early stopping criteria.

2.4.3 Signature signal-face image pair model

At this step, a model was trained utilizing both facial images and signature signals using the CNN+LSTM, CNN+GRU, and CNN+TCN algorithms as in the prior pairings. The model incorporates a CNN architecture with max-pooling layers after the convolution layers (32, 64, and 128 filters) for facial images. In the case of signature signals, LSTM (128 units), GRU (128 units), and TCN (128 filters) algorithms are individually employed (Fig. 5). At this point, different numbers of layers were tried to be added, but more successful results were not encountered. The experiments’ findings demonstrate that using facial images and time-dependent signature signals yields better outcomes than using other multi-biometric techniques.

3 Results and performance evaluation

In this study, features from both face data and dynamic and static signature data were integrated and classified. Signature data is the most commonly utilized behavioral biometrics in daily life. Instead of capturing the signature as it is on paper, we capture the signature as it is formed in the air in front of the screen. This innovative approach enables remote verification using both signature and facial data, eliminating the physical presence. The application demonstrated that the use of in-air signatures and facial images together improved identification success.

Assessing the performance of the examined characteristics involved the use of diverse metrics such as multiclass accuracy (Eq. 2), sensitivity (Eq. 3), precision (Eq. 4), F1 score (Eq. 5), and area under curve (AUC). The calculations were based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) outcomes, employing the Python sklearn.metrics library.

$$Accuracy=(TP+TN )/(TP+TN+FP+FN)$$

(2)

$$Sensitivity=TP/(TP+FN)$$

(3)

$$Precision=TP/(TP+FP)$$

(4)

$$F1 Score=2* (Sensitivity*Precision)/(Sensitivity+Precision)$$

(5)

In order to compare multi-biometrics performance, first, the traits were trained individually. As a result of the tests, 90.39% accuracy and an 89.88% F1 score were obtained from signature images utilizing the CNN. 92.28% accuracy and 91.33% F1 score values were obtained with facial images, which is another trait classified with the CNN architecture. Using signature signals with LSTM, 91.84 accuracy and 91.96% F1 score, with GRU 91.22% accuracy and 91.12% F1 score, and with TCN 91.81% accuracy and 90.33% F1 score values were obtained. Other metrics can also be found in the table below (Table 2).

Table 2 Results of the models when traits are considered individually

Full size table

To improve performance in the identification process, multiple biometric traits were combined. With respect to the models that use both the facial image and the signature image, 92.48% accuracy and 91.47% F1 score values were achieved (Table 3).

Table 3 Signature image and face image results

Full size table

Signature image and signature signal pair, with the CNN+LSTM architecture 81.49% accuracy, 81.64% F1 score, with CNN+GRU architecture 80.45% accuracy, 80.59% F1 score, and with CNN+TCN architecture 86.90% accuracy, 87.22% F1 score values were obtained (Table 4).

Table 4 Signature image and signature signal results

Full size table

These results demonstrated that, even when features are combined, performance is not always improved. When face images and dynamic signature signals are trained together, the CNN+LSTM architecture achieves 96.22% accuracy and 96.02% F1 score, whereas the CNN+GRU architecture achieves 95.14% accuracy and 94.83% F1 score. On face images and signature signals, the CNN+TCN architecture performs with 98.01% accuracy and 97.89% F1 score values (Table 5). According to these results, the use of the CNN+TCN architecture with facial image and signature signal traits yielded the best results. These results indicate that modeling in-air signature and facial biometric attributes using a combination of different deep learning techniques yields effective results.

Table 5 Signature signal and face image results

Full size table

4 Case study on deep learning-powered multimodal biometric authentication

In the rapidly evolving digital landscape, ensuring robust security in online transactions remains a paramount challenge for financial services. This case study aims to demonstrate the practical implementation and effectiveness of our proposed deep learning-powered multimodal biometric authentication system in a real-world setting. We selected a financial services company, grappling with the dual challenges of increasing online fraud and the need for a user-friendly authentication experience. This section outlines the implementation process, the challenges encountered, and the tangible benefits realized by integrating our innovative biometric solution into the company’s digital platforms. Through this case study, we aim to provide insights into the scalability and effectiveness of the model, and its potential to revolutionize security protocols in various industries.

The finance institution initiated the development of a bespoke application for its online platform, leveraging robust libraries for capturing facial and hand gesture data. Users are guided through straightforward procedures to capture their dynamic signature. The participants were instructed to sign in the air, following simple directives, and repeatedly sign within a designated box displayed on the screen. Simultaneously, facial data are captured during the signature process. These collected data points are subsequently used to construct a model that is integrated into the system.

Before initiating any client interaction, the financial institution requires the customer to perform a signature in front of the camera, ensuring that their face is also visible. This process is part of multimodal biometric system verification. Once the system successfully verifies both the signature and facial data, the client meeting can proceed. This dual verification method effectively mitigates the risk of system manipulation through the use of any application that can alter facial data, thereby enhancing the security and integrity of the authentication process.

The versatility of our deep learning-powered multimodal biometric authentication system extends beyond the financial sector, offering significant benefits in various fields. In health care, this approach can enhance patient identification, ensuring privacy and security. Retail businesses could leverage it for secure online transactions, minimizing fraud risks. Additionally, in the education sector, it can be used for verifying identities in online examinations and maintaining academic integrity. This system’s adaptability demonstrates its potential as a valuable tool in diverse industries, aligning with the evolving needs of a digital-first world.

5 Conclusion

Biometric traits are utilized in a variety of contexts for classifying and identifying individuals in daily life. Particularly in the context of the increasing growth of online platforms, the demand for reliable person recognition has become paramount. In response to this need, our research introduces a high-performance classification model that leverages dynamic signature and facial biometric features for robust online systems. Employing a dynamic signature and facial biometric traits, it is possible to construct online person recognition and verification systems that are effective based on tests with models created using various deep learning techniques.

Looking ahead, the applicability of our study extends to commercial settings, where the integration of our recognition and matching applications can thrive. The ability to gather signature and face information from participants holds significant promise, especially in the current climate where online platforms such as Zoom, Google Meet, and Skype are gaining widespread popularity. This not only underscores the relevance of our model in the contemporary digital landscape but also positions it as a valuable asset for companies seeking simple and enhanced security solutions.

Additionally, we must emphasize that many participants supported the study by initializing their signatures instead because the signature data are used in numerous applications. Additionally, 25 volunteers were available to help with the study. It was anticipated that by increasing this number, future studies could adopt better approaches.

Data availability

The dataset used in this study was collected in Yildiz Technical University’s (YTU) Cyber Security and Biometric Research Centre under the contract of the General Directorate of Development Agencies with the ethical permission of the YTU Ethical Committee. The data used in this study are available upon request from the corresponding author.

References

Y Zhang D Sun Z Qiu 2012 Hand-based single sample biometrics recognition Neural Comput Appl 21 1835 1844 https://doi.org/10.1007/s00521-011-0521-x
Article Google Scholar
AB Tatar 2023 Biometric identification system using EEG signals Neural Comput Appl 35 1 1009 1023 https://doi.org/10.1007/s00521-022-07795-0
Article Google Scholar
A Parashar A Parashar W Ding I Rida 2023 Deep learning pipelines for recognition of gait biometrics with covariates: a comprehensive review Artif Intell Rev 56 8889 8953 https://doi.org/10.1007/s10462-022-10365-4
Article Google Scholar
A Kumar S Jain M Kumar 2023 Face and gait biometrics authentication system based on simplified deep neural networks Int j inf tecnol 15 1005 1014 https://doi.org/10.1007/s41870-022-01087-5
Article Google Scholar
Anitha A, Vaid S, Dixit C (2021). Implementation of Touch-Less Input Recognition Using Convex Hull Segmentation and Bitwise AND Approach. In: Solanki, A., Sharma, S.K., Tarar, S., Tomar, P., Sharma, S., Nayyar, A. (eds) Artificial Intelligence and Sustainable Computing for Smart City. AIS2C2 2021. Communications in Computer and Information Science, vol 1434. Springer, Cham. https://doi.org/10.1007/978-3-030-82322-1_11
S Mekruksavanich A Jitpattanakul 2022 Deep residual network for smartwatch-based user identification through complex hand movements Sensors 22 8 3094 https://doi.org/10.3390/s22083094
Article Google Scholar
Lewis A, Li Y, Xie M (2016) Real time motion-based authentication for smart-watch. In 2016 IEEE Conference on Communications and Network Security (CNS) pp 380–381. IEEE
Andersson VO, Araujo RM (2014) Full body person identification using the kinect sensor. In 2014 IEEE 26th International Conference on Tools with Artificial Intelligence pp 627–633. IEEE
Chiu LW, Hsieh JW, Lai CR, Chiang HF, Cheng SC, Fan KC (2018) Person authentication by air-writing using 3d sensor and time order stroke context. In Smart Multimedia: First International Conference, ICSM 2018, Toulon, France, August 24–26, 2018, Revised Selected Papers 1 pp 260–273. Springer
S Khan DK Singh M Singh DF Mena 2023 Automatic signature verifier using Gaussian gated recurrent unit neural network IET Biom https://doi.org/10.1049/2023/5087083
Article Google Scholar
C Gonzalez-Garcia R Tolosana R Vera-Rodriguez J Fierrez J Ortega-Garcia 2023 Introduction to presentation attacks in signature biometrics and recent advances S Marcel J Fierrez N Evans Eds Handbook of biometric anti-spoofing advances in computer vision and pattern recognition Springer Singapore 447 466 https://doi.org/10.1007/978-981-19-5288-3_16
Chapter Google Scholar
A Jain SK Singh KP Singh 2021 Signature verification using geometrical features and artificial neural network classifier Neural Comput Appl 33 6999 7010 https://doi.org/10.1007/s00521-020-05473-7
Article Google Scholar
Burnett A, Duffy A, Dowling T (2004) A biometric identity-based signature scheme. Cryptology ePrint Archive.
Monwar MM, Gavrilova M (2008) Fes: A system for combining face, ear and signature biometrics using rank level fusion. In Fifth International Conference on Information Technology: New Generations (ITNG 2008) (pp. 922–927). IEEE.
M Faundez-Zanuy 2005 Signature recognition state-of-the-art IEEE Aerosp Electron Syst Mag 20 7 28 32 https://doi.org/10.1109/MAES.2005.1499249
Article Google Scholar
Sadak MS, Kahraman N, Uludag U (2020) Handwritten signature verification system using sound as a feature. In: 2020 43rd international conference on telecommunications and signal processing (TSP) (pp 365–368). IEEE
MS Sadak N Kahraman U Uludag 2022 Dynamic and static feature fusion for increased accuracy in signature verification Signal Process Image Commun 108 116823 https://doi.org/10.1016/j.image.2022.116823
Article Google Scholar
Muramatsu D, Matsumoto T (2007) Effectiveness of pen pressure, azimuth, and altitude features for online signature verification. In Advances in Biometrics: International Conference, ICB 2007, Seoul, Korea, August 27–29, 2007. Proceedings pp 503–512. Springer
G Bailador C Sanchez-Avila J Guerra-Casanova SA Santos de 2011 Analysis of pattern recognition techniques for in-air signature biometrics Pattern Recogn 44 10–11 2468 2478 https://doi.org/10.1016/j.patcog.2011.04.010
Article Google Scholar
E Guerra-Segura A Ortega-Perez CM Travieso 2021 In-air signature verification system using leap motion Expert Syst Appl 165 113797 https://doi.org/10.1016/j.eswa.2020.113797
Article Google Scholar
J Malik A Elhayek S Guha S Ahmed A Gillani D Stricker 2020 Deepairsig: end-to-end deep learning based in-air signature verification IEEE Access 8 195832 195843 https://doi.org/10.1109/ACCESS.2020.3033848
Article Google Scholar
M Singhal K Shinghal 2023 Secure deep multimodal biometric authentication using online signature and face features fusion Multimed Tools Appl https://doi.org/10.1007/s11042-023-16683-1
Article Google Scholar
T Vijayakumar 2021 Synthesis of palm print in feature fusion techniques for multimodal biometric recognition system online signature J Innov Image Process 3 2 131 143 https://doi.org/10.36548/jiip.2021.2.005
Article Google Scholar
MM Mailah BH Lim 2008 Biometric signature verification using pen position, time, velocity and pressure parameters J Teknologi https://doi.org/10.11113/jt.v48.218
Article Google Scholar
S Kirchgasser C Kauba A Uhl 2021 The plus multi-sensor and longitudinal fingerprint dataset: an initial quality and performance evaluation IEEE Trans Biom Behav Identity Sci 4 1 43 56 https://doi.org/10.1109/TBIOM.2021.3104108
Article Google Scholar
DC Ngo ABJ Teoh A Goh 2006 Biometric hash: high-confidence face recognition IEEE Trans Circuits Syst Video Technol 16 6 771 775 https://doi.org/10.1109/TCSVT.2006.873780
Article Google Scholar
Y Wang D Shi W Zhou 2022 Convolutional neural network approach based on multimodal biometric system with fusion of face and finger vein features Sensors 22 16 6039 https://doi.org/10.3390/s22166039
Article Google Scholar
Ghafourian M, Fierrez J, Gomez LF, Vera-Rodriguez R, Morales A, Rezgui Z, Veldhuis R (2023) Toward face biometric de-identification using adversarial examples. In: 2023 IEEE 47th annual computers, software, and applications conference (COMPSAC). IEEE
I Bhattacharya P Ghosh S Biswas 2013 Offline signature verification using pixel matching technique Procedia Technol 10 970 977 https://doi.org/10.1016/j.protcy.2013.12.445
Article Google Scholar
MM Hameed R Ahmad MLM Kiah G Murtaza 2021 Machine learning-based offline signature verification systems: A systematic review Signal Process Image Commun 93 116139 https://doi.org/10.1016/j.image.2021.116139
Article Google Scholar
T Dhieb H Boubaker S Njah M Ben Ayed AM Alimi 2022 A novel biometric system for signature verification based on score level fusion approach Multimed Tools Appl 81 6 7817 7845 https://doi.org/10.1007/s11042-022-12140-7
Article Google Scholar
C Keskin F Kıraç YE Kara L Akarun 2012 Hand pose estimation and hand shape classification using multi-layered randomized decision forests A Fitzgibbon S Lazebnik P Perona Y Sato C Schmid Eds Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI Springer Berlin Heidelberg Berlin, Heidelberg 852 863 https://doi.org/10.1007/978-3-642-33783-3_61
Chapter Google Scholar
C Ma A Wang G Chen C Xu 2018 Hand joints-based gesture recognition for noisy dataset using nested interval unscented Kalman filter with LSTM network Vis Comput 34 1053 1063 https://doi.org/10.1007/s00371-018-1556-0
Article Google Scholar
GH Samaan AR Wadie AK Attia AM Asaad AE Kamel SO Slim MS Abdallah YI Cho 2022 Mediapipe’s landmarks with RNN for dynamic sign language recognition Electronics 11 19 3228 https://doi.org/10.3390/electronics11193228
Article Google Scholar
V Schöffl I Schöffl 2022 Anatomy and biomechanics of the hand V Schöffl I Schöffl C Lutter T Hochholzer Eds Climbing Medicine: A Practical Guide Springer Cham https://doi.org/10.1007/978-3-030-72184-8_3
Chapter Google Scholar
Skreinig LR, Stanescu A, Mori S, Heyen F, Mohr P, Sedlmair M, Schmalstieg D, Kalkofen D (2022) AR hero: Generating interactive augmented reality guitar tutorials. In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces 14 Abstracts and Workshops (VRW) pp 395–401. IEEE
BJ Jo SK Kim S Kim 2023 Enhancing virtual and augmented reality interactions with a mediapipe-based hand gesture recognition user interface Ingénierie des systèmes d information 28 3 633 638 https://doi.org/10.18280/isi.280311
Article Google Scholar
B Subramanian B Olimov SM Naik S Kim KH Park J Kim 2022 An integrated mediapipe-optimized GRU model for Indian sign language recognition Sci Rep 12 1 11964 https://doi.org/10.1038/s41598-022-15998-7
Article Google Scholar
J Bora S Dehingia A Boruah AA Chetia D Gogoi 2023 Real-time Assamese sign language recognition using mediapipe and deep learning Procedia Comput Sci 218 1384 1393 https://doi.org/10.1016/j.procs.2023.01.117
Article Google Scholar
Roy K, Akif MAH (2022) Real-time hand gesture-based user-friendly human-computer interaction system. In 2022 International Conference on Innovations in Science, Engineering and Technology (ICISET) (pp 260–265). IEEE
Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, Zhang F, Chang CL, Yong M G, Lee J, et al. (2019) Mediapipe: A framework for building perception pipelines. arXiv preprint https://arxiv.org/abs/1906.08172. https://doi.org/10.48550/arXiv.1906.08172
Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, Zhang F, Chang CL, Yong M, Lee J, et al. (2019) Mediapipe: A framework for perceiving and processing reality. In Third workshop on computer vision for AR/VR at IEEE computer vision and pattern recognition (CVPR) (Vol. 2019)
Zhang F, Bazarevsky V, Vakunov A, Tkachenka A, Sung G, Chang CL, Grundmann M (2020) Mediapipe hands: On-device real-time hand tracking. arXiv preprint https://arxiv.org/abs/2006.10214. https://doi.org/10.48550/arXiv.2006.10214
K Fukushima 1980 Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position Biol Cybern 36 4 193 202 https://doi.org/10.1007/BF00344251
Article Google Scholar
S Hochreiter J Schmidhuber 1997 Long short-term memory Neural Comput 9 8 1735 1780 https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Cho K, Van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint https://arxiv.org/abs/1409.1259. https://doi.org/10.48550/arXiv.1409.1259
Lea C, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks: A unified approach to action segmentation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part III 14 (pp 47–54). Springer

Download references

Funding

Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK).

Author information

Serkan Salturk and Nihan Kahraman contributed equally for the article.

Authors and Affiliations

Department of Informatics, Yildiz Technical University, Davutpasa Campus, Istanbul, 34320, Turkey
Serkan Salturk
Department of Electronics and Communications Engineering, Yildiz Technical University, Davutpasa Campus, Istanbul, 34320, Turkey
Nihan Kahraman

Authors

Serkan Salturk
View author publications
You can also search for this author in PubMed Google Scholar
Nihan Kahraman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Serkan Salturk.

Ethics declarations

Conflict of interest

The authors declare no potential conflicts of interest with respect to the research, authorship, or publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Salturk, S., Kahraman, N. Deep learning-powered multimodal biometric authentication: integrating dynamic signatures and facial data for enhanced online security. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09690-2

Download citation

Received: 01 November 2023
Accepted: 25 March 2024
Published: 15 April 2024
DOI: https://doi.org/10.1007/s00521-024-09690-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep learning-powered multimodal biometric authentication: integrating dynamic signatures and facial data for enhanced online security

Abstract

Similar content being viewed by others

Deep learning based authentication schemes for smart devices in different modalities: progress, challenges, performance, datasets and future directions

An Overview of Deep Learning Techniques for Biometric Systems

State-of-the-Art Multi-trait Based Biometric Systems: Advantages and Drawbacks

1 Introduction