1 Introduction

Biometrics, a discipline dedicated to studying distinctive physical and behavioral traits that are challenging to imitate, encompasses a myriad of data types, including fingerprints, iris patterns, facial features, voice, gait, signature, EEG/ECG, palm, and more. These biometric data play a pivotal role in verification and person recognition across various facets of daily life and in academic research [1,2,3,4]. The global shift to remote activities during the pandemic prompted the evolution of diverse approaches for acquiring biometric data [5]. These approaches leverage hardware such as smart devices, 3D sensors, and stereo cameras, demonstrating the versatility of biometric applications in different domains. In recent studies, Mekruksavanich and Jitpattanakul [6] employed deep learning techniques involving the use of accelerometers and gyroscope sensors in smart devices for biometric identification. On the other hand, Lewis, Lie, and Xie [7] utilized the dynamic time warping (DTW) method. Andersson and Araujo [8] explored biometric identification using the Kinect sensor to capture body movements, while Chiu, Hsieh, et al. [9] tracked handwriting in the air using a 3D sensor. A wide range of applications for person identification and verification have been created utilizing these technologies, which employ diverse biometric characteristics. Despite the existence of such research, these technologies can be difficult to obtain and occasionally costly. One of our motivations in this study is to obtain biometric data with a camera, which is a common and easy-to-use hardware for online activities. Our aim is to develop a model that demonstrates the usability of these biometric data for verification and recognition purposes.

One of the most utilized traits for biometric authentication is signature biometrics. Signatures represent a subset of behavioral biometrics and play a crucial role in the majority of official records verification processes. Consequently, numerous studies in the literature have delved into the utilization of signature data for person verification and recognition. In their work, Khan et al. [10] built an architecture based on Gaussian gated recurrent unit (GGRU) tailored for handwritten signature biometrics. Focusing on potential vulnerabilities in signature biometrics, Gonzalez-Garcia et al. [11] investigated various attack scenarios. Jain et al. [12] took a different approach by integrating geometrical features and neural networks for signature verification. Utilizing signatures, there are also studies using machine learning methods [13,14,15]. Signature biometrics can be static or dynamic. Static signature biometrics involves the utilization of the signature’s final image or signal, whereas dynamic biometrics also incorporates the time variable during data reception. Depending on the time variable while signing, different data such as pen sounds, pressure, and angle have been applied in the literature for biometric authentication [16,17,18]. On the other hand, due to improvements in object tracking, the number of studies using air signatures has increased in recent years. Bailador et al. [19] collected in-air signatures with mobile phone sensors, and Guerra Segura et al. [20] utilized a 3D sensor to acquire data to apply traditional machine learning methods. Malik et al. [21] captured in-air signatures with a depth camera to create a deep learning model. The approach of acquiring in-air signatures offers a significant advantage when obtaining signatures remotely. Individuals can sign without physically being present at the location where their signature is needed. This circumstance has various advantages. In terms of cost savings, monitoring the in-air signature does not necessitate additional sensors, only cameras.

It is often effective to use a single biometric method; nevertheless, various studies have shown that combining multiple systems produces better results [14, 22,23,24]. Multiple biometrics can be defined as the use of sensors together (multi-sensor) [25], the use of different instances of the same biometric characteristic (multi-instance) [17], or the use of distinct biometric features together (multimodal) [14]. This paper proposes a multimodal solution for online security using facial and dynamic signatures biometric data. Facial biometrics is a type of biometric data that has been utilized for a long time. It exists in a variety of fields, including health, business, security, and entertainment. Because of its prevalence in everyday life, it has been employed in much related research [26,27,28]. Similarly, various studies utilizing signature biometrics have been carried out, and a significant number of related articles have been published in the literature, particularly in the previous ten years. There have been studies on this topic that use paper signatures [29, 30], air signatures [20, 21], and time-varying physical conditions when signing [16,17,18, 31]. Moreover, there have also been studies in which these biometric techniques have been utilized in tandem [22].

The aim of this study is to develop a robust deep learning model for online activities, especially focusing on the potential vulnerabilities associated with facial and in-air signature biometrics. This is especially significant given the potential for modern technology to deceive many biometric features. To achieve this aim, we employed a low-cost laptop camera to collect 70 instances of both facial and in-air signature data from each of the 25 participants. Concurrently, we recorded the coordinates and timestamp details of the signature data during acquisition. Following a meticulous preprocessing stage, a total of 1750 pairs of facial images, in-air signature images, and corresponding in-air signature signals were utilized for training across CNN, LSTM, GRU, and TCN networks, resulting in an impressive accuracy rate surpassing 98%.

2 Methods and materials

Deep learning models represent a cornerstone in contemporary machine learning applications, demonstrating their ability through remarkable effectiveness across various research domains. Notably, biometric data consolidation, categorization, and authentication have made significant strides leveraging deep learning methodologies. This study delves into the integration of diverse biometric features, encompassing facial and dynamic signature traits, unveiling the potential of their combined utilization to significantly enhance identification success rates. The state-of-the-art deep learning architectures, including CNN, LSTM, GRU, and TCN, were selected as strategic choices. Careful consideration was given to their established success in effectively handling the inherent complexity of patterns and sequences within biometric data. While CNNs excel in image-related tasks, LSTMs and GRUs prove adept at capturing temporal dependencies in sequential data. Additionally, TCNs offer an effective approach for modeling long-range dependencies in time series data. This diverse set of architectures was planned so that their individual strengths could be used to fully address the complex nature of biometric information. This led to a significant improvement in the overall performance of the system.

One of the most crucial points of this study is that practically every phone or computer has a basic camera that may be used to capture data. In this work, the data were acquired using a modest camera with two megapixels and a resolution of 1920 × 1080. The computational framework included an i7-11800H 2.30 GHz processor and 16 GB RAM.

2.1 Data collection and description

The perception of hand shape, movement, and hand joint points has become vital in achieving user experiences in various technical disciplines and platforms [1, 32,33,34,35]. For instance, augmented reality apps enable digital material, which has been increasingly popular in recent years, to be presented in conjunction with the actual environment [36, 37]. Aside from that, the perception of hand movements is used in areas such as sign language perception and digital content presentation [38,39,40].

In the expansive landscape of hand and finger detection, numerous libraries have been developed across diverse software languages and platforms. Noteworthy examples include OpenPose, HandPose in TensorFlow.js, and the robust leap motion hardware and software solution. However, in the context of this study, which focused on obtaining precise coordinates for hand joint points, a deliberate choice was made to harness the capabilities of the MediaPipe Hands (MPH) tool. This high-performance Python library stands out for its proficiency in hand and finger tracking, utilizing machine learning to extract 21 joints and endpoints with exceptional precision. Consequently, the MPH proves to be an exemplary choice for providing a comprehensive and accurate representation of one or more hands in three-dimensional space. [41,42,43]. This solution works on both desktop and mobile platforms. The location of the fingers and the writing action are utilized to store the image from the signature procedure in the proposed air signature application. A rectangular region is established for the signature, and an image is obtained inside this region based on finger motions and placements. The user is initially prompted to show her or his hand in front of the camera at the depth to be signed to prevent the depth from distorting the metric measurement. In addition to the image, the coordinates of the spots where the fingertip travels and the time information are stored in a txt file.

Similarly, to obtain facial data from the camera, the MediaPipe Face (MPF) library was employed. Facial landmarks and facial expressions in photographs and videos can be detected using the MPF. Machine learning (ML) models are used for this purpose, and they can process both a single image and an ongoing stream of images. This process generates 3D facial landmarks, blendshape scores to infer detailed facial surfaces in real time, and transformation matrices to carry out the changes necessary for effect rendering. Finding 468 main points on the face with the use of this library allows us to establish the basic shape of the face. The camera captures the face while simultaneously capturing the signature.

The dataset used in this study was collected at Yildiz Technical University’s (YTU) Cyber Security and Biometric Research Centre under the contract of General Directorate of Development Agencies with the ethical permission of YTU Ethical Committee. Volunteers are asked to sign in the air in front of a screen, and the biometric data is saved. As shown in Table 1, there were 25 volunteers, including 11 males and 14 females.

Table 1 Dataset characteristics and participant statistics

Participants were asked to first show their hands at the location where they would sign in order to assess the depth in which they would sign. Here, the hand dimensions were determined, and a key on the keyboard was pressed to initiate the biometric data collection process. Afterward, the rectangular area where the signature would be placed was displayed on the screen, and the volunteers were asked to sign there. During signing, both important locations on the face were displayed, and the person’s real-time signature was drawn on the screen at the same time (Fig. 1). The participants were requested to write on the screen with the tip of the index finger while signing. However, a rule has been defined that the drawing process is halted if the distance between the tips of the index and middle fingers is below a threshold value, preventing the hand from beginning to draw when it enters the rectangular area to be signed. The threshold value here is approximately the distance between the index finger dip joint and the tip of the index finger. It is possible to add distinct signature sections such as points, lines, while signing in this manner. At the end of the procedure, a key determined from the keyboard was used to record the face, signature image, and signature coordinate-time data.

Fig. 1
figure 1

a Screenshot taken following the collection of biometric data. b The view of the signature before preprocessing. c Signature view after the removal of spaces. d The final 256 × 256 pixel signature version of the image created for the model. e Acquired coordinate-time raw signals for signature data. f Subtracting the lowest values for each signal to make the signals start from the origin. g Oversampling to ensure that each participant’s signals have the same length. h Intersecting the points where the face and mask match. i Final 256 × 256 facial image

In this context, a total of 1750 data for each biometric trait (face image (.jpg), signature image (.jpg), and signature signal (.csv)) were obtained, including 70 signatures and facial images from each of the 25 people.

2.2 Preprocessing

Following the capture of the data, 1750 signatures and facial images from 25 people were prepared for processing. In addition to acquiring the signature visually, the coordinates and creation times of the points where the index fingertip was followed were transferred to a file with a csv extension.

On the 2D signature images, the unnecessary portions around the signatures were removed based on the size of the rectangle visible on the screen, and their size was fixed to 256 × 256. Furthermore, despite the fact that a rotation was performed to place the signatures on the horizontal plane, this preprocessing step was removed because it had no significant positive effect on the resulting performance. On the other hand, for each signature signal, to make the signature signal coordinates start from the origin, the minimum values in the x and y signals were subtracted from the x and y values of each point, and the first point’s time t0 was subtracted from all time signal points. Additionally, all signals are oversampled to ensure that their lengths are the same. To accomplish this, two times the maximum length of the longest signal was chosen, and all signals were extended to this length (Fig. 2). The signature, consisting of 30 points created in the air anywhere in the rectangular area on the screen in Fig. 2a, was shifted to the origin on the coordinate axis, and then oversampled to 200 points by linear interpolation method (Eq. 1), as seen in Fig. 2b. In this equation, for two points of the signal, if \(y\left(n\right)\) and \(y\left(n+1\right)\) are the starting and ending points, respectively, \({\hat{\text{Y}}}\left( {n + h} \right)\) represents the estimated value at a point which is at a distance of \(h\) from point \(n\). \(h\) lies in the interval [0,1]. The Python SciPy library's “interp1d linear interpolation” function was used during this procedure. Since there is no noise in the signature signal, preprocessing steps such as noise reduction and skeleton removal were not applied.

$$ {\hat{\text{Y}}}\left( {n + h} \right) = \left( {1 - h} \right)y\left( n \right) + hy\left( {n + 1} \right) $$
(1)
Fig. 2
figure 2

a 2D image of a 30-point signature before preprocessing b 2D image of a 200-point interpolated signature after preprocessing

While acquiring facial images, the FACE_OVAL feature of the MPF library crops only the facial portion of the images while masking the rest with a black background. Therefore, no further preprocessing was needed.

2.3 Applied deep learning methods

2.3.1 Convolutional neural network (CNN)

Convolutional neural networks (CNNs) have been a game-changing innovation in image processing and computer vision [44]. CNNs use layers of linked nodes to extract complex visual patterns, simulating the human visual system. Their design includes pooling layers to downsample feature maps, activation functions to induce nonlinearities, and convolutional layers to apply filters to input images. With the use of gradient descent optimization, CNNs are trained via backpropagation. CNNs do away with the necessity for human feature engineering since they have the potential to automatically learn optimum filters. They have been used in many different fields, including video analysis, object identification, and medical diagnostics. CNNs have transformed the study of visual data by providing formerly unobtainable degrees of accuracy. In this study, facial images and 2D signature images were included in the network model using the CNN algorithm because of the CNN architecture’s strength with images.

2.3.2 Long short-term memory (LSTM)

A specific kind of recurrent neural network (RNN) called long short-term memory (LSTM) has been developed to recognize long-term dependencies in sequential inputs [45]. Based on the relevance of the information, LSTMs selectively store, discard, and output it using memory cells, input gates, forget gates, and output gates. LSTMs may perform very well in applications such as speech recognition, natural language processing, and time series analysis due to their distinctive design. By modifying the weights they carry using gradient descent optimization, LSTMs are trained to learn from labeled sequences via backpropagation through time (BPTT). LSTM networks have greatly improved sequential modeling and have been applied in many different fields, boosting the capacity of neural networks to comprehend and model sequential inputs. Since it consistently produces positive outcomes in signal data, LSTM was one of the neural network techniques utilized to include signature signals in the model in this study.

2.3.3 Gated recurrent unit (GRU)

Due to its streamlined construction and performance, which rival LSTM, the gated recurrent unit (GRU) is a type of recurrent neural network (RNN) [46]. By including memory update and reset gates, GRU networks solve the vanishing gradient issue and enable improved information flow and quicker training. GRUs have a more streamlined architecture than does LSTM due to the merging of these gates, which lowers the amount of computation needed. Tasks, including machine translation, conversational systems, and anomaly detection, have been effectively completed using GRUs. GRUs are useful tools in a variety of applications requiring time series analysis and natural language processing due to their efficiency in handling sequential data, quicker training durations, and lower computing complexity. It is known that the GRU algorithm yields better results than does the LSTM algorithm in some studies. GRU algorithm was also used in this study for comparison with the LSTM results.

2.3.4 Temporal convolutional networks (TCN)

For challenges involving sequence modeling, temporal convolutional networks (TCNs) have become effective architectures [47]. Dilated convolutions are used by TCNs to effectively describe dependencies over a range of time steps. TCNs are excellent at processing lengthy sequences while lowering computing complexity due to their parallelizable and scalable nature. Applications including voice synthesis, music production, and video processing have had success with TCNs. TCNs perform transformed sequence modeling by capturing a broad variety of relationships with fewer parameters and computations than do conventional convolutional structures. They are essential tools in a variety of fields involving time series analysis and sequential data processing because of their ability to handle temporal sequences well. This algorithm, which yields effective results with signals, was also evaluated during the inclusion of signature signals in the study’s model.

2.4 Biometric authentication

Three different types of data were gathered during the data collection: facial images, signature images, and signature signal. The images consist of RGB intensities, while the signals comprise x, y coordinates, and timestamp data. After the preprocessing phase, the identification success of the dataset was evaluated separately for each trait. The CNN architecture, whose performance on images has been proven in the literature, was employed for the evaluation of the facial image and the signature image traits. To do this, all the images were initially resized to 256 × 256 × 3 dimensions. Before building the model, the images are partitioned into 80% training images and 20% test images. Using a model with a similar structure, the CNN model (Fig. 3) was constructed for both facial images and signature images separately. This structure employs 2D convolution layers with 32, 64, and 128 filters, followed by 2x2 max-pooling layers. The model ended with a flatten layer and two separate dense output layers. The first was constructed with a 512-unit and ReLu activation function, and the second was made with a 25-unit (representing the number of classes or number of participants) and SoftMax function. Outputs P1 to P25 in Fig. 3 represent classes or participants. Various layer structures were also employed in this phase, with this structure yielding the highest accuracy values. Training was initiated with 50 epochs while the model was being fitted, and an early stopping criterion that required the performance value to not rise by five times was implemented. Training was completed for facial images between 35 and 42 epochs and signature images between 36 and 45 epochs via tenfold cross-validation. While fitting the model, exploration of different batch sizes, specifically 16, 32, and 64, was undertaken. The model’s performance did not significantly alter at 16 batch size, but time performance did decline. Besides, when utilizing a batch size of 64, hardware-related issues were encountered, leading to disruptions in the training process. As a compromise, the batch size was ultimately set to 32 to ensure both satisfactory performance and efficient hardware utilization. In conjunction with batch size tuning, the Adam optimizer, a popular choice for deep learning tasks, was employed. The Adam optimizer adapts the learning rate during training, providing an adaptive and efficient optimization strategy. The initial learning rate was set to 0.001, and the optimizer’s default parameters were used.

Fig. 3
figure 3

The model created with the CNN architecture using only facial images. The CNN architecture created with the signature image was also employed in the same way. P1 to P25 represent the classes or participants

LSTM, GRU, and TCN architectures were individually employed to analyze the signature signals. Prior to analysis, all signals, including their x, y, and time coordinates, were resized to a length of 200, which is twice the length of the longest signal. First, the preprocessed and equalized signals are split into 80% training samples and 20% test samples, as in the images. Two distinct LSTM layers were added to the LSTM architecture with linear activation, each having 128 and 32 units. The architecture was then completed with two distinct dense layers containing 512 units using the ReLu function and 25 units using the SoftMax function (Fig. 4). Similar to LSTM, it is implemented in the GRU architecture with GRU layers. For the TCN structure, two 1D convolution layers with 256 and 64 filters were utilized, followed by 1D maximum pooling and two dense layers.

Fig. 4
figure 4

Representation of the LSTM architecture applied to the signature signal

2.4.1 Signature signal-signature image pair model

The construction of the initial pair model commenced by considering the signature image and its corresponding signature signal. Twenty percent of the coordinate-time signals and signature images handled as a result of the preprocessing are set aside for testing. The training procedure was applied to the remaining 80% of the data. First, a CNN+LSTM model was constructed for processing signature images and signals. For the CNN model applied to signature images, convolution and max-pooling operations were executed in three separate hidden layer blocks. Each convolution operation, utilizing 32, 64, and 128 filters, was followed by 2x2 pooling layers. The resulting feature maps were then flattened using a flatten layer. On the other hand, the signature signals underwent a 128-unit LSTM layer and a flatten layer. Subsequently, the blocks composed of the CNN and LSTM layers were merged. This merging process was accomplished using the concatenate function in the Python Keras library. This library constructs a robust architecture that merges tensors of the same length from CNN and LSTM models, encapsulating the distinctive features of both tensors. The result is a unified representation that captures the combined information extracted from the concatenated tensors, enhancing the overall effectiveness of the model. The concatenate function combines the outputs of the CNN and LSTM models, allowing for the fusion of data extracted by both architectures. To complete the model, the merged blocks were connected to two independent dense layers. The first dense layer consisted of 512 units with a ReLu activation function, while the second dense layer contained 25 units with a SoftMax activation function (Fig. 5). Fifty epochs were first thought to be sufficient to begin the model’s training. The model was developed using tenfold cross-validation, with the epoch number stopped with an early stopping between 34 and 48 to prevent overfitting.

Fig. 5
figure 5

CNN + LSTM/CNN + GRU/CNN + TCN models for images + signatures

When the GRU algorithm is used instead of the LSTM algorithm, a 128-unit GRU layer is added to the LSTM layer. When the TCN algorithm is used, the same parameters are used to add a 128-filter 1D convolution layer.

2.4.2 Signature image-face image pair model

The CNN algorithm was employed to generate a model with a combination of facial images and signal images that took into account both attributes. The same design is used to produce a CNN structure that includes max-pooling layers, three independent convolution layers with 32, 64, and 128 filters, and a flatten layer. By adding two dense layers, each with 512 and 25 units, to the end of the model, the model was trained in this manner. Training was started with 100 epochs in this structure, which uses two CNN architectures with tenfold cross-validation, and it was determined that the performance would not increase by 10 times as an early stopping criteria.

2.4.3 Signature signal-face image pair model

At this step, a model was trained utilizing both facial images and signature signals using the CNN+LSTM, CNN+GRU, and CNN+TCN algorithms as in the prior pairings. The model incorporates a CNN architecture with max-pooling layers after the convolution layers (32, 64, and 128 filters) for facial images. In the case of signature signals, LSTM (128 units), GRU (128 units), and TCN (128 filters) algorithms are individually employed (Fig. 5). At this point, different numbers of layers were tried to be added, but more successful results were not encountered. The experiments’ findings demonstrate that using facial images and time-dependent signature signals yields better outcomes than using other multi-biometric techniques.

3 Results and performance evaluation

In this study, features from both face data and dynamic and static signature data were integrated and classified. Signature data is the most commonly utilized behavioral biometrics in daily life. Instead of capturing the signature as it is on paper, we capture the signature as it is formed in the air in front of the screen. This innovative approach enables remote verification using both signature and facial data, eliminating the physical presence. The application demonstrated that the use of in-air signatures and facial images together improved identification success.

Assessing the performance of the examined characteristics involved the use of diverse metrics such as multiclass accuracy (Eq. 2), sensitivity (Eq. 3), precision (Eq. 4), F1 score (Eq. 5), and area under curve (AUC). The calculations were based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) outcomes, employing the Python sklearn.metrics library.

$$Accuracy=(TP+TN )/(TP+TN+FP+FN)$$
(2)
$$Sensitivity=TP/(TP+FN)$$
(3)
$$Precision=TP/(TP+FP)$$
(4)
$$F1 Score=2* (Sensitivity*Precision)/(Sensitivity+Precision)$$
(5)

In order to compare multi-biometrics performance, first, the traits were trained individually. As a result of the tests, 90.39% accuracy and an 89.88% F1 score were obtained from signature images utilizing the CNN. 92.28% accuracy and 91.33% F1 score values were obtained with facial images, which is another trait classified with the CNN architecture. Using signature signals with LSTM, 91.84 accuracy and 91.96% F1 score, with GRU 91.22% accuracy and 91.12% F1 score, and with TCN 91.81% accuracy and 90.33% F1 score values were obtained. Other metrics can also be found in the table below (Table 2).

Table 2 Results of the models when traits are considered individually

To improve performance in the identification process, multiple biometric traits were combined. With respect to the models that use both the facial image and the signature image, 92.48% accuracy and 91.47% F1 score values were achieved (Table 3).

Table 3 Signature image and face image results

Signature image and signature signal pair, with the CNN+LSTM architecture 81.49% accuracy, 81.64% F1 score, with CNN+GRU architecture 80.45% accuracy, 80.59% F1 score, and with CNN+TCN architecture 86.90% accuracy, 87.22% F1 score values were obtained (Table 4).

Table 4 Signature image and signature signal results

These results demonstrated that, even when features are combined, performance is not always improved. When face images and dynamic signature signals are trained together, the CNN+LSTM architecture achieves 96.22% accuracy and 96.02% F1 score, whereas the CNN+GRU architecture achieves 95.14% accuracy and 94.83% F1 score. On face images and signature signals, the CNN+TCN architecture performs with 98.01% accuracy and 97.89% F1 score values (Table 5). According to these results, the use of the CNN+TCN architecture with facial image and signature signal traits yielded the best results. These results indicate that modeling in-air signature and facial biometric attributes using a combination of different deep learning techniques yields effective results.

Table 5 Signature signal and face image results

4 Case study on deep learning-powered multimodal biometric authentication

In the rapidly evolving digital landscape, ensuring robust security in online transactions remains a paramount challenge for financial services. This case study aims to demonstrate the practical implementation and effectiveness of our proposed deep learning-powered multimodal biometric authentication system in a real-world setting. We selected a financial services company, grappling with the dual challenges of increasing online fraud and the need for a user-friendly authentication experience. This section outlines the implementation process, the challenges encountered, and the tangible benefits realized by integrating our innovative biometric solution into the company’s digital platforms. Through this case study, we aim to provide insights into the scalability and effectiveness of the model, and its potential to revolutionize security protocols in various industries.

The finance institution initiated the development of a bespoke application for its online platform, leveraging robust libraries for capturing facial and hand gesture data. Users are guided through straightforward procedures to capture their dynamic signature. The participants were instructed to sign in the air, following simple directives, and repeatedly sign within a designated box displayed on the screen. Simultaneously, facial data are captured during the signature process. These collected data points are subsequently used to construct a model that is integrated into the system.

Before initiating any client interaction, the financial institution requires the customer to perform a signature in front of the camera, ensuring that their face is also visible. This process is part of multimodal biometric system verification. Once the system successfully verifies both the signature and facial data, the client meeting can proceed. This dual verification method effectively mitigates the risk of system manipulation through the use of any application that can alter facial data, thereby enhancing the security and integrity of the authentication process.

The versatility of our deep learning-powered multimodal biometric authentication system extends beyond the financial sector, offering significant benefits in various fields. In health care, this approach can enhance patient identification, ensuring privacy and security. Retail businesses could leverage it for secure online transactions, minimizing fraud risks. Additionally, in the education sector, it can be used for verifying identities in online examinations and maintaining academic integrity. This system’s adaptability demonstrates its potential as a valuable tool in diverse industries, aligning with the evolving needs of a digital-first world.

5 Conclusion

Biometric traits are utilized in a variety of contexts for classifying and identifying individuals in daily life. Particularly in the context of the increasing growth of online platforms, the demand for reliable person recognition has become paramount. In response to this need, our research introduces a high-performance classification model that leverages dynamic signature and facial biometric features for robust online systems. Employing a dynamic signature and facial biometric traits, it is possible to construct online person recognition and verification systems that are effective based on tests with models created using various deep learning techniques.

Looking ahead, the applicability of our study extends to commercial settings, where the integration of our recognition and matching applications can thrive. The ability to gather signature and face information from participants holds significant promise, especially in the current climate where online platforms such as Zoom, Google Meet, and Skype are gaining widespread popularity. This not only underscores the relevance of our model in the contemporary digital landscape but also positions it as a valuable asset for companies seeking simple and enhanced security solutions.

Additionally, we must emphasize that many participants supported the study by initializing their signatures instead because the signature data are used in numerous applications. Additionally, 25 volunteers were available to help with the study. It was anticipated that by increasing this number, future studies could adopt better approaches.