1 Introduction

Data sharing is fundamental for collaborative research, enabling cross-disciplinary collaboration and enhancing research quality by facilitating replication and building upon existing work [1, 2]. It expedites research processes, granting swift access to data, fostering rapid discoveries, and promoting transparency by sharing methodologies and findings with the public. Moreover, data sharing reduces research costs by sharing resources, freeing up valuable resources for additional research [3].

Naturalistic driving studies (NDS) have become essential in understanding transportation safety, particularly through analyzing NDS data. NDS data offers various advantages, including improving driver safety by identifying risk factors and developing interventions like alerts to prevent accidents. Furthermore, NDS data aids in designing safer vehicles by identifying factors causing driver fatigue and distraction, leading to more comfortable and less distracting vehicle designs. Public policy benefits from NDS data as well, as it helps identify factors contributing to accidents, leading to the development of policies like road safety programs and stricter distracted driving laws. Moreover, NDS data supports research into driver behavior and the development of technologies such as advanced driver assistance systems, enhancing overall road safety [4, 5].

While sharing of NDS data has numerous benefits, it also presents significant ethical and privacy concerns that need careful consideration [6]. Publicly sharing data originating from human subjects raises the risk of misuse and unauthorized access, jeopardizing the privacy of individuals who participated in the research. Failure to protect personally identifiable information (PII) may violate institutional review board and contracts on the consent forms made to safeguard participants’ privacy. Therefore, responsible data sharing practices must be established. Data deidentification techniques or privacy-focused data sharing methods, are often used to ensure the integrity of scientific research [5, 7,8,9,10,11,12].

However, sharing face videos from NDS encounters a unique challenge due to the presence of PII within the data. These videos contain two types of PII: basic driver information and videos capturing drivers under various conditions. While basic information can be anonymized relatively easily, anonymizing video data is more complex. It requires the removal of identifying facial features while preserving the data’s scientific value, posing a significant challenge in balancing privacy protection and data utility [13]. This dilemma underscores the need for advanced techniques in deidentifying face videos, as traditional methods may fall short in preserving the meaningful insights offered by the data.

Recent advancements in computer vision have brought forth the emergence of generative adversarial networks (GAN)-based techniques, particularly in the realm of face swapping for deidentifying face videos [14, 15]. This approach involves leveraging GANs to replace the original facial features in videos with synthetic or anonymized counterparts, eliminating PII while preserving important human factors attributes [16, 17]. While the effectiveness of GAN-based deidentification of still videos and images has been widely researched, their efficacy in replacing in-the-wild videos like face videos in NDS data remains a very less explored. To address this gap, in this paper, we present four main research goals:

  • Understand the requirements of data deidentification from privacy requirements.

  • A comparative evaluation of the effectiveness of GAN-based face swapping techniques for deidentifying drivers’ face videos from naturalistic driving scenarios with unconstraint head movement, lighting conditions.

  • Test and compare the effectiveness of different face deidentification algorithms to preserve key human factor attributes.

  • Test effectiveness of using fake faces in deidentification for naturalistic data.

To accomplish our objective, we conduct a comprehensive evaluation of GAN-based face swapping techniques applied to face videos obtained from NDS data. The evaluation encompasses rigorous experimentation and analysis performed on a substantial dataset of NDS face videos. By assessing the performance of these GAN-based techniques, we demonstrate their effectiveness in preserving the valuable human factors attributes present in the videos while ensuring the removal of PII. Both quantitative and qualitative analysis of deidentified videos are done to check the effectiveness of GAN-based techniques in preserving key attributes that would help researchers study driver safety, attention, and fatigue while the face is deidentified. We also provide an error analysis plan and a framework for automated deidentification of drivers’ face videos. Furthermore, our research makes a noteworthy contribution by incorporating outdoor NDS data into the evaluation. This emphasis on outdoor data is particularly valuable as it reflects real-world driving conditions and scenarios, which is crucial for the practical application of deidentification techniques in transportation safety research.

This paper is structured into five main sections. Section 2 offers a comprehensive review of relevant literature and background information concerning the challenges and advancements in sharing drivers’ face videos while adhering to ethical guidelines. Section 3 outlines the research’s methodology and experimental setup, with a focus on GAN-based face swapping techniques for deidentification, providing insights into the rationale and parameters used. Section 4 presents the experimental results, including quantitative measurements and statistical analyses, shedding light on the effectiveness of the deidentification methods. Section 5 details the error analysis plan, systematically identifying, quantifying, and mitigating potential sources of error to ensure the reliability of the findings. Finally, Sect. 6 offers concluding remarks, summarizing key findings, addressing research objectives, acknowledging limitations, and suggesting avenues for future exploration.

2 Related works

Data sharing in transportation safety research enhances the reproducibility, reliability, and transparency of scientific findings [3, 18, 19]. By sharing data, researchers validate reproducibility, optimize resources, and maximize benefits [20]. However, with human subject data, privacy considerations are crucial [21, 22]. US Department of Health and Human Services’ Common Rule provides comprehensive protections, including privacy provisions [23].

Deidentification is a delicate balance between data sharing and protecting participants’ identities. The process, while crucial for privacy, demands robust safeguards to prevent reidentification. Ethical guidelines intertwine with deidentification techniques, emphasizing the need for careful navigation. This section provides a comprehensive overview of the related works for this paper. Section 2.1 discusses the recommended steps for deidentification, Sect. 2.2 delves into the standard data deidentification practices, Sect. 2.3 discusses considerations specific to deidentifying NDS, and Sect. 2.4 explores the face swapping techniques.

2.1 Steps to deidentify data

The National Academy of Science (NAS) has devised a comprehensive twelve-step process with the objective of facilitating the sharing of data from clinical trials, as outlined in the publication by the National Academies Press in 2015 [24]. However, given the specific context of sharing face video data collected in the NDS, some steps within the original process are not directly applicable to our scope. Consequently, we focus on five key steps that are particularly relevant and essential for this effort:

  • Determine the direct identifiers in the existing dataset.

  • Mask the identifiers in the dataset.

  • Perform threat modeling.

  • Determine minimal acceptable data utility.

  • Determine the reidentification risk threshold.

The initial step entails identifying direct identifiers, which can include not only traditional personal details like names and addresses but also unique challenges presented by multimedia content like facial images. To mitigate these challenges, masking techniques are applied as the second step. This can involve blurring faces or replacing names with participant IDs, ensuring that PII remains confidential. Next, threat modeling is essential to anticipate potential risks posed by various adversaries, from institutional competitors to individuals with malicious intent, considering both direct and additional available information. Additionally, determining the minimal acceptable data utility, the fourth step, depends on research objectives. For instance, studies on driver behavior may require preserving specific human factors attributes like eye and lip movements, while other research may not necessitate retaining facial features. Finally, researchers must define the reidentification risk threshold, which varies based on factors like dataset type and recipient reputation.

By following these steps, researchers can navigate the process of deidentifying face video data collected in NDS, ensuring privacy protection and maintaining data utility for scientific investigations. These steps lay the foundation for effective data sharing practices in transportation safety research while upholding ethical guidelines and participant confidentiality.

2.2 Standard deidentification practices

In the context of sharing data that includes PII, it has become a standard practice to remove PII through various deidentification techniques. This section discusses some commonly employed techniques:

  1. a.

    Pseudonymization Pseudonymization is a widely used deidentification technique where PII within the dataset is replaced with dummy identifiers or pseudonyms. By substituting identifiable information with unique but unrelated values, the privacy of individuals can be safeguarded [25].

  2. b.

    Aggregation Aggregation involves providing a summary or aggregate information about the dataset instead of revealing individual-level details [26]. For instance, rather than disclosing the specific data of each customer, releasing average statistics about the customer base exemplifies the use of aggregation.

  3. c.

    Data reduction Data reduction entails the complete removal of direct identifiers from the dataset to achieve deidentification. Additionally, quasi-identifiers that could potentially be used to reidentify individuals are also eliminated.

  4. d.

    Data suppression Data suppression aims to suppress direct information that could lead to the identification of individuals [27]. Techniques such as providing data ranges instead of precise values, clustering data, or applying random rounding fall under the purview of data suppression [28, 29].

  5. e.

    Data masking Data masking involves obscuring direct identifiers to deidentify the data [30]. This can be achieved by introducing random noise or substituting values with random numbers. Data masking may incorporate pseudonymization as part of its process.

Table 1 Popular deidentification techniques and examples

Limitations of existing deidentification techniques for NDS data While the deidentification techniques discussed earlier (Table  1), such as pseudonymization, aggregation, data reduction, data suppression, and data masking, are commonly used for general datasets, applying these techniques to video or image-based data presents additional challenges. Face video or image data contains rich visual information that can potentially lead to the identification of individuals, making it more difficult to achieve effective deidentification. However, researchers have developed specialized methods to address these challenges and protect privacy in video and image-based datasets.

2.2.1 Existing deidentification techniques for face images/videos

In the context of face video or image data, traditional deidentification techniques like pseudonymization or aggregation may not be directly applicable. Simply replacing personal identifiers or providing aggregate information might not be sufficient to prevent the identification of individuals, as visual attributes can be highly distinctive and unique to each person.

Instead, specific approaches tailored to video and image data are employed to deidentify individuals while preserving the utility of the data. Some widely used techniques are described below:

  1. a.

    Video blur It involves applying a blur effect to the facial region in a video or image as shown in Fig. 1b. The degree of blur can vary, ranging from a slight blurring of the facial features to a more intense blur that renders the face unrecognizable [31]. This technique works by reducing the level of detail in the facial region, making it difficult for individuals to be identified based on their facial characteristics.

  2. b.

    Pixelation It works by replacing the original pixels in the facial region with larger, blocky pixels as shown in Fig. 1c. This process creates a mosaiclike effect, where the facial features become less distinguishable [32]. The degree of pixelation can be adjusted to balance privacy protection and maintain the overall context of the video or image.

  3. c.

    Creating a composite face Instead of simply blurring or pixelating the entire face, a composite face can be generated by combining facial features from different individuals in the dataset as shown in Fig. 1d. This approach creates a mixed representation that does not correspond to any specific person, making it more difficult to identify individuals.

  4. d.

    Creating synthetic faces Generating synthetic faces, such as avatars, is a method used to deidentify face video or image data [33]. Synthetic faces are computer-generated representations that do not correspond to real individuals. By replacing real faces with synthetic ones, the privacy of individuals is protected, as their identities are dissociated from the data [34]. Synthetic faces, as shown in Fig. 1e, are customizable, are diverse, and offer anonymity.

  5. e.

    Black out/block identifying attributes Another approach to deidentifying face video or image data is by blacking out or blocking identifying attributes as shown in Fig. 1f. This involves obscuring specific features or areas of the face that could potentially lead to the identification of individuals. By selectively blocking or blacking out attributes such as eyes, nose, mouth, or other distinguishing facial characteristics, the anonymity of individuals in the data is preserved [35].

Fig. 1
figure 1

Various common techniques for deidentification of faces

Fig. 2
figure 2

Various secondary behaviors of drivers during naturalistic driving scenarios

While video blur, pixelation, creating composite faces, generating synthetic faces, and blacking out identifying attributes are commonly used techniques to deidentify face video or image data, they may not effectively protect the human attributes necessary for transportation research. These traditional methods can sometimes result in the loss of crucial details and subtle facial expressions that are essential for analyzing driver behavior and emotions. Therefore, more robust techniques are needed to balance privacy protection and the preservation of valuable information. Next, we will talk about the nature of naturalistic driving study data and the considerations needed for deidentification of NDS data particularly drivers’ face videos.

2.3 Naturalistic driving study (NDS) data

The Naturalistic Driving Study (NDS) data involves collecting real-world driving data from drivers in unconstrained scenarios (as shown in Fig. 2), aiming to capture their typical behavior during regular driving.

The driver-facing, in-cabin videos are particularly valuable as they provide essential insights into the behavior of drivers. These videos capture key information such as gaze patterns, secondary behavior, and interactions with the vehicular environment. Analyzing this data allows researchers to study factors like driver distraction, fatigue, cognitive load, and situational awareness, contributing to the development of effective interventions and technologies for road safety. The combination of vehicle kinematics, radar data, GPS data, and in-cabin videos provides a rich and valuable resource for studying driver actions, improving road safety, and developing advanced driving technologies.

Sharing face videos in naturalistic driving studies (NDS) alongside numerical data is crucial for several reasons. While numerical data derived from face videos offer valuable insights into driver behavior, they often rely on computer vision algorithms for calculations, such as head pose estimations [36,37,38]. As the field of artificial intelligence continually evolves, these algorithms improve over time, leading to more accurate and refined analyses. By releasing only numerical data calculated using algorithms that were state of the art at the time of data collection, there is a risk of missing out on advancements made in the field since then. Additionally, the full potential and utility of the dataset may not be realized initially, and other researchers may discover alternative or more effective use cases for the face videos. Therefore, providing access to the video dataset ensures that researchers can leverage the most current and comprehensive data for their analyses, facilitating ongoing advancements in transportation research and road safety measures.

2.3.1 Considerations for deidentification of driver face videos

When sharing drivers’ face videos, deidentification methods typically involve the removal of PII features. These features can include the eyes, skin characteristics, and facial features such as the nose and lips. However, it is crucial to ensure that the deidentification process does not result in the loss of human attributes necessary for building intelligent systems.

Preserving critical information related to driving behavior is essential when deidentifying driver face videos. This includes preserving the ambient lighting conditions and the driver’s exhibited behavior during driving. Driver behaviors, such as head, eye, and lip movements, contain important insights that can be used in various applications in transportation research. For example, preserving lip movements allows researchers to monitor and analyze behaviors like yawning, laughing, and talking during driving. Similarly, preserving information such as eye movements enables researchers to monitor driver attentiveness.

Deidentifying data involving drivers’ faces presents challenges in computer vision, primarily due to factors such as variations in ambient illumination, driver appearance, and posture. Additionally, drivers can engage in various behaviors, such as smoking, drinking water, changing FM stations, or gesturing to other drivers on the road. Occlusions may occur due to hand movements covering the face or eating behavior. Another common challenge is dealing with wearables worn by drivers during the data collection process.

Taking these considerations into account, the deidentification of driver face videos from NDS data requires specialized techniques that balance privacy protection while preserving crucial human attributes necessary for transportation research and the development of intelligent systems.

2.4 Face swapping using computer vision

In recent years, face swapping has gained significant attention and popularity in computer vision, thanks to advancements in deep learning and the emergence of deepfakes [39]. Deepfakes are synthetic media generated using deep neural networks, such as variational autoencoders (VAEs) and GANs, which have the ability to replace one person’s face with another in images or videos seamlessly.

2.4.1 Generative adversarial networks

GANs are a class of deep learning models that have gained significant attention in recent years [40]. GANs are designed to generate synthetic data that closely resembles real data samples from a given distribution. The GAN framework consists of two main components: a generator network and a discriminator network. The generator is a neural network that learns to generate synthetic data samples, while the discriminator is another neural network that tries to distinguish between real and synthetic data samples. Let us denote the generator as G and the discriminator as D. The generator takes as input a random noise vector z sampled from a prior distribution p(z) and produces a synthetic data sample x. The discriminator takes a data sample x as input and outputs a probability D(x), indicating the probability that the sample is real.

The objective of GANs is to find an equilibrium where the generator produces synthetic data samples that are indistinguishable from real data samples according to the discriminator. This equilibrium is achieved through an adversarial training process, where the generator and the discriminator play a min-max game.

The adversarial objective The training objective of GANs can be expressed using the following minimax objective function:

$$\begin{aligned}{} & {} \min _{G} \max _{D} V(D, G) = \min _{G} \max _{D} {\mathbb {E}}_{x \sim p{\text {data}}(x)} [\log D(x)]\nonumber \\{} & {} \quad + {\mathbb {E}}_{z \sim p(z)} [\log (1 - D(G(z)))] \end{aligned}$$
(1)

In Eq. (1), the discriminator aims to maximize V(DG) by correctly classifying real and synthetic samples. On the other hand, the generator aims to minimize V(DG) by producing synthetic samples that the discriminator is likely to classify as real.

The training process The training process of GANs involves iteratively updating the parameters of the generator and the discriminator. At each iteration, a batch of real data samples \({x^{(1)}, x^{(2)},\ldots , x^{(m)}}\) is sampled from the real data distribution \(p_{\text {data}}(x)\), and a batch of noise vectors \({z^{(1)}, z^{(2)},\ldots , z^{(m)}}\) is sampled from the prior distribution p(z).

Discriminator update To update the discriminator, we first compute the discriminator loss \({\mathcal {L}}_D\) as the negative log-likelihood of the discriminator’s predictions:

$$\begin{aligned} {\mathcal {L}}_D = -\frac{1}{m} \sum _{i=1}^{m} [\log D(x^{(i)}) + \log (1 - D(G(z^{(i)})))] \end{aligned}$$
(2)

We then update the discriminator’s parameters by taking a gradient step in the direction that minimizes \({\mathcal {L}}_D\).

Generator Update To update the generator, we compute the generator loss \({\mathcal {L}}_G\) as the negative log-likelihood of the discriminator’s incorrect predictions on synthetic samples:

$$\begin{aligned} {\mathcal {L}}_G = -\frac{1}{m} \sum _{i=1}^{m} \log D(G(z^{(i)})) \end{aligned}$$
(3)

We update the generator’s parameters by taking a gradient step in the direction that minimizes \({\mathcal {L}}_G\).

GANs have revolutionized the field of generative modeling by providing a powerful framework for generating realistic synthetic data samples [41]. The adversarial training process, involving the generator and discriminator playing a min-max game, enables GANs to learn the underlying distribution of the real data and generate high-quality synthetic samples [42]. GANs are widely used to generate deepfakes. Deepfakes have found applications in various domains, including entertainment, social media, and forensics [43]. These techniques leverage large datasets of images and videos to train deep neural networks to learn the underlying patterns and features of human faces, enabling the generation of highly realistic and believable fake content.

Fig. 3
figure 3

Examples of deepfakes generation using identity swap and face reenactment techniques

2.4.2 Deepfake generation

There are different types of deepfakes that have been developed, each with its own approach and application [44]. Broadly, on the basis of techniques, deepfakes can be divided into four major groups:

  1. a.

    Identity swap/face swapping Identity swap, or face swapping, using GANs involves training a GAN model to learn the mapping between faces of different individuals [45]. The GAN consists of two components: a generator and a discriminator. The generator takes in an input face image from one person and generates a face image that resembles the target person. The discriminator’s role is to distinguish between real face images and the synthesized face images generated by the generator. During training, the generator tries to fool the discriminator into classifying its synthesized faces as real, while the discriminator aims to correctly distinguish between real and fake faces. By iteratively training the generator and discriminator, the GAN learns to generate realistic face images that successfully swap the identity of the source face onto the target face as shown in Fig. 3a.

  2. b.

    Face reenactment/face animation Face reenactment using GANs involves capturing and transferring the facial expressions and movements of one person onto another person’s face [46]. The process typically starts with face alignment and landmark detection to obtain corresponding facial landmarks between the source and target faces [47]. These landmarks are used to extract the facial expressions from the source face. A GAN model is then trained to generate a target face that not only matches the target identity but also aligns with the facial expressions from the source face. The generator in the GAN learns to generate face images that mimic the desired expressions, while the discriminator distinguishes between real and synthesized face expressions. By training the GAN with pairs of aligned source and target faces, the model learns to reenact facial expressions realistically on the target face as shown in Fig. 3b.

  3. c.

    Attribute manipulation/face editing Attribute manipulation or face editing using GANs involves modifying specific attributes or characteristics of a face image while preserving the overall identity as shown in Fig. 4a. GAN-based approaches allow for direct control over specific attributes by manipulating the latent space of the generator network [48]. By modifying specific dimensions or vectors in the latent space, different attributes such as age, gender, or hairstyle can be manipulated [49]. For instance, by traversing the latent space in the direction associated with “age,” one can make the generated faces appear older or younger. GANs enable attribute manipulation by learning a disentangled representation of the face images, where specific dimensions or vectors in the latent space correspond to different attributes [50].

  4. d.

    Face generation/synthesis Face generation or synthesis using GANs involves training a GAN model to generate entirely new and realistic face images [51, 52]. The GAN model typically consists of a generator and a discriminator [53]. The generator takes random noise as input and synthesizes face images, while the discriminator distinguishes between real face images and synthesized ones. Through adversarial training, the generator learns to generate increasingly realistic face images that can fool the discriminator. The generator’s objective is to synthesize faces that are indistinguishable from real faces (as shown in Fig. 4b), while the discriminator’s objective is to accurately classify real and fake faces.

Fig. 4
figure 4

Examples of deepfakes generation using attribute manipulation and face generation techniques

In this paper, our focus will be on the technique of identity swap or face swapping using GANs. We aim to leverage this approach to address the deidentification challenges in driver face videos within the context of Naturalistic Driving Study (NDS) data. By applying face swapping techniques, we can effectively replace the faces of drivers in the NDS dataset with synthetic or imposter faces while preserving the essential facial attributes and expressions necessary for transportation research. This allows us to balance privacy protection with the preservation of valuable information related to driver behavior, such as head movements, eye gaze patterns, and lip movements.

Fig. 5
figure 5

Headshot of participants in the ORNL dataset used

Table 2 Physical features of participants

3 Methods and experimental setup

Driver behavior analysis in transportation research has evolved with the advent of computer vision and machine learning, offering more objective and automated methods than traditional manual approaches. To assess the effectiveness of our face deidentification algorithms, we establish a robust experimental framework. This involves selecting an appropriate dataset, defining metrics for behavioral attributes, assessing perception and image quality metrics, and implementing advanced face swapping algorithms. We provide an overview of the dataset, including participant demographics and video statistics, and focus on extracting driver behavioral attributes like head pose, eye, and lip movements. We also introduce metrics for perception and image quality assessment and detail face swapping algorithms and experimental setup, ensuring the reliability and validity of our study’s outcomes. This section serves as a crucial foundation for the subsequent analysis and interpretation of the study’s results.

3.1 Dataset

In this section, we provide a brief review of the dataset used in our study, highlighting the importance of its diverse and representative nature. The dataset was primarily collected by the Oak Ridge National Laboratory (ORNL) with support from the Virginia Tech Transportation Institute (VTTI) [54]. It was gathered as part of the Exploratory Advanced Research Program of the Federal Highway Administration.

The dataset comprises naturalistic driving data, including driving videos and front-facing headshot images of the participants. Nine participants were involved in the data collection, with each participant undertaking a short driving trip lasting between 6 and 10 min. These driving videos capture real-world driving scenarios, enabling us to evaluate the robustness and effectiveness of face deidentification algorithms in practical applications.

Alongside the driving videos, the dataset includes front-facing headshot images of the participants captured from various angles. The front-facing headshots of participants are shown in Fig. 5. The first row (from left) has participants with Participant IDs 873, 886, 863, 897, and 876, respectively. Similarly, the second row (from left) has participants with Participant IDs 906, 883, 880, and 893, respectively. These images were utilized for calculating anthropometric measurements, providing insights into the facial characteristics of the participants. Each image and video in the dataset is assigned a unique identification number, facilitating easy referencing and analysis.

3.1.1 Participants and anthropometric measures

The ORNL dataset is rich in demographic information, encompassing a diverse range of participants in terms of age and gender. This diversity allows us to thoroughly evaluate the validity of face deidentification algorithms across different demographic groups. Table 2 provides detailed information on the physical features of the participants, including their age, gender, and glasses usage. The glasses are categorized into three types: T0 (spectacles with clear lenses), T1 (photochromic glasses that darken in response to ultraviolet light), and T2 (sunglasses that obscure the eyes).

To quantify the morphology of the human face, we employed anthropometric facial analysis using the DLIB library [55]. DLIB is a widely-used computer vision library that offers a range of pre-trained models and algorithms for facial analysis tasks. In our study, we utilized DLIB’s capabilities for face detection and facial landmark prediction. The first step in our analysis was face detection, which involved locating and identifying faces within the images or video frames. DLIB’s face detection algorithm is robust and can effectively handle variations in lighting, pose, and occlusion. It provided accurate face detection even in challenging conditions. Once the faces were detected, we moved on to facial landmark prediction. Facial landmarks are specific points on the face that serve as important references for analyzing facial features.

Fig. 6
figure 6

DLIB and 68-point landmarks given by DLIB

Using DLIB, we were able to detect facial landmarks, such as the corners of the eyes, the tip of the nose, and the corners of the mouth, with high precision as shown in Fig. 6. These landmarks were essential for calculating anthropometric measurements.

In our study, we focused on two anthropometric measurements: the face width-to-height ratio (FWHR) and the cheek-to-jaw width ratio (CJWR). The FWHR represents the ratio of face width to face height and provides insights into the overall facial structure. Two settings were considered for the FWHR analysis: FWHR-Brow, which measures the ratio of face width to face height till the eyebrow, and FWHR-Lid, which measures the ratio of face width to face height till the eyelid. The CJWR, on the other hand, quantified the ratio of cheek width to jaw width, offering additional information about facial proportions. Details of anthropometric measures for various participants are given in Table 3.

Table 3 Anthropometric measures of the participants and statistics regarding video length

By utilizing this diverse and representative dataset, we can accurately evaluate the performance and effectiveness of face deidentification algorithms in capturing important human factors such as head movements, mouth movements, and eye blinking.

3.1.2 Statistics of driving videos

On average, the ORNL driving videos, which were recorded in naturalistic driving scenarios, had an average duration of 6.59 min. These videos captured real-world driving situations and provided valuable insights into driver behavior and interaction with the environment. The videos were captured at a rate of 30 frames per second, ensuring a high-quality recording of the driving events. Each video contained an average of 11,862 frames, capturing the fine-grained details of the driving experience.

The videos showcased the dynamic nature of naturalistic driving, with changing lighting conditions throughout the recordings. These variations in lighting can be attributed to realistic environmental factors, such as shade from trees or houses, sunlight intensity changes, or reflections from surrounding objects. By capturing these naturalistic lighting conditions, the videos provide a realistic representation of the challenges faced by drivers in real-world driving scenarios. The specific number of frames in each video is given in Table 3, which provides a comprehensive overview of the video durations and frame counts for individual recordings.

3.2 Driver’s behavioral attributes from face video

In the analysis of driver behavior, in-cabin videos captured by the NDS are instrumental. These videos provide valuable insights for crash analysis and driver attention monitoring. Transportation safety researchers rely on various facial attributes extracted from driver face videos to understand different behavioral attributes. For example, eye movements can indicate the driver’s level of attentiveness, while mouth movements can reveal actions like yawning, eating, drinking, or speaking. Recent advancements in transportation research have emphasized the importance of facial detection algorithms, gaze estimation, and analysis of eye and lip movements. Therefore, it is crucial to consider the preservation of these vital features during the deidentification process. The focus of this research is to prioritize the retention of head movements, eye and lip movements, and fiducial points.

This section explores key attributes widely utilized in transportation safety research, derived from drivers’ face videos. Additionally, the section discusses metrics used to evaluate perception and image quality, and algorithms for face swapping, and provides an overview of the experimental setup.

3.2.1 Head pose

Head pose is a fundamental factor in driver behavior monitoring. It plays a crucial role in studying gaze direction, patterns, and visual attention of the driver. In terms of pose estimation, the human face is typically considered a rigid body with three degrees of freedom: roll, pitch, and yaw. Roll refers to the tilting motion, yaw is the rotation left and right, and pitch represents the up and down movement of the head. These angles provide valuable insights into the driver’s behavior and attention while operating the vehicle (Fig. 7).

Fig. 7
figure 7

Illustration of rotation angles associated with the head pose. Yaw refers to the rotation around the Y-axis, pitch corresponds to the rotation around the X-axis, and roll represents the rotation around the Z-axis

3.2.2 Eye movements

Eye movements are essential indicators of driver behavior and can provide valuable information about their level of attentiveness and engagement in the driving task. By analyzing eye movements, researchers can gain insights into various aspects of driver behavior, including drowsiness, distraction, and visual attention. In this section, we focus on the eye aspect ratio (EAR) and pupil circularity (PUC) as parameters for evaluating eye movements.

Eye aspect ratio (EAR) EAR is a commonly used parameter derived from eye landmarks obtained through facial analysis techniques like the Dlib toolkit [55]. EAR is calculated by measuring the ratio of distances between specific landmarks around the eyes. This ratio can indicate the openness or closure of the eyes, providing valuable information about the driver’s level of attentiveness and potential signs of drowsiness.

Traditionally, methods like ellipse fitting have been employed to assess the state of the eyes. These methods involve segmenting the pupils and fitting an ellipse based on the size of the eye region [56]. However, these techniques may face challenges in accurately segmenting eyes, particularly in real-world driving scenarios with factors like glasses or varying lighting conditions.

Fig. 8
figure 8

Landmarks for the calculation of eye aspect ratio and pupil circularity

The eye height–width ratio (EHWR) is another frequently employed parameter [57] that assesses the eyes by calculating the ratio between their height and width. This ratio is determined using only four pixels of the eye, which can result in inaccuracies, particularly in real-world driving scenarios. To address these challenges and ensure more reliable measurements, we adopted the EAR parameter based on facial landmarks. By relying on specific landmarks, EAR avoids the potential inaccuracies associated with traditional image segmentation techniques. Figure 6 illustrates the 68 facial landmarks, with six major landmark points dedicated to the eyes (Fig. 8). These landmarks play a crucial role in calculating the EAR and accurately assessing the state of the eyes. The expression for the calculation of EAR is shown in Eq. (4).

$$\begin{aligned} \text {Eye aspect ratio (EAR)} = \frac{(d_{v1}^e + d_{v2}^e)}{2 \times d_{h}^e} \end{aligned}$$
(4)

where \(d_{v1}^e\) is the distance between P2 and P6 (from Fig. 8). Similarly, \(d_{v2}^e\) is the distance between P3 and P5. \(d_{h}^e\) is the horizontal length of eyes (distance between P1 and P4).

To establish the maximum and minimum bounds for the EAR, empirical observations were conducted on the anthropometric measures of drivers’ faces. A hypothesis was formulated that the width of the eye would be twice the mean of the distances \(d_{v1}^e\) and \(d_{v2}^e\). Based on this hypothesis, the maximum bound for EAR can be deduced using Eq. (5):

$$\begin{aligned} \text {EAR} = \frac{(d_{v1}^e + d_{v2}^e)}{2 \times d_{h}^e} \end{aligned}$$
(5)

Since \(d_{h}^e\) = \(2 \times \frac{d_{v1}^e + d_{v2}^e}{2}\) for maximum EAR (condition when eye is opened), Eq. (5) becomes:

$$\begin{aligned} \text {EAR}_{\text {max}}= & {} \frac{d_{v1}^e + d_{v2}^e}{2 \times (d_{v1}^e + d_{v2}^e)} \end{aligned}$$
(6)
$$\begin{aligned} \text {EAR}_{\text {max}}= & {} \frac{1}{2} \end{aligned}$$
(7)

Hence, from Eq. (7), the upper bound for EAR is 0.5. By examining the maximum EAR values obtained from the frames analyzed, it was found that the highest EAR recorded was 0.47, which validates the hypothesized upper bound of 0.5.

EAR provides valuable insights into the driver’s attention levels, as changes in the ratio can indicate variations in eye movement patterns, such as blinking frequency and duration. By monitoring EAR, researchers can identify moments of decreased attentiveness, potential signs of fatigue or distraction, and take appropriate measures to ensure driver safety.

Pupil circularity (PUC)

In addition to the EAR, another parameter that provides valuable insights into driver eye movements is PUC. PUC focuses specifically on the circularity of the pupil, offering complementary information to EAR [58].

PUC is calculated based on the shape of the pupil and is derived from Eq. (9). It measures the roundness of the pupil by considering the ratio of the pupil’s area to the square of its perimeter. A person with partially or almost closed eyes will have a significantly lower PUC value compared to someone with fully open eyes. This is due to the squared term in the denominator, making PUC a more sensitive metric for assessing the level of eye closure.

$$\begin{aligned} \text {Pupil Circularity} = \frac{4 \times \pi \times Area}{Perimeter^2} \end{aligned}$$
(8)
Table 4 Statistics of various human attributes for ORNL dataset used

where \(Area = \left( \frac{d_r^p}{2}\right) ^2 \times \pi \) ; \(d_r^p\) is the distance between P2 and P5.

\(Perimeter = d_{p1}^{p2} + d_{p2}^{p3} + d_{p3}^{p4} + d_{p4}^{p5} + d_{p5}^{p6} + d_{p6}^{p1} \) ;

\(d_{a}^{b}\) is distance between a and b from Fig. 8.

Similar to EAR, a decrease in PUC can indicate drowsiness or reduced attentiveness. When a driver becomes drowsy, their eyes tend to exhibit less circularity, resulting in a lower PUC value. By monitoring PUC, researchers can identify potential signs of fatigue or drowsiness, enabling timely interventions to ensure driver safety.

The combination of EAR and PUC provides a comprehensive understanding of driver eye movements. While EAR captures the overall eye openness and attentiveness, PUC focuses specifically on the circularity of the pupil. By analyzing both parameters together, researchers can gain deeper insights into the driver’s visual attention and detect subtle changes that may indicate reduced alertness.

3.2.3 Lip movements

In addition to analyzing eye movements, lip movements also play a significant role in understanding driver behavior and ensuring transportation safety, particularly in the context of Advanced Driver Assistance Systems (ADAS). The detection and analysis of lip movements provide valuable insights into driver attentiveness, engagement, and potential distractions.

Fig. 9
figure 9

Landmarks for the calculation of lip aspect ratio

The lip landmarks are obtained using Dlib library as shown in Fig. 9. These landmarks enable the measurement of the lip aspect ratio (LAR), which evaluates the shape and movements of the lips. The LAR is computed using Eq. (8), which represents a fractional term. The numerator of the equation represents the vertical length of the mouth, while the denominator represents the horizontal length of the mouth. As a ratio of distances, the LAR is dimensionless.

$$\begin{aligned} \text {Lip aspect ratio (LAR)} = \frac{d_v^L}{d_h^L} \end{aligned}$$
(9)

where \(d_v^L\) is the vertical distance between L3 and L7 (from Fig. 9). Similarly, \(d_h^L\) is the horizontal distance between L1 and L5.

By monitoring changes in the LAR, researchers can gain insights into various aspects of driver behavior. For instance, an increased LAR value may imply movements associated with talking, eating, or engaging in secondary activities while driving. It can also help us to assess yawning behavior, etc. Thus, understanding these lip movements also helps in assessing the driver’s level of distraction and potential risks on the road. Hence, considering both eye and lip movements, researchers can capture a broader range of behavioral attributes, enhancing their ability to detect signs of fatigue, distraction, or other safety-critical factors.

3.2.4 Statistics of various human attributes in ORNL dataset

In our study, we analyzed various attributes using the ORNL dataset. Table 4 presents the statistical information regarding these attributes, including the maximum, minimum, mean, and standard deviation values observed.

For the EAR and PUC, we specifically considered subjects without glasses, as well as subjects wearing T0 and T1 glasses. The maximum EAR observed in the dataset was 0.47, while the minimum EAR was 0.06. On average, the EAR value was found to be 0.26. Similarly, for PUC, the maximum value was 0.70, the minimum value was 0.21, and the mean value was 0.43. This wide range of EAR and PUC values allows us to robustly assess the effectiveness of algorithms designed to analyze eye movements under diverse conditions, including blinks, drowsiness, and other variations that might occur during real-world driving scenarios.

Analyzing the ORNL dataset, we observed a diverse range of LAR values. The maximum LAR recorded in the dataset was 0.63, while the minimum value was 0.0, indicating closed lips. On average, the LAR value across subjects and frames was found to be 0.06. This spectrum of LAR values enables a comprehensive evaluation of algorithms assessing mouth movements under varying circumstances, such as yawns, eating behavior, and instances of an open mouth during different driving scenarios.

Furthermore, we also examined the statistics of head movements, including pitch, roll, and yaw. The maximum pitch angle observed in the dataset was 45.53 degrees, while the minimum pitch angle was \(-\)54.49 degrees. The mean pitch angle was 1.71 degrees. For roll, the maximum value was 45.01 degrees, the minimum value was \(-\)38.40 degrees, and the mean roll angle was 0.17 degrees. Lastly, for yaw, the maximum observed angle was 87.77 degrees, the minimum angle was \(-\)89.19 degrees, and the mean yaw angle was \(-\)7.84 degrees. Using these statistical insights, we emphasize that the statistics of various human attributes in the ORNL dataset serve as foundational insights crucial for assessing the effectiveness of the proposed study. These statistics provide essential context regarding the range and variability of human attributes such as EAR, PUC, and LAR, which are fundamental for understanding driver behavior and facial expressions in naturalistic driving scenarios. By presenting the maximum, minimum, mean, and standard deviation values of these attributes, we establish a baseline understanding of the typical range of human behaviors observed in the dataset. This information enables us to gauge the robustness and generalizability of algorithms designed to analyze eye and mouth movements, head poses, and other human attributes across diverse driving conditions.

3.3 Metrics for assessment of human emotions

Understanding human emotions in drivers’ videos is crucial for several reasons. Emotions can significantly influence driving behavior, impacting factors such as attention, reaction time, and decision-making. For example, detecting signs of stress or fatigue in a driver’s facial expressions can alert systems to potential risks of accidents, prompting interventions like alerts or reminders. Additionally, recognizing positive emotions like happiness can provide insights into the overall driving experience, enabling the design of more user-friendly and engaging vehicle interfaces. From a practical standpoint, this understanding contributes to the development of advanced driver-assistance systems that enhance safety and overall driving performance. Thus, to ensure the utility of deidentified videos, it is essential to preserve facial emotions accurately. To evaluate the preservation of facial emotions, we employ a vision transformer-based algorithm [59] fine-tuned on the FER2013 dataset [60], comprising facial images categorized into seven emotions: angry, disgust, fear, happy, sad, surprise, and neutral.

Evaluation of results is conducted by comparing the predicted emotions of the deidentified video with ground truth emotions obtained from the original data. Common evaluation metrics such as accuracy, precision, and recall are employed for this purpose. These metrics can be defined as follows (Eqs. 1012):

$$\begin{aligned} \text {Accuracy}= & {} \frac{\text {TP} + \text {TN}}{\text {TP} + \text {TN} + \text {FP} + \text {FN}} \end{aligned}$$
(10)
$$\begin{aligned} \text {Precision}= & {} \frac{\text {TP}}{\text {TP} + \text {FP}} \end{aligned}$$
(11)
$$\begin{aligned} \text {Recall}= & {} \frac{\text {TP}}{\text {TP} + \text {FN}} \end{aligned}$$
(12)

where TP (true positive) represents correctly identified emotions, TN (true negative) represents correctly rejected emotions, FP (false positive) represents incorrectly identified emotions, and FN (false negative) represents incorrectly rejected emotions. These metrics provide insights into the effectiveness of the deidentification process in preserving facial emotions accurately.

3.4 Metrics for perception and image qualities

In the field of image quality assessment, several metrics are commonly used to evaluate the similarity and quality of two images. When addressing tasks related to the deidentification of facial videos, the robustness of these metrics becomes crucial. It is essential to scrutinize frames where the values of these metrics differ significantly from expected values to ensure the privacy of human subjects. Since the privacy of human subjects is something that cannot be compromised, there are six image quality metrics that were used in this experiment. The purpose of using multiple metrics is that it adds robustness to the identification of frames that should be scrutinized. Apart from this reason, there is not a single metric that aligns truly with the human perception of the quality of the images. Usually, the image quality is assessed using a full-reference metric. A full-reference metric provides a direct comparison between the test image and the reference image without any distortion.

3.4.1 Mean squared error

Mean squared error (MSE) is a full-reference metric [61]. It is a simple yet robust metric that measures the average of the squared difference between the original and deidentified pixel values. However, there is no absolute indication of what MSE value is considered better. It depends widely upon the use cases of MSE. The general rule of thumb is that the lower the value is, the better the image quality of the deidentified image. However, a zero value of MSE represents that the images are completely identical. Mathematically, MSE can be formulated as shown in Eq. (13).

$$\begin{aligned} \text {MSE} = \frac{1}{{m \times n}} \sum _{i=0}^{m-1} \sum _{j=0}^{n-1} [I(i,j)-D(i,j)]^2 \end{aligned}$$
(13)

where

$$\begin{aligned}&m \text { and } n \text { are the height and width of the image in pixels;} \\&i \text { and } j \text { are the row and column pixels of the given image;} \\&I(i, j) \text { is the original image;} \\&D(i, j) \text { is the deidentified image.} \end{aligned}$$

3.4.2 Root mean squared error

Root mean squared error (RMSE) is another widely used metric to measure the differences between the given image (original) and deidentified image. It is basically the square root of the MSE [62]. Mathematically, RMSE can be mathematically formulated as Eq. (14).

$$\begin{aligned} \text {RMSE} = \sqrt{\text {MSE}} \end{aligned}$$
(14)

3.4.3 Peak signal-to-noise ratio

The peak signal-to-noise ratio (PSNR) is the ratio between the maximum possible signal power and the power of the distorting noise which affects the quality of its representations [63]. PSNR is widely used in order to calculate the reconstruction losses. Mathematically, PSNR is given by Eq. (15).

$$\begin{aligned} \text {PSNR} = 10 \log _{10} \left( \frac{{\text {peakval}}^2}{{\text {MSE}}} \right) \end{aligned}$$
(15)

where peakval is the maximum value of pixels; for the ORNL dataset used, it is 255.

3.4.4 Universal image quality index

This image quality metric goes beyond the traditional error metrics, which are mostly based on error summation methods. The universal image quality index (UIQI) goes beyond that by assessing image qualities via a combination of three factors: loss of correlation, luminance distortion, and contrast distortion [64]. Mathematically, the implementation for UIQI for pair of images with the original image, I, and deidentified image, D, \(Q_{(I/D)}\) is given by Eq. (16).

$$\begin{aligned} Q_{(I/D)} = \frac{1}{{n \times m}} \sum _{i=1}^n \sum _{j=1}^m Q_{ij} \end{aligned}$$
(16)

where

$$\begin{aligned} Q = \frac{{\sigma _{xy}}}{{\sigma _x \sigma _y}} \frac{{(2{\bar{xy}})}}{{({\bar{x}}^2 + {\bar{y}}^2)}} \frac{{(2\sigma _x \sigma _y)}}{{(\sigma _x^2 + \sigma _y^2)}} \end{aligned}$$
(17)

and,

$$\begin{aligned}&m \text { and } n \text { are the height and width of the image in pixels,} \\&x = {x_1, \ldots , x_n} \text { and } y = {y_1, \ldots , y_n} \\&\text { are original and test image signals,} {\bar{x}} \text { is the mean of } x, \\&\sigma _x \text { is the variance of } x, \\&\sigma _y \text { is the variance of } y, \\&\sigma _{xy} \text { is the covariance of } x, y. \end{aligned}$$

In the context of the UIQI, the variables m and n represent the dimensions of the images being compared. Let us assume that image I has dimensions m\(\times \)n, which means it has m rows and n columns. Similarly, image D also has dimensions m\(\times \)n. In Eq. (16), the double summation is performed over the indices i and j, ranging from 1 to m and 1 to n, respectively. This means that the calculation of UIQI involves iterating over each pixel in the images I and D to compute the quality index for each pixel and then averaging them over the entire image.

3.4.5 Spectral angle mapper

The spectral angle mapper (SAM) metric assesses the similarities between the two images in terms of the spectral features. It is basically the cosine of the angle formed between the reference spectrum and the image spectrum. In this work, the reference is the deidentified image, and the image is the original image. The mathematical formulation for SAM is given as Eq. (18).

$$\begin{aligned} \cos (\alpha ) = \frac{{\sum XY}}{{\sqrt{{\sum X^2} \times {\sum Y^2}}}} \end{aligned}$$
(18)

3.4.6 Relative dimensionless global error synthesis

Erreur relative globale adimensionnelle de synthèse (ERGAS), which translates to “relative dimensionless global error synthesis,” is used to determine the quality of the images in terms of the normalized average error of each band of processed images [65]. This image quality metric is highly sensitive. A higher value of the metric shows that there is some distortion in the deidentified image, whereas a lower value of the metric shows that there is less distortion. Mathematically, ERGAS is given by Eq. (19).

$$\begin{aligned} \text {ERGAS} = 100 \times \frac{{b}}{{l}} \sum _{i=1}^N \left( \frac{{\text {RMSE}(i)}}{{\mu (i)}}\right) \end{aligned}$$
(19)

where

$$\begin{aligned}&b \text { and } l \text { represent high-spatial-resolution and low-spatial}\\&\text {-resolution images,} \\&\mu (i) \text { represents the mean radiance of the spectral band,} \\&N \text { represents the number of bands.} \end{aligned}$$

3.4.7 Cosine similarity

Cosine similarity is a metric used to measure the similarity between two vectors by calculating the cosine of the angle between them. In the context of image quality assessment, cosine similarity can be applied to compare the similarity between two images represented as vectors.

Mathematically, cosine similarity between two vectors A and B is calculated as follows:

$$\begin{aligned} \text {Cosine Similarity} = \frac{{A \cdot B}}{{|A| \times |B|}} \end{aligned}$$
(20)

where \(A \cdot B\) represents the dot product of vectors A and B, and |A| and |B| denote the Euclidean norms of vectors A and B, respectively.

In the context of image quality assessment, the images can be vectorized using techniques like histogram-based methods or deep learning-based feature extraction. Once the images are represented as vectors, cosine similarity can be calculated to determine their similarity. In this work, we utilized a pre-trained convolutional neural network (CNN), specifically VGG16, for calculating the cosine similarity between two face images. A cosine similarity value of 1 represents perfect similarity, while 0 indicates no similarity, and \(-1\) represents complete dissimilarity. Thus, the cosine similarity metric serves as a reliable measure of similarity, considering both the orientation and magnitude of the feature vectors. The purpose of our study was to assess the similarity between images and evaluate their quality based on the cosine similarity metric.

These metrics provide a comprehensive assessment of image quality, allowing for the evaluation of deidentified frames in terms of their similarity to the original frames.

3.5 Algorithms for face swapping

In our implementation, we have used multiple algorithms to test their performance under different conditions. The purpose of using multiple algorithms is to assess which algorithm works well in terms of varying factors. By evaluating the algorithms using various performance metrics, we can determine the most effective algorithm for our specific application. This approach allows us to compare the algorithms’ performance in achieving accurate and realistic face swaps, handling different facial expressions and poses, maintaining identity consistency, and considering computational efficiency.

3.5.1 Face swapping GAN (FSGAN)

Face swapping GAN (FSGAN) [66] is a novel approach for face swapping and reenactment that is subject agnostic, meaning that it can be applied to any pair of faces without requiring training on those faces. It has three main components: (i) recurrent reenactment generator and segmentation generator, (ii) inpainting generator, and (iii) blending generator. The RNN-based face reenactment network used in the framework is able to adjust for both pose and expression variations. This allows for more realistic and natural face swaps and reenactments. Similarly, the continuous interpolation method for face views allows for smooth transitions between different face views in a video sequence. This is important for producing realistic and natural-looking videos of face swaps and reenactments. FSGAN uses multiple loss functions viz. reconstruction loss [66], domain-specific perceptual loss [67], and adversarial loss for adversarial objective [68]. In the face blending part of the architecture, the face blending network uses a novel Poisson blending loss which combines Poisson optimization with perceptual loss. This results in more accurate and realistic face blending. While the method has its advantages in terms of face reenactment and swapping, it can sometimes produce artifacts, such as blurry edges or unnatural-looking skin. This is typically due to the difficulty of blending two faces together seamlessly. For this work, we have used FSGAN in two settings: (i) without fine-tuning and (ii) with fine-tuning.

3.5.2 Generative high-fidelity one-shot transfer (GHOST)

Generative high-fidelity one-shot transfer (GHOST) is a one-shot pipeline for image-to-image and image-to-video face swap solutions [69]. The architecture of GHOST is built upon Adaptive Embedding Integration Network (AEI-Net) [70] to generate a high-fidelity face swapping result. The AEI-Net consists of an identity encoder that extracts identity from the source image, a multilevel attributes encoder to extract attributes of the target image, and adaptive attentional denormalization (AAD) generator which generates swapped images. On top of these, a multiscale discriminator is used to improve the quality of the output by comparing real and fake images. Through, GHOST, the authors try to address specific challenges of face swapping in the video, such as face jittering and other distortions that can occur when the face is processed frame by frame. GHOST addresses this challenge by using a video-specific face alignment method and a super-resolution block to improve the quality of the generated faces. The authors have made use of various losses like reconstruction loss [71], attribute loss, identity loss, and adversarial loss. On top of these losses, the paper introduces a new kind of loss function, eye loss for preserving gaze information related to eyes.

Table 5 Different algorithms used along with the loss functions used

3.5.3 InfoSwap

InfoSwap [72] is a method for disentangling identity and identity-irrelevant information in face images. The method, called InfoSwap, is based on the information bottleneck (IB) principle [73,74,75], which states that the optimal representation of a data point is one that contains the maximum amount of information relevant to a particular task while minimizing the amount of irrelevant information. The IB principle can be used to guide the learning of a disentangled representation by maximizing the information about identity in the representation while minimizing the information about other factors, such as pose, expression, and lighting. To achieve this, InfoSwap first uses a pre-trained face recognition model to extract a latent representation of the source image. This latent representation is then passed through an IB bottleneck, which reduces the dimensionality of the representation while preserving the maximum amount of information about identity. The resulting representation is then used to generate a face image with the identity of the source image but the appearance of the target image. The bottleneck is implemented as a fully connected layer with a small number of output units. The information about identity in the latent representation is measured using the mutual information between the representation and the identity label. The information about other factors, such as pose, expression, and lighting, is measured using the mutual information between the representation and a set of nuisance factors. The IB bottleneck is optimized using a loss function that minimizes the information about nuisance factors while maximizing the information about identity. InfoSwap also proposes a novel identity contrastive loss, which is used to further disentangle the identity and identity-irrelevant information in the latent representation. The identity contrastive loss requires the generated face image to have a similar identity to the source image but a different identity to the target image. This loss helps to ensure that the latent representation contains information that is specific to the identity of the source image. Information contrastive loss (ICL) forms a part of information bottleneck (IB) loss. Apart from this, infoswap uses perceptual loss, cycle consistency loss, and adversarial loss which together forms the total objective function for infoswap. In our implementation, we have used infoswap in two settings: (i) without kernel smoothing and (ii) with kernel smoothing [76].

3.5.4 SimSwap

SimSwap is another face swapping framework that is capable of transferring the identity of an arbitrary source face into an arbitrary target face while preserving the attributes of the target face. As compared to the previous face swapping techniques, SimSwap mainly has two novel contributions: (i) ID injection module (IIM) and (ii) weak feature matching loss. The IIM transfers the identity information of the source face into the target face at the feature level. This is done by first extracting features from both the source and target faces. The features from the source face are then injected into the features of the target face. This is done by multiplying the features of the target face by a matrix that is trained to represent the identity information of the source face. The weak feature matching loss helps the framework to preserve the facial attributes of the target face in an implicit way. This is done by encouraging the features of the generated face to be similar to the features of the target face. However, the loss is designed to be weak so that it does not interfere with the identity transfer process. The method uses multiple loss functions like identity loss, reconstruction loss, weak feature matching loss, adversarial loss, and gradient penalty [77, 78]. The loss functions of different algorithms used in our work are given in Table 5.

3.6 Experimental setup

In this section, we describe the experimental setup for our study. We outline the steps involved in assessing the quality of the dataset, performing face swapping using multiple algorithms, planning for error analysis, and guidelines for large-scale deidentification. The high-level overview of the deidentification process and evaluation of deidentified videos is given in Fig. 10.

Fig. 10
figure 10

Overview of the deidentification process and evaluation outline for videos and error analysis plan

Fig. 11
figure 11

Qualitative comparison of various algorithms in terms of face swapping

3.6.1 Dataset quality assessment

Before conducting face swapping experiments to assess the effectiveness of algorithms, it is crucial to assess the quality of the dataset used. This assessment helps ensure that the dataset contains diverse facial images, which are essential for drawing insights into the effectiveness of face swapping in driving scenarios. The analysis of various physiological features and anthropometric measures is given in Sect. 3.1.

3.6.2 Face swapping

The face swapping process involves replacing the face in one image (target image) with the face from another image (source image or imposter image) while preserving the target image’s facial expressions and characteristics. Our implementation utilizes state-of-the-art face swapping algorithms, including SimSwap [71], FSGAN [66], InfoSwap [72], and GHOST [69]. The videos of drivers under naturalistic driving conditions are split into frames and face swapping is done. The output of this is deidentified frames. The frames are stitched back to create videos.

3.6.3 Error analysis plan

In order to assess the effectiveness of face swapping algorithms, we analyze errors in various human attributes as mentioned in section 3.2 and similarity measures mentioned in Sect. 3.4. We perform a detailed analysis of various metrics to draw conclusions on the usefulness of face swapping measures.

3.6.4 Large-scale deidentification

In addition to the individual face swapping experiments, we will also provide guidelines for large-scale deidentification. This involves developing a systematic approach to applying face swapping algorithms to a large volume of videos, ensuring scalability and efficiency. We will outline the steps mainly in regard to facilitating automated deidentification with human-in-the-loop validation.

4 Experiments and results

In this section, we present the experiments and corresponding results obtained. Figure 11 shows qualitative results of various algorithms for face swapping. Our experiments focus on assessing the effectiveness of face swapping algorithms for deidentification purposes, as well as analyzing errors in various human attributes and conducting a quantitative analysis of deidentified videos. Additionally, we explore the use of synthetic faces in deidentification.

4.1 Qualitative analysis of results

In order to assess the effectiveness of face swapping algorithms for deidentification purposes, we conducted a qualitative analysis of the results obtained from our experiments. In order to ensure a comprehensive and unbiased evaluation of the deidentified videos, we employed the expertise of four human evaluators. Each evaluator independently assessed the videos based on the rating scale described earlier, considering the aspects of facial alignment, facial expression matching, skin texture and color, hair, and facial features, and overall realism and naturalness. The use of multiple evaluators helped mitigate individual biases and provided a more robust assessment of the face swapping algorithms.

Table 6 presents a rating scale ranging from 1 to 5 for each of the evaluated aspects. A rating of 1 indicates poor performance with noticeable distortions, misalignments, and inconsistencies, while a rating of 5 indicates excellent performance with indistinguishable results from the original videos. The intermediate ratings represent varying levels of performance and quality.

The qualitative analysis of the results revealed valuable insights into the performance of the face swapping algorithms. The evaluators observed that the quality of facial alignment varied across the algorithms, with some algorithms exhibiting noticeable distortions, misplacements, or misalignments. Facial expression matching also showed variations, with some algorithms producing completely mismatched expressions or unnatural and inconsistent results.

The evaluators also assessed the skin texture and color, hair, and facial features of the deidentified videos. They identified issues such as blurry and inconsistent skin texture, mismatched color tones, and noticeable distortions and artifacts in hair and facial features. The overall realism and naturalness of the videos were also considered, with some algorithms producing completely unrealistic and unnatural results, while others achieved a high level of realism and naturalness.

Table 6 Rubric for assessing face swap quality

Figure 12 shows an overview of the quality of face swapping algorithms for naturalistic driving videos considering a wide range of aspects mentioned in Table 6. The aspect-based analysis of face swapping outcomes highlighted significant disparities in the performance of various algorithms. Notably, InfoSwap consistently received lower ratings across several categories, indicating poor facial alignment, facial expression matching, skin texture, and color preservation, and significant flaws in hair and facial feature preservation. In contrast, SimSwap consistently outperformed other algorithms, achieving high ratings in these aspects and delivering overall realistic and natural results with only minor imperfections.

Fig. 12
figure 12

Evaluation of face swapping in terms of various qualitative measures

4.1.1 Overall evaluation of face swapping results

Figure 13 presents the stacked plot for the total scores which provides an overview of the performance of each algorithm across all categories. The total scores represent the cumulative evaluation scores for facial alignment, facial expression matching, skin texture and color, hair and facial features, and overall realism and naturalness. The stacked plot allows for a visual comparison of the total scores given by each evaluator for each algorithm. SimSwap algorithm performs reasonably better among all the algorithms with a cumulative score of 85 out of 100.

Fig. 13
figure 13

Overall evaluation of face swapping results

4.2 Error analysis in head movements

Table 7 summarizes the error analysis for head movements across different algorithms. Lower values indicate better accuracy. SimSwap consistently outperformed other algorithms, achieving the lowest errors for roll, pitch, and yaw head movements, with a mean absolute error of 2.94 degrees. GHOST also performed well, particularly in roll and pitch movements. In contrast, FSGAN and InfoSwap exhibited higher errors in head movement reproduction. These results highlight SimSwap’s superior accuracy in preserving head movements in driver videos.

In order to provide more granular details on how roll, pitch, and yaw angles were preserved across the frames, Fig. 14 shows angular error across the videos for the ORNL dataset using the SimSwap algorithm. The errors were calculated on a video level. Among errors in head movements around different axes, yaw errors were prominent in SimSwap as well.

Figure 15 presents the error statistics for face swapping based on different gender combinations: FF (female face replaced by a female imposter face), FM (female face replaced by a male imposter face), MF (male face replaced by a female imposter face), and MM (male face replaced by a male imposter face). Figure 15 shows that the lowest error rates were observed when a female face in the video was replaced by a female imposter face. The errors slightly increased when a female face was replaced by a male imposter face (FM) or when a male face was replaced by a female imposter face (MF). The highest error rates were observed when a male face was replaced by a male imposter face (MM).

Overall, these findings suggest that SimSwap among all face swapping algorithms achieved good performance in terms of preserving head movements. The errors were relatively low, particularly when replacing female faces with female imposter faces. These results highlight the effectiveness of the algorithms in maintaining the naturalness and realism of head movements during the deidentification process.

4.3 Error analysis in driver’s eye and mouth movements

Table 8 presents the error analysis for eye and mouth movements using different algorithms and variations.

The error metrics used are EAR error, LAR error, and circularity error. Among the algorithms, SimSwap exhibits the lowest error for all three metrics, with an EAR error of 0.061, LAR error of 0.075, and circularity error of 0.059. This indicates that SimSwap provides the most accurate preservation of eye and mouth movements. FSGAN and GHOST also demonstrate relatively lower errors across metrics, although slightly higher as compared to SimSwap. InfoSwap, on the other hand, shows higher errors for all three metrics.

Table 4 shows that the average EAR across all frames is 0.26. Similarly, the average circularity is 0.43. The highest EAR and circularity are 0.47 and 0.70, respectively. From Fig. 16, the EAR error and circularity error are nearly less than 0.1 for most of the frames for face swapping using SimSwap. The results shown in Fig. 16 are calculated for frame level. Similarly, LAR is also well preserved. This shows that even in the deidentified videos, the EAR and circularity are very well preserved. This low error shows that the deidentified videos can readily be used to build safe driving models, such as distracted driving detection models. The human attributes were well preserved.

Table 7 Error analysis (in degrees) for head movements for different algorithms

4.4 Temporal consistency of errors in head, eye, and lip movement

In our study, we sought to comprehensively assess the temporal dynamics of head, eye, and lip movements by utilizing both the average error metrics and the standard deviation (SD) of these errors. The average error presented in Table 8 provided a baseline measure of central tendency, indicating the typical extent of deviation from expected movement patterns. To gain deeper insights into the variability and consistency of these deviations, we calculated the standard deviation of the errors, which highlighted the spread or dispersion of error magnitudes around the mean value.

Further enhancing our analysis, we incorporated the standard deviation of the first derivatives of each error related to head, eye, and lip movements. This advanced metric allowed us to examine the rate of change of the errors, thereby identifying not just the magnitude but also the dynamism of the errors over time. Calculating the first derivatives of errors using the central difference method-achieved through convolution operations-enabled us to capture the rate of change at each data point effectively. The standard deviation of these first derivatives was computed to assess the variability and consistency of the errors throughout the observation period. A high standard deviation in this context indicates a high degree of variability, pointing to erratic or dynamic changes in the errors. This could suggest less stable or less predictable control mechanisms in head, eye, and lip movements. Conversely, a low standard deviation would indicate that the errors are more uniform and steady over time, implying more stable and controlled movement dynamics. Such metrics are crucial for understanding the underlying stability and control mechanisms of head, eye, and lip movements, especially in applications requiring precise motion tracking and error correction in real-time systems.

Fig. 14
figure 14

Violin plot for angular errors in head movements for SimSwap algorithm. The detailed numerical values for the average and standard deviation of errors are given in Tables 7 and 9

Fig. 15
figure 15

Error analysis for face swapping with different gender combinations using SimSwap algorithm

Table 8 Error analysis for eye and mouth movements for different algorithms

Table 9 presents the standard deviations of head movement errors and their first derivatives across different face swapping algorithms. SimSwap consistently demonstrates superior results across all error metrics, exhibiting considerably lower standard deviations compared to other algorithms. This indicates its effectiveness in preserving the consistency and temporal dynamics of head movements, crucial for maintaining the fidelity of deidentified face videos in naturalistic driving studies. These findings suggest that SimSwap holds promise as a reliable tool for NDS video deidentification, ensuring the preservation of essential human factors attributes while anonymizing participants’ identities. Its ability to maintain stability and control mechanisms in head movements enhances its suitability for applications requiring precise motion tracking and error correction in NDS.

Fig. 16
figure 16

Proportion of frames vs. absolute error for EAR, LAR, and Circularity using SimSwap

Similarly, Table 10 provides standard deviations of eye and lip movement errors, along with their first derivatives, across different face swapping algorithms. Each error metric represents the variability or dispersion of errors associated with EAR, LAR, and PUC. For each algorithm, including InfoSwap with and without kernel smoothing (KS), FSGAN with and without fine-tuning (FT), GHOST, and SimSwap, the table showcases the standard deviations of these error metrics. Lower standard deviation values indicate less variability and greater consistency in error magnitudes, suggesting more stable and controlled movement dynamics. Among the algorithms, SimSwap consistently demonstrates the lowest standard deviations across all error metrics, indicating superior performance in preserving temporal dynamics of eye and lip movements. Conversely, InfoSwap generally exhibits higher standard deviations, suggesting more erratic or dynamic changes in error magnitudes over time. The error levels shown by SimSwap shows that there is no huge and erratic variations in errors which affirms the applicability of the algorithm in face swapping NDS videos.

4.5 Error analysis in preserving human emotions

Table 11 presents error metrics for preserving human emotions using different face swapping models. Each model is evaluated based on three key metrics: accuracy, precision, and recall. Accuracy measures the overall correctness of emotion preservation, precision assesses the proportion of correctly preserved emotions among all emotions predicted as preserved, and recall evaluates the proportion of correctly preserved emotions among all the actual preserved emotions.

Table 9 Standard deviations of head movement errors and their first derivatives across different face swapping algorithms
Table 10 Standard deviations of eye and lip movement errors and their first derivatives across different face swapping algorithms

The models include InfoSwap with and without kernel smoothing (KS), FSGAN with and without fine-tuning (FT), GHOST, and SimSwap. Among these, SimSwap demonstrates the highest accuracy, precision, and recall, indicating superior performance in preserving human emotions. GHOST also exhibits relatively high performance across all metrics, followed by InfoSwap and FSGAN, which show lower scores. These metrics provide insights into the effectiveness of each model in accurately preserving human emotions, crucial for maintaining the authenticity and utility of deidentified face videos in naturalistic driving studies. The result shown by SimSwap is high and is at par with those from the literature [79].

Table 11 Error metrics in preserving human emotions

4.6 Quantitative analysis of deidentified videos quality

We calculated error metrics for all the deidentified videos vs. original videos. The maximum, minimum, and mean values across all our pairs are given in Table 12. The direction of the arrow represents the direction in which images are highly similar. For example, PSNR \(\uparrow \) means that a higher value of PSNR for a given deidentified and original image pair means they are highly similar.

Table 12 Error metric statistics across all frames of ORNL dataset

Table 12 shows that the RMSE and MSE metrics are sensitive and can range from the smallest to the largest errors. The metric UIQI has almost similar values for all the videos and hence is considered to be very insensitive. Similarly, from Table 12, the ERGAS metric seems to be very sensitive and hence can predict if the frames are deidentified. For all the deidentified frames, the errors with the original image should never be zero. If the errors are zero, trivially, the deidentified frames and original frames are the same, and hence, human interception is needed to check if deidentification has happened. We discuss the error analysis plan in detail in Sect. 5.

4.7 Analysis of secondary actions

In order to analyze the preservation of secondary actions in the deidentified videos, a qualitative analysis was conducted on various cases. Table 13 presents an overview of selected cases, showcasing the preservation of secondary behaviors. Specifically, the driver’s face was swapped with a face displaying a different racial profile to evaluate the effectiveness of the deidentification process. The deidentified videos were carefully examined to assess the extent to which secondary actions, such as facial expressions, eye movements, and mouth movements, were maintained.

The results obtained from the qualitative analysis indicate that the face swapping performed well in preserving secondary actions. Notably, examples of secondary behaviors were observed in various scenarios, including instances where the driver was wearing glasses, closing their eyes, speaking, parking, using features on the dashboard, and under harsh lighting conditions. These findings demonstrate the efficacy of the deidentification approach and highlight its potential for future directions and improvements.

Table 13 Qualitative analysis of face swapping during different secondary behaviors

4.8 Use of synthetic faces in deidentification

Fig. 17
figure 17

Use of synthetic faces to replace face of drivers

In response to growing concerns regarding PII, utilizing the faces of actual participants as target images in face swapping can present challenges. To address this issue, we employed synthetic faces generated using StyleGAN [80] that do not correspond to real individuals. These faces offer the opportunity to enhance diversity in the data, provided that the training dataset remains unbiased and encompasses a wide range of examples. Figure 17 showcases examples of these synthetic faces employed to replace the faces of real drivers. Notably, the face swapping techniques remain effective even when drivers are wearing glasses.

Fig. 18
figure 18

Error analysis for human attributes for face swapping using synthetic faces

Despite the utilization of synthetic faces, crucial aspects such as roll, pitch, and yaw angles, as well as human factors like EAR and LAR, are accurately preserved. Figure 18a demonstrates the minimal error in terms of roll, pitch, and yaw angles. Similarly, as shown in Fig. 18b, the errors associated with EAR, LAR, and circularity exhibit similarities to the case where real human faces were replaced. Apart from privacy protection, another advantage of substituting drivers’ faces with synthetic counterparts is the improved diversity within the data, assuming that the training dataset maintains impartiality and includes a wide array of diverse examples.

5 Discussion and further guidelines

This section presents a comprehensive discussion and provides guidelines based on the findings and outcomes of the research. It also aims to shed light on the implications and potential applications of the deidentification framework.

5.1 Error analysis plan

In order to benefit from face swapping for the deidentification of drivers’ face videos, large-scale processing is really important. The scalability of the proposed framework in this work can reap greater benefits in the curation of a large dataset. Checkpoints are needed to make sure that the deidentification is done with proper guidelines. For this, a proper error analysis plan that involves human-in-the-loop validation is also important. Since the purpose of this study was the deidentification of driver face videos while preserving the human factors attributes, the evaluation of the results was done in two steps. First, the image quality of the frames in the deidentified videos was evaluated with respect to the original videos. Second, human attributes that are useful in transportation safety research, such as head movements, lip movements and eye movements, were used. To automate the deidentification of large datasets using the framework provided, we suggest two major steps: (i) creating an error threshold and spot checking and (ii) spot checking for frames with abrupt changes in metrics.

5.1.1 Creating error threshold and spot checking

From the experiments, we found that the error bounds for various error metrics needed to be defined. By training on a larger dataset that accounts for larger nuance, a more robust error threshold could be defined. For example, in this experimentation with the ORNL dataset, it was found that for properly deidentified images (visual inspection and not recognized by recognition algorithm), the error was prevalent for all the metrics. For example, for a given original and deidentified pair, the error should never be zero. For a frame to be deidentified, it should have acceptable errors across all the metrics. Even if there is a single instance of error being out of acceptable range, it should be subjected to spot-checking.

5.1.2 Spot checking for frames with abrupt changes in metrics

Most error metrics can detect unusual behavior. As shown in Fig. 19, unusual dips and peaks can be observed. Upon close examination, it was observed that the dip was due to poor face swapping in harsh lighting conditions. Human-in-the-loop verification is suggested for such unusual dips and peaks. It is better to take a more sensitive metric like ERGAS (which can range from 0 to tens of thousands). In our data, the maximum ERGAS is in the range of 16,000, whereas the minimum is in the range of 200.

Fig. 19
figure 19

Evaluation of face swapping in terms of preserving eye and mouth movements

5.2 Reidentification of deidentified faces

There are several fields of research where the reidentification of a person is a major topic of interest [81,82,83,84]. Reidentification is necessary on its own terms, but in the context of this paper, we should be aware that reidentification is a risk for deidentified videos and we should develop and deploy enough countermeasures to minimize the chance. In order to study how well the face swapping algorithms withstand reidentification, we perform reidentification on 16 face swapped images.

Fig. 20
figure 20

Test image and gallery of 20 images used for identification task. The objective is to correctly identify among gallery images what the original image for the test image is

5.2.1 Reidentification ranks

To evaluate the performance of face swapping algorithms in withstanding reidentification, we analyze the reidentification ranks of 16 face-swapped images. The reidentification rank represents the position at which the correct identity is found in the ranked list of potential matches. A lower rank indicates a better reidentification performance, as the correct identity is found earlier in the list. We evaluated the reidentification using 20 human evaluators and a cosine-similarity-based measure. In our analysis using cosine similarity, we consider the following steps:

  1. a.

    Preprocessing Preprocess the face-swapped images and extract facial features or embeddings using a suitable algorithm or model.

  2. b.

    Gallery construction The gallery of reference images is constructed to include a combination of original unswapped images and synthetic face images. In our calculation, we include a total of 20 gallery images. The gallery images consist of real and synthetic face images used for evaluation purposes. For a fair evaluation, face images were cropped using mediapipeFootnote 1 just to show the facial region. By incorporating a diverse range of images in the gallery, we aim to assess the reidentification performance of the face swapping algorithms under various scenarios and conditions. The example of gallery images and test image is shown in Fig. 20, where the objective is to accurately identify among these images the original image corresponding to the given deidentified test image.

  3. c.

    Similarity calculation Compute the similarity scores or distances between the features/embeddings of the face-swapped images and the gallery images. Common similarity metrics include cosine similarity, Euclidean distance, or other appropriate distance measures. In our implementation, we used cosine similarity as mentioned in Sect. 3.4.7. Apart from cosine-based similarity, we also make human evaluators rank the images.

  4. d.

    Ranking Rank the gallery images based on their similarity scores or distances in ascending order. The lower the score or distance, the higher the similarity to the face-swapped image.

  5. e.

    Evaluation Determine the reidentification rank of each face-swapped image by identifying the position of the correct identity in the ranked list. For example, if the correct identity is found at rank 3, it indicates that the algorithm successfully reidentified the face-swapped image among the top 3 potential matches.

Similarly, for the assessment of reidentification using human evaluators, we provide them with the same set of gallery images and face-swapped images.

Fig. 21
figure 21

Comparison of reidentification performance using cumulative match characteristic (CMC) for Cosine similarity based technique and average human evaluation

In Fig. 21, we present the average performance of twenty human evaluators and a cosine-similarity-based technique. The positions of these curves on the Cumulative Match Characteristic (CMC) curve plot indicate their relative performance in reidentifying the face-swapped images. A higher curve and a lower rank indicate better performance, as it signifies that the correct identity is found earlier in the ranked list. Our CMC curve shows that when evaluated using human evaluators and cosine-similarity-based techniques, we find that the face swapping algorithms were highly effective in the deidentification of face videos in NDS data.

5.2.2 Further considerations

Further considerations in the deidentification of drivers’ appearances go beyond facial features and involve non-biometric identifiers such as clothing, hairstyle, and other distinguishing characteristics. While this work primarily focuses on deidentifying drivers’ face videos, there is a need to address these additional identifiers to ensure comprehensive privacy protection. Non-biometric identifiers, such as clothing and hairstyle, can play a significant role in reidentification if individuals who are already familiar with the drivers have access to the deidentified videos (e.g., fleet managers). To mitigate this risk, it is important to extend the deidentification process to encompass these non-biometric identifiers as well.

Recent computer vision advancements offer algorithms to manipulate clothing styles, hairstyles, and appearance features. When combined with face swapping, this bolsters deidentification by concealing non-biometric identifiers. These algorithms can detect and alter clothing styles, hairstyles, and more, reducing reidentification risks. However, integrating them poses complexities, requiring advanced computer vision techniques and preserving natural appearances. Additionally, exploring the impact of source image selection on deidentification and automating this process for optimized outcomes is a promising future avenue for improving deidentification.

5.3 Limitations

Despite the potential of AI-based face swapping algorithms in automating the deidentification of large datasets, it is important to acknowledge their limitations. The proposed framework relies on these algorithms, but certain factors can affect their accuracy and robustness, posing challenges to the deidentification process.

A key limitation lies in the sensitivity of face detection algorithms used in the framework, influenced by factors like lighting, camera angles, and facial expressions. Variations in lighting or camera angles can result in face detection failures, affecting LAR and EAR measurements. Enhancing face detection algorithms to ensure robust performance in diverse scenarios is crucial. Additionally, rapid eyelid or mouth movements can challenge face swapping algorithms, introducing inconsistencies or artifacts. Algorithms capable of handling and compensating for such swift movements are needed to enhance deidentification accuracy. Video resolution and compression impact the framework. Lower resolution and high compression can degrade video quality, making it harder for algorithms to identify and manipulate facial features. Evaluating the framework across varied data sources with different resolutions and compression levels is essential for optimization.

In addition to the mentioned limitations, it is essential to test the dataset with varying resolutions and characteristics to ensure generalizability. While we conducted experiments using the ORNL dataset, which is readily available, we could not utilize other datasets due to privacy concerns. For instance, datasets like SHRP-2 impose restrictive terms on data usage, particularly concerning the publication of drivers’ face videos. Consequently, our experiments were limited to the ORNL dataset. The statistics presented in Sect. 3.2.4 demonstrate dynamic movements in terms of head, eye, and lip movements within the ORNL dataset. These findings suggest a degree of generalizability in our experiments, indicating that the insights derived from the ORNL dataset may extend to broader contexts. However, future studies should aim to include diverse datasets with varying resolutions and characteristics to further validate the robustness and applicability of our findings.

Addressing these limitations necessitates further research and development. By overcoming these limitations, the framework can reliably and effectively automate deidentification for large datasets, enabling safer, privacy-conscious driver behavior, and transportation safety research.

6 Conclusion

In conclusion, this paper has addressed the challenges and restrictions posed by PII in NDS and proposed a framework to mitigate these concerns. While the previous research focuses on the face deidentification of stationary face videos, we extensively evaluate in-the-wild driving videos. By leveraging recent advancements in computer vision, our primary focus has been on removing biometric identifiers from the face area of drivers, thus ensuring privacy while preserving important human factors attributes for safety research. Through extensive experimentation and analysis, we have demonstrated the feasibility of face deidentification techniques in deidentifying NDS videos. We have also proposed ways to quantitatively assess the preservation of crucial human factors attributes, such as head movements, mouth movements, and eye movements. This quantitative assessment allows for an objective evaluation of the extent to which the preserved data maintains its usefulness for subsequent safety research and analysis. In the future, further explorations will be done in the realms of the removal of non-biometric identifiers.