1 Introduction

In recent years, the development of smart vehicles has brought a safer and more efficient driving experience. While improving the driving experience, an important aspect of driving safety involves the problem of users' distraction when the in-vehicle system (such as the entertainment system) is controlled [1]. Driving is the primary main task in the vehicle, which requires a full commitment, but it also needs to perform some concurrent secondary tasks [2]. According to the frequency of use, in-vehicle secondary tasks mainly include air conditioning settings, media settings, vehicle function control, etc. [3]. The control of secondary tasks in vehicles is usually traditional physical control, such as physical buttons or touch screen clicks, which usually leads to the distraction of users from the main tasks in vehicles to the secondary tasks to be performed [4]. A recent study by Metz et al. [5] found that under the control of the traditional direct touch interface, performing secondary tasks would cause more visual interference among 25 to 40% of drivers. In this case, more and more researchers are looking for input modes other than physical keys and touch screen clicks. Bilius et al. [6] thought that air gesture input has become one of the main methods to ensure driving safety and reduce distraction during driving.

In the past decade, mid-air gesture interfaces have become more and more popular. Osen et al. [7] divides the gestures of human–computer interaction into two categories: touch-screen gesture and in-air gesture. To interact with the screen interface, the user must directly touch the screen of the device with a finger. On the contrary, interaction based on mid-air gesture is also called contact-less gesture, which enables users to perform various spatial inputs without touching the screen. This type of interaction depends on the image recognition performance of the computer. According to the classification in [8] and for ease of description, the contact-based or hand-busy interaction we mentioned in this article is called Touch-based interaction (TBI), and mid-air gesture interaction is called Gesture-based interaction (GBI).

The triggering of secondary tasks in traditional vehicles is based on the traditional direct touch interface. When using the traditional direct touch interface, the driver needs to perform a series of operations, including moving the field of view toward the interface, visual search of control elements, performing the operation, waiting for system feedback, and returning the field of view to the road [9]. Therefore, the traditional direct touch interface takes the driver more browsing time on the screen. Besides, bright screens may cause glare at night. The driver's eyes move from the bright screen to the dark road, which requires more time to recover. On the contrary, GBI promotes a comfortable, efficient and natural human-car interaction (HCI) experience, reducing the number of steps required to complete the task [10]. The emergence of natural human–computer interaction has shifted the focus of gesture-based interaction research from the accuracy of gesture recognition to the user's interactive experience [11]. Interface interaction gestures are designed by computer science and design professionals, usually for ease of management and automatic recognition rather than ease of use [12]. In the process of gesture design, interactive workers ignore the preferences and needs of end users for interactive interfaces. When users perform target operations, they still invest a lot of visual cognitive ability and need the prompt of the display screen to recall the corresponding gesture actions, which is a very dangerous thing in vehicle driving. Although gesture recognition has greatly improved in recognition type and recognition rate, the usability and user acceptability of gestures/commands have become the focus of attention of researchers and designers [13].

Therefore, on the basis of improving the recognition rate of gestures, it is very important for researchers to determine what low-attention mid-air gesture interface that users might expect.

2 Related work

In the actual vehicle information interaction, in order to avoid the driver's distraction, we should not only consider the influence of interaction mode on the user's cognition and experience, but also consider the feedback speed of interaction. In this part, we reviewed the development of mid-air gestures for in-vehicle interactive interface, user-elicitation study, and gesture recognition algorithm.

2.1 Mid-air gestures for interactive interface

Mid-air gesture interaction is a contact-less interactive operation between people and devices, which mainly estimates gestures based on computer vision recognition technology. The contactless user interfaces enable end users to view, control and manipulate any digital content, such as objects, projects or scenes, without physically touching the user interface or devices [14]. Mid-air gesture interaction reduces the obstacles of interaction and improves the flexibility of input between users and computers. Therefore, mid-air gesture interaction can be used as an alternative or complementary way in some fields requiring touchless operation. For example, some scenarios use mid-air gestures instead of the physical mouse to perform some clicking and drawing tasks [15]. There are other scenarios that use mid-air gestures to manipulate some complex activities, such as controlling virtual objects in VR scenes [16], controlling smart homes [17] or TV [18] and so forth. With the development of modern technology, this interactive mode is becoming feasible.

The development of automobile technology needs a more efficient interactive interface. The touch-based in-vehicle interactive interface has been widely used in modern automobiles [19]. However, the user needs to make a directional gesture when using the touchscreen, which inevitably requires substantial visual, cognitive and manual capacity [20]. A study by Doring et al. [21] confirmed that the multi-touch user interface system increased the visual demand for drivers. Because of the unstable driving environment or road conditions, the touch input will also be highly disturbed [22]. In addition to the touch-based interactive interface mentioned above, the speech feedback system is another choice for the design of in-vehicle user interaction. However, the microphone position, speaker position, and environmental noise have a great influence on vehicle voice control. Recent research shows that [23], the character error rate of automobile speech recognition is 5.87% in the best case and 80.82% in the worst case. Jamson et al. [24] proved that voice interaction design delayed the driver's response by 30%. Therefore, voice interaction will inevitably increase drivers' cognitive load on driving tasks and distract their attention, thus reducing driving safety. Eshed Ohn Bar et al. [25] found that gesture-based interaction is more natural to operate, and can replace the traditional touch-based interaction without affecting the driver's interactive experience, thus effectively reducing the driver's visual dispersion.

May, Keenan R et al. [26] used a multimodal air gesture interface for vehicle menu navigation to reduce users' cognitive load for the driving task. Jahani, H et al. [2] thought that mid-air gesture-based interfaces can provide a less complex in-vehicle interface for a safer driving experience. More and more researchers have considered the mid-air gesture interface to prove that it can reduce the distraction of driving and increase the risk of accidents.

2.2 User elicitation for interactive interface design

Various interactive mid-air gestures have been proposed in previous studies. Wu and Balakrishnan [27] described a group of multi-finger and full-hand gesture operations. Rekimoto [28] describes a set of gesture actions, such as pan, zoom, and rotation, which can be used in intelligent management systems. Ringel et al. [29] proposed a set of gestures that can be used to call the camera to enhance mouse actions and edit actions. However, when we talk about the interaction based on mid-air gesture, the interactive gestures used in these studies were usually designed without fully consulting the end-users, and the usability of gestures was even sacrificed for the ease of achieving interactive goals and actual equipment.

The possibility of how the AG interface works means that the minimum distraction in vehicle driving. It is necessary to have a more comprehensive and in-depth understanding of users' preferences and expectations. This can be achieved by using established participatory design methods, such as elicitation study [10]. Its essence is an emerging participatory design technology, and it is widely used to collect the needs and expectations of end-users on the target system. Morris et al. [30] provided empirical evidence to allow end-users to participate in gesture design research. Mahr et al. [31] used an elicitation method to determine the values of specific parameters required by drivers to perform different "micro gestures", such as size and time. Angelini et al. [32] conducted an elicitation study on on-wheel gestures. Finally, it was concluded that some gestures may have potential dangers when driving. Dong, H. et al. [33], and Henze et al. [34] used the elicited method of studying the user-defined gestures to control the media menu. Relevant research showed that the method of exploring users’ needs and investigating users’ evaluations had higher task performance than the traditional touch interface. The user elicitation method not only made it possible to design a more effective interactive interface in understanding users' behaviors and preferences [35], but also, were easier to remember and perform than gestures created by researchers or designers [10].

2.3 Gesture recognition algorithm

An important factor that affects the mid-air gesture interaction experience is the gesture recognition rate. Traditional gesture recognition systems based on hand features or electrical signals are susceptible to interference from various external factors, and the sensing equipment is relatively heavy and complex. The user must use a dedicated input device, such as a glove or remote control [36], or perform gestures within the range of a close sensor, such as a depth-sensing camera or infrared sensor to track the movement of the body [37]. Vision-based methods involve computer vision technology, although computer vision calculation is accurate, it also brings privacy and light interference problems. Therefore, accurately segmenting gestures from the interactive background and effectively extracting gesture features are prerequisites for improving recognition rate, such as the use of skin color or depth information for hand image segmentation to accurately find the position of the hand in the image. In previous studies, the effective gesture target detection method is the DPM (deformable part model) proposed by Felzenszwalb et al. [38], which was essentially a manual method to extract features and the influence of illumination was not considered. With the promotion of deep learning technology, fully convolutional target detection networks have gradually developed. The latest progress in target detection was successfully promoted by region proposal methods [39] and region-based convolutional neural networks (R-CNNs) [40]. Region-based CNN was computationally expensive. Jonathan Long et al. [41] proposed Fully Convolutional Networks (FCN). The main idea was to use the convolutional layer to replace the original linear layer in the network so that the network could adapt to any size input. Although the advantages of the FCN network realized end-to-end segmentation, the disadvantage was that the segmentation effect of details was not good enough. Olaf Ronneberger et al. [42] proposed the U-Net network, which increased the details of low-level features by using multiple hop connections, thereby improving the accuracy of the network. The U-Net network structure is further improved on the basis of FCN, which can accurately separate small targets of the same kind from the background. Therefore, The U-net network will be used to segment small objects with gestures, which can avoid the influence of light interference on recognition, but the model is larger needed to be optimized.

Based on previous works, the research works of this paper are as follows:

(1) Gesture is an active choice for in-vehicle interaction, and end-user-based gesture design is a key factor in reducing cognitive bias in interaction. Therefore, this paper aims at evaluating a user-elicited vocabulary for in-vehicle gestural interface using major secondary tasks [43]. We explored the user's expectation and preference for media interactive gestures in vehicles. A user-elicitation method was introduced and used in an in-vehicle media interaction, and the appropriateness of gesture consensus set was verified.

(2) Based on the existing algorithm, combined with the target data set, the gesture recognition model was optimized. We introduced VGG16 as the feature extraction part of the U-net network, segmenting RGB images into gray images to avoid the influence of light interference on subsequent recognition. We integrated the lightweight convolutional neural network MobileNet-V2 to reduce the running time of the model by taking advantage of the features of low depth and separable convolution parameters and small calculations. The result proved that our proposed network can be trained end-to-end under the condition of reducing redundant parameters.

(3) The result of the gesture proposed by the user was evaluated in the simulation experiment. Based on eye movement technology, the subjects’ target gaze heat map, the number of fixation points, target fixation time, and average fixation time were extracted as test indicators. We explored the difference of users' cognitive load between gesture-based interaction and touch-based interaction to evaluate the user’s gesture interaction experience.

The potential applications of this research include the interactive design of typical in-vehicle information and entertainment functions. The research process framework is shown in Fig. 1.

Fig. 1
figure 1

The overall theoretical framework of this paper

3 Materials and methods

3.1 In-vehicle secondary tasks classification and user elicitation

Understanding the terminal user's gesture interaction preferences is conducive to reducing the cognitive bias in the user interaction process and improving the user's gesture acceptability during human-computer interaction, thus reducing the cognitive load in the driving process. The method mainly elicits appropriate mid-air gestures for in-vehicle secondary driving tasks. Therefore, in human-computer interaction, it is necessary to develop specific gestures for specific application scenarios [44]. In-vehicle secondary tasks were divided into three categories in research [3], which included A/C control tasks, Car-control tasks, and Media control tasks. However, the author did not subdivide these tasks into studies related to driving distraction. Jane Stutts et.al [45] conducted video surveillance and analysis of 70 drivers for a week, and summarized the common distractions during driving. In this documentary video, the most distracting tasks recorded were leaning on, searching for, and taking objects. However, participants adjusted their audio controls a total of 1539 times or an average of 7.4 times each per hour of driving (1539/207.2 coded hours of driving). We classified the three categories of in-vehicle secondary tasks according to frequency of use on vehicle interactive interface: 1. Frequently, the functions that were frequently set when driving; 2. Occasionally, users might use the set functions while driving; 3. Never, functions that were basically never used when driving, as shown in Table 1. The frequency and process of in-vehicle secondary tasks affect users' distraction and cognitive ability. Therefore, in the previous main secondary driving categories, we subdivided the specific tasks of media control and designed the user elicitation of gestures based on the degree of distraction and frequency of use.

Table 1 The frequency of use of major driving secondary tasks modified from [3]

Wobbrock et al. [35] studied 1080 gesture proposals defined by 20 users covering 27 command sets. This set of commands covered a wide range of tasks commonly found in many applications, including open/close menus, next/previous, pause/resume; and direct manipulation tasks usually related to touch-based surfaces and interactive media, such as rotation, zooming, and panning. Wobbrock's gesture consensus set showed its potential for practical application by Microsoft and Kuhnel et al. [46]. The gesture consensus set proposed by Wobbrock mainly established a classification of computer surface touch gestures based on user behavior to capture and describe gesture design. However, in our paper mainly researching on in-vehicle secondary tasks, we needed to design mid-air gesture consensus set to reduce users' cognition and distraction instead of surface touch gestures. Therefore, Wobbrock’s gesture consensus set had poor suitability for an in-vehicle gestural interface. A more detailed gesture consensus set should to be considered for an in-vehicle gestural interface. By carefully observing the process of surface gesture consensus set developed by Wobbrock, we found that the elicitation methods of these gesture consensus set could be easily applied to the design of mid-air gesture control scenes.

In this paper, a user elicitation method based on command prompt and end-user agreement scoring principle was applied. An agreement analysis was conducted in order to understand participants’ level of consensus for each specific task/referent. The agreement score was measured using Eqs. 1:

$$\mathop A\nolimits_{r} = \sum\limits_{{\mathop P\nolimits_{i} \subseteq \mathop P\nolimits_{r} }} {\mathop {\left( {\frac{{|\mathop P\nolimits_{i} |}}{{|\mathop P\nolimits_{r} |}}} \right)}\nolimits^{2} }$$
(1)

where \(\mathop p\nolimits_{r}\) represents the set of all gestures proposed for referent r, and \(\mathop p\nolimits_{i}\) represents subsets of identical proposals from \(\mathop p\nolimits_{r}\). The agreement score.

\(\mathop A\nolimits_{r}\) ranges from \(\mathop {\left| {\mathop p\nolimits_{r} } \right|}\nolimits^{ - 1}\) (no agreement at all) to 1 (perfect agreement).

3.2 Evaluation of user distraction for in-vehicle interface

The appropriateness of gestures is directly related to the distraction in driving tasks. The key point of gesture design in a driving task is that the user can recall the gesture required by the target operation when interacting with mid-air gesture, which is also the key to leading the traditional direct touch interaction. Ma and Du [47] discussed the appropriateness of gesture input that should be “easy to learn and remember, but also intuitive, and comfortable”. It is assumed that the user needs to look at the prompts on the display screen to recall the corresponding command gestures when performing in-vehicle secondary tasks, which will undoubtedly cause the same visual channel occupation and distraction as the touch screen click operation, and increase the cognitive load. Normark et al. [4] determined that performing any in-vehicle secondary task would come from the same perceptual and cognitive resources as the driving task itself. The distraction during driving mainly came from the high cognitive load when performing primary secondary tasks. Chen et al. [48] believed that continuous tracking of pupil diameter and blink rate were effective techniques for measuring cognitive load. Therefore, in order to verify the cognitive load of user-defined gestures when actually performing in-vehicle secondary control tasks, we conducted eye movement experiments based on gesture interaction and touch interaction. By capturing the number of fixation points, fixation point time and average fixation time of the user on the central control panel, which was called the area of interest (AOI) during driving, the cognitive load difference generated during the operation of the in-vehicle secondary task was specifically evaluated. In the experiment, we also evaluated the appropriateness standard of user-defined gestures based on the Five-point Likert scale, that was, memory, intuitiveness, and comfort. The purpose of the evaluation at this stage was to verify the effective performance of the gesture consensus set obtained by the user elicitation experiment in this paper in reducing drivers' distraction and improving the interactive experience when driving through the results of objective data and subjective data.

3.3 Gesture segmentation and recognition

The performance of gesture detection and recognition reflects the speed of interactive feedback and is also a prerequisite for improving the user's gesture interaction experience. Gesture target detection is a basic computer vision task combining target location and recognition, which brings some privacy problems and light interference problems. Therefore, before gesture recognition, we proposed to segment the contour edges of gestures. Its purpose was to detect the edge of an accurate target frame and convert RGB images into gray images, so as to avoid the influence of light interference on the recognition network in the later period. At the same time, it was also the premise to improve the gesture recognition rate in complex interactive tasks, and it was the basis of gesture semantic segmentation. In this paper, we optimized the feature extraction part of the U-net segmentation network, simplified the network parameters and reduced the amount of calculation. We integrated the lightweight MobileNet-V2 idea for gesture segmentation and recognition.

U-net was a segmentation network for medical imaging proposed on the basis of FCN, which can achieve good segmentation results in the case of small data sets. The biggest feature of the U-net was the U-shaped structure and skip-connection. The left side of the U-shaped structure was convolution layers, and the right side was upsampling layers. U-net had a large model and a large amount of calculations. Although it can segment the details in the image better, the segmentation speed was slower. We found that the activation value of each pooling layer in U-net's convolutions layers will be concatenated to the activation value of the corresponding upsampling layer. Therefore, U-net's encode process was a feature extraction process. In other words, if we used the VGG16’s partial convolution parameter structure to rewrite it into a better ResNet, and then combined the basic frame structure of U-net to reduce redundant parameters. The feature extraction effect and the accuracy of the experiment can be improved.

In this paper, the encoding process of the U-net was taken out separately, and the VGG16 network was selected to extract the features of the image. VGG16 was a large-scale trained convolutional neural network developed in ImageNet in 2014 (1.4 million labeled images and 1,000 different categories). The Monkey Data Set provided by Kaggle has been used to be extracted features by using the VGG16 model in Keras. The VGG16 network had 13 convolutional layers and 3 fully connected layers. The 13 convolutional layers were divided into 5 groups, and each group was followed by a pooling layer. There were 3 fully connected layers (FC) at the end of the entire network. The first two FC layers had 4096 neurons in each layer, and the last layer had 1000 neurons. We replaced the original U-Net feature extraction network with the VGG16 model, replaced the overall features with partial features for training, reduced redundant convolution kernels, and used pretrained mature models to improve training efficiency and experimental accuracy. Specifically, first, we discarded the SoftMax layer of the VGG16 network. Secondly, we set the anchor box detection frame scale of VGG16 to [0.7, 1, 1.4], and reconstructed the five convolutional parameters of conv1_2, conv2_2, conv3_3, conv4_3, and conv5_3 of VGG16 for multi-scale detection. The parameter settings were shown in Table 2. Finally, the output of the FC2 layer of the VGG16 network was used as the final image feature to form a feature map. Keep the down-sampling part of the original U-Net unchanged, and fuse the same scale of the number of channels corresponding to the feature extraction part in each up-sampling. The specific image segmentation process is shown in Fig. 2.

Table 2 The multi-scale analysis convolution parameters of VGG16
Fig. 2
figure 2

Flowchart of VGG16 for feature extraction and U-net for segmentation

A light-weight convolutional neural network (MobileNet-V2) [49] was used to extract and recognize the features of the segmented gesture. MobileNet-V2 was improved on the MobileNet-V1 network. The main advantage of MobileNet-V2 was efficient to reduce the parameters and computation complexity. In addition to the middle depth separable structure, it also included an expansion layer and a projection layer. The projection layer used the 1 × 1 network structure as similar as V1 to map from high-dimensional features to low dimensional space. This improved measure was called the bottleneck layer. On the contrary, the expansion layer adopted a 1 × 1 network structure to map from low dimensional space to high dimensional space, to increase the number of channels and obtain more features. The expansion layer can be adjusted according to the actual situation, but the default value was 6 times. Therefore, the whole network process of MobileNet-V2 was changed from "projection layer → feature extraction convolution layer → expansion layer" to "expansion layer → feature extraction convolution layer → projection layer". Finally, linear bottlenecks were selected to output directly instead of Relu nonlinear activation function to further prevent the extracted features from being compressed and destroyed by Relu. The network parameters of MobileNet-V2 were shown in Table 3. In MobileNet-V2, we used 0.0001 learning rate, Adam Optimizer, and Cross-Entropy Loss Function to analyze the accuracy and loss rate of gesture in the training set.

Table 3 Implementation process of MobileNetV2

Where H × W is the channel size, t is the channel expansion multiple, s is the stride size, and K is the convolution kernel size. The first row is the dimension increasing layer, the second row is the convolution layer, and the third row is the dimension reducing layer.

4 Experiments

In order to complete the control of every target in frequent in-vehicle secondary tasks, we conducted two separate experiments on 35 participants, and all the participants have obtained their legal driving certificates for more than 1 year. In Sect. 4.1 elicitation activity on 25 participants (15males, 10females) was mainly to browse a series of in-vehicle secondary task functions, and propose a right-hand mid-air gesture for each referent. we conducted the verification of gesture recognition model in Sect. 4.2 to verify the performance of our optimized model. In Sect. 4.3, we simulated driving distraction experiment assessed by another 10 participants (5males, 5females). We mainly evaluated various indicators of user eye movement in GBI and TBI, and analyzed the appropriateness of user-defined gesture for each referent.

4.1 User elicitation design experiment

4.1.1 Participant

Whether the research findings will generalize to other domains and groups of people, the generality of any participatory design activity depends on the participants—especially their past experiences and cultural backgrounds [49]. Due to the traffic laws and regulations of the country where the participants worked in this paper, the participants were all right-handed and the steering wheels used in our experiment were all on the left side of the car. When participants needed to operate the in-vehicle user interface, they were used to controlling the steering wheel with their left hand and touching the interface with their right hand. 25 healthy adult subjects (mean age was 26 years, SD = 2.2 years) who had experience using interactive touch screen technology volunteered for this study. Participants were ordinary users who had used in-vehicle interactive systems and had no experience in computer science, gesture design, or usability design.

4.1.2 Referents

Before determining the referents of this study, we have refined every possible command in the frequently used Media control tasks based on actual driving. Before elicitation study, the participants' acceptability (expressed in percentage) of using mid-air gestures interaction of these tasks were investigated by setting a percentage slider questionnaire. The results were shown in 5.1.

8 commands commonly used for controlling the in-vehicle media system were selected for the experiment (Table 4). By following the terminology of [35], commands were denoted as referents. According to Wobbrock’s gesture consensus set, the set of referents contains frequently-used commands for controlling the in-vehicle media control tasks which covered Open/Close (Turn on/off radio), Volume + /Volume—(Increase/decrease volume), Pause/Resume (Mute radio), Previous/Next (Change song title). Each referent was denoted by T and described the operation and notes on direct touch interfaces for each referent. Although other commands could easily be included, we chose this specific and limited set for several reasons. First, when driving a car, drivers preferred to start the in-vehicle media system especially the music player, and these control tasks related to music targets were frequently evoked and used. Second, in order to be objective in the experimental results of this study, the number of gestures they can remember commands should be limited without increasing participants’ cognitive load.

Table 4 Eight in-vehicle primary secondary tasks with notes

4.1.3 Procedure

After 25 participants (the same as the above participants) were arranged in a stationary car to simulate the driving environment. A display screen was placed in the right front center console of the subjects, and a series of referents and icons representing different referents were displayed on the screen. Each referent had a short text description (such as ‘Previous’), which was presented on a low-fidelity 1920 × 1080 prototype interface. The prototype program was written by Python and PYQT, as shown in Fig. 3, the entire prototype interface had no interference from other design elements. The left half of the prototype interface simulated the content of the music interface and arranged 8 interactive tasks in an orderly manner. The right half of the prototype interface showed a short text description of each referent. First, participants needed to be familiar with the prototype interface of 1920 × 1080 for each referent in 30 s, and understood its effect through text description. Secondly, as each icon of the referent looped on the screen, the participants were asked to put their left hand on the steering wheel to simulate the driving posture, and to demonstrate gestures that could have caused the current command by mimicking the gesture with their right hand, such as the command T1 ‘Open’, and they were allowed to further explain the gesture verbally if necessary. Finally, they were asked to identify one preferred gesture for the task. Participants were encouraged to think aloud while searching for the best gesture to match the task.

Fig. 3
figure 3

The prototype interface of user elicitation procedure

In the process of gesture elicitation, participants were asked to imitate the gesture they want to propose with their right hand to get the best feeling. This "imitation" step helped the subjects' decision-making in order to better propose their preferred gestures. Subjects can consider each command individually, that was, they did not need to consider whether they chose the most suitable gesture for a command.

4.2 Verification of the optimized model

The detection and recognition of gestures affect the user's interactive experience when using the system. We used OpenCV-python3.4.2.16, Tensorflow1.14.0 deep learning framework, keras2.1.5. The i7-9700k server system configuration, 16g of memory, and 2070s graphics card were used to verify the algorithm proposed in this paper.

Due to the small sample size of the data set, we used Image Data Generator by Keras to expand the data set, that was, random cropping, cropping according to specifications, flipping, denoising, adding random noise, histogram equalization, rotation, and contrast adjustment. The learning rate of 0.0001 and the SoftMax cross-entropy loss function were selected during the training process of our optimized model. The formula was as follows:

$$H_{{\left( {p,q} \right)}} = - \sum\nolimits_{x} {p(x)} \log_{x} q(x)$$
(2)

where \(\mathop p\nolimits_{\left( x \right)}\) represents the true probability distribution, \(\mathop q\nolimits_{\left( x \right)}\) represents the predicted probability distribution output by the neural network function, and \(\mathop H\nolimits_{(p,q)}\) represents the difference between the predicted probability and the true probability.

During the network training process, we continuously optimize the network parameters through the Adam optimizer to narrow the gap between the predicted value and the real result. Accuracy (ACC), precision (precision), and recall (recall) were selected as the evaluation criteria of the general algorithm model. The formula was as follows:

$$ACC = \frac{TP + TN}{{TP + FN + TN + FP}}$$
(3)
$$\Pr ecision = \frac{TP}{{TP + FP}}$$
(4)
$${\text{Re}} call = \frac{TP}{{TP + FN}}$$
(5)

where TP represents the number of samples correctly predicted, the true is 0, and the prediction is 0; TN represents the number of non-samples that are correctly predicted, the true is 1, and the prediction is 1; FP represents the number of samples that are incorrectly predicted, the true is 1, and the prediction is 0; FN represents the number of non-samples that are incorrectly predicted, the true value is 0, and the prediction value is 1.

4.3 User distraction evaluation experiment on GBI and TBI

In order to assess the appropriateness of gestures form elicitation studies, we combined qualitative measurement methods:

Five-point Likert scale: a subjective qualitative measurement method, in which the measurement index of each elicited gesture includes questions from 1 (positive) to 5 (negative) directions.

Memory: the memory degree of the participant when using the gesture again after a gesture was elicited from 1(very easy to remember) to 5(very hard to remember).

Intuitiveness: the participant's intuitive understanding of the elicited gestures for referents assessing from 1(Very easy to understand) to 5(very hard to understand).

Comfort: the participant’s physically perception of the comfort when producing a gesture elicited assessing from 1(very comfortable) to 5(very uncomfortable).

4.3.1 Participant

We selected another 10 subjects (5 males, 5 females) to participate in the interactive simulation experiment of the in-vehicle music system (mean age was 26 years, SD = 1.5 years). All the subjects had obtained their legal driving certificates for more than 1 year, as well as had relevant experience of car driving and the TBI experience of in-vehicle information systems. All subjects had normal vision or normal vision after glasses correction and were in good physical condition during the simulation.

4.3.2 Apparatus

In the experiment, EyeSo Glasses EG100H wearable eye-tracking system produced by Braincraft Technology Company was used to collect the eye movement signals of the subjects. The sampling frequency was 50 fps, and the eye-tracking range was 83° horizontally/52° vertically. The calibration was completed by the hand-held physical mark calibration method. The gesture recognition process was realized through the in-vehicle simulation system platform, which was built by a Volkswagen car. We used a 10.1-inch 1920*1200 capacitive touch screen and a 5-megapixel Microsoft LifeCam camera with a frame rate of 30 fps to simulate the interactive platform of the in-vehicle information system. The display screen and the camera were fixed at a medium height and slightly below the subjects’ shoulders. The gesture control simulation prototype system ran through the workstation platform of the I7 10,700 + 2060 graphics card, and the gesture recognition kit ran under the Keras and TensorFlow-GPU environment. Figure 4 showed the scene of our eye movement experiment.

Fig. 4
figure 4

The scene of driving simulation experiment

4.3.3 Procedure

The experiment was carried out on a fixed road, and subjects was asked to complete the in-vehicle music simulation system including opening, close, pausing, resuming, volume + , volume -, previous, and next during driving in the powered-on driving simulated environment. The experimental flowchart was shown in Fig. 5. Specifically, the participants drove a car on a straight road at a speed of 10 km/h. Participants needed to complete 8 commands control tasks on GBI and TBI. Before starting the experiment, the participants needed to be familiar with the road conditions, the operation of the in-vehicle media control system, and related gestures, which will be explained by the assistant experimenter. The participants were also asked to wear eye trackers on their heads, adjust their seats, and fasten their seat belts within 20 s. During the simulated driving experiment, participants should be keeping the head position level with the front windshield marking position, and keeping the head position as stable as possible during driving. The assistant experimenter sat in the co-pilot position and explained to the subject the next interactive operation to be done, including command prompt, gesture action of the command (the gesture was elicited by the participants in 4.1), and execution time. After that, the participants completed the control task first based on gesture interaction (GBI), and the time interval for each command operation was 10 s. After finishing a user-defined gesture control task, according to the Five-point Likert scale, each referent was evaluated according to three indicators of appropriateness within 5 s: memory, intuitiveness and comfort. After completing eight tasks based on gesture interaction, the subjects rested for 10 s. Then the experimenter began to announce task operations based on touch interaction (TBI). In this interactive mode, except that the experimenter did not need to prompt participants for gestures and the subjects did not need to evaluate the appropriateness of gestures, other steps were the same as those in GBI mode. Throughout the experiment, we used the EyeSo Glasses EG100H to record the change of driver's attention focus, which generated the AOI heat map, the number of AOI fixations, the AOI fixation time, and the average AOI fixation time.

Fig. 5
figure 5

The flowchart of eye movement experiment

5 Results and analysis

5.1 Preferred gesture and agreement rate

Generally speaking, the tasks frequently used by users were the main factors that distract drivers from driving [45]. 25 participants’ acceptability regarding eight frequently used referents by using mid-air gestures-based interaction from the main driving secondary tasks of the Media control system were shown in Table 5. Among the 25 participants’ acceptability, participants' acceptance of choosing to open/close the media system and change song titles was over 90%. About 80–90% of participants’ acceptability chose volume related tasks. We found that the results in the table were consistent with the research of Jane Stutts et al. It showed that the participants had a high acceptance of using mid-air gestures of 8 in-vehicle media tasks.

Table 5 The result of 8 frequently used media referents from participants’ acceptability

In accepting user-preferred gestures, in general, participants showed a certain degree of consensus when choosing their preferred gestures. We reported the results from 25 (participants) × 8 (referents) = 200 proposals for gesture commands. According to the Eq. 1, the consensus agreement analysis was obtained to understand the consensus degree of participants on each specific gesture/referent.

The mean agreement rate for our set was 0.397, which was comparable to it obtained by Wobbrock et al. [35] (AR = 0.32 for 1-hand). Our results verified that participants showed a certain degree of consensus in using mid-air gestures to control in-vehicle media interaction tasks. According to the research of R.-D. Vatavu et al. [35], agreement rates of referents above 0.50 indicated robust proposals for commands, agreements above 0.25 indicated possibly useful commands while values lower than 0.25 indicated the need for expert design. For our result, all referents received an agreement above 0.25 but below 0.50. The individual agreement rate for each task was shown in Fig. 6. Participants' gesture proposals included the number of fingers and the direction of finger-pointing. For some tasks, such as ‘Previous’ or ‘Next’, finger-pointing was relevant. Other tasks obviously depended on the number of fingers, such as ‘Open’, ‘Close’ and ‘Pause’. ‘Open’,’Previous’ and ‘Next’ had the highest agreement.

Fig. 6
figure 6

Agreement rates for gesture proposals elicited with right-hand; NOTE: Referents are shown on the horizontal axis in descending order, error bars show 95% confidence intervals

The agreement analysis justified that when performing the final gesture-command associations, the gesture occurring most frequently for a referent won the command. Table 6 showed the user's final proposal that the mid-air gesture as the best gesture for each referent, totally 8 kinds of consensus gesture prototypes. And gestures’ appropriateness assessment of referents in simulated driving were shown in the Table 6.

Table 6 User-defined gesture consensus set and appropriateness evaluation

A conclusion can be drawn from the results of user gesture elicitation and gesture appropriateness evaluation, the choice of gestures may be related to the user’s previous experience and memory [50]. Through the observation during the experiment and the subsequent analysis, we got the following gesture directions about the design of in-vehicle Media control tasks.

(1) When the non-dominant hand (left hand) was fixed on the steering wheel during driving, the user preferred the right-hand gesture with the wrist in the vertical direction. Compared with the wrist in the horizontal direction, the gesture of the vertical wrist seemed to be more relaxed and comfortable in executing commands, which reduced the driver's energy consumption when performing secondary tasks.

(2) We also found that when the task command was determined, users preferred to choose gestures with fewer fingers and simpler actions as the best gestures, such as the gestures of referents T1, T2, T7, and T8. For ‘Open’ and ‘Close’, participants tended to perform ‘Close’ by closing the fingers of their hand which was like the number 0, and used 1 stretched finger which was similar to the number 1 in order to pop up the ‘Open Menu’. Participants thought that this gesture was intuitive, which also required sensing technology and recognition technology with good performance to capture these details.

(3) Pointing gestures were frequently used. For example, when executing the volume control command, the participants were reminded by the logo with a volume bar. Participants imagined that the upward-pointing index finger represented the volume bar, while the left and right-pointing of the thumb represented the volume − and + . Therefore, when the referents of T3 and T4 achieved high agreements, the gestures were easy to remember and were intuitive.

(4) Gestures related to semantic symbols and culture were also put forward by participants, because they were similar to the current situation in the world, and gestures were used because users were familiar with them [51]. For ‘Pause’, participants tended to open their whole palm with five fingers. we asked that the reason was mainly related to the daily experience of traffic police commanding vehicles to prohibit traffic. For ‘Previous’ and ‘Next’, commands were executed by pointing left and right with the customary thumb. This idea came from the semantic connection between the interface logo of the command.

Our research results show that participants preferred physically and conceptually simple gestures, while HCI researchers tended to create more complex gestures, such as those with more moving parts, precise timing, or spatial dependencies. We can get that even if participants do not know the source of each gesture, they prefer user-defined gestures rather than interactive worker-defined gesture sets.

5.2 The results of gesture segmentation and recognition

In order to solve the influence of gesture recognition caused by the complex background of illumination in the actual interactive scene, we verified the performance of the model. A total of 800 gesture pictures of 8 referents from user elicitation were collected by the software web crawlers, and each of referent had 50 pictures. The picture size was 224*224, and each picture had a corresponding hand-marked label, then we added brightness noise to each image based on the label. After data augmentation by Image Data Generator of the above 8(referents)×50(images)+8(referents)×50(brightness noise images)=800 (images), we finally got a gesture data sets containing 800(images)×8(augmentation operation)=6400(images). Among them, 6000 images were used as the training set and 400 images were used as the test set. The VGG16+U-net network optimized in this paper and the original U-net network were trained and compared. The comparison results were shown in Table 7. Our data image was subjected to multi-scale analysis and detection of VGG16 to extract features, and then through the up-sampling convolution of the original U-net for feature fusion, the segmentation results are shown in Fig. 7. The segmentation curve finally converged, indicating that our segmentation model has a better effect, and the accuracy of segmentation reached 98.6%, which was 3.4% higher than the 95.2% of the original U-net network.

Table 7 Comparison results of segmentation between VGG16 + U-net and U-net
Fig. 7
figure 7

The result of gesture segmentation network

We sent the segmented image to the lightweight MoblieNet-V2 network for classification and recognition, and the result was shown in Fig. 8. As can be seen from the Fig. 8, our image data set obtained the best result of the time required for prediction after the optimized network segmentation. It can be found that the network has gradually stabilized after 20 iterations, and the accuracy has stabilized above 95%. In other words, the algorithm in this paper can quickly complete the task of gesture recognition while maintaining accuracy, reducing redundant parameters, making the model more lightweight, and saving gesture interaction time in practical applications.

Fig. 8
figure 8

The classification result of MobileNet-V2

5.3 Eye movement index analysis

For a long time, in the field of ergonomics research, the selection, continuity and conversion of visual attention have been widely used as evaluation indicators of workers' mental workload [52]. During the driving process, the driver's viewpoint changes with the continuous change of the driving environment. Reichle E. D et al. [53] used indicators such as gaze time, saccade time, and saccade distance to study the characteristics of the driver's moving viewpoint. In the research of this paper, we used Eyeso Studio software, an eye movement experiment design and data analysis system to export the eye movement data of the subjects as the analysis index. We extracted the hot zone map of the subjects' fixation events on the AOI (display screen) as shown in Fig. 9, the AOI fixation number as shown in Table 8, the AOI fixation time and the average AOI fixation time as shown in Fig. 10.

Fig. 9
figure 9

The heat map of AOI

Table 8 The fixation number of AOI
Fig. 10
figure 10

The fixation time of AOI

By analyzing the changes of the subjects’ gaze data on AOI, it was verified that our proposed user-defined gesture consensus set method can greatly reduce user distraction. The larger the red area of AOI was, the longer the user's fixation time on the in-vehicle central control panel was. We can know that the longer the driver's attention deviates from the road during driving. The red area in TBI mode was larger than that in GBI mode as shown in Fig. 9. This indicated that during driving, when the driver performed the in-vehicle Media control tasks, the user-defined gestures were less distracting than the familiar traditional touch-based interaction.

The fixation number of AOI (Table 8) and the fixation time of AOI (Fig. 10) reflected the load of the interaction process on the drivers. The drivers needed to spend more effort and time to identify the operational function with the greater number of fixations and the more fixation time [54].

Combined with Fig. 10 and Table 8, in the process of gesture-based interaction, the number of times the driver fixes AOI was less than the TBI, and most of the reasons for fixation were to confirm the interaction feedback and the gesture, which required little fixation time. Among our interactive commands for in-vehicle Media control tasks, the previous and next commands belonged to the item that the subject will look at the display screen for confirmation after completing the task. Figure 10 showed the comparison of target fixation time and average fixation time between GBI and TBI. It can be seen from the figure that the average fixation time of TBI was 0.822 s (t (5.15 s) = 19.387, p < 0.001, d = 1.515), while the average fixation time of GBI was 0.406 s (t (10 s) = 4.05, p < 0.001, d = 0.01). Compared with TBI, the time for the line of sight to leave the road during GBI operation was reduced by 50.6%. Specifically, the participants' maximum distraction averaged at 5.15 s with touch-based interaction (TBI). While the participants' maximum distraction averaged at 10 s with mid-air gesture-based interaction (GBI), which was much smaller than that TBI. According to the multi-resource theory, the size of the brain loads of the subjects was directly related to the amount of cognitive resources used and cognitive resource conflicts [55]. As a result, although drivers with many years of experience in touch screen interactions, they still consumed cognitive resources due to the command arrangement on the screen, resulting in driver distraction. These results showed that user-defined gesture interaction had a very good application value. In practical applications, by allowing users to define command gestures, it can be more consistent with the user's cognitive habits, and the better memory, intuitiveness and comfort can be also achieved.

6 Conclusions

In this work, we paid more attention to the research of gesture interaction and the end-user experience. Directly operated interactive interfaces are becoming more and more common, and gestures design will play an important role in determining the success of these technologies. For this factor, according to the frequency of use in driving, we took the media control system as the background, conducted the gesture elicitation experiment of users’ preference, and got a set of gesture consensus set for in-vehicle media control task. Through eye tracking technology and appropriateness evaluation method, the user's attention distraction degree and user's load degree in the actual driving process of performing in-vehicle secondary tasks were evaluated. Finally, we optimized the gesture recognition algorithm. The segmentation rate was 98.6% and the recognition rate was 95%, which proved that our algorithm was suitable for our collected image data. It provided an algorithm basis for in-vehicle gesture interaction system. The main contributions of this work are as follows:

  1. (1)

    Compared with TBI, we considered that the initial cognitive workload for GBI may be even higher for beginners because people must learn which commands required which gestures to complete. Therefore, our research was to use user preferences to define mid-air gestures. We presented a user-defined mid-air gesture consensus set for performing the tasks open, close, volume + , volume -, pause, resume, previous, next in an in-vehicle interface. Through experiments, we got that the gesture sets of open, previous, and next commands had a high agreement. The mid-air gestures vocabulary also had good applicability. By understanding the user's preferences and expectations, the mid-air gestures vocabulary obtained in this study had good appropriateness. This method was not to sacrifice usability and user acceptance for simply realizing practical applications, but to design gestures based on experimental user's gesture preferences.

  2. (2)

    Aiming at the problem of gesture recognition in actual complex interactive tasks, we optimized the gesture segmentation and recognition network model to avoid the problem of poor gesture recognition performance due to the influence of illumination, and reduce redundant parameters, and the amount of calculation to achieve a lightweight model. The recognition results showed that the optimized model had better segmentation and recognition effects.

  3. (3)

    We obtained 8 kinds of consensus gesture prototypes for controlling the frequently use media tasks when driving, and tested the application value of user-defined interactive gestures in actual driving environments with eye-movement experiments. Through eye movement experiment, the distraction gap in driving vehicles was evaluated about TBI and GBI, which proved that user-defined gesture consensus set can reduce users' distraction in driving secondary task interaction. This result will be a potential to conduct a research study on developing a mid-air gesture recognition system with the optimal physical affordance regarding in-vehicle applications.

Gesture elicitation research is linked with computer iconography, and a more systematic interface is designed for the elicitation and recognition feedback of gestures of other tasks in the future. Although this method avoids the prejudice of gesture commands in cognition, learning, and interactive applications, which are the problem that designers should pay attention to in gesture design, this research has some limitations. One of the limitations is that we only investigated a small number of gestures and potential causes, and only involved 8 interactive commands of the in-vehicle media control system. The precise design of these systems also requires further research. In addition, the investigation should also check real driving conditions to ensure that these effects are not only reported in the simulated environment.