1 Introduction

1.1 Introduction to driver distraction and IVIS design

Imagine going on a road trip with a friend driving their car. The first thing you notice is that your friend is overly invested in the touchscreen to navigate the In-Vehicle Infotainment System (IVIS). Alas, there is no auditory feedback from the touchscreen. This vignette highlights a number of human factors problems commonly seen in the driving context. The driver is distracted visually to see the touchscreen, physically to reach out to touchscreen, and cognitively to manipulate the content (NHTSA 2010; [87]). Can the IVIS be designed differently to minimize these distractions? This question motivated us to conduct our research.

Multiple resource theory explains that a human has different attentional resources. When people do more than one task, its performance will suffer less if the two tasks draw on different resource pools (e.g., visual type: focal vision vs. peripheral vision, modality: visual vs. auditory) than if they draw on the same pool [82]. For example, when the primary task (driving) requires the visual resource, the secondary task (IVIS) can be designed to require non-visual resources [77]. Because mid-air hand gestures are monitored through people’s proprioceptive senses, they can be performed with minimal visual attention, presenting an advantage over button or touchscreen controls [45]. Thus, gesture-based interfaces have emerged as a positive alternative to touchscreen interfaces for IVIS interactions by significantly decreasing off-road glances and producing lower driver workload [30, 71]. In the same line, auditory displays have been explored by researchers in driving environments and shown to decrease driver visual distraction [33, 44, 68]. Adopting both approaches, the present study aims to enhance driving safety, usability, and workload in gesture navigation by investigating the effectiveness of adding auditory displays.

1.2 Optimizing IVIS interactions: the role of auditory displays

Research has shown that a combination of gesture-based interaction and auditory display can be optimal in reducing driver distractions [65, 71]. However, the choice of auditory display should be made carefully. Although well-implemented auditory systems have the potential to reduce visual distraction and keep a driver’s eyes on the road, poorly implemented systems may have an adverse effect by imposing high levels of mental demand on the driver, requiring long glances away from the roadway to acquire information about the system status. To utilize the best of their potentials, auditory displays must be made sufficiently simple presenting accurate feedback to reduce a driver’s cognitive demand [13]. In the previous study, spearcons (compressed speech) [79] have outperformed other auditory displays, including earcons [6] and auditory icons [27]. However, the details of the spearcon design for the in-vehicle gesture interface context have not been determined yet. The current study seeks to identify this design granularity to make it a truly user-centered interface.

1.3 In-vehicle air gesture menu navigation system

Before discussing the specific application of air gesture systems in vehicles, it is crucial to understand other broader category of technologies that can be applied to the in-vehicle context. Spoken Dialogue Systems (SDS) represent a complex integration of human–computer interaction technologies designed to enable verbal communication between users and systems. These systems, which rely on speech recognition, natural language processing, and speech synthesis, facilitate tasks ranging from simple queries to complex procedural interactions [49, 86]. They are already pervasive in the vehicle contexts, but still not perfect. For example, they have problems such as language issues, detection issues, noise, repetitive control, precise control, etc. Employing touchscreens with multi-touch capabilities could be an alternative approach. It will allow drivers to use gestures they are accustomed to from smartphones [31]. However, it still requires drivers to reach out to the system, which will add physical demand. Research also shows that well-designed gesture interfaces can reduce visual distraction compared to touchscreens [71].

Transitioning from these dialogue-focused interfaces to more controllable graphical user interfaces, the systems use the WIMP paradigm—standing for 'windows, icons, menus, pointer,' describing a framework in user interface design that uses graphical elements to facilitate human–computer interaction. This interaction style, which became the foundation for most graphical user interfaces, relies on the use of graphical elements such as windows to contain different tasks, icons to represent functions and files, menus for command selection, and a pointer to navigate and select items [11]. The choice to employ a WIMP-based interaction style in our study was driven by its familiarity to most users and its proven efficiency in computer systems. While other interaction styles such as those based on touch or voice commands could also be considered, the WIMP approach provides a reliable and well-understood framework for extending traditional graphical user interfaces into the in-vehicle context, potentially easing the learning curve and enhancing the user's sense of control when interacting with the system.

Different menu types have also been investigated. For example, May et al. [47] designed a one-dimensional menu system in which users could scroll up or down to navigate a list of up to eight common in-vehicle functions. On the other hand, Sterkenburg et al. [71] developed two-dimensional grid menu navigation prototypes (2 × 2 or 4 × 4) in which users could move their hand along two axes (left–right and up-down) moving a visual cursor from a menu item to another. We followed the latter design and extended it by adding three pages.

There have been various evaluation studies around the in-vehicle air gesture menu navigation. Researchers have compared the gesture system to the existing touchscreen system, e.g., Gable et al. [24], Graichen et al. [30], May et al. [47], Sterkenburg et al. [71], and Wu et al. [84]. For example, Walker and his colleagues [24, 48, 84] showed that driving performance was equivalent between the two systems, but air gesture system resulted in more short glances away from the road and their participants perceived more overall workload when using the air gesture menu navigation system. In contrast, Sterkenburg et al. [71] showed that both systems resulted in comparable driving performance and driver workload. The auditory-supported air gestures allowed drivers to visually focus more on the road, but slightly decreased secondary task performance compared to the touchscreen. Given that people are already familiar with touchscreen systems, the outcome of the secondary task performance is understandable. Note that only an auditory-supported air gesture system led to improved visual attention, which demonstrates the importance of auditory displays in the context. In the subsequent experiment Sterkenburg et al. (2023) evaluated the control orientation – horizontal (mouse metaphor using x and z axes) vs. vertical (direct manipulation using x and y axes). Even though there were no different performance results, vertical controls showed significantly lower workload than horizontal controls. Thus, the present study adopted the vertical control method.

1.4 In-vehicle air gesture system with auditory displays

Research then converged towards evaluating feedback modalities: unimodally, bimodally or trimodally. May et al. [47], Jaschinski et al. [32], Shakeri et al. [65, 66], and Sterkenburg et al. [70,71,72] evaluated the auditory modality. Large et al. [43] and Shakeri et al. [65, 66] evaluated the tactile modality. Roider and Raad [57] and Shakeri et al. [65, 66] evaluated the peripheral visual modality.

Among research efforts that aimed at evaluating the auditory modality, May et al. [47] provided fast but intelligible speech feedback, as well as non-speech sounds for system status feedback while navigating the one-dimensional menu list. Sterkenburg and colleagues [70,71,72] conducted multiple evaluations to examine the effects of speech displays on in-vehicle air gesture controls for a two-dimensional grid menu. They showed that the auditory and visual feedback lowered the frequency of off-road glances and reduced driver workload. However, the addition of auditory displays did not have any significant impact on lane departures or secondary task performance. All three evaluations [70,71,72] showed a significant improvement of visual distraction without degrading driving performance, presenting a common inference that prototypes with auditory displays have a clear improvement compared to prototypes without. Two other evaluations were conducted by Shakeri et al. [65, 66] to assess different types of feedback-visual, auditory, haptic and peripheral visual (2017), and bimodal feedback when added to ultrasound feedback (2018) for a sequential gesture execution secondary task. In both evaluations, auditory feedback was presented in the form of earcons directly mapped to six gestures and presented after a gesture was executed. Shakeri et al. [65] showed that auditory feedback resulted in better secondary task performance than tactile feedback but worse than visual feedback, and significantly improved time spent looking away from the road. However, all feedback conditions resulted in similar driving performance. Shakeri et al. [66] then provided conforming results as the bimodal auditory-ultrasound condition provided less time looking away from the road than visual and ultrasound-visual conditions while driving performance was similar across all conditions. Additionally, the use of auditory feedback resulted in the numerically highest secondary task performance that is significantly better than unimodal ultrasound feedback and was preferred by 47% of participants, and showed significantly less physical demand compared to visual conditions. Recently, Moustafa et al. (2023) compared auditory icons, earcons, spearcons, and no sound condition in the context of in-vehicle air gesture menu navigation. They showed that spearcons provided the least visual distraction, least workload, best system usability and were favored by participants.

Although some mixed results exist, the potential of auditory displays to improve driving safety when multitasking with an IVIS air-gesture interface is undeniable. Whereas the use of auditory displays showed benefits for air gesture IVIS operation in general and menu navigation in specific, there has been no in-depth analysis on how to design each auditory display type. Most auditory menus have explored the use of speech or earcons. Although Moustafa et al. (2023) investigated different auditory displays (auditory icons, earcons, spearcons), the study used only one design for each auditory cue. There still exists an ambiguity concerning how auditory displays should be designed and how different auditory supports for air gesture navigation affect primary driving safety, then secondary task performance. Therefore, we aim to bridge the existing gap in the literature by conducting an exploratory study to ultimately provide informed design guidelines on spearcons, which showed the best outcome in the literature.

1.5 Auditory displays in vehicles

Much research has been conducted on the use of auditory displays inside the vehicle, either to support the driving task such as with warning signals [29] or to support secondary tasks such as with the navigation of infotainment systems [33, 70]. Researchers commonly classify auditory displays under two labels: non-speech sounds such as earcons and auditory icons, and speech sounds.

Earcons [6] are non-verbal synthetic sounds that are usually expressed as abstract musical tones or sound patterns, and can be used with structured combinations such as menus, and usually have an arbitrary relationship with the referent item or action. Auditory icons [27] are non-verbal brief sounds that are associated with objects, functions or actions,they use elements of the analogic sound of the referent. Auditory icons utilize familiar sounds from the environment, making them immediately recognizable and intuitive for conveying information or alerts [27]. Auditory icons also offer a repertoire of sound options to map with the referent, as they can directly represent the referent using a sound it produces, or can be indirectly related using a sound produced by a surrogate of the referent [36]. Spearcons (“Speech-based Earcons”) [80] are brief auditory cues that are created through a text-to-speech (TTS) algorithm, then time-compressed to create a faster speech without altering the tone. Spearcons provide a direct, non-arbitrary mapping to the item they represent. Although spearcons are based on speech, they can become unintelligible and can be classified as non-speech auditory cues.

Sabic et al. [60] examined the recognition of auditory icons, spearcons at two compression speeds (40% and 60% of original length) and TTS as car warning signals. They showed that auditory icons had significantly lower recognition accuracy than TTS, while spearcons’ accuracy was not significantly different from either but numerically better than auditory icons. Auditory icons also had significantly slower reaction times compared to spearcons and TTS, and 40% spearcons produced significantly faster response times compared to TTS. Results also indicated no significant differences between the 40% spearcons and 60% spearcons in terms of accuracy, reaction time, perceived temporal demand, and perceived annoyance. Moreover, a trend in the reaction time data suggested a direct correlation with the compression rate of spearcons: a 40% compression yielded the fastest reaction times, whereas a 100% compression (full speech) resulted in the slowest. Sabic et al. [59] assessed the effectiveness of spearcons, TTS and auditory icons under various background noise conditions while driving in terms of recognition accuracy, reaction time and inverse-efficiency scores. Overall, auditory icons were the least efficient, and spearcons only outperformed TTS in quiet environments without added noise sources such as music or talk-radio.

In terms of air gesture menu navigation tasks, the only one study compared all three non-speech auditory cues, including auditory icons, earcons, and spearcons (Tabbarah et al. 2023). They showed that spearcons reduced the visual distraction and workload, led to the best system usability and were most favored by participants, which led to the present study being conducted with spearcons.

The present study aims to enhance driving safety, usability, and workload in gesture navigation by investigating the effectiveness of adding auditory displays. Specifically, it focuses on the effects of different spearcon compression rates on these factors in the context of in-vehicle air gesture menu navigation.

2 Current study and hypotheses

Although the use of spearcons has shown positive results, alternative spearcon design is still scarce with a few exceptions [16, 58, 60, 69]. Sabic et al. [69] evaluated 40% and 60% compression speeds within a larger evaluation of auditory car warning displays recognition (2017), and thoroughly examined spearcon recognition at different compression speeds ranging from 100% (TTS) to 20% with 10% decrements (2016). As opposed to the other evaluations of auditory displays [17, 59, 60, 75], Sabic and Chen [58] did not provide any training on auditory displays before conducting their study. They evaluated the ability of participants to recognize a spearcon word without prior training and identified a 75% intelligibility threshold at 40% compression beyond which identification rates declined rapidly. 40% spearcons and TTS resulted in similar recognition efficiency even though 40% spearcons were responded to significantly faster, suggesting the presence of a tradeoff between compression speed and reaction time on one hand and accuracy on the other. Srbinovska et al. [69] examined the impact of training on spearcon recognition at different compression rates. Training significantly increased recognition rates when compared to untrained scenarios at 20% compression speed (82–89% versus 25–47%), 25% compression (84–95% versus 39–56%) and unintelligible compressions down to 10% compression speed (82–95% versus 6–16%) on a word-by-word basis. Trained participants expressed more confidence in their ability to recognize spearcons and found the recognition task to be less difficult with an increased familiarity. Finally, Davidson et al. [16] evaluated trained spearcon recognition in a dual-task environment while performing linguistic tasks such as reading, saying and listening. They found that sound-producing secondary tasks (saying and listening) worsened spearcon identification when multitasking while non-sound-producing tasks (reading) did not, generating a competition for the auditory modality and in verbal processing resources, conforming with the multiple resource theory.

From this background, the current study investigated in-vehicle air gesture menu navigation interfaces with a focus on alternative spearcon designs varying the compression rate. A significant gap exists in understanding the effects of speech compression in dual-task conditions, especially while driving. To this end, we posed the following research questions.

RQ1: How does adding spearcons affect air-gesture IVIS interaction in terms of driving performance, eye glance behavior, secondary task performance, perceived workload, and user experience?

Hypothesis 1: Adding spearcons will improve driving performance, eye glance behavior, secondary task performance, perceived workload, and user experience compared to text-to-speech (TTS) or no auditory display condition.

Literature shows that spearcons have a faster recognition time than TTS, and that 40% spearcon is the threshold at which participants understand 75% of the words. Nonetheless, spearcon recognition was evaluated as a primary task only, in which participants were able to allocate all of their cognitive resources towards the stand-alone task of spearcon recognition. It would be of interest to test if the recognition of 40% spearcon in a secondary task context would require more cognitive resources and would affect selection times as well as more visual demand for menu navigation. The inherent design of a spearcon requires the manipulation of a temporal aspect, and we suspect that 40% spearcon may induce urgency, hence, may result in higher perception of temporal demand.

RQ2: How do different spearcon compression rates affect air-gesture IVIS in terms of driving performance, eye glance behavior, secondary task performance, perceived workload, and user experience?

Hypothesis 2: 70% spearcon will provide the most efficient secondary task performance with faster selection times than 40% spearcon and TTS, and less mental and temporal demand compared to 40% spearcon.

Hypothesis 3: 40% spearcons will result in higher visual distraction than 70% spearcons and TTS, but less than the no auditory display condition.

3 Methods

3.1 Menu and interaction design

We developed a 2 × 2 grid menu selection system (see Fig. 1) with four square targets: each measuring 10 × 10 cm in the air gesture space, inspired by Sterkenburg et al.’s [71] design that showed most efficient. To expand the number of menu items, we created two additional pages as to allow the user to access 3 (pages) * 4 (options) = 12 menu choices,better representative of high-level main menu structures in real in-vehicle displays. To preserve the level of fidelity established, each of the 12 menu items represented an IVIS option present in commercial vehicles. For four auditory display conditions, we generated four sets of equivalent menu items (Table 1).

Fig. 1
figure 1

Air gesture navigation prototype in developer view and the menu display screen

Table 1 Menu sets for experiment

Our gesture menu selection system is comprised of four gestures, each mapped to an IVIS action: System Activation, Search & Navigation of menu page, Switching between menu pages, and Selection (see Table 2). We introduced an “activation gesture” to initiate the operator-system interaction to avoid accidental gestures due to inadvertent hand movements. To achieve stimulus–response compatibility of the gestures used, we decided to use the familiar swiping motion to navigate between pages similar to how a finger-swipe is used on touchscreens and smartphones. To keep the design simple, the menu system wrapped around with a unidirectional swipe. If participants were at page 1, they can only swipe right to switch to page 2, then to page 3, then back to page 1. A swipe right gesture was hence synonymous with a “next page” command. For Selection, users can tap the menu item selected.

Table 2 Gesture and action library

Visibility of system status is fundamental to an interactive UI design. Particularly with in-vehicle interfaces, we aim to provide efficient continuous visibility about the gesture menu IVIS operation. Air gesture interfaces present an additional challenge in communicating system status by the need to inform the user about hand tracking and gesture recognition states [23, 28]. Accordingly, we included a visual display to inform the user about gesture recognition status presented in the form of simple binary visual feedback that can be noticed by the user’s peripheral vision. A green background indicated that the system was activated, and a red highlight indicated a user’s hand position within the 3D menu, as depicted in Fig. 1. For all conditions, two sound cues were implemented as feedback to inform a successful gesture execution. A “swoosh” sound was provided after a swiping gesture, and a generic digital confirmation sound (“click”) was provided after a selection was made.

3.2 Spearcon design

Spearcons were created by using an online text-to-speech (TTS) engine with an American male voice, followed by applying the SOLA (synchronized overlap add method) algorithm of the Spearcon Factory software [79] to generate spearcon WAV files. We chose to evaluate spearcons at two compression rates: 40% and 70% of the original TTS length. Sabic and Chen [58] identified the intelligibility threshold of spearcon being at the 40% compression speed. Moreover, Sabic and Chen [58] and Sabic et al. [60] found that 40% spearcon were responded to significantly faster than TTS. 70% is a default compression rate of the spearcon factory [79]. In addition to showing the best performance in Tabbarah et al. (2023), 70% spearcon had a numerically faster recognition time than TTS for words that were unrelated [58]. We consequently chose to evaluate 40% spearcon and 70% spearcon in the experiment. Participants did neither receive training regarding spearcons nor were they provided with a visual representation of menu items within the menu structure. Training was shown to increase spearcon recognition from 52 to 89% for 25% spearcons [69]. The absence of training encouraged the process of recognition rather than recall and provided stronger inferences about the use of spearcons for bigger menu structures that contain a larger number of menu items.

3.3 Experimental design and independent variable

The study followed a within-subject repeated measure factorial design. Each participant engaged in a 90 min-session and experienced all four auditory display conditions: 40% spearcon, 70% spearcon, TTS, and a no auditory display condition (see Table 3). The order in which participants experienced auditory conditions was fully counterbalanced to minimize order effects.

Table 3 Experimental design

3.4 Dependent measures

The dependent measures for this study were classified into five categories: driving (primary task) performance, eye glance behavior, menu navigation (secondary task) performance, perceived workload, and user experience.

3.4.1 Driving performance

Driving behavior can be explained as the actions taken by the driver to “maintain lateral and longitudinal control of the vehicle to safely move the occupants of a vehicle from one point to another” (Smith 2018). Four driving metrics were recorded, and their standard deviations were measured and used as dependent variables in this study. Standard deviation measures how spread and dispersed the data are relative to the mean, which is indicative of driving consistency and drivers’ ability to maintain control of their vehicle while performing a non-driving secondary task. The driving dependent measures are hence:

  • Standard deviation of following distance (the distance maintained by the driver between their vehicle and the leading vehicle directly ahead): indicative of longitudinal vehicle control

  • Standard deviation of lane deviation: indicative of lateral vehicle control

  • Standard deviation of steering wheel angle: indicative of lateral vehicle control

  • Standard deviation of vehicle speed: indicative of longitudinal vehicle control

3.4.2 Eye glance behavior

NHTSA (2012) guidelines indicate that 85% of off-road eye glances should last less than 2 s. A naturalistic analysis conducted on the last five seconds prior to a near-crash incident discovered that drivers had an average longest off-road glance lasting 1 s [39]. Of those glances, 36% targeted the visual display of an IVIS or locations similarly away from the forward roadway. To understand what a glance is, we first need to define a gaze. A gaze can be explained as the direction towards which the eyes are directed. A glance is hence defined as the transition to or from the area of interest (AOI) and maintaining visual gaze within the boundaries of the AOI for at least one fixation.

Accordingly, eye glances were placed into three categories based on their duration: Short (< 1 s), medium (1-2 s), and long (> 2 s). Four total variables were hence evaluated:

  • Frequency of short, medium, and long glances

  • Dwell Time: total glance duration for a single menu selection task

3.4.3 Menu navigation performance

  • Selection accuracy: the percentage of correct selection tasks during a single driving scenario.

  • Selection time: time elapsed between the offset of the auditory selection command and the execution of a selection gesture.

3.4.4 Workload

Subjective workload was measured using the commonly used NASA-TLX tool (Hart 1988). Participants rated, on a 20-point scale, their perceived workload on six subscales: mental demand, physical demand, temporal demand, effort, performance, and frustration. Then, they performed pairwise comparisons between the six categories based on which one contributed more to their overall workload. A weighted average was then calculated to indicate perceived overall workload, and results were presented in a percentage format as scores out of a maximum of 100.

3.4.5 User experience

  • System Usability Scale (SUS): widely used by usability practitioners to assess the usability of a product or service [9]. This “quick and dirty” survey consists of 10 questions of a 5-point Likert scale. The resulting SUS score varies within the 0–100 range. Bangor et al. (2009; 2008) described two qualitative interpretations of SUS scores.

  • Sound user experience questionnaire: given only in the conditions that contained an auditory display. The answers were given on a 5-point Likert scale.

  • Preference choice among four sound conditions.

3.5 Apparatus

3.5.1 Driving simulator

A medium fidelity national advanced driving simulator (NADS) MiniSim driving simulator (see Fig. 2) was used to simulate driving scenarios. Each driving scenario lasted around six minutes and consisted of a car following scenario on a suburban route with low to moderate traffic. The speed of the lead vehicle varied between 35 and 50 mph. Participants were instructed to maintain a uniform safe distance while following the lead vehicle (Figs. 3, 4).

Fig. 2
figure 2

Experimental setup (Driving simulator setup, visual display monitor, Leap motion, and representative participant wearing eye-tracker)

Fig. 3
figure 3

Frequency of short glances (< 1 s) across different auditory display conditions. **p < 0.00833. Error bars denote standard errors

Fig. 4
figure 4

Dwell time across different auditory display conditions. **p < 0.00833. Error bars denote standard errors

3.5.2 LEAP motion

Leap Motion Controller (Model LM-010) used to detect and track participants' hand movements within its interactive zone. This zone extends more than 60 cm (24 inches) from the device, within a field of view spanning approximately 140 × 120 degrees. The controller's software is engineered to recognize 27 different hand elements, such as bones and joints, and is capable of maintaining tracking fidelity even when these elements are partially occluded by other parts of the hand. The device is equipped with two near-infrared cameras, each with a resolution of 640 × 240 pixels and spaced 40 mm apart. These cameras operate within an 850 nm ± 25 spectral range and typically capture images at a rate of 120 frames per second, allowing for precise motion detection within 1/2000th of a second.

3.5.3 Eye tracker

Tobii Pro Glasses 2 eye-tracking device (sampling rate of 50 Hz) was used to capture participants’ glance behavior during the study.

3.6 Participants

A power analysis was conducted to determine the minimum sample size needed to achieve 80% power with a medium effect size; a sample size of 24 was necessary. Accordingly, a total of 26 participants were recruited in a similar manner to the experiment. One participant was compensated and excused from the study as the eye tracking device did not calibrate to their eyes. Another participant’s driving data came out corrupt; we hence did not include their data. The 24 participants (14 males and 10 females; age: M = 23.25, SD = 1.88) whose data were included in the analysis came from 12 different countries. Language proficiency was not a variable of concern in this study as all participants recruited were fluent English speakers and were all students at Virginia Tech with no participants included who did not speak English. To be accepted into Virginia Tech, students need to meet the requirements for proficiency in English language. Each participant’s session lasted a maximum of 1 h and 30 min, and each participant was compensated with $15 for their time and contribution.

3.7 Procedure

Participants were first briefed about the experiment and signed a consent form, then conducted a short driving scenario serving as a training and as a simulation sickness test run [25]. Each participant watched a video tutorial on how to manipulate the menu-gesture system, and then, was offered as much time as needed to practice. Before each of the four driving scenarios, participants were introduced to the auditory display and given sufficient time to practice the hand gestures and get familiar with the interface. Familiarization with the auditory display was self-assessed, with participants proceeding only after they felt comfortable they could navigate through the system and interpret the auditory cues. During each data collection scenario, participants were asked to perform 12 trials of the secondary menu navigation task. The task consisted of a one-second command instructing the participant to select one out of the twelve menu items. After the command was given, a timer started and the participant had 20 s to perform a selection. If the 20 s elapsed and the participant did not make any selection, it was considered a failed attempt. The menu gesture system was designed to only allow one selection per command. An inadvertent selection was also considered a failed attempt. Secondary task commands were spaced out, with 25–35 s separating them. Following each driving scenario, participants completed the NASA-TLX workload assessment, responded to a subjective questionnaire, and filled out the System Usability Scale (SUS) questionnaire. Upon finishing all four scenarios, participants were asked about their preferred auditory condition.

3.7.1 Leap motion controller and gesture interaction

While the Leap Motion Controller provides high-resolution real time tracking data, its performance can be affected by environmental factors such as lighting, hand orientation and hand proximity. These factors were mitigated through a controlled lab environment with consistent lighting and orientation guidance. Furthermore, an introduction video was provided to familiarize participants with the proper hand placement and movements required to interact effectively with the system. Additionally, participants engaged in a practice session to ensure comfort and reduce the likelihood of tracking errors during the actual experiment. The setup aimed for an error margin within acceptable limits defined by the accuracy needed for the hand gestures to be recognized by the system.

3.8 Data analysis

For all data, we intended to conduct a repeated-measure analysis of variance (ANOVA) in which the auditory display condition is treated as a within-subject variable. To ensure the reliability of the obtained results, we checked for the parametric assumptions of the repeated-measure analysis of variance (ANOVA): normality of residuals, and sphericity.

Normality assumption was checked for qualitatively by observing the normal quantile plot and the histogram frequency distribution, and quantitatively using the Shapiro–Wilk goodness-of-fit test with a significance level of 0.05.

Sphericity was checked using Mauchly’s Test of Sphericity with a significance level of 0.05.

Depending on whether the data violated ANOVA assumptions or not, parametric or non-parametric analyses were performed to analyze the data. A one-way repeated measures ANOVA was conducted to analyze the data conforming with the assumptions. Partial eta-squared was also calculated to measure the effect size. When significant main effects were present, a post-hoc paired samples t-test was conducted, using the Bonferroni adjustment to control the Type-I error. All parametric tests were conducted using JMP 16.0 (SAS Institute Inc., 2020). When departures from the ANOVA assumptions were presented, an appropriate transform was applied onto the data (e.g., logarithmic, square-root, exponential) to satisfy the assumptions. When no data-transformation was appropriate, non-parametric tests were conducted. For all non-parametric data, including ordinal data, the Friedman test was conducted, followed by a Bonferroni-corrected. Wilcoxon signed rank test was used for pairwise comparisons when applicable. All non-parametric tests were conducted using the 2022 QI Macros statistical tools add-in on Excel (KnowWare International Inc., 2022).

4 Results

4.1 Driving performance

A logarithmic transformation was performed on the mean and standard deviation of following distance to satisfy parametric assumptions. A repeated measures ANOVA was conducted on following distance, lane deviation, and vehicle speed data. The standard deviation of steering wheel angle did not meet parametric assumptions; the Friedman test was conducted. ANOVA and Friedman test results are presented in Table 4; there were no significant differences between different auditory display conditions for all driving performance metrics.

Table 4 ANOVA and Friedman test results for driving performance metrics. p < 0.05 indicates significant main effect of the auditory display condition

4.2 Eye glance behavior

Table 5 presents a summary of the eye tacking data collected during the study. Glance frequency and dwell time results are per selection task. The number of glance free selections is presented per driving scenario, in which each participant performed 12 selection tasks.

Table 5 Descriptive statistics of eye glance behavior. Glance Frequency and Dwell Time are per selection task. Number of glance-free selections is per driving scenario

4.2.1 Short glance frequency

ANOVA results revealed significant differences in short glance frequency between different auditory display conditions, F(3, 69) = 5.4113, p = 0.0021, η2p = 0.19. Post-hoc paired-samples t-tests were conducted, and the results are presented in Table 6.

Table 6 Results of the post-hoc paired-samples t-tests conducted for multiple pairwise comparisons of the frequency of short glances

4.2.2 Medium and long glance frequency

Medium and Long frequency data did not meet parametric assumptions; the Friedman test was conducted. Results from Friedman tests conducted for each dependent variable are presented in Table 7; there were no significant differences between different auditory display conditions.

Table 7 Results of the four paired-samples Friedman test on medium and long glance frequency. p < 0.05 indicates significant differences

4.2.3 Dwell time

An exponential transformation was performed on dwell time data to satisfy parametric assumptions. ANOVA results revealed significant differences in dwell time between different auditory display conditions, F(3, 69) = 6.0898, p = 0.0010, η2p = 0.21. Post-hoc paired-samples t-tests were conducted, and the results are presented in Table 8.

Table 8 Results of the post-hoc paired-samples t-tests conducted for multiple pairwise comparisons of dwell time

4.2.4 Number of glance-free selections per driving scenario

ANOVA results revealed significant differences in the number of glance-free selections between different auditory display conditions, F(3, 69) = 5.0984, p = 0.0030, η2p = 0.18. Post-hoc paired-samples t-tests were conducted, and the results are presented in Table 9.

Table 9 Results of the post-hoc paired-samples t-tests conducted for multiple pairwise comparisons of the frequency of short glances

4.3 Manu navigation performance

4.3.1 Selection accuracy

An exponential transformation was conducted on accuracy data to satisfy parametric assumptions. ANOVA results showed a significant main effect of auditory displays on selection accuracy, F(3, 68) = 3.642, p = 0.017, η2p = 0.14 (Fig. 5). Post-hoc paired t-tests with a Bonferroni corrected alpha level of 0.083 revealed that 70% spearcons (M = 93.75%, SD = 6.14%) resulted in significantly higher secondary task selection accuracy than TTS (M = 86.46%, SD = 10.66%), t(23) = 3.58, p = 0.0016. Although not significant in terms of the Bonferroni corrected alpha, 70% spearcons had a tendency to show higher secondary task selection accuracy than 40% spearcons (M = 85.76%, SD = 15.64%), t(22) = 2.71, p = 0.0125.

Fig. 5
figure 5

Selection accuracy across different auditory display conditions. **p < 0.00833. Error bars denote standard errors

4.3.2 Selection time

A logarithmic transformation of the data was performed in order to satisfy the parametric assumptions. An ANOVA on the transformed data revealed a significant main effect of auditory displays on selection time, F(3, 68) = 4.5551, p = 0.0057, η2p = 0.167 (Fig. 6). Post-hoc paired samples t-tests with the Bonferroni corrected alpha value of 0.0083 performed between auditory display conditions revealed that 70% spearcon (M = 9.91, SD = 2.33) resulted in significantly slower selection times than the no auditory display condition (M = 8.62, SD = 2.04), t(22) = 3.36, p = 0.0029. Although not reaching the 0.0083 statistical significance, 70% spearcon had a tendency to show a slower selection times than TTS (M = 9.01, SD = 2.04), t(23) = 2. 79, p = 0.0104. Table 10 presents selection times across all four auditory display conditions.

Fig. 6
figure 6

Selection time across different auditory display conditions. **p < 0.00833. Error bars denote standard errors

Table 10 Descriptive statistics of Selection Time

4.4 Perceived workload

ANOVA results did neither reveal a significant effect of auditory displays on any of the six subcategories of the NASA-TLX, nor on the overall workload score. NASA-TLX results are presented in Fig. 7, and ANOVA results are summarized in Table 11.

Fig. 7
figure 7

Perceived workload self-reported by participants using the NASA-TLX tool. Error bars denote standard errors

Table 11 F*, P-value, and partial Eta-squared results of the ANOVA conducted on NASA-TLX results

4.5 User experience

4.5.1 System usability scale (SUS)

A summary of system usability (SUS) score results is presented in Table 12. Systems having a SUS score superior to 72.75 are described as “good”, and systems scoring above 70 are considered “acceptable, as per Bangor et al. [4]. According to those standards, the auditory-supported air-gesture menu navigation interfaces with all four auditory display conditions are considered good and acceptable.

Table 12 SUS scores and adjective description for all auditory display conditions

4.5.2 Auditory display user experience questionnaire

Participants filled out a questionnaire about how they perceived each speech-based auditory display after interacting with it. Table 13 and Fig. 8 show the questionnaire results about how participants perceived different each of the auditory displays based on 7 sound characteristics. The auditory display that presented the best outcome is highlighted in blue.

Table 13 Sound questionnaire descriptive statistics
Fig. 8
figure 8

Mean scores for sound characteristics by auditory display condition. Error bars indicate standard errors. The patterns in the bars represent different auditory conditions (striped for 40% Spearcon, solid for 70% Spearcon, and dotted for TTS) to aid viewers with color vision deficiencies

4.5.3 User preference

At the end of the study, participants were asked to rank the four auditory display conditions based on their preference. The results are depicted in Fig. 9. Eleven participants chose 70% spearcon condition as their first choice.

Fig. 9
figure 9

Auditory display condition preference

5 Discussion

5.1 Revisiting the results

5.1.1 Driving performance

The results of this study showed that the auditory display conditions had no influence on driving performance. These results are in agreement with [70, 72] findings that speech-based feedback did not improve driving performance compared to the absence of any auditory display while navigating a 2 × 2 grid menu with air-gesture controls. Also, the results are in line with Tabbarah’s previous study (2023) which showed that 70% spearcon and the no auditory display condition did not reach the difference. This result also agrees with the previous study [65] in which there was no driving performance difference among the multimodal conditions, including visual, auditory, and tactile feedback for the in-vehicle gesture interface.

5.1.2 Eye glance behavior

Results showed that the auditory display conditions significantly influenced eye glance behavior (short glance frequency, dwell time, and number of glance-free selections per driving scenario) than the no auditory display condition. Descriptive statistics of short glance frequency, dwell time and number of glance-free selections suggest the presence of a consistent pattern; 70% spearcon induced the least visual distraction, closely followed by TTS, then by 40% spearcon, and lastly by the no auditory display condition which induced the numerically highest visual distraction across all different measures. Post-hoc pairwise comparisons between auditory display conditions strongly indicate that the use of 70% spearcon or TTS is visually safer than the no auditory display condition. A closer look at the direct comparison between spearcon conditions at different compression rates reveals a sizable difference. 70% spearcon induced 34% less short off-road glances than 40% spearcon. Participants utilizing 70% spearcon spent on average 32% less time looking away from the road towards the visual display than when utilizing 40% spearcon. Additionally, navigating the menu system with 70% spearcon feedback resulted in 44% more glance-free selections than 40% spearcon. Even though only the frequency of short glances revealed differences with a conservative significance less than 0.0083, all other spearcon comparisons had the same tendency (less than 0.05: p = 0.014 for dwell time, and p = 0.022 for glance-free selections) demonstrating the similar trends.

The pattern of visual distraction displayed by 70% spearcon, TTS, and the no auditory display condition is consistent with the results from Tabbarah (2023) and what is seen in literature. [70,71,72] found that TTS significantly reduced the frequency of off-road glances for in-vehicle air-gesture grid menu navigation, and Larsson and Niemand [44] found that spearcon reduced the frequency of off-road glances while decreasing dwell time compared to the absence of auditory displays. Both findings were replicated in our results. Nonetheless, no literature evaluated fast-paced spearcons in a driving context; our eye glance behavior results are the first inference on fast compressed speech in a driving context secondary task.

5.1.3 Menu navigation performance

Selection time and accuracy results reveal a significant effect of the auditory display condition. The selection accuracy of 70% spearcon (93.75%) was numerically superior to the cluster of accuracies of 40% spearcon, TTS and the no auditory display condition (85.76%, 86.46%, and 87.32% respectively). The only significant difference in accuracy was, however, between 70% spearcon and TTS. Also, the mean selection time of 70% spearcon (9.91 s) was significantly slower than the mean selection time of the no auditory display condition (8.62 s). Selections using 70% spearcon were also numerically slower than selections using 40% spearcon (9.08 s) and TTS (9.01 s).

Both accuracy and selection time results for TTS and the no auditory display condition conform with the findings of [70, 71] that there is no significant difference between auditory display conditions. Nonetheless, our results do not conform with Larsson and Niemand [44] that there was no difference between the selection times of the spearcon and the no auditory display condition while navigating a one-dimensional list menu while driving. However, major differences exist in the design of the in-vehicle menu system between our study and Larsson and Niemand’s. Those differences may be intrinsic to the interaction modality (air-gesture vs. button) and to the design of the menu structure (three-dimensional vs. one-dimensional list).

5.1.4 Perceived workload

There were no significant differences in all perceived workload measures across all auditory display conditions. However, there is an identifiable numerical trend in overall workload, mental demand and effort indicating relatively higher perceived workload for the no auditory display condition, followed by 40% spearcon, then 70% spearcon, and finally TTS inducing the numerically least perceived workload. The limited body of research that evaluated speech-based feedback in menu navigation agrees that speech-based feedback improves perceived mental demand and overall workload [34, 70,71,72]. Shakeri et al. [66] also showed that the auditory feedback condition showed significantly less physical demand compared to the visual condition. Therefore, the identified trend in our results conforms with the literature. The lack of significance, however, could be caused by the difference in experimental design; our design includes three speech-based auditory display conditions. The lack of significant difference in temporal demand between the 40% and 70% spearcon also conforms with findings by Sabic et al. [60] who examined the recognition of 40% and 60% spearcons as in-vehicle warning signals.

5.2 Revisiting the research questions and hypotheses

To achieve the goal of this body of research to increase understanding on how different attributes of auditory displays affect driving safety while interacting with in-vehicle information systems (IVISs) using mid-air gesture controls, the following research questions were devised.

R1: How does adding spearcons affect air-gesture IVIS interaction in terms of driving performance, eye glance behavior, secondary task performance, perceived workload, and user experience?

Hypothesis 1: Adding spearcons will improve driving performance, eye glance behavior, secondary task performance, perceived workload, and user experience compared to text-to-speech (TTS) or no auditory display condition.

R2: How do different spearcon compression rates affect air-gesture IVIS in terms of driving performance, eye glance behavior, secondary task performance, perceived workload, and user experience?

Hypothesis 2: 70% spearcon will provide the most efficient secondary task performance with faster selection times than 40% spearcon and TTS, and less mental and temporal demand compared to 40% spearcon.

Hypothesis 3: 40% spearcons will result in higher visual distraction than 70% spearcons and TTS, but less than the no auditory display condition.

Secondary task performance results show that menu selection using 70% spearcon resulted in the highest accuracy yet the slowest selection time. While using 70% spearcon, participants were significantly more accurate than when using TTS. Although not reaching statistical significance, 70% spearcon resulted in numerically higher accuracy than 40% spearcon and the no auditory display condition. As for selection time, 70% spearcon resulted in significantly slower selections than the no auditory display condition, and numerically slower selections than 40% spearcon and TTS. There is hence a tradeoff between selection accuracy and selection time while using 70% spearcon. Upon closer examination of participants’ subjective feedback, 70% spearcon was more comfortable to use than 40% spearcon and TTS. Multiple participants commented that 70% spearcon was easier to follow; it matched the speed of their hand movement or lined up with the speed at which they were reading. The slower selection time of the 70% spearcon condition might hence be associated with a higher level of comfort and ease of use. In terms of perceived workload, there were no significant differences between 40 and 70% spearcons. After careful consideration of all the results, we can infer that H2 is partially supported.

Statistically significant differences and numerical patterns of eye glance behavior results strongly indicate that 40% spearcon result in more visual distraction compared to 70% spearcon. Although there were no significant differences between 40% spearcon and TTS, 40% spearcon resulted in numerically higher short glance frequency and dwell time, and a numerically lower number of glance-free selections. As for the comparison between 40% spearcon and the no auditory display condition, there were neither statistically significant nor numerical differences in the frequency of short glances or dwell time. Therefore, we can infer that H3 is also partially supported. After careful consideration of all dependent measures’ outcomes, we can infer that adding spearcons resulted in differences in a number of dependent measures (RQ1). However, spearcons showed the speed-accuracy trade-off and the results also depend on the compression rate (RQ2). Therefore, taken together, H1 is also partially supported. In conclusion, 70% spearcons are recommended to use for the in-vehicle air gesture menu navigation system with further research studies being warranted.

6 Conclusion

This study explored in-vehicle air gesture menu navigation interactions by adding different compression rates of spearcons compared to TTS and no sound conditions. The results showed that 70% spearcons outperformed the other conditions in terms of reducing visual distraction and improving menu navigation accuracy. It was also the most preferred auditory display by the participants. However, 70% spearcons did not show any significant differences in driving performance and workload compared to the other conditions. It also showed a tradeoff between speed and accuracy in the menu navigation task, with 70% spearcons resulting in slower selection times but higher accuracy compared to no sound.

Based on these findings, we recommend using 70% spearcons for in-vehicle air gesture interfaces. There are some limitations to the current study that can be addressed in future research:

  • The LEAP motion sensor used for hand tracking has its own errors and limitations. As hand tracking technologies improve, the air gesture interactions are expected to become more robust and reliable.

  • The current study used a three-dimensional menu system, whereas most previous studies used a one-dimensional menu navigation task. Investigating more complex and realistic menu structures with appropriate auditory displays for each context would help refine the design guidelines.

In summary, this research provides a granular analysis of one key variable in the design of auditory displays—spearcon compression rates—for in-vehicle air gesture interactions. We believe this type of detailed evaluation of specific design parameters will lead to more user-centered interfaces that are optimized for the driving context. This study was conducted with manual driving, but it would be interesting to explore these interactions in the context of different levels of vehicle automation, as the driver's roles and responsibilities change. Further research can build upon these findings to develop robust multimodal interfaces that maximize the potential benefits of air gesture controls and auditory displays in vehicles. Future studies could also explore dynamic spearcon compression, adjusting the compression rates in real-time, to potentially enhance reaction times while still maintaining satisfactory accuracy levels. Such adaptive auditory feedback mechanisms could optimize user interaction by calibrating to the user's performance and preferences over time.