1 Introduction

Pedestrian safety represents a major issue, with approximately 310,000 pedestrian fatalities being recorded annually across the globe, accounting for 23% of total road traffic fatalities (World Health Organization 2023). Pedestrian collisions frequently involve a motor vehicle as the impacting entity, contributing to 63% of such incidents in a developed country like the Netherlands (SWOV 2020). Investigative analysis of accident casualties within the United States and Europe illustrates that pedestrian behaviour is implicated in approximately two-thirds of pedestrian-vehicle collisions, and that 10 to 15% of collisions stem from obstructed views (Bálint et al. 2021; European Road Safety Observatory 2018; Hunter et al. 1996; Yue et al. 2020).

Improvements in pedestrian safety have been made through vehicle technology like radar and camera systems, which are used to detect approaching pedestrians, alert the driver, and autonomously initiate braking or steering. Improvements to these systems involve pedestrian path prediction to identify pedestrians entering the road as early as possible (Rudenko et al. 2020), thereby increasing the time budget for the vehicle to respond. However, detecting a pedestrian or knowing a pedestrian’s intention may not always be possible, especially when the pedestrian steps onto the road from between parked vehicles (Palffy et al. 2023).

Additionally, pedestrian warning systems have been developed that use mobile devices, such as smartphones and smartwatches, to send alerts to pedestrians ahead of potential collisions (Bastani Zadeh et al. 2018; Liu et al. 2015; Won et al. 2020). In the current study, we extend this concept by investigating whether augmented reality (AR), which can be presented via a head-mounted display (HMD), could potentially offer a solution for pedestrians dealing with occlusion of approaching vehicles. More specifically, we explored: (i) Whether providing a live video stream of the occluded area can improve pedestrians’ behaviour and perception, and (ii) The impact of diminished reality on pedestrian behaviour and perception, where environmental occlusion is substituted with direct visibility.

A fundamental question in the design of any AR system is how the augmented feedback—in this case, video feed—should be positioned. There are different reference frames to be considered. Firstly, the information could be presented at a fixed position on the screen of the head-mounted display (HMD), an approach also known as a head-locked presentation (Lebeck et al. 2017). It is also possible to present the information at a fixed distance from the user’s torso, an approach also referred to as a body-locked (Klose et al. 2019) or surround-fixed (Feiner et al. 1993) presentation. An alternative is to tie the information to the world, an approach also known as world-fixed (Feiner et al. 1993) or world-locked (Lebeck et al. 2017) presentation.

An advantage of head-locked and body-locked AR is the presence of information at an accessible position, which may induce a prompt response from the user, especially when information in the real world lies outside the user’s immediate attention (Ghasemi et al. 2021; Schinke et al. 2010; Smith et al. 2021; Tabone et al. 2023). However, a risk is that users might respond to the cues presented without successfully integrating these with task-relevant real-world cues. This could occur either because the real-world cues have not been visually identified yet, or due to the challenges in switching cognitive and/or accommodative attention between the augmented and real-world cues (Chen et al. 2023; Dixon et al. 2014; Kerr et al. 2012). For similar reasons, head-locked information, or information that is otherwise not clearly locked to the world, can be difficult to use while walking and might induce discomfort, depending on whether the information is hard-locked or soft-locked to the user’s head (Fukushima et al. 2020; Kaufeld et al. 2022; MagicLeap 2020).

World-locked AR, on the other hand, offers the advantage of projecting cues at a contextually relevant location. Specifically, when a world-locked display aligns closely with the task at hand, it can offer an intuitive and seamless mode of information processing (Bauerfeind et al. 2021; Kim et al. 2018; Robertson et al. 2008; Schankin et al. 2017; Wickens 2021; Zhao et al. 2023). One potential drawback of world-locked interfaces, however, is that the process of locating the world-locked information might consume more time for the user compared to head- or body-locked displays, where the information is directly accessible (De Oliveira Faria et al. 2020; Lee and Woo 2023; Tabone et al. 2023).

It is noted that the classification into head-locked, body-locked, and world-locked reference frames does not completely capture the breadth of AR designs. Some propose that it is more beneficial to envisage a continuum of ‘naturalism’ (Pijnenburg 2017) or ‘local presence’ (Rauschnabel et al. 2022), ranging from head-locked overlays, such as textual instructions or warning symbols, to the conformal presentation of virtual information that is seamlessly integrated into the real world (Kim et al. 2016; Wickens 2021). The placement on this continuum is influenced not only by the choice of reference frame but also by the quality of depth rendering of virtual objects, amongst others (Rauschnabel et al. 2022).

Diminished reality (DR) refers to the process of removing or modifying specific elements from the environment, usually in real-time (Cheng et al. 2022; Mann and Fung 2002; Mori et al. 2017). DR can be considered a subtype of AR: While AR is typically used to enhance the user’s perception of the real world by overlaying virtual elements onto it, DR concerns the removal or reduction of certain elements from the physical world. Various methods exist for achieving DR. One is inpainting, which involves removing an object from the current image and then filling in the gap with plausible background details to ‘guess’ what the background should look like (Elharrouss et al. 2020). Another approach is video see-through. This method involves using a remote video camera to capture the scene beyond the occluding object, and processing the video feed, before it is displayed to the user (e.g., Meerits and Saito 2015; Rameau et al. 2016). In automotive settings, a number of see-through displays have been incorporated. In Samsung’s Safety Truck, for example, live video images were displayed on the back of the truck, effectively allowing trailing drivers to ‘see through’ the truck (Samsung 2015; see also Zhang et al. 2018). Gomes et al. (2012) and Rameau et al. (2016) developed a ‘see-through cars’ system, which provided drivers an unobstructed view of the road. Other researchers have used the concept of transparency, primarily in virtual environments. In this regard, Yasuda and Ohama (2012) tried to solve the problem of poorly visible intersections by making a wall semi-transparent, so that the approaching car could be seen through the wall. Lindemann et al. (2019) examined a semi-transparent cockpit which provided drivers with the ability to see parts of the environment that are usually blocked by the car body.

In the current study, we investigated a series of AR solutions aimed at solving the occlusion problem of pedestrians. The investigated designs ranged from a head-locked video feed of the occluded area (thus low local presence according to Rauschnabel et al. 2022’s framework), and a solution in which the same video feed was at a fixed distance from the user on the opposite side of the road (medium local presence), to two DR solutions that blended in with the environment (high local presence). Inspired by the above-described DR solutions in the automotive domain, two DR implementations were chosen, namely a see-through video feed of the occluding vehicle and a solution in which the occluding vehicle was made semi-transparent. Given that the different positions on the local-presence dimension may carry distinct advantages and disadvantages, as delineated above, we refrained from formulating explicit hypotheses. Instead, our primary interest lies in discerning which AR condition in the pedestrian-crossing scenario shows the most and least favourable results in terms of perceived pedestrian safety, workload, comfort, and acceptance.

Due to the challenges associated with implementing AR in real-world settings, we have chosen to test the proposed solutions in a virtual environment using a high-end HMD. While this approach may have its drawbacks, such as potential disparities in attentional switching as compared to AR evaluated in real-world settings (Gabbard et al. 2019), the use of virtual reality offers various advantages. These include improved experimental control and the mitigation of technical issues such as difficulties in precisely anchoring objects to the world (Walter et al. 2019; Wiesner 2019).

As indicated above, there is a challenge of divided attention between the task-intrinsic content in the world (i.e., the approaching car) and on the augmented feedback (i.e., the video feed). Prior studies have focused on directing the user’s view by means of arrows (Schinke et al. 2010) or attracting it by means of bounding boxes (Chen et al. 2015; Orlosky et al. 2019), attention funnels (Renner and Pfeiffer 2017), flickering (Schmitz et al. 2020; Waldin et al. 2017), and contrasting (Lu et al. 2012) around points of interest. In the present study, we additionally investigated whether cues in the form of a 3D arrow that continuously indicates where the task-intrinsic information (i.e., approaching car) is located provide added value compared to just using the video feed.

2 Method

2.1 Participants

A total of 28 individuals, 23 of whom were male, aged between 19 and 32 years (M = 25.0, SD = 3.1), participated in the experiment. Participants were students and doctoral candidates from different faculties at Delft University of Technology. The recruitment process did not offer incentives, and welcomed individuals regardless of their driving experience, nationality, driving-side orientation, or age. The study was approved by the Delft Human Research Ethics Committee, approval no. 1817. Each participant provided written informed consent before the start of the experiment.

Demographic characteristics were recorded by means of a pre-experiment questionnaire. The sample predominantly comprised Dutch nationals (n = 19), but also included individuals of Indian, Italian, Belgian, Russian, and Maltese nationality. Most participants held a driver’s licence (n = 25), and among these, the average driving experience was 7.12 years (SD = 3.0). Furthermore, most participants were regular pedestrians in urban environments, with 12 participants walking every day and 10 walking 4–6 days per week. In terms of digital entertainment, 10 participants reported playing video games several times a week, 8 playing approximately once a month, and 10 rarely playing or not playing anymore.

2.2 Materials

The experiment was executed using Unity 2019.4.3f1 on an Alienware PC, paired with a Varjo VR-3 HMD. The PC was equipped with an Intel i7-9700 K CPU and a NVIDIA GeForce RTX 2080 Ti GPU. Participants used the spacebar on the PC’s keyboard to indicate their readiness to cross.

The Varjo VR-3 HMD featured a high-resolution display, offering a 90 Hz refresh rate and a 115° horizontal field of view. The focus area, spanning 27° × 27°, was rendered at 70 pixels per degree on a micro-OLED display, providing 1920 × 1920 pixels per eye. Meanwhile, the peripheral area was rendered at about 30 pixels per degree on an LCD, producing 2880 × 2720 pixels per eye. Additionally, the Varjo VR-3 offered foveated rendering via integrated eye-tracking.

Four SteamVR base stations enabled the positional tracking of the Varjo HMD. The frame rate of the simulation was set to 30 frames per second. Audio was delivered through a Jabra Evolve stereo headset. After each experimental session, all equipment was sanitised with alcohol wipes. The experimental setup is depicted in Fig. 1.

Fig. 1
figure 1

The experimental setup

2.3 Participant task

In the experiment, participants were instructed to press a response key when they felt safe to cross the road. More specifically, the participants were given the following task instructions: whenever they perceived crossing as safe, they were to press and hold down the spacebar on the provided keyboard. This action was to be sustained for the duration of perceived safety. Should they feel it was no longer safe to cross, they were to stop pressing the spacebar. They were allowed to engage and disengage the key as many times as they deemed necessary. Before each trial, participants received the verbal auditory instruction “Press now” to indicate that they should press and hold the key.

2.4 Virtual environment

The study used an open-source simulator (Bazilinskyy 2020; Bazilinskyy et al. 2020b), designed using the Unity game development platform. The simulated setting is an urban city centre, with two-lane roads and static elements like buildings, parked cars, and trees (Fig. 2).

Fig. 2
figure 2

The virtual street in which the experiment took place

In this study, the participant assumed the role of a pedestrian, standing on a curb behind a parked Nissan 1400 ‘Bakkie’ and a Ford Mustang GTO. This positioning largely impeded the view of the road, with an approaching vehicle, specifically a Smart Fortwo, coming from the pedestrian’s left.

The in-game camera, representing the pedestrian’s perspective, was set at fixed height of 1.67 m, which is close to the 1.65 m average eye height for Dutch adults aged 20 to 30 years old (DINED 2020). The pedestrian was on a 0.22 m-high curb and 2.5 m away from the road edge orthogonally. A fixed camera position was used, and only the rotation of the head affected what participants viewed. This approach was adopted to maintain a consistent perspective and degree of occlusion of the approaching vehicle for all trials and participants. The pedestrian was represented by an avatar; this avatar was visible when the participant looked down. However, the avatar was static and did not respond to the participant’s movements.

The road spanned 10 m in width. The obstructive vehicle, a Nissan pickup truck, had dimensions of 3.8 m in length, 1.7 m in height, and 1.67 m in width. The pedestrian was located 4.25 m behind the rear of the pickup truck and remained stationary throughout the duration of the simulation. A top-down review of the distances is provided in Fig. 3.

Fig. 3
figure 3

Top-down overview of the virtual environment. In this instance, the vehicle is positioned at the stopping point in the yielding scenario

In all trials, the car started from a standstill. It accelerated and made a 90-degree left turn to then approach the participant; after 4.6 s, the car had completed this turn and had reached an approach speed of 15 km/h. We have opted for a low vehicle speed of 15 km/h, as a lower speed affords the pedestrian more time to respond, thus allowing for a more effective comparison of experimental conditions. Furthermore, a low speed introduces a degree of ambiguity regarding whether the AV will stop. In comparison, if the speed of the vehicle is high, then it is evident that crossing in front of this vehicle is not a safe option, and additional explicit signals will therefore have relatively little influence on the pedestrian’s crossing intentions (see Dey et al. 2021 and Onkhar et al. 2022 for similar argumentation).

In non-yielding scenarios, the vehicle maintained this 15 km/h speed until the end of the trial. In yielding scenarios, the vehicle initiated deceleration at a rate of 1 m/s2 at an elapsed time of 14.3 s, 11.64 m from the pedestrian. The vehicle stopped at 18.4 s, positioned 6.55 m away from the pedestrian. Following the halt, the vehicle remained stationary for 5 s before recommencing motion. Yielding trials had a total duration of 31.8 s, whereas non-yielding trials took 22.0 s to complete. A representation of the pedestrian-vehicle distance versus time relationship is shown in Fig. 4.

Fig. 4
figure 4

Distance between the approaching vehicle and pedestrian as a function of time. Grey backgrounds represent intervals where the vehicle could not be seen in the Baseline condition because it was fully occluded by the parked vehicle. Note that the yielding vehicle started to decelerate at an elapsed time of 14.3 s, came to a full stop at 18.4 s, drove away at 23.4 s, and passed the pedestrian at 26.9 s. The non-yielding vehicle passed the pedestrian at an elapsed time of 17.2 s. Distances were calculated based on the Euclidean norm, using the pedestrian’s location and the centre point of the vehicle

Each trial began with the sound of a starting engine from the AV. As the vehicle drove, it produced a humming sound of a combustion engine. The sound perceived by the participant depended on the distance to the AV, including the Doppler Effect.

The approaching vehicle was rendered in cyan, as this colour bears no established connotations in signalling yielding or non-yielding behaviours to pedestrians (Bazilinskyy et al. 2020a). To mimic a dart-out crossing scenario more closely, a zebra crossing was omitted from the design. The inclusion of such a crossing would implicitly suggest safety for the pedestrian to cross (in the Netherlands, traffic law mandates stopping for pedestrians poised at the curb).

2.5 Augmented reality designs

The experiment included a total of six AR designs, as depicted in Table 1, with an additional condition without additional functioning as the baseline for comparisons against the other designs.

Two DR solutions were designed, a see-through display (VideoSeeThrough) and a semi-transparent parked vehicle (TransparentVehicle), along with two video feed interfaces, namely a head-locked (VideoHead) and a body- and world-locked version (VideoStreet). An additional feature was implemented in the VideoHead and VideoStreet designs to assist participants’ views, using a continuous ‘waypoint arrow’ pointing to the moving vehicle. The arrow was positioned in the central top portion of the video feeds. Figure 5 provides screenshots of the six AR designs and the Baseline tested in this experiment. Additionally, the online data repository includes a screen capture video, illustrating the various AR designs from the participant’s perspective.

Fig. 5
figure 5

The six augmented reality designs (screenshot resolution 2880 × 2720, consistent with the resolution of the displays in Varjo VR-3).

According to Rauschnabel et al. (2022)’s local-presence dimension, VideoHead exhibits a low local presence; the video feed is projected into the field of view but is not embedded in the world. VideoStreet can be described as having a medium local presence; it is present as a floating screen at a fixed position on the other side of the street, thus not truly part of the world. The two DR designs demonstrate a high local presence as the information provided is strongly embedded in the real world: VideoSeeThrough is presented as allowing visibility through the parked vehicle, while the transparent car also allowed the participant to see through the parked vehicle.

Table 1 Overview of the six augmented reality designs

The positioning of the VideoHead display was in the upper section of the field of view, as per recommendations by Klose et al. (2019). This upper placement enabled unobstructed viewing of the road and other objects in the lower field of view. With dimensions of 8 cm both in height and width, the display was positioned 36 cm away from the midpoint between the pedestrian’s eyes. This arrangement yielded a field of view angle, both vertical and horizontal, of 12.7°.

The VideoStreet display was similarly positioned above road level and objects on the curb, including a bench and a waste bin, to ensure an unobstructed view. The choice of positioning was also influenced by prior research into AR in road-crossing scenarios (Tabone et al. 2021), as well as the contemporary traffic system, wherein traffic signals and pedestrian crosswalks typically appear perpendicular to or across the road. We consider the VideoStreet presentation as world-locked because it occupies fixed coordinates in the virtual world, as well as body-locked, because the participant in our experiment does not translate within the environment, and thus, the VideoStreet display always remains at a constant distance from the participant. The VideoStreet screen had a height of 300 cm and width of 400 cm, and was positioned 12.7 m from the pedestrian, leading to field of view angles of 13.5° vertically and 17.9° horizontally.

The VideoSeeThrough display, incorporated into the rear of the pickup truck, was 36 cm in height and 89 cm in width. When the pedestrian was positioned 4.25 m from the rear of the truck, it encompassed a vertical and horizontal field of view with angles of 4.9° and 12.0°, respectively. As for the TransparentVehicle, the transparency coefficient (alpha value) was set to 0.49 to generate a semi-transparent visual presentation.

The video feed was captured by a stationary camera unit, centrally positioned and mounted on the top of the windshield of the pickup truck (Fig. 6, top left). This camera was oriented to face forward in alignment with the road, and provided a field of view spanning 60°. Figure 6 shows examples of the camera view in various conditions.

Fig. 6
figure 6

Top left: Windscreen camera placement. Top right: Close-up view of the VideoHead + concept. Bottom left: Close-up view of the VideoSeeThrough concept. Bottom right: Close-up view of the TransparentVehicle concept

2.6 Experimental design

This study used a within-subjects design, with each participant performing all seven experimental conditions. Conditions were grouped into blocks, each containing six randomly allocated yielding and non-yielding scenarios to offset expectancy effects. The sequence of block presentations was managed using Bradley’s (1958) balanced Latin Square method. In total, the experiment included 1176 trials, namely 28 participants who each conducted 6 trials for each of the 7 experimental conditions (6 AR designs + Baseline).

2.7 Questionnaires

Participants were provided with information regarding the study purpose, the procedural layout of the experiment, the data management process, and their right to abstain at any point, via an informed consent document. This was followed by the administration of a pre-experimental questionnaire, to obtain information on general demographics, commuting behaviour, and prior experiences with VR and gaming.

Upon the completion of each trial block, participants were prompted to provide a measure of their current state of well-being, using the Misery Scale (MISC; Bos et al. 2005). In instances where a participant’s MISC score was observed to be 4 or greater, a break was offered. Following this, the NASA-TLX questionnaire (Hart and Staveland 1988) was administered as a means of assessing workload. This questionnaire incorporated six variables: (1) mental demand, (2) physical demand, (3) temporal demand, (4) performance, (5) effort, and (6) frustration, evaluated via a 21-point scale ranging from ‘perfect’ to ‘failure’ for performance, and from ‘very low’ to ‘very high’ for the other five items. Lastly, acceptance of the seven experimental conditions in terms of usefulness and satisfaction was evaluated using a questionnaire developed by Van der Laan et al. (1997). In this questionnaire, participants were asked to rate nine semantic-differential items on a five-point Likert scale. Usefulness was calculated as the mean of the following five items: (1) useful–useless; 3. bad–good; 5. effective–superfluous; 7. assisting–worthless; and 9. raising alertness–sleep-inducing, and satisfaction was calculated based on the following four items: (2) pleasant–unpleasant; 4. nice–annoying; 6. irritating–likeable; and 8. undesirable–desirable.

Due to the logistical impracticalities associated with asking participants to disengage from the VR environment in order to complete questionnaires, all post-block questionnaires were conducted within the virtual environment, with the participant still wearing the HMD. Participants were directed to verbally provide their responses, which were then recorded by the experimenter via Google Forms on a laptop. Figure 7 shows the questionnaires used to evaluate well-being, workload, and acceptance.

Fig. 7
figure 7

Post-block questionnaires, displayed above the road in front of the participant. Note: This screenshot shows a zoomed-out external view. During the experiment, the participant could turn their body and head to direct visual attention to one of the three questionnaires

2.8 Data analysis

Data was collected at a frequency of 50 Hz using a data logging script within Unity. No trials were excluded from the analysis.

For the analysis of keypress data, plots were created depicting the mean percentage of trials in which the key was pressed over time.

The keypress percentage per AR design was computed by calculating the mean over the period from 8.86 to 14.30 s, namely from the moment the approaching vehicle was completely occluded by the parked vehicle until the approaching vehicle in the yielding condition began to brake. The initial seconds of the trials are less relevant to our analysis, as the approaching vehicle accelerated and executed a turn, and participants still had to press the key in response to the ‘press now’ instruction. The percentages for each participant were averaged across the six trials to create a single score per participant. This averaging helps meet the assumptions of independence and normality, which are needed for valid statistical analysis, unlike the approach of using scores from each trial individually.

For the NASA-TLX responses, the 21-point scores were converted into percentage values. A composite score was then obtained by averaging the scores across the six items (Byers et al. 1989). For the acceptance questionnaire, the responses for Items 1, 2, 4, 5, 7, and 9 were mirrored, so that a higher score corresponds to higher usefulness/satisfaction. Next, scores were offset from the 1 to 5 scale (see Fig. 7) to correspond with the original scale of -2 to + 2.

The seven interface conditions were compared per dependent variable using a linear mixed effects model using Mathworks MATLAB R2023b. In this model, the interface condition is a categorical fixed effect, and the interface block number (i.e., a number from 1 to 7 indicating whether that condition was presented first, second, third, fourth, fifth, sixth, or seventh for that participant) is a fixed effect and covariate. The participant number (1 to 28) was submitted as a random effect. The model was estimated using the maximum likelihood method, and the number of degrees of freedom was determined using the residual method. The covariate ‘block number’ was included to determine whether there was a learning/experience effect for the participants in our counterbalanced design; including this covariate enables a more powerful determination of the differences between the seven interface conditions.

Differences between pairs of conditions were assessed by evaluating the overlap of 95% confidence intervals. To calculate these confidence intervals, a method for within-subjects designs was used, as described by Morey (2008). The scores per dependent variable were first linearly detrended to correct for the above-mentioned learning/experience effects over the seven blocks.

3 Results

All 28 participants completed the experiment. Overall, the MISC scores were low, with averages of 0.50, 0.82, 1.14, 1.25, 0.96, 1.18, and 1.21 for Blocks 1 through 7, respectively. Out of the 196 trials completed (28 participants × 7 conditions), a MISC score of 4 occurred 14 times and a MISC score of 5 occurred once. These scores were attributed to 7 of the 28 participants. Among them, 5 participants agreed to take a break after completing a trial, with the breaks lasting 3–5 min.

Figure 8 shows the keypress percentages as a function of time for the seven conditions in the non-yielding scenario. Upon the approach of the vehicle, the majority of participants felt safe to cross, indicated with more than 80% of them keeping the key pressed. It can be seen that participants felt safest to cross in the TransparentVehicle condition, and least safe in the Baseline condition. In the Baseline condition, participants tended to release the response key as soon as the vehicle was behind the parked Nissan. For completeness, the same figures for the entire trial and for yielding and non-yielding vehicles separately are available in the Supplementary Material (Figures S1 and S2).

Fig. 8
figure 8

Keypress percentages for the seven conditions in the non-yielding scenario. The grey background represents the interval where the approaching vehicle could not be seen in the Baseline condition because it was fully occluded by the parked vehicle. Across the depicted interval, the vehicle drove 15 km/h

Table 2 presents the results of the linear mixed effects model, distinguishing between the effect of interface condition and the learning/experience effect. Figure 9 also shows the means and 95% confidence intervals, after applying a linear detrending to correct for the learning/experience effect. The effects of the interface conditions are described below.

  • The perception of safety, as measured through keypresses across the 8.86–14.30 s interval, showed significant differences between the seven experimental conditions, with the highest score for the TransparentVehicle, and the lowest score for the Baseline condition (Fig. 9, top left). The performance of the VideoHead and VideoStreet designs, including their guidance-included variations, was equivalent on this metric.

  • Usefulness (Fig. 9, top middle) and satisfaction (Fig. 9, top right) showed significant differences between conditions. The TransparentVehicle was found to be the most useful and satisfying, followed by the VideoSeeThrough. A negligible difference was observed in the scores between the standard VideoStreet design and the one equipped with the guiding arrow, both of which attained positive scores on the scales of satisfaction and usefulness. A similar negligible difference is noted in the scores between the VideoHead design and the guidance-included version, although these two conditions score negatively on the satisfaction scale. Importantly, there was no substantial difference in the usefulness scores between VideoHead and VideoStreet, both with and without the guiding arrow. The Baseline condition attained the lowest scores, with net negative values on both the satisfaction and usefulness scales.

  • Self-reported workload (Fig. 9, bottom left) also showed significant differences between conditions, with the TransparentVehicle having the lowest workload, and the VideoHead the highest.

  • The MISC scores (Fig. 9, bottom middle) showed significant differences, with the highest discomfort for the two VideoHead conditions.

  • Upon completion, the participants ranked the AR solutions. The mean ranks (Fig. 9, bottom right) showed significant differences between conditions. Participants preferred the Baseline condition over the VideoHead display, while the TransparentVehicle emerged as the most favoured by a considerable margin, followed by the VideoSeeThrough display. The VideoHead designs were the least preferred.

Table 2 Results of linear mixed model analysis for the six dependent variables

It should also be noted that the seven mean values shown in Fig. 9 were found to correlate with each other. For example, the mean keypress percentage correlated strongly with mean self-reported usefulness (r = 0.87). Additionally, the mean perceived satisfaction strongly correlated with mean perceived usefulness (r = 0.82), mean perceived workload (r = -0.92), and the mean preference rank (r = -0.97).

Fig. 9
figure 9

Means over the participants with 95% confidence intervals for the seven experimental conditions, for the dependent variables of this study. A linear trend was removed before calculating means and confidence intervals to compensate for block presentation order

3.1 Head movement analysis

In conditions featuring high local presence, namely VideoSeeThrough and TransparentVehicle, as well as with the Baseline condition, participants predominantly glanced leftward. This is demonstrated by a yaw angle approximating 125 degrees, corresponding to the location of the DR interfaces within the virtual environment (see Fig. 10, left).

On the contrary, engagement with the VideoStreet displays, which were positioned across the road, resulted in participants mainly focusing straight ahead with a yaw angle near 175 degrees, while intermittently looking leftward to spot the approaching vehicle (Fig. 10, right).

The histogram plot of the VideoHead conditions reveals less defined peaks, with a peak at 130 degrees suggesting that participants predominantly focused on the parked vehicle or directly across the road at 180 degrees (Fig. 10, middle).

Fig. 10
figure 10

Distribution of yaw head rotation angles analysis during the vehicle’s approaching phase (0 to 14.3 s). Note that a yaw angle of 180 degrees indicates the participant gaze across the road, with angles lesser or greater than 180 degrees suggesting a leftward or rightward gaze, respectively

4 Discussion

This study used a virtual reality setup with a HMD to examine the impact of six distinct AR designs on participants’ perceived safety when crossing. The results show that pedestrians’ perceived safety can increase by eliminating obstructions to their view. Particularly, in the Baseline condition, participants released the key earlier compared to the other conditions, reflecting a diminished sense of safety. This may be attributed to participants potentially feeling cautious, given the view-blocking parked vehicle. It should be acknowledged here that the subjective feeling of safety does not necessarily imply objective safety. Nevertheless, the results clearly demonstrate that eliminating occlusion through innovative augmented reality solutions has the potential to positively impact participants’ perceptions, as measured both during the experimental trials via the response key and subsequent to trial completion via a questionnaire.

All six AR designs can be considered effective compared to the Baseline, in terms of objective keypresses and subjective usefulness (see non-overlapping confidence intervals in Fig. 9). That is, participants, to a greater or lesser extent, used the additional information that made the occluded vehicle visible. However, when considering the results on the dimension of low local presence (VideoHead), medium local presence (VideoStreet), to high local presence (VideoSeeThrough, TransparentVehicle) (Rauschnabel et al. 2022), higher local presence was more advantageous. The transparent car, which was fully conformal and did not introduce new elements to the environment, received particularly high scores. This result is consistent with the proximity compatibility interface-design principle (Wickens and Carswell 1995), which posits that when task elements need to be processed concurrently, it is advantageous to present corresponding cues together and integrated, rather than separately. In this respect, it is useful that the estimation could be made by directly looking at the vehicle rather than elsewhere, such as on a separate screen.

The head-locked interfaces did not receive satisfactory ratings, and they appeared to be less preferred when compared to the Baseline. This observation can be linked to literature discussed in the introduction which indicated that head-locked displays may cause discomfort. In our case, the substantial degree of accommodation necessary for the VideoHead screen, which was presented at a close distance, could be an auxiliary explanatory factor.

Our analysis of head movements supports the notion that AR designs with a high degree of local presence permit participants to focus their attention at a single location, thereby obviating the need for attention division. While the VideoHead conditions somewhat alleviate the issue of divided attention between the video feed and the car (given that the video feed remains constantly in view), attentional switching was still evident. While displaying information on the opposite side of the road constitutes a familiar and useful approach, evidenced, for example, by pedestrian traffic signals (Tabone et al. 2023), a limitation is that it requires pedestrians to divide their attention between the information across the road and the oncoming vehicle. The result is a pronounced bimodal distribution in the frequency of head movements (see Fig. 10, right).

Our study established that AR information fully integrated within the environment yielded superior results compared to supplementary displays in the environment. This outcome appears to contradict a study by Tabone et al. (2023), where they found that a head-locked display was preferred over cues projected on the road or the approaching vehicle itself. A plausible explanation is that, in the study by Tabone and colleagues, the head-locked information comprised an unambiguous and dependable message (text: “danger! vehicle is approaching”/“safe to cross” or a red/green pedestrian traffic light) which pedestrians could use to determine whether or not to cross the road. Such explicit information could be particularly advantageous when the pedestrian is visually distracted, or has not yet identified the approaching vehicle in the environment. In our case, however, the extra screens did not provide any information that was not available in the DR solutions. Another factor is that the head-locked displays in Tabone et al. (2023) were presented in a CAVE-based simulator rather than a HMD, which reduces the likelihood of discomfort (Kim et al. 2012; Pala et al. 2021).

Although previous research has demonstrated that directional arrows in AR may have a beneficial effect on performance, this has primarily been established in the context of tasks requiring navigation or spatial orientation (Gabbard et al. 2019; Liu et al. 2021; Markov-Vetter et al. 2020). In our experiment, the guiding arrow did not have any meaningful impact on the dependent variables. A plausible explanation is that the vehicle always came from the left, and the participant, who conducted a total of 42 trials, eventually became aware of the vehicle’s origin. A second explanation concerns divided attention. As mentioned above, the augmented screens entailed that the participant had to divide attention between the screen and the approaching traffic. The introduction of an arrow, in addition to the screen, would require even more need for divided attention, something participants may have preferred to avoid by ignoring the arrow.

The present experiment was executed in a virtual environment to ensure a controlled comparison of conditions and to minimise the risk of technological glitches. Designs of AR solutions with minimal local presence, such as the VideoHead design, would demand the least complex technology for real-life applications (Rauschnabel et al. 2022). More specifically, a live video stream would need to be relayed and then displayed on the HMD. This would require that (parked) vehicles will be equipped with streaming cameras, which is a reasonable assumption given that automated vehicles already use cameras to perceive their surroundings. The VideoStreet design additionally requires visual tracking technology. Here, an HMD-embedded camera would need to determine certain world features like the ground surface and road layout, and the screen could be presented at a predetermined height above this surface. In turn, the implementation of the two DR solutions requires more complex tracking technology, where the parked vehicle is recognised by the HMD camera, following which the vehicle is either rendered transparent or augmented with a screen. This may require inpainting techniques (e.g., Ardino et al. 2021; Elharrouss et al. 2020; Liao et al. 2020) to estimate the image behind the parked car. The approaching vehicle can then be displayed or simulated on this inpainted image, where the position and orientation of this vehicle can be obtained from a wireless broadcast from the vehicle. Despite the fact that previous DR applications have been demonstrated in real-world automotive contexts, and the concept of essentially rendering objects invisible presents an intriguing notion, the technological prerequisites along with the potential risks of visual artefacts, such as jitter or perspective distortions, are substantial (Overmeyer et al. 2023; Rameau et al. 2016; Wilmott et al. 2022).

A limitation of our study is that it examined fixed display sizes and positions. Investigating displays of different dimensions could provide additional insight, as the current video feed displays were relatively small. Furthermore, a more optimal positioning of the screen, especially in regard to the distance from the user, may aid in reducing feelings of discomfort. Additionally, in our experiment, the pedestrian remained stationary on the curb, only able to visually scan the environment. For future research, it is recommended to consider either a CAVE-based simulator (Kaleefathullah et al. 2022) or a motion suit combined with an HMD (Kooijman et al. 2019) as alternatives to the fixed camera position used in the present study. These setups may offer less constrained environments and allow for more extensive measurements of pedestrian behaviour, including walking and crossing actions, as part of the perception-action cycle. It should be noted that in these cases, extra thought needs to be given to the design of the VideoStreet and VideoSeeThrough designs, in regard to the perspective of the screens.

Another limitation is that the current study investigated attention distribution using head-tracking because eye-tracking data was not stored. The head-tracking data seemed sufficiently sensitive to differentiate between conditions (see Fig. 10), although our previous research with a Varjo VR-2 HMD showed that eye-tracking provides sharper peaks in the distribution of the yaw angle than head-tracking data (Mok et al. 2022). This can be explained by the fact that participants can rotate their eyes to focus on a target, and therefore head orientation is only a proxy of where participants focus their attention.

A final limitation is that only one approaching vehicle was used in the current experiment. Further increasing the realism of the crossing situation could involve incorporating other vehicles. Such vehicles would increase visual demands and require divided attention.

5 Conclusion

This study shows the promise of removing visual occlusions through AR in increasing pedestrians’ feeling of safety. The findings indicate a user preference for AR solutions that exhibit a high degree of ‘local presence’—in this case, diminished reality solutions using see-through video and transparent vehicle renderings. These designs also demonstrated the least demand for divided attention. It is important to consider, however, the technological feasibility and complexity of implementing such solutions in a real-world scenario.

This study also revealed relatively high workload and discomfort from the head-locked AR display, as participants juggled between the close-by AR screens and real-world cues. The guiding arrows in video feeds did not improve performance. Finally, although the virtual reality experiment ensured control and safety, it might not reflect the complexities of implementing AR in real-world settings.